# 5.1 Introduction to NLP

## What is NLP?

Natural Language Processing (NLP) is a field of computer science that focuses on the interaction between human language and computers. NLP enables computers to understand, interpret, and generate human language. NLP has many applications in journalism from performing complex analysis over written documents to generating parts of articles. In this notebook, we will cover the core concepts of NLP and try an advanced method called Named Entity Recognition.

* **Setup and Packages**: There are several key NLP packages that provide a lot of functionality and are important to know. We will provide a basic overview of each, and will then use them in the following sections.
* **Tokenization**: This critical step involves taking the original text and breaking it up into pieces such as words which is critical for many algorithms.
* **Part-of-Speech (POS) Tagging**: Grammar is a core part of every language and identifying the verbs, subjects, and objects of a sentence is a foundational task for understanding language.
* **Dependency Parsing**: Dependency parsing is similar to POS Tagging, but instead of just finding what part of speech each word is, it finds how the words are connected. For example, we can use this to find what part of the sentence the verb acts on. For example, in the sentence "The boy kicks the ball," the direct object of "kicks" is "ball". 
* **Named Entity Recognition (NER)**: Knowing who was mentioned in a text can be a very useful analysis, and we will cover how to extract names of people and businesses.

In the next notebook, we will cover representing words as "vectors", sentiment analysis, meaning similarity, and text generation.

## Setup and Packages

There are three key NLP packages we will learn and use in this notebook: NLTK, Hazm, and spaCy

### NLTK

NLTK (Natural Language Toolkit) is one of the most widely used Python NLP packages. It provides many utilities for all the common NLP tasks. It does not offer specific Persian support, but many tools, use it and it is very good for English tasks.

In [1]:
import nltk

### Hazm

For Persian NLP analysis, Hazam is a library that provides many core NLP algorithms for Persian text and is compatible with NLTK.

In [2]:
import hazm

### spaCy

spaCy (pronounced "spacy") is a modern and efficient library that is used in many production applications. This can be used both for small analysis and for building a complex NLP system that handles thousands of documents quickly. It has basic support for Persian.

In [3]:
import spacy

## Tokenization

One foundation of NLP is tokenization. Tokenization is the process of splitting text into individual words, phrases, or symbols, known as tokens. Tokenization is the first step in any NLP pipeline, as it enables us to analyze text at a more granular level. Just as humans can read or analyze speech by understanding each word invididually, computers can find and then analyze using these individual words.

In [4]:
from hazm import *

tokenizer = WordTokenizer()

# Sample Persian sentence
sentence = "این یک جمله نمونه است."

# Tokenize the sentence
tokens = tokenizer.tokenize(sentence)

# Print the tokens
print(tokens)

['این', 'یک', 'جمله', 'نمونه', 'است', '.']


As we can see, the sentence is now split into each individual word. We can now perform an analysis such as getting the total word count.

In [5]:
# Show word count
print(len(tokens))

6


#### Sentence Tokenization

Some tasks will also want us to analyze a document sentence by sentence, and to do this, we want to split the text into sentences.

In [6]:
# Initialize the sentence tokenizer
sentence_tokenizer = SentenceTokenizer()

# Sample Persian text
text = "سلام دنیا! این یک متن فارسی است. امیدوارم که این درس مفید باشد."

# Tokenize the text into sentences
sentences = sentence_tokenizer.tokenize(text)

# Print the sentences
print(sentences)

['سلام دنیا!', 'این یک متن فارسی است.', 'امیدوارم که این درس مفید باشد.']


Now, as an exercise, see how we can do word tokenization for each of the sentences. The first way uses a `for` loop and the second uses a more advanced concept called list comprehension, but this is not necessary to know.

In [7]:
# First way: Tokenize the words in each sentence
tokenized_sentences = []
for sentence in sentences:
  tokenized_sentences.append(tokenizer.tokenize(sentence))
print('Tokenized:', tokenized_sentences)

Tokenized: [['سلام', 'دنیا', '!'], ['این', 'یک', 'متن', 'فارسی', 'است', '.'], ['امیدوارم', 'که', 'این', 'درس', 'مفید', 'باشد', '.']]


In [8]:
# Second way: Tokenizing each sentence using list comprehension
print('Tokenized:', [tokenizer.tokenize(sentence) for sentence in sentences])

Tokenized: [['سلام', 'دنیا', '!'], ['این', 'یک', 'متن', 'فارسی', 'است', '.'], ['امیدوارم', 'که', 'این', 'درس', 'مفید', 'باشد', '.']]


#### Example Usage

Now we will take a several sentence example and return the longest sentence, and then the three most common word tokens. We use the `Counter` item from the `collections` library which will help us get the most common word.


In [9]:
# Import required libraries
from hazm import WordTokenizer, SentenceTokenizer
from collections import Counter

# Sample Persian text
text = """
گربه در خانه است. او به دنبال موش است. موش در حال حاضر در حیاط پنهان شده است.
گربه دوباره به دنبال موش است. موش سعی می کند از گربه فرار کند. گربه در خانه است.
گربه بازی می کند. موش بازی می کند. گربه در خانه است. گربه به دنبال موش است.
"""

# Initialize the tokenizers
word_tokenizer = WordTokenizer()
sentence_tokenizer = SentenceTokenizer()

# Tokenize the text into words and sentences
words = word_tokenizer.tokenize(text)
sentences = sentence_tokenizer.tokenize(text)

# Count the occurrences of each word
word_counter = Counter(words)

# Find the top three most common words
top_three_common_words = word_counter.most_common(3)

# Find the longest sentence
longest_sentence = max(sentences, key=len)

# Print the results
for i, (word, count) in enumerate(top_three_common_words, start=1):
    print(f"{i}. '{word}' appears {count} times.")
print(f"Longest sentence: '{longest_sentence}'")

1. '.' appears 10 times.
2. 'گربه' appears 7 times.
3. 'است' appears 6 times.
Longest sentence: 'موش در حال حاضر در حیاط پنهان شده است.'


Notice here that `.` is counted as it's own token, so we would need to remove it from the tokens if we don't want it included or counted.

## Part-of-Speech Tagging

POS tagging is the process of assigning a grammatical category or part-of-speech to each word in a sentence. This is useful so we can remove things like the word "and" and just get a collection of nouns or other items.

To do this, the `hazm` library has trained a tagger model which can be downloaded from their [GitHub page](https://github.com/roshan-research/hazm), but, for this notebook, it has already been downloaded and placed in the `resources/` folder of this repository. We are two directories away from the repository root, so we need to put `../../` first in the path to reach it.

In [10]:
from hazm import POSTagger

# Initialize the tagger
tagger = POSTagger(model='../../resources/postagger.model')

# Sample Persian sentence
sentence = "گربه در حال بازی کردن با موش است"

# Tokenize the sentence
tokenizer = WordTokenizer()
tokens = tokenizer.tokenize(sentence)

# Perform POS tagging
pos_tags = tagger.tag(tokens)

# Print the POS tags
print(pos_tags)

[('گربه', 'N'), ('در', 'P'), ('حال', 'Ne'), ('بازی', 'N'), ('کردن', 'N'), ('با', 'P'), ('موش', 'N'), ('است', 'V')]


Notice here that each word now has a part of speech associated with it. We can see that `N` shows the nouns and then `V` shows the verb.

To find all the nouns, all we have to do is either of the following.

In [11]:
all_nouns = []
all_verbs = []
for pos_tag in pos_tags:
    if pos_tag[1] == 'N':
        all_nouns.append(pos_tag[0])
    elif pos_tag[1] == 'V':
        all_verbs.append(pos_tag[0])
print('Nouns:', all_nouns)
print('Verbs:', all_verbs)

Nouns: ['گربه', 'بازی', 'کردن', 'موش']
Verbs: ['است']


In [12]:
# or a more advanced but equal way
print('Nouns:', [pos_tag[0] for pos_tag in pos_tags if pos_tag[1] == 'N'])

Nouns: ['گربه', 'بازی', 'کردن', 'موش']


### Chunking

The above is useful for finding individual words, but another useful approach is to find the different phrases in a sentence, such as the noun phrases and verb phrases. A "chunker" is a model that can find the groups of words that together make up the individual phrases in a sentence. `hazm` provides a chunker model as well, which we will now use.

Once again, we need to load the chunker model which has already been downloaded and placed in the `resources/` folder.

In [13]:
chunker = Chunker(model='../../resources/chunker.model')
tagged = tagger.tag(word_tokenize('گربه در حال بازی کردن با موش است'))
tree2brackets(chunker.parse(tagged))

'[گربه NP] [در PP] [حال بازی کردن NP] [با PP] [موش NP] [است VP]'

Here "VP" stands for "verb phrase", "NP" stands for "noun phrase" and "PP" stands for "prepositional phrase".

We will now chunk a simple sentence to show how it breaks into just three phrases.

In [14]:
simple_sentence = "سگ زرد را دیدیم"
tagged = tagger.tag(word_tokenize(simple_sentence))
tree2brackets(chunker.parse(tagged))

'[سگ زرد NP] [را POSTP] [دیدیم VP]'

`tree2brackets` provides an easy way to view the sentence broken into phrases, but we can also view the original data that the chunker extracts. It has what is called a "tree" structure, where each word  is connected to each other word and some of them are grouped together.

In [15]:
chunked = chunker.parse(tagged)
print(chunked)

(S (NP سگ/Ne زرد/AJ) (POSTP را/POSTP) (VP دیدیم/V))


In [16]:
# Get the first phrase
print(chunked[0])
# Get the first word of the first phrase
print(chunked[0][0])

(NP سگ/Ne زرد/AJ)
('سگ', 'Ne')


If we wanted to get the verb phrase, we can do a `for` loop until we find the phrase with type `'VP'`. This is done by looking at the `.label()` value of each chunk.

In [17]:
verb_phrase = None
for chunk in chunked:
    if chunk.label() == 'VP':
        verb_phrase = chunk
print(verb_phrase)

(VP دیدیم/V)


Understanding the structure of language is key for successful NLP, and we have now learned several ways that we can automatically find the part-of-speech and the phrase structure of sentences.

## Dependency Parsing

We have now learned some core NLP analysis approaches, but these focused on finding either the part of speech of individual words or finding a few groupings of words. We now want to find how each word connects to each other. This is called dependency parsing and lets us find, for example, which item in a sentence is the direct object of the verb or which noun is the subject among all the other dependencies that make up a sentence.

Before we can do dependency parsing though, there is one more concept to quickly learn.

### Lemmatization

Lemmatization is the process of taking different forms of the same word and grouping them together. For example, "sell" and "sells" are both in the same group, and "dog" and "dogs" are grouped together as well.

In [18]:
lemmatizer = Lemmatizer()
print(lemmatizer.lemmatize('سگ'))
print(lemmatizer.lemmatize('سگها'))

سگ
سگ


One example usage of this technique is we might want to search for a word that could be written multiple ways. We can lemmatize each of the words and then see if it matches the lemmatization of the word we are searching for.

### Dependency Parsing

We are now ready to perform dependency parsing. We will take the simple sentence from before and parse out its dependencies.

Note: `java` needs to be installed for the dependency parser to work. If you receive an error when trying to run, that is fine, and just try to understand what the code is doing. If you need to do dependency parsing, then simply install [Open JDK](https://openjdk.org/) and the specific instructions will depend on what operating system you are using.

In [19]:
simple_sentence = 'سگ زرد را دیدیم'
parser = DependencyParser(tagger=tagger, lemmatizer=lemmatizer, working_dir='../../resources/')
dependency_graph = parser.parse(word_tokenize(simple_sentence))
print(dependency_graph)

defaultdict(<function DependencyGraph.__init__.<locals>.<lambda> at 0x7f89041bb670>,
            {0: {'address': 0,
                 'ctag': 'TOP',
                 'deps': defaultdict(<class 'list'>, {'ROOT': [4]}),
                 'feats': None,
                 'head': None,
                 'lemma': None,
                 'rel': None,
                 'tag': 'TOP',
                 'word': None},
             1: {'address': 1,
                 'ctag': 'Ne',
                 'deps': defaultdict(<class 'list'>, {'NPOSTMOD': [2]}),
                 'feats': '_',
                 'head': 3,
                 'lemma': 'سگ',
                 'rel': 'PREDEP',
                 'tag': 'Ne',
                 'word': 'سگ'},
             2: {'address': 2,
                 'ctag': 'AJ',
                 'deps': defaultdict(<class 'list'>, {}),
                 'feats': '_',
                 'head': 1,
                 'lemma': 'زرد',
                 'rel': 'NPOSTMOD',
                 'tag': '

The original output of the raw data is hard to interpret, so we will display it in an easier to interpret format.

The first one to be aware of is the CoNLL format. Here we can see the `ROOT` line at the bottom shows the start of the dependency graph which begins with the verb. Then, we can see the `OBJ` line which is the object the verb is acting on. This is a good way to see which role each word has, but it is hard to see how each word connects to the others.

In [20]:
print(dependency_graph.to_conll(4))

سگ	Ne	3	PREDEP
زرد	AJ	1	NPOSTMOD
را	POSTP	4	OBJ
دیدیم	V	0	ROOT



If one is familiar with graph data, another option is to show it using the `to_dot` method which is a way of displaying and visualizing graphs. We can see here that each word has an associated number and then each number points to another one. We can see below that we start with `0` which then points to item `4` which is the `ROOT` (this is shown by `0 -> 4` on the second line). Then, 4 points to 3, where 3 is the object, and so on.

In [21]:
print(dependency_graph.to_dot())

digraph G{
edge [dir=forward]
node [shape=plaintext]

0 [label="0 (None)"]
0 -> 4 [label="ROOT"]
1 [label="1 (سگ)"]
1 -> 2 [label="NPOSTMOD"]
2 [label="2 (زرد)"]
3 [label="3 (را)"]
3 -> 1 [label="PREDEP"]
4 [label="4 (دیدیم)"]
4 -> 3 [label="OBJ"]
}


Lastly, we can also visualize the sentence using parentheses which show which words form the center of the sentence and then how they connect to the rest.

In [22]:
dependency_graph.tree().pprint()

(دیدیم (را (سگ زرد)))


#### Using a Dependency Graph

Now that we have created and displayed a dependency graph, we will now navigate one to show how we can get the root and objects of a sentence.

The key idea is that a dependency graph is made up of "nodes" and each node has a root that is its parent node and then has children nodes. A node here is an actual word of the sentence.

In [23]:
# We will first visualize the root node
dependency_graph.root

{'address': 4,
 'word': 'دیدیم',
 'lemma': 'دید#بین',
 'ctag': 'V',
 'tag': 'V',
 'feats': '_',
 'head': 0,
 'deps': defaultdict(list, {'OBJ': [3]}),
 'rel': 'ROOT'}

We see here we can get the word from the `'word'` key and the relationship type by the `'rel'` key.

In [24]:
root_word = dependency_graph.root['word']
root_relationship = dependency_graph.root['rel']
print('Word: ', root_word)
print('Relationship: ', root_relationship)

Word:  دیدیم
Relationship:  ROOT


Now we will find the `OBJ` of the sentence and any word that directly connects to the object, as often an object can be made up of multiple words. This will do a basic graph search and will perform what is called breadth first search. The key idea is that we want to search every node in the graph, so we start with the "root" node and then add the nodes that are below it and continue doing this until we've added and then searched all the nodes.

In [25]:
dependency_graph.nodes[1]

{'address': 1,
 'word': 'سگ',
 'lemma': 'سگ',
 'ctag': 'Ne',
 'tag': 'Ne',
 'feats': '_',
 'head': 3,
 'deps': defaultdict(list, {'NPOSTMOD': [2]}),
 'rel': 'PREDEP'}

In [26]:
# we will create a list which we will add all the nodes too, starting with the root
nodes_to_search = [dependency_graph.root]
# we will store the OBJ of the sentence here
object_nodes = []

while len(nodes_to_search) > 0:
    # get and remove the first item from the nodes_to_search list
    current_node = nodes_to_search.pop()
    is_node_object = False
    if current_node['rel'] == 'OBJ':
        object_nodes.append(current_node)
        # now indicate to add any directly connected words
        is_node_object = True

    
    # now add this nodes dependencies to the nodes_to_search
    for dependency in current_node['deps'].values():
        for node_index in dependency:
            nodes_to_search.append(dependency_graph.nodes[node_index])
            # add the dependent nodes to object_nodes as well if the current node is an object node
            if is_node_object:
                object_nodes.append(dependency_graph.nodes[node_index])

print('Objects:')
for obj in object_nodes:
    print(obj['word'])

Objects:
را
سگ


We will now analyze a more complex sentence.

In [27]:
complex_sentence = '.پسرم را به مدرسه بردم'
complex_graph = parser.parse(word_tokenize(complex_sentence))
print(complex_graph.to_dot())

digraph G{
edge [dir=forward]
node [shape=plaintext]

0 [label="0 (None)"]
0 -> 6 [label="ROOT"]
1 [label="1 (.)"]
2 [label="2 (پسرم)"]
3 [label="3 (را)"]
3 -> 2 [label="PREDEP"]
4 [label="4 (به)"]
4 -> 5 [label="POSDEP"]
5 [label="5 (مدرسه)"]
6 [label="6 (بردم)"]
6 -> 1 [label="PUNC"]
6 -> 3 [label="OBJ"]
6 -> 4 [label="VPP"]
}


In [28]:
complex_graph.tree().pprint()

(بردم . (را پسرم) (به مدرسه))


We will use the same code we used above, but this time we will make it a function so that we can easily get the objects for any sentence.

In [29]:
# Define function that takes in a sentence and then returns a "node" object for each direct object word
def compute_direct_objects(sentence):
    dependency_graph = parser.parse(word_tokenize(sentence))
    
    # we will create a list which we will add all the nodes too, starting with the root
    nodes_to_search = [dependency_graph.root]
    # we will store the OBJ of the sentence here
    object_nodes = []

    while len(nodes_to_search) > 0:
        # get and remove the first item from the nodes_to_search list
        current_node = nodes_to_search.pop()
        is_node_object = False
        if current_node['rel'] == 'OBJ':
            object_nodes.append(current_node)
            # now indicate to add any directly connected words
            is_node_object = True


        # now add this nodes dependencies to the nodes_to_search
        for dependency in current_node['deps'].values():
            for node_index in dependency:
                nodes_to_search.append(dependency_graph.nodes[node_index])
                # add the dependent nodes to object_nodes as well if the current node is an object node
                if is_node_object:
                    object_nodes.append(dependency_graph.nodes[node_index])    
    return object_nodes

In [30]:
objects = compute_direct_objects(complex_sentence)
print('Objects:')
for obj in objects:
    print(obj['word'])

Objects:
را
پسرم


Dependency parsing is a key concept in NLP as it is one of the first concepts taught that shows how computers can understand not only words, but how words are connected together. For example, "I showed the teacher my son" and "I showed my son the teacher" have different meanings and dependency parsing is one way we can automatically find this meaning.

### Example Usage

Dependency Parsing uses every concept we know so far, specifically tokenization to separate each word of a sentence, part-of-speech tagging to know what type of word each word is, and then chunking to find the phrases in a sentence. It then finds how these words actually connect.

We will now look at a two sentence document. We will break it up into sentence using sentence tokenization, and then we will return the direct objects of each of the sentences using the function we created above.


In [31]:
# Prepare an example document with two sentences
example_document = '''من یک کتاب خوب می‌خوانم. من یک ماشین جدید خریدم.'''

# Prepare the hazm SentenceTokenizer
sentence_tokenizer = SentenceTokenizer()
sentences = sentence_tokenizer.tokenize(example_document)
print(sentences)

['من یک کتاب خوب می\u200cخوانم.', 'من یک ماشین جدید خریدم.']


In [32]:
# Now use our `compute_direct_objects` function to find the objects of each sentence.
for i in range(len(sentences)):
    objects = compute_direct_objects(sentences[i])
    print(sentences[i])
    print([obj['word'] for obj in objects])

من یک کتاب خوب می‌خوانم.
['کتاب', 'یک', 'خوب']
من یک ماشین جدید خریدم.
['ماشین', 'یک', 'جدید']


## Using NLTK

Now that we have used `hazm` to do these NLP analysis on Persian text, we will learn how to do the exact same operations using `nltk` in case we want to do an analysis on a text written in a different language. The function names are basically the same, so we will quickly go through each method.

In [33]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser

Now we need to download some models so `nltk` can do part-of-speech tagging.

In [34]:
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/noah/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /home/noah/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Now we will run part-of-speech tagging, and then show chunking all on an English example sentence.

In [35]:
# Example sentence in English
sentence = "I am learning Natural Language Processing with NLTK"

# Part-of-speech tagging
pos_tags = pos_tag(word_tokenize(sentence))
print('POS tags:', pos_tags)

POS tags: [('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'VBG'), ('with', 'IN'), ('NLTK', 'NNP')]


In [36]:
# Chunking
chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"
chunk_parser = RegexpParser(chunk_grammar)
tree = chunk_parser.parse(pos_tags)
print('Chunks:', tree)

Chunks: (S
  I/PRP
  am/VBP
  learning/VBG
  Natural/NNP
  Language/NNP
  Processing/VBG
  with/IN
  NLTK/NNP)


As we can see, some of the part of speech tags are different than the ones we saw in `hazm` because the model we downloaded for `nltk` uses the Penn Part of Speech Tags convention. Here "I" is a `PRP` which stands for "personal pronoun". The list of all part of speech types in the Penn Part of Speech Tags is available at [this NYU webpage](https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html).

To get all nouns, one needs to find all the parts of speech that are noun types and then see if that words part of speech matches any of them.

In [37]:
# create a list of all the part of speech tags we want to match
all_noun_types = ['NN', 'NNS', 'NNP', 'NNPS', 'PRP']

# now go through the tags and add any that are in our list
nouns = []
for tag in pos_tags:
    if tag[1] in all_noun_types:
        nouns.append(tag[0])

print('Nouns:', nouns)

Nouns: ['I', 'Natural', 'Language', 'NLTK']


## Named Entity Recognition

The final part of this introduction shows how to do Named Entity Recognition (NER) using the spaCy library. NER finds proper nouns in a sentence so that we can know which people, businesses and other entities are mentioned.

NER is a statistical approach meaning that some training or data approach is used so it may not find all entities depending upon the quality of the model used. spaCy does not have a Persian specific model or capability, but does have a multi-language support that can perform some Persian analysis.

In [38]:
import spacy

To use spaCy, we need to either train it ourselves which requires a powerful computer and lots of data, or pick an already trained model. For this lesson, we will use an already trained model, which we will need to first download using the following line. This model is trained on a wide variety of languages, so it can do some Persian but is not fully reliable.

In [39]:
!python3 -m spacy download xx_ent_wiki_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('xx_ent_wiki_sm')


We can now load this already trained model into the variable `nlp`. We can then process any sentence by just calling `nlp(sentence)`.

In [40]:
nlp = spacy.load('xx_ent_wiki_sm')
doc = nlp("محمد در حال شنا است")

This computed many values about the sentence, including the entities in it, specifically a persons name. To view the entities, all we need to do is show `doc.ents`.

In [41]:
print(doc.ents)

(محمد در,)


To get the actual name text, we can go through each of the labels and get the `.text` attribute. There are some other attributes such as `.label` but those labels are not very useful for Persian.

In [42]:
print([(ent.text, ent.label) for ent in doc.ents])

[('محمد در', 4317129024397789502)]


NER is a challenging problem and requires a lot of training data, so the above model will not always find all the names as it is a generic model that is not optimized for Persian. In the next NLP lesson we will learn more about the technology behind this, and what actually happens when we call `nlp(sentence)`.

## Conclusion

We have now learned core ideas of NLP that are required for both the most simple methods and the latest advancements in NLP. For example, tokenization is a core part of advanced models such as GPT-4.

We now know how to get every noun in a sentence which can allow analysis such as taking a large document and then getting the items mentioned in it. Dependency parsing then allows going a further step and doing things like finding every object of a specific verb.

Finally, we covered a basic use of `spaCy` to get named entity recognition which gives a first look at what modern NLP methods can do.