# Introduction to NLP

## What is NLP?

Natural Language Processing (NLP) is a field of computer science that focuses on the interaction between human language and computers. NLP enables computers to understand, interpret, and generate human language. NLP has many applications in journalism from performing complex analysis over written documents to generating parts of articles. In this notebook, we will cover the core concepts of NLP before moving on to some advanced applications.

* **Setup and Packages**: There are several key NLP packages that provide a lot of functionality and are important to know. We will provide a basic overview of each, and will then use them in the following sections.
* **Tokenization**: This critical step involves taking the original text and breaking it up into pieces such as words which is critical for many algorithms.
* **Part-of-Speech (POS) Tagging**: Grammar is a core part of every language and identifying the verbs, subjects, and objects of a sentence is a foundational task for understanding language.
* **Dependency Parsing**: Dependency parsing is similar to POS Tagging, but instead of just finding what part of speech each word is, it finds how the words are connected. For example, we can use this to find what part of the sentence the verb acts on. For example, in the sentence "The boy kicks the ball," the direct object of "kicks" is "ball". 
* **Named Entity Recognition (NER)**: Knowing who was mentioned in a text can be a very useful analysis, and we will cover how to extract names of people and businesses.

In the next section, we will cover representing words as "vectors", sentiment analysis, meaning similarity, and text generation.

## Setup and Packages

There are three key NLP packages we will learn and use: NLTK, Hazm, spaCy, and Gensim.

### NLTK

NLTK (Natural Language Toolkit) is one of the most widely used Python NLP packages. It provides many utilities for all the common NLP tasks. We will use this extensively in this notebook.

We will import it as follows.

In [1]:
import nltk

## Hazm

For Persian NLP analysis, Hazam is a library that provides many core NLP algorithms for Persian text and is compatible with NLTK.

In [2]:
import hazm

### spaCy

spaCy is a modern and efficient library that is used in many production applications. This can be used both for small analysis and for building a complex NLP system that handles thousands of documents quickly.

In [3]:
import spacy

## Tokenization

One foundation of NLP is tokenization. Tokenization is the process of splitting text into individual words, phrases, or symbols, known as tokens. Tokenization is the first step in any NLP pipeline, as it enables us to analyze text at a more granular level. Just as humans can read or analyze speech by understanding each word invididually, computers can find and then analyze using these individual words.

In [4]:
from hazm import *

tokenizer = WordTokenizer()

# Sample Persian sentence
sentence = "این یک جمله نمونه است."

# Tokenize the sentence
tokens = tokenizer.tokenize(sentence)

# Print the tokens
print(tokens)

['این', 'یک', 'جمله', 'نمونه', 'است', '.']


As we can see, the sentence is now split into each individual word. We can now perform an analysis such as getting the total word count, or the most common words.

In [5]:
# Show word count
print(len(tokens))

long_sentence = ""

6


#### Sentence Tokenization

Some tasks will also want us to analyze a document sentence by sentence, and to do this, we want to split the text into sentences.

In [6]:
# Initialize the sentence tokenizer
sentence_tokenizer = SentenceTokenizer()

# Sample Persian text
text = "سلام دنیا! این یک متن فارسی است. امیدوارم که این درس مفید باشد."

# Tokenize the text into sentences
sentences = sentence_tokenizer.tokenize(text)

# Print the sentences
print(sentences)

['سلام دنیا!', 'این یک متن فارسی است.', 'امیدوارم که این درس مفید باشد.']


Now, as an exercise, see how we can do word tokenization for each of the sentences. The first way uses a `for` loop and the second uses a more advanced concept called list comprehension, but this is not necessary to know.

In [22]:
# Tokenize the words in each sentence
tokenized_sentences = []
for sentence in sentences:
  tokenized_sentences.append(tokenizer.tokenize(sentence))
print('Tokenized:', tokenized_sentences)

# Tokenizing each sentence using list comprehension
print('Tokenized:', [tokenizer.tokenize(sentence) for sentence in sentences])

Tokenized: [['گربه', 'در', 'خانه', 'است', '.'], ['او', 'به', 'دنبال', 'موش', 'است', '.'], ['موش', 'در', 'حال', 'حاضر', 'در', 'حیاط', 'پنهان', 'شده_است', '.'], ['گربه', 'دوباره', 'به', 'دنبال', 'موش', 'است', '.'], ['موش', 'سعی', 'می', 'کند', 'از', 'گربه', 'فرار', 'کند', '.'], ['گربه', 'در', 'خانه', 'است', '.'], ['گربه', 'بازی', 'می', 'کند', '.'], ['موش', 'بازی', 'می', 'کند', '.'], ['گربه', 'در', 'خانه', 'است', '.'], ['گربه', 'به', 'دنبال', 'موش', 'است', '.']]
Tokenized: [['گربه', 'در', 'خانه', 'است', '.'], ['او', 'به', 'دنبال', 'موش', 'است', '.'], ['موش', 'در', 'حال', 'حاضر', 'در', 'حیاط', 'پنهان', 'شده_است', '.'], ['گربه', 'دوباره', 'به', 'دنبال', 'موش', 'است', '.'], ['موش', 'سعی', 'می', 'کند', 'از', 'گربه', 'فرار', 'کند', '.'], ['گربه', 'در', 'خانه', 'است', '.'], ['گربه', 'بازی', 'می', 'کند', '.'], ['موش', 'بازی', 'می', 'کند', '.'], ['گربه', 'در', 'خانه', 'است', '.'], ['گربه', 'به', 'دنبال', 'موش', 'است', '.']]


#### Example Usage

Now we will take a several sentence example and return the longest sentence, and then the three most common word tokens. We use the `Counter` item from the `collections` library which will help us get the most common word.


In [8]:
# Import required libraries
from hazm import WordTokenizer, SentenceTokenizer
from collections import Counter

# Sample Persian text
text = """
گربه در خانه است. او به دنبال موش است. موش در حال حاضر در حیاط پنهان شده است.
گربه دوباره به دنبال موش است. موش سعی می کند از گربه فرار کند. گربه در خانه است.
گربه بازی می کند. موش بازی می کند. گربه در خانه است. گربه به دنبال موش است.
"""

# Initialize the tokenizers
word_tokenizer = WordTokenizer()
sentence_tokenizer = SentenceTokenizer()

# Tokenize the text into words and sentences
words = word_tokenizer.tokenize(text)
sentences = sentence_tokenizer.tokenize(text)

# Count the occurrences of each word
word_counter = Counter(words)

# Find the top three most common words
top_three_common_words = word_counter.most_common(3)

# Find the longest sentence
longest_sentence = max(sentences, key=len)

# Print the results
for i, (word, count) in enumerate(top_three_common_words, start=1):
    print(f"{i}. '{word}' appears {count} times.")
print(f"Longest sentence: '{longest_sentence}'")

1. '.' appears 10 times.
2. 'گربه' appears 7 times.
3. 'است' appears 6 times.
Longest sentence: 'موش در حال حاضر در حیاط پنهان شده است.'


Notice here that `.` is counted as it's own token, so we would need to remove it from the tokens if we don't want it included or counted.

## Part-of-Speech Tagging

POS tagging is the process of assigning a grammatical category or part-of-speech to each word in a sentence. This is useful so we can remove things like the word "and" and just get a collection of nouns or other items.

To do this, the `hazm` library has trained a tagger model which can be downloaded from their [GitHub page](https://github.com/roshan-research/hazm), but it has already been downloaded and placed in the `resources/` folder of this repository. We are two directories away from the repository root, so we need to put `../../` first in the path to reach it.

In [10]:
from hazm import POSTagger

# Initialize the tagger
tagger = POSTagger(model='../../resources/postagger.model')

# Sample Persian sentence
sentence = "گربه در حال بازی کردن با موش است"

# Tokenize the sentence
tokenizer = WordTokenizer()
tokens = tokenizer.tokenize(sentence)

# Perform POS tagging
pos_tags = tagger.tag(tokens)

# Print the POS tags
print(pos_tags)

[('گربه', 'N'), ('در', 'P'), ('حال', 'Ne'), ('بازی', 'N'), ('کردن', 'N'), ('با', 'P'), ('موش', 'N'), ('است', 'V')]


Notice here that each word now has a part of speech associated with it. We can see that `N` shows the nows and then `V` shows the verb.

To find all the nouns, all we have to do is either of the following.

In [19]:
all_nouns = []
all_verbs = []
for pos_tag in pos_tags:
    if pos_tag[1] == 'N':
        all_nouns.append(pos_tag[0])
    elif pos_tag[1] == 'V':
        all_verbs.append(pos_tag[0])
print('Nouns:', all_nouns)
print('Verbs:', all_verbs)

# or a more advanced but equal way
print('Nouns:', [pos_tag[0] for pos_tag in pos_tags if pos_tag[1] == 'N'])

Nouns: ['گربه', 'بازی', 'کردن', 'موش']
Verbs: ['است']
Nouns: ['گربه', 'بازی', 'کردن', 'موش']


The above is useful for finding individual words, but another useful approach is to find the different phrases in a sentence, such as the noun phrases and verb phrases. A "chunker" is a model that can find the groups of words that together make up the individual phrases in a sentence. `hazm` provides a chunker model as well, which we will now use.

Once again, we need to load the chunker model which has already been downloaded and placed in the `resources/` folder.

In [25]:
chunker = Chunker(model='../../resources/chunker.model')
tagged = tagger.tag(word_tokenize('گربه در حال بازی کردن با موش است'))
tree2brackets(chunker.parse(tagged))

'[گربه NP] [در PP] [حال بازی کردن NP] [با PP] [موش NP] [است VP]'

Here "VP" stands for "verb phrase", "NP" stands for "noun phrase" and "PP" stands for "prepositional phrase".

We will now chunk a simple sentence to show how it breaks into just three phrases.

In [29]:
simple_sentence = "سگ زرد را دیدیم"
tagged = tagger.tag(word_tokenize(simple_sentence))
tree2brackets(chunker.parse(tagged))

'[سگ زرد NP] [را POSTP] [دیدیم VP]'

`tree2brackets` provides an easy way to view the sentence broken into phrases, but we can also view the original data that the chunker extracts. It has what is called a "tree" structure.

In [33]:
chunked = chunker.parse(tagged)
print(chunked)

(S (NP سگ/Ne زرد/AJ) (POSTP را/POSTP) (VP دیدیم/V))


In [38]:
# Get the first phrase
print(chunked[0])
# Get the first word of the first phrase
print(chunked[0][0])

(NP سگ/Ne زرد/AJ)
('سگ', 'Ne')


If we wanted to get the verb phrase, we can do a `for` loop until we find the phrase with type `'VP'`. This is done by looking at the `.label()` value of each chunk.

In [47]:
verb_phrase = None
for chunk in chunked:
    if chunk.label() == 'VP':
        verb_phrase = chunk
print(verb_phrase)

(VP دیدیم/V)


Understanding the structure of language is key for successful NLP, and we have now learned several ways that we can automatically find the part-of-speech and the phrase structure of sentences.

## Dependency Parsing

We have now learned some core NLP analysis approaches, but these focused on finding either the part of speech of individual words or finding a few groupings of words. We now want to find how each word connects to each other. This is called dependency parsing and lets us find, for example, which item in a sentence is the direct object of the verb, which noun is the subject among all the other dependencies that make up a sentence.

Before we can do dependency parsing though, there is one more concept to quickly learn.

### Lemmatization

Lemmatization is the process of taking different forms of the same word and grouping them together. For example, "sell" and "sells" are both in the same group.

In [91]:
lemmatizer = Lemmatizer()
print(lemmatizer.lemmatize('سگ'))
print(lemmatizer.lemmatize('سگها'))

سگ
سگ


### Dependency Parsing

We are now ready to perform dependency parsing. We will take the simple sentence from before and parse out its dependencies.

Note: `java` needs to be installed for the dependency parser to work. If you receive an error when trying to run, that is fine, and just try to understand what the code is doing. If you need to do dependency parsing, then simply install [Open JDK](https://openjdk.org/) and the specific instructions will depend on what operating system you are using.

In [98]:
simple_sentence = 'سگ زرد را دیدیم'
parser = DependencyParser(tagger=tagger, lemmatizer=lemmatizer, working_dir='../../resources/')
print(parser.parse(word_tokenize(simple_sentence)))

defaultdict(<function DependencyGraph.__init__.<locals>.<lambda> at 0x7f75ac37a5e0>,
            {0: {'address': 0,
                 'ctag': 'TOP',
                 'deps': defaultdict(<class 'list'>, {'ROOT': [4]}),
                 'feats': None,
                 'head': None,
                 'lemma': None,
                 'rel': None,
                 'tag': 'TOP',
                 'word': None},
             1: {'address': 1,
                 'ctag': 'Ne',
                 'deps': defaultdict(<class 'list'>, {'NPOSTMOD': [2]}),
                 'feats': '_',
                 'head': 3,
                 'lemma': 'سگ',
                 'rel': 'PREDEP',
                 'tag': 'Ne',
                 'word': 'سگ'},
             2: {'address': 2,
                 'ctag': 'AJ',
                 'deps': defaultdict(<class 'list'>, {}),
                 'feats': '_',
                 'head': 1,
                 'lemma': 'زرد',
                 'rel': 'NPOSTMOD',
                 'tag': '

The original output is hard to read, so we will display it in an easier to interpret format.

## Using NLTK

Now that we have used `hazm` to do these NLP analysis on Persian text, we will learn how to do the exact same operations using `nltk` in case we want to do an analysis on a text written in a different language. The names are basically the same, so we will quickly go through each method.

In [62]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser

Now we need to download some models so `nltk` can do part-of-speech tagging.

In [58]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/noah/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Now we will run part-of-speech tagging, and then show chunking all on an English example sentence.

In [71]:
# Example sentence in English
sentence = "I am learning Natural Language Processing with NLTK"

# Part-of-speech tagging
pos_tags = pos_tag(word_tokenize(sentence))
print('POS tags:', pos_tags)

POS tags: [('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'VBG'), ('with', 'IN'), ('NLTK', 'NNP')]


In [72]:
# Chunking
chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"
chunk_parser = RegexpParser(chunk_grammar)
tree = chunk_parser.parse(pos_tags)
print('Chunks:', tree)

Chunks: (S
  I/PRP
  am/VBP
  learning/VBG
  Natural/NNP
  Language/NNP
  Processing/VBG
  with/IN
  NLTK/NNP)


As we can see, some of the part of speech tags are different than the ones we saw in `hazm` because the model we downloaded used the Penn Part of Speech Tags convention. Here "I" is a `PRP` which stands for "personal pronoun". The list of all part of speech types in the Penn Part of Speech Tags is available at [this NYU webpage](https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html).

To get all nouns, one needs to find all the parts of speech that are noun types and then see if that words part of speech matches any of them.

In [73]:
# create a list of all the part of speech tags we want to match
all_noun_types = ['NN', 'NNS', 'NNP', 'NNPS', 'PRP']

# now go through the tags and add any that are in our list
nouns = []
for tag in pos_tags:
    if tag[1] in all_noun_types:
        nouns.append(tag[0])

print('Nouns:', nouns)

Nouns: ['I', 'Natural', 'Language', 'NLTK']


## Named Entity Recognition

The final part of this introduction will now show how to do Named Entity Recognition (NER) using the spaCy library. NER finds proper nouns in a sentence so that we can know which people, businesses and other entities are mentioned.

NER is a statistical approach meaning that some training or data approach is used so it may not find all entities depending upon the quality of the model used. spaCy's Persian option does well usually.

In [23]:
import spacy