## Natural Language Processing (NLP) - A hands-on introduction

### Popular Libraries

- [NLTK](https://www.nltk.org/)
- [spaCy ](https://spacy.io/)

**NLTK & spaCy** is a free open-source library for Natural Language Processing (NLP) in Python to support teaching, research, and development. Which are:- 
  - Free and Open source
  - Easy to use
  - Modular
  - Well documented
  - Simple and extensible

--------------------------

* In this notebook, I will provide basic NLP tasks that we need in order to **process raw text to find useful informations**. 
* For each tasks, we will be using **NLTK as well as spaCy**. Good news is that both are installed in Google Colab by default. 

### Some definitions

- **Corpus** - Corpora is the plural of Corpus. **"Corpus"** mainly appears in NLP area or application domain related to **texts/documents**, because of its meaning **"a collection of written texts"**
    - **Example:** A collection of news documents.

- **Dataset** - dataset appears in every application domain (in can be **image/video/text/numerical/mixed**) --- a collection of any kind of data is a dataset.

- **Lexicon** - vocabulary or list of Words and their meanings.
    - **Example:** English dictionary.

- **Token** - Each "entity" that is a part of whatever was split up based on
rules.
    - For examples, each word is a token when a sentence is "tokenized" into
words. Each sentence can also be a token, if you tokenized the sentences out
of a paragraph.

## Tokenization

Tokenization is the process of breaking a stream of text up into **sentences, words, phrases, symbols, or other meaningful elements called tokens**.


In [19]:
import nltk
nltk.download('punkt')

# For tokenizing words and sentences
from nltk.tokenize import word_tokenize, sent_tokenize

s = "Good muffins cost $3.88\nin New York. Please buy me two of them.\n\nThanks."

print (sent_tokenize(s))
print (word_tokenize(s))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['Good muffins cost $3.88\nin New York.', 'Please buy me two of them.', 'Thanks.']
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']


In [20]:
import spacy

# Small spaCy model
nlp = spacy.load("en_core_web_sm")

doc = nlp("Good muffins cost $3.88\nin New York. Please buy me two of them.\n\nThanks.")

print("\n\nTokenized Sentences")

for i, sent in enumerate(doc.sents):
        print('-->Sentence %d: %s' % (i, sent.text))

print("\n\nTokenized Words")

tokens = [token.text for token in doc]
print(tokens)



Tokenized Sentences
-->Sentence 0: Good muffins cost $3.88
in New York.
-->Sentence 1: Please buy me two of them.


-->Sentence 2: Thanks.


Tokenized Words
['Good', 'muffins', 'cost', '$', '3.88', '\n', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', '\n\n', 'Thanks', '.']


### Downloading Large spaCy model

In [3]:
!python -m spacy download en_core_web_lg
 
import en_core_web_lg
 
nlp = en_core_web_lg.load()

Collecting en_core_web_lg==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9MB)
[K     |████████████████████████████████| 827.9MB 1.2MB/s 
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-cp37-none-any.whl size=829180944 sha256=8a5ac941d22f1b672be9a7dc3d254afcc459b9df0521dcee735b8386f09ca71d
  Stored in directory: /tmp/pip-ephem-wheel-cache-71k_i5rh/wheels/2a/c1/a6/fc7a877b1efca9bc6a089d6f506f16d3868408f9ff89f8dbfc
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


## Filtering stopwords

 - **Stopwords** are common words that **generally** do not contribute to the
meaning of a sentence.
 - Most search engines will filter stopwords out of search queries and
documents in order to **save space and time** in their index.
- Removing stopwords is not a hard and fast rule in NLP. It depends upon the task that we are working on. 
- For tasks like text classification, where the text is to be classified into different categories, stopwords are removed or excluded from the given text so that **more focus can be given to those words which define the meaning of the text.**
 - All [Stopwords](https://github.com/stopwords-iso/stopwords-iso) collection including **Bengali**.

In [24]:
nltk.download('stopwords')
from nltk.corpus import stopwords

# All english stopwords list
english_stops = set(stopwords.words('english'))

print (english_stops)

print (len(english_stops))

words = ['The', 'natural', 'language', 'processing', 'is', 'very', 'interesting']
filtered_words = [word for word in words if word.lower() not in english_stops]    # word.lower() is for lowering down the words

print(filtered_words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
{'for', 'o', 'each', 'above', 'won', 'of', 'd', 'on', 'them', "couldn't", 'did', 'out', 'he', "you're", "it's", 'before', 'will', 'their', 'doing', 'can', 'were', 'below', "weren't", 'the', 'should', 'very', 'just', 'down', 'so', "shan't", 'do', 've', 'both', "she's", 's', 'am', 'yourselves', 'shouldn', "wasn't", "won't", 'i', 'myself', 'a', 'haven', 'mightn', "you'll", 'other', 'ain', 'there', 'own', 'an', 'have', 'no', 'such', 'ours', 'up', 'me', 'most', "isn't", 'against', 'is', 'had', 'll', "that'll", 'theirs', "hasn't", 'through', 'than', 'too', 'because', 'nor', 're', 'been', "shouldn't", 'herself', "should've", 'this', 'which', 'what', "hadn't", 'being', 'at', 'itself', 'when', 'why', 'ourselves', 't', 'from', 'not', 'now', 'him', 'between', 'its', 'or', "don't", 'her', 'wouldn', 'himself', 'but', 'are', 'whom', 'didn', 'hasn', 'you', 'it', 'having', 'then', 'if

In [25]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

print('Number of stop words: %d' % len(spacy_stopwords))
print('First ten stop words: %s' % list(spacy_stopwords)[:10])

Number of stop words: 326
First ten stop words: ['twenty', 'he', 'before', '‘d', 'thence', 'never', "'s", 'very', 'down', 'am']


In [26]:
doc = nlp("Good muffins cost $3.88\nin New York. Please buy me two of them.\n\nThanks.")

tokens = [token.text for token in doc if not token.is_stop]

print(tokens)

['Good', 'muffins', 'cost', '$', '3.88', '\n', 'New', 'York', '.', 'buy', '.', '\n\n', 'Thanks', '.']


## Adding Custom Stopwords



In [28]:

english_stops = set(stopwords.words('english'))

print (english_stops)

english_stops.remove('is')
english_stops.add('natural')


words = ['The', 'natural', 'language', 'processing', 'is', 'very', 'interesting']
filtered_words = [word for word in words if word.lower() not in english_stops]

print(filtered_words)

{'for', 'o', 'each', 'above', 'won', 'of', 'd', 'on', 'them', "couldn't", 'did', 'out', 'he', "you're", "it's", 'before', 'will', 'their', 'doing', 'can', 'were', 'below', "weren't", 'the', 'should', 'very', 'just', 'down', 'so', "shan't", 'do', 've', 'both', "she's", 's', 'am', 'yourselves', 'shouldn', "wasn't", "won't", 'i', 'myself', 'a', 'haven', 'mightn', "you'll", 'other', 'ain', 'there', 'own', 'an', 'have', 'no', 'such', 'ours', 'up', 'me', 'most', "isn't", 'against', 'is', 'had', 'll', "that'll", 'theirs', "hasn't", 'through', 'than', 'too', 'because', 'nor', 're', 'been', "shouldn't", 'herself', "should've", 'this', 'which', 'what', "hadn't", 'being', 'at', 'itself', 'when', 'why', 'ourselves', 't', 'from', 'not', 'now', 'him', 'between', 'its', 'or', "don't", 'her', 'wouldn', 'himself', 'but', 'are', 'whom', 'didn', 'hasn', 'you', 'it', 'having', 'then', 'if', 'in', 'couldn', 'and', 'isn', "you'd", 'y', 'we', 'any', 'those', 'hadn', 'yours', 'm', 'with', 'about', 'as', 'was'

## Edit Distance

The edit distance is the number of character changes necessary to
transform the given word into the suggested word.

In [30]:
from nltk.metrics import edit_distance

print(edit_distance("Birthday","Bday"))

print(edit_distance("university", "varsity"))

4
4


## Removing Punctuation 

In [31]:
import string
import nltk

nltk.download('punkt')

puncset = list(string.punctuation)

print(puncset)

sentence = "Hun Sen's Cambodian can't People's Party won 64 of the 122 parliamentary seats in party July's elections, short of the two-thirds majority needed to form a government on its own."

sentence = sentence.lower()
print(sentence)
sentence = nltk.word_tokenize(sentence)
print(sentence)
sentence = [i for i in sentence if i not in puncset] # Removing punctuation
print(sentence)
sentence = [w for w in sentence if w.isalpha()] # Removing numbers and punctuation
print(sentence)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
hun sen's cambodian can't people's party won 64 of the 122 parliamentary seats in party july's elections, short of the two-thirds majority needed to form a government on its own.
['hun', 'sen', "'s", 'cambodian', 'ca', "n't", 'people', "'s", 'party', 'won', '64', 'of', 'the', '122', 'parliamentary', 'seats', 'in', 'party', 'july', "'s", 'elections', ',', 'short', 'of', 'the', 'two-thirds', 'majority', 'needed', 'to', 'form', 'a', 'government', 'on', 'its', 'own', '.']
['hun', 'sen', "'s", 'cambodian', 'ca', "n't", 'people', "'s", 'party', 'won', '64', 'of', 'the', '122', 'parliamentary', 'seats', 'in', 'party', 'july', "'s", 'elections', 'short', 'of', 'the', 'two-thirds', 'majority', 'needed', 'to', 'form', 'a', 'gov

## Normalizing Text

The goal of both stemming and lemmatization is to **"normalize"** words
to their **common base form**, which is useful for many text-processing
applications.

 - **Stemming** = heuristically removing the affixes of a word, to get its
**stem (root)**.
    - It is a rule-based process of stripping the suffixes **(“ing”, “ly”, “es”, “s” etc)** from a word
 - **Lemmatization** = Lemmatization process involves first determining
the part of speech of a word, and applying different normalization
rules for each part of speech.

Consider:
 - I was taking a **ride** in the car.
 - I was **riding** in the car.

Imagine every word in the English language, every possible tense and affix you
can put on a word. **Having individual dictionary entries per version would be highly redundant and inefficient.**

- Lisa **ate** the food and washed the dishes.
- They were **eating** noodles at a cafe.
- Don’t you want to **eat** before we leave?
- We have just **eaten** our breakfast.
- It also **eats** fruit and vegetables.

Unfortunately, that is not the case with machines. **They treat these words differently**. Therefore, we need to normalize them to their root word, which is **“eat”** in our example.


### Stemming

 - One of the **most popular** stemming algorithms is the **Porter stemmer**,
which has been around **since 1979**.
 - Several other stemming algorithms provided by NLTK are **Lancaster
Stemmer** and **Snowball Stemmer**.

In [33]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

example_words = ["python","pythoner","pythoning","pythoned","pythonly"]

for w in example_words:
  print(stemmer.stem(w))

python
python
python
python
pythonli


## Lemmatization

Lemmatize takes a part of speech parameter, "pos." **If not supplied,
the default is "noun".**

In [34]:
## Lemmatization using NLTK

from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize('cooking'))
print(lemmatizer.lemmatize('cooking', pos='v'))  # noun = n, verb = v, ajdective = a

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
cooking
cook


In [35]:
## Lemmatization using spaCy

doc = nlp('Jim bought 300 shares of Acme Corp. in 2006.')

lemma_words = [] 

for token in doc:
    lemma_words.append(token.lemma_)

print(lemma_words)

['Jim', 'buy', '300', 'share', 'of', 'Acme', 'Corp.', 'in', '2006', '.']


## Comparison between stemming and lemmatizing

The major difference between these is, as you saw earlier, **stemming
can often create non-existent words**, whereas **lemmas are actual
words**, you can just look up in an English dictionary.

In [36]:
print(stemmer.stem('believes'))
print(lemmatizer.lemmatize('believes'))

believ
belief


## Part-of-speech Tagging

The English language is formed of different parts of speech (POS) like nouns, verbs, pronouns, adjectives, etc. POS tagging analyzes the words in a sentences and associates it with a POS tag depending on the way it is used. 

Full [tag list](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

## Penn Bank Part-of-Speech Tags

<div align="center">
<img src="https://drive.google.com/uc?id=18MqGTRZcK3jYd5Ix8BOaODE-6SGcn_CC" width="700" height="380">
</div>

In [38]:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

nltk.download('averaged_perceptron_tagger')

words = word_tokenize('Jim bought 300 shares of Acme Corp. in 2006.')

tagged_words = pos_tag(words)

print(tagged_words)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[('Jim', 'NNP'), ('bought', 'VBD'), ('300', 'CD'), ('shares', 'NNS'), ('of', 'IN'), ('Acme', 'NNP'), ('Corp.', 'NNP'), ('in', 'IN'), ('2006', 'CD'), ('.', '.')]


In [39]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp('Jim bought 300 shares of Acme Corp. in 2006.')

for token in doc:
    print(token.text, token.pos_, token.tag_)

Jim PROPN NNP
bought VERB VBD
300 NUM CD
shares NOUN NNS
of ADP IN
Acme PROPN NNP
Corp. PROPN NNP
in ADP IN
2006 NUM CD
. PUNCT .


## Named-entity Recognition

Named-entity recognition is a subtask of information extraction that
seeks to locate and classify elements in text into pre-defined
categories such as the names of **persons**, **organizations**, **locations**,
**expressions of times**, **quantities**, **monetary values**, **percentages**, etc.

**NE Type and Examples:-**

 - **ORGANIZATION** - Georgia-Pacific Corp., WHO
 - **PERSON** - Eddy Bonte, President Obama
 - **LOCATION** - Murray River, Mount Everest
 - **DATE**- June, 2008-06-29
 - **TIME** - two fifty a m, 1:30 p.m.
 - **MONEY** - 175 million Canadian Dollars, GBP 10.40
 - **PERCENT** - twenty pct, 18.75 %
 - **FACILITY** - Washington Monument, Stonehenge
 - **GPE** - South East Asia, Midlothian

In [40]:
from nltk import pos_tag, ne_chunk
from nltk.tokenize import wordpunct_tokenize

nltk.download('maxent_ne_chunker')
nltk.download('words')

sent = 'Jim bought 300 shares of Acme Corp. in 2006.'

print(ne_chunk(pos_tag(wordpunct_tokenize(sent))))

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
(S
  (PERSON Jim/NNP)
  bought/VBD
  300/CD
  shares/NNS
  of/IN
  (ORGANIZATION Acme/NNP Corp/NNP)
  ./.
  in/IN
  2006/CD
  ./.)


In [41]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Jim bought 300 shares of Acme Corp. in 2006.")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Jim 0 3 PERSON
300 11 14 CARDINAL
Acme Corp. 25 35 ORG
2006 39 43 DATE
