# *  NLP 101 *
This notebook will be covering both classical NLP and deep learning NLP techniques.
Let's import some packages that we will be using first. We will be first using nltk - the natural language toolkit to show us some classical NLP capabilities. These are important as we can still use these techniques in conjunction with our ML/DL models as preprocessing steps.

In [None]:
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...


True

## A. Classical NLP
Let's define an input paragraph that we want to analyze first

In [None]:
# Feel free to change the text - this is from Wikipedia
text = "Tower Bridge is a drawbridge in London. It crosses the River Thames near the Tower of London. It allows ships through the bridge deck when is raised at an angle in the centre. The north side of the bridge is Tower Hill, and the south side of the bridge comes down into Bermondsey, an area in Southwark. Tower Bridge is far more visible than London Bridge, which people often mistake it for. Many tourists go to London to see the Tower Bridge. It has its own exhibition centre in the horizontal walkway. This gives one of the best vantage points in London."

### 1. Sentence segmentation
Split the sentences up

In [None]:
sentences = nltk.sent_tokenize(text)
print(sentences)

['Tower Bridge is a drawbridge in London.', 'It crosses the River Thames near the Tower of London.', 'It allows ships through the bridge deck when is raised at an angle in the centre.', 'The north side of the bridge is Tower Hill, and the south side of the bridge comes down into Bermondsey, an area in Southwark.', 'Tower Bridge is far more visible than London Bridge, which people often mistake it for.', 'Many tourists go to London to see the Tower Bridge.', 'It has its own exhibition centre in the horizontal walkway.', 'This gives one of the best vantage points in London.']


### 2. Tokenization
Split them into individual words! If this is too easy go look up tokenization in Japanese which cannot rely on spaces to split the words. -> Search 'MeCab Japanese'

In [None]:
words = nltk.word_tokenize(text)
print(words)

['Tower', 'Bridge', 'is', 'a', 'drawbridge', 'in', 'London', '.', 'It', 'crosses', 'the', 'River', 'Thames', 'near', 'the', 'Tower', 'of', 'London', '.', 'It', 'allows', 'ships', 'through', 'the', 'bridge', 'deck', 'when', 'is', 'raised', 'at', 'an', 'angle', 'in', 'the', 'centre', '.', 'The', 'north', 'side', 'of', 'the', 'bridge', 'is', 'Tower', 'Hill', ',', 'and', 'the', 'south', 'side', 'of', 'the', 'bridge', 'comes', 'down', 'into', 'Bermondsey', ',', 'an', 'area', 'in', 'Southwark', '.', 'Tower', 'Bridge', 'is', 'far', 'more', 'visible', 'than', 'London', 'Bridge', ',', 'which', 'people', 'often', 'mistake', 'it', 'for', '.', 'Many', 'tourists', 'go', 'to', 'London', 'to', 'see', 'the', 'Tower', 'Bridge', '.', 'It', 'has', 'its', 'own', 'exhibition', 'centre', 'in', 'the', 'horizontal', 'walkway', '.', 'This', 'gives', 'one', 'of', 'the', 'best', 'vantage', 'points', 'in', 'London', '.']


Seems like we haven't kept the work we did in the sentence segmentation section - how should we apply word tokenization on top of sentence segmentation? Analysing sentences one at a time is usually preferable to minimize complexity.

In [None]:
# Write your code here!
words = [nltk.word_tokenize(sentence) for sentence in sentences]
print(words)

[['Tower', 'Bridge', 'is', 'a', 'drawbridge', 'in', 'London', '.'], ['It', 'crosses', 'the', 'River', 'Thames', 'near', 'the', 'Tower', 'of', 'London', '.'], ['It', 'allows', 'ships', 'through', 'the', 'bridge', 'deck', 'when', 'is', 'raised', 'at', 'an', 'angle', 'in', 'the', 'centre', '.'], ['The', 'north', 'side', 'of', 'the', 'bridge', 'is', 'Tower', 'Hill', ',', 'and', 'the', 'south', 'side', 'of', 'the', 'bridge', 'comes', 'down', 'into', 'Bermondsey', ',', 'an', 'area', 'in', 'Southwark', '.'], ['Tower', 'Bridge', 'is', 'far', 'more', 'visible', 'than', 'London', 'Bridge', ',', 'which', 'people', 'often', 'mistake', 'it', 'for', '.'], ['Many', 'tourists', 'go', 'to', 'London', 'to', 'see', 'the', 'Tower', 'Bridge', '.'], ['It', 'has', 'its', 'own', 'exhibition', 'centre', 'in', 'the', 'horizontal', 'walkway', '.'], ['This', 'gives', 'one', 'of', 'the', 'best', 'vantage', 'points', 'in', 'London', '.']]


### 3. Part-of-speech tagging
Now we want to analyse the first sentence to see what parts of speech are present. Usually we can use just the nouns present to guess what is going on in a sentence.

In [None]:
words = nltk.word_tokenize(sentences[0])
print(f'First sentence tokenized: {words}')
pos_tags = nltk.pos_tag(words)
print(f'Part of speech tags: {pos_tags}')

First sentence tokenized: ['Tower', 'Bridge', 'is', 'a', 'drawbridge', 'in', 'London', '.']
Part of speech tags: [('Tower', 'NNP'), ('Bridge', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('drawbridge', 'NN'), ('in', 'IN'), ('London', 'NNP'), ('.', '.')]


### 4. Normalization
We want to reduce vocabulary size which will help improve our NLP.
#### 4.1 Stemming
Simply truncating words to their stems (may not be words themselves). Let's try 2 different stemmers over some examples. Snowball stemmer is the 'upgraded' version of the Porter stemmer, including the ability for users to not process stopwords as sometimes the conjugated forms may have different meanings, e.g. 'to be' and 'a being'.

In [None]:
# Define your own word list
word_list = ['being', 'a', 'fairly', 'mischievous', 'cat', 'causing', 'trouble', 'late', 'into', 'the', 'night']

In [None]:
porter = nltk.stem.PorterStemmer()
snowball = nltk.stem.SnowballStemmer('english', ignore_stopwords = True)

print([porter.stem(word) for word in word_list])
print([snowball.stem(word) for word in word_list])

['be', 'a', 'fairli', 'mischiev', 'cat', 'caus', 'troubl', 'late', 'into', 'the', 'night']
['being', 'a', 'fair', 'mischiev', 'cat', 'caus', 'troubl', 'late', 'into', 'the', 'night']


#### 4.2 Lemmatisation
Another option is to reduce words to their base lemmas to 'standardize' words into their synonyms. Lemmatisation is better usually as it is more informative and uses PoS. Let's run the same sentence we stemmed in our lemmatizer.

In [None]:
# Function to convert our PoS tags from one type to another used by wordnet
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return nltk.corpus.wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return nltk.corpus.wordnet.VERB
    elif treebank_tag.startswith('N'):
        return nltk.corpus.wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return nltk.corpus.wordnet.ADV
    else:
        return None


In [None]:
lemmatizer = nltk.stem.WordNetLemmatizer()
pos_tags = nltk.pos_tag(word_list) # we need the part of speech tags for lemmatisation
print(pos_tags)

lemmatized_words = [lemmatizer.lemmatize(pos_tag[0], get_wordnet_pos(pos_tag[1])) if get_wordnet_pos(pos_tag[1]) else pos_tag[0] for pos_tag in pos_tags]
print(lemmatized_words)

[('being', 'VBG'), ('a', 'DT'), ('fairly', 'RB'), ('mischievous', 'JJ'), ('cat', 'NN'), ('causing', 'VBG'), ('trouble', 'NN'), ('late', 'RB'), ('into', 'IN'), ('the', 'DT'), ('night', 'NN')]
['be', 'a', 'fairly', 'mischievous', 'cat', 'cause', 'trouble', 'late', 'into', 'the', 'night']


### 5. Removing stopwords
We remove stopwords to get rid of noise that is usually irrelevant.

In [None]:
stop_words = nltk.corpus.stopwords.words('english')
without_stop_words = [word for word in lemmatized_words if not word in stop_words]
print(without_stop_words)

['fairly', 'mischievous', 'cat', 'cause', 'trouble', 'late', 'night']


## B. Regex and Fuzzy Matching
Let's import some packages we will be using!

In [None]:
import re
from fuzzywuzzy import fuzz
! pip install pyspellchecker
from spellchecker import SpellChecker

### 1. Regex
Let's try regular expression matching

In [None]:
print(re.sub('a', 'b', 'abacadabra is a magic spell'))
print(re.findall(r'[A-Z]+', 'I am feeling okay but if I am happy I ONLY USE CAPS'))

bbbcbdbbrb is b mbgic spell
['I', 'I', 'I', 'ONLY', 'USE', 'CAPS']


Okay, your turn. Please Google to look up new expressions! regex101.com is great for trying out expressions too

Task 1: fill in the regex expression to match only on ['1YESa', 'aYES3', 'asdkfYES', 'YES'].

In [None]:
task_1 = '1YESa 2yesb aYES3 asdkfYES YES yES6'
print(re.findall(r'[a-z0-9]*YES[a-z0-9]*', task_1))

['1YESa', 'aYES3', 'asdkfYES', 'YES']


In [None]:
task_1 = '1YESa 2yesb aYES3 asdkfYES YES yES6'
print(re.findall(r'[a-z]+YES[0-9]*', task_1))

['aYES3', 'asdkfYES']


Task 2: fill in the regex expression to match all 6 e-mail addresses in the string

In [None]:
task_2 = 'Hi Sally, Could you please forward this e-mail to the e-mails in this list please? l&d@deloitte.co.uk, nlp101@deloitte.co.uk and aitesting@deloitte.com, ai101@deloitte.com and ukai@deloitte.com, l&d@deloitte.com Thanks! My instagram handle is @swiftnlp.'
print(task_2, '\n')
print(re.findall(r'YOUR REGEX STRING HERE', task_2))

Hi Sally, Could you please forward this e-mail to the e-mails in this list please? l&d@deloitte.co.uk, nlp101@deloitte.co.uk and aitesting@deloitte.com, ai101@deloitte.com and ukai@deloitte.com, l&d@deloitte.com Thanks! My instagram handle is @swiftnlp. 

[]


In [None]:
task_2 = 'Hi Sally, Could you please forward this e-mail to the e-mails in this list please? l&d@deloitte.co.uk, nlp101@deloitte.co.uk and aitesting@deloitte.com, ai101@deloitte.com and ukai@deloitte.com, l&d@deloitte.com Thanks! My instagram handle is @swiftnlp.'
print(task_2, '\n')
print(re.findall(r'[A-Za-z0-9&]+\@[A-Za-z]+(?:\.[A-Za-z]+)+', task_2))

Hi Sally, Could you please forward this e-mail to the e-mails in this list please? l&d@deloitte.co.uk, nlp101@deloitte.co.uk and aitesting@deloitte.com, ai101@deloitte.com and ukai@deloitte.com, l&d@deloitte.com Thanks! My instagram handle is @swiftnlp. 

['l&d@deloitte.co.uk', 'nlp101@deloitte.co.uk', 'aitesting@deloitte.com', 'ai101@deloitte.com', 'ukai@deloitte.com', 'l&d@deloitte.com']


### 2. Fuzzy Matching

Quick method for matching strings that are similar but not exactly the same - great for data cleaning!

In [None]:
string1 = 'Online NLP 101 Course for Everyone!'
string2 = 'Course in NLP'

In [None]:
print(fuzz.ratio(string1, string2))
print(fuzz.partial_ratio(string1, string2))
print(fuzz.token_sort_ratio(string1, string2))
print(fuzz.token_set_ratio(string1, string2))

33
62
51
87


Another use case for fuzzy matching is simple spell checking using pyspellchecker which relies on Levenshtein distance but also the frequency the word appears in the English language. You can then replace these spelling errors!

In [None]:
spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])

for word in misspelled:
    # Get the one `most likely` answer
    print('most likely candidate: ', spell.correction(word))

    # Get a list of `likely` options
    print('all likely candidates: ', spell.candidates(word))

most likely candidate:  happening
all likely candidates:  {'happening', 'henning', 'penning'}


## C. ML & DL NLP
### 1. Named Entity Recognition
Let us use a popular NLP package named Spacy which will also do all the pre-processing/Classic NLP steps above!

In [None]:
import spacy
from spacy import displacy
! pip install contextualSpellCheck
import contextualSpellCheck

Collecting contextualSpellCheck
  Downloading contextualSpellCheck-0.4.3-py3-none-any.whl (128 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m128.8/128.8 kB[0m [31m635.3 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting editdistance==0.6.0
  Downloading editdistance-0.6.0-cp37-cp37m-manylinux2010_x86_64.whl (285 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m285.6/285.6 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting typing-extensions<4.2.0,>=3.7.4
  Downloading typing_extensions-4.1.1-py3-none-any.whl (26 kB)
Installing collected packages: typing-extensions, editdistance, contextualSpellCheck
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.4.0
    Uninstalling typing_extensions-4.4.0:
      Successfully uninstalled typing_extensions-4.4.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This beha

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

displacy.render(doc, style="ent")

### 2. BERT use-case: Contextual spell checking
Let's use spacy for contextual spell checking too!

In [None]:
contextualSpellCheck.add_to_pipe(nlp)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

In [None]:
doc = nlp('Incom was $9.4 milion conpar to the prior year of $2.7 milion.')
doc._.outcome_spellCheck

'It was $9.4 million compared to the prior year of $2.7 million.'

### 3. Sentiment Analysis

Let us use NLTK's built-in, pretrained sentiment analyser. It is called VADER. VADER is best suited for social media language - short w/ abbreviations and slang.

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("Wow, NLTK is really powerful!")

{'neg': 0.0, 'neu': 0.295, 'pos': 0.705, 'compound': 0.8012}

## D. Build your own ML model for classification

Import all required packages

In [None]:
!pip install contractions
import contractions
import string
from itertools import chain
from nltk.corpus import movie_reviews as mr
import pandas as pd
import sklearn
from sklearn import naive_bayes, linear_model, ensemble, svm

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting contractions
  Downloading contractions-0.1.72-py2.py3-none-any.whl (8.3 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m287.5/287.5 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting pyahocorasick
  Downloading pyahocorasick-1.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.5/106.5 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsear

Load in data

In [None]:
positive_review_ids = nltk.corpus.movie_reviews.fileids(categories=["pos"])
negative_review_ids = nltk.corpus.movie_reviews.fileids(categories=["neg"])
pos_revs = pd.DataFrame([nltk.corpus.movie_reviews.raw(rev) for rev in positive_review_ids], columns = ['review'])
neg_revs = pd.DataFrame([nltk.corpus.movie_reviews.raw(rev) for rev in negative_review_ids], columns = ['review'])
pos_revs['positive_sentiment'] = 1
neg_revs['positive_sentiment'] = 0
all_revs = pos_revs.append(neg_revs)
all_revs.head(5)

Unnamed: 0,review,positive_sentiment
0,films adapted from comic books have had plenty...,1
1,every now and then a movie comes along from a ...,1
2,you've got mail works alot better than it dese...,1
3,""" jaws "" is a rare film that grabs your atten...",1
4,moviemaking is a lot like being the general ma...,1


Apply cleaning steps here
1. Sentence split up
2. Words split up
3. Lemmatisation
4. Taking stopwords out

In [None]:
def lemmatize(word_list):
    pos_tags = nltk.pos_tag(word_list) # we need the part of speech tags for lemmatisation

    lemmatized_words = [lemmatizer.lemmatize(pos_tag[0], get_wordnet_pos(pos_tag[1])) if get_wordnet_pos(pos_tag[1]) else pos_tag[0] for pos_tag in pos_tags]
    return lemmatized_words

all_revs['review'] = all_revs['review'].apply(lambda x:' '.join([contractions.fix(word) for word in x.split()]))
all_revs['review'] = all_revs['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
all_revs['review'] = all_revs['review'].apply(lambda x: ' '.join(lemmatize(nltk.word_tokenize(x))))

all_revs.head(5)

Unnamed: 0,review,positive_sentiment
0,"film adapt comic book plenty success , whether...",1
1,"every movie come along suspect studio , every ...",1
2,get mail work alot good deserves . order make ...,1
3,`` jaw `` rare film grab attention show single...,1
4,moviemaking lot like general manager nfl team ...,1


Split into train and test set for model training

In [None]:
train_x, valid_x, train_y, valid_y = sklearn.model_selection.train_test_split(all_revs['review'], all_revs['positive_sentiment'])

Vectorize using 2 methods: count vectorization and tfidf vectorization

In [None]:
count_vect = sklearn.feature_extraction.text.CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(all_revs['review'])
tfidf_vect = sklearn.feature_extraction.text.TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features = 500)
tfidf_vect.fit(all_revs['review'])

xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)

xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)

Let's try a couple of classic classification models:
1. Naive Bayes
2. Linear Model (Logistic Regression)
3. Support Vector Machines
4. Random Forest

In [None]:
# Function created to make it easier to train various models/algorithms
def train_model(classifier, feature_vector_train, label, feature_vector_valid):
    classifier.fit(feature_vector_train, label)
    predictions = classifier.predict(feature_vector_valid)
    return sklearn.metrics.classification_report(valid_y, predictions)

In [None]:
nb_cv = train_model(naive_bayes.BernoulliNB(), xtrain_count.toarray(), train_y, xvalid_count.toarray())
print("Naive Bayes + Count Vectors: ", nb_cv)

nb_tfidf = train_model(naive_bayes.BernoulliNB(), xtrain_tfidf.toarray(), train_y, xvalid_tfidf.toarray())
print("Naive Bayes + TFIDF Vectors: ", nb_tfidf)

lr_cv = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count)
print("Logistic Regression + Count Vectors: ", lr_cv)

lr_tfidf = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xvalid_tfidf)
print("Logistic Regression + TFIDF Vectors: ", lr_tfidf)

svm_cv = train_model(svm.SVC(), xtrain_count, train_y, xvalid_count)
print("Support Vector Machines + Count Vectors: ", svm_cv)

svm_tfidf = train_model(svm.SVC(), xtrain_tfidf, train_y, xvalid_tfidf)
print("Support Vector Machines + TFIDF Vectors: ", svm_tfidf)

rf_cv = train_model(ensemble.RandomForestClassifier(), xtrain_count, train_y, xvalid_count)
print("Random Forest + Count Vectors: ", rf_cv)

rf_tfidf = train_model(ensemble.RandomForestClassifier(), xtrain_tfidf, train_y, xvalid_tfidf)
print("Random Forest + TFIDF Vectors: ", rf_tfidf)

Naive Bayes + Count Vectors:                precision    recall  f1-score   support

           0       0.74      0.88      0.80       246
           1       0.86      0.70      0.77       254

    accuracy                           0.79       500
   macro avg       0.80      0.79      0.79       500
weighted avg       0.80      0.79      0.79       500

Naive Bayes + TFIDF Vectors:                precision    recall  f1-score   support

           0       0.71      0.78      0.74       246
           1       0.76      0.69      0.73       254

    accuracy                           0.73       500
   macro avg       0.74      0.73      0.73       500
weighted avg       0.74      0.73      0.73       500



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Logistic Regression + Count Vectors:                precision    recall  f1-score   support

           0       0.82      0.85      0.83       246
           1       0.84      0.81      0.83       254

    accuracy                           0.83       500
   macro avg       0.83      0.83      0.83       500
weighted avg       0.83      0.83      0.83       500

Logistic Regression + TFIDF Vectors:                precision    recall  f1-score   support

           0       0.78      0.80      0.79       246
           1       0.80      0.78      0.79       254

    accuracy                           0.79       500
   macro avg       0.79      0.79      0.79       500
weighted avg       0.79      0.79      0.79       500

Support Vector Machines + Count Vectors:                precision    recall  f1-score   support

           0       0.78      0.83      0.81       246
           1       0.83      0.77      0.80       254

    accuracy                           0.80       500
   macro a

The challenge is now to improve the sentiment predictor. There are many methods but the easiest one is by preprocessing the text.

If you don't know where to start, try:
1. Removing contractions
2. Removing stopwords
3. Lemmatising the text

After your text is clean and processed, you can then start looking at adding additional features (e.g. word count), optimizing the vectorization methods and classification algorithms.