<a href="https://colab.research.google.com/github/RJuro/nlp-intro-cuny/blob/master/notebooks/Intro_to_nlp_and_supervised_tasks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro To NLP vs. Supervised ML

Roman Jurowetzki, Aalborg University In part based on the Intro from the DeepNLP course by Dan Anastasyev - https://github.com/DanAnastasyev/DeepNLP-Course

In this notebook we are going to explore supervised ML as used on vectorised text input. This is probably the most common application when working with NLP today and very useful if you want to generate (predict) indicators from text data for further exploration (e.g. simple statistical or econometric analysis)

We are going to use a (VERY!) standard dataset of movie reviews from IMDB and try solve a binary classification problem - is the movie good or bad. This is certainly a oversimplification but appropriate given the timeframe and that this here is an intro...

![alt text](https://media.giphy.com/media/7jNeb9CVSgyUE/giphy.gif)

In this tutorial we will be using the well known IMDB movie review dataset for simple classification with different vectorization techniques:


*   Simple bag-of-words
*   TF-IDF
*   LSI / SVD


We will also look at some more recent approaches to model explainability i.e. "Why did the model decide this or that?"


Finally, we will look at a simple approach to building a **semantic search** based on vector-similarity.


In [1]:
!pip -q install eli5 #installing a great package for explaining ML models

[?25l[K     |███                             | 10kB 21.1MB/s eta 0:00:01[K     |██████▏                         | 20kB 26.2MB/s eta 0:00:01[K     |█████████▎                      | 30kB 19.2MB/s eta 0:00:01[K     |████████████▍                   | 40kB 12.3MB/s eta 0:00:01[K     |███████████████▌                | 51kB 8.2MB/s eta 0:00:01[K     |██████████████████▋             | 61kB 8.2MB/s eta 0:00:01[K     |█████████████████████▊          | 71kB 8.9MB/s eta 0:00:01[K     |████████████████████████▊       | 81kB 8.5MB/s eta 0:00:01[K     |███████████████████████████▉    | 92kB 8.5MB/s eta 0:00:01[K     |███████████████████████████████ | 102kB 8.9MB/s eta 0:00:01[K     |████████████████████████████████| 112kB 8.9MB/s 
[?25h

In [2]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv("https://github.com/RJuro/nlp-intro-cuny/raw/master/images/imdb.zip", sep="\t")

In [4]:
data.head()

Unnamed: 0,is_positive,review
0,0,"Dreamgirls, despite its fistful of Tony wins i..."
1,0,This show comes up with interesting locations ...
2,1,I simply love this movie. I also love the Ramo...
3,0,Spoilers ahead if you want to call them that.....
4,1,My all-time favorite movie! I have seen many m...


In [5]:
# some basic text cleaning, removing HTML fragments (only a problem here)

import re

pattern = re.compile('<br /><br />')

print(data['review'].iloc[3])
print(pattern.subn(' ', data['review'].iloc[3])[0])

Spoilers ahead if you want to call them that...<br /><br />I would almost recommend this film just so people can truly see a 1/10. Where to begin, we'll start from the top...<br /><br />THE STORY: Don't believe the premise - the movie has nothing to do with abandoned cars, and people finially understanding what the mysterious happenings are. It's a draub, basic, go to cabin movie with no intensity or "effort".<br /><br />THE SCREENPLAY: I usually give credit to indie screenwriters, it's hard work when you are starting out...but this is crap. The story is flat - it leaves you emotionless the entire movie. The dialogue is extremely weak and predictable boasting lines of "Woah, you totally freaked me out" and "I was wondering if you'd uh...if you'd like to..uh, would you come to the cabin with me?". It makes me want to rip out all my hair, one strand at a time and feed it to myself.<br /><br />THE CHARACTERS: HOLY CRAP!!!! Some have described the characters as flat, I want to take it one 

In [6]:
# application of the cleaning mask to everthing

data['review'] = data['review'].apply(lambda text: pattern.subn(' ', text)[0])
data['review'] = data['review'].apply(lambda text: pattern.subn(' ', text)[0])

## Approach 1 - Sklearn
If you don't want to deal with language or much code you can just do that

In [7]:
# module to split data into training / test
from sklearn.model_selection import train_test_split

In [8]:
# define in and outputs

X = data['review'].values
y = data['is_positive'].values

In [9]:
# Split the data in 80% trainig 20% test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)

In [10]:
# Simple BoW vectorizer
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train_vec_1 = vectorizer.fit_transform(X_train)

In [11]:
# Instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=2000)

In [12]:
# Train the model

model.fit(X_train_vec_1, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=2000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [13]:
# Transform the test-set
X_test_vec_1 = vectorizer.transform(X_test)

In [14]:
# Check performance of the model
model.score(X_test_vec_1, y_test)

0.864

In [15]:
# Predict on new data

y_pred = model.predict(X_test_vec_1)

In [16]:
# confusion matrix by hand... :-)

pd.crosstab(y_test, y_pred)

col_0,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2151,381
1,299,2169


In [17]:
# Or TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_train_vec_2 = vectorizer.fit_transform(X_train)

model = LogisticRegression(max_iter=2000)

# Train the model

model.fit(X_train_vec_2, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=2000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [18]:
# Transform the test-set
X_test_vec_2 = vectorizer.transform(X_test)

In [19]:
# Check performance of the model
model.score(X_test_vec_2, y_test)

0.878

In [20]:
import eli5
eli5.show_weights(model, feature_names=vectorizer.get_feature_names(), target_names=['negative','positive'], top=20)



Weight?,Feature
+7.619,great
+5.507,excellent
+4.663,best
+4.329,perfect
+4.220,wonderful
+4.042,well
+4.004,amazing
+3.757,loved
… 34620 more positive …,… 34620 more positive …
… 33875 more negative …,… 33875 more negative …


In [21]:
eli5.show_prediction(model, X_test[0], vec=vectorizer, target_names=['negative','positive'])

Contribution?,Feature
0.47,stupid
0.376,would
0.293,insult
0.253,rubbish
0.241,die
0.229,even
0.203,might
0.195,anything
0.191,but
0.177,<BIAS>


In [22]:
# Let's fire up spacy

import spacy

# and load the small english language model. Large models can be downloaded for many languages.
nlp = spacy.load("en")

# find more models for other languages here: https://spacy.io/models/

In [23]:
doc = nlp(X_test[1])

Spacy docs have POS (part of speech) and ENT (entity anotation) - let's see how we can use that to filter (bootstrap) a nice dictionary for future use.

In [24]:
# let's look at the POS tags
[(tok.text, tok.pos_) for tok in doc]

[('I', 'PRON'),
 ('like', 'VERB'),
 ('this', 'DET'),
 ('film', 'NOUN'),
 ('for', 'ADP'),
 ('several', 'ADJ'),
 ('reasons', 'NOUN'),
 ('.', 'PUNCT'),
 ('I', 'PRON'),
 ('have', 'AUX'),
 ('a', 'DET'),
 ('soft', 'ADJ'),
 ('spot', 'NOUN'),
 ('for', 'ADP'),
 ('films', 'NOUN'),
 ('about', 'ADP'),
 ('intricately', 'ADV'),
 ('plotted', 'VERB'),
 ('criminal', 'ADJ'),
 ('plots', 'NOUN'),
 ('like', 'SCONJ'),
 ('TOPKAPI', 'PROPN'),
 ('.', 'PUNCT'),
 ('I', 'PRON'),
 ('also', 'ADV'),
 ('enjoy', 'VERB'),
 ('films', 'NOUN'),
 ('(', 'PUNCT'),
 ('like', 'SCONJ'),
 ('TOPKAPI', 'PROPN'),
 ('and', 'CCONJ'),
 ('BIG', 'PROPN'),
 ('DEAL', 'PROPN'),
 ('ON', 'ADP'),
 ('MADONNA', 'PROPN'),
 ('STREET', 'PROPN'),
 (')', 'PUNCT'),
 ('that', 'DET'),
 ('spoof', 'VERB'),
 ('the', 'DET'),
 ('the', 'DET'),
 ('genre', 'NOUN'),
 ('.', 'PUNCT'),
 ('One', 'NUM'),
 ('of', 'ADP'),
 ('the', 'DET'),
 ('best', 'ADJ'),
 ('ones', 'NOUN'),
 ('is', 'AUX'),
 ('DISORGANIZED', 'PROPN'),
 ('CRIME', 'PROPN'),
 ('.', 'PUNCT'),
 ('Corbin', 

In [25]:
# Let's tokenize the first 2000 articles (that should take around 1 minute with this approach)
tokenlist = []
for doc in nlp.pipe(X_train[:2000]):
  tokens =[tok.text.lower() for tok in doc if tok.pos_ in ['NOUN','ADJ','ADV','VERB'] and not tok.is_stop]
  tokenlist.append(tokens)

In [26]:
from gensim.corpora.dictionary import Dictionary

In [27]:
dictionary = Dictionary(tokenlist)

In [28]:
len(dictionary)

18720

In [29]:
dictionary.filter_extremes(no_below=5, no_above=0.2)

In [30]:
len(dictionary)

4871

In [31]:
vectorizer = TfidfVectorizer(vocabulary=list(dictionary.values()))
X_train_vec_2 = vectorizer.fit_transform(X_train)

model = LogisticRegression(max_iter=2000)

# Train the model

model.fit(X_train_vec_2, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=2000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [32]:
X_test_vec_2 = vectorizer.fit_transform(X_test)

In [33]:
# Check performance of the model
model.score(X_test_vec_2, y_test)

0.8634

In [34]:
eli5.show_weights(model, feature_names=vectorizer.get_feature_names(), target_names=['negative','positive'], top=20)

Weight?,Feature
+5.778,excellent
+4.756,best
+4.388,perfect
+4.323,wonderful
+4.236,amazing
+4.129,favorite
… 2443 more positive …,… 2443 more positive …
… 2391 more negative …,… 2391 more negative …
-3.695,supposed
-3.716,unfortunately


In [35]:
eli5.show_prediction(model, X_test[0], vec=vectorizer, target_names=['negative','positive'])

Contribution?,Feature
0.603,stupid
0.566,insult
0.457,rubbish
0.32,die
0.103,piece
0.099,given
0.092,kind
0.085,stations
0.042,normally
0.026,tv


In [37]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.neural_network import MLPClassifier

tfidf = TfidfVectorizer(vocabulary=list(dictionary.values()))
svd = TruncatedSVD(n_components=100, n_iter=7, random_state=42)
clf = MLPClassifier(verbose=False)


pipe = make_pipeline(tfidf, svd, clf)

pipe.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token...
                               batch_size='auto', beta_1=0.9, beta_2=0.999,
                               early_stopping=False, epsilon=1e-08,
                               hidden_layer_sizes=(100,),
                               learning_rate='constant',
                               learning_rat

In [38]:
pipe.score(X_test, y_test)

0.8392

In [39]:
from eli5.lime import TextExplainer

te = TextExplainer(random_state=42)
te.fit(X_test[0], pipe.predict_proba)
te.show_prediction(target_names=['negative','positive'])

Contribution?,Feature
1.216,stupid
0.996,rubbish
0.64,normally
0.605,insult
0.541,die
0.484,given
0.448,piece
0.401,describe
0.277,kind
0.235,commercial


## Semantic search

Once you obtain dense vectors that represent your text you can calculate distance measures. Where distance is not high, you will probably find texts that are semantically similar... :-)

This can be done by calculating all distances in the corpus (which would be rather inefficient) or by using nearest-neighbor approximation.

We will be using Annoy, a popular technique for finding neighbors developed at spotify (to find similar songs)
https://github.com/spotify/annoy


In [263]:
!pip install annoy

Processing /root/.cache/pip/wheels/3a/c5/59/cce7e67b52c8e987389e53f917b6bb2a9d904a03246fadcb1e/annoy-1.17.0-cp36-cp36m-linux_x86_64.whl
Installing collected packages: annoy
Successfully installed annoy-1.17.0


Let's first vectorise our texts. For this we will be using Gensim, as it provides a more language-oriented approach as well as a good interlude into topic modelling...

In [242]:
# Import the dictionary builder
from gensim.corpora.dictionary import Dictionary

# Import the TfidfModel from Gensim
from gensim.models.tfidfmodel import TfidfModel

# Just like before, we import the model
from gensim.models.lsimodel import LsiModel

# Tooling to map between corpus (gensim) and matrix - more general
from gensim.matutils import corpus2csc, corpus2dense

In [239]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [236]:
# Generate a dictionary and filter
dictionary = Dictionary(tokenlist)
dictionary.filter_extremes(no_below=5, no_above=0.2)

In [241]:
# construct corpus using this dictionary
corpus = [dictionary.doc2bow(word_tokenize(doc.lower())) for doc in data['review']]

In [244]:
# Create and fit a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

In [245]:
# transform corpus to TFIDF
corpus_tfidf = tfidf[corpus]

In [246]:
# Training the LSI model
model_lsi = LsiModel(corpus_tfidf, num_topics = 300, id2word=dictionary)

In [248]:
# Generating the corpus train & test

corpus_lsi = model_lsi[corpus_tfidf]

In [249]:
# turn into matrix
corpus_lsi_matrix = corpus2dense(corpus_lsi, 300 )

  result = np.column_stack(sparse2full(doc, num_terms) for doc in corpus)


In [251]:
corpus_lsi_matrix.shape

(300, 25000)

In [252]:
corpus_lsi_matrix = corpus_lsi_matrix.T

In [265]:
from annoy import AnnoyIndex

In [266]:
f = 300

t = AnnoyIndex(f, 'angular')  # Length of item vector that will be indexed

for i in range(len(corpus_lsi_matrix)):
    t.add_item(i, corpus_lsi_matrix[i])

In [267]:
t.build(10)

True

In [276]:
t.get_nns_by_item(0, 10)

[0, 2976, 21545, 15132, 11501, 24134, 17560, 11379, 965, 12661]

In [282]:
data['review'][44]

"The Woman in Black (1989) is a TV adaptation of Susan Hill's modern classic ghost story, published only a few years earlier than the film was made. Sadly, this film has not been released on DVD, and as far as I am aware it has been deleted on VHS. It's availability is in direct contrast to it's popularity amongst those in the know about horror films. The story revolves around events in a seaside community in the early 20th century where a young solicitor is sent by his firm to conclude the affairs of a recently deceased widow, who died on her isolated marshland estate. What he thought would be a routine and probably tedious task turns into a nightmare as he discovers that the old woman was haunted to her death, and that the ghosts of her past are not content to rest. The story is told in a subtle but concise way, never being self-indulgent, flashy or over-expositional. The obviously tight budget may have contributed to the no-nonsense approach, but it's just what the story needs, and 

In [283]:
data['review'][t.get_nns_by_item(44, 10)]

44       The Woman in Black (1989) is a TV adaptation o...
22386    "The Woman in Black" is easily one of the cree...
16672    Need I say--its a stinker! (I gave it a rating...
8306     This Lifetime style movie takes the middle age...
10741    I was lucky enough to watch this without any p...
22170    Lord Alan Cunningham(Antonio De Teffè)is a nut...
1536     'I don't understand. None of this makes any se...
12546    Someday somebody is going to write an essay co...
3427     A young woman who is a successful model, and i...
14178    Alain Resnais films are uncanny in the way tha...
Name: review, dtype: object