<a href="https://colab.research.google.com/github/CALDISS-AAU/sdsphd21/blob/master/notebooks/Intro_to_nlp_and_supervised_tasks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro To NLP vs. Supervised ML

Roman Jurowetzki, Aalborg University In part based on the Intro from the DeepNLP course by Dan Anastasyev - https://github.com/DanAnastasyev/DeepNLP-Course

In this notebook we are going to explore supervised ML as used on vectorised text input. This is probably the most common application when working with NLP today and very useful if you want to generate (predict) indicators from text data for further exploration (e.g. simple statistical or econometric analysis)

We are going to use a (VERY!) standard dataset of movie reviews from IMDB and try solve a binary classification problem - is the movie good or bad. This is certainly a oversimplification but appropriate given the timeframe and that this here is an intro...

![alt text](https://media.giphy.com/media/7jNeb9CVSgyUE/giphy.gif)

In this tutorial we will be using the well known IMDB movie review dataset for simple classification with different vectorization techniques:


*   Simple bag-of-words
*   TF-IDF
*   LSI / SVD


We will also look at some more recent approaches to model explainability i.e. "Why did the model decide this or that?"


Finally, we will look at a simple approach to building a **semantic search** based on vector-similarity.


In [None]:
!pip -q install eli5 #installing a great package for explaining ML models

In [None]:
import pandas as pd
import numpy as np

In [None]:
data = pd.read_csv("https://github.com/RJuro/nlp-intro-cuny/raw/master/images/imdb.zip", sep="\t")

In [None]:
data.head()

In [None]:
# some basic text cleaning, removing HTML fragments (only a problem here)

import re

pattern = re.compile('<br /><br />')

print(data['review'].iloc[3])
print(pattern.subn(' ', data['review'].iloc[3])[0])

In [None]:
# application of the cleaning mask to everthing

data['review'] = data['review'].apply(lambda text: pattern.subn(' ', text)[0])
data['review'] = data['review'].apply(lambda text: pattern.subn(' ', text)[0])

## Approach 1 - Sklearn
If you don't want to deal with language or much code you can just do that

In [None]:
# module to split data into training / test
from sklearn.model_selection import train_test_split

In [None]:
# define in and outputs

X = data['review'].values
y = data['is_positive'].values

In [None]:
# Split the data in 80% trainig 20% test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=18)

In [None]:
# Simple BoW vectorizer
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train_vec_1 = vectorizer.fit_transform(X_train)

In [None]:
# Instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=2000)

In [None]:
# Train the model

model.fit(X_train_vec_1, y_train)

In [None]:
# Transform the test-set
X_test_vec_1 = vectorizer.transform(X_test)

In [None]:
# Check performance of the model
model.score(X_test_vec_1, y_test)

In [None]:
# Predict on new data

y_pred = model.predict(X_test_vec_1)

In [None]:
# confusion matrix by hand... :-)

pd.crosstab(y_test, y_pred)

In [None]:
# Or TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_train_vec_2 = vectorizer.fit_transform(X_train)

model = LogisticRegression(max_iter=2000)

# Train the model

model.fit(X_train_vec_2, y_train)

In [None]:
# Transform the test-set
X_test_vec_2 = vectorizer.transform(X_test)

In [None]:
# Check performance of the model
model.score(X_test_vec_2, y_test)

In [None]:
import eli5
eli5.show_weights(model, feature_names=vectorizer.get_feature_names(), target_names=['negative','positive'], top=20)

In [None]:
eli5.show_prediction(model, X_test[0], vec=vectorizer, target_names=['negative','positive'])

In [None]:
# Let's fire up spacy

import spacy

# and load the small english language model. Large models can be downloaded for many languages.
nlp = spacy.load("en")

# find more models for other languages here: https://spacy.io/models/

In [None]:
doc = nlp(X_test[1])

Spacy docs have POS (part of speech) and ENT (entity anotation) - let's see how we can use that to filter (bootstrap) a nice dictionary for future use.

In [None]:
# let's look at the POS tags
[(tok.text, tok.pos_) for tok in doc]

In [None]:
# Let's tokenize the first 2000 articles (that should take around 1 minute with this approach)
tokenlist = []
for doc in nlp.pipe(X_train[:2000], disable=["tagger", "parser", "ner"]):
  tokens =[tok.text.lower() for tok in doc if tok.pos_ in ['NOUN','ADJ','ADV','VERB'] and not tok.is_stop]
  tokenlist.append(tokens)

In [None]:
from gensim.corpora.dictionary import Dictionary

In [None]:
dictionary = Dictionary(tokenlist)

In [None]:
len(dictionary)

In [None]:
dictionary.filter_extremes(no_below=5, no_above=0.2)

In [None]:
len(dictionary)

In [None]:
vectorizer = TfidfVectorizer(vocabulary=list(dictionary.values()))
X_train_vec_2 = vectorizer.fit_transform(X_train)

model = LogisticRegression(max_iter=2000)

# Train the model

model.fit(X_train_vec_2, y_train)

In [None]:
X_test_vec_2 = vectorizer.fit_transform(X_test)

In [None]:
# Check performance of the model
model.score(X_test_vec_2, y_test)

In [None]:
eli5.show_weights(model, feature_names=vectorizer.get_feature_names(), target_names=['negative','positive'], top=20)

In [None]:
eli5.show_prediction(model, X_test[0], vec=vectorizer, target_names=['negative','positive'])

In [None]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.neural_network import MLPClassifier

tfidf = TfidfVectorizer(vocabulary=list(dictionary.values()))
svd = TruncatedSVD(n_components=100, n_iter=7, random_state=42)
clf = MLPClassifier(verbose=False)


pipe = make_pipeline(tfidf, svd, clf)

pipe.fit(X_train, y_train)

In [None]:
pipe.score(X_test, y_test)

In [None]:
from eli5.lime import TextExplainer

te = TextExplainer(random_state=42)
te.fit(X_test[0], pipe.predict_proba)
te.show_prediction(target_names=['negative','positive'])

## Semantic search

Once you obtain dense vectors that represent your text you can calculate distance measures. Where distance is not high, you will probably find texts that are semantically similar... :-)

This can be done by calculating all distances in the corpus (which would be rather inefficient) or by using nearest-neighbor approximation.

We will be using Annoy, a popular technique for finding neighbors developed at spotify (to find similar songs)
https://github.com/spotify/annoy


In [None]:
!pip install annoy

Let's first vectorise our texts. For this we will be using Gensim, as it provides a more language-oriented approach as well as a good interlude into topic modelling...

In [None]:
# Import the dictionary builder
from gensim.corpora.dictionary import Dictionary

# Import the TfidfModel from Gensim
from gensim.models.tfidfmodel import TfidfModel

# Just like before, we import the model
from gensim.models.lsimodel import LsiModel

# Tooling to map between corpus (gensim) and matrix - more general
from gensim.matutils import corpus2csc, corpus2dense

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

In [None]:
# Generate a dictionary and filter
dictionary = Dictionary(tokenlist)
dictionary.filter_extremes(no_below=5, no_above=0.2)

In [None]:
# construct corpus using this dictionary
corpus = [dictionary.doc2bow(word_tokenize(doc.lower())) for doc in data['review']]

In [None]:
# Create and fit a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

In [None]:
# transform corpus to TFIDF
corpus_tfidf = tfidf[corpus]

In [None]:
# Training the LSI model
model_lsi = LsiModel(corpus_tfidf, num_topics = 300, id2word=dictionary)

In [None]:
# Generating the corpus train & test

corpus_lsi = model_lsi[corpus_tfidf]

In [None]:
# turn into matrix
corpus_lsi_matrix = corpus2dense(corpus_lsi, 300 )

In [None]:
corpus_lsi_matrix.shape

In [None]:
corpus_lsi_matrix = corpus_lsi_matrix.T

In [None]:
from annoy import AnnoyIndex

In [None]:
f = 300

t = AnnoyIndex(f, 'angular')  # Length of item vector that will be indexed

for i in range(len(corpus_lsi_matrix)):
    t.add_item(i, corpus_lsi_matrix[i])

In [None]:
t.build(10)

In [None]:
t.get_nns_by_item(0, 10)

In [None]:
data['review'][44]

In [None]:
data['review'][t.get_nns_by_item(44, 10)]