<a href="https://colab.research.google.com/github/RJuro/Africalics-PhD-Academy-2018/blob/master/notebooks/NLP_supervised_and_other_text_based_fun.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Roman Jurowetzki, Aalborg University
In part based on the Intro from the DeepNLP course by Dan Anastasyev - https://github.com/DanAnastasyev/DeepNLP-Course

In [0]:
# Some initial downloads and installs

!wget -O imdb.zip -qq --no-check-certificate "https://drive.google.com/uc?export=download&id=1vrQ5czMHoO3pEnmofFskymXMkq_u1dPc"
!unzip imdb.zip
!pip -q install eli5

# Supervised ML & Text & some other things

![alt text](https://www.dropbox.com/s/lyp8lvbphdnuhd2/antique-classic-close-up-1995842.jpg?dl=1)

In this tutorial we will be using the well known IMDB movie review dataset for simple classification with different vectorization techniques:


*   Simple bag-of-words
*   TF-IDF
*   LSI / SVD
*   Average embeddings (general public domain vectors)
*   Average embeddings (custom-trained vectors)
*   TF-IDF weighted embeddings

We will also look at some more recent approaches to model explainability i.e. "Why did the model decide this or that?"


Finally, we will look at a simple approach to building a **semantic search** based on vector-similarity



Disclamer: 

- This is more a demo than a class - thus please don't be distracted by sometimes complex code. If you are interested, you can review the material later
- The more things get complex, the more options we have - yet this means also that complex models often underperform "out-of-the-box"

In [2]:
!head train.tsv

is_positive	review
0	"Dreamgirls, despite its fistful of Tony wins in an incredibly weak year on Broadway, has never been what one would call a jewel in the crown of stage musicals. However, that is not to say that in the right cinematic hands it could not be fleshed out and polished into something worthwhile on-screen. Unfortunately, what transfers to the screen is basically a slavishly faithful version of the stage hit with all of its inherent weaknesses intact. First, the score has never been one of the strong points of this production and the film does not change that factor. There are lots of songs (perhaps too many?), but few of them are especially memorable. The closest any come to catchy tunes are the title song and One Night Only - the much acclaimed And I Am Telling You That I Am Not Going is less a great song than it is a dramatic set piece for the character of Effie (Jennifer Hudson). The film is slick and technically well-produced, but the story and characters are surpris

In [1]:
# Read in the files and quickly print the size of the training and test set.

import pandas as pd
import numpy as np

train_df = pd.read_csv("train.tsv", delimiter="\t")
test_df = pd.read_csv("test.tsv", delimiter="\t")

print('Train size = {}'.format(len(train_df)))
print('Test size = {}'.format(len(test_df)))

Train size = 25000
Test size = 25000


In [2]:
# some prep work so we don't have to wait all the time

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize, sent_tokenize

import multiprocessing
p = multiprocessing.Pool()

train_df_review_tok = p.map(word_tokenize, train_df.review)
test_df_review_tok = p.map(word_tokenize, test_df.review)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
# preload the word-vectors to make things faster later on

import gensim.downloader as api

model_gensim_glove = api.load('glove-wiki-gigaword-300')



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


The dataset is from the IMDB review database and contains 50k reviews. As outcome variable we use a simple binary "is positive" measure.

![alt text](https://www.dropbox.com/s/p6itk8lbh5yoki3/imdb.jpg?dl=1)



In [0]:
# some basic text cleaning, removing HTML fragments

import re

pattern = re.compile('<br /><br />')

print(train_df['review'].iloc[3])
print(pattern.subn(' ', train_df['review'].iloc[3])[0])

Spoilers ahead if you want to call them that...<br /><br />I would almost recommend this film just so people can truly see a 1/10. Where to begin, we'll start from the top...<br /><br />THE STORY: Don't believe the premise - the movie has nothing to do with abandoned cars, and people finially understanding what the mysterious happenings are. It's a draub, basic, go to cabin movie with no intensity or "effort".<br /><br />THE SCREENPLAY: I usually give credit to indie screenwriters, it's hard work when you are starting out...but this is crap. The story is flat - it leaves you emotionless the entire movie. The dialogue is extremely weak and predictable boasting lines of "Woah, you totally freaked me out" and "I was wondering if you'd uh...if you'd like to..uh, would you come to the cabin with me?". It makes me want to rip out all my hair, one strand at a time and feed it to myself.<br /><br />THE CHARACTERS: HOLY CRAP!!!! Some have described the characters as flat, I want to take it one 

In [0]:
# application of the cleaning mask

train_df['review'] = train_df['review'].apply(lambda text: pattern.subn(' ', text)[0])
test_df['review'] = test_df['review'].apply(lambda text: pattern.subn(' ', text)[0])

## Bag of words

The most simple way to represent text is by using a so called bag-of-words approach. Here we simply count up words in phrases to represent and build a table of phrases (rows) and words (columns)

![bow](https://raw.githubusercontent.com/DanAnastasyev/DeepNLP-Course/master/Week%2001/Images/BOW.png)

In python we can do something like that

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

dummy_data = ['The movie was excellent',
              'the movie was awful']

dummy_matrix = vectorizer.fit_transform(dummy_data)


pd.DataFrame(data = dummy_matrix.toarray(), columns = vectorizer.get_feature_names())

Unnamed: 0,awful,excellent,movie,the,was
0,0,1,1,1,1
1,1,0,1,1,1


In [4]:
# Let's apply this method to the entire training dataset (50%)

vectorizer = CountVectorizer()
vectorizer.fit(train_df['review'].values)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [5]:
# Some simple statistics and illustration

print(vectorizer.get_feature_names()[:20])
print(vectorizer.get_feature_names()[-20:])
print(len(vectorizer.get_feature_names()))


['00', '000', '0000000000001', '00001', '00015', '000s', '001', '003830', '006', '007', '0079', '0080', '0083', '0093638', '00am', '00pm', '00s', '01', '01pm', '02']
['är', 'ääliöt', 'äänekoski', 'åge', 'åmål', 'æsthetic', 'écran', 'élan', 'émigré', 'émigrés', 'était', 'état', 'étc', 'évery', 'êxtase', 'ís', 'ísnt', 'østbye', 'über', 'üvegtigris']
74849


In Python's SKlearn toolkit most models (and preprocessing) use a fit-transform logic
- `fit` here means "go learn all words and build an index
- `transform` generates the transformation in a 2nd step

In [6]:
# let's apply the fitted vectorizer to one review

vectorizer.transform([train_df['review'].iloc[3]])

<1x74849 sparse matrix of type '<class 'numpy.int64'>'
	with 207 stored elements in Compressed Sparse Row format>

As you can see, we obtain as output a sparse matrix of shape (1, 75k)

The task is to build a classifier that sorts reviews automatically into positive or negative. Intuitively, if we needed to create some program, we would to somthing like this:

We would give positive words a value, let's say 1; negative words -1, and 0s for all neutral

The simples way of learning such a linear model is using a logistic regression. Let's try that.

![bow with weights](https://github.com/DanAnastasyev/DeepNLP-Course/raw/master/Week%2001/Images/BOW_weights.png)

In [7]:
# to make things more elegant, we use a pipeline here

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

dummy_data = ['The movie was excellent',
              'the movie was awful']

dummy_labels = [1, 0]

vectorizer = CountVectorizer()
classifier = LogisticRegression()

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

model.fit(dummy_data, dummy_labels)


pd.DataFrame(data = classifier.coef_, columns = vectorizer.get_feature_names())



Unnamed: 0,awful,excellent,movie,the,was
0,-0.401058,0.401058,0.0,0.0,0.0


As we expected, the model learned that "awful" is something bad while "excellent" is something good.
We can now do the same for the whole dataset.

In [0]:
# note, that model here refers to the whole pipeline, not only the logistic regression classifier.

model.fit(train_df['review'], train_df['is_positive'])



Pipeline(memory=None,
         steps=[('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=None,
             

In [0]:
# we can also evaluate the performance

from sklearn.metrics import accuracy_score

def eval_model(model, test_df):
    preds = model.predict(test_df['review'])
    print('Test accuracy = {:.2%}'.format(accuracy_score(test_df['is_positive'], preds)))
    
eval_model(model, test_df)

Test accuracy = 86.66%


In recent years there has been much focus on model explainability

![alt text](https://media.giphy.com/media/9FvN85CcQU9fW/giphy.gif)

The community developed some nice models and tools to understand ML results
one of them is LIME  (https://arxiv.org/abs/1602.04938) which is implemented in **eli5**  with a focus on text

In [0]:
# first we can check the weights

import eli5
eli5.show_weights(classifier, vec=vectorizer, top=30)

Using TensorFlow backend.


Weight?,Feature
+1.585,refreshing
+1.411,wonderfully
+1.354,erotic
+1.296,funniest
+1.288,perfect
+1.282,excellent
+1.279,carrey
+1.261,superb
+1.250,surprisingly
+1.250,appreciated


In [0]:
# we can repeat the exercise for a specific prediction

print('Positive' if test_df['is_positive'].iloc[1] else 'Negative')

eli5.show_prediction(classifier, test_df['review'].iloc[1], vec=vectorizer, 
                     targets=['positive'], target_names=['negative', 'positive'])

Positive


Contribution?,Feature
16.203,Highlighted in text (sum)
0.045,<BIAS>


In [0]:
print('Positive' if test_df['is_positive'].iloc[12] else 'Negative')
eli5.show_prediction(classifier, test_df['review'].iloc[12], vec=vectorizer, 
                     targets=['positive'], target_names=['negative', 'positive'])

Negative


Contribution?,Feature
0.045,<BIAS>
-4.122,Highlighted in text (sum)


## TF-IDF

Right now we attribute equal weight to all words - yet, some are more rare, others more prominent; and this frequency of occurence is actually useful information.

The easiest approach to add statistical information on frequency is to apply *tf-idf* weights:

$$\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)$$

*tf* - term-frequency - `t` in the specific document `d` (in our case the individual review). This is exactly what we were doing before.

*idf* - inverse document-frequency - coefficient which is larger whenever the particular term is found in a lesser number of documents. We calculate it like so:
$$\text{idf}(t) = \text{log}\frac{1 + n_d}{1 + n_{d(t)}} + 1$$
where $n_d$ - the number of all documents and $n_{d(t)}$ - the number of documents containing the word `t`.

It's very easy to use - just replace `CountVectorizer` with `TfidfVectorizer`.

In [0]:
# To implement TF-IDF we only need to replace the vectorizer 

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
classifier = LogisticRegression()

dummy_data = ['The movie was excellent',
              'the movie was awful']

dummy_labels = [1, 0]


model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

model.fit(dummy_data, dummy_labels)


pd.DataFrame(data = classifier.coef_, columns = vectorizer.get_feature_names())



Unnamed: 0,awful,excellent,movie,the,was
0,-0.286673,0.286673,0.0,0.0,0.0


In [0]:
vectorizer = TfidfVectorizer()
classifier = LogisticRegression()

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

model.fit(train_df['review'], train_df['is_positive'])

eval_model(model, test_df)



Test accuracy = 88.29%


In [0]:
print('Positive' if test_df['is_positive'].iloc[12] else 'Negative')
eli5.show_prediction(classifier, test_df['review'].iloc[12], vec=vectorizer, 
                     targets=['positive'], target_names=['negative', 'positive'])

Negative


Contribution?,Feature
-0.155,<BIAS>
-1.129,Highlighted in text (sum)


## Topic modelling & more complex models


While >88% is a great result, we can try to further improve it by applying more complex methods. For instance, we can try to reduce the dimensionality of the input matrix. One way to achieve it is topic modelling. The illustration is more suited to explain LDA (latent Dirichlet allocation) but can also be used to understand other related approaches. We will be using Latent Semantic Analysis (LSA).
In a nutshell, we will transform the matrix (representing the phrases) into a matrix of lower dimensionality where each dimension is an (optimally) interpretable combination of overall co-occuring terms and thereby a topic. A phrase is then a combination of various topics rather than a combination of terms.

A very easy way of implementing LSA is using SKLean's truncated SVD module (truncated singular value decomposition) - which is exactly the same as LSA. Why 2 names? Because different communities call their things differently.

We will also use a different classifier. Just to make a point that often it doesn't matter...or does it?


In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# we will use Extreme Gradient Boosting (XGBoost) for the classification
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline, make_pipeline

vec = TfidfVectorizer()
svd = TruncatedSVD(n_components=200, n_iter=7, random_state=42)

lsa = make_pipeline(vec, svd)

clf = XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.1)

pipe = make_pipeline(lsa, clf)

In [0]:
pipe.fit(train_df['review'], train_df['is_positive'])

Pipeline(memory=None,
         steps=[('pipeline',
                 Pipeline(memory=None,
                          steps=[('tfidfvectorizer',
                                  TfidfVectorizer(analyzer='word', binary=False,
                                                  decode_error='strict',
                                                  dtype=<class 'numpy.float64'>,
                                                  encoding='utf-8',
                                                  input='content',
                                                  lowercase=True, max_df=1.0,
                                                  max_features=None, min_df=1,
                                                  ngram_range=(1, 1), norm='l2',
                                                  preprocessor=None,
                                                  smooth_idf=True,
                                                  stop_words=None,
                                              

In [0]:
pipe.score(test_df.review, test_df['is_positive'])

0.83392

In the former examples life was rather easy when looking at explainability. With LSA and XGBoost we are in a situation where we are dealing with a black-box model. Here again **eli5** as a wrapper around **LIME** is helpful.

In [0]:
from eli5.lime import TextExplainer

In [0]:
te = TextExplainer(random_state=42)
te.fit(test_df.review[1], pipe.predict_proba)
te.show_prediction(target_names=['negative', 'positive'])

Contribution?,Feature
2.005,Highlighted in text (sum)
-0.013,<BIAS>


### We can also use Gensim

In [0]:
from gensim.corpora.dictionary import Dictionary

# Dictionary
dictionary = Dictionary(train_df_review_tok)
dictionary.filter_extremes(no_below=10, no_above=0.5, keep_n=2000)

# corpus
corpus_train = [dictionary.doc2bow(doc) for doc in train_df_review_tok]
corpus_test = [dictionary.doc2bow(doc) for doc in test_df_review_tok]

In [0]:
from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus_train)

corpus_train_tfidf = tfidf[corpus_train]
corpus_test_tfidf =  tfidf[corpus_test]

In [0]:
from gensim.models.lsimodel import LsiModel

lsi = LsiModel(corpus_train_tfidf, id2word=dictionary, num_topics=100)

corpus_train_lsi = lsi[corpus_train_tfidf]
corpus_test_lsi =  lsi[corpus_test_tfidf]

In [22]:
from gensim.similarities import MatrixSimilarity
corpus_train_lsi_dense = MatrixSimilarity(corpus_train_lsi)

  if np.issubdtype(vec.dtype, np.int):


In [23]:
corpus_test_lsi_dense = MatrixSimilarity(corpus_test_lsi)

  if np.issubdtype(vec.dtype, np.int):


In [0]:
from sklearn.neural_network import MLPClassifier

In [0]:
clf = MLPClassifier(verbose=True, early_stopping=True)
clf.fit(corpus_train_lsi_dense.index, train_df['is_positive'])

In [56]:
clf.score(corpus_test_lsi_dense.index, test_df['is_positive'])

0.86216

## Word embeddings (aka. Word2Vec & Co.)

We will start by using some pretrained vectors that are readily available online. Often these are trained on textdata from Wikipedia and other webdata.

In [0]:
#assigning preloaded model for the seminar

model = model_gensim_glove

In [0]:
#import gensim.downloader as api

#model = api.load('glove-wiki-gigaword-300')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
model.most_similar('student')

  if np.issubdtype(vec.dtype, np.int):


[('students', 0.7690913677215576),
 ('teacher', 0.6873654723167419),
 ('graduate', 0.6737600564956665),
 ('school', 0.613064706325531),
 ('college', 0.6090279221534729),
 ('undergraduate', 0.6043775677680969),
 ('faculty', 0.5998986959457397),
 ('university', 0.5970512628555298),
 ('academic', 0.5810065865516663),
 ('campus', 0.5767688155174255)]

In [0]:
model.most_similar(positive=['king', 'woman'], negative=['man'])

  if np.issubdtype(vec.dtype, np.int):


[('queen', 0.6713277101516724),
 ('princess', 0.5432624220848083),
 ('throne', 0.5386104583740234),
 ('monarch', 0.5347574949264526),
 ('daughter', 0.498025119304657),
 ('mother', 0.4956442713737488),
 ('elizabeth', 0.4832652509212494),
 ('kingdom', 0.47747087478637695),
 ('prince', 0.4668239951133728),
 ('wife', 0.4647327661514282)]

In [0]:
model.most_similar(positive=['france', 'vodka'], negative=['russia'])

  if np.issubdtype(vec.dtype, np.int):


[('cognac', 0.5327103734016418),
 ('champagne', 0.5181883573532104),
 ('liqueur', 0.5106790661811829),
 ('bourbon', 0.452729195356369),
 ('whiskey', 0.4516924321651459),
 ('bottle', 0.45057547092437744),
 ('drink', 0.4444752335548401),
 ('drinks', 0.44175881147384644),
 ('brandy', 0.43809670209884644),
 ('wine', 0.43010368943214417)]

In [0]:
model.most_similar(positive=['france', 'beer'], negative=['germany'])

  if np.issubdtype(vec.dtype, np.int):


[('drink', 0.5580321550369263),
 ('wine', 0.5453617572784424),
 ('drinks', 0.5428332090377808),
 ('champagne', 0.5211746692657471),
 ('whiskey', 0.502285361289978),
 ('bottles', 0.4866787791252136),
 ('bottle', 0.48181623220443726),
 ('beers', 0.48144668340682983),
 ('beverage', 0.465888649225235),
 ('bottled', 0.46392548084259033)]

In [0]:
# a little helper function turning input text into it's average vector

def get_phrase_embedding(model, phrase):    
    vector = np.zeros([model.vector_size], dtype='float32')
    if type(phrase) == str:
      phrase = list(map(lambda x: x.lower(), word_tokenize(phrase)))
    vecs = [model.get_vector(tok) for tok in phrase if tok in model.wv.vocab]
    if len(vecs) == 0:
      return vector
    else:
      vector = sum(vecs)/len(vecs)
      return vector

In [39]:
# some modules needed to prepare our texts

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize, sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
# vectorizing the train and test-set using average vectors from the pretrained model

text_vectors_train = np.array([get_phrase_embedding(model, phrase) for phrase in train_df_review_tok])
text_vectors_test = np.array([get_phrase_embedding(model, phrase) for phrase in test_df_review_tok])

print(text_vectors_train.shape)

In [0]:
clf = LogisticRegression()
clf.fit(text_vectors_train,train_df['is_positive'])



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
clf.score(text_vectors_test,test_df['is_positive'])

0.83416

83% is not particularly good. Let's see if we can improve that score using custom trained word embeddings. To accomplish this, we need to slice all our data into sentences which will represent the context of the specific word. 

In [0]:
#first we create a list of reviews
texts = list(train_df.review)

texts.extend(test_df.review) 
# alternaltively one could also use the sentences from the test-set
# without the test-set data this step is a bit silly but ok...

In [0]:
# here we split the sentences

sents = []
for text in texts:
  sents.extend(sent_tokenize(text))

In [0]:
# here we create tokenized sentenses

tokenized_texts = [word_tokenize(text) for text in sents]
tokenized_texts = list(map(lambda x: [y.lower() for y in x], tokenized_texts))

In [0]:
# we can use Gensim to train the model

from gensim.models import Word2Vec

In [0]:
model = Word2Vec(tokenized_texts, 
                 size=300,      # embedding vector size
                 #min_count=10,  # consider words that occured at least 10 times
                 window=8,
                 max_final_vocab = 3000).wv  # define context as a 5-word window around the target word

In [0]:
model.most_similar('stupid')

  if np.issubdtype(vec.dtype, np.int):


[('dumb', 0.8213922381401062),
 ('lame', 0.753311038017273),
 ('ridiculous', 0.7447041273117065),
 ('pathetic', 0.6781675219535828),
 ('unrealistic', 0.677298903465271),
 ('silly', 0.6696971654891968),
 ('idiotic', 0.6617854833602905),
 ('unfunny', 0.6458582878112793),
 ('unbelievable', 0.645208477973938),
 ('corny', 0.6252151727676392)]

Now that we spent some time training the model, we can use a bit more to have a look at it

In [0]:
# we can print some of the often seen terms
words = sorted(model.vocab.keys(), 
               key=lambda word: model.vocab[word].count,
               reverse=True)[:1000]

print(words[::100])

['the', 'him', 'however', 'money', 'budget', 'tries', 'brother', 'light', 'famous', 'comment']


In [0]:
# the model can also return our word-vectors

word_vectors = model.vectors[[model.vocab[word].index for word in words]]

In [0]:
# little function to visualize the vectors

import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook

def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    output_notebook()
    
    if isinstance(color, str): 
        color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: 
        pl.show(fig)
    return fig

In [0]:
# we can project the word vectors into 2dimensinoal space using the new UMAP library
import umap

def get_umap_projection(word_vectors):
    vecs = umap.UMAP(n_neighbors=15, metric='cosine').fit_transform(word_vectors)
    return vecs

In [0]:
word_umap = get_umap_projection(word_vectors[:1000])
draw_vectors(word_umap[:, 0], word_umap[:, 1], color='blue', token=words)

Now let's try to run the classification again. This time using our custom embeddings.

In [46]:
# again, we transform our tokenized phrases into matrices

text_vectors_train = np.array([get_phrase_embedding(model, phrase) for phrase in train_df_review_tok])
text_vectors_test = np.array([get_phrase_embedding(model, phrase) for phrase in test_df_review_tok])

  


In [0]:
# and again, we fit the model

clf = LogisticRegression()
clf.fit(text_vectors_train,train_df['is_positive'])
clf.score(text_vectors_test,test_df['is_positive'])



0.855

In [0]:
# and again, we fit the model

clf = MLPClassifier(verbose=True, early_stopping=True)
clf.fit(text_vectors_train,train_df['is_positive'])
clf.score(text_vectors_test,test_df['is_positive'])

## TF-IDF weighted phrase embeddings

For our finaly exercise we will combine the idea behind TF-IDF and word embeddings
Basically we need to weight the embeddings that are going into the representation of a phrase by the TF-IDF values for the individual tokens.

We can achieve it by multiplication of the word-vector matrix with the TF-IDF document matrix

The approach shown here is not really useful for larger datasets but is nice for demonstration on the toy dataset in this example


In [50]:
# we start by creating a tf-idf document matrix using only the vocabulary that can be found in the embeddings
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(vocabulary=model.wv.vocab.keys())

tfidf_matrix_train = vectorizer.fit_transform(train_df['review'])
tfidf_matrix_test = vectorizer.transform(test_df['review'])

  


In [51]:
# extract a matching word-vector matrix
vecs = np.vstack([model.wv[x] for x in model.wv.vocab.keys()])

  """Entry point for launching an IPython kernel.


In [0]:
doc_vecs_train = np.dot(tfidf_matrix_train.toarray(), vecs)
doc_vecs_test = np.dot(tfidf_matrix_test.toarray(), vecs)

In [54]:
# test the performance - better as before (not much, but still)
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(doc_vecs_train, train_df.is_positive)
classifier.score(doc_vecs_test, test_df.is_positive)



0.86372

In [0]:
classifier = MLPClassifier(verbose=True, early_stopping=True)

classifier.fit(doc_vecs_train, train_df.is_positive)


In [0]:
classifier.score(doc_vecs_test, test_df.is_positive)

## Finally - semantic search

Here we will be trying to identify similar text fragments given low cosine distance of the representing vectors

![alt text](https://i1.wp.com/dataaspirant.com/wp-content/uploads/2015/04/cover_post_final.png)

In [0]:
# we will use the existing cosine similarity implimentation
from sklearn.metrics.pairwise import cosine_similarity


# we will also define a find_nearest function that finds k nearest texts given a query
def find_nearest(model, text_vectors, texts, query, k=10):
    query = get_phrase_embedding(model, query)
    c = cosine_similarity(text_vectors, query.reshape(1,-1))
    c = pd.DataFrame(c)
    ix = [int(x) for x in c.sort_values(by = 0, ascending=False)[:k].index]
    return [texts[x] for x in ix]

A little try out on some dummy data. We would expect that the first sentence has a higher similarity to the query than the second.

In [0]:
dummy_data = ['The movie was excellent',
              'the movie was horrible and awful']
dummy_quiry = 'this was an excellent movie'

dummy_data_embedding = np.array([get_phrase_embedding(model, phrase) for phrase in dummy_data])
dummy_quiry_embedding = np.array(get_phrase_embedding(model, dummy_quiry))

c = cosine_similarity(dummy_data_embedding, dummy_quiry_embedding.reshape(1,-1))
print(c)

[[0.7419639 ]
 [0.55409884]]


  


In [0]:
# fo better readability
import textwrap

Below you can see the results from the semantic search.



*   First one using averadge vectors
*   Second, using TF-IDF vectors

Which one is better? There is quite some overlap in the results. Thus: Hard to say.



In [0]:
# Let's try on real data first using our average embedding vectors

results = find_nearest(model, text_vectors_train, train_df.review, query="bad horror", k=5)

for text in results:
  print(textwrap.fill(text, 60))
  print('\n')

When I rented this I was hoping for what "Reign of Fire" did
not deliver: a clash between modern technology and mythic
beasts. Instead I got a standard "monster hunts stupid
people in remote building" flick, with bad script, bad
music, bad effects, bad plot, bad acting. Bad, bad, bad.
Only reason why I did give it a 2 was that in theory there
could exist worse movies. In theory.....


I guess I've seen worse films, but that may be becuz I'm so
jaded by how standard these bad horror movies are. The
killer monster thing is really really bad, basically its a
guy in some kind of green body suit. There is much worse
acting as far as B movie go, but don't think for a second
this was anything stellar, hell no. It actually did have a
plot with substance, but was still pretty stupid. Basically
its just a bad low budget horror movie. But at least its not
as bad as titanic, that movie sucks balls, this one just
sucks. The blood looks really fake in this movie. Thats one
complaint I have about all

  


We can try the same with the TF-IDF weighted vectors

In [0]:
# Let's try on real data first using our average embedding vectors

results = find_nearest(model, doc_vecs_train, train_df.review, query="bad horror", k=5)

for text in results:
  print(textwrap.fill(text, 60))
  print('\n')

  


One of my best friends brought this movie over one night
with the words 'Wanna watch the worst horror movie ever?' I
always enjoy a good laugh at a bad horror flick and said
yes. I had expected your typical cheesy b-slasher but this
was beyond B. This is Z-slasher, the lowest of the low. With
obviously low budget, extremely bad acting, bad lightning,
no plot, really bad so-called 'special effects', shaky
cameras and a horrible soundtrack this makes movies like
House of Wax look like Oscar-winning masterpieces. The only
good thing about it is about 15 seconds of one of the
characters getting topless - she had some very nice tits.
Most of what I said during this film was along the lines of
'Wow this is actually SO BAD', 'This is the worst movie
ever' and 'I'm not drunk enough for this'. So in conclusion:
don't waste your time (or money!).


I guess I've seen worse films, but that may be becuz I'm so
jaded by how standard these bad horror movies are. The
killer monster thing is really rea