Setting up the environment
```
python==3.6.3
xlrd==1.1.0
spaCy==2.0.12
gensim==3.4.0
scikit-learn==0.19.1
seaborn==0.8
```

In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>

In [2]:
import re  # For preprocessing
import pandas as pd  # For data handling
from time import time  # To time our operations
from collections import defaultdict  # For word frequency

import spacy  # For preprocessing

import logging  # Setting up the loggings to monitor gensim
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

# Introduction
I chose to play with the script from the Simpsons, both because I love the Simpsons and because with more than 150k lines of dialogues, the dataset was substantial!

This dataset contains the characters, locations, episode details, and script lines for approximately 600 Simpsons episodes, dating back to 1989. It can be found here: https://www.kaggle.com/ambarish/fun-in-text-mining-with-simpsons/data (~25MB)

## Preprocessing
We keep only two columns:

raw_character_text: the character who speaks (can be useful when monitoring the preprocessing steps)
spoken_words: the raw text from the line of dialogue
We do not keep normalized_text because we want to do our own preprocessing.

You can find the resulting file here: https://www.kaggle.com/pierremegret/dialogue-lines-of-the-simpsons

In [3]:
df = pd.read_csv('simpsons_dataset.csv')
df.shape

(158314, 2)

In [4]:
df.head()

Unnamed: 0,raw_character_text,spoken_words
0,Miss Hoover,"No, actually, it was a little of both. Sometim..."
1,Lisa Simpson,Where's Mr. Bergstrom?
2,Miss Hoover,I don't know. Although I'd sure like to talk t...
3,Lisa Simpson,That life is worth living.
4,Edna Krabappel-Flanders,The polls will be open from now until the end ...


In [5]:
df.isnull().sum()

raw_character_text    17814
spoken_words          26459
dtype: int64

The missing values comes from the part of the script where something happens, but with no dialogue. For instance "(Springfield Elementary School: EXT. ELEMENTARY - SCHOOL PLAYGROUND - AFTERNOON)"

In [6]:
df = df.dropna().reset_index(drop=True)
df.isnull().sum()

raw_character_text    0
spoken_words          0
dtype: int64

### Cleaning
We are lemmatizing and removing the stopwords and non-alphabetic characters for each line of dialogue.

In [14]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [15]:
def cleaning(doc):
    # Lemmatizes and removes stopwords
    # doc needs to be a spacy Doc object
    txt = [token.lemma_ for token in doc if not token.is_stop]
    # Word2Vec uses context words to learn the vector representation of a target word,
    # if a sentence is only one or two words long,
    # the benefit for the training is very small
    if len(txt) > 2:
        return ' '.join(txt)

In [16]:
brief_cleaning = (re.sub("[^A-Za-z']+", ' ', str(row)).lower() for row in df['spoken_words'])

Generator functions are a special kind of function that return a lazy iterator. These are objects that you can loop over like a list. However, unlike lists, lazy iterators do not store their contents in memory.

In [17]:
brief_cleaning

<generator object <genexpr> at 0x7f37c9ea6ba0>

In [None]:
t = time()

txt = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, batch_size=5000, n_threads=-1)]

print('Time to clean up everything: {} mins'.format(round((time() - t) / 60, 2)))

In [None]:
df_clean = pd.DataFrame({'clean': txt})
df_clean = df_clean.dropna().drop_duplicates()
df_clean.shape

### Bigrams
We are using Gensim Phrases package to automatically detect common phrases (bigrams) from a list of sentences. https://radimrehurek.com/gensim/models/phrases.html

The main reason we do this is to catch words like "mr_burns" or "bart_simpson" !

In [19]:
from gensim.models.phrases import Phrases, Phraser

In [None]:
sent = [row.split() for row in df_clean['clean']]

In [None]:
type(sent)

In [None]:
sent

Creates the relevant phrases from the list of sentences

In [None]:
phrases = Phrases(sent, min_count=30, progress_per=10000)

In [None]:
bigram = Phraser(phrases)

In [None]:
sentences = bigram[sent]

### Most Frequent Words
Mainly a sanity check of the effectiveness of the lemmatization, removal of stopwords, and addition of bigrams.

In [None]:
word_freq = defaultdict(int)
for sent in sentences:
    for i in sent:
        word_freq[i] += 1
len(word_freq)

In [None]:
sorted(word_freq, key=word_freq.get, reverse=True)[:10]

In [None]:
for sent in sentences[:10]:
    print(sent)

## Training the model
Gensim Word2Vec Implementation:
We use Gensim implementation of word2vec: https://radimrehurek.com/gensim/models/word2vec.html

In [20]:
import multiprocessing

from gensim.models import Word2Vec

I prefer to separate the training in 3 distinctive steps for clarity and monitoring.

1. Word2Vec():
In this first step, I set up the parameters of the model one-by-one.
I do not supply the parameter sentences, and therefore leave the model uninitialized, purposefully.

2. .build_vocab():
Here it builds the vocabulary from a sequence of sentences and thus initialized the model.
With the loggings, I can follow the progress and even more important, the effect of min_count and sample on the word corpus. I noticed that these two parameters, and in particular sample, have a great influence over the performance of a model. Displaying both allows for a more accurate and an easier management of their influence.

3. .train():
Finally, trains the model.
The loggings here are mainly useful for monitoring, making sure that no threads are executed instantaneously.

### The parameters

    min_count = int - Ignores all words with total absolute frequency lower than this - (2, 100)
    
    window = int - The maximum distance between the current and predicted word within a sentence. E.g. window words on the left and window words on the left of our target - (2, 10)
    
    size = int - Dimensionality of the feature vectors. - (50, 300)
    
    sample = float - The threshold for configuring which higher-frequency words are randomly downsampled. Highly influencial. - (0, 1e-5)
    
    alpha = float - The initial learning rate - (0.01, 0.05)
    
    min_alpha = float - Learning rate will linearly drop to min_alpha as training progresses. To set it: alpha - (min_alpha * epochs) ~ 0.00
    
    negative = int - If > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drown. If set to 0, no negative sampling is used. - (5, 20)
    
    workers = int - Use these many worker threads to train the model (=faster training with multicore machines)

In [None]:
w2v_model = Word2Vec(min_count=20,
                     window=2,
                     size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=cores-1)

In [None]:
t = time()

w2v_model.build_vocab(sentences, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

Parameters of the training:

total_examples = int - Count of sentences;
epochs = int - Number of iterations (epochs) over the corpus - [10, 20, 30]

In [None]:
t = time()

w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

In [None]:
vocab = w2v_model.wv.vocab

## Preprocessing and training template directly from dataFrame

In [None]:
def _get_embeddings(df, column='text', embedding_dim=200, embedding_window=3, min_count=30, pretrained_path=None):
        """
        Returns gensim Word2Vec model trained on the column given.
        """

        list_words = []
        for des in df[column]:
            des = des.translate(des.maketrans({key: ' ' for key in string.punctuation}))
            des = des.lower()
            list_words.append(re.findall(r"[\w']+|[.,!?;]", des.strip()))

        model_small = Word2Vec(list_words, workers=4, size=embedding_dim, min_count=min_count, window=embedding_window,
                               sample=1e-3, sg=1, seed=1)
        vocab = model_small.wv.vocab
        total_examples = len(list_words)

        if pretrained_path is not None:
            model = Word2Vec.load(pretrained_path)
            model.min_count = 1
            model.build_vocab([list(vocab.keys())], update=True)
            model.train(list_words, total_examples=total_examples, epochs=model_small.epochs)
            return model, vocab

        return model_small, vocab

In [None]:
def save(path='Data/wordvectors.kv'):
    w2v_model.save(path)

In [21]:
def load(path='Data/wordvectors.kv'):
    w2v_model = KeyedVectors.load(path)

## Evaluating the model

### Top K most similar words

In [None]:
from gensim.models.keyedvectors import KeyedVectors
w2v_model = load()

In [24]:
def most_similar(word, k=10):
        """
        Get the k nearest neighbors to the word.
        :param word: word to find nearest neighbors.
        :param k: number of neighbors to return
        :return: list of (word, similarity)
        """
        return w2v_model.wv.most_similar(word, topn=k)

In [None]:
most_similar(["homer"])

In [None]:
most_similar(["homer_simpson"])

In [None]:
most_similar(["marge"])

In [None]:
most_similar(positive=["bart"])

### Similarity scores between 2 words

In [None]:
w2v_model.wv.similarity("moe_'s", 'tavern')

In [None]:
w2v_model.wv.similarity("maggie", 'baby')

In [None]:
w2v_model.wv.similarity("homer", 'marge')

In [None]:
w2v_model.wv.similarity("cat", 'dog')

In [None]:
w2v_model.wv.similarity("cat", 'ship')

### Analogy difference
Which word is to woman as homer is to marge?

In [None]:
w2v_model.wv.most_similar(positive=["woman", "homer"], negative=["marge"], topn=3)

Which word is to woman as bart is to man?

In [None]:
w2v_model.wv.most_similar(positive=["woman", "bart"], negative=["man"], topn=3)

### Visualization

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
 
import seaborn as sns
sns.set_style("darkgrid")

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

In [None]:
def tsnescatterplot(model, word, list_names, n_comp):
    """ Plot in seaborn the results from the t-SNE dimensionality reduction algorithm of the vectors of a query word,
    its list of most similar words, and a list of words.
    """
    arrays = np.empty((0, 300), dtype='f')
    word_labels = [word]
    color_list  = ['red']

    # adds the vector of the query word
    arrays = np.append(arrays, model.wv.__getitem__([word]), axis=0)
    
    # gets list of most similar words
    close_words = model.wv.most_similar([word])
    
    # adds the vector for each of the closest words to the array
    for wrd_score in close_words:
        wrd_vector = model.wv.__getitem__([wrd_score[0]])
        word_labels.append(wrd_score[0])
        color_list.append('blue')
        arrays = np.append(arrays, wrd_vector, axis=0)
    
    # adds the vector for each of the words from list_names to the array
    for wrd in list_names:
        wrd_vector = model.wv.__getitem__([wrd])
        word_labels.append(wrd)
        color_list.append('green')
        arrays = np.append(arrays, wrd_vector, axis=0)
        
    # Reduces the dimensionality from 300 to 21 dimensions with PCA
    reduc = PCA(n_components=n_comp).fit_transform(arrays)
    
    # Finds t-SNE coordinates for 2 dimensions
    np.set_printoptions(suppress=True)
    
    Y = TSNE(n_components=2, random_state=0, perplexity=15).fit_transform(reduc)
    
    # Sets everything up to plot
    df = pd.DataFrame({'x': [x for x in Y[:, 0]],
                       'y': [y for y in Y[:, 1]],
                       'words': word_labels,
                       'color': color_list})
    
    fig, _ = plt.subplots()
    fig.set_size_inches(9, 9)
    
    # Basic plot
    p1 = sns.regplot(data=df,
                     x="x",
                     y="y",
                     fit_reg=False,
                     marker="o",
                     scatter_kws={'s': 40,
                                  'facecolors': df['color']
                                 }
                    )
    
    # Adds annotations one by one with a loop
    for line in range(0, df.shape[0]):
         p1.text(df["x"][line],
                 df['y'][line],
                 '  ' + df["words"][line].title(),
                 horizontalalignment='left',
                 verticalalignment='bottom', size='medium',
                 color=df['color'][line],
                 weight='normal'
                ).set_size(15)

    
    plt.xlim(Y[:, 0].min()-50, Y[:, 0].max()+50)
    plt.ylim(Y[:, 1].min()-50, Y[:, 1].max()+50)
            
    plt.title('t-SNE visualization for {}'.format(word.title()))

most similar vs most dissimilar

In [None]:
tsnescatterplot(w2v_model, 'maggie', [i[0] for i in w2v_model.wv.most_similar(negative=["maggie"])], n_comp=21)

random vs most similar

In [None]:
tsnescatterplot(w2v_model, 'homer', ['dog', 'bird', 'ah', 'maude', 'bob', 'mel', 'apu', 'duff'], n_comp=19)

In [None]:
def plot_embedding(vocab):

        vocab = list(vocab)
        X = w2v_model.wv[vocab]

        tsne = TSNE(n_components=2)
        X_tsne = tsne.fit_transform(X)

        df = pd.DataFrame(X_tsne, index=vocab, columns=['x', 'y'])
        
        plt.rcParams["figure.figsize"] = (30,60)

        fig = plt.figure()
        ax = fig.add_subplot(1, 1, 1)

        ax.scatter(df['x'], df['y'])

        for word, pos in df.iterrows():
            ax.annotate(word, pos)

        plt.show()


# Useful functions

In [None]:
def get_embedding(word):
    """
    Method to obtain the embedding vector of a word.
    :param word: word to obtain embedding.
    :return: word vector <np.array>
    """
    return np.array(w2v_model.wv[word])

In [None]:
get_embedding("good")

In [None]:
def get_sentence_embedding(sentence):
    """
    Get the average vectors of the words in the word2vec vocabulary.
    :param sentence: text to get embedding. (str)
    :return: embedding vector (np.array)
            If no word of the sentence is in vocabulary, return None
    """

    sentence = re.sub("[^A-Za-z]", " ", sentence.lower())
    word_count = 0
    embedding = None
    for word in re.findall(r"[\w']+|[.,!?;]", sentence.strip()):
        if word in w2v_model.wv.vocab:
            word_count += 1
            if embedding is None:
                embedding = get_embedding(word)
            else:
                embedding += get_embedding(word)

    if embedding is None:
        return None

    return embedding / word_count

In [None]:
get_sentence_embedding("I want to go home")

In [None]:
from scipy import spatial

In [None]:
def get_sentence_similarity(sentence_embedding, query_embedding):
    """
    Get sentence similarity from a sentence and a embedding.
    """
    if sentence_embedding is None:
        return -1
    result = 1 - spatial.distance.cosine(query_embedding, sentence_embedding)
    return result

In [None]:
def get_answer(question, context):

        query_embedding = get_sentence_embedding(question)
        

        if query_embedding is None:
            answer, score = None, 0
        else:
            sentences = context
            sentences = [el for el in sentences if type(el) is str if len(el) > 5]

            if len(sentences) == 0:
                return None, 0

            top_sentences = [(sentence, get_sentence_similarity(get_sentence_embedding(sentence), query_embedding))
                             for sentence in sentences]

            answer, score = sorted(top_sentences, key=lambda x: x[1], reverse=True)[0]

        return answer, score

In [None]:
question="Who is Homer?"
context=["I like to eat."," My dog is big.", "Homer is a husband of Marge."," I like the movie Simpsons"]

In [None]:
get_answer(question, context)