# Word2Vec Text Representation using Gutenberg Corpus

**Prerequisites:** Skills in tokenization with nltk, knowledge of Word2Vec Text Representation model.

## Outline

**Main Goal:** To practice how to create Word2Vec models with Gensim and NLTK. Then introduce how to extract information from both text representation, and finally how to measure word similarity.

- Gensim Corpus Inizialization
- Word2Vec model example

## About Gensim

Gensim is a Python library for *topic modelling*, *document indexing*
and *similarity retrieval* with large corpora. Target audience is the
*natural language processing* (NLP) and *information retrieval* (IR)
community. [Gensim Documentation](Gensim Doc)

## About NLTK

Natural Language ToolKit (NLTK) is a comprehensive Python library for natural language
processing and text analytics. Originally designed for teaching, it has been adopted in the
industry for research and development due to its usefulness and breadth of coverage. NLTK
is often used for rapid prototyping of text processing programs and can even be used in
production applications. [(Perkins2014)](#Perkins2014)

## What is Word2Vec?

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space [(Mikolov2013)](#Mikolov2013).

In [None]:
import gensim
import nltk
import os
import re
import time

## Wrangling Data

From txt collection to a list of strings, and from string-list to a list of word-list by sentence-list.

This first method to load the whole text collection is based on "os" module, this is only a code snippet to practice a different ways to do it. NLTK, numpy, and other libraries have it's own methods to do the same process.

In this case a text structure "sentences" with a list of words per sentence per line is generated.

In [1]:
doc_collection = []
file_path = 'gutenberg/'
file_list = list(os.popen('ls '+ file_path).read().split('\n'))
for file in file_list:
    if file:
        with open(os.path.join(file_path,file)) as doc:
            doc_collection.append(doc.read())
            
#Wrangling the data from list of doc-strings -> list of word-list by sentences
sentences = []
for doc in range(len(doc_collection)):
    for sent in nltk.sent_tokenize(doc_collection[doc]):
        sent_words = []
        for word in nltk.word_tokenize(sent):
            sent_words.append(word)
        sentences.append(sent_words)

## Generating the Word2Vec Model

**WARNING**: gensim.models.word2vec: Each 'sentences' item should be a list of words (usually unicode strings).

In [2]:
from gensim.models import Word2Vec

init = time.time()
#first build vocabulary
w2v = Word2Vec(iter=1)
w2v.build_vocab(sentences)

#second train the model / save it / and then load it
w2v = Word2Vec(sentences, min_count=1, size=300)
w2v.save('models/w2v_model')
w2v = gensim.models.Word2Vec.load('gensim_data/w2v_model')

#third train the model with more sentences
w2v.train(sentences,total_words=20000000,epochs=w2v.iter)
end = time.time()-init
print('Total time:', end)

Total time: 58.66724157333374


In [3]:
w2v.most_similar(positive=['Alice'],negative=['man'])

[('Mock', 0.49112191796302795),
 ('Bell', 0.48581546545028687),
 ('Rosamond', 0.4763166904449463),
 ('eagerly', 0.4708961248397827),
 ('cautiously', 0.46099987626075745),
 ('sharply', 0.46045875549316406),
 ('impatiently', 0.4462878704071045),
 ('Billy', 0.4441705346107483),
 ('aloud', 0.44107112288475037),
 ('gravely', 0.43723881244659424)]

In [4]:
w2v.wv['Alice'][:10]

array([ 0.29933944, -0.35046065, -0.76184052, -0.22597112,  0.10589588,
        0.55989736,  0.60162735,  0.14057656, -0.27160674,  0.77934808], dtype=float32)

## Sklearn Word2Vec-Cosine sentence similarity

### Wrangling Data

From string-sentences to "Continue Bag of Word" numerical vectors.

In [5]:
sentence1 = 'the girl run into the hall'
sentence2 = 'Here Alice run to the hall'

sent1 = sentence1.split()
sent2 = sentence2.split()

sent1s = 'girl run hall'
sent2s = 'Alice run hall'

sent1sl = sent1s.split()
sent2sl = sent2s.split()

#If we change the sent1 by a very different meaning sent3
sent3 = ['the','boy','eat','a','red','apple']

In [6]:
import numpy as np

def preproc_data(sentence1, sentence2, model):
    
    w2v_sent1 = []
    w2v_sent2 = []

    for word in sent1:
        try:
            w2v_sent1.append(w2v.wv[word])
        except:
            pass

    for word in sent2:
        try:
            w2v_sent2.append(w2v.wv[word])
        except:
            pass

    w2v_sent1 = sum(np.asarray(w2v_sent1))
    w2v_sent2 = sum(np.asarray(w2v_sent2))
    A = w2v_sent1.reshape(1,-1)
    B = w2v_sent2.reshape(1,-1)
    
    return A,B

In [7]:
w2v_sent1, w2v_sent2 = preproc_data(sentence1,sentence2,w2v)
print(len(w2v_sent1[0]))
w2v_sent2[0][:10]

300


array([-0.6003623 , -0.16434795,  1.7552073 ,  0.27997163, -5.97910786,
       -0.77644032, -5.9618144 , -5.70739174,  3.43792582, -1.5847491 ], dtype=float32)

### Applying Similarity

In [8]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(w2v_sent1,w2v_sent2)[0][0]

0.93066299

In [9]:
#Filtering stopwords
w2v_sent1s, w2v_sent2s = preproc_data(sent1s,sent2s,w2v)
cosine_similarity(w2v_sent1s,w2v_sent2s)[0][0]

0.87735528

## Scipy Cosine Similarity

In [10]:
from scipy.spatial.distance import cosine as cosine_scipy

print(cosine_scipy(w2v_sent1,w2v_sent2))
print(cosine_scipy(w2v_sent1s,w2v_sent2s)) #Filtering stopwords

0.0693369839679
0.122644664832


## Cosine using Gensim w2v of a sentence

In [None]:
vec_sent1 = w2v.wv[sent1]
vec_sent2 = w2v.wv[['corrió','al','hueco']]

#cosine(vec_sent1,vec_sent2)
vec_sent1_ = vec_sent1.sum(axis=0)
vec_sent2_ = vec_sent2.sum(axis=0)

1-cosine_scipy(vec_sent1_,vec_sent2_)

## Gensim w2v.n_similarity

In [11]:
w2v.n_similarity(['the','girl','run','into','the','hall'],['Here','Alice','run','to','the','hall'])

0.72427952777680515

In [12]:
w2v.n_similarity(['girl','run','hall'],['Alice','run','hall'])

0.76351450038282376

In [13]:
w2v.n_similarity(['the','boy','eat','a','red','apple'],
                   ['Here','Alice','run','to','the','hall'])

0.48189778837229824

## Gensim w2v.similarity

A score constructed with this method based on an international article.[John2016](#John2016)

In [21]:
# get similarity between 2 words with word2vec
print('Similarity between Alice and girl:', w2v.similarity('Alice','girl'))

# to get similarity betwee 2 sentences with word2vec create it like John2016, ALPHA=0.25

#To test if sent1 == sent3, change sentence2 by sentence3 in the 2nd loop
#sentence3 = ['the','girl','run','into','the','hall']

def sent_sim_jonh2016(sent1, sent2, model):
    """type sent1,sent2: list of strings"""
    
    sim_vector = []
    ALPHA = 0.25

    for wordA in sent1:
        for wordB in sent2:
            try:
                sim = w2v.similarity(wordA,wordB)
                if sim > ALPHA:
                    sim_vector.append(sim)
            except:
                pass

    return sum(sim_vector)/len(sim_vector)


Similarity between Alice and girl: 0.763514499882


In [15]:
print('Sentence w2v.similarity with stopwords', word_vector_cosine_sim(sent1,sent2, w2v))
print('Sentence w2v.similarity without stopwords', word_vector_cosine_sim(sent1sl,sent2sl, w2v))

Sentence w2v.similarity with stopwords 0.951933983071
Sentence w2v.similarity with stopwords 0.763514464125


## Best Pair Word Overlap

Lets try a different way to compound a sentence similarity, based on WordNet-Augmented-Word-Overlap similarity idea.

$p = {\sum_{w\in\ sent_1}max(df[w][w']) \over len(sent_1)} \ \ \ \forall\ w' \in\ sent_2$

$q = {\sum_{w'\in\ sent_2}max(df[w][w']) \over len(sent_2)} \ \ \ \forall\ w \in\ sent_1$

$sim = \left\{ \begin{array}{rcl} 
0  & if\ p+q = 0\\
{2 p*q \over (p+q)}  & others\\
\end{array}
\right.$

In [2]:
def harmonic_best_pair_word_sim(sent1,sent2, w2v):
    p=0
    for wi in sent1:
        m = 0
        for wc in sent2:
            m = max(m, w2v.similarity(wi,wc))
        p += m
    p = p/len(sent1)

    q=0
    for wc in sent2:
        m = 0
        for wi in sent1:
            m = max(m, w2v.similarity(wi,wc))
        q += m
    q = q/len(sent2)

    sim = 2*p*q/(p+q or 1)
    return sim

harmonic_best_pair_word_sim(sent1,sent2, w2v)

NameError: name 'sent1' is not defined

In [1]:
print('Sentence w2v_harmonic_best_pair_word with stopwords', harmonic_best_pair_word_sim(sent1,sent2,w2v))
print('Sentence w2v_harmonic_best_pair_word without stopwords',harmonic_best_pair_word_sim(sent1,sent2,w2v))
print('Different sentence w2v_harmonic_best_pair_word similarity', harmonic_best_pair_word_sim(sent3,sent2,w2v))

NameError: name 'harmonic_best_pair_word_sim' is not defined

## Gensim TfIdf-Hellinger sentence similarity

In [20]:
from gensim.matutils import kullback_leibler, jaccard, hellinger, cossim

print(hellinger(w2v_sent1,w2v_sent2))
print(kullback_leibler(w2v_sent1, w2v_sent2))

nan
inf


  sim = np.sqrt(0.5 * ((np.sqrt(vec1) - np.sqrt(vec2))**2).sum())


# Conclusions

- Word2vec seems to have a sparcity problem or in this notebook the amount of texts are not enough to have good results.
- The best similarities using this text representation models must be implemented with innovatives ideas.
- The original gensim accuracy test output is different to this one.

# Recomendations

- See the notebooks training Gensim models with Wikipedia dump, review gensim distances and distances trated here.
- Try to test other text representation models like Weigthed Matrix Factorization to study if the problem of sparcity persist.
- Try to train w2v model with more documents and test the Best-Pair word overlap similarity.
- n_symilarity and similarity methods seems to re-train when you execute the related notebook cells, and the similarity values changed util arrive to 0.999. Read and test this with more documents.

<a id='referencias'></a>
# Referencias

<a id='Perkins2014'></a>
[1] *[Perkins2014]* Jacov Perkins. 
Book **Python 3 Text Processing with NLTK 3 Cookbook**. 2014. 
p. 7 **ISBN**: 978-1-78216-785-3

<a id='Mikolov2013'></a>
[2] *[Mikolov2013]* Tomas Mikolov et al. **Efficient Estimation of Word Representations in Vector Space**. Publisher [arXiv](https://arxiv.org/abs/1301.3781), 2013.

<a id='John2016'></a>
[3] *[John2016]* John, Adebayo Kolawole and Caro, Luigi Di and Boella, Guido. **NORMAS at SemEval-2016 Task 1: SEMSIM: A Multi-Feature Approach to Semantic Text Similarity**. Publisher ACM, 2016.