# Word\Sentence Embedding:

We start this section by introducing Word2Vec embedding and how you can use Gensim to train a Word2vec model on a list of sentences. As we described in the slides, the Word2vec algorithms are skip-gram and CBOW. In the following, we show how we train a model for either of them. For more information about Word2vec embedding and Gensim, please visit [here](https://rare-technologies.com/word2vec-tutorial/). 

In this tutorial, you’ll learn:

    How to use Gensim to train a Word2Vec embedding model
    How to upload pretrained models
    How to train BOW and TF-IDF models

In [18]:
import warnings
warnings.filterwarnings('ignore')

In [19]:
# Python program to generate word vectors using Word2Vec
  
# importing all necessary modules
from nltk.tokenize import sent_tokenize, word_tokenize
  
import gensim
from gensim.models import Word2Vec
import nltk  
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

##Word2vec Embedding


In [20]:
sentences = 'the car is owned by my father. the house is owned by my mother.'
  
data = []
  
# iterate through each sentence in the file
for i in sent_tokenize(sentences):
    temp = []
      
    # tokenize the sentence into words
    for j in word_tokenize(i):
        temp.append(j.lower())
  
    data.append(temp)

In [21]:
print(data)

[['the', 'car', 'is', 'owned', 'by', 'my', 'father', '.'], ['the', 'house', 'is', 'owned', 'by', 'my', 'mother', '.']]


## models.word2vec()



```
class gensim.models.word2vec.Word2Vec(sentences=None, corpus_file=None, 
vector_size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, 
sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, 
ns_exponent=0.75, cbow_mean=1, hashfxn=<built-in function hash>, epochs=5, 
null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, 
compute_loss=False, callbacks=(), comment=None, max_final_vocab=None, 
shrink_windows=True)
```
**sentences** (iterable of iterables, optional) – The sentences iterable can be simply a list of lists of tokens

**vector_size** (int, optional) – Dimensionality of the word vectors

**window** (int, optional) – Maximum distance between the current and predicted word within a sentence

**min_count** (int, optional) – Ignores all words with total frequency lower than this

**sg** ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW

more information: https://radimrehurek.com/gensim/models/word2vec.html

In [22]:
# Create CBOW model
model1 = gensim.models.Word2Vec(data, min_count = 1, 
                              size = 10, window = 2)




In [23]:
# you can save the model
model1.save("word2vec.model")

# and load it
model1 = Word2Vec.load("word2vec.model")

In [24]:
# reach out the vector of a word
model1.wv['house']

array([ 0.00257788,  0.03419786,  0.00961709,  0.03859387, -0.01584929,
        0.01011667, -0.04816683, -0.01648258, -0.01717688,  0.02342607],
      dtype=float32)

In [25]:
model1.wv.most_similar('owned', topn=10)

[('car', 0.29851916432380676),
 ('house', 0.22156289219856262),
 ('is', 0.12342262268066406),
 ('the', 0.10404221713542938),
 ('father', 0.08825638890266418),
 ('by', 0.08316656947135925),
 ('my', -0.001071631908416748),
 ('mother', -0.07891765236854553),
 ('.', -0.48244214057922363)]

In [26]:
# Print results
print("Cosine similarity between 'mother' " + 
               "and 'father' - CBOW : ",
    model1.similarity('mother', 'father'))
      
print("Cosine similarity between 'house' " +
                 "and 'car' - CBOW : ",
      model1.similarity('house', 'car'))

Cosine similarity between 'mother' and 'father' - CBOW :  0.33359298
Cosine similarity between 'house' and 'car' - CBOW :  0.23999052


In [14]:
# Create Skip Gram model
model2 = gensim.models.Word2Vec(data, min_count = 1, size = 10,
                                             window = 3, sg = 1)
  




In [15]:
# reach out the vector of a word
model2.wv['house']

array([ 0.00258364,  0.0342035 ,  0.00961899,  0.03859608, -0.01584661,
        0.0101135 , -0.04817257, -0.0164755 , -0.01717148,  0.02342499],
      dtype=float32)

In [16]:
model2.wv.most_similar('owned', topn=10)

[('car', 0.29851916432380676),
 ('house', 0.221551313996315),
 ('is', 0.12342262268066406),
 ('the', 0.10404221713542938),
 ('father', 0.08824722468852997),
 ('by', 0.08316656947135925),
 ('my', -0.001071631908416748),
 ('mother', -0.07891765236854553),
 ('.', -0.48245689272880554)]

In [17]:
# Print results
print("Cosine similarity between 'mother' " +
          "and 'father' - Skip Gram : ",
    model2.similarity('mother', 'father'))
      
print("Cosine similarity between 'house' " +
            "and 'car' - Skip Gram : ",
      model2.similarity('house', 'car'))

Cosine similarity between 'mother' and 'father' - Skip Gram :  0.33357447
Cosine similarity between 'house' and 'car' - Skip Gram :  0.24001113


## Pretrained Models

In [27]:
import gensim.downloader
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [28]:
glove_vectors = gensim.downloader.load('glove-twitter-25')



In [29]:
glove_vectors.most_similar('twitter')

[('facebook', 0.9480051398277283),
 ('tweet', 0.9403422474861145),
 ('fb', 0.9342358708381653),
 ('instagram', 0.9104823470115662),
 ('chat', 0.8964964747428894),
 ('hashtag', 0.8885936141014099),
 ('tweets', 0.8878157734870911),
 ('tl', 0.8778461813926697),
 ('link', 0.877821147441864),
 ('internet', 0.8753897547721863)]

## Word Embedding with TF-IDF and BOW

In [30]:
# TfidfVectorizer 
# CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
import pandas as pd


In [31]:
# set of documents
train =  sent_tokenize(sentences)
# instantiate the vectorizer object
countvectorizer = CountVectorizer(analyzer= 'word')
tfidfvectorizer = TfidfVectorizer()
# convert th documents into a matrix
count_wm = countvectorizer.fit_transform(train)
tfidf_wm = tfidfvectorizer.fit_transform(train)


In [32]:
#retrieve the terms found in the corpora
# if we take same parameters for both Classes(CountVectorizer and TfidfVectorizer) , it will give same output of get_feature_names() methods)
#count_tokens = tfidfvectorizer.get_feature_names() # no difference
count_tokens = countvectorizer.get_feature_names()
tfidf_tokens = tfidfvectorizer.get_feature_names()


In [33]:
df_countvect = pd.DataFrame(data = count_wm.toarray(),index = ['Doc1','Doc2'],columns = count_tokens)
df_tfidfvect = pd.DataFrame(data = tfidf_wm.toarray(),index = ['Doc1','Doc2'],columns = tfidf_tokens)


In [34]:
print("Count Vectorizer\n")
print(df_countvect)


Count Vectorizer

      by  car  father  house  is  mother  my  owned  the
Doc1   1    1       1      0   1       0   1      1    1
Doc2   1    0       0      1   1       1   1      1    1


In [None]:
df_tfidfvect.head()

Unnamed: 0,by,car,father,house,is,mother,my,owned,the
Doc1,0.334251,0.469778,0.469778,0.0,0.334251,0.0,0.334251,0.334251,0.334251
Doc2,0.334251,0.0,0.0,0.469778,0.334251,0.469778,0.334251,0.334251,0.334251


As you can see, the value of tfidfvectorizer from sklearn is different from what we manually computed in the slides. Let's see what the reason is:

Based on [scikit-learn website](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html), `tf-idf` of a term t in doucment d is computed as `tf-idf(t, d) = tf(t, d) * idf(t)`. The value of `tf(t, d)` is the same as we defined. However, they add 1 to the value of `idf(t)` because they want to not ignor the words that apear in all documents (sentences). If the variable `smooth_idf` is `True` in `sklearn.feature_extraction.text.TfidfTransformer(*, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)`then the formula for `idf(t)` will be `idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1 `( if `smooth_idf` is `False` the formula will be `idf(t) = log [ n / (df(t) + 1) ])`.). The final vector for document `d` is normalized based on the norm chosen in `norm` and the default is `l2` and can be `l1` as well. See [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) for more details. 


# Assignment
Take the data we created in the Data Collection tutorial, and try to train the word and sentence embedding models you learned in the current tutorial.