##Word2vec


Word representation methods from the last lab

- Bag of Words
- TF-IDF

Limitations of these representations

- High-dimensional
- Sparse
- No info about words

Word2vec Paper [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)

Word2Vec is a shallow, two-layer neural network which is trained to reconstruct linguistic contexts of words.

It takes as its input a large corpus of words and produces a vector space, with each unique word in the corpus being assigned a corresponding vector in the space.


Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

Example:    
The **kid** studies mathematics.

The **child** studies mathematics.

![embedding](https://miro.medium.com/max/1400/1*sAJdxEsDjsPMioHyzlN3_A.png)

###Methods for building the Word2vec model

![cbow-skip-gram](https://miro.medium.com/max/1400/1*cuOmGT7NevP9oJFJfVpRKA.png)

###Continuous Bag-of-Words (CBOW)



CBOW predicts target words from the surrounding context words.

![cbow](https://1.bp.blogspot.com/-nZFc7P6o3Yc/XQo2cYPM_ZI/AAAAAAAABxM/XBqYSa06oyQ_sxQzPcgnUxb5msRwDrJrQCLcBGAs/s1600/image001.png)

###Skip-gram

Skip-gram predicts surrounding context words from the target words.

![skip-gram](https://i.stack.imgur.com/fYhXF.png)


##Architecture

The words are feeded as one-hot vectors ( vector of the same length as the vocabulary, filled with zeros except at the index that represents the word we want to represent, which is assigned “1”.)

The hidden layer is a standard fully-connected (Dense) layer whose weights are the word embeddings.

The output layer outputs probabilities for the target words from the vocabulary.

The goal of this neural network is to learn the weights for the hidden layer matrix.

![model](https://miro.medium.com/max/1400/1*tmyks7pjdwxODh5-gL3FHQ.png)

High-level illustration of the architecture

![model2](https://i.imgur.com/CBuZay5.png)

The rows of the hidden layer weight matrix, are actually the word vectors (word embeddings).


![hidden-layer](https://i.imgur.com/v6VqHad.png)

The hidden layer operates as a lookup table. The output of the hidden layer is just the “word vector” for the input word.

More concretely, if you multiply a 1 x 10,000 one-hot vector by a 10,000 x 300 matrix, it will effectively just select the matrix row corresponding to the ‘1’.

![vector](https://i.imgur.com/EYhcA5S.png)

###Semantic and syntactic relationships

If different words are similar in context, then Word2Vec should have similar outputs when these words are passed as inputs, and in-order to have a similar outputs, the computed word vectors (in the hidden layer) for these words have to be similar, thus Word2Vec is motivated to learn similar word vectors for words in similar context.

Word2Vec is able to capture multiple different degrees of similarity between words, such that semantic and syntactic patterns can be reproduced using vector arithmetic.

![w2vec](https://i.imgur.com/I66L7No.png)

![w2vec2](https://israelg99.github.io/images/2017-03-23-Word2Vec-Explained/linear-relationships.png)

**Skip-gram** - works well with a small amount of the training data, represents well even rare words or phrases

**CBOW** - several times faster to train than the skip-gram, slightly better accuracy for the frequent words.

###Word2vec embeddings in Gensim

In [2]:
from gensim.models import Word2Vec
import gensim.downloader

Gensim has multiple vector representations for words: word2vec, fasttext, glove

In [None]:
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


Downloading the word2vec model

In [3]:
word2vec = gensim.downloader.load('word2vec-google-news-300')

In [5]:
word2vec['cat'][:20]

array([ 0.0123291 ,  0.20410156, -0.28515625,  0.21679688,  0.11816406,
        0.08300781,  0.04980469, -0.00952148,  0.22070312, -0.12597656,
        0.08056641, -0.5859375 , -0.00445557, -0.296875  , -0.01312256,
       -0.08349609,  0.05053711,  0.15136719, -0.44921875, -0.0135498 ],
      dtype=float32)

In [6]:
word2vec.similarity('dog', 'house')

0.25689757

In [None]:
word2vec.similarity('dog', 'puppy')

0.81064284

In [4]:
word2vec.most_similar('cat')

[('cats', 0.8099379539489746),
 ('dog', 0.7609456777572632),
 ('kitten', 0.7464985251426697),
 ('feline', 0.7326233983039856),
 ('beagle', 0.7150583267211914),
 ('puppy', 0.7075453996658325),
 ('pup', 0.6934291124343872),
 ('pet', 0.6891531348228455),
 ('felines', 0.6755931377410889),
 ('chihuahua', 0.6709762215614319)]


(king - man) + woman = queen

In [None]:
word2vec.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.518113374710083),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411999702454)]

In [None]:
word2vec.most_similar(['woman', 'officer'], negative = ['man'])

[('Officer', 0.5694271326065063),
 ('officers', 0.538264274597168),
 ('offi_cer', 0.5283650159835815),
 ('chief', 0.48523107171058655),
 ('deputy', 0.47100305557250977),
 ('patrolwoman', 0.4685642719268799),
 ('policewoman', 0.46202757954597473),
 ('vice_president', 0.461116224527359),
 ('supervisor', 0.4552857577800751),
 ('oficer', 0.4532422721385956)]

##Assignment

To be uploaded here: https://forms.gle/KuR71xiA2rR6tukz8 until December 15th.

1. Play around with the word2vec model and see if there are any interesting or counterintuitive similarity results using  ```word2vec.similarity``` and ```word2vec.most_similar```.

2. Use other embeddings (glove, fasttext) to encode the data from the sentiment analysis task and train the classification model.

Using the word2vec embeddings from Gensim

In [None]:
import nltk
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [None]:
from nltk.corpus  import twitter_samples

pos_tweets = twitter_samples.strings('positive_tweets.json')
print(len(pos_tweets))

neg_tweets = twitter_samples.strings('negative_tweets.json')
print(len(neg_tweets))

5000
5000


In [None]:
import pandas as pd
pos_df = pd.DataFrame(pos_tweets, columns = ['tweet'])
pos_df['label'] = 1

In [None]:
neg_df = pd.DataFrame(neg_tweets, columns = ['tweet'])
neg_df['label'] = 0

In [None]:
data_df = pd.concat([pos_df, neg_df], ignore_index=True)
# data_df = data_df[:20]

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(data_df, test_size=0.2, shuffle = True)
print(train_df)
print(test_df)

                                                  tweet  label
6745                 @sweetbabecake yea i guess so :(((      0
8859  I fucking hate when I wake up like at this tim...      0
644         .@sajidislam honored to have you here ! :-)      1
6013                    o otp :( http://t.co/EVislmNp5V      0
2283  @effinoreos HAPPY 15th BIRTHDAY VIANEY!!! (a.k...      1
...                                                 ...    ...
4840  @scousebabe888 Nice Holiday Honey!!!!!!!!!!!!!...      1
3976  @planetjedward GoodMorning ! What's coming nex...      1
7151  I feel like I'm a weird person for shipping Be...      0
6902  I met a new kinds of people, new classmate, ne...      0
5326        @hamzaabasiali exactly but unfortunately :(      0

[8000 rows x 2 columns]
                                                  tweet  label
5648  @jenxmish @wittykrushnic you are the only thin...      0
7425                                   Omg no Amber :((      0
255   @AvinPera  follow @jnlaz

In [None]:
import numpy as np
import tqdm

def compute_embeddings(df):
    train_emb = []
    for i, row in tqdm.tqdm(df.iterrows(), total = len(df.index)):
        words = row['tweet'].split(' ')
        words = filter(lambda x: x in word2vec.vocab, words)
        text_emb = [word2vec[word] for word in words]
        
        if len(text_emb) == 0:
            train_emb.append(np.zeros(300))
            continue

        doc_embedding = np.mean(text_emb, axis = 0)
        train_emb.append(doc_embedding)
    return np.array(train_emb)

In [None]:
X_train_emb = compute_embeddings(train_df)
y_train = train_df['label']

X_test_emb = compute_embeddings(test_df)
y_test = test_df['label']

100%|██████████| 8000/8000 [00:01<00:00, 4282.18it/s]
100%|██████████| 2000/2000 [00:00<00:00, 4114.12it/s]


In [None]:
from sklearn.svm import SVC

svm = SVC(verbose = 2)
svm.fit(X_train_emb, y_train)

[LibSVM]

SVC(verbose=2)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, f1_score 

y_test_pred = svm.predict(X_test_emb)

print('Accuracy', accuracy_score(y_test, y_test_pred))
print('Precision',precision_score(y_test, y_test_pred))
print('F1 score',f1_score(y_test, y_test_pred))

Accuracy 0.8815
Precision 0.9507803121248499
F1 score 0.869851729818781


Notebook adapted from https://israelg99.github.io/2017-03-23-Word2Vec-Explained/

Further Reading

- [Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings](https://arxiv.org/pdf/1607.06520.pdf)
- [Debiaswe: try to make word embeddings less sexist](https://github.com/tolga-b/debiaswe)
- [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf)
- [Fasttext Word vectors for 157 languages](https://fasttext.cc/docs/en/crawl-vectors.html)
- [Illustrated Word2vec](https://jalammar.github.io/illustrated-word2vec/)

## 1. Word2vec similarities

In [5]:
word2vec.most_similar(['king', 'man'], negative = ['queen'])

[('boy', 0.5393532514572144),
 ('guy', 0.47399765253067017),
 ('Alexios_Marakis', 0.4579210579395294),
 ('Man', 0.4575732350349426),
 ('teenager', 0.4346425235271454),
 ('dude', 0.42896467447280884),
 ('woman', 0.42854905128479004),
 ('suspected_purse_snatcher', 0.4247782826423645),
 ('him', 0.42347830533981323),
 ('motorcyclist', 0.4231083393096924)]

In [6]:
word2vec.most_similar('earth')

[('Earth', 0.7105128765106201),
 ('planet', 0.6802847981452942),
 ('meek_inheriting', 0.5625146627426147),
 ('earths', 0.5312458276748657),
 ('cosmos', 0.5272278785705566),
 ('mankind', 0.5163297057151794),
 ('mega_vertebrate', 0.5102849006652832),
 ('shepherded_Tolkien_Middle', 0.5001775026321411),
 ('ERDAS_creates', 0.4907360076904297),
 ('Martian_surface', 0.480654239654541)]

In [12]:
word2vec.most_similar(['werewolf'], negative = ['wolf'])

[('vampire_werewolf', 0.3629043996334076),
 ('love_triangle', 0.35874736309051514),
 ('vampire', 0.3576195240020752),
 ('werewolf_Jacob', 0.3553285598754883),
 ('Harry_Daniel_Radcliffe', 0.3486817479133606),
 ('plotline', 0.3475787937641144),
 ('Bella_Swan', 0.3428325355052948),
 ('Edward_Cullen', 0.3365398049354553),
 ('WEIRD_SCIENCE', 0.33004915714263916),
 ('vampires', 0.32982370257377625)]

In [16]:
word2vec.most_similar(['werewolf', 'man'], negative = ['wolf'])

[('woman', 0.5253201723098755),
 ('teenager', 0.4768238067626953),
 ('teenage_girl', 0.4647809565067291),
 ('boy', 0.4555785655975342),
 ('PARANOID_schizophrenic', 0.44646579027175903),
 ('horribly_horribly_deranged', 0.4370340406894684),
 ('girl', 0.43483102321624756),
 ('transvestite_hooker', 0.42668646574020386),
 ('SUSPECT_SOUGHT', 0.4237041771411896),
 ('nightclubber', 0.4223080575466156)]

## 2. Glove

In [None]:
import nltk
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

In [None]:
from nltk.corpus  import twitter_samples

pos_tweets = twitter_samples.strings('positive_tweets.json')
print(len(pos_tweets))

neg_tweets = twitter_samples.strings('negative_tweets.json')
print(len(neg_tweets))

5000
5000


In [None]:
import pandas as pd
pos_df = pd.DataFrame(pos_tweets, columns = ['tweet'])
pos_df['label'] = 1

neg_df = pd.DataFrame(neg_tweets, columns = ['tweet'])
neg_df['label'] = 0

In [None]:
data_df = pd.concat([pos_df, neg_df], ignore_index=True)
# data_df = data_df[:20]

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(data_df, test_size=0.2, shuffle = True)
print(train_df)
print(test_df)

                                                  tweet  label
9132                   7pm on a Friday and I am dead :(      0
875   @TOBMAST3R @inspchin @koeitecmoeurope @TanikoH...      1
1353  i thought they won at mubank.. lol \nbecause d...      1
7916                              @crude18 takfaham. :(      0
6976  No Friday night live @NRL on @GEMChannel in St...      0
...                                                 ...    ...
9759                                 something wrong :(      0
2477  @rusmexuswriters I have seen some of the new f...      1
5839  this place will forever hold a very special pl...      0
9383  @seokielips sorry :( i spend more time on inst...      0
1319  @SilverArrowsHAM Are you seeing the cars drift...      1

[8000 rows x 2 columns]
                                                  tweet  label
6433                                         Iam fat :(      0
9319          @BenJPierce I wish I could see you Ben :(      0
3450  @paulbeaton720 Looking g

In [None]:
import torch
import torchtext.vocab as vocab
glove = vocab.GloVe(name='6B', dim=300)

print('Loaded {} words'.format(len(glove.itos)))

100%|█████████▉| 399999/400000 [00:46<00:00, 8611.16it/s]


Loaded 400000 words


In [None]:
import numpy as np
import tqdm

def compute_embeddings(df):
    train_emb = []
    for i, row in tqdm.tqdm(df.iterrows(), total = len(df.index)):
        words = row['tweet'].split(' ')
        words = filter(lambda x: x in glove.stoi.keys(), words)
        text_emb = [np.asarray(glove.vectors[glove.stoi[word]], dtype='float32') for word in words]
        
        if len(text_emb) == 0:
            train_emb.append(np.zeros(300))
            continue

        doc_embedding = np.mean(text_emb, axis = 0)
        train_emb.append(doc_embedding)
    return np.array(train_emb)

In [None]:
X_train_emb = compute_embeddings(train_df)
y_train = train_df['label']

X_test_emb = compute_embeddings(test_df)
y_test = test_df['label']

100%|██████████| 8000/8000 [00:01<00:00, 4192.58it/s]
100%|██████████| 2000/2000 [00:00<00:00, 4268.96it/s]


In [None]:
from sklearn.svm import SVC

svm = SVC(verbose = 2)
svm.fit(X_train_emb, y_train)

[LibSVM]

SVC(verbose=2)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, f1_score 

y_test_pred = svm.predict(X_test_emb)

print('Accuracy', accuracy_score(y_test, y_test_pred))
print('Precision',precision_score(y_test, y_test_pred))
print('F1 score',f1_score(y_test, y_test_pred))

Accuracy 0.9015
Precision 0.8747609942638623
F1 score 0.9028120374938332
