# Word Embedding - Home Assigment
## Dr. Omri Allouche 2018. YData Deep Learning Course

[Open in Google Colab](https://colab.research.google.com/github/omriallouche/deep_learning_course/blob/master/DL_word_embedding_assignment.ipynb)
    
    
In this exercise, you'll use word vectors trained on a corpus of 380,000 lyrics of songs from MetroLyrics (https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics).  
The dataset contains these fields for each song, in CSV format:
1. index
1. song
1. year
1. artist
1. genre
1. lyrics

Before doing this exercise, we recommend that you go over the "Bag of words meets bag of popcorn" tutorial (https://www.kaggle.com/c/word2vec-nlp-tutorial)

Other recommended resources:
- https://rare-technologies.com/word2vec-tutorial/
- https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial

### Train word vectors
Train word vectors using the Skipgram Word2vec algorithm and the gensim package.
Make sure you perform the following:
- Tokenize words
- Lowercase all words
- Remove punctuation marks
- Remove rare words
- Remove stopwords

Use 300 as the dimension of the word vectors. Try different context sizes.

In [7]:
import re
import pandas as pd
from time import time  # To time our operations
from collections import defaultdict  # For word frequency

import logging  # Setting up the loggings to monitor gensim
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

In [8]:
raw_data = pd.read_csv("380000-lyrics-from-metrolyrics.zip")

In [9]:
raw_data.columns.values

array(['index', 'song', 'year', 'artist', 'genre', 'lyrics'], dtype=object)

In [10]:
raw_data.shape

(362237, 6)

In [11]:
print(raw_data.head(10))

   index                    song  year           artist genre  \
0      0               ego-remix  2009  beyonce-knowles   Pop   
1      1            then-tell-me  2009  beyonce-knowles   Pop   
2      2                 honesty  2009  beyonce-knowles   Pop   
3      3         you-are-my-rock  2009  beyonce-knowles   Pop   
4      4           black-culture  2009  beyonce-knowles   Pop   
5      5  all-i-could-do-was-cry  2009  beyonce-knowles   Pop   
6      6      once-in-a-lifetime  2009  beyonce-knowles   Pop   
7      7                 waiting  2009  beyonce-knowles   Pop   
8      8               slow-love  2009  beyonce-knowles   Pop   
9      9   why-don-t-you-love-me  2009  beyonce-knowles   Pop   

                                              lyrics  
0  Oh baby, how you doing?\nYou know I'm gonna cu...  
1  playin' everything so easy,\nit's like you see...  
2  If you search\nFor tenderness\nIt isn't hard t...  
3  Oh oh oh I, oh oh oh I\n[Verse 1:]\nIf I wrote...  
4  Party 

In [12]:
raw_data['genre'].value_counts()

Rock             131377
Pop               49444
Hip-Hop           33965
Not Available     29814
Metal             28408
Other             23683
Country           17286
Jazz              17147
Electronic        16205
R&B                5935
Indie              5732
Folk               3241
Name: genre, dtype: int64

In [13]:
print(raw_data.shape)
data = raw_data.loc[raw_data["lyrics"].str.len() > 3]
print(data.shape)

(362237, 6)
(266505, 6)


In [14]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/samuelguedj/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
import re
from nltk.corpus import stopwords

def document_to_words(text, remove_panctuations=True, remove_stopwords=False):
    if remove_panctuations:
      text = re.sub("[^a-zA-Z0-9]"," ", text)
    words = text.lower().split()
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]   
    return words

In [16]:
document_to_words(data["lyrics"][0], remove_stopwords=True)[:10]

['oh',
 'baby',
 'know',
 'gonna',
 'cut',
 'right',
 'chase',
 'women',
 'made',
 'like']

In [17]:
import nltk
nltk.download('punkt')

# Load the punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/samuelguedj/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [18]:
tokenizer.tokenize(data["lyrics"][0].strip())

['Oh baby, how you doing?',
 "You know I'm gonna cut right to the chase\nSome women were made but me, myself\nI like to think that I was created for a special purpose\nYou know, what's more special than you?",
 "You feel me\nIt's on baby, let's get lost\nYou don't need to call into work 'cause you're the boss\nFor real, want you to show me how you feel\nI consider myself lucky, that's a big deal\nWhy?",
 "Well, you got the key to my heart\nBut you ain't gonna need it, I'd rather you open up my body\nAnd show me secrets, you didn't know was inside\nNo need for me to lie\nIt's too big, it's too wide\nIt's too strong, it won't fit\nIt's too much, it's too tough\nHe talk like this 'cause he can back it up\nHe got a big ego, such a huge ego\nI love his big ego, it's too much\nHe walk like this 'cause he can back it up\nUsually I'm humble, right now I don't choose\nYou can leave with me or you could have the blues\nSome call it arrogant, I call it confident\nYou decide when you find on what 

In [19]:
def document_to_sentences(document, tokenizer, remove_stopwords=False):
    sentences = []
    if tokenizer:
        raw_sentences = tokenizer.tokenize(document)
    else:
        raw_sentences = document
    for raw_sentence in raw_sentences:
        if len(raw_sentence) > 0:
            sentences.append(document_to_words(raw_sentence, remove_stopwords))
    return sentences

In [20]:
sentence = document_to_sentences(data["lyrics"][0], tokenizer, remove_stopwords=True)
len(sentence[1])

37

In [25]:
sentences = []  # Initialize an empty list of sentences

print("Parsing sentences from training set")
for song_lyrics in data["lyrics"]:
    sentences += document_to_sentences(song_lyrics, tokenizer, 
                                       remove_stopwords=True)

Parsing sentences from training set


In [22]:
from gensim.models import Word2Vec

In [23]:
from gensim.models import Word2Vec

def word2vec_hat(sentences, num_features=300, min_word_count=40, workers=4, 
             context=10, downsampling=1e-3, save_model=True):

    model = Word2Vec(sentences, workers=workers, size=num_features, 
                   min_count=min_word_count, window=context, sample=downsampling)

    if save_model:
        # If you don't plan to train the model any further, calling 
        # init_sims will make the model much more memory-efficient.
#         model.init_sims(replace=True)
        model.wv.init_sims
        # It can be helpful to create a meaningful model name and 
        # save the model for later use. You can load it later using Word2Vec.load()
        model_name = "{}features_{}minwords_{}context.wv.model".format(num_features, 
                                                              min_word_count, 
                                                              context)
        model.save(model_name)
 
    return model

In [26]:
word2vec_hat(sentences)

INFO - 16:38:52: collecting all words and their counts
INFO - 16:38:52: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 16:38:52: PROGRESS: at sentence #10000, processed 547073 words, keeping 24856 word types
INFO - 16:38:52: PROGRESS: at sentence #20000, processed 1002175 words, keeping 36701 word types
INFO - 16:38:52: PROGRESS: at sentence #30000, processed 1577546 words, keeping 45437 word types
INFO - 16:38:53: PROGRESS: at sentence #40000, processed 2132629 words, keeping 55005 word types
INFO - 16:38:53: PROGRESS: at sentence #50000, processed 2565209 words, keeping 60035 word types
INFO - 16:38:53: PROGRESS: at sentence #60000, processed 3125605 words, keeping 70378 word types
INFO - 16:38:53: PROGRESS: at sentence #70000, processed 3650087 words, keeping 80685 word types
INFO - 16:38:53: PROGRESS: at sentence #80000, processed 4152681 words, keeping 88085 word types
INFO - 16:38:53: PROGRESS: at sentence #90000, processed 4712834 words, keeping 92527 w

INFO - 16:39:03: PROGRESS: at sentence #820000, processed 43471094 words, keeping 323912 word types
INFO - 16:39:03: PROGRESS: at sentence #830000, processed 44172564 words, keeping 325593 word types
INFO - 16:39:03: PROGRESS: at sentence #840000, processed 44667279 words, keeping 330357 word types
INFO - 16:39:03: PROGRESS: at sentence #850000, processed 45114992 words, keeping 331606 word types
INFO - 16:39:03: PROGRESS: at sentence #860000, processed 45840735 words, keeping 334894 word types
INFO - 16:39:04: PROGRESS: at sentence #870000, processed 46364336 words, keeping 338759 word types
INFO - 16:39:04: PROGRESS: at sentence #880000, processed 46858701 words, keeping 340849 word types
INFO - 16:39:04: PROGRESS: at sentence #890000, processed 47371857 words, keeping 345358 word types
INFO - 16:39:04: PROGRESS: at sentence #900000, processed 47954498 words, keeping 346951 word types
INFO - 16:39:04: PROGRESS: at sentence #910000, processed 48459183 words, keeping 351406 word types


INFO - 16:39:48: EPOCH 1 - PROGRESS: at 77.49% examples, 925531 words/s, in_qsize 7, out_qsize 0
INFO - 16:39:49: EPOCH 1 - PROGRESS: at 79.94% examples, 927210 words/s, in_qsize 7, out_qsize 0
INFO - 16:39:50: EPOCH 1 - PROGRESS: at 81.69% examples, 924593 words/s, in_qsize 7, out_qsize 0
INFO - 16:39:51: EPOCH 1 - PROGRESS: at 83.07% examples, 917176 words/s, in_qsize 7, out_qsize 0
INFO - 16:39:52: EPOCH 1 - PROGRESS: at 84.95% examples, 915031 words/s, in_qsize 7, out_qsize 0
INFO - 16:39:53: EPOCH 1 - PROGRESS: at 86.65% examples, 915373 words/s, in_qsize 7, out_qsize 0
INFO - 16:39:54: EPOCH 1 - PROGRESS: at 88.61% examples, 917594 words/s, in_qsize 7, out_qsize 0
INFO - 16:39:55: EPOCH 1 - PROGRESS: at 90.94% examples, 919615 words/s, in_qsize 7, out_qsize 0
INFO - 16:39:56: EPOCH 1 - PROGRESS: at 93.70% examples, 921703 words/s, in_qsize 7, out_qsize 0
INFO - 16:39:57: EPOCH 1 - PROGRESS: at 95.73% examples, 923483 words/s, in_qsize 7, out_qsize 0
INFO - 16:39:58: EPOCH 1 - PRO

INFO - 16:41:05: EPOCH 3 - PROGRESS: at 28.29% examples, 776426 words/s, in_qsize 7, out_qsize 0
INFO - 16:41:06: EPOCH 3 - PROGRESS: at 30.05% examples, 782419 words/s, in_qsize 7, out_qsize 0
INFO - 16:41:07: EPOCH 3 - PROGRESS: at 32.33% examples, 796158 words/s, in_qsize 7, out_qsize 0
INFO - 16:41:08: EPOCH 3 - PROGRESS: at 34.62% examples, 806195 words/s, in_qsize 7, out_qsize 0
INFO - 16:41:10: EPOCH 3 - PROGRESS: at 36.84% examples, 816285 words/s, in_qsize 7, out_qsize 0
INFO - 16:41:11: EPOCH 3 - PROGRESS: at 38.78% examples, 824981 words/s, in_qsize 7, out_qsize 0
INFO - 16:41:12: EPOCH 3 - PROGRESS: at 40.94% examples, 833273 words/s, in_qsize 7, out_qsize 0
INFO - 16:41:13: EPOCH 3 - PROGRESS: at 43.28% examples, 842489 words/s, in_qsize 7, out_qsize 0
INFO - 16:41:14: EPOCH 3 - PROGRESS: at 45.70% examples, 849640 words/s, in_qsize 7, out_qsize 0
INFO - 16:41:15: EPOCH 3 - PROGRESS: at 47.98% examples, 853972 words/s, in_qsize 7, out_qsize 0
INFO - 16:41:16: EPOCH 3 - PRO

INFO - 16:42:27: worker thread finished; awaiting finish of 3 more threads
INFO - 16:42:27: worker thread finished; awaiting finish of 2 more threads
INFO - 16:42:27: worker thread finished; awaiting finish of 1 more threads
INFO - 16:42:27: worker thread finished; awaiting finish of 0 more threads
INFO - 16:42:27: EPOCH - 4 : training on 63696185 raw words (45785587 effective words) took 46.2s, 991369 effective words/s
INFO - 16:42:28: EPOCH 5 - PROGRESS: at 1.99% examples, 873972 words/s, in_qsize 7, out_qsize 0
INFO - 16:42:29: EPOCH 5 - PROGRESS: at 3.97% examples, 878252 words/s, in_qsize 7, out_qsize 0
INFO - 16:42:30: EPOCH 5 - PROGRESS: at 6.29% examples, 926430 words/s, in_qsize 7, out_qsize 0
INFO - 16:42:31: EPOCH 5 - PROGRESS: at 8.31% examples, 949677 words/s, in_qsize 7, out_qsize 0
INFO - 16:42:32: EPOCH 5 - PROGRESS: at 10.82% examples, 961521 words/s, in_qsize 7, out_qsize 0
INFO - 16:42:33: EPOCH 5 - PROGRESS: at 13.05% examples, 970227 words/s, in_qsize 7, out_qsize 

<gensim.models.word2vec.Word2Vec at 0x131e01668>

In [27]:
w2v_model = Word2Vec.load("300features_40minwords_10context.wv.model")

INFO - 16:45:37: loading Word2Vec object from 300features_40minwords_10context.wv.model
INFO - 16:45:38: loading wv recursively from 300features_40minwords_10context.wv.model.wv.* with mmap=None
INFO - 16:45:38: setting ignored attribute vectors_norm to None
INFO - 16:45:38: loading vocabulary recursively from 300features_40minwords_10context.wv.model.vocabulary.* with mmap=None
INFO - 16:45:38: loading trainables recursively from 300features_40minwords_10context.wv.model.trainables.* with mmap=None
INFO - 16:45:38: setting ignored attribute cum_table to None
INFO - 16:45:38: loaded 300features_40minwords_10context.wv.model


### Review most similar words
Get initial evaluation of the word vectors by analyzing the most similar words for a few interesting words in the text. 

Choose words yourself, and find the most similar words to them.

In [28]:
w2v_model.wv.most_similar("dog") 

INFO - 16:52:02: precomputing L2-norms of word weight vectors


[('hound', 0.6047170758247375),
 ('puppy', 0.5669090747833252),
 ('cat', 0.5553213357925415),
 ('dogs', 0.5495033860206604),
 ('barking', 0.5416650772094727),
 ('hog', 0.5347200036048889),
 ('frog', 0.5303950905799866),
 ('barkin', 0.5250210165977478),
 ('bark', 0.5211504697799683),
 ('doggy', 0.48853635787963867)]

In [29]:
w2v_model.wv.most_similar("love") 

[('baby', 0.554186224937439),
 ('loving', 0.5354353189468384),
 ('heart', 0.5152585506439209),
 ('oh', 0.5066409111022949),
 ('you', 0.4939931631088257),
 ('true', 0.48410582542419434),
 ('darling', 0.4810052216053009),
 ('lovin', 0.4613601565361023),
 ('me', 0.4605832099914551),
 ('darlin', 0.4548726975917816)]

In [30]:
w2v_model.wv.most_similar("war")

[('battle', 0.6541935801506042),
 ('waging', 0.6304421424865723),
 ('wars', 0.6265519261360168),
 ('waged', 0.5722817778587341),
 ('civil', 0.5561788082122803),
 ('battles', 0.496185839176178),
 ('tug', 0.48518654704093933),
 ('battlefield', 0.4778044819831848),
 ('fighting', 0.477022647857666),
 ('wage', 0.4723787009716034)]

In [31]:
w2v_model.wv.most_similar("blue") 

[('bluest', 0.5335527658462524),
 ('gray', 0.5190421938896179),
 ('grey', 0.47112494707107544),
 ('starry', 0.43493953347206116),
 ('cloudy', 0.4216483235359192),
 ('starlit', 0.4208819270133972),
 ('hue', 0.4200236201286316),
 ('tangerine', 0.4178546667098999),
 ('colour', 0.4062366485595703),
 ('misty', 0.40504148602485657)]

In [42]:
w2v_model.wv.most_similar("paris") 

[('france', 0.6699743866920471),
 ('spain', 0.5838769674301147),
 ('hilton', 0.5828134417533875),
 ('tokyo', 0.5712754726409912),
 ('italy', 0.5608839988708496),
 ('venice', 0.5596826672554016),
 ('london', 0.5530416369438171),
 ('miami', 0.5352116823196411),
 ('brazil', 0.5246003866195679),
 ('rome', 0.5206924676895142)]

In [43]:
w2v_model.wv.most_similar("jew") 

[('jewish', 0.5946222543716431),
 ('buddhist', 0.5576699376106262),
 ('muslim', 0.5428418517112732),
 ('caucasian', 0.5240269303321838),
 ('bishop', 0.5107120275497437),
 ('goat', 0.5064100027084351),
 ('christian', 0.4977492392063141),
 ('haitian', 0.49670469760894775),
 ('reggie', 0.4932796061038971),
 ('italian', 0.486927330493927)]

### Word Vectors Algebra
We've seen in class examples of algebraic games on the word vectors (e.g. man - woman + king = queen ). 

Try a few vector algebra terms, and evaluate how well they work. Try to use the Cosine distance and compare it to the Euclidean distance.

In [44]:
w2v_model.wv.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.5984432697296143),
 ('kings', 0.4317205548286438),
 ('empress', 0.3948866128921509),
 ('princess', 0.39446741342544556),
 ('homecoming', 0.3881065249443054),
 ('throne', 0.3874613642692566),
 ('crowning', 0.3848448097705841),
 ('crowned', 0.38424116373062134),
 ('crown', 0.3822874128818512),
 ('majesty', 0.3679766058921814)]

In [45]:
w2v_model.wv.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])

[('queen', 0.8952284455299377),
 ('empress', 0.7569559812545776),
 ('princess', 0.7516407370567322),
 ('homecoming', 0.7510775923728943),
 ('kings', 0.7475972771644592),
 ('sweetest', 0.744684100151062),
 ('throne', 0.7338636517524719),
 ('crown', 0.7334924936294556),
 ('crowning', 0.7326667904853821),
 ('newborn', 0.7302643656730652)]

In [46]:
vector = w2v_model.wv['king'] - w2v_model.wv['man'] + w2v_model.wv['woman']
w2v_model.wv.similar_by_vector(vector, topn=10, restrict_vocab=None)

[('king', 0.7405559420585632),
 ('queen', 0.6157742738723755),
 ('woman', 0.43306297063827515),
 ('kings', 0.43230652809143066),
 ('princess', 0.4136694669723511),
 ('throne', 0.40156853199005127),
 ('empress', 0.3987256586551666),
 ('homecoming', 0.394988477230072),
 ('crowning', 0.39387229084968567),
 ('crown', 0.39278435707092285)]

In [47]:
# Cosine similarity
cosine_distance = w2v_model.wv.similarity("woman", "girl")
print("Cosine distance = ", cosine_distance)

import numpy as np
euclidean_distance = np.linalg.norm(w2v_model.wv["women"] - w2v_model.wv["girl"])
print("Euclidean distance = ", euclidean_distance)

Cosine distance =  0.55022025
Euclidean distance =  30.819527


## Sentiment Analysis
Estimate sentiment of words using word vectors.  
In this section, we'll use the SemEval-2015 English Twitter Sentiment Lexicon.  
The lexicon was used as an official test set in the SemEval-2015 shared Task #10: Subtask E, and contains a polarity score for words in range -1 (negative) to 1 (positive) - http://saifmohammad.com/WebPages/SCL.html#OPP

Build a classifier for the sentiment of a word given its word vector. Split the data to a train and test sets, and report the model performance on both sets.

Use your trained model from the previous question to predict the sentiment score of words in the lyrics corpus that are not part of the original sentiment dataset. Review the words with the highest positive and negative sentiment. Do the results make sense?

### Visualize Word Vectors
In this section, you'll plot words on a 2D grid based on their inner similarity. We'll use the tSNE transformation to reduce dimensions from 300 to 2. You can get sample code from https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial or other tutorials online.

Perform the following:
- Keep only the 3,000 most frequent words (after removing stopwords)
- For this list, compute for each word its relative abundance in each of the genres
- Compute the ratio between the proportion of each word in each genre and the proportion of the word in the entire corpus (the background distribution)
- Pick the top 50 words for each genre. These words give good indication for that genre. Join the words from all genres into a single list of top significant words. 
- Compute tSNE transformation to 2D for all words, based on their word vectors
- Plot the list of the top significant words in 2D. Next to each word output its text. The color of each point should indicate the genre for which it is most significant.

You might prefer to use a different number of points or a slightly different methodology for improved results.  
Analyze the results.

In [48]:
from sklearn.datasets import fetch_mldata
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

In [49]:
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne.fit_transform(data_subset)

NameError: name 'data_subset' is not defined

## Text Classification
In this section, you'll build a text classifier, determining the genre of a song based on its lyrics.

### Text classification using Bag-of-Words
Build a Naive Bayes classifier based on the bag of Words.  
You will need to divide your dataset into a train and test sets.

Show the confusion matrix.

Show the classification report - precision, recall, f1 for each class.

### Text classification using Word Vectors
#### Average word vectors
Do the same, using a classifier that averages the word vectors of words in the document.

#### TfIdf Weighting
Do the same, using a classifier that averages the word vectors of words in the document, weighting each word by its TfIdf.


### Text classification using ConvNet
Do the same, using a ConvNet.  
The ConvNet should get as input a 2D matrix where each column is an embedding vector of a single word, and words are in order. Use zero padding so that all matrices have a similar length.  
Some songs might be very long. Trim them so you keep a maximum of 128 words (after cleaning stop words and rare words).  
Initialize the embedding layer using the word vectors that you've trained before, but allow them to change during training.  

Extra: Try training the ConvNet with 2 slight modifications:
1. freezing the the weights trained using Word2vec (preventing it from updating)
1. random initialization of the embedding layer

You are encouraged to try this question on your own.  

You might prefer to get ideas from the paper "Convolutional Neural Networks for Sentence Classification" (Kim 2014, [link](https://arxiv.org/abs/1408.5882)).

There are several implementations of the paper code in PyTorch online (see for example [this repo](https://github.com/prakashpandey9/Text-Classification-Pytorch) for a PyTorch implementation of CNN and other architectures for text classification). If you get stuck, they might provide you with a reference for your own code.