<a href="https://colab.research.google.com/github/SDS-AAU/M2-2019/blob/master/notebooks/training_word2vec_%26_co_on_custom_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training customized word embeddings

Word embeddings became big around 2013 and are linked to [this paper](https://arxiv.org/abs/1301.3781) with the beautiful title 
*Efficient Estimation of Word Representations in Vector Space* by Tomas Mokolov et al. coming out of Google. This was the foundation of Word2Vec.

The idea behind it is easiest summarized by the following quote: 


> *You shall know a word by the company it keeps (Firth, J. R. 1957:11)*

Let me start with a fascinating example of word embeddings in practice. Below, you can see a figure from the paper: 
*Dynamic Word Embeddings for Evolving Semantic Discovery*. Here (in simple terms) the researchers estimated word vectors for from textual inputs in different time-frames. They picked out some terms and person that obviously changed *their company* over the years. Then they look at the relative position of these terms compared to terms that did not change much (anchors). If you are interested in this kind of research, check out [this blog](https://blog.acolyer.org/2018/02/22/dynamic-word-embeddings-for-evolving-semantic-discovery/) that describes the paper briefly or the [original paper](https://arxiv.org/abs/1703.00607).

![alt text](https://adriancolyer.files.wordpress.com/2018/02/evolving-word-embeddings-fig-1.jpeg)

Word embeddings allow us to create term representations that "learn" meaning from semantic and syntactic features. These models take a sequence of sentences as an input and scan for all individual terms that appear in the whole corpus and all their occurrences. Such contextual learning seems to be able to pick up non-trivial conceptual details and it is this class of models that today enable technologies such as chatbots, machine translation and much more.

The early word embedding models were Word2Vec and [GloVe](https://nlp.stanford.edu/projects/glove/).
In December 2017 Facebook presented [fastText](https://fasttext.cc/) (by the way - by 2017 Tomas Mikolov was working for Facebook and is one of the authors of the [paper](https://arxiv.org/abs/1607.04606) that introduces the research behind fastText). This model extends the idea of Word2Vec, enriching these vectors by information from sub-word elements. What does that mean? Words are not only defined by surrounding words but in addition also by the various syllables that make up the word. Why should that be a good idea? Well, now words such as *apple* and *apples* do not only get similar vectors due to them often sharing context but also because they are composed of the same sub-word elements. This comes in particularly handy when we are dealing with language that have a rich morphology such as Turkish or Russian.  This is also great when working with web-text, which is often messy and misspelt.

The current state-of-the-art (April 2018!) is ELMo (Embeddings from Language Models) that further tackles the problem of contextuality and particularly polysemy, i.e. same term means something else in a different context. 

You can read more about the ins and outs of the current state of embedding models [here](https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a).

Now the good news: You will find pre-trained vectors from all mentioned models online. They will do great in most cases. However, when working with specific tasks: Some obscure languages and/or specific technical jargon (finance talk), it is nice to know how to train such word-vectors.

In this tutorial and on M3 we will not go further than fastText (2017-state-of-the-art should be good enough for us – sorry). You are more than welcome to use other, more sophisticated, embeddings.


In this tutorial we will train three embedding models:

- Word2Vec on the hate speech and toxic comments data
- fastText on the toxic comments data

Once trained, we will store the models


In [0]:
# import pandas for tabular data
import pandas as pd


# import gensim and the Word2Vec as well as FastText models
import gensim
from gensim.models import Word2Vec, FastText

In [0]:
# We import logging to get informative outputs from Gensim training

import logging

## Training Word2Vec on a specific copus (from memory)

In the following we will train the model first on the toxic-comments data from [this Kaggle challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) as well the twitter hatespeech data

In [0]:
# Download the data from GDrive

import gdown
gdown.download('https://drive.google.com/uc?id=1-8hQOHTWZ2JYyqsvcb5tu2pmcBzzCMfo', 'hatespeech.gzip', quiet=False)

In [0]:
all_hate_data = pd.read_pickle('hatespeech.gzip')

#### Bonus material: Identification og bi-grams & co:

In [0]:
from gensim.models.phrases import Phrases, Phraser

In [0]:
# Count up potential phrases and words
phrases = Phrases(all_hate_data, min_count=5, threshold=10)

2019-10-13 19:56:58,459 : INFO : collecting all words and their counts
2019-10-13 19:56:58,462 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2019-10-13 19:56:59,662 : INFO : PROGRESS: at sentence #10000, processed 655155 words and 285005 word types
2019-10-13 19:57:00,980 : INFO : PROGRESS: at sentence #20000, processed 1297718 words and 477817 word types
2019-10-13 19:57:02,181 : INFO : PROGRESS: at sentence #30000, processed 1928609 words and 642497 word types
2019-10-13 19:57:03,518 : INFO : PROGRESS: at sentence #40000, processed 2584346 words and 800472 word types
2019-10-13 19:57:04,730 : INFO : PROGRESS: at sentence #50000, processed 3219612 words and 941861 word types
2019-10-13 19:57:06,006 : INFO : PROGRESS: at sentence #60000, processed 3887527 words and 1080208 word types
2019-10-13 19:57:07,255 : INFO : PROGRESS: at sentence #70000, processed 4534434 words and 1210177 word types
2019-10-13 19:57:08,483 : INFO : PROGRESS: at sentence #80000, processe

In [0]:
# Train bigram_model

bigram = Phraser(phrases)

2019-10-13 19:57:40,365 : INFO : source_vocab length 3825076
2019-10-13 19:58:20,602 : INFO : Phraser built with 27006 phrasegrams


In [0]:
# Apply bigram model to texts
all_hate_data_phrases = all_hate_data.map(lambda t: bigram[t])

### Model training in one line

We can actually instantiate, build vocabulary and train in one line only. Isn't that great?

In [0]:
# We can instantiate and train the model in one line. Just pass the input data (sequence of token-lists)
# specify target dimensionality (size), the window adound the target term, minimum count, and number of iterations/epochs
# workers are optional (for multiprocessing)

model_toxic = Word2Vec(all_hate_data_phrases, size=100, window=5, min_count=5, workers=4, iter=3)

2019-10-13 20:00:58,845 : INFO : collecting all words and their counts
2019-10-13 20:00:58,847 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-10-13 20:00:58,989 : INFO : PROGRESS: at sentence #10000, processed 617585 words, keeping 40430 word types
2019-10-13 20:00:59,136 : INFO : PROGRESS: at sentence #20000, processed 1221867 words, keeping 59673 word types
2019-10-13 20:00:59,293 : INFO : PROGRESS: at sentence #30000, processed 1816057 words, keeping 74435 word types
2019-10-13 20:00:59,459 : INFO : PROGRESS: at sentence #40000, processed 2432691 words, keeping 87198 word types
2019-10-13 20:00:59,603 : INFO : PROGRESS: at sentence #50000, processed 3031098 words, keeping 98119 word types
2019-10-13 20:00:59,758 : INFO : PROGRESS: at sentence #60000, processed 3659194 words, keeping 108398 word types
2019-10-13 20:00:59,907 : INFO : PROGRESS: at sentence #70000, processed 4268370 words, keeping 117781 word types
2019-10-13 20:01:00,054 : INFO : PROGR

In [0]:
# Does it work? The output of this cell makes it clear: Yes, it works pretty well.

model_toxic.wv.most_similar('idiot')

2019-10-13 20:02:31,331 : INFO : precomputing L2-norms of word weight vectors
  if np.issubdtype(vec.dtype, np.int):


[('asshole', 0.8137665390968323),
 ('retard', 0.7840962409973145),
 ('moron', 0.7749398946762085),
 ('dude', 0.7731416821479797),
 ('dumbass', 0.7631232738494873),
 ('jerk', 0.7593957185745239),
 ('cunt', 0.7485070824623108),
 ('wanker', 0.7464522123336792),
 ('twat', 0.7412905693054199),
 ('loser', 0.7397902607917786)]

In [0]:
# something less violent?

model_toxic.wv.most_similar('mother')

  if np.issubdtype(vec.dtype, np.int):


[('wife', 0.9304474592208862),
 ('father', 0.9074036478996277),
 ('sister', 0.8720070123672485),
 ('brother', 0.8584585189819336),
 ('daughter', 0.8555762767791748),
 ('grandmother', 0.8510181903839111),
 ('husband', 0.8365557193756104),
 ('dad', 0.830119788646698),
 ('uncle', 0.8284080028533936),
 ('mom', 0.8273847699165344)]

In [0]:
model_toxic.save('model.m')

2018-11-03 21:08:23,499 : INFO : saving Word2Vec object under model.m, separately None
2018-11-03 21:08:23,501 : INFO : not storing attribute vectors_norm
2018-11-03 21:08:23,504 : INFO : not storing attribute cum_table
2018-11-03 21:08:24,328 : INFO : saved model.m


Now imagine, you trained a model and would like to update it with new data later. 

You can save your model for later re-use just like that:

```
model_toxic.save('model.m')
```

the key-vectors can be used independantly and also also in R

To re-train, we need to 
- build some additional vocab (perhaps the new data contains some new words)
- train the model using the ```train``` method. (surprise)


In [0]:
# Build ne vocab from new data, telling the model that it's an update

# model_toxic.build_vocab(other_data['tokenized'], update=True)

In [0]:
# Re-train (you need to specify total_examples) 
# we only have ~25k tweets...and it shouldnt take long

#model_toxic.train(other_data['tokenized'], total_examples = model_toxic.corpus_count, epochs=2)

### Finally, we can train a fastText model (thanks Facebook for that)

This will be ~4 times slower than Word2Vec (since the algo has to account for all the sub-word-stuff etc.)

In [0]:
model_toxic_fasttext = FastText(all_hate_data_phrases, size=100, window=8, min_count=3, workers=4, iter=5)

2019-10-13 20:02:54,900 : INFO : collecting all words and their counts
2019-10-13 20:02:54,902 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-10-13 20:02:55,038 : INFO : PROGRESS: at sentence #10000, processed 617585 words, keeping 40430 word types
2019-10-13 20:02:55,166 : INFO : PROGRESS: at sentence #20000, processed 1221867 words, keeping 59673 word types
2019-10-13 20:02:55,291 : INFO : PROGRESS: at sentence #30000, processed 1816057 words, keeping 74435 word types
2019-10-13 20:02:55,424 : INFO : PROGRESS: at sentence #40000, processed 2432691 words, keeping 87198 word types
2019-10-13 20:02:55,551 : INFO : PROGRESS: at sentence #50000, processed 3031098 words, keeping 98119 word types
2019-10-13 20:02:55,679 : INFO : PROGRESS: at sentence #60000, processed 3659194 words, keeping 108398 word types
2019-10-13 20:02:55,806 : INFO : PROGRESS: at sentence #70000, processed 4268370 words, keeping 117781 word types
2019-10-13 20:02:55,943 : INFO : PROGR

In [0]:
# Let's see what we get

model_toxic_fasttext.wv.most_similar('idiot', topn=20)

  if np.issubdtype(vec.dtype, np.int):


[('idiot_idiot', 0.9367774724960327),
 ('iddiot', 0.9146779775619507),
 ('n_idiot', 0.9089648723602295),
 ('idiota', 0.9089310169219971),
 ('idiom', 0.8440301418304443),
 ('goddamn_idiot', 0.8381199240684509),
 ('idiote', 0.8246064186096191),
 ('idioma', 0.8040664196014404),
 ('idio', 0.79912269115448),
 ('idi', 0.7943286895751953),
 ('idiocy', 0.7876805067062378),
 ('idios', 0.7758983373641968),
 ('an_idiot', 0.7641075849533081),
 ('idiots', 0.7624642848968506),
 ('iot', 0.7354482412338257),
 ('idiotic', 0.7004623413085938),
 ('ediot', 0.6806033849716187),
 ('trollish_boob', 0.6773156523704529),
 ('ignorant_fool', 0.6703081130981445),
 ('fool', 0.6701761484146118)]

As expected, fastText gives us a mix of semanticly similar and similar sounding words.

In [0]:
model_toxic_fasttext.save('model_fasttext.m')

2018-11-04 20:29:11,796 : INFO : saving FastText object under model_fasttext.m, separately None
2018-11-04 20:29:11,798 : INFO : storing np array 'vectors_ngrams' to model_fasttext.m.wv.vectors_ngrams.npy
2018-11-04 20:29:12,174 : INFO : not storing attribute vectors_ngrams_norm
2018-11-04 20:29:12,177 : INFO : not storing attribute vectors_norm
2018-11-04 20:29:12,178 : INFO : not storing attribute vectors_vocab_norm
2018-11-04 20:29:12,180 : INFO : not storing attribute buckets_word
2018-11-04 20:29:12,181 : INFO : storing np array 'vectors_ngrams_lockf' to model_fasttext.m.trainables.vectors_ngrams_lockf.npy
2018-11-04 20:29:14,888 : INFO : saved model_fasttext.m
