<a href="https://colab.research.google.com/github/SDS-AAU/M3-2018/blob/master/notebooks/training_word2vec_text8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training customized word embeddings

Word embeddings became big around 2013 and are linked to [this paper](https://arxiv.org/abs/1301.3781) with the beautiful title 
*Efficient Estimation of Word Representations in Vector Space* by Tomas Mokolov et al. coming out of Google. This was the foundation of Word2Vec.

The idea behind it is easiest summarized by the following quote: 


> *You shall know a word by the company it keeps (Firth, J. R. 1957:11)*

Let me start with a fascinating example of word embeddings in practice. Below, you can see a figure from the paper: 
*Dynamic Word Embeddings for Evolving Semantic Discovery*. Here (in simple terms) the researchers estimated word vectors for from textual inputs in different time-frames. They picked out some terms and person that obviously changed *their company* over the years. Then they look at the relative position of these terms compared to terms that did not change much (anchors). If you are interested in this kind of research, check out [this blog](https://blog.acolyer.org/2018/02/22/dynamic-word-embeddings-for-evolving-semantic-discovery/) that describes the paper briefly or the [original paper](https://arxiv.org/abs/1703.00607).

![alt text](https://adriancolyer.files.wordpress.com/2018/02/evolving-word-embeddings-fig-1.jpeg)

Word embeddings allow us to create term representations that "learn" meaning from semantic and syntactic features. These models take a sequence of sentences as an input and scan for all individual terms that appear in the whole corpus and all their occurrences. Such contextual learning seems to be able to pick up non-trivial conceptual details and it is this class of models that today enable technologies such as chatbots, machine translation and much more.

The early word embedding models were Word2Vec and [GloVe](https://nlp.stanford.edu/projects/glove/).
In December 2017 Facebook presented [fastText](https://fasttext.cc/) (by the way - by 2017 Tomas Mikolov was working for Facebook and is one of the authors of the [paper](https://arxiv.org/abs/1607.04606) that introduces the research behind fastText). This model extends the idea of Word2Vec, enriching these vectors by information from sub-word elements. What does that mean? Words are not only defined by surrounding words but in addition also by the various syllables that make up the word. Why should that be a good idea? Well, now words such as *apple* and *apples* do not only get similar vectors due to them often sharing context but also because they are composed of the same sub-word elements. This comes in particularly handy when we are dealing with language that have a rich morphology such as Turkish or Russian.  This is also great when working with web-text, which is often messy and misspelt.

The current state-of-the-art (April 2018!) is ELMo (Embeddings from Language Models) that further tackles the problem of contextuality and particularly polysemy, i.e. same term means something else in a different context. 

You can read more about the ins and outs of the current state of embedding models [here](https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a).

Now the good news: You will find pre-trained vectors from all mentioned models online. They will do great in most cases. However, when working with specific tasks: Some obscure languages and/or specific technical jargon (finance talk), it is nice to know how to train such word-vectors.

In this tutorial and on M3 we will not go further than fastText (2017-state-of-the-art should be good enough for us – sorry). You are more than welcome to use other, more sophisticated, embeddings.


In this tutorial we will train three embedding models:

- Word2Vec on text8 - a sample of English Wikipedia
- Word2Vec on the hate speech and toxic comments data
- fastText on the toxic comments data

Once trained, we will store the models


In [0]:
# Let's start by installing Gensim that we know and love since M2


#surpresses lots of output
%%capture 

!pip install gensim

In [0]:
# import pandas for tabular data
import pandas as pd


# import gensim and the Word2Vec as well as FastText models
import gensim
from gensim.models import Word2Vec, FastText

In [0]:
# Let's downlaod all needed data (form different sources)
%%capture 

!wget http://mattmahoney.net/dc/text8.zip
!wget https://github.com/t-davidson/hate-speech-and-offensive-language/raw/master/data/labeled_data.csv
!wget http://sds-datacrunch.aau.dk/public/all.zip

In [0]:
# Also, we need to unzip text8 and our toxic-comments data
%%capture 

#!unzip text8.zip
!unzip all.zip

In [0]:
# We import logging to get informative outputs from Gensim training

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Training Word2Vec on text8

in the following we will train the model step by step. In future, you can do things much faster

- We first use the text8 wrapper to efficiently read the text8 file from disk. 
- Then we instantiate a Word2Vec model (with some parameters)
- We build the vocabulary
- Finally, we train it

Once done, we can play a bit around with it.

In [0]:
# connect to the text8 file on disk. Gensim has a number of wrappers for other filetypes and sources. 
# Obviopusely, for a many-gigabyte-wikipedia dump, you wouldn't load the whole thing into memory
# But rather read it in line by line from disk.

text8 = gensim.models.word2vec.Text8Corpus('text8', max_sentence_length=10000)

In [0]:
# We instantiate the model. Words that appear less than 5 times are kicked out. 
# We will train over 3 iterations.

model_text8 = Word2Vec(iter=3, min_count=5)

In [48]:
# Let's build the vocabulary

model_text8.build_vocab(text8)

2018-11-03 20:39:08,094 : INFO : collecting all words and their counts
2018-11-03 20:39:08,100 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-11-03 20:39:13,701 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2018-11-03 20:39:13,702 : INFO : Loading a fresh vocabulary
2018-11-03 20:39:14,447 : INFO : effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2018-11-03 20:39:14,449 : INFO : effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2018-11-03 20:39:14,684 : INFO : deleting the raw counts dictionary of 253854 items
2018-11-03 20:39:14,697 : INFO : sample=0.001 downsamples 38 most-common words
2018-11-03 20:39:14,700 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2018-11-03 20:39:15,013 : INFO : estimated required memory for 71290 words and 100 dimensions: 92677000 bytes
2018-11-03 20:39:15,014 : 

In [52]:
# Now we can start the training

model_text8.train(text8, total_examples=model_text8.corpus_count, epochs=3)

2018-11-03 20:41:11,590 : INFO : training model with 3 workers on 71290 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2018-11-03 20:41:12,609 : INFO : EPOCH 1 - PROGRESS: at 4.06% examples, 502655 words/s, in_qsize 5, out_qsize 0
2018-11-03 20:41:13,635 : INFO : EPOCH 1 - PROGRESS: at 8.29% examples, 505036 words/s, in_qsize 4, out_qsize 1
2018-11-03 20:41:14,658 : INFO : EPOCH 1 - PROGRESS: at 12.11% examples, 491411 words/s, in_qsize 5, out_qsize 0
2018-11-03 20:41:15,679 : INFO : EPOCH 1 - PROGRESS: at 15.87% examples, 483510 words/s, in_qsize 4, out_qsize 1
2018-11-03 20:41:16,683 : INFO : EPOCH 1 - PROGRESS: at 19.69% examples, 481828 words/s, in_qsize 5, out_qsize 0
2018-11-03 20:41:17,690 : INFO : EPOCH 1 - PROGRESS: at 23.34% examples, 477614 words/s, in_qsize 5, out_qsize 0
2018-11-03 20:41:18,716 : INFO : EPOCH 1 - PROGRESS: at 27.16% examples, 476481 words/s, in_qsize 5, out_qsize 0
2018-11-03 20:41:19,738 : INFO : EPOCH 1 - PROGRESS: at 30.86

(37514496, 51015621)

In [55]:
# Training is done and we can play a bit around

# ask the model for most similar word to some term
model_text8.wv.most_similar("milk")

2018-11-03 20:43:18,152 : INFO : precomputing L2-norms of word weight vectors
  if np.issubdtype(vec.dtype, np.int):


[('meat', 0.9049235582351685),
 ('fruit', 0.904716432094574),
 ('honey', 0.8918293714523315),
 ('beans', 0.8837248086929321),
 ('sugar', 0.8805627822875977),
 ('vegetables', 0.8805302381515503),
 ('beef', 0.8631115555763245),
 ('vegetable', 0.8613918423652649),
 ('drinks', 0.8598768711090088),
 ('chocolate', 0.8590322732925415)]

In [58]:
# Do some algebra exercises

model_text8.wv.most_similar(positive=['germany', 'paris'], negative=['france'])

  if np.issubdtype(vec.dtype, np.int):


[('berlin', 0.8118886351585388),
 ('vienna', 0.7582015991210938),
 ('munich', 0.7417254447937012),
 ('moscow', 0.665905237197876),
 ('leipzig', 0.6623905897140503),
 ('milan', 0.6177572011947632),
 ('hamburg', 0.6133787631988525),
 ('frankfurt', 0.6034635305404663),
 ('bologna', 0.5932918190956116),
 ('bonn', 0.5911003351211548)]

In [59]:
model_text8.wv.most_similar(positive=['japan', 'mercedes'], negative=['germany'])

  if np.issubdtype(vec.dtype, np.int):


[('honda', 0.7667422294616699),
 ('sony', 0.7596989870071411),
 ('jeep', 0.7313293814659119),
 ('toyota', 0.7283535003662109),
 ('nintendo', 0.7144696712493896),
 ('mitsubishi', 0.7099435329437256),
 ('bmw', 0.7031159996986389),
 ('mazda', 0.6964906454086304),
 ('jaguar', 0.6929983496665955),
 ('motorcycle', 0.6928412914276123)]

In [68]:
# and more

print(model_text8.wv.doesnt_match(['dog','cat','chicken','hamster']))
print()
print(model_text8.wv.doesnt_match(['bus','street','honey','house']))
print()
print(model_text8.wv.doesnt_match(['gin','vodka','beer','whiskey']))

chicken

honey

beer


  if np.issubdtype(vec.dtype, np.int):


## Training Word2Vec on a specific copus (form memory)

In the following we will train the model first on the toxic-comments data from [this Kaggle challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) 

Then, we will re-train the model with more data (yes, that's possible, since the Word2Vec model is actually a shallow neural network). Here we will use the hate-speech classification tweets

In [0]:
# Load up the toxic comments data

toxic1 = pd.read_csv('train.csv')
toxic2 = pd.read_csv('test.csv')

In [0]:
# Load up the hatespeech tweets

hatespeech_tweets = pd.read_csv('labeled_data.csv')

### some light preprocessing

we will use twitter-style preprocessing to get rid of typical online-text issues.

In [0]:
# Import tweet-tokenizer

from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()

In [0]:
# Preprocess the toxic comments text
# just tokenizing, loweing and removing everything that is not a word
# given 2 files, we are doing it 2 times

toxic1['tokenized'] = toxic1['comment_text'].map(lambda t: tknzr.tokenize(t))
toxic1['tokenized'] = toxic1['tokenized'].map(lambda t: [x.lower() for x in t if x.isalpha()])

toxic2['tokenized'] = toxic2['comment_text'].map(lambda t: tknzr.tokenize(t))
toxic2['tokenized'] = toxic2['tokenized'].map(lambda t: [x.lower() for x in t if x.isalpha()])

In [0]:
# Finally, we can just concatenate the tokenized colum of both dataframes as out training data input

toxic_data = pd.concat([toxic1['tokenized'],toxic2['tokenized']])

In [0]:
# Same preprocessing for the hatespeech tweets.

hatespeech_tweets['tokenized'] = hatespeech_tweets['tweet'].map(lambda t: tknzr.tokenize(t))
hatespeech_tweets['tokenized'] = hatespeech_tweets['tokenized'].map(lambda t: [x.strip('#') for x in t])
hatespeech_tweets['tokenized'] = hatespeech_tweets['tokenized'].map(lambda t: [x.lower() for x in t if x.isalpha()])
hatespeech_tweets['tokenized'] = hatespeech_tweets['tokenized'].map(lambda t: [x for x in t if x !='rt'])

### DONE ####

In [0]:
# Let's bring all of the training data together
all_hate_data = pd.concat([toxic_data, hatespeech_tweets['tokenized']])

#### Bonus material: Identification og bi-grams & co:

In [0]:
from gensim.models.phrases import Phrases, Phraser

In [47]:
phrases = Phrases(all_hate_data, min_count=5, threshold=10)

2018-11-07 09:25:38,028 : INFO : collecting all words and their counts
2018-11-07 09:25:38,049 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2018-11-07 09:25:39,616 : INFO : PROGRESS: at sentence #10000, processed 655155 words and 285005 word types
2018-11-07 09:25:41,238 : INFO : PROGRESS: at sentence #20000, processed 1297718 words and 477817 word types
2018-11-07 09:25:42,844 : INFO : PROGRESS: at sentence #30000, processed 1928609 words and 642497 word types
2018-11-07 09:25:44,497 : INFO : PROGRESS: at sentence #40000, processed 2584346 words and 800472 word types
2018-11-07 09:25:46,018 : INFO : PROGRESS: at sentence #50000, processed 3219612 words and 941861 word types
2018-11-07 09:25:47,617 : INFO : PROGRESS: at sentence #60000, processed 3887527 words and 1080208 word types
2018-11-07 09:25:49,179 : INFO : PROGRESS: at sentence #70000, processed 4534434 words and 1210177 word types
2018-11-07 09:25:50,752 : INFO : PROGRESS: at sentence #80000, processe

In [48]:
bigram = Phraser(phrases)

2018-11-07 09:27:34,632 : INFO : source_vocab length 3825076
2018-11-07 09:28:22,622 : INFO : Phraser built with 27006 phrasegrams


In [0]:
all_hate_data_phrases = all_hate_data.map(lambda t: bigram[t])

In [0]:
hatespeech_tweets['tokenized_phrases'] = hatespeech_tweets['tokenized'].map(lambda t: bigram[t])

In [51]:
hatespeech_tweets['tokenized_phrases']

0        [as, a, woman, you, complain_about, cleaning_u...
1        [boy, dats, cold, tyga, dwn, bad, for, cuffin,...
2        [dawg, you, ever, fuck, a, bitch, and, she, st...
3                             [she, look, like, a, tranny]
4        [the, shit, you, hear, about, me, might, be, t...
5        [the, shit, just, blows, me, claim, you, so, f...
6        [i, can, not, just, sit, up, and, hate, on, an...
7        [cause, tired, of, you, big, bitches, coming, ...
8        [you, might, not, get, ya, bitch, back, thats,...
9              [hobbies, include, fighting, mariam, bitch]
10       [keeks, is, a, bitch, she, curves, everyone, l...
11                   [murda, gang, bitch, its, gang, land]
12       [so, hoes, that, smoke, are, losers, yea, go, ...
13       [bad_bitches, is, the, only, thing, that, i, l...
14                               [bitch, get, up, off, me]
15                      [bitch, nigga, miss, me, with, it]
16                                  [bitch, plz, whateve

### Model training in one line

We can actually instantiate, build vocabulary and train in one line only. Isn't that great?

In [26]:
# We can instantiate and train the model in one line. Just pass the input data (sequence of token-lists)
# specify target dimensionality (size), the window adound the target term, minimum count, and number of iterations/epochs
# workers are optional (for multiprocessing)

model_toxic = Word2Vec(toxic_data, size=100, window=5, min_count=5, workers=4, iter=3)

2018-11-04 21:53:52,371 : INFO : collecting all words and their counts
2018-11-04 21:53:52,496 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-11-04 21:53:52,677 : INFO : PROGRESS: at sentence #10000, processed 655155 words, keeping 33301 word types
2018-11-04 21:53:52,863 : INFO : PROGRESS: at sentence #20000, processed 1297718 words, keeping 49193 word types
2018-11-04 21:53:53,049 : INFO : PROGRESS: at sentence #30000, processed 1928609 words, keeping 61707 word types
2018-11-04 21:53:53,241 : INFO : PROGRESS: at sentence #40000, processed 2584346 words, keeping 72823 word types
2018-11-04 21:53:53,426 : INFO : PROGRESS: at sentence #50000, processed 3219612 words, keeping 82595 word types
2018-11-04 21:53:53,622 : INFO : PROGRESS: at sentence #60000, processed 3887527 words, keeping 91889 word types
2018-11-04 21:53:53,816 : INFO : PROGRESS: at sentence #70000, processed 4534434 words, keeping 100497 word types
2018-11-04 21:53:54,008 : INFO : PROGRE

In [81]:
# Does it work? The output of this cell makes it clear: Yes, it works pretty well.

model_toxic.wv.most_similar('idiot')

2018-11-03 21:06:12,882 : INFO : precomputing L2-norms of word weight vectors
  if np.issubdtype(vec.dtype, np.int):


[('asshole', 0.8236082196235657),
 ('moron', 0.7038889527320862),
 ('imbecile', 0.6953639388084412),
 ('fool', 0.6788172721862793),
 ('loser', 0.6326109170913696),
 ('illiterate', 0.6308602094650269),
 ('ignorant', 0.6303929090499878),
 ('hypocrite', 0.6218823194503784),
 ('retard', 0.6214895248413086),
 ('insult', 0.6125708818435669)]

In [82]:
# something less violent?

model_toxic.wv.most_similar('mother')

  if np.issubdtype(vec.dtype, np.int):


[('father', 0.8239043354988098),
 ('wife', 0.8019334077835083),
 ('grandmother', 0.7894612550735474),
 ('mom', 0.7853889465332031),
 ('sister', 0.7848141193389893),
 ('dad', 0.7723446488380432),
 ('tongue', 0.7573986649513245),
 ('daughter', 0.7491512298583984),
 ('boyfriend', 0.745460033416748),
 ('parents', 0.7441803812980652)]

In [83]:
model_toxic.save('model.m')

2018-11-03 21:08:23,499 : INFO : saving Word2Vec object under model.m, separately None
2018-11-03 21:08:23,501 : INFO : not storing attribute vectors_norm
2018-11-03 21:08:23,504 : INFO : not storing attribute cum_table
2018-11-03 21:08:24,328 : INFO : saved model.m


Now imagine, you trained a model and would like to update it with new data later. 

You can save your model for later re-use just like that:

```
model_toxic.save('model.m')
```

To re-train, we need to 
- build some additional vocab (perhaps the new data contains some new words)
- train the model using the ```train``` method. (surprise)


In [84]:
# Build ne vocab from new data, telling the model that it's an update

model_toxic.build_vocab(hatespeech_tweets['tokenized'], update=True)

2018-11-03 21:11:09,890 : INFO : collecting all words and their counts
2018-11-03 21:11:09,893 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-11-03 21:11:09,927 : INFO : PROGRESS: at sentence #10000, processed 111788 words, keeping 10979 word types
2018-11-03 21:11:09,968 : INFO : PROGRESS: at sentence #20000, processed 247332 words, keeping 16372 word types
2018-11-03 21:11:09,986 : INFO : collected 18336 word types from a corpus of 305428 raw words and 24783 sentences
2018-11-03 21:11:09,988 : INFO : Updating model with new vocabulary
2018-11-03 21:11:10,007 : INFO : New added 4191 unique words (18% of original 22527) and increased the count of 4191 pre-existing words (18% of original 22527)
2018-11-03 21:11:10,038 : INFO : deleting the raw counts dictionary of 18336 items
2018-11-03 21:11:10,040 : INFO : sample=0.001 downsamples 124 most-common words
2018-11-03 21:11:10,041 : INFO : downsampling leaves estimated 408334 word corpus (144.1% of prior 28

In [85]:
# Re-train (you need to specify total_examples) 
# we only have ~25k tweets...and it shouldnt take long

model_toxic.train(hatespeech_tweets['tokenized'], total_examples = model_toxic.corpus_count, epochs=2)

2018-11-03 21:12:38,382 : INFO : training model with 4 workers on 62589 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2018-11-03 21:12:38,792 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-11-03 21:12:38,793 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-11-03 21:12:38,808 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-11-03 21:12:38,815 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-11-03 21:12:38,816 : INFO : EPOCH - 1 : training on 305428 raw words (219843 effective words) took 0.4s, 524751 effective words/s
2018-11-03 21:12:39,207 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-11-03 21:12:39,220 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-11-03 21:12:39,225 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-11-03 21:12:39,227 : INFO : worker thread finished; awaiting finish of 0 

(439659, 610856)

In [86]:
# Any new linguistic insights

model_toxic.wv.most_similar('idiot')

# Some of the words changed place but other than that...no big changes

2018-11-03 21:13:46,429 : INFO : precomputing L2-norms of word weight vectors
  if np.issubdtype(vec.dtype, np.int):


[('asshole', 0.8278401494026184),
 ('moron', 0.7025822401046753),
 ('imbecile', 0.7006790637969971),
 ('fool', 0.6777839660644531),
 ('illiterate', 0.6400813460350037),
 ('loser', 0.6324371099472046),
 ('ignorant', 0.6284880042076111),
 ('retard', 0.628136396408081),
 ('hypocrite', 0.6253594756126404),
 ('insult', 0.6170707941055298)]

### Finally, we can train a fastText model (thanks Facebook for that)

This will be ~4 times slower than Word2Vec (since the algo has to account for all the sub-word-stuff etc.)

In [14]:
model_toxic_fasttext = FastText(all_hate_data, size=100, window=8, min_count=3, workers=4, iter=5)

2018-11-04 20:13:37,648 : INFO : collecting all words and their counts
2018-11-04 20:13:37,669 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-11-04 20:13:37,827 : INFO : PROGRESS: at sentence #10000, processed 655155 words, keeping 33301 word types
2018-11-04 20:13:38,002 : INFO : PROGRESS: at sentence #20000, processed 1297718 words, keeping 49193 word types
2018-11-04 20:13:38,171 : INFO : PROGRESS: at sentence #30000, processed 1928609 words, keeping 61707 word types
2018-11-04 20:13:38,355 : INFO : PROGRESS: at sentence #40000, processed 2584346 words, keeping 72823 word types
2018-11-04 20:13:38,537 : INFO : PROGRESS: at sentence #50000, processed 3219612 words, keeping 82595 word types
2018-11-04 20:13:38,730 : INFO : PROGRESS: at sentence #60000, processed 3887527 words, keeping 91889 word types
2018-11-04 20:13:38,915 : INFO : PROGRESS: at sentence #70000, processed 4534434 words, keeping 100497 word types
2018-11-04 20:13:39,098 : INFO : PROGRE

In [15]:
# Let's see what we get

model_toxic_fasttext.wv.most_similar('idiot', topn=20)

2018-11-04 20:27:33,310 : INFO : precomputing L2-norms of word weight vectors
2018-11-04 20:27:33,440 : INFO : precomputing L2-norms of ngram weight vectors
  if np.issubdtype(vec.dtype, np.int):


[('iddiot', 0.9258104562759399),
 ('idiota', 0.89719557762146),
 ('idiom', 0.8744869232177734),
 ('idiote', 0.8390666246414185),
 ('idioma', 0.8310518264770508),
 ('idiocy', 0.7922382354736328),
 ('idio', 0.7896774411201477),
 ('idi', 0.7786841988563538),
 ('iot', 0.7639410495758057),
 ('idios', 0.7508178353309631),
 ('asshole', 0.7388148903846741),
 ('idiotic', 0.7024364471435547),
 ('idiots', 0.6875221729278564),
 ('ediot', 0.677729606628418),
 ('assh', 0.6772303581237793),
 ('idm', 0.6770260334014893),
 ('idw', 0.6714885234832764),
 ('idiotarian', 0.6597433090209961),
 ('idoit', 0.6562399864196777),
 ('idc', 0.6494848728179932)]

As expected, fastText gives us a mix of semanticly similar and similar sounding words.

In [16]:
model_toxic_fasttext.save('model_fasttext.m')

2018-11-04 20:29:11,796 : INFO : saving FastText object under model_fasttext.m, separately None
2018-11-04 20:29:11,798 : INFO : storing np array 'vectors_ngrams' to model_fasttext.m.wv.vectors_ngrams.npy
2018-11-04 20:29:12,174 : INFO : not storing attribute vectors_ngrams_norm
2018-11-04 20:29:12,177 : INFO : not storing attribute vectors_norm
2018-11-04 20:29:12,178 : INFO : not storing attribute vectors_vocab_norm
2018-11-04 20:29:12,180 : INFO : not storing attribute buckets_word
2018-11-04 20:29:12,181 : INFO : storing np array 'vectors_ngrams_lockf' to model_fasttext.m.trainables.vectors_ngrams_lockf.npy
2018-11-04 20:29:14,888 : INFO : saved model_fasttext.m


In [0]:
hatespeech_tweets.to_pickle('hatespeech.p')

In [25]:
!curl --upload-file hatespeech.p https://transfer.sh/hatespeech.p

https://transfer.sh/oG2hb/hatespeech.p