<a href="https://colab.research.google.com/github/FreemindTrader/nlp-in-practice/blob/master/Chapter_6_fastText_Training_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
-----------------------

This notebook will demonstrate training a word2vec model **with subword information** (fastText) on the Wikipedia Attack Comments dataset. We'll look at how the training time and memory requirements compare, as well as the quality of the resulting vectors.

# Download & Parse Dataset
------------------

We'll use:

* `wget` to download the dataset file.
* `pandas` to parse the dataset `.tsv` file.
* The `gensim` function `gensim.utils.simple_preprocess` for tokenizing the sentences.

We'll need to install wget first.

In [None]:
!pip install wget

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp36-none-any.whl size=9682 sha256=767b5d2835f689eb9318018aa4aedc3769d70eebc5740222016a333c89c505af
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


Now we can download the dataset text file.

In [None]:
import wget
import os

# Create the data subdirectory if not there.
if not os.path.exists('./data/'):
    os.mkdir('./data/')

filename = './data/attack_annotated_comments.tsv'

# Download download if we already have it!
if not os.path.exists(filename):

    # URL for the CSV file (~55.4MB) containing the wikipedia comments.
    url = 'https://ndownloader.figshare.com/files/7554634'

    # Download the dataset.
    print('Downloading Wikipedia Attack Comments dataset (~55.4MB)...')
    wget.download(url, filename)

    print('  DONE.')

# We won't use these, but FYI, this is the file containing the labels
# for the comments.
#   url = 'https://ndownloader.figshare.com/files/7554637'
#   filename = './data/attack_annotated_comments.tsv'

Downloading Wikipedia Attack Comments dataset (~55.4MB)...
  DONE.


## Parse the dataset file
--------------------------------
We'll use `pandas` just to help us parse the tab-separated `.tsv` file.


In [None]:
import pandas as pd

print('Parsing the dataset .tsv file...')
comments = pd.read_csv('./data/attack_annotated_comments.tsv', sep = '\t')

print('    Done.')


Parsing the dataset .tsv file...
    Done.


## Tokenize the comments
------------------------------------
This dataset uses the special labels "NEWLINE_TOKEN" and "TAB_TOKEN" to represent the newline and tab characters. We'll replace these with a single space.

In [None]:
# remove newline and tab tokens
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))

Next, Use gensim to perform a simple tokenization strategy to the text and turn each comment into a list of words.

In [None]:
%%time

import gensim
import io

print('Tokenizing comments...')

# Track the total number of tokens in the dataset.
num_tokens = 0

# List of sentences to use for training.
sentences = []

# For each comment...
for i, row in comments.iterrows():

    # Report progress.
    if ((i % 20000) == 0):
        print('  Read {:,} comments.'.format(i))

    # Tokenize the comment. This returns a list of words.
    parsed = gensim.utils.simple_preprocess(row.comment)

    # Accumulate the total number of words in the dataset.
    num_tokens += len(parsed)

    # Add the comment to the list.
    sentences.append(parsed)

print('DONE.')
print('')
print('{:>10,} comments'.format(i))
print('{:>10,} tokens'.format(num_tokens))
print('{:>10,} avg. tokens / comment'.format(int(num_tokens / len(sentences))))
print('')

Tokenizing comments...
  Read 0 comments.
  Read 20,000 comments.
  Read 40,000 comments.
  Read 60,000 comments.
  Read 80,000 comments.
  Read 100,000 comments.
DONE.

   115,863 comments
 7,651,029 tokens
        66 avg. tokens / comment

CPU times: user 22.9 s, sys: 516 ms, total: 23.5 s
Wall time: 23.8 s


# Training
----------------


Time to train the model!


## Configure Logging
-----------------------------



`gensim` provides some valuable information about the training process using the `logging` module in Python.

In order to see this log output, we first need to setup logging.


In [None]:
import logging

# Enable logging at the `INFO` level and set a custom format--the
# default log format is pretty wordy.
logging.basicConfig(
    format='%(asctime)s : %(message)s', # Display just time and message.
    datefmt='%H:%M:%S', # Display time, but not the date.
    level=logging.INFO)


The following settings will eliminate some unhelpful warnings from the remainder of this notebook.

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

## Set Model Parameters
----------------------------------

We define all of the parameters for our model upfront. Take a look at the code comments for each parameter below.

Also, for reference, here is the [documentation](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText.trainables) for the `gensim.models.FastText` constructor, and the [source code](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/fasttext.py#L468) on GitHub.

In [None]:
import gensim

model = gensim.models.FastText (
    sentences=None, # Don't provide the sentences yet, otherwise
                    # it will kick off the training automatically.

    size=100,    # Number of features in word vector

    window=10,   # Context window size (in each direction)
                 #   Default is 5

    min_count=2, # Words must appear this many times to be in vocab.
                 #   Default is 5

    workers=10,  # Training thread count

    sg=0,        # 0: CBOW, 1: Skip-gram.
                 #   Default is 0, CBOW

    hs=0,        # 0: Negative Sampling, 1: Hierarchical Softmax
                 #   Default is 0, NS

    negative=5,  # Nmber of negative samples (default is 5)

    sample=1e-3, # The coefficient for the subsampling of frequent words
                 # equation.

    word_ngrams=1, # Turn on n-grams.
    min_n=3,       # Min n-gram size of 3 characters (default is 3).
    max_n=6,       # Max n-gram size of 6 characters (default is 6).

    bucket=2000000, # Initial number of buckets for the n-gram hash table.
                    # gensim appears to resize the hash table for you, though,
                    # as part of building the vocabulary.

    # Additional parameters and their defaults...

    # seed=1,
    # alpha=0.025,    # Initial learning rate.
    # min_alpha=0.0001,
    # cbow_mean=1,
    # hashfxn=hash,
    # null_word=0,
    # sorted_vocab=1,
    # trim_rule=None,
    # batch_words=MAX_WORDS_IN_BATCH,
    # callbacks=()
)

## Build the Vocabulary
---------------------------------

Before we can train the word2vec neural network, we need to create a vocabulary. The vocabulary contains the full list of words that we will end up learning word vectors for.

* Source code for `build_vocab` is at [base_any2vec.py#L896](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/base_any2vec.py#L896)

In [None]:
# Build the vocabulary using the comments in "sentences".
model.build_vocab(
    sentences, # Our comments dataset
    progress_per=20000  # Update after this many sentences.
                        # Too many progress updates is annoying!
)

22:41:13 : collecting all words and their counts
22:41:13 : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
22:41:13 : PROGRESS: at sentence #20000, processed 1399843 words, keeping 51238 word types
22:41:14 : PROGRESS: at sentence #40000, processed 2764033 words, keeping 76280 word types
22:41:14 : PROGRESS: at sentence #60000, processed 4091968 words, keeping 94720 word types
22:41:14 : PROGRESS: at sentence #80000, processed 5354741 words, keeping 112164 word types
22:41:15 : PROGRESS: at sentence #100000, processed 6640746 words, keeping 129161 word types
22:41:15 : collected 141062 word types from a corpus of 7651029 raw words and 115864 sentences
22:41:15 : Loading a fresh vocabulary
22:41:15 : effective_min_count=2 retains 71038 unique words (50% of original 141062, drops 70024)
22:41:15 : effective_min_count=2 leaves 7581005 word corpus (99% of original 7651029, drops 70024)
22:41:16 : deleting the raw counts dictionary of 141062 items
22:41:16 : sample=0.001 

-------------------------------
The logging output above displays the estimated size of the model as `243616904 bytes` (232.3 MB). From the same logging output of the word2vec model, that one is estimated at 88.1 MB, so fastText requires 2.64x more memory.

TBD - This does not correspond directly to the expected matrix size, so perhaps it includes the memory for the vocabulary as well?

In [None]:
print('word2vec model is %.1f MB' % (92349400 / 2**20))
print('fastText model is %.1f MB' % (243616904 / 2**20))
print()
print('fastText model is %.2fx larger' % (243616904 / 92349400))
print()
print('Expected word2vec matrix size: %.2f MB' % (71038*100*4 / 2**20))
print('Expected fastText matrix size: %.2f MB' % ((71038 + 336317)*100*4 / 2**20))

word2vec model is 88.1 MB
fastText model is 232.3 MB

fastText model is 2.64x larger

Expected word2vec matrix size: 27.10 MB
Expected fastText matrix size: 155.39 MB


## Train the model
-------------------------

Now that we have a vocabulary built, we're ready to train the model.

The word2vec model took `43s`  to train on my desktop, while this fastText one took `268s` (4min 28s), which is about `6.2x` longer.

In [None]:
%%time

print('Training the model...')

model.train(
    sentences,
    total_examples=len(sentences),
    epochs=10,        # How many training passes to take.
    report_delay=10.0 # Report progress every 10 seconds.
)

print('  Done.')
print('')

22:41:36 : training model with 10 workers on 71038 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=10


Training the model...


22:41:37 : EPOCH 1 - PROGRESS: at 1.28% examples, 64976 words/s, in_qsize 19, out_qsize 0
22:41:47 : EPOCH 1 - PROGRESS: at 16.91% examples, 93243 words/s, in_qsize 19, out_qsize 0
22:41:57 : EPOCH 1 - PROGRESS: at 33.10% examples, 94474 words/s, in_qsize 19, out_qsize 0
22:42:07 : EPOCH 1 - PROGRESS: at 49.46% examples, 95526 words/s, in_qsize 19, out_qsize 0
22:42:17 : EPOCH 1 - PROGRESS: at 66.90% examples, 96011 words/s, in_qsize 20, out_qsize 2
22:42:27 : EPOCH 1 - PROGRESS: at 83.84% examples, 96154 words/s, in_qsize 20, out_qsize 2
22:42:36 : worker thread finished; awaiting finish of 9 more threads
22:42:36 : worker thread finished; awaiting finish of 8 more threads
22:42:37 : worker thread finished; awaiting finish of 7 more threads
22:42:37 : worker thread finished; awaiting finish of 6 more threads
22:42:37 : worker thread finished; awaiting finish of 5 more threads
22:42:37 : worker thread finished; awaiting finish of 4 more threads
22:42:37 : worker thread finished; awaiti

  Done.

CPU times: user 20min 22s, sys: 2.77 s, total: 20min 25s
Wall time: 10min 23s


In [None]:
# Write the model out to disk
model.save('./data/wiki_attack_ft.model')

22:51:59 : saving FastText object under ./data/wiki_attack_ft.model, separately None
22:51:59 : storing np array 'vectors_ngrams' to ./data/wiki_attack_ft.model.wv.vectors_ngrams.npy
22:51:59 : not storing attribute vectors_norm
22:51:59 : not storing attribute vectors_vocab_norm
22:51:59 : not storing attribute vectors_ngrams_norm
22:51:59 : not storing attribute buckets_word
22:51:59 : storing np array 'vectors_ngrams_lockf' to ./data/wiki_attack_ft.model.trainables.vectors_ngrams_lockf.npy
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
22:52:01 : saved ./data/wiki_attack_ft.model


## Compare results
---------------------------



Let's load both the word2vec and fasttext models so that we can compare them side-by-side.

I've hosted a copy of the trained word2vec model created by the "Appendix - Full word2vec Training Example.ipynb" Notebook.

In [None]:
import gdown

print('Downloading word2vec model...\n')

# Specify the name to give the file locally.
output = 'wiki_attack_w2v.model'

# Specify the Google Drive ID of the file.
file_id = '1atZ7L6DqT_zZtIkUnPi77KA7Qsv5zm07'

# Download the file.
gdown.download('https://drive.google.com/uc?id=' + file_id, output,
                quiet=False)

print('\nDONE.')

Downloading word2vec model...



Downloading...
From: https://drive.google.com/uc?id=1atZ7L6DqT_zZtIkUnPi77KA7Qsv5zm07
To: /content/wiki_attack_w2v.model
89.9MB [00:00, 234MB/s]



DONE.


Now we can load the word2vec model.

In [None]:
import os
# Distinguish the fasttext model from the plain word2vec.
model_ft = model

# Specify the path to the word2vec model that we trained in Chapter 4.
w2v_filename = 'wiki_attack_w2v.model'

# Load the trained w2v model.
model_w2v = gensim.models.Word2Vec.load(w2v_filename)

23:17:07 : loading Word2Vec object from wiki_attack_w2v.model
23:17:08 : loading wv recursively from wiki_attack_w2v.model.wv.* with mmap=None
23:17:08 : setting ignored attribute vectors_norm to None
23:17:08 : loading vocabulary recursively from wiki_attack_w2v.model.vocabulary.* with mmap=None
23:17:08 : loading trainables recursively from wiki_attack_w2v.model.trainables.* with mmap=None
23:17:08 : setting ignored attribute cum_table to None
23:17:08 : loaded wiki_attack_w2v.model


----------------------------
As a quick, informal test of the quality of the model, let's look at some word comparisons that we think might appear in this dataset.

Let's start by defining a helper function which will perform similarity searches using both models and then displays them side by side. That way we can apply it to a number of different words.

In [None]:
import pandas as pd

def compare_results(word):
    '''
    Performs similarity searches using both models and returns
    a table showing them side-by-side.
    '''

    # Report the occurrence count for this word.
    print("Word '%s' has %d samples in training text." % (word, model_ft.wv.vocab[word].count))

    # Find the most similar words using both models.
    print('Running similarity searches...')
    results_ft = model_ft.wv.most_similar(word)
    results_w2v = model_w2v.wv.most_similar(word)

    # Merge the result into one table.
    table_rows = []

    # For each result...
    for i in range(len(results_ft)):

        # Get the words for result 'i'.
        word_ft  =  results_ft[i][0]
        word_w2v = results_w2v[i][0]

        # Lookup the occurrence counts.
        count_ft  = model_ft.wv.vocab[word_ft].count
        count_w2v = model_ft.wv.vocab[word_w2v].count

        score_ft  =  results_ft[i][1]
        score_w2v = results_w2v[i][1]

        # Combine result `i` from both models into to a single row.
        # Format the similarity score to 2 decimal places.
        table_rows.append(
                            (word_ft,  '{:,}'.format(count_ft),  '{:.2}'.format(score_ft),
                             word_w2v, '{:,}'.format(count_w2v), '{:.2}'.format(score_w2v))
                         )

    # Create a pandas dataframe to get a nice table display.
    df = pd.DataFrame(table_rows, columns=['fasttext', 'freq', 'score', 'word2vec', 'freq', 'score'])
    return(df)


----------------

Let's start with the word 'condescending'. It occurs in our training text 84 times, so it should have a decent word vector.

I think the results here are pretty fascinating!

It's immediately apparent that fastText is able to make reasonable comparisons on words with relatively few samples. In particular, the misspelling 'condecending', and the conjugations 'condescendingly' and 'condescended'--*each of which only had three training samples*--are identified as strongly similar. Awesome!

On the other hand, it's giving way too much weight to the words having overlap. The top two results, `descending` and `ascending`, are not good results!

In [None]:
df = compare_results('condescending')
display(df)

23:17:09 : precomputing L2-norms of word weight vectors


Word 'condescending' has 84 samples in training text.
Running similarity searches...


Unnamed: 0,fasttext,freq,score,word2vec,freq.1,score.1
0,descending,12,0.93,sarcastic,99,0.77
1,ascending,7,0.91,aggressive,198,0.76
2,condecending,3,0.9,rude,516,0.73
3,condescendingly,3,0.89,uncivil,362,0.72
4,condescended,3,0.87,abusive,328,0.67
5,condemning,20,0.86,insulting,363,0.67
6,condoning,13,0.86,immature,152,0.66
7,disheartening,11,0.85,nasty,215,0.66
8,condensing,10,0.84,arrogant,252,0.65
9,malingering,4,0.84,polite,194,0.65


---------------------------

Here is a fun one to check out, I came across this word randomly in the vocabulary.

Using subword information works exceptionally well here! Without subword info, 8 out of the 10 results are garbage.

In [None]:
df = compare_results('hahahahahaha')
display(df)

Word 'hahahahahaha' has 14 samples in training text.
Running similarity searches...


Unnamed: 0,fasttext,freq,score,word2vec,freq.1,score.1
0,hahahahahahaha,5,1.0,hahahaha,46,0.72
1,hahahahaha,28,1.0,ur,500,0.69
2,hahahaha,46,0.99,noob,47,0.68
3,hahahahahah,3,0.99,xd,46,0.68
4,ahahahahaha,3,0.98,beeblebrox,23,0.68
5,bwahahahaha,2,0.98,gaey,2,0.67
6,hahahahah,4,0.98,fuckers,37,0.67
7,ahahaha,3,0.97,dik,3,0.67
8,hahahah,9,0.97,tute,6,0.67
9,hahaha,124,0.96,nerd,139,0.67


-----------------------------
Let's look at two words which should be very similar, and are well represented in the dataset.

In [None]:
print("Similarity between 'stupid' and 'dumb:'")
print("  fasttext: %.2f" %  model_ft.wv.similarity('stupid', 'dumb'))
print("  word2vec: %.2f" % model_w2v.wv.similarity('stupid', 'dumb'))


Similarity between 'stupid' and 'dumb:'
  fasttext: 0.69
  word2vec: 0.72


-----------------------------------------
Note that another way to measure the "likely quality" of a word vector (besides looking at the training sample count) is to check out the vector's norm. Under-trained vectors tend to have low norms.

In [None]:
import numpy as np

print('%.3f' % np.linalg.norm(model_w2v.wv['stupid']))
print('%.3f' % np.linalg.norm(model_w2v.wv['bwahahahaha']))

16.118
0.603


# Appendix
-------------------

## Locate local gensim code
---------------------------------------

If you want to poke through your own local copy of the gensim functions, the following code can help you quickly locate the files.

In [None]:
import gensim
import os

path = gensim.models.__file__

# On windows, un-escape the backslashes
if os.name == 'nt':
    path = path.replace('\\', '/')

# Trim off __init__.py
path = path[:-len('__init__.py')]

# Add on the base file.
path = path + 'base_any2vec.py'

print(path)

/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py
