# **How to train your <del>dragon</del> custom word embeddings**

In the [baseline-keras-lstm-is-all-you-need](https://www.kaggle.com/huikang/baseline-keras-lstm-is-all-you-need) notebook shared by Hui Kang (thanks again!), it was demonstrated that a LSTM model using generic global vector (GLOVE) achieved a pretty solid benchmark results.

After playing around with GLOVE, you will quickly find that certain words in your training data are not present in its vocab. These are typically replaced with same-shape zero vector, which essentially means you are 'sacrificing' the word as your input feature, which can potentially be important for correct prediction. Another way to deal with this is to train your own word embeddings, using your training data, so that the semantic relationship of your own training corpus can be better represented.

In this notebook, I will demonstrate how to train your custom word2vec using Gensim.

For those who are new to word embeddings and would like to find out more, you can check out the following articles:
1. [Introduction to Word Embedding and Word2Vec](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
2. [A Beginner's Guide to Word2Vec and Neural Word Embeddings](https://skymind.ai/wiki/word2vec)

In [2]:
import numpy as np
import pandas as pd
import os
import re
import time

from gensim.models import Word2Vec
from tqdm import tqdm

tqdm.pandas()

In [3]:
def preprocessing(titles_array):
    
    """
    Take in an array of titles, and return the processed titles.
    
    (e.g. input: 'i am a boy', output - 'am boy')  -> since I remove those words with length 1
    
    Feel free to change the preprocessing steps and see how it affects the modelling results!
    """
    
    processed_array = []
    
    for title in tqdm(titles_array):
        
        # remove other non-alphabets symbols with space (i.e. keep only alphabets and whitespaces).
        processed = re.sub('[^a-zA-Z ]', '', title)
        
        words = processed.split()
        
        # keep words that have length of more than 1 (e.g. gb, bb), remove those with length 1.
        processed_array.append(' '.join([word for word in words if len(word) > 1]))
    
    return processed_array

## **Something to take note**
Word2vec is a **self-supervised** method (well, sort of unsupervised but not unsupervised, since it provides its own labels. check out this [Quora](https://www.quora.com/Is-Word2vec-a-supervised-unsupervised-learning-algorithm) thread for a more detailed explanation), so we can make full use of the entire dataset (including test data) to obtain a more wholesome word embedding representation.

In [20]:
df_train = pd.read_csv('SEC-CompanyTicker.csv')
df_test = pd.read_csv('SEC-CompanyTicker.csv')

In [25]:
df_train = df_train.rename(columns={"companyName":"title"}).head(100)
df_test = df_test.rename(columns={"companyName":"title"})
df_train

Unnamed: 0.1,Unnamed: 0,cik_str,ticker,title,processed
0,0,320193,AAPL,Apple Inc.,Apple Inc
1,1,789019,MSFT,Microsoft Corp,Microsoft Corp
2,2,1652044,GOOGL,Alphabet Inc.,Alphabet Inc
3,3,1018724,AMZN,Amazon Com Inc,Amazon Com Inc
4,4,1045810,NVDA,Nvidia Corp,Nvidia Corp
...,...,...,...,...,...
95,95,1075531,BKNG,Booking Holdings Inc.,Booking Holdings Inc
96,96,829224,SBUX,Starbucks Corp,Starbucks Corp
97,97,1668717,BUD,Anheuser-Busch Inbev Sa/Nv,AnheuserBusch Inbev SaNv
98,98,947263,TD,Toronto Dominion Bank,Toronto Dominion Bank


In [26]:
df_train['processed'] = preprocessing(df_train['title'])
df_test['processed'] = preprocessing(df_test['title'])

sentences = pd.concat([df_train['processed'], df_test['processed']],axis=0)
train_sentences = list(sentences.progress_apply(str.split).values)

100%|████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 178861.58it/s]
100%|████████████████████████████████████████████████████████████████████████| 10898/10898 [00:00<00:00, 376805.53it/s]
100%|████████████████████████████████████████████████████████████████████████| 10998/10998 [00:00<00:00, 916821.47it/s]


In [35]:
# Parameters reference : https://www.quora.com/How-do-I-determine-Word2Vec-parameters
# Feel free to customise your own embedding

start_time = time.time()

model = Word2Vec(train_sentences, 
                 min_count = 1, vector_size = 100,
                                             window = 5, sg = 1)

print(f'Time taken : {(time.time() - start_time) / 60:.2f} mins')

Time taken : 0.00 mins


## **Pretty fast isn't it.**

Let's check out some of the features of the customised word vector.

In [37]:

model = Word2Vec(["a","b","c"], 
                 min_count = 1, vector_size = 100,
                                             window = 5, sg = 1)

In [41]:
# Total number of vocab in our custom word embedding
vocab_len = len(model.wv)
vocab_len

3

In [42]:
model.wv

<gensim.models.keyedvectors.KeyedVectors at 0x283fc8bb0>

In [43]:
# Check out the dimension of each word (we set it to 100 in the above training step)

model.wv.vector_size

100

In [45]:
# Check out how 'iphone' is represented (an array of 100 numbers)

model.wv.get_vector('a')

array([ 9.4563962e-05,  3.0773198e-03, -6.8126451e-03, -1.3754654e-03,
        7.6685809e-03,  7.3464094e-03, -3.6732971e-03,  2.6427018e-03,
       -8.3171297e-03,  6.2054861e-03, -4.6373224e-03, -3.1641065e-03,
        9.3113566e-03,  8.7338570e-04,  7.4907029e-03, -6.0740625e-03,
        5.1605068e-03,  9.9228229e-03, -8.4573915e-03, -5.1356913e-03,
       -7.0648370e-03, -4.8626517e-03, -3.7785638e-03, -8.5361991e-03,
        7.9556061e-03, -4.8439382e-03,  8.4236134e-03,  5.2625705e-03,
       -6.5500261e-03,  3.9578713e-03,  5.4701497e-03, -7.4265362e-03,
       -7.4057197e-03, -2.4752307e-03, -8.6257253e-03, -1.5815723e-03,
       -4.0343284e-04,  3.2996845e-03,  1.4418805e-03, -8.8142155e-04,
       -5.5940580e-03,  1.7303658e-03, -8.9737179e-04,  6.7936908e-03,
        3.9735902e-03,  4.5294715e-03,  1.4343059e-03, -2.6998555e-03,
       -4.3668128e-03, -1.0320747e-03,  1.4370275e-03, -2.6460087e-03,
       -7.0737829e-03, -7.8053069e-03, -9.1217868e-03, -5.9351693e-03,
      

## Now, why are word embeddings powerful? 

This is because they capture the semantics relationships between words. In other words, words with similar meanings should appear near each other in the vector space of our custom embeddings.

Lets check out an example:

In [47]:
# Find words with similar meaning to 'iphone'

model.wv.most_similar('k')

KeyError: "Key 'k' not present in vocabulary"

Well, you will see words similar to 'iphone', sorted based on euclidean distance.
Of cause, there are also not so intuitive and relevant ones (e.g. jetblack, cpo, ten). If you would like to tackle this, you can do a more thorough pre-processing/ try other embedding dimensions


## **The most important part!**
Last but not least, save your word embeddings, so that you can used it for modelling. You can load the text file next time using Gensim KeyedVector function.

In [10]:
model.wv.save_word2vec_format('custom_glove_100d.txt')


# How to load:
# w2v = KeyedVectors.load_word2vec_format('custom_glove_100d.txt')

# How to get vector using loaded model
# w2v.get_vector('iphone')


In [34]:
print("Cosine similarity between 'alice' " +
          "and 'wonderland' - Skip Gram : ",
    model.wv.similarity('Apple Inc', 'Apple Inc'))

KeyError: "Key 'Apple Inc' not present"