# **How to train your custom word embeddings**

In this notebook, you will learn how to train your custom word2vec using Gensim.

For those who are new to word embeddings and would like to find out more, you can check out the following articles:
1. [Introduction to Word Embedding and Word2Vec](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
2. [A Beginner's Guide to Word2Vec and Neural Word Embeddings](https://skymind.ai/wiki/word2vec)

In [1]:
import numpy as np
import pandas as pd
import os
import re
import time

from gensim.models import Word2Vec
from tqdm import tqdm

tqdm.pandas()

In [2]:
def split_fslash(x):
    x = x.split('/')[0]
    return x

In [3]:
def split_underscore(x):
    x = x.split('_')[0]
    return x

In [4]:
def preprocessing(titles_array,image_path_array):
    
    #Retrieve the category from image_path (E.g. beauty, fashion, mobile) and append to title column
    df_train['splitted'] = (image_path_array.apply(split_fslash)).apply(split_underscore)
    df_train["processed"] = titles_array.map(str) + ' ' + df_train['splitted']
    
    processed_array = []
    
    for title in tqdm(df_train["processed"]):
        
        # remove other non-alphabets symbols with space (i.e. keep only alphabets and whitespaces).
        processed = re.sub('[^a-zA-Z ]', '', title)
        
        words = processed.split()
        
        #Remove word with length <= 1
        processed_array.append(' '.join([word for word in words if len(word) > 1]))
    
    return processed_array

## **Something to take note**
Word2vec is a **self-supervised** method (well, sort of unsupervised but not unsupervised, since it provides its own labels. check out this [Quora](https://www.quora.com/Is-Word2vec-a-supervised-unsupervised-learning-algorithm) thread for a more detailed explanation), so we can make full use of the entire dataset (including test data) to obtain a more wholesome word embedding representation.

In [6]:
df_train = pd.read_csv('../input/train.csv')
df_test = pd.read_csv('../input/test.csv')

In [7]:
df_train['processed'] = preprocessing(df_train['title'], df_train['image_path'])
df_test['processed'] = preprocessing(df_test['title'], df_test['image_path'])

sentences = pd.concat([df_train['processed'], df_test['processed']],axis=0)
train_sentences = list(sentences.progress_apply(str.split).values)

100%|██████████| 666615/666615 [00:02<00:00, 227778.93it/s]


1261 words replaced


100%|██████████| 172402/172402 [00:00<00:00, 229775.76it/s]
100%|██████████| 839017/839017 [00:03<00:00, 254942.09it/s]


In [8]:
# Parameters reference : https://www.quora.com/How-do-I-determine-Word2Vec-parameters
# Feel free to customise your own embedding

start_time = time.time()

model = Word2Vec(sentences=train_sentences, 
                 sg=1, 
                 size=300,  
                 workers=4)

print(f'Time taken : {(time.time() - start_time) / 60:.2f} mins')

Time taken : 1.71 mins


## **Pretty fast isn't it.**

Let's check out some of the features of the customised word vector.

In [9]:
# Total number of vocab in our custom word embedding

len(model.wv.vocab.keys())

16689

In [10]:
# Check out the dimension of each word (we set it to 100 in the above training step)

model.wv.vector_size

300

In [11]:
# Check out how 'iphone' is represented (an array of 100 numbers)

model.wv.get_vector('iphone')

array([ 9.71527621e-02,  2.20016912e-01, -1.81544483e-01,  1.29227087e-01,
       -3.59647930e-01,  1.92518815e-01,  1.16994448e-01, -4.39830840e-01,
       -4.36425090e-01,  9.08169299e-02, -9.80512891e-03,  9.03804302e-02,
       -1.96560979e-01,  1.55536113e-02,  6.76497877e-01, -2.19069839e-01,
        1.93168953e-01,  1.57927707e-01,  8.54952186e-02, -3.60724628e-01,
       -4.39989269e-01,  1.73035692e-02, -4.53415483e-01, -4.47129905e-01,
        5.90981185e-01, -5.24166882e-01, -2.68053353e-01,  2.26392657e-01,
       -1.52156409e-02,  2.70230621e-01, -3.76996547e-02, -2.12974511e-02,
       -7.87326694e-01,  1.80330351e-01,  1.19560681e-01,  9.01071951e-02,
        1.70087755e-01,  1.80407032e-01,  2.25447584e-02,  8.44595730e-02,
        1.05461165e-01, -2.79203117e-01, -1.22048885e-01, -8.06665421e-02,
       -3.66776317e-01,  1.57191247e-01, -1.66163549e-01, -7.57468818e-03,
       -7.27388859e-02, -3.52258623e-01, -7.49977827e-02,  5.49288869e-01,
       -3.87300283e-01,  

## Now, why are word embeddings powerful? 

This is because they capture the semantics relationships between words. In other words, words with similar meanings should appear near each other in the vector space of our custom embeddings.

Lets check out an example:

In [12]:
# Find words with similar meaning to 'iphone'

model.wv.most_similar('iphone')

[('iphones', 0.7185158133506775),
 ('originaliphone', 0.6419346332550049),
 ('cpo', 0.6294914484024048),
 ('jetblack', 0.6190686821937561),
 ('mateblack', 0.6137925982475281),
 ('iphoneplus', 0.6127023100852966),
 ('exinternasional', 0.6091649532318115),
 ('apple', 0.6091042757034302),
 ('iph', 0.6079336404800415),
 ('selleriphone', 0.6071897745132446)]

Well, you will see words similar to 'iphone', sorted based on euclidean distance.
Of cause, there are also not so intuitive and relevant ones (e.g. jetblack, cpo, ten). If you would like to tackle this, you can do a more thorough pre-processing/ try other embedding dimensions


## **The most important part!**
Last but not least, save your word embeddings, so that you can used it for modelling. You can load the text file next time using Gensim KeyedVector function.

In [13]:
model.wv.save_word2vec_format('custom_glove_300d.txt')


# How to load:
# w2v = KeyedVectors.load_word2vec_format('custom_glove_100d.txt')

# How to get vector using loaded model
# w2v.get_vector('iphone')
