[Dataset link](https://www.kaggle.com/CooperUnion/cardataset)

[Reference](https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92)

# Introduction to Word2Vec

Word2vec is one of the most popular technique to learn word embeddings using a two-layer neural network. Its input is a text corpus and its output is a set of vectors. Word embedding via word2vec can make natural language computer-readable, then further implementation of mathematical operations on words can be used to detect their similarities. A well-trained set of word vectors will place similar words close to each other in that space. For instance, the words women, men, and human might cluster in one corner, while yellow, red and blue cluster together in another.

There are two main training algorithms for word2vec, one is the continuous bag of words(CBOW), another is called skip-gram. The major difference between these two methods is that CBOW is using context to predict a target word while skip-gram is using a word to predict a target context. Generally, the skip-gram method can have a better performance compared with CBOW method, for it can capture two semantics for a single word. For instance, it will have two vector representations for Apple, one for the company and another for the fruit. For more details about the word2vec algorithm, please check [here](https://arxiv.org/pdf/1301.3781.pdf).

# Gensim Python Library Introduction

Gensim is an open source python library for natural language processing and it was developed and is maintained by the Czech natural language processing researcher Radim Řehůřek. Gensim library will enable us to develop word embeddings by training our own word2vec models on a custom corpus either with CBOW of skip-grams algorithms.

In [1]:
!pip install --upgrade gensim

Collecting gensim
  Downloading gensim-4.2.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 1.4 MB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.2.0


# Download the data

## Dataset Description

This vehicle dataset includes features such as make, model, year, engine, and other properties of the car. We will use these features to generate the word embeddings for each make model and then compare the similarities between different make model.

In [4]:
!wget https://raw.githubusercontent.com/PICT-NLP/BE-NLP-Elective/main/2-Embeddings/data.csv

--2022-05-12 04:57:31--  https://raw.githubusercontent.com/PICT-NLP/BE-NLP-Elective/main/2-Embeddings/data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1475504 (1.4M) [text/plain]
Saving to: ‘data.csv.1’


2022-05-12 04:57:31 (146 MB/s) - ‘data.csv.1’ saved [1475504/1475504]



# Implementation of Word Embedding with Gensim

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


## Data Preprocessing

Since the purpose of this tutorial is to learn how to generate word embeddings using genism library, we will not do the EDA and feature selection for the word2vec model for the sake of simplicity.

Genism word2vec requires that a format of ‘list of lists’ for training where every document is contained in a list and every list contains lists of tokens of that document. At first, we need to generate a format of ‘list of lists’ for training the make model word embedding. To be more specific, each make model is contained in a list and every list contains lists of features of that make model.

To achieve this, we need to do the following things

Create a new column for Make Model

In [5]:
df['Maker_Model']= df['Make']+ " " + df['Model']

Generate a format of ‘ list of lists’ for each Make Model with the following features: Engine Fuel Type, Transmission Type, Driven_Wheels, Market Category, Vehicle Size, Vehicle Style.

In [6]:
df1 = df[['Engine Fuel Type','Transmission Type','Driven_Wheels','Market Category','Vehicle Size', 'Vehicle Style', 'Maker_Model']]

df2 = df1.apply(lambda x: ','.join(x.astype(str)), axis=1)

df_clean = pd.DataFrame({'clean': df2})

sent = [row.split(',') for row in df_clean['clean']]

## Genism word2vec Model Training

We can train the genism word2vec model with our own custom corpus as following:

In [8]:
from gensim.models.word2vec import Word2Vec

Let’s try to understand the hyperparameters of this model.

1. `vector_size`: The number of dimensions of the embeddings and the default is 100.

2. `window`: The maximum distance between a target word and words around the target word. The default window is 5.

3. `min_count`: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5.

4. `workers`: The number of partitions during training and the default workers is 3.

5. `sg`: The training algorithm, either CBOW(0) or skip gram(1). The default training algorithm is CBOW.

After training the word2vec model, we can obtain the word embedding directly from the training model as following.

In [10]:
model = Word2Vec(sent, min_count=1,vector_size= 50,workers=3, window =3, sg = 1)

Save the model

In [16]:
model.save("word2vec.model")

Load the model

In [17]:
model = Word2Vec.load("word2vec.model")

After training the word2vec model, we can obtain the word embedding directly from the training model as following.

In [18]:
model.wv['Toyota Camry']

array([-0.01062947,  0.11961275,  0.02979935, -0.08574718, -0.06292428,
       -0.22584412,  0.01034379,  0.28048238, -0.09596439, -0.10089288,
        0.06185615,  0.04831436,  0.08525883, -0.04098298, -0.02881935,
        0.18914147,  0.14610888,  0.2954876 , -0.13336119, -0.3463727 ,
       -0.02103316, -0.04603435,  0.29187238,  0.07342483,  0.17763135,
        0.00159852, -0.05606321,  0.40010315, -0.04461393, -0.04021346,
        0.02497271,  0.05991193,  0.04810696,  0.01428093,  0.09551513,
       -0.11588804,  0.19436751, -0.00469545,  0.04907812,  0.0961762 ,
        0.11089811, -0.04799738, -0.23201576,  0.09111656,  0.38385868,
        0.06561734, -0.05103017, -0.16176185,  0.03146917,  0.02458125],
      dtype=float32)

In [21]:
sims = model.wv.most_similar('Toyota Camry', topn=10)
sims

[('Suzuki Verona', 0.9866250157356262),
 ('Kia Optima', 0.9836329817771912),
 ('Nissan Altima', 0.9832392930984497),
 ('Mazda 626', 0.9830097556114197),
 ('Nissan Sentra', 0.982242226600647),
 ('Mazda 6', 0.9815133810043335),
 ('Oldsmobile Cutlass Ciera', 0.9799666404724121),
 ('Dodge Stratus', 0.9799144864082336),
 ('Honda Accord Hybrid', 0.9797704219818115),
 ('Subaru Legacy', 0.9795089960098267)]

# Bonus Task
Calculate similarity between two words