# Word To Vector using Gensim

## About Gensim
Gensim is an open-source library for unsupervised topic modeling, document indexing, retrieval by similarity, and other natural language processing functionalities, using modern statistical machine learning. Gensim is implemented in Python and Cython for performance.

## What is Word2Vector Embedding?
Word2vec is a technique for natural language processing (NLP) published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence.

## Gensim installation and loading the word embedding

In [1]:
#! pip install gensim

Collecting gensim
  Downloading gensim-4.3.2-cp310-cp310-macosx_11_0_arm64.whl.metadata (8.4 kB)
Collecting scipy>=1.7.0 (from gensim)
  Downloading scipy-1.11.3-cp310-cp310-macosx_12_0_arm64.whl.metadata (112 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.9/112.9 kB[0m [31m55.9 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-6.4.0-py3-none-any.whl.metadata (21 kB)
Downloading gensim-4.3.2-cp310-cp310-macosx_11_0_arm64.whl (24.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m0m
[?25hDownloading scipy-1.11.3-cp310-cp310-macosx_12_0_arm64.whl (29.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.8/29.8 MB[0m [31m567.5 kB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0m
[?25hDownloading smart_open-6.4.0-py3-none-any.whl (57 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [11]:
#! pip install chardet
import chardet
with open("/Users/aaryasoni/Desktop/DE_assessment/phrases.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))           #pd.read_csv was showing error due to windows encoding in MAC so to reslove this issue i detect the encoding and then specify it in read_csv function below
result

Collecting chardet
  Downloading chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Downloading chardet-5.2.0-py3-none-any.whl (199 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.4/199.4 kB[0m [31m972.1 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: chardet
Successfully installed chardet-5.2.0


{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

In [2]:
import pandas as pd                         #importing pandas and numpy
import numpy as np
df=pd.read_csv("phrases.csv",encoding="Windows-1252")

In [3]:
import gensim
from gensim.models import KeyedVectors
wv = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True, limit=1000000)
wv.save_word2vec_format('vectors.csv')

## Assigning each word in each phrase a Word2Vec embedding

In [15]:
df2=pd.read_csv("vectors.csv", on_bad_lines='skip')                 #Feathing word vectors
df2.head(5)

Unnamed: 0,1000000 300
0,</s> 0.0011291504 -0.00089645386 0.00031852722...
1,in 0.0703125 0.08691406 0.087890625 0.0625 0.0...
2,for -0.011779785 -0.04736328 0.044677734 0.063...
3,that -0.01574707 -0.028320312 0.083496094 0.05...
4,is 0.0070495605 -0.07324219 0.171875 0.0225830...


In [16]:
df2.columns                                                         #getting name of columns

Index(['1000000 300'], dtype='object')

In [18]:
c=dict()                                                            # saving raw data into a dictionary with word as key and feature vector as value
for i in df2['1000000 300']:
    s=i.split()
    for j in range(1,len(s)):
        s[j]=float(s[j])
    c[s[0]]=s[1:]


In [20]:
len(c['is'])                                                        #300 features in a feature vector

300

## Calculating Cosine distance of each phrase to all other phrases and storing results.

In [27]:
def avg_sentence_vector(words, model):                              #function to average all words vectors in a given paragraph
    featureVec = np.zeros((300,), dtype="float32")                  #inicialize zero vector
    nwords = 0                                                      #to find number of words

    for word in words:
            if(word in model.keys()):
                nwords = nwords+1                                   #words increased
                featureVec = np.add(featureVec, model[word])        #adding vectors

    if nwords>0:
        featureVec = np.divide(featureVec, nwords)                  #averge vector
    return featureVec

In [21]:
df.columns                                                              #find columns of Phases dataframe

Index(['Phrases'], dtype='object')

In [72]:
from sklearn.metrics.pairwise import cosine_similarity                      #finding cosine similarity of each phase in the phase.csv with each other
c1=dict()                                                                   #dict for storing phase and its cosine similarity
for i in df['Phrases']:                                                 
    l=[]
    for j in df['Phrases']:
        if(i==j):
            continue
        s1=i.split()
        s2=j.split()
        x=avg_sentence_vector(s1, c)                                        #x is average feature vector of the word i
        y=avg_sentence_vector(s2, c)                                        #y is average feature vector of the word y
        l.append(1-cosine_similarity(x.reshape(-1, 1),y.reshape(-1, 1)))    #cosine distance is (1-cosine similarity)
    l=np.array(l)                                                           
    c1[i]=l                                                                 #saving results in dictionary

In [69]:
df3=pd.DataFrame()                                                          #saving the results in a csv file
df3["Phase"]=c1.keys()
df3["cosine_similarity"]=c1.values()
df3.to_csv("cosine_similarity.csv")

## Creating a function that takes any string, e.g. user-input phrase, and finds and return the closest match from phrases in phrases.csv and the distance

In [60]:
import math                                                                #function to find most similar function among phases in phases.csv
def find_closest(s):
    Max=math.inf
    c=0
    ans=""
    for i in df['Phrases']:
        c=wv.wmdistance(s, i)
        if(c<Max):
            ans=i
            Max=c
    return ans

In [67]:
#! pip3 install POT
print(find_closest('how firm differentiates to its employee?'))         #testing the function

What has the capacity movement of airline companies been over the years?


In [5]:
wv.save("w_to_v.model")

In [6]:
wv2=wv.load("w_to_v.model")