<h1 align='center'>Vectorization </h1>

**Corpus**: Concatenation of the text of all the samples<br>
**Vocabulary**: Collection of unique words from the corpus <br>
**Document**: The individual text associated to each sample

***Word Vectorization***<br>
It's a technique used in NLP where the words are represented in the form of real-valued vectors that encodes the meaning of the word such that the words that are similar in meaning are closer in the vector space.

In [13]:
import pandas as pd
df=pd.read_csv("ExampleData.csv")
df

Unnamed: 0,document,text
0,d1,tough times create strong men
1,d2,strong men create easy times
2,d3,easy times create weak men
3,d4,weak men create tough times


In [18]:
corpus=[]
for i in range(df.shape[0]):
    corpus.extend(df.loc[i].text.split(' '))
vocabulary=list(set(corpus))
print("Corpus : \n",corpus)
print("Vocabulary : \n",vocabulary)

Corpus : 
 ['tough', 'times', 'create', 'strong', 'men', 'strong', 'men', 'create', 'easy', 'times', 'easy', 'times', 'create', 'weak', 'men', 'weak', 'men', 'create', 'tough', 'times']
Vocabulary : 
 ['tough', 'times', 'strong', 'create', 'easy', 'men', 'weak']


<h1 align ='center'> One-hot encoding </h1>

Here the words in the document if present in the vocabulary are represented by 1 else by 0.<br>
Here each of the document is transformed into k dimensional representation where k represents total number of unique words in the vocabulary.


In [44]:
def OHE(df):
    ohe_df=df.copy()
    for i in range(df.shape[0]):
        for j in vocabulary:
            if j in df.loc[i].text:
                ohe_df.loc[i,j]=1
            else:
                ohe_df.loc[i,j]=0
                
    return ohe_df
ohe_df=OHE(df)

In [45]:
ohe_df

Unnamed: 0,document,text,tough,times,strong,create,easy,men,weak
0,d1,tough times create strong men,1.0,1.0,1.0,1.0,0.0,1.0,0.0
1,d2,strong men create easy times,0.0,1.0,1.0,1.0,1.0,1.0,0.0
2,d3,easy times create weak men,0.0,1.0,0.0,1.0,1.0,1.0,1.0
3,d4,weak men create tough times,1.0,1.0,0.0,1.0,0.0,1.0,1.0


**Pros**<br>
1. Simple & Intuitive 
2. Easy to implement
**Cons**<br>
1. Sparsity
2. No fixed size.
3 OOV issue. Can't handle new words.
4. No caputure of semantics

<h1 align ='center'> Bag of Words </h1>

Here the frequency of the words are written as the vactor magnitude against a given word.<br>

In [49]:
df_=pd.read_csv('ExampleData_2.csv')
df_

Unnamed: 0,document,text
0,d1,"tough times create strong men, strong men crea..."
1,d2,"easy times create weak men, weak men create to..."


In [67]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
bow=cv.fit_transform(df_['text'])
cv.vocabulary_

{'tough': 5,
 'times': 4,
 'create': 0,
 'strong': 3,
 'men': 2,
 'easy': 1,
 'weak': 6}

In [68]:
print("Text : ", df_.loc[0].text)
print("Encoded text : ", bow[0].toarray())

print("Text : ", df_.loc[1].text)
print("Encoded text : ", bow[1].toarray())

Text :  tough times create strong men, strong men create easy times
Encoded text :  [[2 1 2 2 2 1 0]]
Text :  easy times create weak men, weak men create tough times
Encoded text :  [[2 1 2 0 2 1 2]]


In [77]:
def BOW(df):
    cv=CountVectorizer()
    bow=cv.fit_transform(df['text']).toarray()
    bow_=pd.DataFrame(bow, columns=cv.get_feature_names_out())
    bow_df=pd.concat([df, bow_],axis=1)
    return bow_df
bow_df=BOW(df_)

In [78]:
bow_df

Unnamed: 0,document,text,create,easy,men,strong,times,tough,weak
0,d1,"tough times create strong men, strong men crea...",2,1,2,2,2,1,0
1,d2,"easy times create weak men, weak men create to...",2,1,2,0,2,1,2


**Pros**<br>
1. Simple
2. Intuitive 
3. The problem due to OOV is handled inherently
**Cons**<br>
1. Sparsity
2. Semantic meaning is not captured. Still better than OHE.
3. OOV words are ignored. So, it might cause an issue as the importance of that word is not caputured

<h1 align='center'> Bag of Words with n-grams </h1>

In [90]:
def ngramBOW(df,tup):
    cv=CountVectorizer(ngram_range=tup)
    bow=cv.fit_transform(df['text']).toarray()
    bow_=pd.DataFrame(bow, columns=cv.get_feature_names_out())
    bow_df=pd.concat([df, bow_],axis=1)
    return bow_df

### Unigram

In [94]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(1, 1))
bow=cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'tough': 5, 'times': 4, 'create': 0, 'strong': 3, 'men': 2, 'easy': 1, 'weak': 6}


In [95]:
bow_df_unigram=BOW(df,tup=(1,1))
bow_df_unigram

Unnamed: 0,document,text,create,easy,men,strong,times,tough,weak
0,d1,tough times create strong men,1,0,1,1,1,1,0
1,d2,strong men create easy times,1,1,1,1,1,0,0
2,d3,easy times create weak men,1,1,1,0,1,0,1
3,d4,weak men create tough times,1,0,1,0,1,1,1


### Bi-gram

In [96]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(2, 2))
bow=cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'tough times': 8, 'times create': 7, 'create strong': 1, 'strong men': 6, 'men create': 5, 'create easy': 0, 'easy times': 4, 'create weak': 3, 'weak men': 9, 'create tough': 2}


In [97]:
bow_df_bigram=BOW(df,tup=(2,2))
bow_df_bigram

Unnamed: 0,document,text,create easy,create strong,create tough,create weak,easy times,men create,strong men,times create,tough times,weak men
0,d1,tough times create strong men,0,1,0,0,0,0,1,1,1,0
1,d2,strong men create easy times,1,0,0,0,1,1,1,0,0,0
2,d3,easy times create weak men,0,0,0,1,1,0,0,1,0,1
3,d4,weak men create tough times,0,0,1,0,0,1,0,0,1,1


### Tri-gram

In [83]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(3, 3))
bow=cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'tough times create': 10, 'times create strong': 8, 'create strong men': 1, 'strong men create': 7, 'men create easy': 5, 'create easy times': 0, 'easy times create': 4, 'times create weak': 9, 'create weak men': 3, 'weak men create': 11, 'men create tough': 6, 'create tough times': 2}


In [98]:
bow_df_trigram=BOW(df,tup=(3,3))
bow_df_trigram

Unnamed: 0,document,text,create easy times,create strong men,create tough times,create weak men,easy times create,men create easy,men create tough,strong men create,times create strong,times create weak,tough times create,weak men create
0,d1,tough times create strong men,0,1,0,0,0,0,0,0,1,0,1,0
1,d2,strong men create easy times,1,0,0,0,0,1,0,1,0,0,0,0
2,d3,easy times create weak men,0,0,0,1,1,0,0,0,0,1,0,0
3,d4,weak men create tough times,0,0,1,0,0,0,1,0,0,0,0,1


### Unigram+Bigram

In [84]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(1, 2))
bow=cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'tough': 13, 'times': 11, 'create': 0, 'strong': 9, 'men': 7, 'tough times': 14, 'times create': 12, 'create strong': 2, 'strong men': 10, 'easy': 5, 'men create': 8, 'create easy': 1, 'easy times': 6, 'weak': 15, 'create weak': 4, 'weak men': 16, 'create tough': 3}


In [99]:
bow_df_unigram_bigram=BOW(df,tup=(1,2))
bow_df_unigram_bigram

Unnamed: 0,document,text,create,create easy,create strong,create tough,create weak,easy,easy times,men,men create,strong,strong men,times,times create,tough,tough times,weak,weak men
0,d1,tough times create strong men,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,0,0
1,d2,strong men create easy times,1,1,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0
2,d3,easy times create weak men,1,0,0,0,1,1,1,1,0,0,0,1,1,0,0,1,1
3,d4,weak men create tough times,1,0,0,1,0,0,0,1,1,0,0,1,0,1,1,1,1


### Unigram+Bigram+Trigram

In [85]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(1, 3))
bow=cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'tough': 23, 'times': 19, 'create': 0, 'strong': 16, 'men': 12, 'tough times': 24, 'times create': 20, 'create strong': 3, 'strong men': 17, 'tough times create': 25, 'times create strong': 21, 'create strong men': 4, 'easy': 9, 'men create': 13, 'create easy': 1, 'easy times': 10, 'strong men create': 18, 'men create easy': 14, 'create easy times': 2, 'weak': 26, 'create weak': 7, 'weak men': 27, 'easy times create': 11, 'times create weak': 22, 'create weak men': 8, 'create tough': 5, 'weak men create': 28, 'men create tough': 15, 'create tough times': 6}


In [100]:
bow_df_n_gram=BOW(df,tup=(1,3))
bow_df_n_gram

Unnamed: 0,document,text,create,create easy,create easy times,create strong,create strong men,create tough,create tough times,create weak,...,times,times create,times create strong,times create weak,tough,tough times,tough times create,weak,weak men,weak men create
0,d1,tough times create strong men,1,0,0,1,1,0,0,0,...,1,1,1,0,1,1,1,0,0,0
1,d2,strong men create easy times,1,1,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,d3,easy times create weak men,1,0,0,0,0,0,0,1,...,1,1,0,1,0,0,0,1,1,0
3,d4,weak men create tough times,1,0,0,0,0,1,1,0,...,1,0,0,0,1,1,0,1,1,1


**Pros**<br>
1. Simple
2. Intuitive 
3. Semantic is captured to some small extent.
**Cons**<br>
1. Sparsity
2. Dimension incrases as the value of n increases.
3. OOV words are ignored. So, it might cause an issue as the importance of that word is not caputured.

<h1 align='center'> TF-IDF </h1>

Intuition : If any word has high frequency value in a given document but it's very less occuring in rest of the corpus then that word becomes very important to capture the essence of that very document so more weight is given to that word for that document.

***Term Frequency (tf):*** <br>TF gives us the frequency of the word in each document in the corpus. It is the ratio of number of times the word appears in a document compared to the total number of words in that document. It increases as the number of occurrences of that word within the document increases.<br>
$TF(t)$ = $\frac{Number of occurances of term t in document d}{Total number of terms in document d}$ <br>

***Inverse Data Frequency (idf):*** IDF used to calculate the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high IDF score.<br>
$IDF= log\frac{Total number of documents in the corpus}{Number of documents with term t in them}$

In [111]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
tfidf_data=tfidf.fit_transform(df['text']).toarray()
tfidf.vocabulary_

{'tough': 5,
 'times': 4,
 'create': 0,
 'strong': 3,
 'men': 2,
 'easy': 1,
 'weak': 6}

In [112]:
def TFIDF(df):
    tfidf=TfidfVectorizer()
    tfidf_data=tfidf.fit_transform(df['text']).toarray()
    tfidf_df=pd.DataFrame(tfidf_data, columns=tfidf.get_feature_names_out())
    tfidf_df=pd.concat([df, tfidf_df],axis=1)
    return tfidf_df
tfidf_df = TFIDF(df)

In [113]:
tfidf_df=TFIDF(df)
tfidf_df

Unnamed: 0,document,text,create,easy,men,strong,times,tough,weak
0,d1,tough times create strong men,0.363572,0.0,0.363572,0.549294,0.363572,0.549294,0.0
1,d2,strong men create easy times,0.363572,0.549294,0.363572,0.549294,0.363572,0.0,0.0
2,d3,easy times create weak men,0.363572,0.549294,0.363572,0.0,0.363572,0.0,0.549294
3,d4,weak men create tough times,0.363572,0.0,0.363572,0.0,0.363572,0.549294,0.549294


**Pros**<br>
1. Simple
2. Intuitive 
3. Importance of words are handled in a good way
**Cons**<br>
1. Sparsity
2. Dimension incrases as the size of vocabulary increases.
3. OOV words are ignored. So, it might cause an issue as the importance of that word is not caputured.

<h1 align='center'> Word Embeddings </h1>

In NLP word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of each word in such a way that the words are closer in the vector space are expected to be similar in meaning

Need of vector embeddings:

1.  The methods like OHE, BoW or n-grams produce matrix of very large dimension if the number of unique words in the text corpus is very high. But the embedding models like Word2Vec, OpenAIEmbedding, all-mpnet-base-v2 have a fixed embedding size. So we will not end up creating matrix of infinite dimensions so computation time is much faster.

2. The vector representation that is been produced is a dense vector. So the problem of sparsity goes away which makes the NLP model very less prone to overfitting.

3. The semantic similarity is captured in a much better way. It can identify happy and joy as similar word unlike other methods. 

### Technique 1 : Using pre-trained embedding model

**Word2Vec**:<br>

 Word2Vec is a deep learning model to vectorize words in the text corpus. It was created by Google Engineers in the year 2013. It's the pioneer of embeddings. The embeddings were generated training the neural network over the Google News data of 3 Billion words.<br>

- Paper : https://arxiv.org/pdf/1310.4546.pdf
- Paper : https://arxiv.org/pdf/1309.4168v1.pdf
- Code : https://drive.google.com/file/d/1V4fnxaKMJ5-vz1LddRa-QvRFCKxJk6F6/view?usp=drive_link
- TF Blog : https://www.tensorflow.org/text/tutorials/word2vec

***Intuition:***<br>

The Word2vec deep learning model creates features and against that feature, values are assigned to that word. For the deep learning model we won't get to know what are the feature names but will only get the values. <br>

![Alt text](image.png) <br>

Take, the features that the DL model has generated are greeting,Gender,...and Number and based on these features, values are assigned against each word. From this image we can say that the vector representation of the word Tuna will be [0.01,0.09,0.01,0.99,0.92,0.39,0.00]

There are two different architectures of Word2vec that are Continous Bag of Word (CBOW) and other one is skip-gram. Here we create a fake/dummy problem, we try to solve that fake problem and as a bi-product of which we get the vector embeddings or feature values against each word. It has been proven experimentally that the CBOW shows better results in case of small data whereas skip-gram shows good results in case of large data. 

#### Architecture 1 -  CBOW:

The dummy problem that has been used in CBOW is that, there is a large text corpus. Where we are taking a window of three words. we will be predicting the middle word taking the first and third words as context. So, the first and third word will be the training data(features) and the middle word is the target. Example:<br>
Text = 'Tough times create strong men'<br>

    features(X1) = [(tough,create)] , target = times<br>
    features(X2) = [(times,strong)] , target = create<br>
    features(X3) = [(create,men)] ,    target = strong<br>

Now, we wil perform one hot encoding to represent this into a vector format.<br>

![Alt text](image-1.png)

The above diagram represents a Neural Network that is build to solve the fake problem. Here the hidden layer contains 3 neurons because we have considered the window size to be 3 as well. Here the words like tough, times ,...men will be represented by a collection of weights coming out of the hidden layer. That's how we will get the vector representation of each word.

#### Architecture 2 - skip-gram:

The dummy problem that has been used in skip-gram is that, there is a large text corpus where we are taking a window of three words. we will be predicting the context word taking middle word as input. So, the first and third word will be the target data and the middle word is the feature. Basically it's the opposite of CBOW. Example:<br>

![Alt text](image-2.png)

### Technique 2: Training on own data to get embeddings

Here we will use the Game of Thrones text to train our data and will generate our own embeddings.

In [27]:
import os
import nltk
import gensim
import numpy as np 
import pandas as pd
nltk.download('punkt')
from nltk import sent_tokenize
import plotly.express as px
from sklearn.decomposition import PCA
from gensim.utils import simple_preprocess


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\krish\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [8]:
story=[]
filename='GOT-Book_1.txt'
f = open(filename)
corpus= f.read()
raw_sentence = sent_tokenize(corpus)
for sentence in raw_sentence:
    story.append(simple_preprocess(sentence))

In [9]:
len(story)

27244

In [11]:
model = gensim.models.Word2Vec(
                            window=10,
                            vector_size=100,  # we will get the embedding size =100
                            min_count=2,
                            workers=4)

model.build_vocab(story)

In [14]:
model.train(story, total_examples=model.corpus_count, 
            epochs=model.epochs)


(1058787, 1423500)

In [19]:
model.wv.most_similar('baratheon')

[('warden', 0.9065202474594116),
 ('stage', 0.8709477186203003),
 ('dragonstone', 0.8660157918930054),
 ('name', 0.8567836284637451),
 ('eldest', 0.8557701706886292),
 ('wife', 0.8550875782966614),
 ('heir', 0.8504862785339355),
 ('council', 0.8501777648925781),
 ('protector', 0.8492279648780823),
 ('stannis', 0.848252534866333)]

In [20]:
model.wv.most_similar('wolf')

[('movement', 0.8486617207527161),
 ('hodor', 0.8439391255378723),
 ('warmth', 0.8419230580329895),
 ('darkness', 0.8397408127784729),
 ('silence', 0.8363795876502991),
 ('ghost', 0.8274149298667908),
 ('crow', 0.8244905471801758),
 ('stiv', 0.8227370977401733),
 ('freeze', 0.8195980191230774),
 ('lurch', 0.8130811452865601)]

In [22]:
model.wv.most_similar('lannister')

[('foster', 0.7443743944168091),
 ('kingslayer', 0.7383334636688232),
 ('household', 0.7175178527832031),
 ('conspired', 0.7165029048919678),
 ('personal', 0.7143693566322327),
 ('steward', 0.7116624712944031),
 ('jaime', 0.706831693649292),
 ('sons', 0.6976523995399475),
 ('host', 0.695743978023529),
 ('nephew', 0.6951910257339478)]

In [42]:
### If the word is not in the text corpus, it will throw error
model.wv.most_similar('Europe')

KeyError: "Key 'Europe' not present in vocabulary"

In [25]:
model.wv.similarity('arya','sansa')

0.93387604

In [26]:
model.wv['sansa']  ## vector representation of sansa

array([ 1.3777735 , -0.38697323, -0.609149  , -1.1972412 ,  1.3713096 ,
       -0.46447858, -0.64681154,  1.5312755 ,  0.23259486, -0.06502395,
       -0.74165606, -1.2367    ,  0.2597048 ,  0.19739881,  0.96973175,
        0.2833963 , -0.5170741 , -0.14250743,  1.6395868 , -1.6783292 ,
        0.8906768 , -0.6863455 , -0.11743952, -0.82964665, -1.0515867 ,
        1.495001  , -0.6738145 , -0.40878728,  0.21746342,  0.59670055,
        0.12087309, -0.6756956 ,  0.05202612, -1.5229514 ,  0.616399  ,
       -1.1564926 , -0.14506681,  0.62610126, -0.46736622, -0.2184238 ,
       -1.0106217 , -1.3874764 ,  0.7814562 ,  0.8717466 ,  1.486853  ,
       -0.06433678, -0.30219686, -0.994785  , -0.637019  ,  0.1812993 ,
       -0.7001283 ,  0.5464055 ,  1.1131701 , -0.48495632,  0.93450576,
        0.53115976, -0.3626475 , -0.0978311 ,  1.0700375 , -0.0595584 ,
       -0.29068387, -0.59739643,  0.8360576 ,  0.9681423 , -0.6011112 ,
        1.2211233 , -0.39477125,  0.535186  ,  0.04808044,  1.05

In [30]:
pca= PCA(n_components=3)
X= pca.fit_transform(model.wv.get_normed_vectors())
Y = model.wv.index_to_key
X.shape

(7432, 3)

In [41]:
fig= px.scatter_3d(X[:500], x=0,y=1,z=2, color=Y[:500])
fig.show()