<h1 align='center'>Vectorization </h1>

**Corpus**: Concatenation of the text of all the samples<br>
**Vocabulary**: Collection of unique words from the corpus <br>
**Document**: The individual text associated to each sample

***Word Vectorization***<br>
It's a technique used in NLP where the words are represented in the form of real-valued vectors that encodes the meaning of the word such that the words that are similar in meaning are closer in the vector space.

In [13]:
import pandas as pd
df=pd.read_csv("ExampleData.csv")
df

Unnamed: 0,document,text
0,d1,tough times create strong men
1,d2,strong men create easy times
2,d3,easy times create weak men
3,d4,weak men create tough times


In [18]:
corpus=[]
for i in range(df.shape[0]):
    corpus.extend(df.loc[i].text.split(' '))
vocabulary=list(set(corpus))
print("Corpus : \n",corpus)
print("Vocabulary : \n",vocabulary)

Corpus : 
 ['tough', 'times', 'create', 'strong', 'men', 'strong', 'men', 'create', 'easy', 'times', 'easy', 'times', 'create', 'weak', 'men', 'weak', 'men', 'create', 'tough', 'times']
Vocabulary : 
 ['tough', 'times', 'strong', 'create', 'easy', 'men', 'weak']


<h1 align ='center'> One-hot encoding </h1>

Here the words in the document if present in the vocabulary are represented by 1 else by 0.<br>
Here each of the document is transformed into k dimensional representation where k represents total number of unique words in the vocabulary.


In [44]:
def OHE(df):
    ohe_df=df.copy()
    for i in range(df.shape[0]):
        for j in vocabulary:
            if j in df.loc[i].text:
                ohe_df.loc[i,j]=1
            else:
                ohe_df.loc[i,j]=0
                
    return ohe_df
ohe_df=OHE(df)

In [45]:
ohe_df

Unnamed: 0,document,text,tough,times,strong,create,easy,men,weak
0,d1,tough times create strong men,1.0,1.0,1.0,1.0,0.0,1.0,0.0
1,d2,strong men create easy times,0.0,1.0,1.0,1.0,1.0,1.0,0.0
2,d3,easy times create weak men,0.0,1.0,0.0,1.0,1.0,1.0,1.0
3,d4,weak men create tough times,1.0,1.0,0.0,1.0,0.0,1.0,1.0


**Pros**<br>
1. Simple & Intuitive 
2. Easy to implement
**Cons**<br>
1. Sparsity
2. No fixed size.
3 OOV issue. Can't handle new words.
4. No caputure of semantics

<h1 align ='center'> Bag of Words </h1>

Here the frequency of the words are written as the vactor magnitude against a given word.<br>

In [49]:
df_=pd.read_csv('ExampleData_2.csv')
df_

Unnamed: 0,document,text
0,d1,"tough times create strong men, strong men crea..."
1,d2,"easy times create weak men, weak men create to..."


In [67]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
bow=cv.fit_transform(df_['text'])
cv.vocabulary_

{'tough': 5,
 'times': 4,
 'create': 0,
 'strong': 3,
 'men': 2,
 'easy': 1,
 'weak': 6}

In [68]:
print("Text : ", df_.loc[0].text)
print("Encoded text : ", bow[0].toarray())

print("Text : ", df_.loc[1].text)
print("Encoded text : ", bow[1].toarray())

Text :  tough times create strong men, strong men create easy times
Encoded text :  [[2 1 2 2 2 1 0]]
Text :  easy times create weak men, weak men create tough times
Encoded text :  [[2 1 2 0 2 1 2]]


In [77]:
def BOW(df):
    cv=CountVectorizer()
    bow=cv.fit_transform(df['text']).toarray()
    bow_=pd.DataFrame(bow, columns=cv.get_feature_names_out())
    bow_df=pd.concat([df, bow_],axis=1)
    return bow_df
bow_df=BOW(df_)

In [78]:
bow_df

Unnamed: 0,document,text,create,easy,men,strong,times,tough,weak
0,d1,"tough times create strong men, strong men crea...",2,1,2,2,2,1,0
1,d2,"easy times create weak men, weak men create to...",2,1,2,0,2,1,2


**Pros**<br>
1. Simple
2. Intuitive 
3. The problem due to OOV is handled inherently
**Cons**<br>
1. Sparsity
2. Semantic meaning is not captured. Still better than OHE.
3. OOV words are ignored. So, it might cause an issue as the importance of that word is not caputured

<h1 align='center'> Bag of Words with n-grams </h1>

In [90]:
def ngramBOW(df,tup):
    cv=CountVectorizer(ngram_range=tup)
    bow=cv.fit_transform(df['text']).toarray()
    bow_=pd.DataFrame(bow, columns=cv.get_feature_names_out())
    bow_df=pd.concat([df, bow_],axis=1)
    return bow_df

### Unigram

In [94]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(1, 1))
bow=cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'tough': 5, 'times': 4, 'create': 0, 'strong': 3, 'men': 2, 'easy': 1, 'weak': 6}


In [95]:
bow_df_unigram=BOW(df,tup=(1,1))
bow_df_unigram

Unnamed: 0,document,text,create,easy,men,strong,times,tough,weak
0,d1,tough times create strong men,1,0,1,1,1,1,0
1,d2,strong men create easy times,1,1,1,1,1,0,0
2,d3,easy times create weak men,1,1,1,0,1,0,1
3,d4,weak men create tough times,1,0,1,0,1,1,1


### Bi-gram

In [96]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(2, 2))
bow=cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'tough times': 8, 'times create': 7, 'create strong': 1, 'strong men': 6, 'men create': 5, 'create easy': 0, 'easy times': 4, 'create weak': 3, 'weak men': 9, 'create tough': 2}


In [97]:
bow_df_bigram=BOW(df,tup=(2,2))
bow_df_bigram

Unnamed: 0,document,text,create easy,create strong,create tough,create weak,easy times,men create,strong men,times create,tough times,weak men
0,d1,tough times create strong men,0,1,0,0,0,0,1,1,1,0
1,d2,strong men create easy times,1,0,0,0,1,1,1,0,0,0
2,d3,easy times create weak men,0,0,0,1,1,0,0,1,0,1
3,d4,weak men create tough times,0,0,1,0,0,1,0,0,1,1


### Tri-gram

In [83]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(3, 3))
bow=cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'tough times create': 10, 'times create strong': 8, 'create strong men': 1, 'strong men create': 7, 'men create easy': 5, 'create easy times': 0, 'easy times create': 4, 'times create weak': 9, 'create weak men': 3, 'weak men create': 11, 'men create tough': 6, 'create tough times': 2}


In [98]:
bow_df_trigram=BOW(df,tup=(3,3))
bow_df_trigram

Unnamed: 0,document,text,create easy times,create strong men,create tough times,create weak men,easy times create,men create easy,men create tough,strong men create,times create strong,times create weak,tough times create,weak men create
0,d1,tough times create strong men,0,1,0,0,0,0,0,0,1,0,1,0
1,d2,strong men create easy times,1,0,0,0,0,1,0,1,0,0,0,0
2,d3,easy times create weak men,0,0,0,1,1,0,0,0,0,1,0,0
3,d4,weak men create tough times,0,0,1,0,0,0,1,0,0,0,0,1


### Unigram+Bigram

In [84]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(1, 2))
bow=cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'tough': 13, 'times': 11, 'create': 0, 'strong': 9, 'men': 7, 'tough times': 14, 'times create': 12, 'create strong': 2, 'strong men': 10, 'easy': 5, 'men create': 8, 'create easy': 1, 'easy times': 6, 'weak': 15, 'create weak': 4, 'weak men': 16, 'create tough': 3}


In [99]:
bow_df_unigram_bigram=BOW(df,tup=(1,2))
bow_df_unigram_bigram

Unnamed: 0,document,text,create,create easy,create strong,create tough,create weak,easy,easy times,men,men create,strong,strong men,times,times create,tough,tough times,weak,weak men
0,d1,tough times create strong men,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,0,0
1,d2,strong men create easy times,1,1,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0
2,d3,easy times create weak men,1,0,0,0,1,1,1,1,0,0,0,1,1,0,0,1,1
3,d4,weak men create tough times,1,0,0,1,0,0,0,1,1,0,0,1,0,1,1,1,1


### Unigram+Bigram+Trigram

In [85]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(ngram_range=(1, 3))
bow=cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'tough': 23, 'times': 19, 'create': 0, 'strong': 16, 'men': 12, 'tough times': 24, 'times create': 20, 'create strong': 3, 'strong men': 17, 'tough times create': 25, 'times create strong': 21, 'create strong men': 4, 'easy': 9, 'men create': 13, 'create easy': 1, 'easy times': 10, 'strong men create': 18, 'men create easy': 14, 'create easy times': 2, 'weak': 26, 'create weak': 7, 'weak men': 27, 'easy times create': 11, 'times create weak': 22, 'create weak men': 8, 'create tough': 5, 'weak men create': 28, 'men create tough': 15, 'create tough times': 6}


In [100]:
bow_df_n_gram=BOW(df,tup=(1,3))
bow_df_n_gram

Unnamed: 0,document,text,create,create easy,create easy times,create strong,create strong men,create tough,create tough times,create weak,...,times,times create,times create strong,times create weak,tough,tough times,tough times create,weak,weak men,weak men create
0,d1,tough times create strong men,1,0,0,1,1,0,0,0,...,1,1,1,0,1,1,1,0,0,0
1,d2,strong men create easy times,1,1,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,d3,easy times create weak men,1,0,0,0,0,0,0,1,...,1,1,0,1,0,0,0,1,1,0
3,d4,weak men create tough times,1,0,0,0,0,1,1,0,...,1,0,0,0,1,1,0,1,1,1


**Pros**<br>
1. Simple
2. Intuitive 
3. Semantic is captured to some small extent.
**Cons**<br>
1. Sparsity
2. Dimension incrases as the value of n increases.
3. OOV words are ignored. So, it might cause an issue as the importance of that word is not caputured.

<h1 align='center'> TF-IDF </h1>

Intuition : If any word has high frequency value in a given document but it's very less occuring in rest of the corpus then that word becomes very important to capture the essence of that very document so more weight is given to that word for that document.

***Term Frequency (tf):*** <br>TF gives us the frequency of the word in each document in the corpus. It is the ratio of number of times the word appears in a document compared to the total number of words in that document. It increases as the number of occurrences of that word within the document increases.<br>
$TF(t)$ = $\frac{Number of occurances of term t in document d}{Total number of terms in document d}$ <br>

***Inverse Data Frequency (idf):*** IDF used to calculate the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high IDF score.<br>
$IDF= log\frac{Total number of documents in the corpus}{Number of documents with term t in them}$

In [111]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
tfidf_data=tfidf.fit_transform(df['text']).toarray()
tfidf.vocabulary_

{'tough': 5,
 'times': 4,
 'create': 0,
 'strong': 3,
 'men': 2,
 'easy': 1,
 'weak': 6}

In [112]:
def TFIDF(df):
    tfidf=TfidfVectorizer()
    tfidf_data=tfidf.fit_transform(df['text']).toarray()
    tfidf_df=pd.DataFrame(tfidf_data, columns=tfidf.get_feature_names_out())
    tfidf_df=pd.concat([df, tfidf_df],axis=1)
    return tfidf_df
tfidf_df = TFIDF(df)

In [113]:
tfidf_df=TFIDF(df)
tfidf_df

Unnamed: 0,document,text,create,easy,men,strong,times,tough,weak
0,d1,tough times create strong men,0.363572,0.0,0.363572,0.549294,0.363572,0.549294,0.0
1,d2,strong men create easy times,0.363572,0.549294,0.363572,0.549294,0.363572,0.0,0.0
2,d3,easy times create weak men,0.363572,0.549294,0.363572,0.0,0.363572,0.0,0.549294
3,d4,weak men create tough times,0.363572,0.0,0.363572,0.0,0.363572,0.549294,0.549294


**Pros**<br>
1. Simple
2. Intuitive 
3. Importance of words are handled in a good way
**Cons**<br>
1. Sparsity
2. Dimension incrases as the size of vocabulary increases.
3. OOV words are ignored. So, it might cause an issue as the importance of that word is not caputured.