# <b><u>PREPROCESSING II: SENTENCE TO VECTOR and WORD TO VECTOR</u></b>

In [1]:
import nltk
import sklearn
import copy
import re
import pandas as pd
import pprint
import numpy as np

pd.set_option('display.max_colwidth', None)
pp = pprint.PrettyPrinter(indent=4)

In [2]:
lemmatizer = nltk.stem.WordNetLemmatizer()
stemmer = nltk.stem.PorterStemmer()

def processing1(sentence):
    sentence = re.sub('[^a-zA-Z0-9\s]', '', sentence) # remove punctuations and any kind of symbols
    sentence = sentence.lower() # convert everything to same case to avoid redundancy due to mixed cases
    sentence = sentence.split() # list of words
    sentence = [word for word in sentence if word not in set(nltk.corpus.stopwords.words('english')) ] # remove unimportant stopwords
    # sentence = [stemmer.stem(word) for word in sentence ] # stem to base words
    sentence = [lemmatizer.lemmatize(word) for word in sentence ] # lemmatize to base words
    sentence = ' '.join(sentence) # back to sentence from processed list of words
    return sentence


In [3]:
paragraph = '''
I have three visions for India. 
In 3000 years of our history, people from all over the world have come and invaded us, captured our lands, conquered our minds. 
From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British, the French, the Dutch, all of them came and looted us, took over what was ours. 
Yet we have not done this to any other nation. We have not conquered anyone. 
We have not grabbed their land, their culture, their history and tried to enforce our way of life on them.
Why? Because we respect the freedom of others.
'''

## <b><u>Extracting sentences and Preprocessing I (cleanup): </u></b>

In [4]:
# Extract all sentences from paragraph. Each sentence will be turned into vector having some specific features (unique words).
sentences = nltk.sent_tokenize(paragraph)

# Processing 1
sentences = [processing1(sentence) for sentence in sentences]

df = pd.DataFrame()
df["Preprocessed Sentence"] = sentences

df

Unnamed: 0,Preprocessed Sentence
0,three vision india
1,3000 year history people world come invaded u captured land conquered mind
2,alexander onwards greek turk mogul portuguese british french dutch came looted u took
3,yet done nation
4,conquered anyone
5,grabbed land culture history tried enforce way life
6,
7,respect freedom others


## <b><u>ONE HOT ENCODING (sentence to vector based on EXISTENCE of words)</u></b>

In one hot encoding(OHE) technique, we **consider WORDS as FEATURES (each unique word present in the entire corpus(collection of sentences) is treated as a unique feature)** <br>
Each sentence is represented by a vector **based on the "EXISTENCE" of each of the unique words(features) in the sentence**
<br><br>
For Example: <br>
Let `corpus = [ "orange is a fruit of orange colour", "carrot is a vegetable of orange colour" ]` <br>
After cleanup and preprocessing 1, <br>
`corpus = [ "orange fruit orange colour", "carrot vegetable orange colour" ]` <br>
So; **unique words** in sorted order are: `[ "carrot", "colour", "fruit", "orange", "vegetable" ]` <br>
So; after vectorizing using **one hot encoding** technique; the vectors corresponding to the sentences in the preprocessed corpus will be: <br>
`vectors = [ [ 0 1 1 1 0 ], [ 1 1 0 1 1 ] ]` <br><br>

**Advantages:**

- Can represent sentences as vectors for training purpose

**Disadvantages:**

- Semantics(grammatical relationships) and any other relationships among the words are not taken into consideration, each word is an independant feature, independant of other words.
- No weightage is given to the words, every word is given equal importance and each word is considered independant and equally related to every other word in the corpus's vocabulary.

In [5]:
# Converting sentences to vectors using One Hot Encoding strategy
ohe = sklearn.preprocessing.OneHotEncoder()

vocab = []
for sentence in sentences:
    vocab = vocab + sentence.split()

vocab = list(set(vocab)) # remove duplicates

ohe.fit([ [word] for word in vocab ])

vectors = []
for sentence in sentences:
    vector = np.zeros(len(vocab))
    for word in sentence.split():
        vector = vector + ohe.transform([[word]]).toarray()[0]
    vectors.append(vector.tolist())

print("Features(Words) are:\n")
print(ohe.get_feature_names_out())

df["One Hot Encoded Vector Representation"] = vectors

df

Features(Words) are:

['x0_3000' 'x0_alexander' 'x0_anyone' 'x0_british' 'x0_came' 'x0_captured'
 'x0_come' 'x0_conquered' 'x0_culture' 'x0_done' 'x0_dutch' 'x0_enforce'
 'x0_freedom' 'x0_french' 'x0_grabbed' 'x0_greek' 'x0_history' 'x0_india'
 'x0_invaded' 'x0_land' 'x0_life' 'x0_looted' 'x0_mind' 'x0_mogul'
 'x0_nation' 'x0_onwards' 'x0_others' 'x0_people' 'x0_portuguese'
 'x0_respect' 'x0_three' 'x0_took' 'x0_tried' 'x0_turk' 'x0_u' 'x0_vision'
 'x0_way' 'x0_world' 'x0_year' 'x0_yet']


Unnamed: 0,Preprocessed Sentence,One Hot Encoded Vector Representation
0,three vision india,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]"
1,3000 year history people world come invaded u captured land conquered mind,"[1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0]"
2,alexander onwards greek turk mogul portuguese british french dutch came looted u took,"[0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]"
3,yet done nation,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]"
4,conquered anyone,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]"
5,grabbed land culture history tried enforce way life,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]"
6,,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]"
7,respect freedom others,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]"


## <b><u>BAG OF WORDS (sentence to vector based on FREQUENCY of words) (CountVectorizer)</u></b>

In bag of words technique(BOW), we **consider WORDS as FEATURES (each unique word present in the entire corpus(collection of sentences) is treated as a unique feature)** <br>
Each sentence is represented by a vector **based on the "FREQUENCY" of each of the unique words(features) in the sentence**
<br><br>
For Example: <br>
Let `corpus = [ "orange is a fruit of orange colour", "carrot is a vegetable of orange colour" ]` <br>
After cleanup and preprocessing 1, <br>
`corpus = [ "orange fruit orange colour", "carrot vegetable orange colour" ]` <br>
So; **unique words** in sorted order are: `[ "carrot", "colour", "fruit", "orange", "vegetable" ]` <br>
So; after vectorizing using **bag of words(frequency based)** technique; the vectors corresponding to the sentences in the preprocessed corpus will be: <br>
`vectors = [ [ 0 1 1 2 0 ], [ 1 1 0 1 1 ] ]` <br><br>

**Advantages:**

- Can represent sentences as vectors for training purpose
- Gives different weightage to different words, so they are not treated the same.

**Disadvantages:**

- Semantics(grammatical relationships) and any other relationships among the words are not taken into consideration, each word is an independant feature, independant of other words.
- Weightage is given on basis of frequency in the sentence itself (kindof local scope). So, each sentence is independant, no relation is considered between different sentences in entire corpus. So; the weightage given to the different words is not very accurate.

In [6]:
df.drop(["One Hot Encoded Vector Representation"], axis=1, inplace=True)

# Converting sentences to vectors using Bag Of Words strategy
vectorizer = sklearn.feature_extraction.text.CountVectorizer()
vectors = vectorizer.fit_transform(sentences).toarray().tolist()

print("Features(Words) are:\n")
pp.pprint(vectorizer.vocabulary_)

df["Bag of words Vector Representation"] = vectors

df

Features(Words) are:

{   '3000': 0,
    'alexander': 1,
    'anyone': 2,
    'british': 3,
    'came': 4,
    'captured': 5,
    'come': 6,
    'conquered': 7,
    'culture': 8,
    'done': 9,
    'dutch': 10,
    'enforce': 11,
    'freedom': 12,
    'french': 13,
    'grabbed': 14,
    'greek': 15,
    'history': 16,
    'india': 17,
    'invaded': 18,
    'land': 19,
    'life': 20,
    'looted': 21,
    'mind': 22,
    'mogul': 23,
    'nation': 24,
    'onwards': 25,
    'others': 26,
    'people': 27,
    'portuguese': 28,
    'respect': 29,
    'three': 30,
    'took': 31,
    'tried': 32,
    'turk': 33,
    'vision': 34,
    'way': 35,
    'world': 36,
    'year': 37,
    'yet': 38}


Unnamed: 0,Preprocessed Sentence,Bag of words Vector Representation
0,three vision india,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]"
1,3000 year history people world come invaded u captured land conquered mind,"[1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0]"
2,alexander onwards greek turk mogul portuguese british french dutch came looted u took,"[0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]"
3,yet done nation,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]"
4,conquered anyone,"[0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
5,grabbed land culture history tried enforce way life,"[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]"
6,,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
7,respect freedom others,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]"


## <b><u>BAG OF N-GRAMS (sentence to vector based on frequency of N-GRAMS(group of N CONSECUTIVE words)</u></b>

In bag of n-grams(BON) technique, we **consider N-GRAMS(groups of CONSECUTIVE words) as FEATURES (each sequence of consecutive words of specified sizes, extracted from every sentence in the corpus(collection of sentences), is treated as an unique feature is treated as a unique feature)** <br>
Each sentence is represented by a vector **based on the "FREQUENCY" of each of the unique N-GRAMS(group of words=> treated asfeatures here) in the sentence** <br>
**Words are Uni-grams**, **Bi-grams** are groups of 2 words each, **Tri-grams** are groups of 3 words each, and so on... <br>
<br>
For Example: <br>
Let `corpus = [ "orange is a fruit of orange colour", "carrot is a vegetable of orange colour" ]` <br>
After cleanup and preprocessing 1, <br>
`corpus = [ "orange fruit orange colour", "carrot vegetable orange colour" ]` <br>
So; all **unigrams** are: `[ "carrot", "colour", "fruit", "orange", "vegetable" ]` <br>
So; all **bigrams** are: `[ "orange fruit", "fruit orange", "orange colour", "carrot vegetable", "vegetable orange" ]` <br>
So; all **trigrams** are: `[ "orange fruit orange", "fruit orange colour", "carrot vegetable orange", "vegetable orange colour" ]` <br>
So; all **4-grams** are: `[ "orange fruit orange colour", "carrot vegetable orange colour" ]` <br>
So; after vectorizing using **bag of 1-grams** technique; the vectors corresponding to the sentences in the preprocessed corpus will be: <br>
`vectors = [ [ 0 1 1 2 0 ], [ 1 1 0 1 1 ] ]` <br>
So; after vectorizing using **bag of 2-grams** technique; the vectors corresponding to the sentences in the preprocessed corpus will be: <br>
`vectors = [ [ 1 1 1 0 0 ], [ 0 0 1 1 1 ] ]` <br>
So; after vectorizing using **bag of 3-grams** technique; the vectors corresponding to the sentences in the preprocessed corpus will be: <br>
`vectors = [ [ 1 1 0 0 ], [ 0 0 1 1 ] ]` <br>
So; after vectorizing using **bag of 4-grams** technique; the vectors corresponding to the sentences in the preprocessed corpus will be: <br>
`vectors = [ [ 1 0 ], [ 0 1 ] ]` <br><br>

**Advantages:**

- Adds some semantic relationship among the words than BOW, as now they are not only being considered independantly but also being considered in **consecutive word groups**, thereby somewhat relating the words.

**Disadvantages:**

- Semantic relations are present, but are very weak. Any other relations are not present.
- Weightage is given on basis of frequency in the sentence itself, so each sentence is considered independantly, no relation is considered between different sentences in entire corpus.

In [7]:
df.drop(["Bag of words Vector Representation"], axis=1, inplace=True)

# Converting sentences to vectors using Bag Of N-Grams(here 3-grams) strategy
vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(2,2))
vectors = vectorizer.fit_transform(sentences).toarray().tolist()

print("Features(2-Grams) are:\n")
pp.pprint(vectorizer.vocabulary_)

df["Bag of 2-GRAMS Vector Representation"] = vectors

vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(3,3))
vectors = vectorizer.fit_transform(sentences).toarray().tolist()

print("\nFeatures(3-Grams) are:\n")
pp.pprint(vectorizer.vocabulary_)

df["Bag of 3-GRAMS Vector Representation"] = vectors

df

Features(2-Grams) are:

{   '3000 year': 0,
    'alexander onwards': 1,
    'british french': 2,
    'came looted': 3,
    'captured land': 4,
    'come invaded': 5,
    'conquered anyone': 6,
    'conquered mind': 7,
    'culture history': 8,
    'done nation': 9,
    'dutch came': 10,
    'enforce way': 11,
    'freedom others': 12,
    'french dutch': 13,
    'grabbed land': 14,
    'greek turk': 15,
    'history people': 16,
    'history tried': 17,
    'invaded captured': 18,
    'land conquered': 19,
    'land culture': 20,
    'looted took': 21,
    'mogul portuguese': 22,
    'onwards greek': 23,
    'people world': 24,
    'portuguese british': 25,
    'respect freedom': 26,
    'three vision': 27,
    'tried enforce': 28,
    'turk mogul': 29,
    'vision india': 30,
    'way life': 31,
    'world come': 32,
    'year history': 33,
    'yet done': 34}

Features(3-Grams) are:

{   '3000 year history': 0,
    'alexander onwards greek': 1,
    'british french dutch': 2,
    'cam

Unnamed: 0,Preprocessed Sentence,Bag of 2-GRAMS Vector Representation,Bag of 3-GRAMS Vector Representation
0,three vision india,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]"
1,3000 year history people world come invaded u captured land conquered mind,"[1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0]","[1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0]"
2,alexander onwards greek turk mogul portuguese british french dutch came looted u took,"[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0]","[0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0]"
3,yet done nation,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]"
4,conquered anyone,"[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
5,grabbed land culture history tried enforce way life,"[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]"
6,,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
7,respect freedom others,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]"


## <b><u>Term Frequency - Inverse Document Frequency (TF-IDF) Vectorizer</u></b>

In tf-idf technique also, we **consider WORDS as FEATURES (each unique word present in the entire corpus(collection of sentences) is treated as a unique feature)** <br>
Each sentence is represented by a vector **based on the "TF Score * IDF Score" value of each of the unique words(features) present in the sentence** <br>

**Term Frequency(TF) of a word W in a sentence S = (Frequency of W in S / Total number of words in S)** 

**Inverse Docment Frequency(IDF) of a word W in a corpus(collection of sentences) C = log(Total number of sentences in C / Number of sentences which contain W atleast once)**

The **TF Score * IDF Score** value helps in **giving the more importent words more weightage and less important words less weightage**. <br>

For Example: <br>
Let `corpus = [ "orange is a fruit of orange colour", "carrot is a vegetable of orange colour" ]` <br>
After cleanup and preprocessing 1, <br>
`corpus = [ "orange fruit orange colour", "carrot vegetable orange colour" ]` <br>
So; **unique words** in sorted order are: `[ "carrot", "colour", "fruit", "orange", "vegetable" ]` <br>
So; the **TF vectors** corresponding to the sentences in the preprocessed corpus will be: <br>
`tf_vectors = [ [ 0 1/4 1/4 2/4 0 ], [ 1/4 1/4 0 1/4 1/4 ] ] = [ [ 0 0.25 0.25 0.5 0 ], [ 0.25 0.25 0 0.25 0.25 ] ]` <br>
So; the **IDF vector** corresponding to the entire preprocessed corpus will be: <br>
`idf_vector = [ log(2/1) log(2/2) log(2/1) log(2/2) log(2/1) ] = [ 0.3 0 0.3 0 0.3 ]` <br>
So; the **TF-IDF vectors** corresponding to the sentences in the preprocessed corpus will be: <br>
`tf_idf_vectors = [ [ (0*0.3) (0.25*0) (0.25*0.3) (0.5*0) (0*0.3) ], [ (0.25*0.3) (0.25*0) (0*0.3) (0.25*0) (0.25*0.3) ] ] = [ [ 0 0 0.075 0 0 ], [ 0.075 0 0 0 0.075 ] ]`

<br>

**Why TF-IDF ?** <br>
If we were to find the word which is most distinct among the two above sentences, which one would it be? <br>
Note that, both the sentences talk about a substance which is orange in colour. So; the words "orange" and "colour" are present in both sentences and are not quite the distinguishing feature between them. So; these words are given less weightage (here). <br>
The sentences do differ on the aspect that: The first sentence is about a "fruit" named "apple" while the second sentence is about a "vegetable" named "carrot". So; the words "fruit", "vegetable", "apple" and "carrot" are more important and help more in distinguishing the sentences. So; these words are given more weightage (here). <br>
<br>
So; **greater TF-IDF score => more distict feature(word)** and **lesser TF-IDF score => more common feature(word)** <br><br>

**Advantages:**

- Stronger semantic relationships between words than in bag of words/ngrams technique
- Both relations between words in individual sentence (in TF score) as well as relations between words in different sentences (in IDF score) are considered. So; relations between sentences in entire corpus being considered.

**Disadvantages:**

- No other relations except semantic relations being considered.
- The semantic relations are stronger than bow/bon but they are still weak (can be made better as we see later).

In [8]:
# Converting sentences to vectors using TF-IDF strategy
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer()
vectors = vectorizer.fit_transform(sentences).toarray().tolist()

print("Features(Words) are:\n")
print(vectorizer.get_feature_names_out())

df.drop(["Bag of 2-GRAMS Vector Representation"], axis=1, inplace=True)
df.drop(["Bag of 3-GRAMS Vector Representation"], axis=1, inplace=True)
df["TF-IDF Vector Representation"] = vectors

df

Features(Words) are:

['3000' 'alexander' 'anyone' 'british' 'came' 'captured' 'come'
 'conquered' 'culture' 'done' 'dutch' 'enforce' 'freedom' 'french'
 'grabbed' 'greek' 'history' 'india' 'invaded' 'land' 'life' 'looted'
 'mind' 'mogul' 'nation' 'onwards' 'others' 'people' 'portuguese'
 'respect' 'three' 'took' 'tried' 'turk' 'vision' 'way' 'world' 'year'
 'yet']


Unnamed: 0,Preprocessed Sentence,TF-IDF Vector Representation
0,three vision india,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5773502691896258, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5773502691896258, 0.0, 0.0, 0.0, 0.5773502691896258, 0.0, 0.0, 0.0, 0.0]"
1,3000 year history people world come invaded u captured land conquered mind,"[0.31454746818160906, 0.0, 0.0, 0.0, 0.0, 0.31454746818160906, 0.31454746818160906, 0.2636153271241494, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2636153271241494, 0.0, 0.31454746818160906, 0.2636153271241494, 0.0, 0.0, 0.31454746818160906, 0.0, 0.0, 0.0, 0.0, 0.31454746818160906, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.31454746818160906, 0.31454746818160906, 0.0]"
2,alexander onwards greek turk mogul portuguese british french dutch came looted u took,"[0.0, 0.2886751345948129, 0.0, 0.2886751345948129, 0.2886751345948129, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2886751345948129, 0.0, 0.0, 0.2886751345948129, 0.0, 0.2886751345948129, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2886751345948129, 0.0, 0.2886751345948129, 0.0, 0.2886751345948129, 0.0, 0.0, 0.2886751345948129, 0.0, 0.0, 0.2886751345948129, 0.0, 0.2886751345948129, 0.0, 0.0, 0.0, 0.0, 0.0]"
3,yet done nation,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5773502691896258, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5773502691896258, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5773502691896258]"
4,conquered anyone,"[0.0, 0.0, 0.7664298449085388, 0.0, 0.0, 0.0, 0.0, 0.6423280258820045, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]"
5,grabbed land culture history tried enforce way life,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.36748939521121393, 0.0, 0.0, 0.36748939521121393, 0.0, 0.0, 0.36748939521121393, 0.0, 0.30798479381600735, 0.0, 0.0, 0.30798479381600735, 0.36748939521121393, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.36748939521121393, 0.0, 0.0, 0.36748939521121393, 0.0, 0.0, 0.0]"
6,,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]"
7,respect freedom others,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5773502691896258, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5773502691896258, 0.0, 0.0, 0.5773502691896258, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]"


## <b><u>Hash Vectorizer</u></b>

In [9]:
# Converting sentences to vectors using hashing strategy
vectorizer = sklearn.feature_extraction.text.HashingVectorizer()
vectors = vectorizer.fit_transform(sentences).toarray().tolist()

df.drop(["TF-IDF Vector Representation"], axis=1, inplace=True)
df["Hashed Vector Representation"] = vectors

df

Unnamed: 0,Preprocessed Sentence,Hashed Vector Representation
0,three vision india,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]"
1,3000 year history people world come invaded u captured land conquered mind,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]"
2,alexander onwards greek turk mogul portuguese british french dutch came looted u took,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]"
3,yet done nation,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]"
4,conquered anyone,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]"
5,grabbed land culture history tried enforce way life,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]"
6,,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]"
7,respect freedom others,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...]"


## <b><u>Word Embedding: Finding even more relationships among different words in the vocabulary</u></b>

Considering each word as a vector, so each sentence is 2D matrix, the corpus is 3D matrix...

Word2Vec, fasttext, glove