# <b><u>PREPROCESSING II: SENTENCE TO VECTOR and WORD TO VECTOR</u></b>

In [1]:
import nltk
import sklearn
import copy
import re
import pandas as pd

pd.set_option('display.max_colwidth', None)

In [2]:
lemmatizer = nltk.stem.WordNetLemmatizer()

def processing1(sentence):
    sentence = re.sub('[^a-zA-Z0-9\s]', '', sentence) # remove punctuations and any kind of symbols
    sentence = sentence.lower() # convert everything to same case to avoid redundancy due to mixed cases
    sentence = sentence.split() # list of words
    sentence = [word for word in sentence if word not in set(nltk.corpus.stopwords.words('english')) ] # remove unimportant stopwords
    sentence = [lemmatizer.lemmatize(word) for word in sentence ] # lemmatize to base words
    sentence = ' '.join(sentence) # back to sentence from processed list of words
    return sentence


In [3]:
paragraph = '''
I have three visions for India. 
In 3000 years of our history, people from all over the world have come and invaded us, captured our lands, conquered our minds. 
From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British, the French, the Dutch, all of them came and looted us, took over what was ours. 
Yet we have not done this to any other nation. We have not conquered anyone. 
We have not grabbed their land, their culture, their history and tried to enforce our way of life on them.
Why? Because we respect the freedom of others.
'''

## <b><u>Extracting sentences and Preprocessing I (cleanup): </u></b>

In [4]:
# Extract all sentences from paragraph. Each sentence will be turned into vector having some specific features (unique words).
sentences = nltk.sent_tokenize(paragraph)

# Processing 1
sentences = [processing1(sentence) for sentence in sentences]

df = pd.DataFrame()
df["Preprocessed Sentence"] = sentences

df

Unnamed: 0,Preprocessed Sentence
0,three vision india
1,3000 year history people world come invaded u captured land conquered mind
2,alexander onwards greek turk mogul portuguese british french dutch came looted u took
3,yet done nation
4,conquered anyone
5,grabbed land culture history tried enforce way life
6,
7,respect freedom others


## <b><u>BAG OF WORDS (sentence to vector)</u></b>

In bag of words technique, we **consider WORDS as FEATURES (each unique word present in the entire corpus(collection of sentences) is treated as a unique feature)** <br>
Each sentence is represented by a vector **based on the "FREQUENCY" of each of the unique words(features) in the sentence**
<br><br>
For Example: <br>
Let `corpus = [ "orange is a fruit of orange colour", "carrot is a vegetable of orange colour" ]` <br>
After cleanup and preprocessing 1, <br>
`corpus = [ "orange fruit orange colour", "carrot vegetable orange colour" ]` <br>
So; **unique words** in sorted order are: `[ "carrot", "colour", "fruit", "orange", "vegetable" ]` <br>
So; after vectorizing using **bag of words(frequency based)** technique; the vectors corresponding to the sentences in the preprocessed corpus will be: <br>
`vectors = [ [ 0 1 1 2 0 ], [ 1 1 0 1 1 ] ]`

In [5]:
# Converting sentences to vectors using Bag Of Words strategy
vectorizer = sklearn.feature_extraction.text.CountVectorizer()
vectors_bow = vectorizer.fit_transform(sentences).toarray().tolist()

print("Features(Words) are:\n")
print(vectorizer.get_feature_names_out())

df["Bag of words Vector Representation"] = vectors_bow

df

Features(Words) are:

['3000' 'alexander' 'anyone' 'british' 'came' 'captured' 'come'
 'conquered' 'culture' 'done' 'dutch' 'enforce' 'freedom' 'french'
 'grabbed' 'greek' 'history' 'india' 'invaded' 'land' 'life' 'looted'
 'mind' 'mogul' 'nation' 'onwards' 'others' 'people' 'portuguese'
 'respect' 'three' 'took' 'tried' 'turk' 'vision' 'way' 'world' 'year'
 'yet']


Unnamed: 0,Preprocessed Sentence,Bag of words Vector Representation
0,three vision india,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]"
1,3000 year history people world come invaded u captured land conquered mind,"[1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0]"
2,alexander onwards greek turk mogul portuguese british french dutch came looted u took,"[0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]"
3,yet done nation,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]"
4,conquered anyone,"[0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
5,grabbed land culture history tried enforce way life,"[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]"
6,,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
7,respect freedom others,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]"


## <b><u>Term Frequency - Inverse Document Frequency (TF-IDF)</u></b>

In tf-idf technique also, we **consider WORDS as FEATURES (each unique word present in the entire corpus(collection of sentences) is treated as a unique feature)** <br>
Each sentence is represented by a vector **based on the "TF Score * IDF Score" value of each of the unique words(features) present in the sentence** <br>
<br>
**Term Frequency(TF) of a word W in a sentence S = (Frequency of W in S / Total number of words in S)** <br><br>
**Inverse Docment Frequency(IDF) of a word W in a corpus(collection of sentences) C = log(Total number of sentences in C / Number of sentences which contain W atleast once)**
<br><br>
The **TF Score * IDF Score** value helps in **giving the more importent words more weightage and less important words less weightage**. <br>
<br>
For Example: <br>
Let `corpus = [ "orange is a fruit of orange colour", "carrot is a vegetable of orange colour" ]` <br>
After cleanup and preprocessing 1, <br>
`corpus = [ "orange fruit orange colour", "carrot vegetable orange colour" ]` <br>
So; **unique words** in sorted order are: `[ "carrot", "colour", "fruit", "orange", "vegetable" ]` <br>
So; the **TF vectors** corresponding to the sentences in the preprocessed corpus will be: <br>
`tf_vectors = [ [ 0 1/4 1/4 2/4 0 ], [ 1/4 1/4 0 1/4 1/4 ] ] = [ [ 0 0.25 0.25 0.5 0 ], [ 0.25 0.25 0 0.25 0.25 ] ]` <br>
`idf_vector = [ log(2/1) log(2/2) log(2/1) log(2/2) log(2/1) ] = [ 0.3 0 0.3 0 0.3 ]` <br>
`tf_idf_vectors = [ [ (0*0.3) (0.25*0) (0.25*0.3) (0.5*0) (0*0.3) ], [ (0.25*0.3) (0.25*0) (0*0.3) (0.25*0) (0.25*0.3) ] ] = [ [ 0 0 0.075 0 0 ], [ 0.075 0 0 0 0.075 ] ]`

In [6]:
# Converting sentences to vectors using TF-IDF strategy
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer()
vectors_tfidf = vectorizer.fit_transform(sentences).toarray().tolist()

print("Features(Words) are:\n")
print(vectorizer.get_feature_names_out())

df["TF-IDF Vector Representation"] = vectors_tfidf

df

Features(Words) are:

['3000' 'alexander' 'anyone' 'british' 'came' 'captured' 'come'
 'conquered' 'culture' 'done' 'dutch' 'enforce' 'freedom' 'french'
 'grabbed' 'greek' 'history' 'india' 'invaded' 'land' 'life' 'looted'
 'mind' 'mogul' 'nation' 'onwards' 'others' 'people' 'portuguese'
 'respect' 'three' 'took' 'tried' 'turk' 'vision' 'way' 'world' 'year'
 'yet']


Unnamed: 0,Preprocessed Sentence,Bag of words Vector Representation,TF-IDF Vector Representation
0,three vision india,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5773502691896258, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5773502691896258, 0.0, 0.0, 0.0, 0.5773502691896258, 0.0, 0.0, 0.0, 0.0]"
1,3000 year history people world come invaded u captured land conquered mind,"[1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0]","[0.31454746818160906, 0.0, 0.0, 0.0, 0.0, 0.31454746818160906, 0.31454746818160906, 0.2636153271241494, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2636153271241494, 0.0, 0.31454746818160906, 0.2636153271241494, 0.0, 0.0, 0.31454746818160906, 0.0, 0.0, 0.0, 0.0, 0.31454746818160906, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.31454746818160906, 0.31454746818160906, 0.0]"
2,alexander onwards greek turk mogul portuguese british french dutch came looted u took,"[0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]","[0.0, 0.2886751345948129, 0.0, 0.2886751345948129, 0.2886751345948129, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2886751345948129, 0.0, 0.0, 0.2886751345948129, 0.0, 0.2886751345948129, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2886751345948129, 0.0, 0.2886751345948129, 0.0, 0.2886751345948129, 0.0, 0.0, 0.2886751345948129, 0.0, 0.0, 0.2886751345948129, 0.0, 0.2886751345948129, 0.0, 0.0, 0.0, 0.0, 0.0]"
3,yet done nation,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5773502691896258, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5773502691896258, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5773502691896258]"
4,conquered anyone,"[0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0.0, 0.0, 0.7664298449085388, 0.0, 0.0, 0.0, 0.0, 0.6423280258820045, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]"
5,grabbed land culture history tried enforce way life,"[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.36748939521121393, 0.0, 0.0, 0.36748939521121393, 0.0, 0.0, 0.36748939521121393, 0.0, 0.30798479381600735, 0.0, 0.0, 0.30798479381600735, 0.36748939521121393, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.36748939521121393, 0.0, 0.0, 0.36748939521121393, 0.0, 0.0, 0.0]"
6,,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]"
7,respect freedom others,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5773502691896258, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5773502691896258, 0.0, 0.0, 0.5773502691896258, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]"
