### 1. Bag_of _word_Manually
 This vectorization technique converts the text content to numerical feature vectors. It takes a document from a corpus and converts it into a numeric vector by mapping each document word to a feature vector for the machine learning model 

In [1]:
from nltk.tokenize import word_tokenize
from collections import Counter
import pandas as pd

In [2]:
sentence1 = 'The quick brown fox jumps over the lazy dog.'
sentence2 = 'The cat chases the mouse and it squakes loudly'

token = word_tokenize(sentence1.lower()) + word_tokenize(sentence2.lower())

In [3]:
tokens = set(token)

In [4]:
df = pd.DataFrame({},index =[1,2] ,columns=list(tokens) )

In [5]:
df

Unnamed: 0,brown,dog,and,the,fox,lazy,chases,it,.,jumps,over,mouse,cat,quick,squakes,loudly
1,,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,,


In [6]:
token1 = word_tokenize(sentence1.lower())
token2 = word_tokenize(sentence2.lower())

In [7]:
counts1 = [token1.count(x) for x in df.columns]
counts2 = [token2.count(x) for x in df.columns]

In [8]:
df.iloc[0:] = counts1
df.iloc[1:] = counts2

In [9]:
df

Unnamed: 0,brown,dog,and,the,fox,lazy,chases,it,.,jumps,over,mouse,cat,quick,squakes,loudly
1,1,1,0,2,1,1,0,0,1,1,1,0,0,1,0,0
2,0,0,1,2,0,0,1,1,0,0,0,1,1,0,1,1


### Vectorization : -
1. In Natural Language Processing (NLP), vectorization is the process of converting text data into numerical vectors. These numerical vectors are used as input for machine learning models. 
2. The process of vectorization allows machines to understand and interpret text data 

### 1. ***Count Vectorizer*** : 
1. This is one of the simplest ways of doing text vectorization. It creates a document term matrix, which is a set of dummy variables indicating if a particular word appears in the document. 
2. Count vectorizer will fit and learn the word vocabulary and try to create a document term matrix in which the individual cells denote the frequency of that word in a particular document, also known as term frequency, and the columns are dedicated to each word in the corpus

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [11]:
sentence1 = 'The quick brown fox jumps over the lazy dog.'
sentence2 = 'The cat chases the mouse and it squakes loudly'

In [12]:
cv = CountVectorizer(ngram_range=(1,2))

x_vect = cv.fit_transform([sentence1,sentence2])

In [13]:
columns = cv.get_feature_names_out()

In [14]:
df = pd.DataFrame(x_vect.toarray(),columns=columns)

df

Unnamed: 0,and,and it,brown,brown fox,cat,cat chases,chases,chases the,dog,fox,...,over the,quick,quick brown,squakes,squakes loudly,the,the cat,the lazy,the mouse,the quick
0,0,0,1,1,0,0,0,0,1,1,...,1,1,1,0,0,2,0,1,0,1
1,1,1,0,0,1,1,1,1,0,0,...,0,0,0,1,1,2,1,0,1,0


### 2.***TF-IDF_VECTORIZER***: -
1. Similar to the count vectorization method, in the TF-IDF method, a document term matrix is generated and each column represents an individual unique word. T
2. he difference in the TF-IDF method is that each cell doesn’t indicate the term frequency, but contains a weight value that signifies how important a word is for an individual text message or document

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

In [16]:
sentence1 = 'The quick brown fox jumps over the lazy dog.'
sentence2 = 'The cat chases the mouse and it squakes loudly'

In [17]:
tfidf = TfidfVectorizer(stop_words='english')

x_vectr = tfidf.fit_transform([sentence1,sentence2])

In [18]:
df = pd.DataFrame(x_vectr.toarray(), columns=tfidf.get_feature_names_out())

df

Unnamed: 0,brown,cat,chases,dog,fox,jumps,lazy,loudly,mouse,quick,squakes
0,0.408248,0.0,0.0,0.408248,0.408248,0.408248,0.408248,0.0,0.0,0.408248,0.0
1,0.0,0.447214,0.447214,0.0,0.0,0.0,0.0,0.447214,0.447214,0.0,0.447214


### 3. TF-IDF Transformer

In [19]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [20]:
sentence1 = 'The quick brown fox jumps over the lazy dog.'
sentence2 = 'The cat chases the mouse and it squakes loudly'

In [21]:
cv = CountVectorizer()
word_vector = cv.fit_transform([sentence1,sentence2])

In [22]:
tfidf = TfidfTransformer()
x = tfidf.fit_transform(word_vector)

In [23]:
df = pd.DataFrame(x.toarray(),columns=cv.get_feature_names_out())
df

Unnamed: 0,and,brown,cat,chases,dog,fox,it,jumps,lazy,loudly,mouse,over,quick,squakes,the
0,0.0,0.332872,0.0,0.0,0.332872,0.332872,0.0,0.332872,0.332872,0.0,0.0,0.332872,0.332872,0.0,0.473682
1,0.332872,0.0,0.332872,0.332872,0.0,0.0,0.332872,0.0,0.0,0.332872,0.332872,0.0,0.0,0.332872,0.473682
