#### Common Terms
* Corpus (C): Total no of words of all the reviews
* Vocabulary (V): Unique words of corpus
* Document (D): Individual Review
* Word (W): Individual word of each Review

## One hot Encoding

One-hot encoding is the process of turning categorical factors into a numerical structure that machine learning algorithms can readily process. It functions by representing each category in a feature as a binary vector of 1s and 0s, with the vector’s size equivalent to the number of potential categories. 

For example:
If we have a feature with three categories (A, B, and C), 
each category can be represented as a binary vector of length three, 
with the vector for category A being [1, 0, 0], 
the vector for category B being [0, 1, 0], and the vector for category C being [0, 0, 1].

Problems in One Hot Encoding
- Sparsity
- No Fixed Size
- OOV (Out of Vocabulary)
- No capturing of semantic


## Bag to Words

In [28]:
import numpy as np
import pandas as pd

data = {
    'text' : ['people watch campusx', 'campusx watch campusx', 'people write comment', 'campusx write comment'], 
    'output' : [1,1,0,0]
}

df = pd.DataFrame(data)
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [29]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

# fit the text columns
bow = cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}


In [39]:
check = pd.DataFrame(bow.toarray(), columns=cv.get_feature_names_out())
# pd.DataFrame(bow[1].toarray(), columns=colunm)
check

Unnamed: 0,campusx,comment,people,watch,write
0,1,0,1,1,0
1,2,0,0,1,0
2,0,1,1,0,1
3,1,1,0,0,1


Problems solved
 - OOV (Out of Vocabulary)

In [40]:
testing = cv.transform(["campusx watch and write comment of campusx"]).toarray()
pd.DataFrame(testing, columns=cv.get_feature_names_out())

Unnamed: 0,campusx,comment,people,watch,write
0,2,1,0,1,1


Important Parameters
- encoding='utf-8'
- lowercase=True
- stop_words='english'
- token_pattern=(regularExpression)
- binary=True ----- Only when performing sentiment analysis


In [37]:
cv.get_feature_names_out()

array(['campusx', 'comment', 'people', 'watch', 'write'], dtype=object)

## Ngrams

## TF-IDF

## Word2Vec