# Vectorization

## Import all needed libraries

In [1]:
# Data handling
import numpy as np
import pandas as pd

# Data visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Text processing
import re
import string
import emoji
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer


In [2]:
df = pd.read_csv("preprocessed_text.csv")

In [3]:
df.head()

Unnamed: 0,reviewId,content,score,sentiment_label,content_cleaned
0,cc1cfcd2-dc8a-4ead-88d1-7f2b2dbb2662,Plsssss stoppppp giving screen limit like when...,2,negative,plsssss stoppppp give screen limit like ur wat...
1,7dfb1f90-f185-4e81-a97f-d38f0128e5a4,Good,5,positive,good
2,3009acc4-8554-41cf-88de-cc5e2f6e45b2,👍👍,5,positive,thumb up thumb up
3,b3d27852-9a3b-4f74-9e16-15434d3ee324,Good,3,neutral,good
4,8be10073-2368-4677-b828-9ff5d06ea0b7,"App is useful to certain phone brand ,,,,it is...",1,negative,app useful certain phone brand except phone tr...


In [5]:
df.isnull().sum()

reviewId            0
content             0
score               0
sentiment_label     0
content_cleaned    60
dtype: int64

In [6]:
df.fillna('', inplace=True)

## Bag of Words

This method creates literally a bag of words, without taking into account the semantic meaning of the words or their position in the sentence. First, all the inputs are tokenized. Then from all the unique tokens, the algorithm creates a vocabulary in alphabetical order. For every input sequence, the algorithm creates a matrix that has the length of the vocabulary and frequencies of each token are assigned to the corresponding index. The Bag of Words algorithm is implemented with the CountVectorizer function.

In [17]:
# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit the model and transform the data
bow = vectorizer.fit_transform(df['content_cleaned'])

print(len(vectorizer.vocabulary_))
print(bow.shape)

34326
(113292, 34326)


We notice that the produced vocabulary is of size 34326, while our bag of words has 113292 vectors, each having the size of the vocabulary.

Positive: 
- Sequences have a fixed size.

Negative:
- Order of words or semantic meaning is not preserved.
- If we have a new sequence that contains new words that are not part of our vocabulary, it will not work.

## TF-IDF

