# Vectorization

## Import all needed libraries

In [1]:
# Data handling
import numpy as np
import pandas as pd

# Text processing
import re
import string
import emoji
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [2]:
df = pd.read_csv("preprocessed_text.csv")

In [3]:
df.head()

Unnamed: 0,Content,Score,Sentiment,Content_cleaned
0,Plsssss stoppppp giving screen limit like when...,2,negative,plsssss stoppppp give screen limit like ur wat...
1,Good,5,positive,good
2,👍👍,5,positive,thumb up thumb up
3,Good,3,neutral,good
4,"App is useful to certain phone brand ,,,,it is...",1,negative,app useful certain phone brand except phone tr...


In [4]:
df.isnull().sum()

Content             0
Score               0
Sentiment           0
Content_cleaned    60
dtype: int64

In [5]:
df.fillna('', inplace=True)

## Bag of Words

This method creates literally a bag of words, without taking into account the semantic meaning of the words or their position in the sentence. First, all the inputs are tokenized. Then from all the unique tokens, the algorithm creates a vocabulary in alphabetical order. For every input sequence, the algorithm creates a matrix that has the length of the vocabulary and frequencies of each token are assigned to the corresponding index. The Bag of Words algorithm is implemented with the CountVectorizer function.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit the model and transform the data
bow = vectorizer.fit_transform(df['Content_cleaned'])

print(len(vectorizer.vocabulary_))
print(bow.shape)

34326
(113292, 34326)


In [7]:
print(df['Content_cleaned'][2])
print(bow[2])

thumb up thumb up
  (0, 30228)	2
  (0, 31874)	2


In [14]:
sorted_vocab_keys = sorted(vectorizer.vocabulary_.keys())
print(f"30228 is {sorted_vocab_keys[30228]}.")
print(f"31874 is {sorted_vocab_keys[31874]}.")

30228 is thumb.
31874 is up.


We notice that the produced vocabulary is of size 34326, while our bag of words has 113292 vectors, each having the size of the vocabulary. 

In the example we see the that both words "thumb" and "up" get value of 2.

Positive: 
- Sequences have a fixed size.

Negative:
- Very high dimensions.
- Order of words or semantic meaning is not preserved.
- If we have a new sequence that contains new words that are not part of our vocabulary, it will not work.

## TF-IDF


TF-IDF, or Term Frequency- Inverse Document Frequency, is an algorithm that creates a frequency-based vocabulary, like Bag of Words, but unlike that, it takes word importance into consideration. Basically, it considers that if a word is part of a lot of sentences/sequences, then it must not be very important. However, if a word is present in only a few sentences/sequences, then it must be of high importance. This way words that get repeated too often don’t overpower less frequent but important words. The formula for words in a sentence/sequence is as follows:
- TF(x) = (frequency of word 'x' in a sequence)/(total number of words in the sequence).
- IDF(x) = log((total number of sequences)/(number of sequences that contain word 'x')).
- TF-IDF(x) - TF(x) * IDF(x).

In IDF(x) the document frequency is inversed so the more common a word is across all documents, the lesser its importance is for the current document.


In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit the model and transform the data
tfidf = vectorizer.fit_transform(df['Content_cleaned'])

print(len(vectorizer.vocabulary_))
print(tfidf.shape)

34326
(113292, 34326)


In [11]:
print(df['Content_cleaned'][2])
print(tfidf[2])

thumb up thumb up
  (0, 31874)	0.5497003144270964
  (0, 30228)	0.8353619361203571


In [13]:
sorted_vocab_keys = sorted(vectorizer.vocabulary_.keys())
print(f"30228 is {sorted_vocab_keys[30228]}.")
print(f"31874 is {sorted_vocab_keys[31874]}.")

30228 is thumb.
31874 is up.


We see that just like Bag of Words, we have a vocabulary of 34326 size and 113292 vectors of the same size.

In the example we see that unlike Bag of Words, where both words got value 2, the word "thumb" gets a higher value than the word "up", meaning it is of more importance. The word "thumb" must exist in less sequences than the word "up", making it more significant.

Positive: 
- Sequences have a fixed size.
- Some word importance is considered, unlike Bag of Words.

Negative:
- Very high dimensions.
- Order of words is still not preserved.
- Again if we have a new sequence that contains new words that are not part of our vocabulary, it will not work.

# Word2Vec

Word2Vec is a neural network-based model for learning word embeddings. Unlike in the frequency-based vectorization algorithms, the vector representation of words was said to be contextually aware. Since every word is represented as an n-dimensional vector, one can imagine that all of the words are mapped to this n-dimensional space in such a manner that words having similar meanings exist in close proximity to one another in this hyperspace. 

There are two main ways to implement Word2Vec, CBoW and Skip-Gram.
