# Vectorization

## Import all needed libraries

In [1]:
# Data handling
import numpy as np
import pandas as pd

# Text processing
import re
import string
import emoji
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [2]:
df = pd.read_csv("preprocessed_text.csv")

In [3]:
df.head()

Unnamed: 0,Content,Score,Sentiment,Content_cleaned
0,Plsssss stoppppp giving screen limit like when...,2,negative,plss stopp give screen limit like ur watch thi...
1,Good,5,positive,good
2,👍👍,5,positive,thumb up thumb up
3,Good,3,neutral,good
4,"App is useful to certain phone brand ,,,,it is...",1,negative,app useful certain phone brand except phone tr...


In [4]:
df.isnull().sum()

Content             0
Score               0
Sentiment           0
Content_cleaned    67
dtype: int64

In [5]:
df.fillna('', inplace=True)

## Bag of Words

This method creates literally a bag of words, without taking into account the semantic meaning of the words or their position in the sentence. First, all the inputs are tokenized. Then from all the unique tokens, the algorithm creates a vocabulary in alphabetical order. For every input sequence, the algorithm creates a matrix that has the length of the vocabulary and frequencies of each token are assigned to the corresponding index. The Bag of Words algorithm is implemented with the CountVectorizer function.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit the model and transform the data
bow = vectorizer.fit_transform(df['Content_cleaned'])

print(len(vectorizer.vocabulary_))
print(bow.shape)

31451
(113292, 31451)


In [7]:
print(df['Content_cleaned'][2])
print(bow[2])

thumb up thumb up
  (0, 27621)	2
  (0, 29212)	2


In [10]:
sorted_vocab_keys = sorted(vectorizer.vocabulary_.keys())
print(f"27621 is {sorted_vocab_keys[27621]}.")
print(f"29212 is {sorted_vocab_keys[29212]}.")

27621 is thumb.
29212 is up.


We notice that the produced vocabulary is of size 31451, while our bag of words has 113292 vectors, each having the size of the vocabulary. 

In the example we see the that both words "thumb" and "up" get value of 2.

Positive: 
- Sequences have a fixed size.

Negative:
- Very high dimensions.
- Order of words or semantic meaning is not preserved.
- If we have a new sequence that contains new words that are not part of our vocabulary, it will not work.

## TF-IDF


TF-IDF, or Term Frequency- Inverse Document Frequency, is an algorithm that creates a frequency-based vocabulary, like Bag of Words, but unlike that, it takes word importance into consideration. Basically, it considers that if a word is part of a lot of sentences/sequences, then it must not be very important. However, if a word is present in only a few sentences/sequences, then it must be of high importance. This way words that get repeated too often don’t overpower less frequent but important words. The formula for words in a sentence/sequence is as follows:
- TF(x) = (frequency of word 'x' in a sequence)/(total number of words in the sequence).
- IDF(x) = log((total number of sequences)/(number of sequences that contain word 'x')).
- TF-IDF(x) - TF(x) * IDF(x).

In IDF(x) the document frequency is inversed so the more common a word is across all documents, the lesser its importance is for the current document.


In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit the model and transform the data
tfidf = vectorizer.fit_transform(df['Content_cleaned'])

print(len(vectorizer.vocabulary_))
print(tfidf.shape)

31451
(113292, 31451)


In [12]:
print(df['Content_cleaned'][2])
print(tfidf[2])

thumb up thumb up
  (0, 29212)	0.5498571095671961
  (0, 27621)	0.8352587377923134


In [13]:
sorted_vocab_keys = sorted(vectorizer.vocabulary_.keys())
print(f"27621 is {sorted_vocab_keys[27621]}.")
print(f"29212 is {sorted_vocab_keys[29212]}.")

27621 is thumb.
29212 is up.


We see that just like Bag of Words, we have a vocabulary of 31451 size and 113292 vectors of the same size.

In the example we see that unlike Bag of Words, where both words got value 2, the word "thumb" gets a higher value than the word "up", meaning it is of more importance. The word "thumb" must exist in less sequences than the word "up", making it more significant.

Positive: 
- Sequences have a fixed size.
- Some word importance is considered, unlike Bag of Words.

Negative:
- Very high dimensions.
- Order of words is still not preserved.
- Again if we have a new sequence that contains new words that are not part of our vocabulary, it will not work.

# Word2Vec

Word2Vec is a neural network-based model for learning word embeddings. Unlike in the frequency-based vectorization algorithms, the vector representation of words was said to be contextually aware. Since every word is represented as an n-dimensional vector, one can imagine that all of the words are mapped to this n-dimensional space in such a manner that words having similar meanings exist in close proximity to one another in this hyperspace. 

There are two main ways to implement Word2Vec, CBoW and Skip-Gram.

### CBoW

In CBoW, or Continuous Bag of Words, a NN with a single hidden layer is trained. It takes as input context (vincinity) words and its goal is to predict the current word. For example if we have the sentence "the small kid ate a banana", if we have vincinity=2, an input to the model can be (small, kid, a, banana) and the output will be "ate". In this algorithm we choose a vincinity number m and then for every word in our sequences a dataset is prepared taking the m neighboring words as inputs and the word as a target. All words are turned into one-hot-encodings. Then a NN with a single layer is trained. In the end, we will not use the actual NN anywhere, but we will use the hidden-to-output weight vector as a word embeddings matrix. The size of this matrix is the size of the hidden layer and we can define it as a hyperparameter. Let's say in our case the vocabulary is of size 34326. If we choose a hidden layer of 300 size, then the word embeddings matrix will be of size 34326x300, since every word will be an one-hot vector of 1x34326 size. Then we multiply our word with the embedding matrix and we get a vector of 1x300 size, which is our final goal.

### Skip-Gram

Skip-Gram is the exact mirrored process of CBoW, in the sense that instead of feeding the network context words and trying to predict the current word, we feed the network the current word and it tries to predict context (vincinity) words. For example if we have the sentence "the small kid ate a banana", if we have vincinity=2, an input to the model can be "ate" and the output will be (small, kid, a, banana). Again in this algorithm we choose a vincinity number m and then for every word in our sequences a dataset is prepared taking the m neighboring words as targets and the word as input. Then a NN is trained and the input-to-hidden weights are taken as word embeddings. Then the vectorizing of our dataset is done the same way as in CBoW.

### Differences

Skip-Gram is better when the dataset is small and emphasis on rare words is given. CBoW is better when the dataset is bigger, can better represent frequent words and it is faster to train.


There is the possibility to use pretrained word embeddings or train a new model ourselves. The pretrained usually used is provided by Google. In this notebook we will try both of them and see how they compare, both in vectorizing and later in our models.

In [14]:
from gensim import models

In [15]:
w2v = models.KeyedVectors.load_word2vec_format(
'../GoogleNews-vectors-negative300.bin', binary=True)

In [16]:
def get_average_word2vec(tokens_list, model, vector_size):
    """
    This function computes the average Word2Vec for a given list of tokens.
    """
    # Filter the tokens that are present in the Word2Vec model
    valid_tokens = [token for token in tokens_list if token in model]
    if not valid_tokens:
        return np.zeros(vector_size)
    
    # Compute the average Word2Vec
    word_vectors = [model[token] for token in valid_tokens]
    average_vector = np.mean(word_vectors, axis=0)
    return average_vector

# Tokenize the text data
df['tokens'] = df['Content_cleaned'].apply(lambda x: x.split())

# Compute the average Word2Vec for each row
vector_size = w2v.vector_size
df['word2vec_pretrained'] = df['tokens'].apply(lambda x: get_average_word2vec(x, w2v, vector_size))

df.head()

Unnamed: 0,Content,Score,Sentiment,Content_cleaned,tokens,word2vec_pretrained
0,Plsssss stoppppp giving screen limit like when...,2,negative,plss stopp give screen limit like ur watch thi...,"[plss, stopp, give, screen, limit, like, ur, w...","[0.08365452, 0.0579847, 0.11433671, -0.0025425..."
1,Good,5,positive,good,[good],"[0.040527344, 0.0625, -0.017456055, 0.07861328..."
2,👍👍,5,positive,thumb up thumb up,"[thumb, up, thumb, up]","[0.08703613, 0.07147217, -0.00390625, 0.005859..."
3,Good,3,neutral,good,[good],"[0.040527344, 0.0625, -0.017456055, 0.07861328..."
4,"App is useful to certain phone brand ,,,,it is...",1,negative,app useful certain phone brand except phone tr...,"[app, useful, certain, phone, brand, except, p...","[0.0644662, -0.0806833, -0.0020926339, 0.02535..."


In [17]:
import multiprocessing

def get_average_word2vec2(tokens_list, model, vector_size):
    valid_tokens = [token for token in tokens_list if token in model.wv]
    if not valid_tokens:
        return np.zeros(vector_size)
    word_vectors = [model.wv[token] for token in valid_tokens]
    average_vector = np.mean(word_vectors, axis=0)
    return average_vector

# Define model parameters
vector_size = 300   # Dimensionality of the word vectors
window_size = 5     # Context window size
min_count = 1       # Minimum word frequency
workers = multiprocessing.cpu_count()  # Number of worker threads to use

# Train the Word2Vec model
cbow = models.Word2Vec(df['tokens'].tolist(), vector_size=vector_size, sg=0, window=window_size, min_count=min_count, workers=workers)

# Save the model
model_path = "cbow.model"
cbow.save(model_path)

print(f"Model saved at {model_path}")

df['word2vec_cbow'] = df['tokens'].apply(lambda x: get_average_word2vec2(x, cbow, vector_size))

# Train the Word2Vec model
skipgram = models.Word2Vec(df['tokens'].tolist(), vector_size=vector_size, sg=1, window=window_size, min_count=min_count, workers=workers)

# Save the model
model_path = "skipgram.model"
skipgram.save(model_path)

print(f"Model saved at {model_path}")

df['word2vec_skipgram'] = df['tokens'].apply(lambda x: get_average_word2vec2(x, skipgram, vector_size))

df.head()

Model saved at cbow.model
Model saved at skipgram.model


Unnamed: 0,Content,Score,Sentiment,Content_cleaned,tokens,word2vec_pretrained,word2vec_cbow,word2vec_skipgram
0,Plsssss stoppppp giving screen limit like when...,2,negative,plss stopp give screen limit like ur watch thi...,"[plss, stopp, give, screen, limit, like, ur, w...","[0.08365452, 0.0579847, 0.11433671, -0.0025425...","[0.12165112, 0.20407064, -0.18353447, -0.23372...","[0.124642536, 0.02039718, -0.061326027, -0.039..."
1,Good,5,positive,good,[good],"[0.040527344, 0.0625, -0.017456055, 0.07861328...","[0.025911188, -0.83119565, -0.03553019, -0.544...","[0.11477529, -0.09182124, -0.14298776, -0.2736..."
2,👍👍,5,positive,thumb up thumb up,"[thumb, up, thumb, up]","[0.08703613, 0.07147217, -0.00390625, 0.005859...","[-0.16716677, -0.093416005, -1.4007342, 0.4197...","[-0.043917134, 0.13976072, -0.39437324, -0.012..."
3,Good,3,neutral,good,[good],"[0.040527344, 0.0625, -0.017456055, 0.07861328...","[0.025911188, -0.83119565, -0.03553019, -0.544...","[0.11477529, -0.09182124, -0.14298776, -0.2736..."
4,"App is useful to certain phone brand ,,,,it is...",1,negative,app useful certain phone brand except phone tr...,"[app, useful, certain, phone, brand, except, p...","[0.0644662, -0.0806833, -0.0020926339, 0.02535...","[-0.16657814, 0.06822517, 0.1050694, -0.113447...","[0.16508076, 0.045392197, -0.108809195, -0.125..."


In [18]:
print(cbow.wv.most_similar("movie"))

[('film', 0.7634493112564087), ('stuff', 0.6408114433288574), ('show', 0.5823018550872803), ('programme', 0.5807750821113586), ('program', 0.5739870071411133), ('genre', 0.5574767589569092), ('category', 0.5541295409202576), ('series', 0.5440436601638794), ('watched', 0.543439507484436), ('anime', 0.5359500646591187)]


In [19]:
print(skipgram.wv.most_similar("movie"))

[('flim', 0.7929681539535522), ('binging', 0.7497050762176514), ('reccomende', 0.7456972002983093), ('flick', 0.7446443438529968), ('kdramas', 0.7385998368263245), ('ahow', 0.7357904314994812), ('oldie', 0.7341920733451843), ('wonderfull', 0.7281920909881592), ('syfy', 0.7268280982971191), ('sitcom', 0.7250053286552429)]


In [20]:
print(w2v.most_similar("movie"))

[('film', 0.8676770329475403), ('movies', 0.8013108372688293), ('films', 0.7363011837005615), ('moive', 0.6830360889434814), ('Movie', 0.6693680286407471), ('horror_flick', 0.6577848792076111), ('sequel', 0.6577793955802917), ('Guy_Ritchie_Revolver', 0.650975227355957), ('romantic_comedy', 0.6413198709487915), ('flick', 0.6321909427642822)]


In the above example of the word "movie" we see what similar words the 3 different models are giving. For both 3 of the different embeddings, we see similar words, however CBoW seems to be better than Skip-Gram. We can notice that CBoW gives more frequent and correct words, while Skip-Gram might be giving more rare words. For this reason we will continue with the pretrained version and CBoW.

# GloVe