## GloVe (Global Vectors)


GloVe, or Global Vectors, is an algorithm that like Word2Vec is creating contextual embeddings. The difference is that while Word2Vec only considers local information according to the surroundings of a specific word, GloVe captures both local and global statistics.

It relies in the word-word co-occurrence probabilities with a set window-size in our dataset. The algorithm starts by creating a word co-occurence matrix from a large corpus. Firstly, it identifies the vocabulary of the corpus. Then for each word in the vocabulary, for a set context window size, count how often each word within this window around our given word appears throughout our corpus. Finally, the co-occurence matrix is created with each element representing the number the two corresponding words appear together. 

This word co-occurence matrix is used to examine the ratio of co-occurence probabilities between words. Moreover, this ratio between two target words given a context word can give insights on how close these two words are and is approxximated by the exponential of the dot product of the word vector of the given word and the two word vector of the target words.

Unlike Word2Vec, GloVe does not train a neural network. Instead, it tries to minimize a least squares cost function that uses word vectors w_i and w_j between two words, such that their dot product plus bias terms approximates the logarithm of their corresponding co-occurence from the matrix.



### Example: Calculating the Co-occurence Matrix

#### Step 1: Start with 3 Phrases
Let's use the following phrases as our documents as in TF-IDF:

1. "The car is fast"
2. "The car is red"
3. "The fast car is blue"

#### Step 2: Create the Co-occurence Matrix
First, list all unique words across the phrases: `the`, `car`, `is`, `fast`, `red`, `blue`.

For simplicity we will use a context window size of 1.

Matrix Construction:

Phrase 1: "The car is fast"

The: (car) → 1 

Car: (The, is) → 1, 1

Is: (car, fast) → 1, 1

Fast: (is) → 1

Phrase 2: "The car is red"

The: (car) → 1

Car: (The, is) → 1, 1

Is: (car, red) → 1, 1

Red: (is) → 1

Phrase 3: "The fast car is blue"

The: (fast) → 1

Fast: (The, car) → 1, 1

Car: (fast, is) → 1, 1

Is: (car, blue) → 1, 1

Blue: (is) → 1

| Term   | the | car | is | fast | red | blue |
|--------|-----|-----|----|------|-----|------|
| the    |0    | 2   | 0 | 1 |   0   | 0 |
| car    | 2   |0   | 3 | 1 | 0 | 0 |
| is     | 0 | 3 | 0 |1 | 1 | 1 |
| fast   | 1 | 1 | 1 | 0 | 0 | 0  |
| red    | 0 | 0 | 1 | 0 | 0 | 0  |
| blue   |0 | 0 | 1 | 0 | 0 | 0 |


Then the cost function is minimized, creating the word vectors.

In practice, usually pre-trained GloVe embeddings are used, which are trained on huge datasets. In this notebook we will use the **GloVe 6b**, which is trained on Wikipedia and Gigaword on 6 billion tokens and 400K vocabulary size, and the **GloVe Twitter**, which is trained with 2 billion tweets, 27 billion tokens and 1.2 million vocabulary size. Both embeddings come in 50,100,200 vector sizes. We will use the 100d embeddings to redude dimensionality. We use the first version as a general-purpose embedding and the second version as a more specified embedding, since tweets and netflix reviews can be similar and contain slang and informal language.

### Implementation in Python

Let's begin by importing the libraries we need.

In [1]:
import pandas as pd
import numpy as np

Let's begin by creating a sample dataset. Just like we did previously.

In [2]:
# Sample DataFrame with Netflix app reviews
df = pd.DataFrame({
    'Content_cleaned': [
        'the app is great new features but crashes often',
        'love the app love the content but it crashes',
        'the app crashes too much it is frustrating',
        'the content is great it is easy to use it is great'
    ]
})

Let's visualize it.

In [3]:
print(df.to_string())

                                      Content_cleaned
0     the app is great new features but crashes often
1        love the app love the content but it crashes
2          the app crashes too much it is frustrating
3  the content is great it is easy to use it is great


There is the possibility to use pretrained word embeddings or train a new model ourselves. The pretrained usually used is provided by Google. In this notebook we will try both of them and see how they compare, both in vectorizing and later in our models.

Here is the code for applying pretrained embeddings using the ones provided by Google with vector size 300:

In [5]:
# Path to the GloVe embeddings file
glove_file = '../../glove.6B.100d.txt'

# Load the GloVe embeddings into a dictionary
def load_glove_embeddings(glove_file):
    embeddings_index = {}
    with open(glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    return embeddings_index

# Load the GloVe embeddings
glove_6b = load_glove_embeddings(glove_file)
print(f"Loaded {len(glove_6b)} word vectors from GloVe.")

# Define a function to get the average GloVe vector for a list of tokens
def get_average_glove(tokens_list, embeddings, embedding_dim):
    valid_tokens = [token for token in tokens_list if token in embeddings]
    if not valid_tokens:
        return np.zeros(embedding_dim)
    word_vectors = [embeddings[token] for token in valid_tokens]
    average_vector = np.mean(word_vectors, axis=0)
    return average_vector

# Define the embedding dimension (e.g., 100 for 'glove.6B.100d.txt')
embedding_dim = 100

# Tokenize the text data
df['tokens'] = df['Content_cleaned'].apply(lambda x: x.split())

# Compute the average GloVe vector for each row
df['glove_6B'] = df['tokens'].apply(lambda x: get_average_glove(x, glove_6b, embedding_dim))

df.head()

Loaded 400000 word vectors from GloVe.


Unnamed: 0,Content_cleaned,tokens,glove_6B
0,the app is great new features but crashes often,"[the, app, is, great, new, features, but, cras...","[-0.3715152, 0.21509555, 0.6008711, -0.2824737..."
1,love the app love the content but it crashes,"[love, the, app, love, the, content, but, it, ...","[-0.14057177, 0.06771444, 0.64484, -0.33956334..."
2,the app crashes too much it is frustrating,"[the, app, crashes, too, much, it, is, frustra...","[-0.43096676, 0.11243025, 0.7203987, -0.325277..."
3,the content is great it is easy to use it is g...,"[the, content, is, great, it, is, easy, to, us...","[-0.30285382, 0.20630115, 0.62620914, -0.18935..."


As in Word2Vec, when calculating the vectors of a sequence, we calculate the word embedding of every token seperately and then we average over them to get the sequence vector. This way the overall semantic meaning of the text is captured, while each word contributes to the final vector, allowing the resulting vector to represent the combined meanings of the individual words.

As we see in the example below, our sequences have been turned into a vector of size 1x100.

In [7]:
print(df['glove_6B'][1])

[-0.14057177  0.06771444  0.64484    -0.33956334  0.09621267  0.22005375
 -0.12805289  0.04956245  0.15710913 -0.03636066  0.29230332  0.03594959
  0.22148466 -0.07713312  0.07276111 -0.05160556  0.14087535  0.15393445
 -0.2744702   0.55991334  0.21917889 -0.25078142  0.29296112  0.21898113
  0.2882509   0.27953756 -0.01664556 -0.27808443  0.07061221 -0.18676221
 -0.14077044  0.76046836 -0.02893378  0.02613545  0.1474149   0.25755107
 -0.04283706  0.2562351   0.04777002 -0.22878289 -0.301087   -0.09659548
 -0.08913607 -0.21312311 -0.01634667  0.05641835  0.10402111 -0.27211988
  0.06073888 -0.52196     0.12872776  0.13962911  0.20483     0.8879567
 -0.20441791 -2.230979   -0.04423889 -0.03921788  1.5170234   0.42475557
 -0.05490556  0.7453697  -0.10607912  0.05236555  0.6565278   0.06484122
  0.52867174  0.00745687  0.13892445 -0.28618333  0.05879478 -0.11244889
  0.218283   -0.18381844  0.1074889   0.23128165 -0.03069188 -0.18924624
 -0.8462033  -0.07151488  0.17390333 -0.01163    -0.

In [6]:
# Path to the GloVe embeddings file
glove_file = '../../glove.twitter.27B.100d.txt'

# Load the GloVe embeddings into a dictionary
def load_glove_embeddings(glove_file):
    embeddings_index = {}
    with open(glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    return embeddings_index

# Load the GloVe embeddings
glove_twitter = load_glove_embeddings(glove_file)
print(f"Loaded {len(glove_twitter)} word vectors from GloVe.")

# Define a function to get the average GloVe vector for a list of tokens
def get_average_glove(tokens_list, embeddings, embedding_dim):
    valid_tokens = [token for token in tokens_list if token in embeddings]
    if not valid_tokens:
        return np.zeros(embedding_dim)
    word_vectors = [embeddings[token] for token in valid_tokens]
    average_vector = np.mean(word_vectors, axis=0)
    return average_vector

# Define the embedding dimension (e.g., 100 for 'glove.6B.100d.txt')
embedding_dim = 100

# Compute the average GloVe vector for each row
df['glove_twitter'] = df['tokens'].apply(lambda x: get_average_glove(x, glove_twitter, embedding_dim))

df.head()

Loaded 1193514 word vectors from GloVe.


Unnamed: 0,Content_cleaned,tokens,glove_6B,glove_twitter
0,the app is great new features but crashes often,"[the, app, is, great, new, features, but, cras...","[-0.3715152, 0.21509555, 0.6008711, -0.2824737...","[0.2957551, 0.019755114, 0.002848328, 0.074190..."
1,love the app love the content but it crashes,"[love, the, app, love, the, content, but, it, ...","[-0.14057177, 0.06771444, 0.64484, -0.33956334...","[0.08385044, 0.029493107, 0.011498875, 0.21228..."
2,the app crashes too much it is frustrating,"[the, app, crashes, too, much, it, is, frustra...","[-0.43096676, 0.11243025, 0.7203987, -0.325277...","[0.20398799, 0.030513998, 0.2892485, 0.11278, ..."
3,the content is great it is easy to use it is g...,"[the, content, is, great, it, is, easy, to, us...","[-0.30285382, 0.20630115, 0.62620914, -0.18935...","[0.2149126, 0.027656665, 0.25671306, 0.2465589..."


Let's try now applying what we have learned to the Netflix dataset. 

In [9]:
# Read the dataset
df = pd.read_csv('../DATASETS/preprocessed_text.csv')

# Filling empty text that occurred after text preprocessing
df.fillna('', inplace=True)

# Tokenize the text data
df['tokens'] = df['content_cleaned'].apply(lambda x: x.split())

# Path to the GloVe embeddings file
glove_file = '../../glove.6B.100d.txt'

# Load the GloVe embeddings into a dictionary
def load_glove_embeddings(glove_file):
    embeddings_index = {}
    with open(glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    return embeddings_index

# Load the GloVe embeddings
glove_6b = load_glove_embeddings(glove_file)
print(f"Loaded {len(glove_6b)} word vectors from GloVe.")

# Define a function to get the average GloVe vector for a list of tokens
def get_average_glove(tokens_list, embeddings, embedding_dim):
    valid_tokens = [token for token in tokens_list if token in embeddings]
    if not valid_tokens:
        return np.zeros(embedding_dim)
    word_vectors = [embeddings[token] for token in valid_tokens]
    average_vector = np.mean(word_vectors, axis=0)
    return average_vector

# Define the embedding dimension (e.g., 100 for 'glove.6B.100d.txt')
embedding_dim = 100

# Compute the average GloVe vector for each row
df['glove_6B'] = df['tokens'].apply(lambda x: get_average_glove(x, glove_6b, embedding_dim))

# Path to the GloVe embeddings file
glove_file = '../../glove.twitter.27B.100d.txt'

# Load the GloVe embeddings into a dictionary
def load_glove_embeddings(glove_file):
    embeddings_index = {}
    with open(glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    return embeddings_index

# Load the GloVe embeddings
glove_twitter = load_glove_embeddings(glove_file)
print(f"Loaded {len(glove_twitter)} word vectors from GloVe.")

# Define a function to get the average GloVe vector for a list of tokens
def get_average_glove(tokens_list, embeddings, embedding_dim):
    valid_tokens = [token for token in tokens_list if token in embeddings]
    if not valid_tokens:
        return np.zeros(embedding_dim)
    word_vectors = [embeddings[token] for token in valid_tokens]
    average_vector = np.mean(word_vectors, axis=0)
    return average_vector

# Define the embedding dimension (e.g., 100 for 'glove.6B.100d.txt')
embedding_dim = 100

# Compute the average GloVe vector for each row
df['glove_twitter'] = df['tokens'].apply(lambda x: get_average_glove(x, glove_twitter, embedding_dim))

df.head()

Loaded 400000 word vectors from GloVe.
Loaded 1193514 word vectors from GloVe.


Unnamed: 0,content,sentiment,content_cleaned,tokens,glove_6B,glove_twitter
0,Plsssss stoppppp giving screen limit like when...,negative,plss stopp giving screen limit like when you a...,"[plss, stopp, giving, screen, limit, like, whe...","[-0.101591855, 0.21243754, 0.45259842, -0.2616...","[0.058894146, 0.18850434, 0.08296321, 0.174713..."
1,Good,positive,good,[good],"[-0.030769, 0.11993, 0.53909, -0.43696, -0.739...","[0.091552, 0.093336, -0.028113, 0.3699, 0.1895..."
2,👍👍,positive,thumbs_up,[thumbs_up],"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,Good,neutral,good,[good],"[-0.030769, 0.11993, 0.53909, -0.43696, -0.739...","[0.091552, 0.093336, -0.028113, 0.3699, 0.1895..."
4,"App is useful to certain phone brand ,,,,it is...",negative,app is useful to certain phone brand it is not...,"[app, is, useful, to, certain, phone, brand, i...","[-0.19991928, 0.11995281, 0.36286283, -0.22692...","[0.23760791, 0.07707109, 0.06094666, 0.2031615..."


Just like in Word2Vec, for a particular word we can find the most similar words in the corpus. This is done by calculating the cosine similarity of our word with the closest words in the vector space.

In [10]:
from scipy import spatial

def find_closest_embeddings(embeddings_dict, word, top_n=5):
    if word not in embeddings_dict:
        print(f"Word '{word}' not found in the embedding dictionary.")
        return []
    
    embedding = embeddings_dict[word]
    closest_words = sorted(
        embeddings_dict.keys(), 
        key=lambda w: spatial.distance.cosine(embeddings_dict[w], embedding) if len(embeddings_dict[w]) == len(embedding) else float('inf')
    )
    closest_words.remove(word)  # Remove the word itself from the results
    return closest_words[:top_n]


similar_words_6b = find_closest_embeddings(glove_6b, 'movie')
print(f"Words similar to 'movie' in GloVe 6B: {similar_words_6b}")

# Test with GloVe Twitter
similar_words_twitter = find_closest_embeddings(glove_twitter, 'movie')
print(f"Words similar to 'movie' in GloVe Twitter: {similar_words_twitter}")

Words similar to 'movie' in GloVe 6B: ['film', 'movies', 'films', 'hollywood', 'comedy']
Words similar to 'movie' in GloVe Twitter: ['movies', 'episode', 'story', 'trailer', 'watching']


In the above example of the word "movie" we see what similar words the 2 different embeddings are giving. For both of the embeddings, we see relevant words.

## Pros and Cons of Word2Vec

Positive:
- **Scalability**: Works well with large datasets and can be easily scaled.
- **Low Dimensionality**: The dimensionality is exceptionally reduced, compared to the frequency-based methods.
- **Context taken into account**: Semantics and context, both local and global, are the base element of the vectors.

Negative:
- **Needs big dataset** : Requires a large amount of training data to produce high-quality embeddings.
- **Out-Of-Vocabulary Issue**: Once again we work on a finite vocabulary, based on the corpus our model was trained on. If we have a new sequence with new words, they will not be taken into consideration when the embedding is produced, possibly losing important information.
- **Polysemy issue**: Can struggle with words with multiple meanings since it produces one vector per word.