# Glove
This notebook presents the methods that were used to get from cleaned tweets to sentiment predictions. The objective of this notebook is to assess the performance of the [Glove](https://nlp.stanford.edu/projects/glove/) word representation. Others notebooks were made with different word representations, the focus is more on evalutating the performance under a fix setup rather than getting the absolute best performance. We therefore used the [Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression) which is a relatively light, interpretable and well suited algorithm for a binary classification task. The library [scikit-learn](https://scikit-learn.org/stable/) will be used extensively throughout this notebook. 

### Importing the needed packages

In [1]:
import numpy as np
import pandas as pd 
import csv
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression

### Loading the clean data files

In [12]:
# Load the clean data
train_data = pd.read_csv('../Data/train_small.txt')

# Shuffle the data
train_data = train_data.sample(train_data.shape[0])

### Using Glove
The main drawbacks from the [Bag-of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) and [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) representations are as follows:

- The dimension of the embedding space depends of the number of distinct words in the corpus.
- The embedding representation of a given document is sparse.
- The embedding representation of a given document does not take into account the words ordering.
- Sementically identical document using different set of words (synonyms) might/will have a cosine similarity of zero.

The [Glove](https://nlp.stanford.edu/projects/glove/) word representation aims to solve the previously mentionned flaws by encoding a word with respect to the context in which it appears. A more detailed explanation can be found in the report. The Glove embeddings used in this notebook were trained on a set of 2 billions tweets and is therefore particularly suited for our goal. 

In [13]:
# Load the pre-trained embedding with a specified dimension here 200
words_embedding = pd.read_table('../Data/glove.twitter.27B.200d.txt', sep=" ", index_col=0, 
                                header=None, quoting=csv.QUOTE_NONE)
words_embedding = words_embedding[~words_embedding.index.isna()]

Once the pre-trained embedding loaded, we can remove from the embedding matrix the rows that won't be used.

In [14]:
# Create a vocabulary from the train set of tweets
vocabulary = pd.DataFrame(train_data['tweet'].apply(lambda x: x.split(' ')).explode()).drop_duplicates()
vocabulary = vocabulary = vocabulary.rename(columns={'index': 'word'})

# Removing unused words from pretrained embeddings
words_embedding = words_embedding.merge(vocabulary, how='inner', left_on=words_embedding.index, right_on='tweet')
words_embedding = words_embedding.rename(columns={'tweet': 'word'})
words_embedding = words_embedding.set_index('word')

In [15]:
# Making a dictionnary out of the dataframe for faster access
embedding_dict = dict(zip(words_embedding.index, words_embedding[words_embedding.columns].values))

Once the vocabulary cleaned from unlikely words, the tweets should also be treated.

In [16]:
def filter_tweets(tweet, vocabulary):
    """
    remove any word from the tweet which does not appear in the dictionary
    @param tweet: single tweet as a string
    @param vocabulary: vocabulary as a set
    @return: single cleaned tweet
    """
    tweet = list(filter(lambda word: word in vocabulary, tweet.split(' ')))
    tweet = ' '.join(tweet)
    return tweet

In [17]:
# Define vocabulary as a set for O(1) access
vocabulary_set = set(embedding_dict.keys())

# Remove non-vocabulary words from tweets 
train_data['tweet'] = train_data['tweet'].apply(lambda tweet: filter_tweets(tweet, vocabulary_set))
train_data = train_data[train_data['tweet'] != '']

Now that the data and embedding matrix are ready to be used we can proceed to the actual encoding of the tweets. Since a typical tweet contains more than a single word and the embedding at hand is at a word level we chose to encode a tweet by averaging its words' embeddings. Note that by doing so the ordering of the terms in a document is lost, a finer approach will be used when dealing with neural networks which can use sequences as input.

In [18]:
def mean_word_embedding(tweet, dictionary):
    """
    computes the embedding of a tweet by averaging the embedding of its words
    @param tweet: single tweet as a string
    @param dictionary: dictionary mapping words to their embedding representation
    @return: mean embedding of the tweet's words
    """
    mean_embedding = np.mean([dictionary[word] for word in tweet.split(' ')], axis=0)
    return mean_embedding

In [19]:
# Use mean_word_embedding
train_data['embedding'] = train_data['tweet'].apply(lambda x: mean_word_embedding(x, embedding_dict))

In [20]:
X_train = train_data['embedding']
X_train = np.stack(X_train.to_numpy())
y_train = train_data['label']

Now that the tweets have numerical representation, they can be fed to the Logistic Regression algorithm.

In [21]:
logistic = LogisticRegression(solver='lbfgs', max_iter=500)
accuracy = cross_val_score(logistic, X_train, y_train, cv=3, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (accuracy.mean(), accuracy.std() * 2))

Accuracy: 0.77 (+/- 0.01)


We observe an accuracy slightly lower than with the TF-IDF embedding, nonetheless the number of dimensions of the embedding space of the Glove embedding is 25 times lower, which directly transfers to a 25 times lower number of paramters in the model when using Logistic Regression.