# TF-IDF
This notebook presents the methods that were used to get from cleaned tweets to sentiment predictions. The objective of this notebook is to assess the performance of the  [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) word representation. Others notebooks were made with different word representations, the focus is more on evalutating the performance under a fix setup rather than getting the absolute best performance. We therefore used the [Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression) which is a relatively light, interpretable and well suited algorithm for a binary classification task. The library [scikit-learn](https://scikit-learn.org/stable/) will be used extensively throughout this notebook. 

### Importing the needed packages

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import cross_val_score

### Loading the clean data files

In [2]:
# Load the clean train data
train_data = pd.read_csv('../Data/train_small.txt')

# Shuffle the data
train_data = train_data.sample(train_data.shape[0])

### Using TF-IDF 
TF-IDF improves the [Bag-of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) representation by taking into account the frequency of a given inside the whole corpus. In other words (no pun intended), it gives more weights to terms that appear seldomly. A more detailed explanation can be found in the report. One of the main drawback of this representation, aside from its sparsity, is that the embedding dimension grows with the number of different words in the corpus. This problem mitigated by artificially fixing the number of features. The latter trick becomes increasingly important if one decide to not only represent single words but also [n-grams](https://en.wikipedia.org/wiki/N-gram) as they can potentially improve the predictive power of a given model. 

In [3]:
# Setup the features extractor
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 1))

# Get embedding representation of the tweets
X_train = vectorizer.fit_transform(train_data['tweet']).toarray()
y_train = train_data['label']

In [4]:
# Verify that the dimension is [#tweets, max_features]
X_train.shape

(149915, 5000)

Now that the tweets have numerical representation, they can be fed to the Logistic Regression algorithm.

In [5]:
logistic = LogisticRegression(solver='lbfgs', max_iter=200)
accuracy = cross_val_score(logistic, X_train, y_train, cv=3, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (accuracy.mean(), accuracy.std() * 2))

Accuracy: 0.80 (+/- 0.01)


# Comparison with the raw data
We are now interested to measure the impact of the pre-processing step on the overall accuracy.

In [9]:
# Load the raw data
neg = pd.read_fwf('../Data/train_neg.txt', header=None, names=['tweet'])
pos = pd.read_fwf('../Data/train_pos.txt', header=None, names=['tweet'])

# Add label
neg['label'] = 0
pos['label'] = 1

# Concatenate the pos and neg dataframes
train_data_raw = pd.concat([pos, neg])

# Using the same number of samples as the cleaned data for a fair comparison
train_data_raw = train_data_raw.sample(train_data.shape[0])

In [9]:
# Setup the features extractor
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 1))

# Get embedding representation of the tweets
X_train_raw = vectorizer.fit_transform(train_data_raw['tweet']).toarray()
y_train_raw = train_data_raw['label']

In [10]:
logistic = LogisticRegression(solver='lbfgs', max_iter=200)
accuracy = cross_val_score(logistic, X_train_raw, y_train_raw, cv=3, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (accuracy.mean(), accuracy.std() * 2))

Accuracy: 0.80 (+/- 0.00)


The accuracy with and without the pre-processing step is exactly the same, nevertheless these two numbers are not sufficient to assess the performance of the pre-processing. Indeed, the latter also aimed at reducing the number of words in the corpus such that the missing words could later be trained using a [nn.Embedding](https://pytorch.org/docs/master/nn.html#torch.nn.Embedding) layer. 

In [15]:
# Comparing the size of the vocabulary
clean_vocabulary = pd.DataFrame(train_data['tweet'].apply(lambda x: x.split(' ')).explode()).drop_duplicates()
raw_vocabulary = pd.DataFrame(train_data_raw['tweet'].apply(lambda x: x.split(' ')).explode()).drop_duplicates()
clean_vocabulary.shape[0] / raw_vocabulary.shape[0]

0.352505582385626

The number of words in the clean vocabulary is significcantly lower than in the raw one. As the final model will be trained on the large set of tweets it is more valuable to compare the reduction in the vocabulary with this set.

In [22]:
# Load the full raw data
neg_full = pd.read_fwf('../Data/train_neg_full.txt', header=None, names=['tweet'])
pos_full = pd.read_fwf('../Data/train_pos_full.txt', header=None, names=['tweet'])

# Concatenate the data
train_data_raw_full = pd.concat([pos_full, neg_full])

# Load the full clean data
train_data_full = pd.read_csv('../Data/train_full.txt')

In [24]:
raw_vocabulary_full = pd.DataFrame(train_data_raw_full['tweet'].apply(lambda x: x.split(' ')).\
                                   explode()).drop_duplicates()
clean_vocabulary_full = pd.DataFrame(train_data_full['tweet'].apply(lambda x: x.split(' ')).\
                                     explode()).drop_duplicates()
clean_vocabulary_full.shape[0] / raw_vocabulary_full.shape[0]

0.12322306856745761

The overall size of the vocabulary is significantly reduced.