# TF-IDF
This notebook presents one of the methods that were used for word vectorization using the pre-processed ("clean") tweets. The objective of this notebook is to assess the performance of the [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) word representation. Others notebooks were made with different word representations. The focus lies more in evaluating the performance under a fixed setup rather than getting the absolute best performance. We therefore used the [Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression), which is a relatively light, interpretable, and well suited algorithm for a binary classification task. The library [scikit-learn](https://scikit-learn.org/stable/) will be used extensively throughout this notebook. 

### Importing the needed packages

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import cross_val_score

### Loading the clean data files

In [2]:
# Load the clean train data
train_data = pd.read_csv('../Data/train_small.txt')

### Using TF-IDF 
TF-IDF improves the [Bag-of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) representation by taking into account the frequency of a given inside the whole corpus. In other words, it gives more weights to terms that appear seldomly. A more detailed explanation can be found in the report. One of the main drawbacks of this representation, aside from its sparsity, is that the embedding dimension grows with the number of different words in the corpus. This problem mitigated by artificially fixing the number of features. The latter trick becomes increasingly important if one decide to not only represent single words but also [n-grams](https://en.wikipedia.org/wiki/N-gram) as they can potentially improve the predictive power of a given model. 

In [3]:
# Setup the features extractor

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 1))

# Get embedding representation of the tweets
X_train = vectorizer.fit_transform(train_data['tweet']).toarray()
y_train = train_data['label']

In [4]:
# Verify that the dimension is [#tweets, max_features]
X_train.shape

(149915, 1500)

Now that the tweets have numerical representation, they can be fed to the Logistic Regression algorithm.

In [5]:
logistic = LogisticRegression(solver='lbfgs', max_iter=200)
accuracy = cross_val_score(logistic, X_train, y_train, cv=3, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (accuracy.mean(), accuracy.std() * 2))

Accuracy: 0.78 (+/- 0.01)
