# Bag of Words
This notebook presents the feature extraction method "Bag of Words", which represents a more basic approach towards word vectorization that were used on cleaned tweets. The objective of this notebook is to assess the performance of the  [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) word representation. Others notebooks were made with different word representations. The focus lies more in evaluating the performance under a fixed setup rather than getting the absolute best performance. We therefore used the [Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression) which is a relatively light, interpretable and well suited algorithm for a binary classification task. The library [scikit-learn](https://scikit-learn.org/stable/) was used to implement the Bag of Words method.

### Importing the needed packages

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import cross_val_score

### Loading the clean data files

In [2]:
# Load the clean train data
train_data = pd.read_csv('../Data/train_small.txt')

# Shuffle the data
train_data = train_data.sample(train_data.shape[0])

### Using Bag of Words
When implementing the Bag of Words function with sci-kit learn, the documents are automatically tokenized and a dictionary is created, which is arranged in descending order according to the word frequency in the whole corpus. Feature vectors are created by counting the occurrences  of each respective word in a document (tweet), where its demension is chosen by setting the "max_features" parameter. If max_features=2000 for example then we choose to only count the frequencies of the top 2000 words from our dictionary. Usually, after a certain threshold (in our case around 1500) the size of the feature vector does not improve the accuracy anymore thus only increasing computational cost any at no further yield.

In [3]:
# Setup the features extractor
vectorizer = CountVectorizer(max_features=1500, ngram_range=(1, 1))

# Get embedding representation of the tweets
X_train = vectorizer.fit_transform(train_data['tweet']).toarray()
y_train = train_data['label']

In [4]:
# Verify that the dimension is [#tweets, max_features]
X_train.shape

(149915, 1500)

Now that the tweets have numerical representation, they can be fed to the Logistic Regression algorithm.

In [None]:
logistic = LogisticRegression(solver='lbfgs', max_iter=200)
accuracy = cross_val_score(logistic, X_train, y_train, cv=3, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (accuracy.mean(), accuracy.std() * 2))

# Comparison with the raw data
We are now interested to measure the impact of the pre-processing step on the overall accuracy.

In [7]:
# Load the raw data
neg = pd.read_fwf('../Data/neg_train.txt', header=None, names=['tweet'])
pos = pd.read_fwf('../Data/pos_train.txt', header=None, names=['tweet'])

# Add label
neg['label'] = 0
pos['label'] = 1

# Concatenate the pos and neg dataframes
train_data_raw = pd.concat([pos, neg])

# Using the same number of samples as the cleaned data for a fair comparison
train_data_raw = train_data_raw.sample(train_data.shape[0])

In [9]:
# Setup the features extractor
vectorizer = CountVectorizer(max_features=1500, ngram_range=(1, 1))

# Get embedding representation of the tweets
X_train_raw = vectorizer.fit_transform(train_data_raw['tweet']).toarray()
y_train_raw = train_data_raw['label']

In [10]:
logistic = LogisticRegression(solver='lbfgs', max_iter=200)
accuracy = cross_val_score(logistic, X_train_raw, y_train_raw, cv=3, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (accuracy.mean(), accuracy.std() * 2))

Accuracy: 0.79 (+/- 0.00)


The accuracy with and without the pre-processing step is exactly the same, nevertheless these two numbers are not sufficient to assess the performance of the pre-processing. Indeed, the latter also aimed at reducing the number of words in the corpus such that the missing words could later be trained using a [nn.Embedding](https://pytorch.org/docs/master/nn.html#torch.nn.Embedding) layer. 

In [11]:
# Comparing the size of the vocabulary
clean_vocabulary = pd.DataFrame(train_data['tweet'].apply(lambda x: x.split(' ')).explode()).drop_duplicates()
raw_vocabulary = pd.DataFrame(train_data_raw['tweet'].apply(lambda x: x.split(' ')).explode()).drop_duplicates()
clean_vocabulary.shape[0] / raw_vocabulary.shape[0]

0.3536073344793457

The number of words in the clean vocabulary is significantly  lower than in the raw one. As the final model will be trained on the large set of tweets it is more valuable to compare the reduction in the vocabulary with this set.

In [8]:
# Load the full raw data
neg_full = pd.read_fwf('../Data/train_neg_full.txt', header=None, names=['tweet'])
pos_full = pd.read_fwf('../Data/train_pos_full.txt', header=None, names=['tweet'])

# Concatenate the data
train_data_raw_full = pd.concat([pos_full, neg_full])

# Load the full clean data
train_data_full = pd.read_csv('../Data/train_full.txt')

In [11]:
raw_vocabulary_full = pd.DataFrame(train_data_raw_full['tweet'].apply(lambda x: x.split(' ')).\
                                   explode()).drop_duplicates()
clean_vocabulary_full = pd.DataFrame(train_data_full['tweet'].apply(lambda x: x.split(' ')).\
                                     explode()).drop_duplicates()
clean_vocabulary_full.shape[0] / raw_vocabulary_full.shape[0]

AttributeError: 'Series' object has no attribute 'explode'

In [13]:
s = pd.Series([[1, 2, 3], 'foo', [], [3, 4]])
s.explode()

AttributeError: 'Series' object has no attribute 'explode'

The overall size of the vocabulary is significantly reduced.