# Pre-processing
This notebook presents the methods that were used to clean the raw tweets. 

### Importing the needed packages

In [1]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import re
import csv
import itertools
from helpers import *
import nltk 
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.tokenize import TweetTokenizer
from nltk.stem.wordnet import WordNetLemmatizer

### Loading the raw data files

In [2]:
# Load raw data
neg = pd.read_fwf('../Data/train_neg.txt', header=None, names=['tweet'])
pos = pd.read_fwf('../Data/train_pos.txt', header=None, names=['tweet'])
test = pd.read_csv('../Data/test_data.txt', sep='\n', header=None, names=['tweet'])
test['tweet-id'] = test.tweet.apply(lambda x: x.split(',')[0])
test['tweet'] = test.tweet.apply(lambda x: ' '.join(x.split(',')[1:]))
test = test.set_index('tweet-id')

In [3]:
pos.head(5)

Unnamed: 0,tweet
0,<user> i dunno justin read my mention or not ....
1,"because your logic is so dumb , i won't even c..."
2,""" <user> just put casper in a box ! "" looved t..."
3,<user> <user> thanks sir > > don't trip lil ma...
4,visiting my brother tmr is the bestest birthda...


As one can observe in the above set of tweets, the data at hand is not particularly cleaned which can lead to under performing algorithm. Indeed it is important to get rid of the noise in the tweets to prevent our future algorithm to overfit on it. Furthermore, cleaning the data can also help reducing the number of word in the vocabulary by merging semantically similar words together, for instance 'I'm' and 'Im'. This is a desirable property when using either [Bag-of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) or [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) as word representation as it can significantly decrease the dimension of the embedding space. When using the [Glove](https://nlp.stanford.edu/projects/glove/) representation this property is again desirable as it increases the chance of a given word to have a pre-trained representation.

In [4]:
# Load pre-trained embedding
words_embedding = pd.read_table('../Data/glove.twitter.27B.25d.txt', sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)
words_embedding = words_embedding[~words_embedding.index.isna()]

In [5]:
def cleaning(tweet):
    """
    @param tweet: single tweet as a string
    @return: single cleaned tweet as a string
    """
    # Instansiate tokenizer
    tokenizer = TweetTokenizer()
    lemmatizer = WordNetLemmatizer()

    # Removing HTML characters
    tweet = BeautifulSoup(tweet).get_text()

    # As the character "'" will be part of the contractions it must be kept
    tweet = tweet.replace('\x92', "'")
    tweet = tweet.replace("’", "'")

    # Remove words that start with '#' or '@' 
    tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)|(#[A-Za-z0-9]+)", " ", tweet).split())

    # Remove web addresses
    tweet = ' '.join(re.sub("(\w+:\/\/\S+)", " ", tweet).split())

    # Removal of punctuation
    tweet = ' '.join(re.sub("[\.\,\!\?\:\;\-\=\(\)\d\"\_\<\>\+\@\/]", " ", tweet).split())

    # contractions source:
    contractions = load_dict_contractions()
    words = tokenizer.tokenize(tweet)
    words = [lemmatizer.lemmatize(word, 'v') for word in words]
    words = list(filter(lambda word: len(word) > 1, words))
    reformed = [contractions[word] if word in contractions else word for word in words]
    tweet = " ".join(reformed)
    return tweet

A first step of the cleaning consists of removing undesired characters, applying some lemmatization on the words, removing single letter words and finally correcting some of spelling errors. 

In [6]:
# Clean tweets
neg['tweet'] = neg['tweet'].apply(lambda tweet: cleaning(tweet))
pos['tweet'] = pos['tweet'].apply(lambda tweet: cleaning(tweet))
test['tweet'] = test['tweet'].apply(lambda tweet: cleaning(tweet))

Some words appear more that others. The ones that appear the most in a language are called stop-words and are often removed from the corpus as they appear in almost every single document and thus do not bear a lot of predictive power. Nonetheless we chose to keep them. Indeed words such as 'not' or 'no' are considered as stop-words but could potentially be helpful in the case of a sentiment analysis task. On the other hand words which appear very few times might have a less significant impact and are also less likely to be in a pre-trained vocabulary. A closer look at the words appearing only once shows that some of those words could get mapped to more common words by a finer pre-processing step. We chose to remove from the document any term used less than a given  threshold. The threshold should be adapted to the size of the corpus, ideally it should rather be a probability of observing a word as it would mitigate the problem of varying size corpus.

In [73]:
# Compute vocabulary i.e. all words appearing at least once in either of the cleaned file
vocabulary = pd.concat([neg['tweet'].apply(lambda x: x.split(' ')).explode(), 
                        pos['tweet'].apply(lambda x: x.split(' ')).explode(),
                        test['tweet'].apply(lambda x: x.split(' ')).explode()])
vocabulary = pd.DataFrame(vocabulary)

# Computes the appearing frequency of each word in the vocabulary
words_frequencies = pd.DataFrame(vocabulary['tweet'].value_counts())
words_frequencies.rename(columns = {'tweet':'count'}, inplace = True) 

# Keep only words that appear at least a given amount of times here 2
vocabulary = words_frequencies[words_frequencies['count'] >= 2].reset_index(level=0)[['index']]
vocabulary = vocabulary.rename(columns={'index': 'word'})

Once the vocabulary cleaned from unlikely words, the tweets should also be treated.

In [75]:
def filter_tweets(tweet, vocabulary):
    """
    remove any word from the tweet which does not appear in the dictionary
    @param tweet: single tweet as a string
    @param vocabulary: vocabulary as a set
    @return: single cleaned tweet
    """
    tweet = list(filter(lambda word: word in vocabulary, tweet.split(' ')))
    tweet = ' '.join(tweet)
    return tweet

In [76]:
# Define vocabulary as a set for O(1) access
vocabulary_set = set(vocabulary['word'].to_list())

# Remove non-vocabulary words from tweets 
test['tweet'] = test['tweet'].apply(lambda tweet: filter_tweets(tweet, vocabulary_set))
pos['tweet'] = pos['tweet'].apply(lambda tweet: filter_tweets(tweet, vocabulary_set))
neg['tweet'] = neg['tweet'].apply(lambda tweet: filter_tweets(tweet, vocabulary_set))

In [79]:
# Looking for empty tweets
test[test['tweet'] == ''].count()

tweet    12
dtype: int64

After the previous step some of the tweets end up being empty. With the large set of tweets and a threshold of five apparition only three tweets in the test set end up being empty. The empty tweets are treated differently if they appear in the train set than in the test set:
   - Train set: The empty tweet is simply dropped.
   - Test set: The empty tweet is filled with the tag `<empty>` which is not in the dictionnary and should have a positive/negative sentiment associated to it. 

In [82]:
test[test['tweet'] == ''] = '<empty>'
vocabulary = vocabulary.append({'word': '<empty>'}, ignore_index=True)

In [83]:
# Add labels
pos['label'] = 1
neg['label'] = 0

When usig the Glove embedding one can also get rid of words which do not appear in the corpus as they won't be of any use.

In [84]:
# Removing unused words from pretrained embeddings
words_embedding = words_embedding.merge(vocabulary, how='inner', left_on=words_embedding.index, right_on='word')
words_embedding = words_embedding.set_index('word')

# Making a dictionnary out of the dataframe for faster access
embedding_dict = dict(zip(words_embedding.index, words_embedding[words_embedding.columns].values))

In [105]:
# Checking the percentage of unmatched words
len(set(vocabulary['word']) - set(embedding_dict.keys())) / len(set(vocabulary['word']))

0.12879352718636955

At the end of the preprocessing step $12 \%$ of the terms in the corpus do not have a pre-trained representation. This number is low but not insignificant we thus chose to add an entry for each of them in the embedding matrix. The latter can then be fed to a [nn.Embedding](https://pytorch.org/docs/master/nn.html#torch.nn.Embedding) layer which will/can be further trained.

In [106]:
# Addig missing entries to the embedding matrix
embedding_matrix, word_indexer = build_embedding_matrix(embedding_dim=25, 
                                                        glove=embedding_dict, 
                                                        vocabulary=vocabulary['word'].to_list())

At this stage we can now split the data in a train, validation and test set.

In [110]:
# Defining the size of the different sets
train_ratio = 0.75
valid_ratio = 0.125
train_stop = int(train_ratio * neg.shape[0])
valid_stop = int((train_ratio + valid_ratio) * neg.shape[0])

In [114]:
# Splitting the data in train, validation and test
train_data = pd.concat([pos[['tweet', 'label']].iloc[:train_stop], 
                        neg[['tweet', 'label']].iloc[:train_stop]], 
                       ignore_index=True)
valid_data = pd.concat([pos[['tweet', 'label']].iloc[train_stop: valid_stop], 
                        neg[['tweet', 'label']].iloc[train_stop: valid_stop]],
                       ignore_index=True)
test_data = pd.concat([pos[['tweet', 'label']].iloc[valid_stop:],
                       neg[['tweet', 'label']].iloc[valid_stop:]], 
                       ignore_index=True)

The empty tweets can now be removed

In [119]:
train_data = train_data[train_data['tweet'] != '']
valid_data = valid_data[valid_data['tweet'] != '']
test_data = test_data[test_data['tweet'] != '']

Finally, save the clean data set ! 

In [130]:
train_data.to_csv('../Data/train_small.txt', index=False)
valid_data.to_csv('../Data/valid_small.txt', index=False)
test_data.to_csv('../Data/test_small.txt', index=False)
test.to_csv('../Data/test_online_small.txt')

In [129]:
np.save('../Data/embedding_small', embedding_matrix)
np.save('../Data/embedding_indexer_small', word_indexer)