<h1>Processing of tweets</h1>

In this notebook we process the provided datasets to improve the prediction results.

In [None]:
import nltk, re
import pandas as pd

# need to use once to download nltk (natural language processing library) on your computer.
# nltk.download()

<h3>Tweet extraction</h3>

This function extracts tweet data from a given filename and puts it in a dataframe. We then apply it to the full tweet datasets and the test dataset.

In [None]:
def get_tweets(filename):
    
    # Read the data file
    with open("twitter-datasets/" + filename, "r", encoding="utf8") as myfile:
        data = myfile.readlines()
        
    # Make a dataframe out of the data
    return pd.DataFrame(data)

In [None]:
negative_DF = get_tweets("train_neg_full.txt")
positive_DF = get_tweets("train_pos_full.txt")
test_DF = get_tweets("test_data.txt")

In [None]:
negative_DF.head(10)

<h3>Tweet formatting</h3>

This function does the job of formatting tweet data to suit our needs. It puts everything to lowercase, removes unwanted elements such as usertags, urls, retweet tags and lone characters and replaces non-alphabetical characters by spaces and sequences of 3 times the same character or more by a single occurence of the character. We also remove the english language stopwords of the nltk package (words such as "I", "you", etc...). Finally, the strings are replaced by lists of words. The function is then applied to each dataset.

In [None]:
def format_tweets(tweets):
    
    # These are stop words that we want to take out from the tweets
    lang_set = nltk.corpus.stopwords.words('english')
    
    # Put everything in lowercase
    tweets[0] = tweets[0].astype(str).str.lower()
    
    # The replacement instructions for below, which:
    # - remove usertags
    # - remove urls
    # - remove retweets ("rt")
    # - replace anything that is not letters by a space
    # - remove lone characters
    # - replace sequences of 3 times the same letter or more by a single occurence of the character
    replacements = [
        ("<user>", ''),
        ("<url>", ''),
        (r'\brt\b', ''),
        (r'[^a-z]+', ' '),
        (r'\b\w\b', ''),
        (r'([a-zA-Z])\1{2,}', r'\1')
    ]
    
    # Apply the replacements instructions
    for key, value in replacements:
        tweets[0] = tweets[0].str.replace(key, value)
            
    # Tokenize each tweet
    tweets[0] = tweets[0].str.split()
    
    # Remove the stop words
    tweets[0] = tweets[0].apply(lambda tweet: [word for word in tweet if word not in lang_set])
        
    return tweets

In [None]:
negative_DF = format_tweets(negative_DF)
positive_DF = format_tweets(positive_DF)
test_DF = format_tweets(test_DF)

In [None]:
negative_DF.head(10)

<h3>Stemming and lemmatizing</h3>

Now that our tweets are cleaned, we apply stemming (crude chopping of the end of the words) and lemming (process that actually looks into vocabulary to try to simplify words) to the words to make words such as "love" and "lover" to be recognized as the same in order to increase efficiency down the road. 

In [None]:
def stem_and_lem(tweets, stemmer, lemmer):
    tweets['stemmed'] = tweets[0].apply(lambda tweet: [stemmer.stem(word) for word in tweet])
    tweets['lemmed'] = tweets[0].apply(lambda tweet: [lemmatizer.lemmatize(word) for word in tweet])
    tweets['both'] = tweets['lemmed'].apply(lambda tweet: [stemmer.stem(word) for word in tweet])
    
    return tweets

In [None]:
# Generating the stemmer and lemmatizer
stemmer = nltk.stem.snowball.SnowballStemmer('english')
lemmatizer = nltk.stem.WordNetLemmatizer()

In [None]:
negative_DF = stem_and_lem(negative_DF, stemmer, lemmatizer)
positive_DF = stem_and_lem(positive_DF, stemmer, lemmatizer)
test_DF = stem_and_lem(test_DF, stemmer, lemmatizer)

In [None]:
negative_DF.head(10)

<h3>Save data</h3>

All we have to do now is save our results in txt files so that they can be used in our different methods.

In [None]:
def save_tweets(tweets, filename):
    
    # Put the stemmed and lemmetized tweets back to string form
    data = tweets['lemmed'].apply(lambda x: ' '.join(x))
    
    # Save to file
    with open("twitter-datasets/" + filename, "w", encoding="utf8") as myfile:
        data.to_csv(myfile, index=False)

In [None]:
save_tweets(negative_DF, "train_neg_proc.txt")
save_tweets(positive_DF, "train_pos_proc.txt")
save_tweets(test_DF, "test_data_proc.txt")