## Cleaning notebook

This notebook aims to provide scripts to clean the data before encoding them in the next modelling stage. The cleaning process contains in the removal/replacement of elements that can negatively affect the transformation steps later on. For this notebook, the detailed outline for the cleaning process can be given by:
 
- Expanding contractions: Expand contractions (don't, won't, etc.) into their full form (do not, will not, etc.). This helps with the later negation handling and the stop words cleaning stage later on. 
- Removing noise texts (emails, hashtags, username mentions, links, words with repeating last letter): Remove noise text from our data. These do not contain or contain little meaningful information for the model to learn given the context of sentiment analysis, and should be removed for better focus on the main features of the text.
- Removing stopwords: Remove words that are too frequently appeared, but contain little information or sentiment and can cause loss of focus onto the main feature of the text. 
- Removing non-alphabets (digits, special characters, punctuations mark...): Remove any character that is a letter in the English alphabet. There is no need to include every single characters into our vocabulary and make it more complex. While it may affect the information conveyed by the text, this is still quite insignificant as it is very likely to contain sentiment and should be ignored for the sake of simplicity and efficiency.
- Negation handling: Add negation indicators ('NEG' tag) as the prefix of the word immediately following the negation word (such as 'not', 'no', etc.). The basic idea of negation handling is to reflect the negation polarity of a sentence, which can be accidentally ignored by the cleaning process if we don't pay attention to it. This is often an neglected task but is actually an important aspect to consider, especially when we are working with bag-of-words model where positions and relations of words do not matter. 
- Additional task (Lowercasing and redundant spaces removing): While lowercasing letters can give us a smaller vocabulary to work with, removing redudant spaces (which can potentially confuse the tokenizer) will improve consistency of setences. 

Given that the data is from Twitter, where there is no standard for typing, the data can be very messy and should definitely be handled with a thorough cleaning process. At the same time, it is also important to remember that because we are working with such a large data set, the optimization aspects should also be considered so that the computational time-cost is suitable with our machine strength.

### Importing packages

In [1]:
# utility
import pandas as pd
import numpy as np
import re 
import contractions

# nltk
import nltk
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to C:\Users\Hung
[nltk_data]     Nguyen\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Hung
[nltk_data]     Nguyen\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Hung
[nltk_data]     Nguyen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Data cleaning

Let us begin by loading in the Twitter sentiment data set and label the columns. 

In [2]:
# Load the data
df = pd.read_csv('../data/twitter-data.csv', encoding = "ISO-8859-1", engine="python")

# Define column names for the data
df.columns = ["label", "id", "date", "query", "user", "text"]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599999 entries, 0 to 1599998
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   label   1599999 non-null  int64 
 1   id      1599999 non-null  int64 
 2   date    1599999 non-null  object
 3   query   1599999 non-null  object
 4   user    1599999 non-null  object
 5   text    1599999 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


As we observe no null values from any columns, no null-hanlding is needed for the moment. 

Given our modelling goal, we will only use the two columns `label`, `text` and ignore the rest:

In [3]:
# Select only the needed columns
df = df[['label', 'text']]

# View the first 5 rows of the data
df.head()

Unnamed: 0,label,text
0,0,is upset that he can't update his Facebook by ...
1,0,@Kenichan I dived many times for the ball. Man...
2,0,my whole body feels itchy and like its on fire
3,0,"@nationwideclass no, it's not behaving at all...."
4,0,@Kwesidei not the whole crew


Moving on to the main cleaning process, let us now implement functions to perform the process we discussed at the beginning of the notebook.

**Expand contractions**

This can simply be done using the `contractions` library. It can handle various cases which can be easily confused if one try to use a rule-based method using ReGex instead.

In [4]:
def expand_contractions(text: str):
  return contractions.fix(text)

In [5]:
expand_contractions("I dont know where he'll go. Maybe he's going to Lindy's place?")

"I do not know where he will go. Maybe he is going to Lindy's place?"

**Removing noise texts (stopwords, emails, hashtags, username mentions, links)**

Here, to remove/replace the noise texts, we simply define their representations in ReGex to identify and remove them.

In [6]:
def remove_noise_texts(text: str):
  # Patterns of the expressions we want to remove
  email_re = r'\S+@\S+'
  mention_hashtag_re = r'(@|#)\S+'
  link_re = r'((www\.\S+)|(https?://\S+))|(\S+\.com)'
  repeating_suff_re   = r'(.)\1\1+'

  # Remove patterns in our textual data
  text = re.sub(repeating_suff_re, r'\1', text)
  text = re.sub(email_re, '', text)
  text = re.sub(mention_hashtag_re, '', text)
  text = re.sub(link_re, '', text)

  return text

In [7]:
remove_noise_texts("In twitter.com a setence can have an email email_123@abcd.nlp.com, it also mentions @someone_on_Twitter and has words with repeating last letter to express my feelingssssssssssss. Lastly #DontForgetTheHashTag")

'In  a setence can have an email  it also mentions  and has words with repeating last letter to express my feelings. Lastly '

**Stopwords removal**

For removing the stopwords, we will first remove every word that is included in the nltk stop words list but the negation indicators 'not' and 'no', which will be used for negation handling later. Only after that, 'not' and 'no' can be finally discarded by using this same function once again. 


In [8]:
# Load nltk stopwords list. 
# This should be loaded outside so that it does not get re-loaded every time we call the function.
stop_words = stopwords.words("english")

def remove_stopwords(text: str, neg_keep: bool = False):
  # Remove 'not' 'no' from the list if neg is True. Otherwise keep the list as it is.nh
  stop_word_list = [word for word in stop_words if word not in ['not', 'no', 'nor']] if neg_keep else stop_words

  # Filter out words that are not in the list
  text = ' '.join([word for word in text.split() if word not in stop_word_list])

  return text

In [9]:
remove_stopwords('I do not like the food')

'I like food'

In [10]:
remove_stopwords('I do not like the food', neg_keep=True)

'I not like food'

**Negation handling**

The basic idea of negation handling is to reflect the negation polarity of a sentence, which can be accidentally ignored by the cleaning process if we don't pay attention to it. In the removal of stopwords, we also remove negation indicators such as 'not' and 'no', thus removing completely the negation meaning of the sentence. We can clearly see this on the example of the previous function `remove_stopwords`.

If we dig further into negation handling, it can be a quite complicated task and would deviates from the main aim of this project. Therefore, let us only perform the very surface level of negation handling using a simple rule-based method: *Adding a prefix 'NEG' to every word followed immediately by a negation indicator*. By doing this, we are essentially creating new words, which can ensure that words in negation contexts are not treated the same as their opposite cases by machine learning models.

In [11]:
def negation_handle(text: str):
  text = re.sub(r'\b(?:not|no|never|cannot)\b\s+[a-z]+', 
      lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG\2', match.group(0)), 
      text)

  return text

In [12]:
negation_handle('I do not like the food. I like the movie.')

'I do not NEGlike the food. I like the movie.'

**Removing non-alphabets (digits, special characters, punctuations mark...)**

We can simply do this using ReGex by removing any character that is not an alphabet a-z, or A-Z.

In [13]:
def remove_non_alpha(text: str):
  # Pattern of the expression we want to remove
  non_alpha_digit_re = r'[^a-zA-Z]'

  # Remove patterns in our textual data
  text = re.sub(non_alpha_digit_re, r' ', text)

  return text

In [14]:
remove_non_alpha('Pair of shoes for 100000 pounds????? No sir good bye!')

'Pair of shoes for        pounds      No sir good bye '

**Removing redundant spaces**

We can simply do this using ReGex by replacing more than two consecutive spaces with one.

In [15]:
def remove_spaces(text: str):
  multi_spaces_re = r'\s\s+'
  text = re.sub(multi_spaces_re, ' ', text)
  return text

In [16]:
remove_spaces("This         has too   many spaces.")

'This has too many spaces.'

**Removing short words**

Removing short words help to reduce the vocabulary size

In [17]:
def remove_short_words(text: str):
  text = ' '.join(word for word in text.split() if len(word) > 1)
  return text

**Stemming or lemmatizing (*optional*)**

Stemming and lemmatizing can help with reducing the complexity of our vocabulary by truncating them to simpler format or converting words to their root form. However, in this project, they will not be used. Given that we are dealing with a large corpus with many poorly formatted, unconventional words, and even typos, overstemming can occur and cause errors in model's predictions. While lemmatizing can still be useful to convert words to original forms, it has been removed due to its inability to significantly improve my machine learning model performance, while consuming a large amount of computational time-cost.

Nonetheless, a lemmatizer is still implemented here and it is up to readers to use it or not. It simply performs pos-tagging on text, then lemmatize words using the that pos tag as a parameter. Pos-tagging is essential in lemmatization task, because it helps the model to understand properly what type of word it is dealing with (verb, adjective, noun, etc.). Ignoring this part is a quite common error in many projects, but this is understandable due to its high computational cost.

In [18]:
lemmatizer = WordNetLemmatizer()

def lemmatize(text: str):
  split_text = pos_tag(text.split())
  result = []
  
  for word, tag in split_text:
    postag = tag[0].lower()
    postag = postag if postag in ['a', 'n', 'v', 'r'] else None
    if not postag:
      result.append(word)
    else:
      result.append(lemmatizer.lemmatize(word, pos=postag)) 

  return ' '.join(result)

In [19]:
lemmatize('I ate hamburgers yesterday.')

'I eat hamburger yesterday.'

**Final cleaning function**

Now, we can put the created functions (with additional cleaning as mentioned) back together in the following order:

Lowercasing words $\rightarrow$ Expand contractions $\rightarrow$ Removing noise texts $\rightarrow$ Removing stopwords but negation stopwords $\rightarrow$ Negation handling $\rightarrow$ Removing non-alphabets $\rightarrow$ Removing short words $\rightarrow$ Removing redundant spaces $\rightarrow$ Removing all stopwords.

In [20]:
def data_cleaning(text: str, handle_neg=False, remove_all_stopword=False, is_lemmatized=False):
  text = text.lower()
  text = expand_contractions(text)
  text = remove_noise_texts(text)
  if handle_neg:
    text = remove_stopwords(text, neg_keep=True)
    text = negation_handle(text)
  text = remove_non_alpha(text)
  text = remove_short_words(text)
  text = remove_spaces(text)
  if is_lemmatized:
    text = lemmatize(text)
  if remove_all_stopword:
    text = remove_stopwords(text)

  return text.strip()

In [21]:
data_cleaning("Just got an email from prankster@troll.com, I think it's @Jordan's troll. He told me in the mail that he really don't like my dog, but he did like a cat. #IHateJordan")

'just got an email from think it is troll he told me in the mail that he really do not like my dog but he did like cat'

Let us now proceed to apply this to the `text` column. Note that the cleaning function might remove all of our text, leaving only empty strings behind. Thus, we should also remove them from our data.

In [22]:
# Apply cleaning function to text column
df['neg_handled_text'] = df['text'].apply(lambda row: data_cleaning(row, handle_neg=True, remove_all_stopword=True))

# Apply cleaning function to text column
df['normal_cleaned_text'] = df['text'].apply(data_cleaning)

# Remove rows with no characters
df_neg = df[df['neg_handled_text'] != ''][['neg_handled_text', 'label']]
df_normal = df[df['normal_cleaned_text'] != ''][['normal_cleaned_text', 'label']]

In [23]:
# View the first 5 rows of the cleaned data
df_neg.head()

Unnamed: 0,neg_handled_text,label
0,upset cannot NEGupdate facebook texting might ...,0
1,dived many times ball managed save rest go bounds,0
2,whole body feels itchy like fire,0
3,NEGbehaving mad cannot NEGsee,0
4,NEGwhole crew,0


In [24]:
# View the first 5 rows of the cleaned data
df_normal.head()

Unnamed: 0,normal_cleaned_text,label
0,is upset that he cannot update his facebook by...,0
1,dived many times for the ball managed to save ...,0
2,my whole body feels itchy and like its on fire,0
3,no it is not behaving at all am mad why am her...,0
4,not the whole crew,0


### Saving the data

Now, we can finally transform our data into a feature vector `X`, and a target vector `y`. They will be saved into a folder as `numpy` arrays for later usage in the modelling stage.

In [25]:
X_neg = df_neg['neg_handled_text'].values
y_neg = df_neg['label'].values
X_normal = df_normal['normal_cleaned_text'].values
y_normal = df_normal['label'].values

np.save('../data/preprocessed_data/feature_vectors_normal.npy', X_normal)
np.save('../data/preprocessed_data/target_vectors_normal.npy', y_normal)
np.save('../data/preprocessed_data/feature_vectors_neg.npy', X_neg)
np.save('../data/preprocessed_data/target_vectors_neg.npy', y_neg)