# Natural Language Processing: A Primer On Pre-processing

In the previous post of this series, I wrote about how pre-processing is the first step we should take when analysing text data. This is done so that we may maximise the performances of our algorithms based on the task we want to accomplish. Specifically, we want to filter out any aspect of the data that will at worst hinder and at best not be helpful in achieving our task.

The following is non-exhaustive roadmap that I consulted in my sentiment analysis project:

- Remove stopwords (context dependent)
- Stemming and lemmatisation
- Remove hashtags, mentions and emojis with text they represent (for Twitter)
- Replace contractions with their full forms
- Remove punctuations
- Convert everything to lowercase
- Remove HTML tags if present

Let's take a look, with code, how these steps can be executed. To better frame this analysis, we will be using Amazon Product Data for their videogames category by Julian McAuley (*Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
R. He, J. McAuley
WWW, 2016, found at http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Video_Games_5.json.gz*)

This data contains the following useful variables:

- body of the review
- a 'summary' or review title
- an overall product rating out of
- the reviewer's username

## 1) Removing stopwords (or rather, should we remove stopwords?) using spaCy

Words that do not hold a lot of semantic connotation, and thus are not useful for linking input (text data) to output (semantic label) are considered stopwords. These include 'a', 'the', 'who', 'why', pronouns and so on. 

However, we must be very careful in deciding whether or not to pursue this step, especially in the context of sentiment analysis. This is because removing stopwords may subtract from the true meaning of a sentence and thus alter it considerably. Consider the word 'not'. "I do not like this colour" has the opposite sentiment of "I like this colour". So if we remove the word 'not' from a review, we have effectively reversed the sentiment of the view. 

As a result, in the context of sentiment analysis, it is advised to not execute this step. I will still display how to remove stopwords as a matter of demonstration. We can use a very useful NLP module called spaCy to do this.

In [76]:
# The data provided is in JSON format, so we can load it in a pandas dataframe using the following code:
import pandas as pd
data = pd.read_json('/Users/alitaimurshabbir/Desktop/reviews_Video_Games_5.json', lines=True)
data.drop(['asin','unixReviewTime', 'reviewTime'], axis = 1, inplace = True) #drop unnecessary columns
data.head(2) #preview data

Unnamed: 0,reviewerID,reviewerName,helpful,reviewText,overall,summary
0,A2HD75EMZR8QLN,123,"[8, 12]",Installing the game was a struggle (because of...,1,Pay to unlock content? I don't think so.
1,A3UR8NLLY1ZHCX,"Alejandro Henao ""Electronic Junky""","[0, 0]",If you like rally cars get this game you will ...,4,Good rally game


spaCy comes with its own defined set of 312 stopwords. We can use this predefined list to remove stopwords from our reviews. It is useful to know that we can add our own stopwords to such a set (https://spacy.io/usage/adding-languages/#stop-words) so let's go ahead and import spaCy and execute this step

In [77]:
import spacy
spacy.load('en') #load the English language model of spaCy
stopWords = spacy.lang.en.STOP_WORDS #load stopwords

#to see a few examples of these stopwords, I can convert the first 10 elements of this set into a list

stopList = list(stopWords)[:10]
print(stopList)

['via', '’d', 'whatever', 'after', 'latter', 'its', 'seems', 'what', 'by', '‘d']


Removing stopwords from our reviews using list comprehension

In [80]:
data['reviewTextNoStopwords'] = data['reviewText'].apply(lambda x:' '.join([word for word in x.split() if word not in stopWords]))

You will notice that we use the methods *join* and *split*. 

We use them because pre-processing has to be performed on items in a list, not on a string. So *split()* is used to accomplish this. Subsequently, *join()*, is used to turn back the split components into a sentence/string, once a pre-processing step has been performed

Below is a comparison of the review text pre- and post-removal of stopwords.

In [81]:
data.loc[[0, 1, 2], ['reviewText', 'reviewTextNoStopwords']]

Unnamed: 0,reviewText,reviewTextNoStopwords
0,Installing the game was a struggle (because of...,Installing game struggle (because games window...
1,If you like rally cars get this game you will ...,If like rally cars game fun.It oriented &#34;E...
2,1st shipment received a book instead of the ga...,1st shipment received book instead game.2nd sh...


## 2) Lemmatisation using NLTK

Lemmatisation is the process of reducing inflectional forms and sometimes derivationally related forms of a word to a common base form. The lemma of 'walking', for example, is 'walk'. As with removing stopwords, lemmatisation is intended to improve model performance, although it is again possible that performance actually declines instead. This is because both of the aforementioned steps are intended to improve the *recall* metric and it tends to negatively impact *precision*. As a result, using either technique depends on the metrics one is focused on.

As with stopword removal, I will demonstarte how lemmatisation is achieved. We can use another very useful NLP module, called **NLTK**, and its function, WordNetLemmatizer()

In [82]:
import nltk 
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [83]:
data['reviewText'] = data['reviewText'].apply(lambda x:' '.join([lemmatizer.lemmatize(word) for word in x.split()]))

## 3) Removing hashtags and mentions (Twitter-specific) using Regex

One of the cool things about NLP is that sources of data can be very specific in terms of the quirks they have. Sometimes we must deal with those quirks and sometimes we can leave them be. Data from Twitter, for example, will contain hashtags (#) and mentions (@username) because those mechanics are inherent to the platform. They are also not particularly useful in analysis, so we can remove them.

Doing so requires familiarty with yet another great module named **Regex**, or Regular Expression. Regex has special considerations for social media regular expressions and for Twitter in particular

Two such expressions in which we are interested are:

- @[A-Za-z0-9]+ which represents all kinds of mentions
- #[A-Za-z0-9]+ which represents all kinds of hashtags

Since our Amazon data doesn't have many hashtags or mentions, let's create a single string containing these elements and witness how Regex works. We will be using the *join*, *sub* and *split* string methods

In [84]:
import re

In [85]:
someText = '''Inter Milan goalkeeper @Samir Handanovic will not be #going to PSG. his agent @massimo venturella said 
to football italia: "I can confirm that there were negotiations with PSG, which we have broken off. PSG is not an
option. Real Madrid and Liverpool are the other strong rumours. #inter #lfc'''

In [86]:
someText = ' '.join(re.sub( "(@[A-Za-z0-9]+)|(#[A-Za-z0-9]+)", ' ', someText).split())

In [87]:
print(someText)

Inter Milan goalkeeper Handanovic will not be to PSG. his agent venturella said to football italia: "I can confirm that there were negotiations with PSG, which we have broken off. PSG is not an option. Real Madrid and Liverpool are the other strong rumours.


## 4) Remove URLs using Regex

We can remove URLs in the same way we removed mentions and hashtags. This time the expression needed is **\w+:\/\/\S+!**, which represents all URLs matching with http:// or https://

## 5) Remove punctuations using Regex

As before, Regex makes this very easy. We use the expression **.,!?:;-=** to represent punctuations

In [100]:
data["reviewText"] = data['reviewText'].str.replace('[^\w\s]','')

## 6) Remove HTML tags using Beautiful Soup

This is very simple and self-explanatory. Our someText variable above does not have HTML tags, so the following shows a general way to accomplish this step

In [88]:
#textVariable = BeautifulSoup(textVariable).get_text()

## 7) Lower-case all text

In [89]:
data['reviewText'] = data['reviewText'].apply(lambda x: ' '.join(word.lower() for word in x.split()))

As seen below, all text can been converted to lowercase

In [90]:
data['reviewText']

0         installing the game wa a struggle (because of ...
1         if you like rally car get this game you will h...
2         1st shipment received a book instead of the ga...
3         i got this version instead of the ps3 version,...
4         i had dirt 2 on xbox 360 and it wa an okay gam...
                                ...                        
231775    funny people on here are rating seller that ar...
231776    all this is is the deluxe 32gb wii u with mari...
231777    the package should have more red on it and sho...
231778    can get this at newegg for $329.00 and the pac...
231779    this is not real, you can go to any retail sto...
Name: reviewText, Length: 231780, dtype: object

## 8) Replace contractions with their full forms

A contraction is the shortened form of a common phrase. "Isn't" is the contraction of "is not", and there may be some performance benefits to be gained by expanding such contractions. However, this is sligthly trickier than the previous few steps as no pre-defined regular expression exists for this purpose.

One solution to this is creating our own dictionary with the keys-value pairs representing the contracted and expanded forms, respectively, of phrases. Here's a short example

In [93]:
textVariable = 'Sean isn\'t in the barn; we\'ve checked it already'

contractions = {"isn't":"is not", "we've":"we have"}
textVariable  = textVariable.replace("’","'")
words = textVariable.split()
reformed = [contractions[word] if word in contractions else word for word in words]
textVariable = " ".join(reformed)

Printing this out shows us:

In [94]:
textVariable

'Sean is not in the barn; we have checked it already'

### We've quickly seen some of the more common substeps within pre-processing of text data and how to execute them. In the next post, we will explore the what, how and why of feature representation of text data