# <font color='#eb3483'> Natural Language Processing </font>

In this notebook, we will be working with a large dataset of movie reviews from the **Internet Movie Database (IMDb)**. The dataset contains 50,000 movie reviews that have been labelled as positive or negative. Positive means that the movie got a rating of more than six stars, while negative means that it got a rating less than five stars.

Our goal is to build a machine learning model to predict whether a reviewer will like or dislike a movie based on his/her written review.

The dataset can be downloaded [here](//drive.google.com/file/d/1mLLHORSCShHdgQO_m1lTSnKC28EYWtnZ/view?usp=drive_link)



In [None]:
import pandas as pd
import numpy as np

import re # for regular expressions
import requests # to read the HTML at a URL into Python
from bs4 import BeautifulSoup # to extract text from HTML
import nltk # natural language toolkit for stop words, stemming and lemmatization (and more!)

In [None]:
#Import from Data Folder in Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path = "/content/drive/MyDrive/Data Science | Abroad | S1 | Claire/Class Materials/Week 3-6 Special Topics/W4D2 NLP/Classwork/movies.csv"
df = pd.read_csv(path)

In [None]:
#df = pd.read_csv('data/movies.csv')
df

In [None]:
# Count the number of 0s and 1s in 'sentiment'
count_values = df['sentiment'].value_counts()

print(count_values)

## <font color='#eb3483'> Before We Start: List Comprehension in Python </font>

Before we begin our intro to natural language processing, we need to discuss **list comprehension** in Python: an easy way to loop over and select/transform items in a list.

As an example, suppose you have a list:

```python
mylist = [1,2,3,4,5]
```

and you want to return a new list that contains the squared values of each element. How would you do this?

Well, you could use a ```for``` loop:

In [None]:
mylist = [1,2,3,4,5]
newlist = []

for i in range(len(mylist)):
    newlist.append(mylist[i]**2)

newlist

A neater way to do this is to use **list comprehension**, which essentially includes the ``for`` loop within the list:

In [None]:
[x**2 for x in mylist]

We can even include an `if` statement in our list comprehension to select only a subset of the items:

In [None]:
[x**2 for x in mylist if x < 4]

## <font color='#eb3483'> Text Cleaning and Preprocessing </font>

As you'll soon see, converting textual data into a format that can be used for machine learning can take a lot of time and effort. Often the techniques that we use are problem specific. Here, we will explore some of these approaches as they apply to our movies dataset. If you ever find yourself analysing text data, you'll probably need to adapt these methods to suit your needs.

###  <font color='#eb3483'> 1. HTML </font>

Have a look at the first and fourth reviews (indexed as 0 and 3):

In [None]:
df.review.iloc[0]

In [None]:
df.review.iloc[3]

You'll notice that they contain expressions like ```<br />```. These are HTML tags - the ``<br>`` tag is used to display a line **br**eak in HTML code. The details are not important, but we probably don't want to keep these in the text.

Let's remove HTML tags using the ``BeautifulSoup`` function from ``bs4``, an extremely useful library for extracting text from HTML and XML files.

In [None]:
BeautifulSoup(df.review.iloc[3], 'html.parser').get_text()

Ah, that looks better! Let's apply this function to all the reviews in our dataset.

In [None]:
# Function to clean HTML tags
def clean_html(text):
    return BeautifulSoup(text, 'html.parser').get_text()

# Apply the function to the entire column
df['review'] = df['review'].apply(clean_html)

df

We can also use regular expressions library

In [None]:
import re

def remove_html(x):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', x)
    return cleantext

df['review'] = df['review'].apply(remove_html)

In [None]:
df.review.iloc[44]

###  <font color='#eb3483'> 2. Regular Expressions </font>

If you take a look at a few of the reviews, you'll notice that people write differently. Some write "I" with a capital letter, while others don't worry about capitalization and simply write "i". Some write "cant" while others correctly write "can't" with the apostrophe. Some put a space after a full stop, others don't. We don't want these arbitrary differences affecting out analyses, so let's try and standardize the text. For example, let's convect all letters to lower case:

In [None]:
myreview = df.review.iloc[3]
myreview = myreview.lower()
myreview

Now let's find and replace all the commas with a space. You're probably used to doing this on your computer in MS Word - you just hit ``ctrl/cmd + F``, type in the letter/word you want to find, and then replace it. Turns out you can also do this programmatically using the ``sub`` (substitute) function from the `re` module:

In [None]:
re.sub(",", " ", myreview)

``re.sub`` is actually way more powerful than this. As its first argument, it accepts a **regular expression** or **regex** which is basically a *search pattern*. We can search for **any** pattern we like. For example, we could search a text string for all phone numbers that have the format +XX-XXX-XXX-XXXX, where X is a number. A regular expression is a way to express this pattern programmatically.

We are not going to have time to cover all the possible ways to express patterns, so we'll just consider the ones that are useful for this dataset. For more information in regular expressions, look at the [```re``` help page](https://docs.python.org/3/library/re.html) or Google around.

We previously replaced all commas with a space. Let's also replace full stops, underscores, question marks and exclamation marks. The regex for this is ``[,._?!]`` which translates into "find all instances of ``,`` or ``.``". Note that the square brackets are simply used to enclose the different symbols that we want to find. They are not included in the search.

In [None]:
re.sub("[,._?!]", " ", myreview)

Let's also get rid of the numbers. Here the regex ``[0-9]`` reads "find any number from 0 to 9".

In [None]:
re.sub("[0-9]", " ", myreview)

Let's combine are two regexs to simulatanously substitute commas, fullstops, exclamation and question marks, AND all numbers:

In [None]:
re.sub("[,._?!0-9]", " ", myreview)

We also probably want to get rid of the dashes in "sex-life" and "police-officers" so that these are each represented by two separate words across all reviewers. Can we just add a "-" into our regex?

Well, no... because a dash is already being used to specify the range of numbers 0-9. Because it already has a special use in a regex, if we want to include it, we must write ``\-`` in the regex:



In [None]:
re.sub("[,._?!0-9\-]", " ", myreview)

Similarly, the following *special* symbols also need to be preceded by a slash "\\":

( ) [ ] & | " / \

In [None]:
re.sub("[.,_\-!?\(\)\/0-9]", " ", myreview)

You may have noticed that we can end up with a lot of space between some words. A space is itself a character (we just can't see it). We can match a single space with the regex `[ ]`, but what if we want to match 1 or more spaces, so that we can, for instance, replaces three consecutive spaces with a single space? Easy peazy! We just use `[ ]+` as our regex:

In [None]:
myreview = re.sub("[.,_\-!?\(\)\/0-9]", " ", myreview)
myreview = re.sub("[ ]+", " ", myreview)
myreview

No more multiple spaces! Finally, let's remove the apostrophes from all contractions (e.g. can't, don't, etc) and possesive words (reviewer's --> reviewers), so that this is handled consistently across reviewers. Instead of replacing it with a space, we will replace it with nothing (that is, ``""``) so we effectly remove it:

In [None]:
myreview = re.sub("'", "", myreview)
myreview

That looks good! Let's put everything we've done into a single function and apply it to all reviews:

In [None]:
def reformat_string(x):
    x = x.lower() # change to a lower case
    x = re.sub('[.,_\-!?\(\)\/\"\&0-9]', " ", x) # remove certain characters
    x = re.sub("[ ]+", " ", x) # replace multiple spaces with a single one
    x = re.sub("'", "", x) # remove apostrophes
    return x

df['review'] = df.review.apply(reformat_string)

In [None]:
df.review.iloc[0]

In [None]:
df.review.iloc[1]

### <font color='#eb3483'> 3. Stop Words </font>

Let's go back to our movies dataset and have a look at the first review, for example:

In [None]:
df.review.iloc[0]

Many of the words in this sentence are unlikely to be helpful for predicting the reviewer's sentiment. For example, "with", "of", "the", "this" and so on. Such words are referred to as **stop words**, and we would usually like to remove them from the sentence.

The ```nltk``` (**n**atural **l**anguage **t**ool**k**it) library contains a set of 179 English stop words. We can use this list to omit stop words from our reviews:

In [None]:
nltk.download('stopwords') # need to run this the first time only

from nltk.corpus import stopwords
stop = stopwords.words('english')
print('List contains', len(stop), 'stopwords')
stop

Note that the stop words list contains contractions such as "you're" and "you've" that include an apostophe. Since we removed the apostrophes in our reviews, we will begin by removing them in the stop word list, and then exclude all the words in the stop word list from our reviews.

In [None]:
stop = [re.sub("'", "", w) for w in stop] # list comprehension
stop

In [None]:
df.review.iloc[0]

In [None]:
# Split the string in the first element into a list of words.
# Iterate over each word w in the list of words and include w in the resulting list only if it is not in the stop list.
# Join the words back together with a space seperating each word
' '.join([w for w in df.review.iloc[0].split() if w not in stop])

Now we're ready to put this into a general function that we can apply to all reviews in our dataset:

In [None]:
def remove_stopwords(x):
    return ' '.join([w for w in x.split() if w not in stop])

df['review'] = df.review.apply(remove_stopwords)
df.head()

### <font color='#eb3483'> 5. Stemming & Lemmatization </font>

Many words with the same meaning can be written in slighly different ways depending, for example, on tense (past, present and future tense) and plurality (singular vs plural). For example, "run", "ran", "runs" and "running" all refer to the same concept of "running" and we would therefore like to represent all of these different words as a single feature in our model for predicting sentiment. We can use stemming or lemmatization to achieve this.

**Stemming** is the process of transforming a word into its root form to allow us to map related words to the same stem. There are many different stemming algorithms; we will use the first such algo developed by Martin Porter in 1979 and thus known as Porter stemming:

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [None]:
def stemming(x):
    words = [stemmer.stem(w) for w in x.split()] # do stemming
    return ' '.join(words)

df['review'] = df.review.apply(stemming) # this may take a few minutes if the dataset is large
df.head()

Sometimes, stemming produces non-real words. This is usually not a problem. However, if we want gramatically correct words, we can use a similar process called **lemmatization** that attempts to identify the canonical form of a word:

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet') # run this the first time you use it

lemmatizer = WordNetLemmatizer()

# Function to lemmatize each word in a string
def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

# Apply the function to the 'text' column
df['review'] = df['review'].apply(lemmatize_text)

df.head()

## <font color='#eb3483'> Formatting the Data for Machine Learning </font>

Now that the reviews have been cleaned, we need to convert them into features that can be used in an ML algo.

In a **bag-of-words** model, the frequency of each word in a sentence is regarded as a separate feature, *ignoring its context or adjacent words within a sentence*. In our movies dataset, the idea is that certain words would be more common in negative reviews, while other words would occur more frequently in positive reviews.

Since feature engineering should always be performed on the training set and **not** on the test data, we begin by dividing our movies data into train and test sets, and make all of our decisions on the training data. In this dataset, I have already ordered the rows, so that the first 25,000 rows are the training set and the last 25,000 rows are the test set:

In [None]:
train = df[:25000]
test = df[25000:]

In [None]:
train.sentiment.value_counts()

In [None]:
test.sentiment.value_counts()

### <font color='#eb3483'> 1. Tokenization </font>

The process of splitting the sentences into words is referred to as **tokenization**. In this case, a word is referred to as a **token**. More generally though, a token could be a word pair, triplet or an even longer string of adjacent words.

To contruct our features, we need to get the counts of **all the words/tokens** across **all the reviews** in our training data (25,000 reviews). This is going to be a very large number of words! We refer to this as the **vocabularly**.



In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
count_matrix = count_vectorizer.fit_transform(train.review.iloc[:5]) # apply just to the first 5 reviews

In [None]:
count_matrix

The output of the ```CountVectorizer``` is a sparse matrix. Most of the counts are zero, so Python uses a memory efficient format to store these data (it only needs to store the non-zero entries in memory). If we want to "see" the matrix, we can use the ```.toarray()``` function:

In [None]:
count_matrix.toarray()

Because we only apply the ```CountVectorizer``` to the first 5 reviews, this matrix is still small. It has 533 columns (a vocab of size 533). But if we were to apply it to the full dataset, the number of columns (words) would be more than 50,000, with 25,000 rows! So it makes sense to store this as efficiently as possible. You can extract the vobabularly from the fitted ```CountVectorizer``` object:

In [None]:
count_vectorizer.vocabulary_

Note that this is a dictionary that maps each word to a column in our ``count_matrix`` array.

In [None]:
count_vectorizer.vocabulary_. keys() # just the vocabulary

In [None]:
# a neater version of the count_matrix:
pd.DataFrame(count_matrix.toarray(), columns=pd.Series(count_vectorizer.vocabulary_).sort_values().index)

As mentioned earlier, we could have considered word pairs as our tokens. This is called a **bigram**. This may be useful to deal with **negation** e.g. the bigram "not good" actually has the opposite meaning to unigram "good".  More generally, a token comprising $n$ words is referred to as an **$n$-gram**. Here's how we would implement bigram tokenization with ```CountVectorizer```:

In [None]:
bigram_vectorizer = CountVectorizer(ngram_range = (2,2))
bigram_matrix = bigram_vectorizer.fit_transform(train.review.iloc[:5])
bigram_vectorizer.vocabulary_.keys() # a dictionary mapping words to columns

### <font color='#eb3483'> 2. Term Frequency-Inverse Document Frequency (TF-IDF) </font>

The tokenization process can often produce very many features (i.e. a LARGE vocabularly). Part of the feature engineering process is to try and reduce less very large feature space down, which amounts to selected a subset of the vocabularly that we think is likely to be most relevant for predicting the outcome variable (sentiment, in our example).

The frequency of a term (word) can certainly help here. As we mentioned earlier, if a word tends to occur more frequently among positive reviews than negative ones, then it is likely to be a good predictor. However, terms/words that occur frequently across all documents are not very informative e.g. words like "with" and "is". We therefore want to keep words that occur often within a document, but not often in all documents. The term frequency-inverse document frequency (TF-IDF) attempts to measure this. Higher values indicate that a word is more relevant, and can be used for feature selection.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(train.review)

print('The vocularly has', len(tfidf_vectorizer.vocabulary_.keys()), 'words')

In [None]:
tfidx_df = pd.DataFrame(tfidf_matrix.toarray(), columns=pd.Series(tfidf_vectorizer.vocabulary_).sort_values().index)
tfidx_df.head()

In [None]:
word_relevance = tfidx_df.sum().sort_values(ascending=False)
word_relevance.head(20)

In [None]:
word_relevance.tail(20)

In [None]:
import numpy as np
import seaborn as sns
graph = sns.lineplot(x=np.arange(5000),y=word_relevance.iloc[:5000])
graph.axvline(500, c='r')


The resulting plot will have a red vertical line at the x-coordinate 500, which can be used to highlight a specific point or threshold in the data.

Let's keep just the top 500 words/features for model building:

In [None]:
vocab = word_relevance[:500].index

tfidf_vect = TfidfVectorizer(vocabulary=vocab)
X_train = tfidf_vect.fit_transform(train.review)
X_test  = tfidf_vect.transform(test.review)

In [None]:
print('Training data shape:', X_train.shape)
print('Test data shape:', X_test.shape)

## <font color='#eb3483'> Model Building </font>

Now that we have a set of features and a discrete outcome variable, we can go ahead and train whichever classifier we choose... or better still, try a few different classification algos and choose the best one using the test data!

As an example, let's train an out-of-the-box random forest:

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X=X_train, y=train.sentiment) # this may take a few minutes

In [None]:
varimp = pd.Series(rf.feature_importances_, index=vocab).sort_values(ascending=False)

import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
sns.barplot(x=varimp[:20], y=varimp.index[:20])

In [None]:
y_pred = rf.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, RocCurveDisplay
print('Confusion matrix:\n', confusion_matrix(test.sentiment, y_pred), '\n')
print('Test accuracy:', accuracy_score(test.sentiment, y_pred))
RocCurveDisplay.from_estimator(rf, X_test, test.sentiment)

## <font color='#eb3483'> Pre-Trained Sentiment Classifier </font>

In the work above, we trained our own sentiment classifier. This is often useful to identify sentiment within a specific fields, where certain worlds have special meanings. For example, if you were to develop a sentiment classifier for financial news headlines, you would want the word "bull" to have positive sentiment and "bear" to have a negative sentiment (prices rise in "bull markets" and plummet in "bear markets"). In most other contexts, the sentiment around bulls and bears would probably be different.

For some domains, pre-trained models are available. Like our model above, these models typically map words to sentiment scores (called a **lexicon**) that are then combined for a given sentence. Other folk have trained the model on a specific dataset (just like we did above) and made the model available to us.

**For example, the VADER VADER (Valence Aware Dictionary and sEntiment Reasoner) lexicon is a sentiment analysis tool that is particularly effective for analyzing social media text.**

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()

**sia.polarity** calls the polarity_scores method from the VADER sentiment analysis tool (sia) and returns a dictionary of sentiment scores for the input text x.

The dictionary contains four keys: neg, neu, pos, and compound. The compound score is a normalized, weighted composite score that summarizes the overall sentiment of the text.

In [None]:
#This calls the polarity_scores method from the VADER sentiment analysis tool (sia)
sia.polarity_scores("love")

In [None]:
sia.polarity_scores("hate")

In [None]:
sia.polarity_scores("bull")

In [None]:
sia.polarity_scores("I love waffles and ice-cream")

Let's see how well the VADER lexicon captures the sentiment in our movie reviews:

In [None]:
vader_sentiment = [sia.polarity_scores(x)['compound'] for x in df.review] #This is a list comprehension that iterates over each review in the df.review column, calculates the compound sentiment score for each review, and stores these scores in the list vader_sentiment.
pd.crosstab(np.array(vader_sentiment) > 0, df.sentiment)

We see that when a movie review has positive sentiment, the VADER lexicon agrees 20731/(20731+4269) = 83% of the time.

But when the reviewer's sentiment is actually negative, VADER only agrees 11811/(11811+13189) = 47% of the time.

Perhaps there are differences between the way people express sentiment in social media compared to in movie reviews?