# LAC Data and Research Summit
## Natural Language Processing Lab
### Isabel Oñate - isabel.onate@northwestern.edu
### 8/28/2019

#### <span style="color:#a50e3e;">Description: </span> 

NLP (Natural Language Processing) is a set of techniques for approaching text problems. In this lab we will go over an example of how to use text data to make predictions using a simple bag of words model. 

We will perform <a href="https://en.wikipedia.org/wiki/Sentiment_analysis">sentiment analysis</a>, to identify and extract subjective information from movie reviews and categorize them inot positive and negative reviews.

#### <span style="color:#a50e3e;">Data: </span> 

The labeled data set consists of 50,000 IMDB movie reviews, selected for sentiment analysis. The sentiment of reviews is binary, 1 for positive reviews and 0 for negative reviews. No individual movie has more than 30 reviews. The data contains information on individual ids for each review ("id"), the sentiment of the review ("sentiment"), and the text with the review ("review").

*_This exercise is based on this <a href="https://www.kaggle.com/c/word2vec-nlp-tutorial/overview">kaggle tutorial</a> and modified to fit the objectives of the session._



## Setup

In [1]:
# import necessary packages
import numpy as np # numpy
import pandas as pd # pandas
from bs4 import BeautifulSoup # package for pulling data out of HTML and XML files
import re # package for regular expresions
import nltk #
from nltk.corpus import stopwords # Import the stop word list
from sklearn.feature_extraction.text import CountVectorizer #
from sklearn.ensemble import RandomForestClassifier # 
from sklearn.model_selection import train_test_split #
import matplotlib.pyplot as plt #

## Data

In [103]:
# Load training data
data = pd.read_csv('https://www.dropbox.com/s/qk7gc7ek68z5o7k/labeledData.tsv?dl=1', header=0, delimiter="\t", quoting=3)

# "header=0" indicates that the first line of the file contains column names, 
# "delimiter=\t" indicates that the fields are separated by tabs, 
# and quoting=3 for ignoring doubled quotes - to avoid errors when reading file

In [102]:
# Shape of data frame
data.shape

(25000, 3)

In [48]:
# Columns in data frame
data.columns.values

array(['id', 'sentiment', 'review'], dtype=object)

In [49]:
# First observations in data frame
data.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [50]:
# Lets look at the first movie review
print(data["review"][0]) 

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

In [51]:
# Split into train and test datasets for testing ML model
train, test = train_test_split(data, test_size=0.2, random_state=58934)
train = train.reset_index() # reset the index of data frame
test = test.reset_index() # reset the index of data frame
print(train.shape)
print(test.shape)

(11, 4)
(3, 4)


In [52]:
# First observations in train data frame
train.head()

Unnamed: 0,index,id,sentiment,review
0,1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
1,2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
2,13,"""7369_1""",0,"""I had a feeling that after \""Submerged\"", thi..."
3,12,"""11744_9""",1,"""\""Mr. Harvey Lights a Candle\"" is anchored by..."
4,0,"""5814_8""",1,"""With all this stuff going down at the moment ..."


## Cleaning

#### <span style="color:#a50e3e;">How do we clean the text data? </span> 

The cleaning depends on the task we are performing on the data. For this exercise we will:

- 1) Remove markup 
- 2) Remove punctuation marks, numbers and other non-letter characters.
- 3) Make everything lower case
- 4) Remove stop words

In some cases punctuation marks like "?" and "!!!" might be important. For this exercise we remove these in the interest of simplicity.

We will also remove stop words. These are words that do not carry much meaning and that occur very frequently in language. For example "the", "a", "and". There are packages in python that contain lists of stop words in English. For this exercise we will use Python's <a href="http://www.nltk.org/">Natural Language Toolkit (NLKT)</a>.

It is not considered a reliable practice to remove markup using regular expressions so we will use the <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">Beautiful Soup package</a>, a library for pulling data out of HTML and XML files.

There are many other things we could do to the data. For example, <a href="https://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization">stemming and lemmatizing</a> (both available in the NLTK package). The goal of both stemming and lemmatizing is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. This would allow us to group words that have a common base form like "messages", "message", and "messaging". 

### Example

In [53]:
example_raw = "This is an example of text containing numbers (like 2 3 4 5); \
non letters (# $ % ^ & *); some words with capital letters (Isabel Juan Ana); \
some stopwords (the is on a); and markup (<br /><br />)."

In [54]:
# Let's take a look
print(example_raw)

This is an example of text containing numbers (like 2 3 4 5); non letters (# $ % ^ & *); some words with capital letters (Isabel Juan Ana); some stopwords (the is on a); and markup (<br /><br />).


In [55]:
# Initialize the BeautifulSoup object
example = BeautifulSoup(train["review"][0]) ## what is this doing
example = BeautifulSoup(example_raw) ## what is this doing
# Extract text
example_text = example.get_text() # function to get the text in the review without tags or markup
print(example_text) # See how the markup was removed

This is an example of text containing numbers (like 2 3 4 5); non letters (# $ % ^ & *); some words with capital letters (Isabel Juan Ana); some stopwords (the is on a); and markup ().


In [56]:
# Remove non-letters
letters_only = re.sub("[^a-zA-Z]", " ", example_text) 
print(letters_only) # See how numbers and symbols were removed

This is an example of text containing numbers  like           non letters                some words with capital letters  Isabel Juan Ana   some stopwords  the is on a   and markup    


In [57]:
# Make everythong lower case
print(letters_only.lower()) # See how everything is lowercase

this is an example of text containing numbers  like           non letters                some words with capital letters  isabel juan ana   some stopwords  the is on a   and markup    


In [58]:
# Split into vector of words
words = letters_only.lower().split() 
# Print the first 6 words in the vector
print(words[0:5])

['this', 'is', 'an', 'example', 'of']


In [59]:
# Lets look at the stop words in nltk package 
print(stopwords.words("english")[0:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


In [60]:
# Remove stop words
meaning_words = [w for w in words if not w in stopwords.words("english")]
print(meaning_words)

['example', 'text', 'containing', 'numbers', 'like', 'non', 'letters', 'words', 'capital', 'letters', 'isabel', 'juan', 'ana', 'stopwords', 'markup']


### Clean reviews

In [61]:
# Function to prepare text data for analysis
# -input: one observation of raw text data
# -output: one observation of raw clean text data
def prepare_words(rawtext): 
    # 1. Remove HTML
    text = BeautifulSoup(rawtext).get_text() 
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", text) 
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    # 4. Convert the stop words to a set for efficiency
    stops = set(stopwords.words("english"))                  
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    # 6. Return the union of the words
    return( " ".join( meaningful_words ))  

In [62]:
# lets look at the first review in the train data frame
print(train["review"][0])

"\"The Classic War of the Worlds\" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different things in a movie. Those who envision themselves as amateur \"critics\" look only to criticize everything they can. Others rate a movie on more important bases,like being entertained, which is why most people never agree with the \"critics\". We enjoyed the effort Mr. Hines put into being faithful to H.G. Wells' classic novel, and we found it to be very entertaining. This made it easy to overlook what the \"critics\" perceive to be its shortcomings."


In [63]:
# Cean first review
clean_review_1 = prepare_words(train["review"][0])
# View clean form
print(clean_review_1)

classic war worlds timothy hines entertaining film obviously goes great effort lengths faithfully recreate h g wells classic book mr hines succeeds watched film appreciated fact standard predictable hollywood fare comes every year e g spielberg version tom cruise slightest resemblance book obviously everyone looks different things movie envision amateur critics look criticize everything others rate movie important bases like entertained people never agree critics enjoyed effort mr hines put faithful h g wells classic novel found entertaining made easy overlook critics perceive shortcomings


In [64]:
# Get the number of reviews in dataframe
num_reviews = train["review"].size

# Initialize an empty list to hold the clean reviews
clean_train_reviews = []

# Loop over each review and clean it
for i in range(0, num_reviews): 
    # Call our function for each one, and add the result to the list of clean reviews
    clean_train_reviews.append(prepare_words(train["review"][i]))

In [65]:
# What type of obejct is it?
type(clean_train_reviews)

list

## Creating Features

#### <span style="color:#a50e3e;">Going from text to numbers... </span> 

After cleaning the text data, we need to convert it into a numerical representaion for the ML algorithm. We will use a <a href="https://en.wikipedia.org/wiki/Bag-of-words_model">Bag of words</a> model which counts the number of times a word appears in each entry. We will use the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html">feature_extraction</a> module from scikit-learn to create bag-of-words features.

Lets go over an example of how this works. Suppose we are working with 2 text entries:

- Sentence 1: "The cat sat on the hat"
- Sentence 2: "The dog ate the cat and the hat"

From these two sentences, our vocabulary space is as follows:

{ the, cat, sat, on, hat, dog, ate, and }

To get our bags of words, we count the number of times each word occurs in each sentence. In Sentence 1, "the" appears twice, and "cat", "sat", "on", and "hat" each appear once, so the feature vector for Sentence 1 is:

{ 2, 1, 1, 1, 1, 0, 0, 0 }

Similarly, the features for Sentence 2 are: 

{ 3, 1, 0, 0, 1, 1, 1, 1}

In the IMDB data, we have a very large number of reviews, which will give us a large vocabulary. To limit the size of the feature vectors, we should choose some maximum vocabulary size. Below, we use the 5000 most frequent words (remembering that stop words have already been removed).

In [66]:
# Initialize the "CountVectorizer" object -scikit-learn's bag of words tool.  
vectorizer = CountVectorizer(analyzer = "word",   
                             tokenizer = None,    
                             preprocessor = None, 
                             stop_words = None,   
                             max_features = 5000) # we use the 5000 most common words

# Fit the model and lean vocabulary; then transform data into feature vectors
train_data_features = vectorizer.fit_transform(clean_train_reviews) # imput needs to be a list of strings

# Convert list into numpy array for ML model
train_data_features = train_data_features.toarray()

In [67]:
# Type of object
print(type(train_data_features))
# Chape of vector
print(train_data_features.shape)

<class 'numpy.ndarray'>
(11, 892)


In [68]:
print(vectorizer.get_feature_names()[0:10])

['action', 'actors', 'acts', 'actual', 'actually', 'addition', 'adds', 'afoul', 'agent', 'agree']





## Supervised Learning Model

We now have a numeric representation of the train data. We can do supervised learning to predict sentiment labels! 

We will start with a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html">random forest classifier</a> included in the scikit-learn package. Random forests is a supervised learning algorithm that can be used for both classification and regression problems. 

We will fit the model to tha data using our train dataset and then test it on our train subset.

### Train model

In [69]:
# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100, random_state = 736438) # more trees is better but will take longer to run

# Fit the forest to the training data
forest = forest.fit(train_data_features, train["sentiment"])

### Test model

We will need to prepare the test data just like we did with the train dataset

In [70]:
# Create an empty list and append the clean reviews
num_reviews = len(test["review"])
clean_test_reviews = [] 

for i in range(0,num_reviews):
    #clean_review = prepare_words(test["review"][i] )
    clean_test_reviews.append(prepare_words(test["review"][i]))

# Get a bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()

In [71]:
# Use the random forest to make sentiment label predictions
result = forest.predict(test_data_features)

# Look at accuracy of the model at predicting sentiment
correct = np.array(test["sentiment"] == result) # Count number of accurate predictions
accuracy = (np.sum(correct)/test["sentiment"].size)*100 # Percent accurate
print("Accuracy:", round(accuracy,2), "%")

Accuracy: 66.67 %


In [72]:
# Copy the results to the test dataframe
test['prediction'] = result

# Export into a comma-separated output file
test.to_csv("Bag_of_Words_model.csv", index=False)

In [73]:
# Final output
test[0:10]

Unnamed: 0,index,id,sentiment,review,prediction
0,6,"""7166_2""",0,"""This movie could have been very good, but com...",1
1,8,"""319_1""",0,"""A friend of mine bought this film for £1, and...",0
2,4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ...",1


## Plots

In [None]:
dist = np.sum(train_data_features, axis=0)
words = []
for tag, count in zip(vectorizer.get_feature_names(), dist):
    words.append(tag)  
words = np.array(words)
words

In [None]:
common_words = pd.DataFrame(data={"word":words, "frequency":dist})
common_words = common_words[common_words['frequency']>10000]
common_words.index = range(len(common_words.index)) # redefine index of data frame 
common_words = common_words.sort_values(by=['frequency'])
common_words

In [36]:
# Plot
pos = np.arange(len(common_words.frequency))
width = 1.0
ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(common_words.word)
ax.set_ylabel('Frequency')
ax.set_title("Most common words")
plot = plt.bar(pos, common_words.frequency, width, color = "light blue", edgecolor='black')
plt.xticks(rotation=90)
plt.rcParams["figure.figsize"] = [40,20]
plt.show

array(['abandoned', 'abc', 'abilities', ..., 'zombie', 'zombies', 'zone'],
      dtype='<U16')