# Applying Machine Learning to Sentiment Analysis

## Chapter 8 from Sebastian Raschka's Python Machine Learning

Sentiment Analysis is a sub-discipline of **Natural Language Proprocessing (NLP)**.  

We will be classifying documents based on their polarity: the attitude of the writer.

We will be using a dataset that consists of 25,000 positive movie reviews and 25,000 negative movie reviews for a total of 50,000 reviews that will be split into training and testing sets.

The movie reviews come from the **Internet Movie Database (IMDB)**.  

The purpose is to build a model that can accurately predict the sentiment of a movie review on new data.

The dataset was specifically obtained from [http://ai.stanford.edu/~amaas/data/sentiment/]

*Some parts of this Jupyter Notebook copies from Raschka's book verbatim.*

In [1]:
import pandas as pd
import numpy as np

# Here I'm using an the same dataset on the Stanford site but previously downloaded and shuffled in a random way
df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head()

Unnamed: 0,review,sentiment
0,My family and I normally do not watch local mo...,1
1,"Believe it or not, this was at one time the wo...",0
2,"After some internet surfing, I found the ""Home...",0
3,One of the most unheralded great works of anim...,1
4,"It was the Sixties, and anyone with long hair ...",0


In [2]:
df.shape

(50000, 2)

In [3]:
df.dtypes

review       object
sentiment     int64
dtype: object

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
review       50000 non-null object
sentiment    50000 non-null int64
dtypes: int64(1), object(1)
memory usage: 781.3+ KB


In [5]:
df.isna().sum()

review       0
sentiment    0
dtype: int64

In [6]:
df['sentiment'].value_counts()

1    25000
0    25000
Name: sentiment, dtype: int64

For df[sentiment] 1 = positive review, 0 = negative review

## Introducing the bag-of-words model

**Bag-of-Word** method allows us to represent textual data as numerical vectors.

Specifically,
1. We create a set of tokens - words - from the entire set of documents
2. We construct a feature vector from each document that counts the frequency of each word in a particular document.

### Transforming words into feature vectors

We can use the **CountVectorizer** class from scikit-learn to help us out.  **CountVectorizer** takes an array of text data, sentences or entire documents, and turns it into a bag-of-words model:

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining, The weather is sweet, and one and one is two'
])

bag = count.fit_transform(docs)

In [9]:
type(bag)

scipy.sparse.csr.csr_matrix

Bag is now an array of sparse feature vectors.

We can print the contents of **bag** by:

In [10]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


This gives us a dictionary of the words and their integer indices.

To print the actual feature vectors:

In [12]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


The rows represent each of the three sentences, and the columns represent how many times the words occur in each sentence.  So for example, the first word at 0 in the above dictionary is the word "and" which occurs in the first two sentences 0 times and twice in the third sentence...giving us the 0, 0, and 2 down the first column.

**The importance of sklearn's CountVectorizer is that it gives us a foundation for understanding and analyzing text data.  At the end of the day, all we are doing is counting words which seems deceptively simple.  But it's the combination of counting and figuring out whether the more frequent words have meaning for the whole document or not that's the hard part.**

The values in the feature vectors are called **raw term frequencies**: the number of times a term *t* occurs in a document *d*.

### Assessing word relevancy via term frequency-inverse document frequency

When we are analyzing text data, we often encounter words that occur across multiple documents from both classes.  These frequently occurring words typically don't contain useful or discriminatory information.  This is where the **term frequency-inverse document frequency** technique can come in handy.  The **tf-idf** is defined as the product of the term frequency *tf(t,d)* and inverse document frequency *idf(t,d)* => tf(d,f) * idf(t,d)

Here tf(t,d) is the term frequency that we talked about above with CountVectorizer().  The **idf(t,d)** can be calculated as the log of (n sub d) divided by 1 + df(t,d) where (n sub d) is the total number of documents and df(t,d) is the number of documents that contain the term t.

#### The logarithm ensures that low document frequencies are not given too much weight.

Now we can use sklearn's **TfidTransformer** which takes in the raw term frequencies from CountVectorizer class as input and transforms them as tf-idfs:


In [14]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf = True,
                       norm = 'l2',
                       smooth_idf = True)

np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


As we saw in the previous subsection, the word 'is' had the largest term frequency in the third document, being the most frequently occurring word. However, As we saw in the previous subsection, the word is had the largest term frequency in the 3rd document, being the most frequently occurring word. **However, after transforming the same feature vector into tf-idfs, we see that the word is is now associated with a relatively small tf-idf (0.45) in document 3 since it is also contained in documents 1 and 2 and thus is unlikely to contain any useful, discriminatory information.**

### Cleaning Text data

The first important step to textual analysis before we get on with our bag-of-words model is to clean the text data by stripping it of all unwanted characters.

For simplicity, we will remove HTML markup and all punctuation marks except for emoticons such as :) as they contain useful information.  

For this task we will Python's **regular expression (regex)** library **re**:

In [19]:
import re


def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

Via the first regex '<[^>]*>' in the preceding code, we tried to remove all of the HTML markup from the movie reviews.  And then we used a slightly more complex (really?) regex to find emoticons, which we temporarily stored as emoticons.  Next we removed all non-word characters from the text via the regex [\W]+ and converted the text into lowercase characters.  

Eventually, we added the temporarily stored *emoticons* to the end of the processed document string.  Additionally, we removed the *nose* character (-) from the emoticons for consistency.

Although the addition of the emoticon characters to the end of the cleaned document strings may not look like the most elegant approach, we shall note that the order of the words doesn't matter in our bag-of-words model if our vocabulary only consists of one-word tokens.  

Let's confirm that our preprocessor works correctly:

In [20]:
preprocessor(df.loc[0, 'review'][-50:])

'to star cinema way to go jericho and claudine '

In [21]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

Let's now apply our **preprocessor** function to all the movie reviews in our DataFrame:

In [22]:
df['review'] = df['review'].apply(preprocessor)

In [23]:
df['review'].head()

0    my family and i normally do not watch local mo...
1    believe it or not this was at one time the wor...
2    after some internet surfing i found the homefr...
3    one of the most unheralded great works of anim...
4    it was the sixties and anyone with long hair a...
Name: review, dtype: object

### Processing documents into tokens

