# Sentiment Analysis
(This chapter is based on Raschka, chapter 8)

In this chapter, we will classify text data. The example we will use is a data set containing movie reviews.

The machine learning methods we use will be from `scikit-learn`. We will make use of `NLTK` for some of the data processing.

To speed up `scikit-learn`'s calculations, we can use `scikit-learn-intelex`, a package developed by Intel. We first need to install it:

    conda install scikit-learn-intelex
    
and then include the following code before any scikit-learn imports.

    from sklearnex import patch_sklearn
    patch_sklearn()


In [1]:
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Obtaining the IMDb movie review dataset

The IMDB movie review set can be downloaded from [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/).
After downloading the dataset, decompress the files.

The data is also available in the course material/data folder on Moodle.

## Preprocessing the movie dataset

We will read the data to a pandas DataFrame. This may take a few minutes.

`pyprind` is a module we use to show a progress bar here. You will need to install it via `pip` or remove the corresponding lines of code here.

In [2]:
#import pyprind
import pandas as pd
import os

import warnings
warnings.filterwarnings("ignore", message="The parameter 'token_pattern'")

# change the `basepath` to the directory of the
# unzipped movie dataset

basepath = 'data/aclImdb'

labels = {'pos': 1, 'neg': 0}
#pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 
                      'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], 
                           ignore_index=True)
            #pbar.update()
df.columns = ['review', 'sentiment']

We create a random permutation of the rows in the DataFrame so that positive and negative reviews occur in random order.

In [3]:
import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index)).reset_index(drop=True)
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


There are 50,000 reviews in the data set.

In [4]:
df.shape

(50000, 2)

You may save the resulting data in a CSV file, though this isn't strictly necessary.

In [5]:
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

## The bag-of-words model

The bag-of-words model is one of the simplest ways of quantitatively representing text. The process consists of two steps:

* Create a vocabulary consisting of all words contained in any of the text documents.
* Construct a feature vector for each document where each feature corresponds to the number of times a particular word appears in the document.

The number of distinct words can be quite large, but the data are __sparse__, i.e., only a small minority of the words in the vocabulary appear in each document. This allows us to store the features in data structures especially suited for sparse data.

## Transforming documents into feature vectors

We can use the class `CountVectorizer` of `scikit-learn` to create a bag-of-words model. To illustrate what this class does, we take a very small corpus containing only three sentences.

1. The sun is shining
2. The weather is sweet
3. The sun is shining, the weather is sweet, and one and one is two


The `fit_transform` method constructs the vocabulary of the bag-of-words model and transforms the documents into sparse feature vectors.

In [6]:
import numpy as np
np.set_printoptions(precision=2)
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

`bag` is a sparse matrix.

In [7]:
bag

<3x9 sparse matrix of type '<class 'numpy.int64'>'
	with 17 stored elements in Compressed Sparse Row format>

In [8]:
print(bag)

  (0, 6)	1
  (0, 4)	1
  (0, 1)	1
  (0, 3)	1
  (1, 6)	1
  (1, 1)	1
  (1, 8)	1
  (1, 5)	1
  (2, 6)	2
  (2, 4)	1
  (2, 1)	3
  (2, 3)	1
  (2, 8)	1
  (2, 5)	1
  (2, 0)	2
  (2, 2)	2
  (2, 7)	1


The values in the feature vectors are also called the __term frequencies__: `tf(t,d)` is the number of times the term `t` occurs in document `d`.

The `toarray` method of the bag of words model returns the feature vectors. Each index corresponds to one word in the vocabulary.

In [9]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


The `vocabulary_` attribute of the `CountVectorizer` is a dictionary containing the unique words and their vocabulary(not the frequency, according to teacher) in the corpus.

In [10]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


## Term frequency-inverse document frequency

Any corpus contains words that occur frequently and in most documents, whereas others do so only rarely and in few documents. The most frequent words often don't contain much useful information when it comes to the classification task because they appear in documents of both classes. We will therefore reduce the weight of such frequent words.

The technique we use is called __term frequency-inverse document frequency__ `tf-idf`. It is defined as the product of the term frequency and the inverse document frequency:

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

The inverse document frequency $idf(t, d)$ is calculated as:

$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

where $n_d$ is the number of documents, and $df(d, t)$ is the number of documents $d$ that contain the term $t$. 

The constant $1$ in the denominator assures that even words appearing in all documents obtain a non-zero value. The logarithm is used so that rarely occurring terms don't get excessively high weights.

We can apply `tf-idf` by using the `TfidfTransformer`. It takes term frequencies and transforms them into tf-idfs.

In [11]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, 
                         norm='l2', 
                         smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


We can see that, while the word 'is' had the largest term frequency in the 3rd document, it has a relatively low weight of 0.45 after the transformation. This is because it also appears in the other documents and therefore is unlikely to be useful in our classification task.

While the usual definition of `tf-idf` is as described above, the `TfidfTransformer` applies a slightly different formula. The equations applied in scikit-learn are the following.

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

The `TfidfTransformer` additionally normalizes the data by dividing the feature vector of each document by its length, which is, by default, measured by the `L2 norm`.

$$v_{\text{norm}} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v_{1}^{2} + v_{2}^{2} + \dots + v_{n}^{2}}} = \frac{v}{\big (\sum_{i=1}^{n} v_{i}^{2}\big)^\frac{1}{2}}$$

We will now apply this method to the movie review data set.

## Cleaning text data

The review files contain some HTML code and other unneeded characters.

In [12]:
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

We use regular expressions to clean the data. We retain emoticons because they could be useful when classifying sentiment.

In [13]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

In [14]:
preprocessor(df.loc[0, 'review'][-50:])

'is seven title brazil not available'

We are able to retain the emoticons.

In [15]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [16]:
df['review'] = df['review'].apply(preprocessor)

## Processing documents into tokens
We will deal with stop words when running our algorithm. We import them here.


In [17]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

We next tokenize the data and apply a stemmer from NLTK to obtain the word stems. We don't stem the stop words.

In [18]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()


def tokenizer_porter(text):
    return [word if word in stop else porter.stem(word) for word in text.split()]

# Training a logistic regression model

We will use only a subset of the data here to limit the run time of the algorithm. To use the whole data, we would run the following.

```
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values
```

In [19]:
X_train = df.loc[:2500, 'review'].values
y_train = df.loc[:2500, 'sentiment'].values
X_test = df.loc[2500:, 'review'].values
y_test = df.loc[2500:, 'sentiment'].values

We will run the following __penalized logistic regression__ model. We will discuss such models in more detail later in the course.

`TfidfVectorizer` combines `CountVectorizer` and `TfidfTransformer`.

A `Pipeline` combines multiple steps of a machine learning task, here the `TfidfVectorizer` and the `LogisticRegression`.

The `param_grid` contains various parameters that are used to prepare the data using the `TfidfVectorizer` and to train the model using the `LogisticRegression`. Such parameters are called __hyper parameters__. The goal is to find the hyper parameters that yield the best results.

The `GridSearchCV` object uses __cross-validation__, also to be discussed later in the course, to train the model and determine the best hyper parameters.

In [20]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0,solver='liblinear'))]) 
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

You can delete parameters from the grid above to reduce the number of models to fit -- for example, by using the following:

    param_grid = [{'vect__ngram_range': [(1, 1)],
                   'vect__stop_words': [stop, None],
                   'vect__tokenizer': [tokenizer],
                   'clf__penalty': ['l1', 'l2'],
                   'clf__C': [1.0, 10.0]},
                  ]

In [21]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 tasks      | elapsed:   11.1s
[Parallel(n_jobs=-1)]: Done 180 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed:  1.6min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=False,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        n

In [22]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each'

In [23]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.856


`gs_lr_tfidf.best_score_` is the average k-fold cross-validation score. I.e., if we have a `GridSearchCV` object with 5-fold cross-validation (like the one above), the `best_score_` attribute returns the average score over the 5-folds of the best model. 