### Text Analytics Workshop - Research & Academic Collaboration Program, University of Kelaniya
#### 5th July 2018

###### Ruvan Weerasinghe 


### The libraries/packages we need

- Pandas is for converting/reading Data Frames
- Numpy is a very useful math library
- Normalization and Utils are two helper programs we have defined

In [9]:
import pandas as pd
import numpy as np
from normalization import normalize_corpus
from utils import build_feature_matrix

Need to ensure that Pandas is installed (automatic in Anaconda - else use 'pip' to install)
Need to ensure that our two Python scripts normalization.py and utils.py are in the Python path

### Load the data file as csv using Pandas

This is done using the read_csv() function

In [10]:
# Load the cleaned movie reviews dataset
dataset = pd.read_csv(r'movie_reviews.csv')
# Check how big the dataset frame is using len() function
# Print the first few data points - note that data consists of 2 columns named 'review' and 'sentiment'
print(dataset.head())

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


The output should show the first 5 reviews together with their sentiment labels

### Preparing the training and testing sets

- We first divide the dataset to training and test sets
- Then we use 4 arrays to store the 'review' and 'sentiment' parts of each set separately

In [11]:
# Divide data into training and testing sets
train_data = dataset[:25000]
test_data = dataset[25000:] 
# Check size (len) and first few elements (head()) of test_data (sub)frame


# Divide the data into the data (review) and the label (sentiment) in both training and testing sets
train_reviews = np.array(train_data['review'])
train_sentiments = np.array(train_data['sentiment'])
test_reviews = np.array(test_data['review'])
test_sentiments = np.array(test_data['sentiment'])


Try examining the length of each array and some elements within it to see if it is what you expected

### We clean/normalize/wrangle the input reviews the way we want

- Here we simply say we don't need to lemmatize the words (default is to lemmatize)
- And that we are only interested in text characters (so we loose terms such as '007')

In [12]:
# Normalize the training review data using the normalization.py module
norm_train_reviews = normalize_corpus(train_reviews,
                                      lemmatize=False,
                                      only_text_chars=True)

This process would take a few minutes - see the code in normalization.py to understand why

### We can now extract the features that we are interested in

- We extract tfidf weights instead of simply counts (frequency)
- We also stick to unigrams (i.e. individual words) and not bigrams or trigrams
- We want to consider all words - even those that occur only once (possibly missplet)

In [13]:
# Extract features from these normalized training reviews
# - which features? Try other features using parameters provided in utils.py                                                                           
vectorizer, train_features = build_feature_matrix(documents=norm_train_reviews,
                                                  feature_type='tfidf',
                                                  ngram_range=(1, 1), 
                                                  min_df=0.0, max_df=1.0)                                      

What is vectorizer and what is train_features?

### Train an SVM model using the training data

We call scikit-learn's SGDClassifier class for this
NB: scikit-learn has many other Machine Learning algorithms you can try

In [14]:
from sklearn.linear_model import SGDClassifier

# Build/train an SVM classifier model with the train features extracted from reviews
svm = SGDClassifier(loss='hinge', n_iter=500)
svm.fit(train_features, train_sentiments) # We give the features and the correct labels

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=500, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

### Test how good our model is

In order to test how good our model is, we need to also transform our test set the same way as the train set

In [15]:
# Normalize the test reviews                        
norm_test_reviews = normalize_corpus(test_reviews,
                                     lemmatize=False,
                                     only_text_chars=True)  
# Extract features from the normalized test reviews                                   
test_features = vectorizer.transform(norm_test_reviews)         

### We finally test the output of the trained model on the test data set

- We first send our vectorized test reviews to the model to get the predictions
- Then we use 3 functions we have defined in our utils.py package to output the performance metrics

In [16]:
# Predict the sentiment for test dataset movie reviews
predicted_sentiments = svm.predict(test_features)       

# Evaluate model prediction performance by comparing predicted sentiments and test sentiments
from utils import display_evaluation_metrics, display_confusion_matrix, display_classification_report

# Show performance metrics
display_evaluation_metrics(true_labels=test_sentiments,
                           predicted_labels=predicted_sentiments,
                           positive_class='positive')  

# Show confusion matrix
display_confusion_matrix(true_labels=test_sentiments,
                         predicted_labels=predicted_sentiments,
                         classes=['positive', 'negative'])

# Show detailed per-class classification report
display_classification_report(true_labels=test_sentiments,
                              predicted_labels=predicted_sentiments,
                              classes=['positive', 'negative'])

Accuracy: 0.89
Precision: 0.88
Recall: 0.91
F1 Score: 0.9
                 Predicted:         
                   positive negative
Actual: positive       3712      359
        negative        501     3543
             precision    recall  f1-score   support

   positive       0.88      0.91      0.90      4071
   negative       0.91      0.88      0.89      4044

avg / total       0.89      0.89      0.89      8115



There are many things you can try:
(a) change the way you 'clean' the data (e.g. remove terms that occur less than a minimum number of times?
(b) change the kind of features you extract (e.g. counts instead of tfidf weights? bigrams and trigrams?
(c) change the learning algorithm from SVM to another supervised algorithm (e.g. Logistic Regression, Naive Bayes, Decision Tree?)