# Classifying IMDB movie reviews

In [1]:
import turicreate as tc
import utils

In [2]:
movies = tc.SFrame('./IMDB_Dataset.csv')
movies

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


review,sentiment
One of the other reviewers has mentioned ...,positive
A wonderful little production. <br /><br ...,positive
I thought this was a wonderful way to spend ...,positive
Basically there's a family where a little ...,negative
"Petter Mattei's ""Love in the Time of Money"" is a ...",positive
"Probably my all-time favorite movie, a story ...",positive
I sure would like to see a resurrection of a up ...,positive
"This show was an amazing, fresh & innovative idea ...",negative
Encouraged by the positive comments about ...,negative
If you like original gut wrenching laughter you ...,positive


### 2a) There are 50,000 rows and 2 columns in the dataset, each row includes one item so there are 50,000 items are in the dataset. Each row includes the text of one review and the sentiment of that review (either positive or negative). The column for the text of review is named as review, and the column for the sentiment of review is named as sentiment. 

### 2b) The sentiment of review is labelled as either "positive" or "negative". 

In [3]:
movies['words'] = tc.text_analytics.count_words(movies['review'])
movies

review,sentiment,words
One of the other reviewers has mentioned ...,positive,"{'darker': 1.0, 'touch': 1.0, 'thats': 1.0, ..."
A wonderful little production. <br /><br ...,positive,"{'done': 1.0, 'surface': 1.0, 'every': 1.0, ..."
I thought this was a wonderful way to spend ...,positive,"{'go': 1.0, 'superman': 1.0, 'interesting': 1.0, ..."
Basically there's a family where a little ...,negative,"{'them': 1.0, 'ignore': 1.0, 'dialogs': 1.0, ..."
"Petter Mattei's ""Love in the Time of Money"" is a ...",positive,"{'work': 1.0, 'for': 1.0, 'anxiously': 1.0, ..."
"Probably my all-time favorite movie, a story ...",positive,"{'for': 1.0, 'dozen': 1.0, 'i': 1.0, 'if': ..."
I sure would like to see a resurrection of a up ...,positive,"{'do': 1.0, 'go': 1.0, 's': 1.0, 'a': 6.0, ..."
"This show was an amazing, fresh & innovative idea ...",negative,"{'awful': 1.0, 'just': 1.0, 'huge': 1.0, 'a': ..."
Encouraged by the positive comments about ...,negative,"{'effort': 1.0, 'an': 1.0, 'making': 1.0, ..."
If you like original gut wrenching laughter you ...,positive,"{'camp': 1.0, 'great': 1.0, 'movie': 2.0, ..."


### 2c) The tc.text_analytics.count_words() function is a built-in Turi Create function from the text_analytics package. The function turns the text of a sentence into a dictionary with the word counts. The texts of the review are stored as a string, then the function counts the occurrences of the word for each review. Each word is stored as the key of the dictionary, and the occurrences of the word are the value in that dictionary. 

In [4]:
model = tc.logistic_classifier.create(movies, features=['words'], target='sentiment')

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



### 2d) We created a words column to store the word counts into a dictionary. We process the review as the string, and each word will be a different feature. The number of words that appear in a review will be used to predict whether the review is positive or negative. The result is based on the number of positive words and the number of negative words that occur in a review. 

### 2e) From the table above, we can see that the iteration started from 0 and ended at 9, so the model iterates 10 times so there are 10 epochs of training are done.

### 2f) 5% of the data are used as the validation set. From the table above, noticed that the training accuracy for iteration 0, 1, 2, 3, 4, 9 are 0.917600, 0.936274, 0.966211, 0.975263, 0.983011, and 0.999600 respectively. The validation accuracy for iteration 0, 1, 2, 3, 4, 9 are 0.856800, 0.840000, 0.877600, 0.884800, 0.896800, and 0.882400 respectively. The model ended with a training accuracy of 0.999600 and a validation accuracy of 0.882400. The model is able to perform with 88.24% accuracy on new data. The training accuracy is higher.  It is expected because only 5% of data are used as the validation set, which the dataset set is used to check how well the model is able to make predictions based on data it hasn't seen before. The model is already familiar with the training dataset but the validation dataset is new to the model; therefore, the training accuracy is expected to be high than the training accuracy.  

In [5]:
model

Class                          : LogisticClassifier

Schema
------
Number of coefficients         : 101661
Number of examples             : 47500
Number of classes              : 2
Number of feature columns      : 1
Number of unpacked features    : 101660

Hyperparameters
---------------
L1 penalty                     : 0.0
L2 penalty                     : 0.01

Training Summary
----------------
Solver                         : lbfgs
Solver iterations              : 10
Solver status                  : Completed (Iteration limit reached).
Training time (sec)            : 1.1867

Settings
--------
Log-likelihood                 : 472.2715

Highest Positive Coefficients
-----------------------------
words[sendak]                  : 44.8536
words[gic]                     : 44.8536
words[widdecombe]              : 32.0042
words[distrusting]             : 32.0042
words[aleination]              : 28.9013

Lowest Negative Coefficients
----------------------------
words[drosselmeier]           

### 2g) There are 101,661 of coefficients are in the model.

### 2h) The weight to the L1 penalty for regularization is 0 and the weight to the L2 penalty for regulariztion is 0.01. 

In [6]:
weights = model.coefficients
weights

name,index,class,value,stderr
(intercept),,positive,0.1132142138312067,
words,darker,positive,0.4800430927092171,
words,touch,positive,0.2064475389636701,
words,thats,positive,-0.3977478976394822,
words,your,positive,-0.0205702171779583,
words,viewing,positive,0.1254112153646217,
words,their,positive,0.0167929423141323,
words,into,positive,-0.0042486812850553,
words,turned,positive,-0.2446291886740998,
words,being,positive,-0.0179409158715086,


In [7]:
weights.sort('value')

name,index,class,value,stderr
words,drosselmeier,positive,-70.90827141379647,
words,eta,positive,-35.91368312140312,
words,nanites,positive,-24.83955701603244,
words,choreographic,positive,-21.00350980517402,
words,newlwed,positive,-13.42574425543114,
words,sierre,positive,-13.37943303201686,
words,poolguy,positive,-12.984192095535354,
words,bsed,positive,-12.930967704749738,
words,nonsenseful,positive,-12.892688768200786,
words,unfortuntately,positive,-12.863293183865933,


In [8]:
weights.sort('value', ascending=False)

name,index,class,value,stderr
words,sendak,positive,44.8535807070221,
words,gic,positive,44.8535807070221,
words,distrusting,positive,32.00415207623371,
words,widdecombe,positive,32.00415207623371,
words,insititue,positive,28.90125182206651,
words,artisty,positive,28.90125182206651,
words,embroider,positive,28.90125182206651,
words,aleination,positive,28.90125182206651,
words,blain,positive,25.14985973615453,
words,kaleidoscopic,positive,25.14985973615453,


In [28]:
print("2i) The top 10 words that contributed to positive reviews are", weights.sort('value', ascending=False)["index"][:10])

2i) The top 10 words that contributed to positive reviews are ['sendak', 'gic', 'distrusting', 'widdecombe', 'insititue', 'artisty', 'embroider', 'aleination', 'blain', 'kaleidoscopic']


### 2i) The top 10 words that contributed to positive reviews are 'sendak', 'gic', 'distrusting', 'widdecombe', 'insititue', 'artisty', 'embroider', 'aleination', 'blain', 'kaleidoscopic'.

In [30]:
print("2j) The top 10 words that contributed to negative reviews are", weights.sort('value')["index"][:10])

2j) The top 10 words that contributed to negative reviews are ['drosselmeier', 'eta', 'nanites', 'choreographic', 'newlwed', 'sierre', 'poolguy', 'bsed', 'nonsenseful', 'unfortuntately']


### 2j) 2j) The top 10 words that contributed to negative reviews are 'drosselmeier', 'eta', 'nanites', 'choreographic', 'newlwed', 'sierre', 'poolguy', 'bsed', 'nonsenseful', 'unfortuntately'.

In [9]:
weights[weights['index']=='wonderful']

name,index,class,value,stderr
words,wonderful,positive,0.9969894541128704,


In [10]:
weights[weights['index']=='horrible']

name,index,class,value,stderr
words,horrible,positive,-0.999877075112364,


In [48]:
weights[weights['index']=='the']

name,index,class,value,stderr
words,the,positive,0.0004862836915785,


### 2k) The weight is given to the word 'the' is 0.0004862836915785153. No, this is not a sensible value, because the word 'the' has not much meaning to either the positive or negative and it occurs in abundance in the reviews. 

### 2l) To improve this model, I will try to remove the stop words such as 'the', 'a', ' is', 'if' and so on. There are 101,661 coefficients are in the model. Removing these words can reduce the number of coefficients. Reducing the number of non-important features can help the model focus more on the important features. These words occur in abundance and they are usually not important or provide little information that can be used in sentiment analysis. We want to remove these words as much as we can to give the model more focus on the important information for prediction. These words have no meaning or value to contribute with either positive or negative sentiment. Even these words had a very small amount of weight to the model; they still slightly affect the output of the algorithm. Another way to improve the model is by normalizing the terms, some words have the same meaning but display in different tense or grammar purposes (noun, verb, adjective, and so on). For example, "mention" and "mentioned", "watch" and "watching", "movie" and "movies" have the same effect for sentiment analysis. 

In [12]:
movies['predictions'] = model.predict(movies, output_type='probability')

In [13]:
movies.sort('predictions', ascending=False)[0]

{'review': "The effects of job related stress and the pressures born of a moral dilemma that pits conscience against the obligations of a family business (albeit a unique one) all brought to a head by-- or perhaps the catalyst of-- a midlife crisis, are examined in the dark and absorbing drama, `Panic,' written and directed by Henry Bromell, and starring William H. Macy and Donald Sutherland. It's a telling look at how indecision and denial can bring about the internal strife and misery that ultimately leads to apathy and that moment of truth when the conflict must, of necessity, at last be resolved.<br /><br />\tAlex (Macy) is tired; he has a loving wife, Martha (Tracey Ullman), a precocious six-year-old son, Sammy (David Dorfman), a mail order business he runs out of the house, as well as his main source of income, the `family' business he shares with his father, Michael (Sutherland), and his mother, Deidre (Barbara Bain). But he's empty; years of plying this particular trade have le

In [14]:
movies.sort('predictions', ascending=True)[0]

{'review': "The Nutcracker has always been a somewhat problematic ballet. It bears little resemblance to ETA Hoffman's original story on which it is based.<br /><br />In the ballet, the story is essentially over by the second-half when Clara (or Marie in this version) travels to the Kingdom of Sweets to watch a series of character dances.<br /><br />There's an infinite variety of stage productions that re-interpret the story in myriad ways (not always successfully) to compensate for the ballet's weak libretto.<br /><br />Balanchine's version doesn't really have any sense of drama or story at all (despite the fact that there is plenty of drama and mystery in Tchaikovsky's wonderful first-act music). The result is a completely forgettable first-half Christmas party where hardly anything happens and where even the dancing (the little that there is of it) isn't particularly memorable.<br /><br />The pantomime over-acting, particularly of Drosselmeier, which might look passable on the stage