# Classifying IMDB movie reviews

In [1]:
import turicreate as tc
import utils

In [2]:
movies = tc.SFrame('./IMDB_Dataset.csv')
movies

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


review,sentiment
One of the other reviewers has mentioned ...,positive
A wonderful little production. <br /><br ...,positive
I thought this was a wonderful way to spend ...,positive
Basically there's a family where a little ...,negative
"Petter Mattei's ""Love in the Time of Money"" is a ...",positive
"Probably my all-time favorite movie, a story ...",positive
I sure would like to see a resurrection of a up ...,positive
"This show was an amazing, fresh & innovative idea ...",negative
Encouraged by the positive comments about ...,negative
If you like original gut wrenching laughter you ...,positive


In [3]:
movies['words'] = tc.text_analytics.count_words(movies['review'])
movies

review,sentiment,words
One of the other reviewers has mentioned ...,positive,"{'darker': 1.0, 'touch': 1.0, 'thats': 1.0, ..."
A wonderful little production. <br /><br ...,positive,"{'done': 1.0, 'surface': 1.0, 'every': 1.0, ..."
I thought this was a wonderful way to spend ...,positive,"{'go': 1.0, 'superman': 1.0, 'interesting': 1.0, ..."
Basically there's a family where a little ...,negative,"{'them': 1.0, 'ignore': 1.0, 'dialogs': 1.0, ..."
"Petter Mattei's ""Love in the Time of Money"" is a ...",positive,"{'work': 1.0, 'for': 1.0, 'anxiously': 1.0, ..."
"Probably my all-time favorite movie, a story ...",positive,"{'for': 1.0, 'dozen': 1.0, 'i': 1.0, 'if': ..."
I sure would like to see a resurrection of a up ...,positive,"{'do': 1.0, 'go': 1.0, 'must': 1.0, 'new': 1.0, ..."
"This show was an amazing, fresh & innovative idea ...",negative,"{'awful': 1.0, 'just': 1.0, 'huge': 1.0, 'a': ..."
Encouraged by the positive comments about ...,negative,"{'effort': 1.0, 'an': 1.0, 'making': 1.0, ..."
If you like original gut wrenching laughter you ...,positive,"{'camp': 1.0, 'great': 1.0, 'br': 2.0, 'movie': ..."


In [4]:
model = tc.logistic_classifier.create(movies, features=['words'], target='sentiment')

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [5]:
model

Class                          : LogisticClassifier

Schema
------
Number of coefficients         : 101725
Number of examples             : 47500
Number of classes              : 2
Number of feature columns      : 1
Number of unpacked features    : 101724

Hyperparameters
---------------
L1 penalty                     : 0.0
L2 penalty                     : 0.01

Training Summary
----------------
Solver                         : lbfgs
Solver iterations              : 10
Solver status                  : Completed (Iteration limit reached).
Training time (sec)            : 2.1925

Settings
--------
Log-likelihood                 : 306.8836

Highest Positive Coefficients
-----------------------------
words[queers]                  : 35.8076
words[maced]                   : 35.8076
words[shoring]                 : 35.8076
words[videozone]               : 18.457
words[hrm]                     : 18.2839

Lowest Negative Coefficients
----------------------------
words[undoubtable]             

In [6]:
weights = model.coefficients
weights

name,index,class,value,stderr
(intercept),,positive,0.075673322652526,
words,darker,positive,0.6589136198219784,
words,touch,positive,0.2179820567383624,
words,thats,positive,-0.402304345309385,
words,your,positive,-0.003744247444623,
words,viewing,positive,0.1599699274415722,
words,their,positive,0.0061854348841175,
words,into,positive,-0.0182890669477339,
words,turned,positive,-0.1909567241039285,
words,being,positive,-0.0053232687529241,


In [7]:
weights.sort('value')

name,index,class,value,stderr
words,undoubtable,positive,-28.455286863738227,
words,cannonite,positive,-28.455286863738227,
words,gately,positive,-16.829449857055202,
words,millitary,positive,-16.829449857055202,
words,nessecary,positive,-16.829449857055202,
words,boyzone,positive,-16.829449857055202,
words,klines,positive,-16.829449857055202,
words,cerimonee,positive,-16.829449857055202,
words,bracketts,positive,-16.829449857055202,
words,unfortuntately,positive,-16.261783026801275,


In [8]:
weights.sort('value', ascending=False)

name,index,class,value,stderr
words,queers,positive,35.80760994044994,
words,maced,positive,35.80760994044994,
words,shoring,positive,35.80760994044994,
words,videozone,positive,18.456994202093867,
words,hrm,positive,18.283948686423056,
words,madcat,positive,17.822736499862973,
words,scribble,positive,17.74520459438151,
words,tensionate,positive,17.580388439632372,
words,tocsin,positive,17.560245658765297,
words,squirms,positive,17.393216774315025,


In [9]:
weights[weights['index']=='wonderful']

name,index,class,value,stderr
words,wonderful,positive,1.0211960859307196,


In [10]:
weights[weights['index']=='horrible']

name,index,class,value,stderr
words,horrible,positive,-1.0344055288567766,


In [11]:
weights[weights['index']=='the']

name,index,class,value,stderr
words,the,positive,0.0004241444491121,


In [12]:
movies['predictions'] = model.predict(movies, output_type='probability')
movies

review,sentiment,words,predictions
One of the other reviewers has mentioned ...,positive,"{'darker': 1.0, 'touch': 1.0, 'thats': 1.0, ...",0.9999994640410024
A wonderful little production. <br /><br ...,positive,"{'done': 1.0, 'surface': 1.0, 'every': 1.0, ...",0.999999999831477
I thought this was a wonderful way to spend ...,positive,"{'go': 1.0, 'superman': 1.0, 'interesting': 1.0, ...",0.9929302718333678
Basically there's a family where a little ...,negative,"{'them': 1.0, 'ignore': 1.0, 'dialogs': 1.0, ...",0.0230471391626418
"Petter Mattei's ""Love in the Time of Money"" is a ...",positive,"{'work': 1.0, 'for': 1.0, 'anxiously': 1.0, ...",0.9985597321712852
"Probably my all-time favorite movie, a story ...",positive,"{'for': 1.0, 'dozen': 1.0, 'i': 1.0, 'if': ...",0.9995842353595462
I sure would like to see a resurrection of a up ...,positive,"{'do': 1.0, 'go': 1.0, 'must': 1.0, 'new': 1.0, ...",0.9999011506692724
"This show was an amazing, fresh & innovative idea ...",negative,"{'awful': 1.0, 'just': 1.0, 'huge': 1.0, 'a': ...",0.0080122635999007
Encouraged by the positive comments about ...,negative,"{'effort': 1.0, 'an': 1.0, 'making': 1.0, ...",5.531662102661318e-07
If you like original gut wrenching laughter you ...,positive,"{'camp': 1.0, 'great': 1.0, 'br': 2.0, 'movie': ...",0.9674411283578056


In [28]:
# returns the movie object with the highest "most positive" value for the "predictions" attribute.
movies.sort('predictions')[-1]

{'review': 'Jackie Chan\'s Police Story is a landmark film for both the Honk Kong action genre and the career of Jackie Chan.<br /><br />Directed/written by Chan, Police Story has a basic plot as did all the films of that era and genre, and like most of the the films of Police Storys\' kind, the script is nothing to be raved about. But the plot of the film is Jackie Chan, who plays a nice guy cop, struggling to convict the local gang lord.<br /><br />The direction of the film is nothing special and by no means the best directing effort that Jackie Chan has given us, that responsibility falls to the underrated masterpiece "Miracles". However the job that Jackie does directing is sufficient and respectable. The standout out directing of the film comes with the fight scenes.<br /><br />The performances in this film also vary with Jackie giving a very solid typical Chan nice guy up against it role, but this is by no means his best acting role, that can been seen in the Sammo Hung directed 

In [33]:
movies.sort('predictions')[-1]

{'review': 'Jackie Chan\'s Police Story is a landmark film for both the Honk Kong action genre and the career of Jackie Chan.<br /><br />Directed/written by Chan, Police Story has a basic plot as did all the films of that era and genre, and like most of the the films of Police Storys\' kind, the script is nothing to be raved about. But the plot of the film is Jackie Chan, who plays a nice guy cop, struggling to convict the local gang lord.<br /><br />The direction of the film is nothing special and by no means the best directing effort that Jackie Chan has given us, that responsibility falls to the underrated masterpiece "Miracles". However the job that Jackie does directing is sufficient and respectable. The standout out directing of the film comes with the fight scenes.<br /><br />The performances in this film also vary with Jackie giving a very solid typical Chan nice guy up against it role, but this is by no means his best acting role, that can been seen in the Sammo Hung directed 

In [29]:
# Most negative review
movies.sort('predictions')[0]

 'sentiment': 'negative',
 'words': {'despair': 1.0,
  'ambiguity': 1.0,
  'admits': 1.0,
  'one': 1.0,
  'endings': 1.0,
  'compare': 1.0,
  'takes': 1.0,
  'could': 1.0,
  'stand': 1.0,
  'as': 3.0,
  'less': 1.0,
  'other': 1.0,
  'whole': 1.0,
  'signs': 1.0,
  'policier': 1.0,
  'every': 1.0,
  'misreads': 1.0,
  'through': 2.0,
  'dominate': 1.0,
  'means': 1.0,
  'killed': 1.0,
  'or': 2.0,
  'popular': 1.0,
  'language': 5.0,
  'foreign': 1.0,
  'movies': 2.0,
  'an': 5.0,
  'adult': 1.0,
  'despite': 2.0,
  'indirection': 1.0,
  'start': 1.0,
  'undermining': 1.0,
  'absolute': 1.0,
  'mistaking': 1.0,
  'reassertion': 1.0,
  'etc': 1.0,
  'lacan': 1.0,
  'regains': 1.0,
  'before': 1.0,
  'style': 3.0,
  'inability': 2.0,
  'york': 1.0,
  'primal': 1.0,
  'must': 2.0,
  'whose': 3.0,
  'him': 2.0,
  'certainty': 1.0,
  'inspector': 1.0,
  'cell': 1.0,
  'relationship': 1.0,
  'new': 3.0,
  'turkey': 1.0,
  'further': 1.0,
  'theory': 1.0,
  'these': 1.0,
  'genre': 1.0,
  'ge

In [27]:
movies.sort('predictions', ascending=True)[0]

 'sentiment': 'negative',
 'words': {'despair': 1.0,
  'ambiguity': 1.0,
  'admits': 1.0,
  'one': 1.0,
  'endings': 1.0,
  'compare': 1.0,
  'takes': 1.0,
  'could': 1.0,
  'stand': 1.0,
  'as': 3.0,
  'less': 1.0,
  'other': 1.0,
  'whole': 1.0,
  'signs': 1.0,
  'policier': 1.0,
  'every': 1.0,
  'misreads': 1.0,
  'through': 2.0,
  'dominate': 1.0,
  'means': 1.0,
  'killed': 1.0,
  'or': 2.0,
  'popular': 1.0,
  'language': 5.0,
  'foreign': 1.0,
  'movies': 2.0,
  'an': 5.0,
  'adult': 1.0,
  'despite': 2.0,
  'indirection': 1.0,
  'start': 1.0,
  'undermining': 1.0,
  'absolute': 1.0,
  'mistaking': 1.0,
  'reassertion': 1.0,
  'etc': 1.0,
  'lacan': 1.0,
  'regains': 1.0,
  'before': 1.0,
  'style': 3.0,
  'inability': 2.0,
  'york': 1.0,
  'primal': 1.0,
  'must': 2.0,
  'whose': 3.0,
  'him': 2.0,
  'certainty': 1.0,
  'inspector': 1.0,
  'cell': 1.0,
  'relationship': 1.0,
  'new': 3.0,
  'turkey': 1.0,
  'further': 1.0,
  'theory': 1.0,
  'these': 1.0,
  'genre': 1.0,
  'ge

### Assignement 3 Starts here: 

### 1) Use the Sentiment_Analysis_IMDB.ipynb notebook to answer the following questions. Insert text cells to write out your answers. Be sure to state the question number.

#### a) [2 marks] How many items are in the raw dataset? What data is contained in each item, and what is the label?

In [34]:
movies.shape

(50000, 4)

There were 50000 items, each one has the comment review, as a string, in the first column and and the tag of positive or negative sentiment in the second column. 

### b) [2 marks] What does the tc.text_analytics.count_words() function do?

This command line transform the comment review (string) into a dictionary with the number of repetition of each word as am imdependient feature.

### c) [2 marks] How much of the data is used for validation? What are the training and validation accuracies, and what does this imply in terms of overfitting/underfitting?

The data is divided in Training and validation with the proportion 0.95 to 0.5, the default values of splitting data. 

Training acurracy: 0.999389          | Validating Acurracy: 0.875600           

The model is overfitting, the testing acurracy is better than validation acurracy.  


### d) [1 mark] How many coefficients are in the model?


In [38]:
# Get the number of coefficients in the model
num_coefficients = len(model.coefficients)

print("Number of coefficients in the logistic regression model:", num_coefficients)

Number of coefficients in the logistic regression model: 101725



### e) [1 mark] What weights are given to the L1 and L2 penalties for regularization?


In [44]:
# Get the L1 and L2 penalty values of the model
l1_penalty = model.l1_penalty
l2_penalty = model.l2_penalty

print("L1 penalty:", l1_penalty)
print("L2 penalty:", l2_penalty)

L1 penalty: 0.0
L2 penalty: 0.01



### f) [2 marks] Compare the top 10 words that contributed to positive reviews with the top 10 words that contributed to negative reviews. What is the difference between them?


In [50]:
negative = weights.sort('value').head()
negative


name,index,class,value,stderr
words,undoubtable,positive,-28.455286863738227,
words,cannonite,positive,-28.455286863738227,
words,gately,positive,-16.829449857055202,
words,millitary,positive,-16.829449857055202,
words,nessecary,positive,-16.829449857055202,
words,boyzone,positive,-16.829449857055202,
words,klines,positive,-16.829449857055202,
words,cerimonee,positive,-16.829449857055202,
words,bracketts,positive,-16.829449857055202,
words,unfortuntately,positive,-16.261783026801275,


In [49]:
positive = weights.sort('value', ascending=False).head()
positive

name,index,class,value,stderr
words,queers,positive,35.80760994044994,
words,maced,positive,35.80760994044994,
words,shoring,positive,35.80760994044994,
words,videozone,positive,18.456994202093867,
words,hrm,positive,18.283948686423056,
words,madcat,positive,17.822736499862973,
words,scribble,positive,17.74520459438151,
words,tensionate,positive,17.580388439632372,
words,tocsin,positive,17.560245658765297,
words,squirms,positive,17.393216774315025,


Positive values are higher, 
Both are classified as positive in the original Tag


### g) [1 mark] What weight is given to the word ‘the’? Is this a sensible value? Why or why not?


In [62]:
the = weights[weights['index']=='the']
print('the value is: ', the['value'])

the value is:  [0.0004241444491121867, ... ]


The value of the is positive but too low to influence the final result, so it could be considered as neutral


### h) [2 marks] Describe what you would do to improve this model, and why it would help.

Suggestions: 
1) Data cleaning and preprocessing: The provided dataset may contain irrelevant or noisy information that could hinder model performance. Cleaning and preprocessing the dataset can improve model accuracy.

2) Feature engineering: Instead of using a simple bag-of-words representation, we can extract more informative features such as n-grams, word embeddings, or topic models. This can help the model capture more nuanced relationships between words and improve accuracy.

3) Model selection and hyperparameter tuning: We can experiment with different machine learning models such as Support Vector Machines, Random Forests, or Neural Networks to see if they perform better than logistic regression. We can also tune hyperparameters such as the regularization strength or learning rate to improve model performance.

4) Evaluating performance on a test set: The provided code trains and evaluates the model on the same dataset. However, it is essential to evaluate the model on a separate test set to estimate its generalization performance accurately. For instance, Train/Validate/Test in the proportion 80/10/10 %. This can help identify issues such as overfitting or underfitting and allow us to fine-tune the model accordingly.

5) Ensemble learning: We can combine multiple models to improve overall performance. This can be done through techniques such as bagging, boosting, or stacking. Changing the hyperparameters in the configuration of the model, L1, L2, spochs, model. 

6) Training at least two model with different hyperparameters and make a comparison between.
