# Project 6: IMDB

This project involves NLP, decision trees, bagging, boosting, and more!

---

## Load packages

You are likely going to need to install the `imdbpie` package:

    > pip install imdbpie

---

In [27]:
import os
import subprocess
import collections
import re
import csv
import json

import pandas as pd
import numpy as np
import scipy

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

import psycopg2
import requests
from imdbpie import Imdb
import nltk

import urllib
from bs4 import BeautifulSoup
import nltk

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

---

## Part 1: Acquire the Data

You will connect to the IMDB API to query for movies. 

See here for documentation on how to use the package:

https://github.com/richardasaurus/imdb-pie

#### 1. Connect to the IMDB API

In [2]:
imdb = Imdb()
imdb = Imdb(anonymize=True)

#### 2. Query the top 250 rated movies in the database

In [3]:
top250 = imdb.top_250()

#### 3. Make a dataframe from the movie data

Keep the fields:

    num_votes
    rating
    tconst
    title
    year
    
And discard the rest

In [12]:
def movie_parser(m):
    return [int(m['num_votes']), float(m['rating']),m['tconst'],m['title'],int(m['year'])]

parsed = [movie_parser(m) for m in top250]

movies = pd.DataFrame(parsed, columns=['num_votes','rating','tconst','title','year'])

In [13]:
movies.head(3)

Unnamed: 0,num_votes,rating,tconst,title,year
0,1661358,9.3,tt0111161,The Shawshank Redemption,1994
1,1137094,9.2,tt0068646,The Godfather,1972
2,776658,9.0,tt0071562,The Godfather: Part II,1974


In [14]:
movies.dtypes

num_votes      int64
rating       float64
tconst        object
title         object
year           int64
dtype: object

#### 3. Select only the top 100 movies

In [15]:
movies = movies[0:100]

#### 4. Get the genres and runtime for each movie and add them to the dataframe

There can be multiple genres per movie, so this will need some finessing.

In [17]:
runtimes = []
genres = []

for t in movies.tconst.values:
    title = imdb.get_title_by_id(t)
    print title.title
    runtimes.append(float(title.runtime))
    genres.append(title.genres)


The Shawshank Redemption
The Godfather
The Godfather: Part II
The Dark Knight
Schindler's List
12 Angry Men
Pulp Fiction
The Lord of the Rings: The Return of the King
The Good, the Bad and the Ugly
Fight Club
The Lord of the Rings: The Fellowship of the Ring
Star Wars: Episode V - The Empire Strikes Back
Forrest Gump
Inception
The Lord of the Rings: The Two Towers
One Flew Over the Cuckoo's Nest
Goodfellas
The Matrix
Seven Samurai
Star Wars
City of God
Se7en
The Silence of the Lambs
It's a Wonderful Life
The Usual Suspects
Life Is Beautiful
Léon: The Professional
Once Upon a Time in the West
Spirited Away
Saving Private Ryan
American History X
Interstellar
Casablanca
Psycho
City Lights
Indiana Jones and the Raiders of the Lost Ark
The Intouchables
Rear Window
Modern Times
The Green Mile
Terminator 2: Judgment Day
The Pianist
The Departed
Back to the Future
Whiplash
Gladiator
Memento
Apocalypse Now
The Prestige
The Lion King
Dr. Strangelove or: How I Learned to Stop Worrying and Love th

In [18]:
print runtimes[0:4]
print genres[0:4]

[8520.0, 10500.0, 12120.0, 9120.0]
[[u'Crime', u'Drama'], [u'Crime', u'Drama'], [u'Crime', u'Drama'], [u'Action', u'Adventure', u'Crime', u'Thriller']]


In [19]:
flatten_genres = [
    item
    for sublist in genres
    for item in sublist
]
unique_genres = np.unique(flatten_genres)
print unique_genres

[u'Action' u'Adventure' u'Animation' u'Biography' u'Comedy' u'Crime'
 u'Drama' u'Family' u'Fantasy' u'Film-Noir' u'History' u'Horror' u'Music'
 u'Musical' u'Mystery' u'Romance' u'Sci-Fi' u'Thriller' u'War' u'Western']


In [21]:
genre_dummy_coded = []

for i, tconst in enumerate(movies.tconst.values):
    row = [tconst]
    row.extend([1 if ug in genres[i] else 0 for ug in unique_genres])
    genre_dummy_coded.append(row)
    if i < 2:
        print tconst, genres[i], row
    
    

tt0111161 [u'Crime', u'Drama'] [u'tt0111161', 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
tt0068646 [u'Crime', u'Drama'] [u'tt0068646', 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


#### 4. Write the Results to a csv

In [22]:
movie_genres = pd.DataFrame(genre_dummy_coded, columns=['tconst']+unique_genres.tolist())
print movie_genres.head(3)
print movies.shape, movie_genres.shape

      tconst  Action  Adventure  Animation  Biography  Comedy  Crime  Drama  \
0  tt0111161       0          0          0          0       0      1      1   
1  tt0068646       0          0          0          0       0      1      1   
2  tt0071562       0          0          0          0       0      1      1   

   Family  Fantasy   ...     History  Horror  Music  Musical  Mystery  \
0       0        0   ...           0       0      0        0        0   
1       0        0   ...           0       0      0        0        0   
2       0        0   ...           0       0      0        0        0   

   Romance  Sci-Fi  Thriller  War  Western  
0        0       0         0    0        0  
1        0       0         0    0        0  
2        0       0         0    0        0  

[3 rows x 21 columns]
(100, 5) (100, 21)


In [23]:
movies = movies.merge(movie_genres, on='tconst')
print movies.shape
print movies.columns

(100, 25)
Index([u'num_votes',    u'rating',    u'tconst',     u'title',      u'year',
          u'Action', u'Adventure', u'Animation', u'Biography',    u'Comedy',
           u'Crime',     u'Drama',    u'Family',   u'Fantasy', u'Film-Noir',
         u'History',    u'Horror',     u'Music',   u'Musical',   u'Mystery',
         u'Romance',    u'Sci-Fi',  u'Thriller',       u'War',   u'Western'],
      dtype='object')


In [25]:
movies['runtime'] = runtimes
movies.to_csv('./movies_with_genres.csv', index=False, encoding='utf-8')

---

## Part 2: Wrangle the text data

#### 1. Scrape the reviews for the top 100 movies

*Hint*: Use a loop to scrape each page at once

#### 2. Extract the reviews and the rating per review for each movie

*Note*: "soup" from BeautifulSoup is the html returned from all 25 pages. You'll need to either address each page individually or break them down by elements.

#### 3. Remove the non AlphaNumeric characters from reviews

#### 4. Calculate the top 200 ngrams from the user reviews

Use the `TfidfVectorizer` in sklearn.

Recommended parameters:

    ngram_range = (1, 2)
    stop_words = 'english'
    binary = False
    max_features = 200

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

#### 5. Merge the user reviews and ratings

#### 6. Save this merged dataframe as a csv

---

## Part 3: Combine Tables in PostgreSQL

#### 1. Import your two .csv data files into your Postgre Database as two different tables

For ease, we can call these table1 and table2

#### 2. Connect to database and query the joined set

#### 3. Join the two tables 

#### 4. Select the newly joined table and save two copies of the into dataframes

---

## Part 4: Parsing and Exploratory Data Analysis

#### 1. Rename any columns you think should be renamed for clarity

#### 2. Describe anything interesting or suspicious about your data (quality assurance)

#### 3. Make four visualizations of interest to you using the data

---

## Part 5: Decision Tree Classifiers and Regressors

#### 1. What is our target attribute? 

Choose a target variable for the decision tree regressor and the classifier. 

#### 2. Prepare the X and Y matrices and preprocess data as you see fit

In [30]:
movies = pd.read_csv('movies_user.csv')

In [41]:
print movies.columns[0:50]
print movies.head(6)

Index([u'num_votes', u'rating', u'tconst', u'title', u'year_x', u'Action',
       u'Adventure', u'Animation', u'Biography', u'Comedy', u'Crime', u'Drama',
       u'Family', u'Fantasy', u'Film-Noir', u'History', u'Horror', u'Music',
       u'Musical', u'Mystery', u'Romance', u'Sci-Fi', u'Thriller', u'War',
       u'Western', u'runtime', u'acting', u'action', u'actor', u'actors',
       u'actually', u'amazing', u'american', u'audience', u'away', u'bad',
       u'beautiful', u'believe', u'best', u'better', u'big', u'bit', u'black',
       u'book', u'brilliant', u'camera', u'cast', u'chaplin', u'character',
       u'characters'],
      dtype='object')
   num_votes  rating     tconst                     title  year_x  Action  \
0  1661358.0     9.3  tt0111161  The Shawshank Redemption  1994.0       0   
1  1661358.0     9.3  tt0111161  The Shawshank Redemption  1994.0       0   
2  1661358.0     9.3  tt0111161  The Shawshank Redemption  1994.0       0   
3  1661358.0     9.3  tt0111161  The

In [125]:
genres = movies.columns[5:25]
duplicated_cols = ['year_x','num_votes','runtime','user_rating']
ngram2_cols = [c for c in movies.columns if ' ' in c]

remove_cols = genres.tolist() + duplicated_cols + ['tconst','title']+ngram2_cols
Y = movies['Crime'].values

X = movies[[c for c in movies.columns if c not in remove_cols]]
predictors = X.columns
X = X.values

print Y.shape
print X.shape

(7755,)
(7755, 200)


In [89]:
genres

Index([u'Action', u'Adventure', u'Animation', u'Biography', u'Comedy',
       u'Crime', u'Drama', u'Family', u'Fantasy', u'Film-Noir', u'History',
       u'Horror', u'Music', u'Musical', u'Mystery', u'Romance', u'Sci-Fi',
       u'Thriller', u'War', u'Western'],
      dtype='object')

#### 3. Build and cross-validate your decision tree classifier

In [90]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score

In [91]:
dtc = DecisionTreeClassifier(max_depth=6)

dtc_scores = cross_val_score(dtc, X, Y, cv=5)
print dtc_scores
print np.mean(dtc_scores)

[ 0.69523196  0.69072165  0.68923275  0.70064516  0.70258065]
0.695682433552


In [92]:
baseline_acc = np.mean(Y)
print baseline_acc
dtc.fit(X, Y)

0.292327530625


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [93]:
feat_imp = dtc.feature_importances_
feature_importances = pd.DataFrame({'feat_imp':feat_imp, 'predictor':predictors})
feature_importances.sort_values(['feat_imp'], ascending=False, inplace=True)


In [94]:
feature_importances

Unnamed: 0,feat_imp,predictor
53,0.166437,family
111,0.141963,michael
186,0.103341,wars
199,0.094071,user_rating
17,0.055946,book
27,0.052185,classic
79,0.047336,history
97,0.028032,lives
92,0.027747,later
26,0.027592,city


#### 4. Gridsearch optimal parameters for your classifier. Does the performance improve?

In [97]:
from sklearn.grid_search import GridSearchCV

Y = movies.Action.values

dtc_gs_params = {
    'max_depth':[1,2,3,4,5,6],
    'min_samples_split':[1,25,50,100]
}

dtc_gs = GridSearchCV(dtc, dtc_gs_params, cv=5)
dtc_gs.fit(X,Y)


GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'min_samples_split': [1, 25, 50, 100], 'max_depth': [1, 2, 3, 4, 5, 6]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [98]:
print dtc_gs.best_score_
print dtc_gs.best_params_
best_dtc = dtc_gs.best_estimator_

0.823855577047
{'min_samples_split': 1, 'max_depth': 2}


#### 5. Build and cross-validate your decision tree regressor

In [109]:
from sklearn.tree import DecisionTreeRegressor

Y = movies.user_rating.values

remove_cols = duplicated_cols + ['tconst','title'] + ngram2_cols
X = movies[[c for c in movies.columns if c not in remove_cols+['user_rating']]]

predictors = X.columns



dtr = DecisionTreeRegressor(max_depth=2)

dtr_scores = cross_val_score(dtr, X, Y, cv=5)
print dtr_scores, dtr_scores.mean()

dtr.fit(X, Y)



[ 0.00770698 -0.10864206 -0.14369265 -0.0322239  -0.28782947] -0.112936218998


DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

In [110]:
feat_imp = dtr.feature_importances_
feature_importances = pd.DataFrame({'feat_imp':feat_imp, 'predictor':predictors})
feature_importances.sort_values(['feat_imp'], ascending=False, inplace=True)
feature_importances

Unnamed: 0,feat_imp,predictor
8,0.648081,Fantasy
4,0.317436,Comedy
141,0.034483,old
0,0.000000,Action
151,0.000000,plays
140,0.000000,new
142,0.000000,original
143,0.000000,oscar
144,0.000000,people
145,0.000000,perfect


#### 6. Gridsearch the optimal parameters for your classifier. Does performance improve?

In [112]:
dtr_gs_params = {
    'max_depth':range(1,20),
    'min_samples_split':range(1,100,10),
    'max_features':['auto','sqrt','log2'],
    
}

dtr_gs = GridSearchCV(dtr, dtr_gs_params, cv=5, verbose=1)
dtr_gs.fit(X,Y)

[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    2.1s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:    7.9s
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:   16.8s
[Parallel(n_jobs=1)]: Done 799 tasks       | elapsed:   40.6s
[Parallel(n_jobs=1)]: Done 1249 tasks       | elapsed:  1.3min
[Parallel(n_jobs=1)]: Done 1799 tasks       | elapsed:  2.0min
[Parallel(n_jobs=1)]: Done 2449 tasks       | elapsed:  3.5min
[Parallel(n_jobs=1)]: Done 2850 out of 2850 | elapsed:  4.3min finished


Fitting 5 folds for each of 570 candidates, totalling 2850 fits


GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'min_samples_split': [1, 11, 21, 31, 41, 51, 61, 71, 81, 91], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

In [113]:
print dtr_gs.best_score_
print dtr_gs.best_params_

0.00924438888354
{'max_features': 'sqrt', 'min_samples_split': 91, 'max_depth': 2}


---

## Part 6: Elastic Net


#### 1. Gridsearch optimal parameters for an ElasticNet using the regression target and predictors you used for the decision tree regressor.


In [117]:
type(X)


pandas.core.frame.DataFrame

In [130]:
from sklearn.linear_model import ElasticNet

Xn = (X - X.mean()) / X.std()
Y = movies.rating.values

enet = ElasticNet()

enet_params = {
    'l1_ratio':np.linspace(0.05, 1.0, 5),
    'alpha':np.logspace(-5,1,10)
}

enet_gs = GridSearchCV(enet, enet_params, cv=5, verbose=1, n_jobs=-1)
enet_gs.fit(Xn,Y)

print enet_gs.best_params_
print enet_gs.best_score_


Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    6.4s
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:    8.3s finished


{'alpha': 1.0000000000000001e-05, 'l1_ratio': 0.050000000000000003}
0.999999719149


In [127]:
print X

[[ 9.3         0.          0.09280123 ...,  0.          0.          0.        ]
 [ 9.3         0.11055819  0.         ...,  0.          0.          0.06884384]
 [ 9.3         0.12246272  0.         ...,  0.          0.          0.        ]
 ..., 
 [ 8.3         0.          0.         ...,  0.          0.          0.        ]
 [ 8.3         0.          0.14051796 ...,  0.          0.12333582  0.        ]
 [ 8.3         0.          0.         ...,  0.          0.11955692
   0.1480708 ]]


In [124]:
movies.columns[0:50]

Index([u'num_votes', u'rating', u'tconst', u'title', u'year_x', u'Action',
       u'Adventure', u'Animation', u'Biography', u'Comedy', u'Crime', u'Drama',
       u'Family', u'Fantasy', u'Film-Noir', u'History', u'Horror', u'Music',
       u'Musical', u'Mystery', u'Romance', u'Sci-Fi', u'Thriller', u'War',
       u'Western', u'runtime', u'acting', u'action', u'actor', u'actors',
       u'actually', u'amazing', u'american', u'audience', u'away', u'bad',
       u'beautiful', u'believe', u'best', u'better', u'big', u'bit', u'black',
       u'book', u'brilliant', u'camera', u'cast', u'chaplin', u'character',
       u'characters'],
      dtype='object')

In [122]:
enet = enet_gs.best_estimator_
enet.fit(Xn, Y)
coef = enet.coef_
enet_coefs = pd.DataFrame({'abs_coef':abs(coef), 'coef':coef, 'predictor':predictors})
enet_coefs.sort_values(['abs_coef'], ascending=False, inplace=True)
enet_coefs

Unnamed: 0,abs_coef,coef,predictor
0,0.0,-0.0,Action
150,0.0,-0.0,played
139,0.0,0.0,need
140,0.0,0.0,new
141,0.0,-0.0,old
142,0.0,-0.0,original
143,0.0,0.0,oscar
144,0.0,0.0,people
145,0.0,-0.0,perfect
146,0.0,0.0,performance


#### 2. Is cross-validated performance better or worse than with the decision trees? 

#### 3. Explain why the elastic net may have performed best at that particular l1_ratio and alpha

---

## Part 7: Bagging and Boosting: Random Forests, Extra Trees, and AdaBoost

#### 1. Load the random forest regressor, extra trees regressor, and adaboost regressor from sklearn

#### 2. Gridsearch optimal parameters for the three different ensemble methods.

#### 3. Evaluate the performance of the two bagging and one boosting model. Which performs best?

#### 4. Extract the feature importances from the Random Forest regressor and make a DataFrame pairing variable names with their variable importances.

#### 5. Plot the ranked feature importances.

#### 6.1 [BONUS] Gridsearch an optimal Lasso model and use it for variable selection (make a new predictor matrix with only the variables not zeroed out by the Lasso). 

#### 6.2 [BONUS] Gridsearch your best performing bagging/boosting model from above with the features retained after the Lasso. Does the score improve?

#### 7.1. [BONUS] Select a threshold for variable importance from your Random Forest regressor and use that to perform feature selection, creating a new subset predictor matrix.

#### 7.2 [BONUS] Using BaggingRegressor with a base estimator of your choice, test a model using the feature-selected dataset you made in 7.1

---

## [VERY BONUS] Part 8: PCA

#### 1. Perform a PCA on your predictor matrix

#### 2. Examine the variance explained and determine what components you want to keep based on them.

#### 3. Plot the cumulative variance explained by the ordered principal components.

#### 4. Gridsearch an elastic net using the principal components you selected as your predictors. Does this perform better than the elastic net you fit earlier?

#### 5. Gridsearch a bagging ensemble estimator that you fit before, this time using the principal components as predictors. Does this perform better or worse than the original? 

#### 6. Look at the loadings of the original predictor columns on the first 3 principal components. Is there any kind of intuitive meaning here?

Hint, you will probably want to sort by absolute value of magnitude of loading, and also only look at the obviously important (larger) ones!

# [Extremely Bonus] Part 9:  Clustering

![](https://snag.gy/jPSZ6U.jpg)

 ***Bonus Bonus:***
This extended bonus question is asking to do something we never really talked about but would like for you to attempt based on the assumptions that we learned during this weeks clustering lesson(s).

#### 1. Import your favorite clustering module

#### 2. Encode categoricals

#### 3. Evaluate cluster metics solely based on a range of K
If K-Means:  SSE/Inertia vs Silhouette (ie: Elbow), silhouette average, etc

#### 4.  Look at your data based on the subset of your predicted clusters.
Assign the cluster predictions back to your dataframe in order to see them in context.  This is great to be able to group by cluster to get a sense of the data that clumped together.

#### 5. Describe your findings based on the predicted clusters 
_How well did it do?  What's good or bad?  How would you improve this? Does any of it make sense?_