# Data Modeling and Clustering

A model predictor will be used to determine if a movie will be highly rate or low rated. Clustering will also be done in order to find patterns in some of the best performing movies.

In [1]:
import pandas as pd
import numpy as np

In [2]:
movies = pd.read_csv('./cleandataV2.csv')

In [60]:
np.median(movies['imdbRating'])

6.4000000000000004

In [62]:
np.percentile(movies['imdbRating'],75)

7.0

A cutoff needs to be determined to see what a high/low threshold rating would be for a movie. Initially I was thinking of using the rating median for this prediction but realized that there is a lot of room 'in the middle' for the average rating movies. This led me to pick the top 25% as the cutoff due to the precise rating cut. This would end up being a similar metric to how Youtube categorizes videos (thumbs up and thumbs down).

In [63]:
target = np.percentile(movies['imdbRating'],75)

In [49]:
def high_rating(rating, target):
    if rating>= target:
        return 1
    else:
        return 0

In [50]:
movies['HighRating'] = movies['imdbRating'].apply(high_rating, target)

In [51]:
y = movies['HighRating']
X = movies.drop(['HighRating','imdbRating','Title'],1)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
comb_n = StandardScaler().fit_transform(X)
pca = PCA().fit(comb_n)
comb_n.shape

In [66]:
from sklearn.model_selection import cross_val_score
rf = RandomForestClassifier(class_weight='balanced', random_state=1)
rf.fit(X,y)
s = cross_val_score(rf, X, y, cv=3)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean(), s.std().round(3)))

Random Forest Score:	0.741 ± 0.025


In [70]:
fdf = pd.DataFrame(rf.feature_importances_).T
fdf.columns=X.columns
fdf = fdf.T.sort_values(0, ascending=False).tail(20)
fdf['Importance'] = fdf[0]
fdf.drop(0, axis=1)

Unnamed: 0,Importance
Charlie Bass,0.0
Logan Browning,0.0
Aseefa Bhutto Zardari,0.0
Sam Robards,0.0
Dara Perlmutter,0.0
David Dorfman,0.0
Dick Shawn,0.0
Tatsuya Nakadai,0.0
Naomie Harris,0.0
Cassidy Freeman,0.0


In [52]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8475 entries, 0 to 8474
Columns: 16586 entries, Awards to Gy Waldron
dtypes: int64(16586)
memory usage: 1.0 GB


In [53]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(class_weight='balanced', n_jobs=1)
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_split=1e-07,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [54]:
y_pred = rf.predict(X_test)

In [55]:
from sklearn.metrics import accuracy_score

In [56]:
accuracy_score(y_test, y_pred)

0.78647267007471489

In [34]:
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier()

In [35]:
gb = gb.fit(X_train, y_train)
gb.get_params

<bound method GradientBoostingClassifier.get_params of GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)>

In [36]:
rf_gb = RandomForestClassifier(max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100)
rf_gb.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [37]:
y_pred_rfgb = rf_gb.predict(X_test)

In [38]:
accuracy_score(y_test, y_pred_rfgb)

0.81046008651199375

In [39]:
rf_gb.feature_importances_

array([ 0.84336666,  0.01497845,  0.03637163, ...,  0.        ,
        0.        ,  0.        ])

In [57]:
from sklearn.metrics import confusion_matrix
cnf_mat = np.array(confusion_matrix(y_test, y_pred, labels=[1,0]))

In [58]:
confusion = pd.DataFrame(cnf_mat, index=['high rating', 'low rating'],
                         columns=['predicted high rating','predicted low rating'])

In [59]:
confusion

Unnamed: 0,predicted high rating,predicted low rating
high rating,417,422
low rating,121,1583


### Conclusions

Overall the random forest tree did not improve significantly with gradient boost, however the score in itself was significantly good at a first shot. I think more features could have been engineerined using the description and writers. 

However adding those features might also cause overfitting, so further analysis would need to be done.