Congrats! You just graduated UVA's BSDS program and got a job working at a movie studio in Hollywood. 

Your boss is the head of the studio and wants to know if they can gain a competitive advantage by predicting new movies that might get high imdb scores (movie rating). 

You would like to be able to explain the model to mere mortals but need a fairly robust and flexible approach so you've chosen to use decision trees to get started. 

In doing so, similar to  great data scientists of the past you remembered the excellent education provided to you at UVA in a undergrad data science course and have outline 20ish steps that will need to be undertaken to complete this task. As always, you will need to make sure to #comment your work heavily. 

 Footnotes: 
-	You can add or combine steps if needed
-	Also, remember to try several methods during evaluation and always be mindful of how the model will be used in practice.
- Make sure all your variables are the correct type (factor, character,numeric, etc.)

In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import graphviz
from sklearn.model_selection import train_test_split,GridSearchCV,RepeatedStratifiedKFold
from sklearn import metrics
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.preprocessing import OrdinalEncoder
# from sklearn.tree import DecisionTreeClassifier, export_graphviz 
from sklearn.tree import plot_tree

In [225]:
#1. Load the data
#Sometimes need to set the working directory back out of a folder that we create a file in

#import os
#os.listdir()
#print(os.getcwd())
#os.chdir('c:\\Users\\Brian Wright\\Documents\\3001Python\\DS-3001')

movie_metadata=pd.read_csv("../data/movie_metadata.csv")
movie_metadata.head()


Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


#2 Ensure all the variables are classified correctly including the target variable and collapse factor variables as needed.

In [226]:
def preprocess(df):
    # Drop Columns with too many unique values
    df = df.drop(columns=['actor_1_name','genres','actor_2_name','color','content_rating','actor_3_name','language', 'plot_keywords','movie_imdb_link'])

    # Collapse imdb score into two classes
    df['imdb_score_cat'] = pd.cut(df['imdb_score'], bins=[0, 7, 10], labels=['Low', 'High'])
    df = df.drop(columns=['imdb_score'])

    # Classify director names mentioned more than 10 times 
    director_counts = df['director_name'].value_counts()
    df['director_name'] = df['director_name'].apply(lambda x: 'popular' if director_counts.get(x, 0) > 10 else 'other')

    # Classify country mentioned more than 10 times 
    country_counts = df['country'].value_counts()
    df['country'] = df['country'].apply(lambda x: 'popular' if director_counts.get(x, 0) > 10 else 'other')

    # Separate genres into different columns
    #genre_dummies = df['genres'].str.get_dummies(sep='|')
    #df = pd.concat([df, genre_dummies], axis=1)
    #df.drop(columns=['genres'], inplace=True)

    # Convert factor variables to categorical
    df['imdb_score_cat'] = df['imdb_score_cat'].astype('category')
    df['director_name'] = df['director_name'].astype('category')
    df['country'] = df['country'].astype('category')
    
    # Encode categorical to become a continuous variable
    df = pd.get_dummies(df, columns=['imdb_score_cat', 'director_name','country'], drop_first=True)
    
    return df

movie_metadata2 = preprocess(movie_metadata)

#3 Check for missing variables and correct as needed. Once you've completed the cleaning again create a function that will do this for you in the future. In the submission, include only the function and the function call.

In [227]:
def fix_missing(df):
    # Drop Columns with too many missing values 
    df = df.dropna(thresh = int((1 - 0.4) * len(movie_metadata)), axis=1)
    
    # Drop missing values
    df = df.dropna()
    
    return df

movie_metadata3 = fix_missing(movie_metadata2)

# Drop movie title
movie_metadata4 = movie_metadata3.drop(columns=['movie_title'])

#4 Guess what, you don't need to scale the data, because DTs don't require this to be done, they make local greedy decisions...keeps getting easier, go to the next step.

#5 Determine the baserate or prevalence for the classifier, what does this number mean?

In [228]:
print(movie_metadata4['imdb_score_cat_High'].value_counts()[1] / movie_metadata4['imdb_score_cat_High'].count())

0.3064982899237043


  print(movie_metadata4['imdb_score_cat_High'].value_counts()[1] / movie_metadata4['imdb_score_cat_High'].count())


This number means...

#6 Split your data into test, tune, and train. (80/10/10)

In [229]:
def split_data(df, target, train_size=0.80, tune_size=0.50, random_state=21):
    # Split independent and dependent variables
    X = df.drop(columns=target)
    y = df[target]
    
    # Split data into training and testing sets
    X_train, X_temp, y_train, y_temp = train_test_split(X, y, train_size=train_size, stratify=y, random_state=random_state)
    
    # Split the temporary set into tuning and testing sets
    X_tune, X_test, y_tune, y_test = train_test_split(X_temp, y_temp, train_size=tune_size, stratify=y_temp, random_state=random_state+28)
    
    return X_train, X_tune, X_test, y_train, y_tune, y_test

X_train, X_tune, X_test, y_train, y_tune, y_test = split_data(movie_metadata4, 'imdb_score_cat_High')

#7 Create the kfold object for cross validation.

In [230]:
kf = RepeatedStratifiedKFold(n_splits=10,n_repeats =5, random_state=42)

#8 Create the scoring metric you will use to evaluate your model and the max depth hyperparameter 

In [231]:
scoring = ['roc_auc','recall','balanced_accuracy'] # scoring metrics
param={"max_depth" : [1,2,3,4,5,6,7,8,9,10,11]} # hyperparameter space up to 11

#9 Build the classifier object 

In [232]:
from sklearn.tree import DecisionTreeClassifier
cl= DecisionTreeClassifier(criterion='gini', random_state=1000)

#10 Use the kfold object and the scoring metric to find the best hyperparameter value for max depth via the grid search method.

In [233]:
search = GridSearchCV(cl, param, scoring=scoring, n_jobs=1, cv=kf,refit='roc_auc', verbose = 3)

#11 Fit the model to the training data.

In [234]:
model = search.fit(X_train, y_train)

Fitting 50 folds for each of 11 candidates, totalling 550 fits
[CV 1/50] END max_depth=1; balanced_accuracy: (test=0.666) recall: (test=0.484) roc_auc: (test=0.666) total time=   0.0s
[CV 2/50] END max_depth=1; balanced_accuracy: (test=0.652) recall: (test=0.452) roc_auc: (test=0.652) total time=   0.0s
[CV 3/50] END max_depth=1; balanced_accuracy: (test=0.701) recall: (test=0.548) roc_auc: (test=0.701) total time=   0.0s
[CV 4/50] END max_depth=1; balanced_accuracy: (test=0.713) recall: (test=0.516) roc_auc: (test=0.713) total time=   0.0s
[CV 5/50] END max_depth=1; balanced_accuracy: (test=0.647) recall: (test=0.366) roc_auc: (test=0.647) total time=   0.0s
[CV 6/50] END max_depth=1; balanced_accuracy: (test=0.622) recall: (test=0.344) roc_auc: (test=0.622) total time=   0.0s
[CV 7/50] END max_depth=1; balanced_accuracy: (test=0.665) recall: (test=0.430) roc_auc: (test=0.665) total time=   0.0s
[CV 8/50] END max_depth=1; balanced_accuracy: (test=0.664) recall: (test=0.484) roc_auc: (

#12 What is the best depth value?

In [235]:
best = model.best_estimator_
print(best)

DecisionTreeClassifier(max_depth=5, random_state=1000)


#13 Print out the model

In [242]:
auc = model.cv_results_['mean_test_roc_auc']
recall= model.cv_results_['mean_test_recall']
bal_acc= model.cv_results_['mean_test_balanced_accuracy']

SDauc = model.cv_results_['std_test_roc_auc']
SDrecall= model.cv_results_['std_test_recall']
SDbal_acc= model.cv_results_['std_test_balanced_accuracy']

# Parameter:
depth= np.unique(model.cv_results_['param_max_depth']).data

# DataFrame:
final_model = pd.DataFrame(list(zip(depth, auc, recall, bal_acc,SDauc,SDrecall,SDbal_acc)),
               columns =['depth','auc','recall','bal_acc','aucSD','recallSD','bal_accSD'])

print(final_model.head())

   depth       auc    recall   bal_acc     aucSD  recallSD  bal_accSD
0      1  0.665356  0.445784  0.665356  0.024946  0.056704   0.024946
1      2  0.710100  0.435461  0.661758  0.023997  0.067053   0.025379
2      3  0.775416  0.404701  0.663928  0.026539  0.070622   0.027415
3      4  0.797080  0.480483  0.692707  0.023821  0.083655   0.031398
4      5  0.803594  0.502313  0.701438  0.026457  0.074536   0.028477


#14 View the results, comment on how the model performed using the metrics you selected.

#15 Which variables appear to be contributing the most (variable importance) 

In [237]:
var_imp = pd.DataFrame(best.feature_importances_, index=X_train.columns, columns=['importance']).sort_values('importance', ascending=False)
print(var_imp)

                           importance
num_voted_users              0.558223
budget                       0.178588
duration                     0.147347
actor_3_facebook_likes       0.050437
gross                        0.040728
facenumber_in_poster         0.009105
movie_facebook_likes         0.007255
title_year                   0.004501
actor_2_facebook_likes       0.003816
actor_1_facebook_likes       0.000000
num_critic_for_reviews       0.000000
director_facebook_likes      0.000000
cast_total_facebook_likes    0.000000
num_user_for_reviews         0.000000
aspect_ratio                 0.000000
director_name_popular        0.000000


#16 Use the predict method on the tune data and print out the results.

In [238]:
pred_prob = model.predict_proba(X_tune)[:,1]
print(pred_prob[:10])
print(pred_prob.shape)

[0.05644302 0.05644302 0.95454545 0.05644302 0.05644302 0.6993007
 0.05644302 0.17107943 0.56730769 0.95454545]
(380,)


#17 How does the model perform on the tune data?

#18 Print out the confusion matrix for the tune data, what does it tell you about the model?

In [239]:
def adjust_thres(x, y, z):
    """
    x=pred_probabilities
    y=threshold
    z=tune_outcome
    """
    thres = pd.DataFrame({'new_preds': [True if i > y else False for i in x]})
    thres.new_preds = thres.new_preds.astype('category')
    con_mat = metrics.confusion_matrix(z, thres)  
    print(con_mat)


adjust_thres(pred_prob, 0.5, y_tune)

[[238  26]
 [ 61  55]]


#19 What are the top 3 movies based on the test set? Which variables are most important in predicting the top 3 movies?

In [240]:
# Top 3 Movies
movie_metadata3['pred_prob'] = model.predict_proba(movie_metadata4.drop(columns='imdb_score_cat_High'))[:,1]
sorted_dataset = movie_metadata3.sort_values(by='pred_prob', ascending=False)
sorted_dataset['movie_title'].head(3)

911    Catch Me If You Can 
836           Forrest Gump 
927                  Shrek 
Name: movie_title, dtype: object

In [241]:
# Most important variables
top_3_movies = sorted_dataset.head(3)
top_3_features = X_train.loc[top_3_movies.index]
var_imp2 = pd.DataFrame(best.feature_importances_, index=top_3_features.columns, columns=['importance']).sort_values('importance', ascending=False)
print(var_imp2)

                           importance
num_voted_users              0.558223
budget                       0.178588
duration                     0.147347
actor_3_facebook_likes       0.050437
gross                        0.040728
facenumber_in_poster         0.009105
movie_facebook_likes         0.007255
title_year                   0.004501
actor_2_facebook_likes       0.003816
actor_1_facebook_likes       0.000000
num_critic_for_reviews       0.000000
director_facebook_likes      0.000000
cast_total_facebook_likes    0.000000
num_user_for_reviews         0.000000
aspect_ratio                 0.000000
director_name_popular        0.000000


#20 Use a different hyperparameter for the grid search function and go through the process above again using the tune set.

#21 Did the model improve with the new hyperparameter search?

#22 Using the better model, predict the test data and print out the results.

In [243]:
param2 = {
    "max_depth": [3, 5, 7, 10],  # Shorten tree depth to avoid overfitting
    "min_samples_split": [2, 5, 10],  # Establishing splits
    "min_samples_leaf": [1, 5, 10],  # Allowing smaller leaves
    "max_features": ["sqrt", "log2"],  # Limiting features 
    "max_leaf_nodes": [None, 50, 100],  # Constrain leaf nodes
    "ccp_alpha": [0.0, 0.0001, 0.001],  # Pruning unnecessary nodes
    "class_weight": ["balanced", None]  # Handle class imbalance
}
cl= DecisionTreeClassifier(criterion='gini', random_state=1000)
search = GridSearchCV(cl, param2, scoring=scoring, n_jobs=1, cv=kf,refit='roc_auc', verbose = 3)
model2 = search.fit(X_tune, y_tune)
best = model2.best_estimator_

Fitting 50 folds for each of 1296 candidates, totalling 64800 fits
[CV 1/50] END ccp_alpha=0.0, class_weight=balanced, max_depth=3, max_features=sqrt, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2; balanced_accuracy: (test=0.715) recall: (test=0.727) roc_auc: (test=0.734) total time=   0.0s
[CV 2/50] END ccp_alpha=0.0, class_weight=balanced, max_depth=3, max_features=sqrt, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2; balanced_accuracy: (test=0.616) recall: (test=0.455) roc_auc: (test=0.625) total time=   0.0s
[CV 3/50] END ccp_alpha=0.0, class_weight=balanced, max_depth=3, max_features=sqrt, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2; balanced_accuracy: (test=0.625) recall: (test=0.545) roc_auc: (test=0.658) total time=   0.0s
[CV 4/50] END ccp_alpha=0.0, class_weight=balanced, max_depth=3, max_features=sqrt, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2; balanced_accuracy: (test=0.571) recall: (test=0.364) roc_auc: (test

KeyboardInterrupt: 

In [202]:
auc = model2.cv_results_['mean_test_roc_auc']
recall= model2.cv_results_['mean_test_recall']
bal_acc= model2.cv_results_['mean_test_balanced_accuracy']

SDauc = model2.cv_results_['std_test_roc_auc']
SDrecall= model2.cv_results_['std_test_recall']
SDbal_acc= model2.cv_results_['std_test_balanced_accuracy']

# Parameter:
depth= np.unique(model2.cv_results_['param_max_depth']).data

# DataFrame:
final_model = pd.DataFrame(list(zip(depth, auc, recall, bal_acc,SDauc,SDrecall,SDbal_acc)),
               columns =['depth','auc','recall','bal_acc','aucSD','recallSD','bal_accSD'])

print(final_model.head())

   depth  auc  recall  bal_acc  aucSD  recallSD  bal_accSD
0      3  0.5     0.0      0.5    0.0       0.0        0.0
1      5  0.5     0.0      0.5    0.0       0.0        0.0
2      7  0.5     0.0      0.5    0.0       0.0        0.0


#23 Summarize what you learned along the way and make recommendations to your boss on how this could be used moving forward, being careful not to over promise.