# Movie Classification Team 11

# Modeling and Analysis

### Team Members:
Andrew Lund, Nicholas Morgan, Amay Umradia, Charles Webb

The main purpose of this notebook is to systematically evaluate many models and store their results in an easy-to-manipulate format. We begin by creating a function to create and store our model results, and then 

**This notebook accomplishes two primary tasks:**
1. Systematically evaluate many models and store their results in an easy-to-manipulate format. We achieve this by looping through multiple dictionaries to evaluate on a common function.
2. Analyze the results of the models to identify patterns between data sources, models, and predictors.

In [192]:
#import libraries
import pandas as pd
from ast import literal_eval
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer
from sklearn.metrics import classification_report, recall_score, precision_score
import numpy as np
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier

Since we will be trying a number of different models, we decided to create a function that will store all of the outcomes of interest in a dictionary. The function below does an 80/20 train/test split. We also fit all of our models using a 50/50 split, but had better accuracy when using 80/20.

In [193]:
def evaluate_model(model, predictors, response, cv=False, params=None):
    """
    evaluate_model()
    
    -splits the predictors & response variables into train and test sets. 
    -creates a dictionary of model outcomes that are of interest
    -if specified, this function will use cross-validation to determine the optimal parameters for a given model
    
    inputs:
        -model: a model object to be fitted
        -predictors: an array, series, or dataframe of predictor variable(s)
        -response: an array or series of the response variable
        -cv: whether or not to cross-validate the model's parameters (default=False)
        -params: if cv=True, params are required to indicate what parameters to optimize in the given model (default=None)
        
    outputs:
        -a results dictionary containing the following:
            -a fitted model object
    
    """
    results = {}
    train_x, test_x = train_test_split(predictors, test_size=0.2, random_state=9001)
    train_y, test_y = train_test_split(response, test_size=0.2, random_state=9001)
    
    if cv:
        model = GridSearchCV(model, params, scoring=make_scorer(f1_score, average='micro'))
    
    classif = OneVsRestClassifier(model)
    classif.fit(train_x, train_y)
    
    train_yhat = classif.predict(train_x)
    test_yhat = classif.predict(test_x)
    
    results['fitted_model'] = classif
    
    results['train_yhat'] = train_yhat
    results['test_yhat'] = test_yhat
    
    results['train_recall_score'] = recall_score(train_y, train_yhat, average='weighted')
    results['test_recall_score'] = recall_score(test_y, test_yhat, average='weighted')
    
    results['train_precision_score'] = precision_score(train_y, train_yhat,average='weighted')
    results['test_precision_score'] = precision_score(test_y, test_yhat,average='weighted')
    
    results['train_classification_report'] = classification_report(train_y, train_yhat,target_names=target_names)
    results['test_classification_report'] = classification_report(test_y, test_yhat,target_names=target_names)
    
    return results

When we created the [MultiLabel Binarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer) array in a previous notebook, it created 18 classes corresponding to the 18 genres found in our dataset. These classes were labeled as class 0 through class 17, sorted in numeric order of the genre id. In order to improve readability of our report, we will set the target_names to be the name of the genre, rather than the id. The target names need to be in the same order as the MultiLabel Binarizer, so we will do two steps:

    1: Sort the keys of the id_to_genre dictionary in ascending numeric order
    2: Use the sorted keys against the id_to_genre dictionary to create the ordered target_names

In [194]:
import json
id_to_genre = json.load(open('data/id_to_genre.json'))

id_to_genre = {int(key):value for key, value in id_to_genre.items()} #convert string keys to int keys

In [195]:
target_names = json.load(open('data/target_names.json'))['tmdb']

target_names

['Adventure',
 'Fantasy',
 'Animation',
 'Drama',
 'Horror',
 'Action',
 'Comedy',
 'History',
 'Western',
 'Thriller',
 'Crime',
 'Science Fiction',
 'Mystery',
 'Music',
 'Romance',
 'Family',
 'War',
 'TV Movie']

Next we will load in our arrays to be used as predictor variables.

In [14]:
tmdb_bow = np.load('data/tmdb_bow.npy')
imdb_bow = np.load('data/imdb_bow.npy')
combined_bow = np.load('data/combined_bow.npy')

Our Word2Vector and Docs2Vector arrays require 2 additional steps in order to be used as predictors:

    1: They need to be converted to an array of lists (they are currently an array of arrays, which is incompatible with the structure of the response variable, which is also an array of lists).
    2: The values need to be standardized between 0 and 1, because (todo - there was an error when they were negative. I will have to re-run and see what caused the error

In [15]:
tmdb_w2v_mean = np.load('data/tmdb_w2v_mean.npy')
imdb_w2v_mean = np.load('data/imdb_w2v_mean.npy')
combined_w2v_mean = np.load('data/combined_w2v_mean.npy')

tmdb_doc_vec = np.load('data/tmdb_doc_vec.npy')
imdb_doc_vec = np.load('data/imdb_doc_vec.npy')
combined_doc_vec = np.load('data/combined_doc_vec.npy')

tmdb_w2v_mean = np.apply_along_axis(lambda x: list(x), 0, tmdb_w2v_mean)
imdb_w2v_mean = np.apply_along_axis(lambda x: list(x), 0, imdb_w2v_mean)
combined_w2v_mean = np.apply_along_axis(lambda x: list(x), 0, combined_w2v_mean)

tmdb_doc_vec = np.apply_along_axis(lambda x: list(x), 0, tmdb_doc_vec)
imdb_doc_vec = np.apply_along_axis(lambda x: list(x), 0, imdb_doc_vec)
combined_doc_vec = np.apply_along_axis(lambda x: list(x), 0, combined_doc_vec)

In [16]:
from sklearn.preprocessing import MinMaxScaler
scale = MinMaxScaler()


#word2vec scaling
scale.fit(tmdb_w2v_mean)
tmdb_w2v_mean = scale.transform(tmdb_w2v_mean)

scale.fit(imdb_w2v_mean)
imdb_w2v_mean = scale.transform(imdb_w2v_mean)

scale.fit(combined_w2v_mean)
combined_w2v_mean = scale.transform(combined_w2v_mean)

#doc2vec scaling
scale.fit(tmdb_doc_vec)
tmdb_doc_vec = scale.transform(tmdb_doc_vec)

scale.fit(imdb_doc_vec)
imdb_doc_vec = scale.transform(imdb_doc_vec)

scale.fit(combined_doc_vec)
combined_doc_vec = scale.transform(combined_doc_vec)

Load in response variable

In [17]:
binary_tmdb = np.load('data/binary_tmdb.npy')

We now have all of the parameters necessary to run our function. We will create models using 6 different predictors:

    1: Bag of words using the TMDB plot
    2: Bag of words using the IMDB plot
    3: Bag of words using the combined plots from both sources
    4: Word2Vectors using the TMDB plot
    5: Word2Vectors using the IMDB plot
    6: Word2Vectors using the combined plots from both sources
    7: Doc2Vectors using the TMDB plot
    8: Doc2Vectors using the IMDB plot
    9: Doc2Vectors using the combined plots from both sources
    

We will use these predictors to create 3 classification models to predict movie genres:

    1: Naive-Bayes, with a cross-validated smoothing parameter
    2: Stochastic Gradient Descent, with a cross-validated regularization multiplier.
    3: Support Vector machines, with a cross-validated penalty parameter of the error term.
    
    
This will result in 27 total models being created. We will store the results of each model in a dictionary, which will allow us to identify the best-performing models.    

In [190]:
modelDict = {'Naive-Bayes':{'model':MultinomialNB(),
                           'params':{'alpha':[0.01,0.1,1.0]}},
            
            'SGD':{'model':SGDClassifier(loss='hinge',penalty='l2',n_iter=5,random_state=9001),
                   'params':{'alpha':[0.01,0.1,1.0]}},
            
            'SVC':{'model':SVC(class_weight='balanced', kernel='linear'),
                   'params':{'C':[0.01,0.1,1.0]}}
           }

predictorDict = {
                 'tmdb_bow':tmdb_bow,
                 'imdb_bow':imdb_bow,
                 'combined_bow':combined_bow,
                 'tmdb_w2v_mean':tmdb_w2v_mean,
                 'imdb_w2v_mean':imdb_w2v_mean,
                 'combined_w2v_mean':combined_w2v_mean,
                 'tmdb_doc_vec':tmdb_doc_vec,
                 'imdb_doc_vec':imdb_doc_vec,
                 'combined_doc_vec':combined_doc_vec
                }

sklearn returns a warning when the function above uses weighted averages on samples that have no predictors. This will not affect our metrics and outputs repetitive information when run in a loop. The warning, that we are choosing to ignore, is as follows:

```UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)```

In [53]:
resultsDict = {}
import warnings
with warnings.catch_warnings(): #temporarily ignore the warnings described above
    warnings.simplefilter("ignore")
    for model in modelDict:
        for predictor in predictorDict:
            resultsDict['{0}-{1}'.format(model,predictor)] = evaluate_model(model = modelDict[model]['model'],
                                                                            predictors = predictorDict[predictor], 
                                                                            response = binary_tmdb,
                                                                            cv=True,
                                                                            params=modelDict[model]['params'])


This next cell will be removed when we submit our final report. It is just being used now to temporarily store the results of the models to avoid having to re-run the cell above

In [54]:
import pickle #trying to use pickle to see if the model object can be retained while saving, as opposed to json which drops it

pickle.dump(resultsDict, open('data/resultsDict.sav','wb'))

In [2]:
#start_here

import pickle

resultsDict = pickle.load(open('data/resultsDict.sav','rb'))

Next we will make a dataframe of the scores for the sake of readability.

In [3]:
scores = ['train_recall_score','test_recall_score',
          'train_precision_score','test_precision_score']

results_df = pd.DataFrame(resultsDict)

# Accuracy Metrics


**Precision Score, Recall Score:**

As shown in the cell above, we are interested in the precision score as well as the recall score. 


The precision score can be defined as `tp / (tp + fp)`, where `tp` represents true-positives and `fp` represents false-positives.

The recall score can be defined as `tp / (tp + fn)` where `tp` represents true-positives and `fn` represents false-negatives


In other words, the precision score penalizes false-positives, whereas the recall score penalizes false-negatives. Both scores range from 0 to 1, with 1 representing perfect accuracy.


**F1-Score:**

While not included in our results_df dataframe, this outcome is included in the `classification_report`, which is shown a little further below. The F1 score can be defined as:

`F1 = 2 * (precision * recall) / (precision + recall)`

In [4]:
results_df = results_df.loc[results_df.index.isin(scores)]

In [5]:
results_df = results_df.transpose()
results_df

Unnamed: 0,test_precision_score,test_recall_score,train_precision_score,train_recall_score
Naive-Bayes-combined_bow,0.725255,0.377953,0.982521,0.926914
Naive-Bayes-combined_doc_vec,0.166339,0.255906,0.163924,0.254321
Naive-Bayes-combined_w2v_mean,0.456782,0.267717,0.583491,0.274074
Naive-Bayes-imdb_bow,0.596821,0.340551,0.948176,0.780741
Naive-Bayes-imdb_doc_vec,0.166339,0.255906,0.163924,0.254321
Naive-Bayes-imdb_w2v_mean,0.178442,0.253937,0.605336,0.262716
Naive-Bayes-tmdb_bow,0.619958,0.301181,0.970122,0.79358
Naive-Bayes-tmdb_doc_vec,0.166339,0.255906,0.163924,0.254321
Naive-Bayes-tmdb_w2v_mean,0.316826,0.261811,0.494266,0.265185
SGD-combined_bow,0.166339,0.255906,0.163924,0.254321


You may have noticed that many of the models listed above have identical scores for all outcomes. Let's extract those models specifically:

In [6]:
duplicate_scores = pd.concat(group for _, group in results_df.groupby((scores)) if len(group) > 1)
duplicate_scores

Unnamed: 0,test_precision_score,test_recall_score,train_precision_score,train_recall_score
Naive-Bayes-combined_doc_vec,0.166339,0.255906,0.163924,0.254321
Naive-Bayes-imdb_doc_vec,0.166339,0.255906,0.163924,0.254321
Naive-Bayes-tmdb_doc_vec,0.166339,0.255906,0.163924,0.254321
SGD-combined_bow,0.166339,0.255906,0.163924,0.254321
SGD-combined_doc_vec,0.166339,0.255906,0.163924,0.254321
SGD-imdb_bow,0.166339,0.255906,0.163924,0.254321
SGD-imdb_doc_vec,0.166339,0.255906,0.163924,0.254321
SGD-tmdb_bow,0.166339,0.255906,0.163924,0.254321
SGD-tmdb_doc_vec,0.166339,0.255906,0.163924,0.254321
SVC-combined_doc_vec,0.186744,0.525591,0.194323,0.532346


There are 2 duplicate values occuring. One occurs in 9 models, and the other occurs with 2. We will explain these one at a time, beginning with the group of 9 models.

In [7]:
group_1 = list(duplicate_scores.index.values[:9])
group_2 = list(duplicate_scores.index.values[9:])

The classification report shows the issue occuring in group 1. Instead of printing out 2 reports for each of the 9 models, we will instead write code that proves that all 9 models have identical classification reports

In [57]:
for model in range(len(group_1)):
    if resultsDict[group_1[0]]['train_classification_report'] != resultsDict[group_1[model]]['train_classification_report']:
        print('Train classification reports do not match between model 0 and model {}'.format(model))
    
    if resultsDict[group_1[0]]['test_classification_report'] != resultsDict[group_1[model]]['test_classification_report']:
        print('Test classification reports do not match between model 0 and model {}'.format(model))

Now that we have proven that all 9 models have identical classification reports, we can print a single set of reports to observe what is happening across all groups.

In [48]:
print(resultsDict[group_1[0]]['train_classification_report'])
print(resultsDict[group_1[0]]['test_classification_report'])

                 precision    recall  f1-score   support

      Adventure       0.00      0.00      0.00       137
        Fantasy       0.00      0.00      0.00        76
      Animation       0.00      0.00      0.00        82
          Drama       0.64      1.00      0.78       515
         Horror       0.00      0.00      0.00        41
         Action       0.00      0.00      0.00       124
         Comedy       0.00      0.00      0.00       189
        History       0.00      0.00      0.00        54
        Western       0.00      0.00      0.00        25
       Thriller       0.00      0.00      0.00       184
          Crime       0.00      0.00      0.00       143
Science Fiction       0.00      0.00      0.00        81
        Mystery       0.00      0.00      0.00        77
          Music       0.00      0.00      0.00        34
        Romance       0.00      0.00      0.00       124
         Family       0.00      0.00      0.00        92
            War       0.00    

Each of these 9 models are being overfit - for both the train and test dataset, each of these models is predicting that 100% of the movies will be genres. This is because of the large number of drama movies in the dataset (this is demonstrated in our EDA notebook). Fine-tuning the penalization parameters may help with these models being overfit, but that exploration is outside the scope of this analysis. We will instead focus on the models that are not being overfit, after first exploring the other group of duplicates.

Again, let's make sure the classification reports are identical between the two models.

In [58]:
if resultsDict[group_2[0]]['train_classification_report'] != resultsDict[group_2[1]]['train_classification_report']:
    print('Train classification reports do not match between model 0 and model 1')
    
if resultsDict[group_2[0]]['test_classification_report'] != resultsDict[group_2[1]]['test_classification_report']:
    print('Test classification reports do not match between model 0 and model 1')
    

The two reports are identical, so let's take a look.

In [59]:
print(resultsDict[group_2[0]]['train_classification_report'])

print(resultsDict[group_2[0]]['test_classification_report'])

                 precision    recall  f1-score   support

      Adventure       0.29      0.72      0.41       137
        Fantasy       0.10      1.00      0.17        76
      Animation       0.10      1.00      0.19        82
          Drama       0.00      0.00      0.00       515
         Horror       0.05      1.00      0.10        41
         Action       0.32      0.73      0.44       124
         Comedy       0.53      0.16      0.24       189
        History       0.14      0.69      0.23        54
        Western       0.11      0.76      0.19        25
       Thriller       0.23      1.00      0.37       184
          Crime       0.18      1.00      0.30       143
Science Fiction       0.28      0.64      0.39        81
        Mystery       0.16      0.77      0.27        77
          Music       0.09      0.76      0.16        34
        Romance       0.45      0.36      0.40       124
         Family       0.24      0.59      0.34        92
            War       0.07    

In [60]:
group_2

['SVC-combined_doc_vec', 'SVC-tmdb_doc_vec']

# Best Models

In [8]:
best_recall = results_df['test_recall_score'].idxmax()
best_precision = results_df['test_precision_score'].idxmax()

In [9]:
results_df.loc[results_df.index.isin([best_recall, best_precision])]

Unnamed: 0,test_precision_score,test_recall_score,train_precision_score,train_recall_score
Naive-Bayes-combined_bow,0.725255,0.377953,0.982521,0.926914
SVC-combined_w2v_mean,0.558305,0.688976,0.742788,0.893827


Now that we have identified the top models based on precision as well as recall scores, we can take a closer look through the classification report.

In [10]:
print('Results: ', best_precision)
print(resultsDict[best_precision]['train_classification_report'])
print(resultsDict[best_precision]['test_classification_report'])

Results:  Naive-Bayes-combined_bow
                 precision    recall  f1-score   support

      Adventure       0.97      0.88      0.92       137
        Fantasy       1.00      0.72      0.84        76
      Animation       1.00      1.00      1.00        82
          Drama       0.96      0.99      0.98       515
         Horror       1.00      1.00      1.00        41
         Action       0.98      0.88      0.93       124
         Comedy       1.00      0.94      0.97       189
        History       1.00      1.00      1.00        54
        Western       1.00      1.00      1.00        25
       Thriller       0.98      0.85      0.91       184
          Crime       1.00      0.90      0.95       143
Science Fiction       0.99      0.93      0.96        81
        Mystery       0.99      0.99      0.99        77
          Music       1.00      0.82      0.90        34
        Romance       1.00      0.85      0.92       124
         Family       1.00      0.90      0.95      

**Naive-Bayes-combined_bow analysis**

There appears to be a an overfit occuring with this model - the train scores for precision and recall are 0.98 and 0.93, respectively, whereas the test results are much lower. This model has the best precision score out of any of the models, meaning that there are many true-positives and few false-positives. However, the recall score is much lower than many of the other models - indicating that this model has the tendency to result in false-negatives.

**Genre-Specific Results for this model:**
Using the test dataset, this model was able to identify Fantasy, Animation, Horror, and War genres with 100% accuracy. In other words - every time this model predicted one of those genres, it was correct. However, this model was somewhat "hesitant" to make those predictions - which is what caused the recall score to be so low. 

This model is best at predicting Drama movies. Although the precision score is lower than the genres listed above, it is less "hesitant" to make drama predictions, as evidenced by the higher recall score. This can be partially explained by the skewed dataset, which had far more drama movies than any other genre. This allowed the model to be fit on a more diverse set of words when compared to other genres.

This model was unable to succesfully predict any History, Western, Mystery, or TV Movie genres. This is likely a result of the model being overfit. This claim is supported by the f1-scores in the training set of these genres, which have scores of [1.00, 1.00, 0.99, and 1.00], respectively. 

In [11]:
print('Results: ', best_recall)
print(resultsDict[best_recall]['train_classification_report'])
print(resultsDict[best_recall]['test_classification_report'])

Results:  SVC-combined_w2v_mean
                 precision    recall  f1-score   support

      Adventure       0.58      0.73      0.65       137
        Fantasy       0.38      0.68      0.49        76
      Animation       0.89      1.00      0.94        82
          Drama       0.90      0.86      0.88       515
         Horror       0.93      1.00      0.96        41
         Action       0.56      0.92      0.70       124
         Comedy       0.59      0.91      0.71       189
        History       0.90      1.00      0.95        54
        Western       0.78      1.00      0.88        25
       Thriller       0.61      0.86      0.71       184
          Crime       0.62      0.81      0.70       143
Science Fiction       0.89      1.00      0.94        81
        Mystery       0.73      1.00      0.85        77
          Music       0.92      1.00      0.96        34
        Romance       0.70      0.98      0.82       124
         Family       0.88      1.00      0.94        9

**SVC-combined_w2v_mean Analysis:**

Unlike the **Naive-Bayes-combined_bow** model, this model does not have any perfect precision scores in the test dataset. This is because this model is less "hesitant" to make predictions - leading to a larger amount of false positives and thus lowering the precision score. Consequently, because it is less "hesitant" to make predictions, there are fewer false-negatives, which results in the higher recall score. Compared to the f1 score, this model outperforms the Naive-Bayes model. 

**Genre-Specific Results for this model:**
With regard to precision, this model is most accurate at identifying Drama, Western, War, Science Fiction, and Horror. Like the Naive-Bayes model, this model also predicts 0 genres as TV Movie. This is likely caused by the few occurences of this genre in the dataset - only occuring 5 times in the training dataset and once in the test dataset.

**Conclusion:**

Although **Naive-Bayes-combined_bow** has a better precision accuracy metric, it is being overfit - as evidenced by the perfect f1-scores in the training dataset in addition to the much lower test results. When this model *does* make predictions, the predictions tend to be accurate, but this model simply does not make many predictions. This is shown by the low recall score, as well as the 4 genres that it made 0 predictions for.

**SVC-combined_w2v_mean** has a lower precision score, but has much higher recall scores and f1-scores. It is also evident that there is less of an overfit occuring.

~here

# "Hesitation" to make predictions

The theory described above can be mathematically proven by looking at the `predict_proba` attribute of each model. In a MultiLabel classification model, the predict_proba attribute returns an array that represents the probability of that individual observation belonging to each class. For each of the probabilities that are above 0.5, the model predicts that the observation belongs to that corresponding class. Let's look at a simple example to describe this concept.

In [220]:
resultsDict[best_precision]['fitted_model'].predict(best_precision_predictor)[0]

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0])

For observation 0, the array above represents classes 0 through 17. Class 3 and class 10 have a value of 1, indicating that the model predicts this movie to be these two genres. Each remaining class that has a value of 0 is not predicted as a genre.

Thus, in the cell below, we will see that every class besides 3 and 10 will have a value less than 0.5:

In [216]:
resultsDict[best_precision]['fitted_model'].predict_proba(best_precision_predictor)[0]

array([3.04719005e-02, 1.66590112e-02, 5.92908140e-04, 9.36481209e-01,
       7.46583129e-04, 3.35582768e-02, 2.40284760e-02, 5.54701095e-04,
       1.26084883e-04, 1.50322996e-01, 7.33628097e-01, 3.91758746e-03,
       1.74222489e-03, 1.38138201e-02, 9.17876692e-03, 2.87533281e-02,
       8.28736853e-04, 6.20652724e-05])

Now that we have reviewed how this function works, we can mathematically prove our argument that the Naive-Bayes Model is more "hesitant". Since predictions are made based on probabilities, we can take the mean of probabilities for all of the observations in the test dataset to get an overall representation of how likely the model is able to make a prediction. The funtion below accomplishes this task.

In [210]:
def mean_probability(model, predictor):
    _, test_x = train_test_split(predictor, test_size=0.2, random_state=9001)
    
    return model.predict_proba(test_x).mean()

Running this function on our Naive-Bayes model:

In [212]:
mean_probability(resultsDict[best_precision]['fitted_model'], best_precision_predictor)

0.10182064002264195

One additional step is required before we can make the comparison to our **SVC-combined_w2v_mean** model. By default, fitted SVC models do not include a predict_proba attribute. We chose to keep the default when looping through our models because it takes much longer to fit a model when this is enabled.

Now that we have identified our best model, we will re-run it through the function, with making an adjustment to include the predict_proba attribute.

In [196]:
best_recall_rerun = {'model':SVC(class_weight='balanced', kernel='linear',
                                probability=True),
                   'params':{'C':[0.01,0.1,1.0]}}
        
        
best_recall_rerun_results = evaluate_model(model = model_rerun['model'],
                        predictors = predictorDict['combined_w2v_mean'], 
                        response = binary_tmdb,
                        cv=True,
                        params=model_rerun['params'])        

Now that we have added this attribute, we can make the comparison to our **Naive-Bayes-combined_bow** model.

In [211]:
mean_probability(best_recall_rerun_results['fitted_model'], best_recall_predictor)

0.14964159310559547

We see that this model is ~48% more likely to make a genre prediction, compared to the **Naive-Bayes-combined_bow** model.

# Subsetting

Subsetting on the data source:

In [110]:
data_source_results = {}

for group in ['imdb', 'tmdb', 'combined']:
    subset = results_df.filter(like=group, axis=0)

    
    data_source_results[group] = {'min_test_precision': subset['test_precision_score'].min(),
                                'max_test_precision': subset['test_precision_score'].max(),
                                'mean_test_precision':subset['test_precision_score'].mean(),
                                'min_test_recall': subset['test_recall_score'].min(),
                                'max_test_recall': subset['test_recall_score'].max(),
                                'mean_test_recall':subset['test_recall_score'].mean()}

In [111]:
column_order = ['min_test_precision','max_test_precision','mean_test_precision',
                                            'min_test_recall','max_test_recall','mean_test_recall']

pd.DataFrame(data_source_results).transpose()[column_order]

Unnamed: 0,min_test_precision,max_test_precision,mean_test_precision,min_test_recall,max_test_recall,mean_test_recall
combined,0.166339,0.725255,0.4021,0.255906,0.688976,0.395888
imdb,0.153214,0.607838,0.32814,0.253937,0.685039,0.375984
tmdb,0.166339,0.619958,0.345431,0.255906,0.633858,0.374016


In the EDA notebook, we discovered that the imdb plot descriptions tend to have more words than the tmdb plot description. We originally suspected that this would lead to higher accuracy metrics, since the plots are theoretically more descriptive if they are longer. However, the results above demonstrate that this is not the case. 

We do see that combining the plot descriptions leads to higher scores. What is particularly interesting is that the recall score does not improve as much as the precision score. In other words, combining the plot descriptions leads to considerably fewer false-positives, but has a lesser effect on preventing false-negatives.

In [113]:
predictor_results = {}
for group in ['bow', 'w2v', 'doc_vec']:
    subset = results_df.filter(like=group, axis=0)

    predictor_results[group] = {'min_test_precision': subset['test_precision_score'].min(),
                                'max_test_precision': subset['test_precision_score'].max(),
                                'mean_test_precision':subset['test_precision_score'].mean(),
                                'min_test_recall': subset['test_recall_score'].min(),
                                'max_test_recall': subset['test_recall_score'].max(),
                                'mean_test_recall':subset['test_recall_score'].mean()}

In [114]:
pd.DataFrame(predictor_results).transpose()[column_order]

Unnamed: 0,min_test_precision,max_test_precision,mean_test_precision,min_test_recall,max_test_recall,mean_test_recall
bow,0.166339,0.725255,0.466946,0.255906,0.551181,0.37336
doc_vec,0.153214,0.186744,0.169415,0.255906,0.525591,0.325241
w2v,0.178442,0.558305,0.43931,0.253937,0.688976,0.447288


Bag of words has the best performance with regard to both precision and recall. Word2vec comes shortly there after. Doc2vec has by far the lowest performance of the three, which is expected in considering the difference in granularity. Word2vec analyzes the plots on a word-by-word basis, whereas Doc2vec analyzes the entire plot as a single vector.