# Project description
The current project aims to predict the genre of movie given the overview text that describes the movie. For example, the overview for *The Matrix* is as follows:
>Set in the 22nd century, The Matrix tells the story of a computer hacker who joins a group of underground insurgents fighting the vast and powerful computers who now rule the earth.

From the above text, we would like to predict that the movie belongs to the "Action" and "Science Fiction" genres.

## Business object in context
We are an internet-based movie distributing company, _NetFlux_. For new movies and original content movies, we want to make sure our staff writes overviews that will represent the correct genre of the movie. This will make our recommender system work better and ultimately provide more insight for our users to what movies they want to see.



In [None]:
from IPython.display import Markdown as md
import os
from datetime import datetime
import pickle 

movies_with_overviews_path = '../data/processed/movies_with_overviews.pkl'
date_refreshed_unix = os.path.getmtime(movies_with_overviews_path)
date_refreshed = datetime.utcfromtimestamp(date_refreshed_unix).strftime('%Y-%m-%d %H:%M:%S')
now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

with open('../data/processed/movies_with_overviews.pkl','rb') as f:
    movies_with_overviews=pickle.load(f)
with open('../data/processed/genre_id_to_name_dict.pkl','rb') as f:
    Genre_ID_to_name=pickle.load(f)
genre_list=sorted(list(Genre_ID_to_name.keys()))
    
num_movies = len(movies_with_overviews)

display(md('''# Data
Movie overviews and genres are scraped from TMDB. Our dataset was last refreshed at **{date_refreshed}**.

Report was generated **{now}**.

The data have **{num_movies}** movie overviews.

'''.format(date_refreshed=date_refreshed, num_movies=num_movies, now=now)))


The distribution of the genres in these movies is shown in the chart below:

In [None]:
%matplotlib inline
import pandas as pd
from matplotlib import pyplot as plt
from collections import Counter
mwo = pd.DataFrame(movies_with_overviews)
genre_ids_series = mwo['genre_ids']
flat_genre_ids = [st for row in genre_ids_series for st in row]

flat_genre_names = [Genre_ID_to_name[id] for id in flat_genre_ids] 
genre_counts = Counter(flat_genre_names)
df = pd.DataFrame.from_dict(genre_counts, orient='index')
ax = df.plot(kind='bar')
ax.set_ylabel('Counts of each genre')
ax.legend().set_visible(False)


The top 10 movies in our dataset by popularity are listed below:

In [None]:
a=[print(x) for x in mwo.sort_values(by='popularity', ascending=False)['original_title'].head(10)]

# Models and Features

We are currently using the following models to train against the dataset with the associated feature engineering:
1. C-SVM
    - The overviews are using a **bag of words** model and have been vectorized and transformed using **TF_IDF**.
2. Naive Bayes
    - The overviews are using a **bag of words** model and have been vectorized with a **Count Vectorizer**.
3. Simple neural network (not deep)
    - The overviews were tokenized with a **white space tokenizer**. Stop words were removed. Overviews were treated as **bag of words**, which each word being converted to a vector, using the GoogleNews-vectors-negative300.bin model.  The **arithmetic mean** of the words represented the overview. Taking the top 3 genres predicted for each movie.



### C-SVM

#### Metrics for each genre

In [None]:
with open('../models/classifier_svc.pkl','rb') as f:
    classif=pickle.load(f)
with open('../data/processed/X_tfidf.pkl','rb') as f:
    X=pickle.load(f)
with open('../data/processed/Y.pkl','rb') as f:
    Y=pickle.load(f)
    
from src.utils.eval_metrics import *
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np

indecies = range(len(movies_with_overviews))
X_train, X_test, Y_train, Y_test, train_movies, test_movies = train_test_split(X, Y, indecies, test_size=0.20, random_state=42)
genre_names=list(Genre_ID_to_name.values())
predstfidf=classif.predict(X_test)
print (classification_report(Y_test, predstfidf, target_names=genre_names))

#### Precision and Recall for the overall model

In [None]:
predictions = generate_predictions(Genre_ID_to_name, X_test, predstfidf)
precs, recs = precsc_recs(test_movies, movies_with_overviews, Genre_ID_to_name, predictions)

prec_mean = np.mean(np.asarray(precs))
rec_mean = np.mean(np.asarray(recs))

md('''Precision: {prec_mean}

Recall: {rec_mean}

'''.format(prec_mean=prec_mean, rec_mean=rec_mean))

#### Example predictions for a small sample

In [None]:
predictions=[]
actuals = []
for i in range(X_test.shape[0]):
    pred_genres=[]
    actual_genres=[]
    movie_label_scores=predstfidf[i]
    actual_scores = Y_test[i]
#     print movie_label_scores
    for j in range(len(movie_label_scores)):
        #print j
        if movie_label_scores[j]!=0:
            genre=Genre_ID_to_name[genre_list[j]]
            pred_genres.append(genre)
        if actual_scores[j]!=0:
            genre=Genre_ID_to_name[genre_list[j]]
            actual_genres.append(genre)
    predictions.append(pred_genres)
    actuals.append(actual_genres)
for i in range(X_test.shape[0]):
    if i%50==0 and i!=0:
        print ('MOVIE: ',movies_with_overviews[test_movies[i]]['title'],
               '\nPREDICTION: ',','.join(predictions[i]), 
               '\nActual: ', ','.join(actuals[i]), '\n')

### Naive Bayes

#### Metrics for each genre

In [None]:
with open('../models/classifier_nb.pkl','rb') as f:
    classif=pickle.load(f)
with open('../data/processed/X.pkl','rb') as f:
    X=pickle.load(f)
with open('../data/processed/Y.pkl','rb') as f:
    Y=pickle.load(f)
    
from src.utils.eval_metrics import *
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np

indecies = range(len(movies_with_overviews))
X_train, X_test, Y_train, Y_test, train_movies, test_movies = train_test_split(X, Y, indecies, test_size=0.20, random_state=42)
genre_names=list(Genre_ID_to_name.values())
preds=classif.predict(X_test)
print (classification_report(Y_test, preds, target_names=genre_names))

#### Precision and Recall for the overall model

In [None]:
predictions = generate_predictions(Genre_ID_to_name, X_test, preds)
precs, recs = precsc_recs(test_movies, movies_with_overviews, Genre_ID_to_name, predictions)

prec_mean = np.mean(np.asarray(precs))
rec_mean = np.mean(np.asarray(recs))

md('''Precision: {prec_mean}

Recall: {rec_mean}

'''.format(prec_mean=prec_mean, rec_mean=rec_mean))

#### Example predictions for a small sample

In [None]:
predictions=[]
actuals = []
for i in range(X_test.shape[0]):
    pred_genres=[]
    actual_genres=[]
    movie_label_scores=preds[i]
    actual_scores = Y_test[i]
#     print movie_label_scores
    for j in range(len(movie_label_scores)):
        #print j
        if movie_label_scores[j]!=0:
            genre=Genre_ID_to_name[genre_list[j]]
            pred_genres.append(genre)
        if actual_scores[j]!=0:
            genre=Genre_ID_to_name[genre_list[j]]
            actual_genres.append(genre)
    predictions.append(pred_genres)
    actuals.append(actual_genres)
for i in range(X_test.shape[0]):
    if i%50==0 and i!=0:
        print ('MOVIE: ',movies_with_overviews[test_movies[i]]['title'],
               '\nPREDICTION: ',','.join(predictions[i]), 
               '\nActual: ', ','.join(actuals[i]), '\n')

In [None]:
print(movies_with_overviews[test_movies[100]])
print(Genre_ID_to_name[35])
print(Genre_ID_to_name[10751])
test_movies[100]
classif.predict(X_test[100])

### Simple Neural Network with Word2Vec features

#### Metrics for each genre

In [None]:
from keras.models import load_model
from sklearn.preprocessing import MultiLabelBinarizer
with open('../data/processed/textual_features.pkl','rb') as f:
    (X,Y)=pickle.load(f)
model_textual = load_model('../models/overview_nn.h5')

indecies = range(len(movies_with_overviews))
X_train, X_test, Y_train, Y_test, train_movies, test_movies = train_test_split(X, Y, indecies, test_size=0.20, random_state=42)
genre_names=list(Genre_ID_to_name.values())
Y_preds=model_textual.predict(X_test)

Y_preds_binary = []
for row in Y_preds:
    predicted = np.argsort(row)[::-1][:3]
    predicted_genre_Y = [1 if k in predicted else 0 for k in range(len(row)) ]
    Y_preds_binary.append(predicted_genre_Y)

print (classification_report(Y_test, np.array(Y_preds_binary), target_names=genre_names))

#### Precision and Recall for the overall model

In [None]:
predictions = generate_predictions(Genre_ID_to_name, X_test, Y_preds_binary)
precs, recs = precsc_recs(test_movies, movies_with_overviews, Genre_ID_to_name, predictions)

prec_mean = np.mean(np.asarray(precs))
rec_mean = np.mean(np.asarray(recs))

md('''Precision: {prec_mean}

Recall: {rec_mean}

'''.format(prec_mean=prec_mean, rec_mean=rec_mean))

#### Example predictions for a small sample

In [None]:
for i in range(X_test.shape[0]):
    if i%50==0 and i!=0:
        print ('MOVIE: ',movies_with_overviews[test_movies[i]]['title'],
               '\nPREDICTION: ',','.join(predictions[i]), 
               '\nActual: ', ','.join(actuals[i]), '\n')

In [None]:
from IPython.core.display import HTML


def css_styling():
    styles = open("../notebooks/static/custom.css", "r").read()
    return HTML(styles)
css_styling()