# <center>Predicting the movie genres</center>
___

## Abstract
> From a long time, movies have been an amazing source of entertainment. It has given a chance for family members to enjoy together, for friends to socialize and for artists to display their talents, be it an actor/actress, directors, cinematographer, dialogue writers and so on. Movies have been a visual art that focuses on storytelling, communicating ideas, stimulate different experiences(like romance, anger, travel, etc.)

> In the initial years of filming, films were recorded on a celluloid film through a photochemical process. Since movies are nothing but continuous series of pictures, there were no possibility to include sounds with the moving frames. As humans progressed, learned new techniques to record and display movies, we transitioned to large movie projectors. This gave us a chance to include sound to our motion-pictures. And in no time, we transioned to the era of digital cameras which eased the efforts or recording a movie along with the sound. 

> I have been a fan of movies, as it gave me an opportunity to look at the world in a broader perspective. I am not a detective, but Sherlock Holmes enlightened me to the world of detectives. I am not a stockbroker, but The Wolf Of Wall Street gave me deep insights to what it takes to be a one. I have never learned or studied mythology, but The Ramayana educated me with it's prolonged history and it's importance.
___

## Introduction
#### 1. Problem Statement
> Movies have a range of genres; from romance to sci-fi to drama to comedy and so on. In this notebook, I will try to **predict the genres** of the movies based on _title, tagline, original_title_ and _overview_. With us, we have a dataset of about 45000 movies with metadata collected from IMDB and complied on Kaggle (https://www.kaggle.com/rounakbanik/the-movies-dataset).

#### 2. Key Documents
>  Out of the 7 documents, as directed, we would only use **movies_metadata.csv**.

#### 3. Breakdown of this notebook
  > 1. Importing Libraries
  2. Loading the dataset
  3. Remove/filling the NaN values from the datasets.
  4. Cleaning the dataset
  6. Classification Analysis:
    1. Decision Tree Classifier
    2. Random Forest Classifier
    3. Multi-layer Perceptron (MLP) Classifier

In [1]:
# Importing the required libraries
import numpy as np
import pandas as pd

In [2]:
# Importing the metadata of movies
df1 = pd.read_csv('movies_metadata.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
df1.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


<div class="alert alert-block alert-info"><b> 
    
- According to the problem statement, we need only the following columns:
    1. title, 
    2. tagline
    3. original_title
    4. overview 
    5. genres
    
  Hence, we subset these columns into a new dataframe.
</b></div>

In [4]:
df = df1[['title', 'tagline', 'original_title', 'overview', 'genres']]

In [5]:
# Setting `title` as the index
df.set_index('title',inplace = True)

In [6]:
df.isna().sum()

tagline           25054
original_title        0
overview            954
genres                0
dtype: int64

The genre(s) of any movie can be identified by its reviews or description. Here, we have a feature `overview` and we  will use this to predict the genre(s) of the movies.

Hence, if for any instance there is *no overview present*, we shall *drop* that partcular instance(s).

In [7]:
df.dropna(subset = ['overview'], inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Ref: https://kite.com/python/answers/how-to-drop-empty-rows-from-a-pandas-dataframe-in-python

## Step 1: Extracting the `genres` of each film(row) present in the list of dictionery/ies under the **key** `name`.

In [8]:
from ast import literal_eval
df['genres'] = df['genres'].apply(literal_eval).apply(lambda x: [i['name'] for i in x] 
                                                                   if isinstance(x, list) else [])
                                                                # List Comprehension

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Ref: https://www.kaggle.com/rounakbanik/movie-recommender-systems, In [3]

The purpose of the following code is to *excluded* any such instances where the genres absent or is '[ ]'.

In [9]:
# Selecting only those rows which have an actual genre
genre_present = df['genres'] != '[]'

# Series of the genres present in the movies_metadata
genres = df['genres'][genre_present]

## Step 2: Separating and selecting the genres

In [10]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

labels = mlb.fit_transform(genres)
label_classes = mlb.classes_

Ref: https://datascience.stackexchange.com/questions/11797/split-a-list-of-values-into-columns-of-a-dataframe

In [11]:
label_classes

array(['Action', 'Adventure', 'Animation', 'Aniplex', 'BROSTA TV',
       'Carousel Productions', 'Comedy', 'Crime', 'Documentary', 'Drama',
       'Family', 'Fantasy', 'Foreign', 'GoHands', 'History', 'Horror',
       'Mardock Scramble Production Committee', 'Music', 'Mystery',
       'Odyssey Media', 'Pulser Productions', 'Rogue State', 'Romance',
       'Science Fiction', 'Sentai Filmworks', 'TV Movie',
       'Telescene Film Group Productions', 'The Cartel', 'Thriller',
       'Vision View Entertainment', 'War', 'Western'], dtype=object)

In [12]:
label_data = pd.DataFrame(labels, columns=label_classes)

In [13]:
val = {}
for x in label_classes :
    val.update({x:label_data[x].value_counts()[1]})

Sorting the `genres` according to the number of instances in *ascending* order.

In [14]:
sorted_val = sorted(val.items(), key=lambda kv: kv[1], reverse=True)

Ref: https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value

In [15]:
val_pd = pd.DataFrame.from_dict(sorted_val, orient='columns')
val_pd.rename(columns={0: "Genre", 1: "Count"}, inplace = True) 

In [16]:
val_pd

Unnamed: 0,Genre,Count
0,Drama,20023
1,Comedy,12806
2,Thriller,7586
3,Romance,6673
4,Action,6565
5,Horror,4660
6,Crime,4269
7,Documentary,3886
8,Adventure,3470
9,Science Fiction,3028


When we look at those rows, we notice that the `genres` with value count 1 do not seem to be genres. They seem to be names of production houses or TV channels.

Hence we would drop these rows to reduce our search space.
___

Conversely, we can select the top 20 `genres` from the above data frame for prediction which contain the actual genres.

In [17]:
dummy_counts = sorted(val.items(), key=lambda kv: kv[1], reverse=True)[0:20] # Selecting the first 20 genres.
dummy_counts

[('Drama', 20023),
 ('Comedy', 12806),
 ('Thriller', 7586),
 ('Romance', 6673),
 ('Action', 6565),
 ('Horror', 4660),
 ('Crime', 4269),
 ('Documentary', 3886),
 ('Adventure', 3470),
 ('Science Fiction', 3028),
 ('Family', 2732),
 ('Mystery', 2451),
 ('Fantasy', 2290),
 ('Animation', 1920),
 ('Foreign', 1599),
 ('Music', 1588),
 ('History', 1379),
 ('War', 1310),
 ('Western', 1035),
 ('TV Movie', 751)]

Here, we select only the first top 20 genres from a list of tuples.

In [18]:
# List Comprehension
genre_counts = [i[0] for i in dummy_counts]

In [19]:
genre_counts

['Drama',
 'Comedy',
 'Thriller',
 'Romance',
 'Action',
 'Horror',
 'Crime',
 'Documentary',
 'Adventure',
 'Science Fiction',
 'Family',
 'Mystery',
 'Fantasy',
 'Animation',
 'Foreign',
 'Music',
 'History',
 'War',
 'Western',
 'TV Movie']

On filtering the actual list of genres, we can now proceed for further processing.
___
Binarizing the selected genres.

In [20]:
final_genres = MultiLabelBinarizer(classes = genre_counts) 
# 'genre_counts' is the final list of genres that will be used for futher training ot model

top = final_genres.fit(genres)

In [21]:
# Dependent Variable
y = final_genres.transform(genres)

  .format(sorted(unknown, key=str)))


In [22]:
final_genres.classes_

array(['Drama', 'Comedy', 'Thriller', 'Romance', 'Action', 'Horror',
       'Crime', 'Documentary', 'Adventure', 'Science Fiction', 'Family',
       'Mystery', 'Fantasy', 'Animation', 'Foreign', 'Music', 'History',
       'War', 'Western', 'TV Movie'], dtype=object)

<div class="alert alert-block alert-info"><b> 
    
- The genres excluded in the genre_counts will be ignored while implementing MultiLabelBinarizer.
  
</b></div>

As discussed earlier, the genre(s) of any movie can be identified by its reviews or description. Here, we have a feature `overview` and we  will use this to predict the genre(s) of the movies.

## Step 3: Separaring the independent variable

In [23]:
# Independent Variable
X = df['overview']

**ATTENSION**

If we have a review present but there is no genre present, this would mistrain our predictive model. Therefore for training, we would include only those rows which contains actual genres and not '[ ]'

One of the simplest ways to perform this action is to check the sum of each row in the genres after executing the MultiLabelBinarizer. If the sum equals 0, this proves that the particular movie has no genres mentioned . Hence, we would not include them in the training purpose.

In [24]:
# Including only those rows
no_label_classes = y.sum(axis = 1) == 0

#### Train-Validation Split

In [25]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X[~no_label_classes], y[~no_label_classes],
                                                     test_size = 0.3, random_state = 1234)

In [26]:
X_train.shape, y_train.shape

((29626,), (29626, 20))

In [27]:
X_valid.shape, y_valid.shape

((12698,), (12698, 20))

As mentioned earlier, the genres would be predicted based on the `overview`. 

The steps that we follow are as following:
  1. Convert the overview rows into TF-IDF features using TfidfVectorizer
  2. Train and build a multi-class classification model
  3. Predict the genres of the given overview.

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features = 1000, stop_words = 'english', lowercase = True)

X_train_vec = vectorizer.fit_transform(X_train)
X_valid_vec = vectorizer.transform(X_valid)

In [29]:
X_train_vec

<29626x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 374765 stored elements in Compressed Sparse Row format>

We have _vectorized_ the `overview` columns and are done with preprocessing steps, we can now build our predictive model.

## Model Building
___
Since we build a classification model, we would build the following models:
1. Decision Tree Classifier
2. Random Forest Classifier

On researching on multiclass classification, I found out about **MLPClassifer**. 

The advantage of MLP Classifier is this implementation works with data represented as dense numpy arrays or sparse scipy arrays of floating point values. Being out training credentials are sparse matrix and numpy arrays, this would, intuitively, help build a better classification model. 


Ref: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

In [30]:
from sklearn.model_selection import GridSearchCV

def model_building(model, parameters = None, cv = 10):
    if parameters == None:
        model.fit(X_train_vec, y_train)
        return(model, model.predict(X_train_vec), model.predict(X_valid_vec))
    else:
        model_cv = GridSearchCV(estimator = model, param_grid = parameters, cv = cv)
        model_cv.fit(X_train_vec, y_train)
        model = model_cv.best_estimator_
            
        return(model_cv,model, model.predict(X_train_vec), model.predict(X_valid_vec))

In [31]:
### Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

dtr = DecisionTreeClassifier()
model, train_dtr, valid_dtr = model_building(dtr)

In [32]:
from sklearn.metrics import classification_report, accuracy_score
print("Classification Report")
print("Training:\n",classification_report(y_true = y_train, y_pred = train_dtr, target_names = genre_counts))
print("Validation:\n",classification_report(y_true = y_valid, y_pred = valid_dtr, target_names = genre_counts))

print("Accuracy")
train_dtr_acc = accuracy_score(y_true = y_train, y_pred = train_dtr)
valid_dtr_acc = accuracy_score(y_true = y_valid, y_pred = valid_dtr)
print("Traning: ", train_dtr_acc)
print("Validation: ",valid_dtr_acc)

Classification Report


  'precision', 'predicted', average, warn_for)


Training:
                  precision    recall  f1-score   support

          Drama       1.00      0.99      1.00     14045
         Comedy       1.00      0.99      0.99      8960
       Thriller       1.00      1.00      1.00      5322
        Romance       1.00      1.00      1.00      4646
         Action       1.00      1.00      1.00      4585
         Horror       1.00      1.00      1.00      3314
          Crime       1.00      0.99      1.00      2976
    Documentary       1.00      1.00      1.00      2731
      Adventure       1.00      1.00      1.00      2420
Science Fiction       1.00      1.00      1.00      2122
         Family       1.00      0.99      1.00      1916
        Mystery       1.00      1.00      1.00      1722
        Fantasy       1.00      0.99      1.00      1619
      Animation       1.00      0.99      0.99      1335
        Foreign       1.00      0.99      0.99      1142
          Music       1.00      0.99      1.00      1140
        History    

___

In [33]:
### Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
model, train_rfc, valid_rfc = model_building(rfc)



In [34]:
print("Classification Report")
print("Training:\n",classification_report(y_true = y_train, y_pred = train_rfc, target_names = genre_counts))
print("Validation:\n",classification_report(y_true = y_valid, y_pred = valid_rfc, target_names = genre_counts))

print("Accuracy")
train_rfc_acc = accuracy_score(y_true = y_train, y_pred = train_rfc)
valid_rfc_acc = accuracy_score(y_true = y_valid, y_pred = valid_rfc)
print("Traning: ", train_rfc_acc)
print("Validation: ",valid_rfc_acc)

Classification Report


  'precision', 'predicted', average, warn_for)


Training:
                  precision    recall  f1-score   support

          Drama       0.99      0.97      0.98     14045
         Comedy       1.00      0.92      0.95      8960
       Thriller       1.00      0.88      0.94      5322
        Romance       1.00      0.88      0.94      4646
         Action       1.00      0.87      0.93      4585
         Horror       1.00      0.88      0.93      3314
          Crime       1.00      0.86      0.92      2976
    Documentary       1.00      0.92      0.96      2731
      Adventure       1.00      0.81      0.89      2420
Science Fiction       1.00      0.85      0.92      2122
         Family       1.00      0.80      0.89      1916
        Mystery       1.00      0.79      0.88      1722
        Fantasy       1.00      0.79      0.88      1619
      Animation       1.00      0.82      0.90      1335
        Foreign       1.00      0.72      0.84      1142
          Music       1.00      0.82      0.90      1140
        History    

In [35]:
### MLP Classifier
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(verbose = True, max_iter = 100, hidden_layer_sizes=(100))

model, train_mlp, valid_mlp = model_building(mlp, cv = 10)

Iteration 1, loss = 8.23082559
Iteration 2, loss = 5.50312092
Iteration 3, loss = 4.96829456
Iteration 4, loss = 4.63997070
Iteration 5, loss = 4.45528650
Iteration 6, loss = 4.34615616
Iteration 7, loss = 4.27404513
Iteration 8, loss = 4.22148260
Iteration 9, loss = 4.18064685
Iteration 10, loss = 4.14656685
Iteration 11, loss = 4.11892843
Iteration 12, loss = 4.09274439
Iteration 13, loss = 4.06829769
Iteration 14, loss = 4.04702925
Iteration 15, loss = 4.02641867
Iteration 16, loss = 4.00678114
Iteration 17, loss = 3.98803531
Iteration 18, loss = 3.96986845
Iteration 19, loss = 3.95239979
Iteration 20, loss = 3.93536530
Iteration 21, loss = 3.91928780
Iteration 22, loss = 3.90432368
Iteration 23, loss = 3.88850654
Iteration 24, loss = 3.87370825
Iteration 25, loss = 3.85927359
Iteration 26, loss = 3.84532451
Iteration 27, loss = 3.83101334
Iteration 28, loss = 3.81759384
Iteration 29, loss = 3.80371635
Iteration 30, loss = 3.79032064
Iteration 31, loss = 3.77720875
Iteration 32, los



In [36]:
print("Classification Report")
print("Training:\n",classification_report(y_true = y_train, y_pred = train_mlp, target_names = genre_counts))
print("Validation:\n",classification_report(y_true = y_valid, y_pred = valid_mlp, target_names = genre_counts))

print("Accuracy")
train_mlp_acc = accuracy_score(y_true = y_train, y_pred = train_mlp)
valid_mlp_acc = accuracy_score(y_true = y_valid, y_pred = valid_mlp)
print("Traning: ", train_mlp_acc)
print("Validation: ",valid_mlp_acc)

Classification Report
Training:
                  precision    recall  f1-score   support

          Drama       0.81      0.83      0.82     14045
         Comedy       0.83      0.68      0.75      8960
       Thriller       0.81      0.57      0.67      5322
        Romance       0.81      0.53      0.64      4646
         Action       0.84      0.65      0.73      4585
         Horror       0.86      0.65      0.74      3314
          Crime       0.80      0.56      0.66      2976
    Documentary       0.93      0.78      0.85      2731
      Adventure       0.87      0.44      0.58      2420
Science Fiction       0.86      0.61      0.71      2122
         Family       0.88      0.50      0.63      1916
        Mystery       0.84      0.41      0.55      1722
        Fantasy       0.84      0.42      0.56      1619
      Animation       0.91      0.55      0.68      1335
        Foreign       0.81      0.10      0.17      1142
          Music       0.89      0.59      0.71      11

  'precision', 'predicted', average, warn_for)


In [37]:
# Evaluation Metrics' Dataframe
train_acc = [train_dtr_acc, train_rfc_acc, train_mlp_acc]
valid_acc = [valid_dtr_acc, valid_rfc_acc, valid_mlp_acc]
eval_mat = pd.DataFrame([train_acc, valid_acc],  index = ['Traning','Validation'],
                        columns = ['Decision Tree Classifier', 'Random Forest Classifier', 'MLP Classifier'])

In [38]:
eval_mat.T

Unnamed: 0,Traning,Validation
Decision Tree Classifier,0.992102,0.099149
Random Forest Classifier,0.836664,0.131911
MLP Classifier,0.359313,0.149551


Observations:
 - As we know the characteristic of **Decision trees**, they tend to *overfit* the traning dataset (as it can be seen above too) and perfomance is measured on the validation dataset. We can clearly see that **Decision Tree Classifier** fails to perform well on the validation dataset.
 
 - **Random Forest** is a bagging algorithm and has a better control on over-fitting. Here, we can see that **Random Forest Classifier** has better performance that decision tree.
 
 - **Multi-layer Perceptron (MLP)** is based on neural networks and uses a supervised learning technique called backpropagation for training. In the above dataframe, we can clearly see that **MLP Classifier** performs the best among the 3.
 
____



Displaying the predicted genres.

In [39]:
test_preds = final_genres.inverse_transform(valid_mlp)

In [40]:
[list(i) for i in test_preds][0:10]

[['Comedy'],
 ['Drama', 'Thriller', 'Crime'],
 ['Drama', 'Comedy'],
 ['Drama'],
 ['Documentary'],
 ['Drama', 'Comedy'],
 ['Drama', 'Comedy', 'Romance'],
 ['Drama'],
 ['Comedy'],
 ['Comedy', 'Science Fiction']]

___

## Challenges Faced

1. Extracting genres from the list of dictionaries present in the feature genres.
2. Selecting the right method to convert multi-class labels
3. Selecting the right classification algorithm
4. Evaluation metrics
5. Combining the validation dataset after predicting the genres
___

## Future Scope

1. Hyper parameter tuning


___