# Project 3


# Movie Genre Classification

Classify a movie genre based on its plot.

<img src="moviegenre.png"
     style="float: left; margin-right: 10px;" />




https://www.kaggle.com/c/miia4201-202019-p3-moviegenreclassification/overview

### Data

Input:
- movie plot

Output:
Probability of the movie belong to each genre


### Evaluation

- 20% API
- 30% Report with all the details of the solution, the analysis and the conclusions. The report cannot exceed 10 pages, must be send in PDF format and must be self-contained.
- 50% Performance in the Kaggle competition (The grade for each group will be proportional to the ranking it occupies in the competition. The group in the first place will obtain 5 points, for each position below, 0.25 points will be subtracted, that is: first place: 5 points, second: 4.75 points, third place: 4.50 points ... eleventh place: 2.50 points, twelfth place: 2.25 points).

• The project must be carried out in the groups assigned for module 4.
• Use clear and rigorous procedures.
• The delivery of the project is on July 12, 2020, 11:59 pm, through Sicua + (Upload: the API and the report in PDF format).
• No projects will be received after the delivery time or by any other means than the one established. 




### Acknowledgements

We thank Professor Fabio Gonzalez, Ph.D. and his student John Arevalo for providing this dataset.

See https://arxiv.org/abs/1702.01992

## Sample Submission

In [190]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier,  ExtraTreesClassifier
from sklearn.metrics import r2_score, roc_auc_score, accuracy_score, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV

In [169]:
dataTraining = pd.read_csv('datasets/dataTraining.csv', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('datasets/dataTesting.csv', encoding='UTF-8', index_col=0)

In [170]:
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [171]:
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


### Create count vectorizer

In [172]:
from nltk.stem import WordNetLemmatizer
import nltk
wordnet_lemmatizer = WordNetLemmatizer()

In [173]:
def split_into_lemmas(text):
    text = text.lower()
    words = text.split()
    return [wordnet_lemmatizer.lemmatize(word) for word in words]

In [174]:
vect = CountVectorizer(lowercase=True, stop_words='english', max_features=5000)
#vect = CountVectorizer(lowercase=True, stop_words='english', analyzer=split_into_lemmas, max_features=10000)
X_dtm = vect.fit_transform(dataTraining['plot'])

### Create y

In [175]:
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [176]:
y_genres

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 1, 0, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0]])

In [177]:
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.66, random_state=666)

### Train multi-class multi-label model

In [195]:
model_to_set = OneVsRestClassifier(LogisticRegressionCV(random_state=666, n_jobs=-1))
parameters = [{
    "estimator__Cs": [0.01,0.1,1,10,100],
    "estimator__max_iter":[100,1000],
}]
mdl = GridSearchCV(model_to_set, param_grid=parameters, return_train_score=True, cv=10)

In [196]:
mdl.fit(X_train, y_train_genres)

GridSearchCV(cv=5, error_score=nan,
             estimator=OneVsRestClassifier(estimator=LogisticRegressionCV(Cs=10,
                                                                          class_weight=None,
                                                                          cv=None,
                                                                          dual=False,
                                                                          fit_intercept=True,
                                                                          intercept_scaling=1.0,
                                                                          l1_ratios=None,
                                                                          max_iter=100,
                                                                          multi_class='auto',
                                                                          n_jobs=-1,
                                                                         

In [197]:
clf = mdl.best_estimator_

In [198]:
y_pred_genres = clf.predict_proba(X_test)

In [199]:
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.8302317263894782

### Predict the testing dataset

In [200]:
X_test_dtm = vect.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

y_pred_test_genres = clf.predict_proba(X_test_dtm)


In [201]:
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)

In [202]:
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.054707,0.041649,0.013776,0.042073,0.208056,0.060229,0.004552919,0.605124,0.015267,0.062206,...,0.038444,0.070729,0.000745,0.709057,0.00550074,0.013796,9.637396e-08,0.09456,0.01722,0.002871
4,0.035179,0.040577,0.028104,0.042305,0.073629,0.153748,0.07284552,0.630306,0.023738,0.025198,...,0.038348,0.048802,0.000745,0.19529,0.00498958,0.013808,0.0001446434,0.271891,0.025344,0.004558
5,0.006418,0.002656,0.000212,0.042134,0.001995,0.992986,1.896808e-08,0.973234,0.000112,0.001264,...,0.038042,0.613439,0.000745,0.04681,5.84653e-07,0.013685,1.127695e-11,0.330825,1e-06,2e-06
6,0.034867,0.052215,0.011074,0.042264,0.111003,0.011252,3.101742e-05,0.897593,0.028522,0.010758,...,0.038359,0.05798,0.000745,0.050808,0.04152372,0.013752,1.52704e-05,0.52173,0.003531,0.00028
7,0.000611,0.007558,0.000339,0.041865,0.005083,0.005576,0.0001444482,0.292607,5.5e-05,0.04608,...,0.038088,0.058778,0.000745,0.0321,0.9448982,0.01375,5.181475e-10,0.443673,0.004082,2.3e-05


In [203]:
res.to_csv('pred_genres_text_RF.csv', index_label='ID')