# Project 3


# Movie Genre Classification

Classify a movie genre based on its plot.

<img src="moviegenre.png"
     style="float: left; margin-right: 10px;" />




https://www.kaggle.com/c/miia4201-202019-p3-moviegenreclassification/overview

### Data

Input:
- movie plot

Output:
Probability of the movie belong to each genre


### Evaluation

- 20% API
- 30% Report with all the details of the solution, the analysis and the conclusions. The report cannot exceed 10 pages, must be send in PDF format and must be self-contained.
- 50% Performance in the Kaggle competition (The grade for each group will be proportional to the ranking it occupies in the competition. The group in the first place will obtain 5 points, for each position below, 0.25 points will be subtracted, that is: first place: 5 points, second: 4.75 points, third place: 4.50 points ... eleventh place: 2.50 points, twelfth place: 2.25 points).

• The project must be carried out in the groups assigned for module 4.
• Use clear and rigorous procedures.
• The delivery of the project is on July 12, 2020, 11:59 pm, through Sicua + (Upload: the API and the report in PDF format).
• No projects will be received after the delivery time or by any other means than the one established. 




### Acknowledgements

We thank Professor Fabio Gonzalez, Ph.D. and his student John Arevalo for providing this dataset.

See https://arxiv.org/abs/1702.01992

## Sample Submission

In [22]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier,  ExtraTreesClassifier
from sklearn.metrics import r2_score, roc_auc_score, accuracy_score, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV

In [23]:
dataTraining = pd.read_csv('datasets/dataTraining.csv', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('datasets/dataTesting.csv', encoding='UTF-8', index_col=0)

In [24]:
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [25]:
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


### Create count vectorizer

In [26]:
from nltk.stem import WordNetLemmatizer
import nltk
wordnet_lemmatizer = WordNetLemmatizer()

In [27]:
def split_into_lemmas(text):
    text = text.lower()
    words = text.split()
    return [wordnet_lemmatizer.lemmatize(word) for word in words]

In [28]:
#vect = CountVectorizer(lowercase=True, stop_words='english', max_features=5000)
vect = CountVectorizer(lowercase=True, stop_words='english', analyzer=split_into_lemmas, max_features=38000)
X_dtm = vect.fit_transform(dataTraining['plot'])

In [29]:
X_dtm.shape

(7895, 34629)

### Create y

In [30]:
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [31]:
y_genres

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 1, 0, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0]])

In [32]:
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, train_size=0.7, random_state=123)

### Train multi-class multi-label model

In [33]:
mdl =  OneVsRestClassifier(LogisticRegressionCV(Cs=1.5,cv=20,max_iter=100,n_jobs=-1))

In [34]:
mdl.fit(X_dtm, y_genres)

OneVsRestClassifier(estimator=LogisticRegressionCV(Cs=100, class_weight=None,
                                                   cv=20, dual=False,
                                                   fit_intercept=True,
                                                   intercept_scaling=1.0,
                                                   l1_ratios=None, max_iter=100,
                                                   multi_class='auto',
                                                   n_jobs=-1, penalty='l2',
                                                   random_state=None,
                                                   refit=True, scoring=None,
                                                   solver='lbfgs', tol=0.0001,
                                                   verbose=0),
                    n_jobs=None)

In [35]:
y_pred_genres = mdl.predict_proba(X_test)

In [36]:
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.9739978039571927

### Predict the testing dataset

In [37]:
X_test_dtm = vect.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

y_pred_test_genres = mdl.predict_proba(X_test_dtm)


In [38]:
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)

In [39]:
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.103718,0.079395,0.005881427,0.046429,0.420226,0.045576,1.367812e-09,0.45002,0.001104878,0.095206,...,0.062852,0.06017,0.000908,0.843357,0.006027,1.102464e-06,2.412428e-06,0.06714,0.0003020208,0.0003719033
4,0.127585,0.04338,0.01830063,0.059152,0.303661,0.246472,0.005915488,0.693381,0.006772167,0.022588,...,0.029935,0.068263,0.000942,0.085293,0.006128,0.0003084874,0.0001008019,0.261454,0.009860388,0.0007622277
5,0.000235,8.4e-05,3.497974e-07,0.035845,0.003008,0.998462,4.624273e-14,0.900852,4.867364e-10,0.000304,...,0.001669,0.932295,0.000769,0.009579,1.3e-05,5.542841e-11,4.086982e-08,0.738879,9.713536e-07,9.355901e-09
6,0.029667,0.018166,4.322753e-05,0.03926,0.08112,0.001279,2.627447e-10,0.884703,0.0001140289,0.006066,...,0.003714,0.033929,0.000851,0.036952,0.043189,2.602558e-07,6.494392e-06,0.212318,0.0008735165,0.001369602
7,0.001017,0.001646,0.000163366,0.042559,0.09541,0.001508,1.789657e-10,0.138474,8.640139e-06,0.070976,...,0.000912,0.072604,0.000844,0.074897,0.982324,2.609939e-06,5.85573e-13,0.353328,5.202543e-08,1.038345e-06


In [40]:
res.to_csv('logit3.csv', index_label='ID')