#Wikipedia Movie
##Ashley Shang

##Abstract
> The main tasks are to analysize the data and try to make classifications based on the plot description. The result of this experiment can be used to help one identify the genre of a movie easily, therefore determine the target audience, furthermore, even can offer suggestions for the investors.

##introduction

Movie industry is with no doubt a huge area of investment, but along with its corresponding complexity it is difficult to make decisions to invest wisely. Big investments are risky. “Absolutely no one, can tell you what a movie is going to do in the marketplace.," said J.Valenti, the president of the Motion Picture Association of America (MPAA), ”Not until that film opens in a darkened theater, and sparks fly up between the screen and the audience can you say this film is right. " With the film industry growing rapidly day in and day out, the Internet now contains an extensive amount of data available , which makes the film industry an interesting area of data analysis.

## data

### Source
This dataset comes from [Wikipedia Movie Plots](https://www.kaggle.com/jrobischon/wikipedia-movie-plots)

###Content
The dataset contains descriptions of 34,886 movies from around the world. Column descriptions are listed below:

- Release Year - Year in which the movie was released
- Title - Movie title
- Origin/Ethnicity - Origin of movie (i.e. American, Bollywood, Tamil, etc.)
- Director - Director(s)
- Cast - Main actor and actresses
- Genre - Movie Genre(s)
- Wiki Page - URL of the Wikipedia page from which the plot description was scraped
- Plot - Long form description of movie plot (WARNING: May contain spoilers!!!)

In [5]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from sparkdl.transformers import utils
import tensorflow as tf
from pyspark.ml.feature import *

import matplotlib.pyplot as plt
import seaborn as sns
import itertools

from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go

from wordcloud import WordCloud

In [6]:
df = pd.read_csv("/dbfs/FileStore/tables/wiki_movie_plots_deduped.csv")
df.head()

## Methods

- EDA
- wordcloud
- nltk
- TfidfVectorizer
- Multinomial Naive Bayes
- LinearSVC

##Part 1 - EDA

In [9]:
df.shape # obtain the size of the data

In [10]:
df.describe()

In [11]:
df.info()

In [12]:
df.columns # current features

__relationship between number of movies & the original country__

In [14]:
plt.figure(figsize=(22,16))
ax = df['Origin/Ethnicity'].value_counts().sort_index().plot.bar(
    figsize = (10, 5),
    fontsize = 14)

ax.set_title("Count of Origin/Ethnicity of Movies", fontsize=16)
plt.xlabel('Origin/Ethnicity', fontsize=20)
plt.ylabel('Counts', fontsize=20)
plt.xticks(size = 8, rotation=30)
plt.yticks(size = 12)
sns.despine(bottom=True, left=True)
image = plt.show() 
display(image)

conclusion: Origin/Ethnicity of American movies are far more than any other movies.

__movie count vs release year__

In [17]:
# for different origin

sns.set(style="whitegrid")
plt.figure(figsize=(22,16))
org = df["Origin/Ethnicity"].unique()
l = len(df["Origin/Ethnicity"])
con = []
for country in df["Origin/Ethnicity"].unique():
    c = df[df["Origin/Ethnicity"]==country]
    if len(c)>l*0.03:
        x = df[df["Origin/Ethnicity"]==country]["Release Year"].value_counts()
        sns.lineplot(x.index, x.values)
        con.append(country)
plt.legend(con)
plt.title("Movie Count Per "+ "Origin/Ethnicity", fontsize = 20)
display(plt.show())

In [18]:
#in all
ax = df['Release Year'].value_counts().sort_index(ascending=True)
sns.lineplot(ax.index, ax.values)
plt.title("Movie count per release year", fontsize = 20)
display(plt.show())

There are over 1000 movies in 2015 and the amount suddenly drops.

__WordCloud of Plot column__

In [21]:
#WordCloud of Plot column
wordcloud = WordCloud(width = 1000, height = 600, max_font_size = 120).generate(" ".join(df.Plot))

plt.subplots(figsize=(18,8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
display(plt.show())

__WordCloud of Title column__

In [23]:

wordcloud1 = WordCloud(width = 1000, height = 600, max_font_size = 120).generate(" ".join(df.Title))

plt.subplots(figsize=(18,8))
plt.imshow(wordcloud1, interpolation='bilinear')
plt.axis('off')
display(plt.show())

__Most frequent Genres__

In [25]:
plt.figure(figsize=(22,16))
plt.title('Most frequent Genre types', fontsize=20)
plt.ylabel('Count', fontsize=16)
plt.xlabel('Genre', fontsize=16)

sns.countplot(df.Genre,order=pd.value_counts(df.Genre).iloc[:15].index,palette=sns.color_palette("PuRd", 15))

plt.xticks(size=12,rotation=45)
plt.yticks(size=16)
sns.despine(bottom=True, left=True)
display(plt.show())

In [26]:
#the numbers of movies of each genre

genres=list(genre.split(',') for genre in df.Genre)
genres=list(itertools.chain.from_iterable(genres))
genres=pd.value_counts(genres)

print('There are ',len(genres), 'different Genres in the dataset:')
print('-'*50)
print(genres)

__Top cast__

In [28]:
CastCount=pd.value_counts(df.Cast)

plt.figure(figsize=(22,16))
plt.title('Top Cast',fontsize=20)

sns.barplot(CastCount.values,CastCount.index,order=CastCount[:20].index,palette=sns.color_palette("cubehelix", 25))

plt.ylabel('Cast',fontsize=16)
plt.xlabel('Number of movies participated',fontsize=16)
plt.xticks(size=16)
plt.yticks(size=16)
display(plt.show())

__Top directors__

In [30]:
plt.figure(figsize=(22,16))
plt.title('Top Directors (based on number of movies directed)',fontsize=20)

sns.countplot(df.Director,order=pd.value_counts(df.Director)[:20].index,palette=sns.color_palette("muted", 20))

plt.xlabel('Directors',fontsize=16)
plt.ylabel('Number of movies directed',fontsize=16)
plt.xticks(size=12,rotation=30)
display(plt.show())

##Part 2 - Genre Classification based on plots

In [32]:
%sh pip install nltk

In [33]:
import re

import pickle 
#import mglearn
import time
from nltk.tokenize import TweetTokenizer # doesn't split at apostrophes
import nltk
from nltk import Text
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import word_tokenize  
from nltk.tokenize import sent_tokenize 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression 
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier


from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline

from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC

In [34]:
#cleansing the genre feature
df['GenreCorrected']=df['Genre'] 
df['GenreCorrected']=df['GenreCorrected'].str.strip()
df['GenreCorrected']=df['GenreCorrected'].str.replace(' - ', '|')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' / ', '|')
df['GenreCorrected']=df['GenreCorrected'].str.replace('/', '|')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' & ', '|')
df['GenreCorrected']=df['GenreCorrected'].str.replace(', ', '|')
df['GenreCorrected']=df['GenreCorrected'].str.replace('; ', '|')
df['GenreCorrected']=df['GenreCorrected'].str.replace('bio-pic', 'biography')
df['GenreCorrected']=df['GenreCorrected'].str.replace('biopic', 'biography')
df['GenreCorrected']=df['GenreCorrected'].str.replace('biographical', 'biography')
df['GenreCorrected']=df['GenreCorrected'].str.replace('biodrama', 'biography')
df['GenreCorrected']=df['GenreCorrected'].str.replace('bio-drama', 'biography')
df['GenreCorrected']=df['GenreCorrected'].str.replace('biographic', 'biography')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' \(film genre\)', '')
df['GenreCorrected']=df['GenreCorrected'].str.replace('animated','animation')
df['GenreCorrected']=df['GenreCorrected'].str.replace('anime','animation')
df['GenreCorrected']=df['GenreCorrected'].str.replace('children\'s','children')
df['GenreCorrected']=df['GenreCorrected'].str.replace('comedey','comedy')
df['GenreCorrected']=df['GenreCorrected'].str.replace('\[not in citation given\]','')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' set 4,000 years ago in the canadian arctic','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('historical','history')
df['GenreCorrected']=df['GenreCorrected'].str.replace('romantic','romance')
df['GenreCorrected']=df['GenreCorrected'].str.replace('3-d','animation')
df['GenreCorrected']=df['GenreCorrected'].str.replace('3d','animation')
df['GenreCorrected']=df['GenreCorrected'].str.replace('viacom 18 motion pictures','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('sci-fi','science_fiction')
df['GenreCorrected']=df['GenreCorrected'].str.replace('ttriller','thriller')
df['GenreCorrected']=df['GenreCorrected'].str.replace('.','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('based on radio serial','')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' on the early years of hitler','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('sci fi','science_fiction')
df['GenreCorrected']=df['GenreCorrected'].str.replace('science fiction','science_fiction')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' (30min)','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('16 mm film','short')
df['GenreCorrected']=df['GenreCorrected'].str.replace('\[140\]','drama')
df['GenreCorrected']=df['GenreCorrected'].str.replace('\[144\]','')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' for ','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('adventures','adventure')
df['GenreCorrected']=df['GenreCorrected'].str.replace('kung fu','martial_arts')
df['GenreCorrected']=df['GenreCorrected'].str.replace('kung-fu','martial_arts')
df['GenreCorrected']=df['GenreCorrected'].str.replace('martial arts','martial_arts')
df['GenreCorrected']=df['GenreCorrected'].str.replace('world war ii','war')
df['GenreCorrected']=df['GenreCorrected'].str.replace('world war i','war')
df['GenreCorrected']=df['GenreCorrected'].str.replace('biography about montreal canadiens star|maurice richard','biography')
df['GenreCorrected']=df['GenreCorrected'].str.replace('bholenath movies|cinekorn entertainment','')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' \(volleyball\)','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('spy film','spy')
df['GenreCorrected']=df['GenreCorrected'].str.replace('anthology film','anthology')
df['GenreCorrected']=df['GenreCorrected'].str.replace('biography fim','biography')
df['GenreCorrected']=df['GenreCorrected'].str.replace('avant-garde','avant_garde')
df['GenreCorrected']=df['GenreCorrected'].str.replace('biker film','biker')
df['GenreCorrected']=df['GenreCorrected'].str.replace('buddy cop','buddy')
df['GenreCorrected']=df['GenreCorrected'].str.replace('buddy film','buddy')
df['GenreCorrected']=df['GenreCorrected'].str.replace('comedy 2-reeler','comedy')
df['GenreCorrected']=df['GenreCorrected'].str.replace('films','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('film','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('biography of pioneering american photographer eadweard muybridge','biography')
df['GenreCorrected']=df['GenreCorrected'].str.replace('british-german co-production','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('bruceploitation','martial_arts')
df['GenreCorrected']=df['GenreCorrected'].str.replace('comedy-drama adaptation of the mordecai richler novel','comedy-drama')
df['GenreCorrected']=df['GenreCorrected'].str.replace('movies by the mob\|knkspl','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('movies','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('movie','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('coming of age','coming_of_age')
df['GenreCorrected']=df['GenreCorrected'].str.replace('coming-of-age','coming_of_age')
df['GenreCorrected']=df['GenreCorrected'].str.replace('drama about child soldiers','drama')
df['GenreCorrected']=df['GenreCorrected'].str.replace('(( based).+)','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('(( co-produced).+)','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('(( adapted).+)','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('(( about).+)','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('musical b','musical')
df['GenreCorrected']=df['GenreCorrected'].str.replace('animationchildren','animation|children')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' period','period')
df['GenreCorrected']=df['GenreCorrected'].str.replace('drama loosely','drama')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' \(aquatics|swimming\)','')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' \(aquatics|swimming\)','')
df['GenreCorrected']=df['GenreCorrected'].str.replace("yogesh dattatraya gosavi's directorial debut \[9\]",'')
df['GenreCorrected']=df['GenreCorrected'].str.replace("war-time","war")
df['GenreCorrected']=df['GenreCorrected'].str.replace("wartime","war")
df['GenreCorrected']=df['GenreCorrected'].str.replace("ww1","war")
df['GenreCorrected']=df['GenreCorrected'].str.replace('unknown','')
df['GenreCorrected']=df['GenreCorrected'].str.replace("wwii","war")
df['GenreCorrected']=df['GenreCorrected'].str.replace('psychological','psycho')
df['GenreCorrected']=df['GenreCorrected'].str.replace('rom-coms','romance')
df['GenreCorrected']=df['GenreCorrected'].str.replace('true crime','crime')
df['GenreCorrected']=df['GenreCorrected'].str.replace('\|007','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('slice of life','slice_of_life')
df['GenreCorrected']=df['GenreCorrected'].str.replace('computer animation','animation')
df['GenreCorrected']=df['GenreCorrected'].str.replace('gun fu','martial_arts')
df['GenreCorrected']=df['GenreCorrected'].str.replace('j-horror','horror')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' \(shogi|chess\)','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('afghan war drama','war drama')
df['GenreCorrected']=df['GenreCorrected'].str.replace('\|6 separate stories','')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' \(30min\)','')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' (road bicycle racing)','')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' v-cinema','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('tv miniseries','tv_miniseries')
df['GenreCorrected']=df['GenreCorrected'].str.replace('\|docudrama','\|documentary|drama')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' in animation','|animation')
df['GenreCorrected']=df['GenreCorrected'].str.replace('((adaptation).+)','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('((adaptated).+)','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('((adapted).+)','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('(( on ).+)','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('american football','sports')
df['GenreCorrected']=df['GenreCorrected'].str.replace('dev\|nusrat jahan','sports')
df['GenreCorrected']=df['GenreCorrected'].str.replace('television miniseries','tv_miniseries')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' \(artistic\)','')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' \|direct-to-dvd','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('history dram','history drama')
df['GenreCorrected']=df['GenreCorrected'].str.replace('martial art','martial_arts')
df['GenreCorrected']=df['GenreCorrected'].str.replace('psycho thriller,','psycho thriller')
df['GenreCorrected']=df['GenreCorrected'].str.replace('\|1 girl\|3 suitors','')
df['GenreCorrected']=df['GenreCorrected'].str.replace(' \(road bicycle racing\)','')
filterE = df['GenreCorrected']=="ero"
df.loc[filterE,'GenreCorrected']="adult"
filterE = df['GenreCorrected']=="music"
df.loc[filterE,'GenreCorrected']="musical"
filterE = df['GenreCorrected']=="-"
df.loc[filterE,'GenreCorrected']=''
filterE = df['GenreCorrected']=="comedy–drama"
df.loc[filterE,'GenreCorrected'] = "comedy|drama"
filterE = df['GenreCorrected']=="comedy–horror"
df.loc[filterE,'GenreCorrected'] = "comedy|horror"
df['GenreCorrected']=df['GenreCorrected'].str.replace(' ','|')
df['GenreCorrected']=df['GenreCorrected'].str.replace(',','|')
df['GenreCorrected']=df['GenreCorrected'].str.replace('-','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('actionadventure','action|adventure')
df['GenreCorrected']=df['GenreCorrected'].str.replace('actioncomedy','action|comedy')
df['GenreCorrected']=df['GenreCorrected'].str.replace('actiondrama','action|drama')
df['GenreCorrected']=df['GenreCorrected'].str.replace('actionlove','action|love')
df['GenreCorrected']=df['GenreCorrected'].str.replace('actionmasala','action|masala')
df['GenreCorrected']=df['GenreCorrected'].str.replace('actionchildren','action|children')
df['GenreCorrected']=df['GenreCorrected'].str.replace('fantasychildren\|','fantasy|children')
df['GenreCorrected']=df['GenreCorrected'].str.replace('fantasycomedy','fantasy|comedy')
df['GenreCorrected']=df['GenreCorrected'].str.replace('fantasyperiod','fantasy|period')
df['GenreCorrected']=df['GenreCorrected'].str.replace('cbctv_miniseries','tv_miniseries')
df['GenreCorrected']=df['GenreCorrected'].str.replace('dramacomedy','drama|comedy')
df['GenreCorrected']=df['GenreCorrected'].str.replace('dramacomedysocial','drama|comedy|social')
df['GenreCorrected']=df['GenreCorrected'].str.replace('dramathriller','drama|thriller')
df['GenreCorrected']=df['GenreCorrected'].str.replace('comedydrama','comedy|drama')
df['GenreCorrected']=df['GenreCorrected'].str.replace('dramathriller','drama|thriller')
df['GenreCorrected']=df['GenreCorrected'].str.replace('comedyhorror','comedy|horror')
df['GenreCorrected']=df['GenreCorrected'].str.replace('sciencefiction','science_fiction')
df['GenreCorrected']=df['GenreCorrected'].str.replace('adventurecomedy','adventure|comedy')
df['GenreCorrected']=df['GenreCorrected'].str.replace('animationdrama','animation|drama')
df['GenreCorrected']=df['GenreCorrected'].str.replace('\|\|','|')
df['GenreCorrected']=df['GenreCorrected'].str.replace('muslim','religious')
df['GenreCorrected']=df['GenreCorrected'].str.replace('thriler','thriller')
df['GenreCorrected']=df['GenreCorrected'].str.replace('crimethriller','crime|thriller')
df['GenreCorrected']=df['GenreCorrected'].str.replace('fantay','fantasy')
df['GenreCorrected']=df['GenreCorrected'].str.replace('actionthriller','action|thriller')
df['GenreCorrected']=df['GenreCorrected'].str.replace('comedysocial','comedy|social')
df['GenreCorrected']=df['GenreCorrected'].str.replace('martialarts','martial_arts')
df['GenreCorrected']=df['GenreCorrected'].str.replace('\|\(children\|poker\|karuta\)','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('epichistory','epic|history')

df['GenreCorrected']=df['GenreCorrected'].str.replace('erotica','adult')
df['GenreCorrected']=df['GenreCorrected'].str.replace('erotic','adult')

df['GenreCorrected']=df['GenreCorrected'].str.replace('((\|produced\|).+)','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('chanbara','chambara')
df['GenreCorrected']=df['GenreCorrected'].str.replace('comedythriller','comedy|thriller')
df['GenreCorrected']=df['GenreCorrected'].str.replace('biblical','religious')
df['GenreCorrected']=df['GenreCorrected'].str.replace('biblical','religious')
df['GenreCorrected']=df['GenreCorrected'].str.replace('colour\|yellow\|productions\|eros\|international','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('\|directtodvd','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('liveaction','live|action')
df['GenreCorrected']=df['GenreCorrected'].str.replace('melodrama','drama')
df['GenreCorrected']=df['GenreCorrected'].str.replace('superheroes','superheroe')
df['GenreCorrected']=df['GenreCorrected'].str.replace('gangsterthriller','gangster|thriller')

df['GenreCorrected']=df['GenreCorrected'].str.replace('heistcomedy','comedy')
df['GenreCorrected']=df['GenreCorrected'].str.replace('heist','action')
df['GenreCorrected']=df['GenreCorrected'].str.replace('historic','history')
df['GenreCorrected']=df['GenreCorrected'].str.replace('historydisaster','history|disaster')
df['GenreCorrected']=df['GenreCorrected'].str.replace('warcomedy','war|comedy')
df['GenreCorrected']=df['GenreCorrected'].str.replace('westerncomedy','western|comedy')
df['GenreCorrected']=df['GenreCorrected'].str.replace('ancientcostume','costume')
df['GenreCorrected']=df['GenreCorrected'].str.replace('computeranimation','animation')
df['GenreCorrected']=df['GenreCorrected'].str.replace('dramatic','drama')
df['GenreCorrected']=df['GenreCorrected'].str.replace('familya','family')
df['GenreCorrected']=df['GenreCorrected'].str.replace('familya','family')
df['GenreCorrected']=df['GenreCorrected'].str.replace('dramedy','drama|comedy')
df['GenreCorrected']=df['GenreCorrected'].str.replace('dramaa','drama')
df['GenreCorrected']=df['GenreCorrected'].str.replace('famil\|','family')

df['GenreCorrected']=df['GenreCorrected'].str.replace('superheroe','superhero')
df['GenreCorrected']=df['GenreCorrected'].str.replace('biogtaphy','biography')
df['GenreCorrected']=df['GenreCorrected'].str.replace('devotionalbiography','devotional|biography')
df['GenreCorrected']=df['GenreCorrected'].str.replace('docufiction','documentary|fiction')

df['GenreCorrected']=df['GenreCorrected'].str.replace('familydrama','family|drama')

df['GenreCorrected']=df['GenreCorrected'].str.replace('espionage','spy')
df['GenreCorrected']=df['GenreCorrected'].str.replace('supeheroes','superhero')
df['GenreCorrected']=df['GenreCorrected'].str.replace('romancefiction','romance|fiction')
df['GenreCorrected']=df['GenreCorrected'].str.replace('horrorthriller','horror|thriller')

df['GenreCorrected']=df['GenreCorrected'].str.replace('suspensethriller','suspense|thriller')
df['GenreCorrected']=df['GenreCorrected'].str.replace('musicaliography','musical|biography')
df['GenreCorrected']=df['GenreCorrected'].str.replace('triller','thriller')

df['GenreCorrected']=df['GenreCorrected'].str.replace('\|\(fiction\)','|fiction')

df['GenreCorrected']=df['GenreCorrected'].str.replace('romanceaction','romance|action')
df['GenreCorrected']=df['GenreCorrected'].str.replace('romancecomedy','romance|comedy')
df['GenreCorrected']=df['GenreCorrected'].str.replace('romancehorror','romance|horror')

df['GenreCorrected']=df['GenreCorrected'].str.replace('romcom','romance|comedy')
df['GenreCorrected']=df['GenreCorrected'].str.replace('rom\|com','romance|comedy')
df['GenreCorrected']=df['GenreCorrected'].str.replace('satirical','satire')

df['GenreCorrected']=df['GenreCorrected'].str.replace('science_fictionchildren','science_fiction|children')
df['GenreCorrected']=df['GenreCorrected'].str.replace('homosexual','adult')
df['GenreCorrected']=df['GenreCorrected'].str.replace('sexual','adult')

df['GenreCorrected']=df['GenreCorrected'].str.replace('mockumentary','documentary')
df['GenreCorrected']=df['GenreCorrected'].str.replace('periodic','period')
df['GenreCorrected']=df['GenreCorrected'].str.replace('romanctic','romantic')
df['GenreCorrected']=df['GenreCorrected'].str.replace('politics','political')
df['GenreCorrected']=df['GenreCorrected'].str.replace('samurai','martial_arts')
df['GenreCorrected']=df['GenreCorrected'].str.replace('tv_miniseries','series')
df['GenreCorrected']=df['GenreCorrected'].str.replace('serial','series')

filterE = df['GenreCorrected']=="musical–comedy"
df.loc[filterE,'GenreCorrected'] = "musical|comedy"

filterE = df['GenreCorrected']=="roman|porno"
df.loc[filterE,'GenreCorrected'] = "adult"


filterE = df['GenreCorrected']=="action—masala"
df.loc[filterE,'GenreCorrected'] = "action|masala"


filterE = df['GenreCorrected']=="horror–thriller"
df.loc[filterE,'GenreCorrected'] = "horror|thriller"

df['GenreCorrected']=df['GenreCorrected'].str.replace('family','children')
df['GenreCorrected']=df['GenreCorrected'].str.replace('martial_arts','action')
df['GenreCorrected']=df['GenreCorrected'].str.replace('horror','thriller')
df['GenreCorrected']=df['GenreCorrected'].str.replace('war','action')
df['GenreCorrected']=df['GenreCorrected'].str.replace('adventure','action')
df['GenreCorrected']=df['GenreCorrected'].str.replace('science_fiction','action')
df['GenreCorrected']=df['GenreCorrected'].str.replace('western','action')
df['GenreCorrected']=df['GenreCorrected'].str.replace('western','action')
df['GenreCorrected']=df['GenreCorrected'].str.replace('noir','black')
df['GenreCorrected']=df['GenreCorrected'].str.replace('spy','action')
df['GenreCorrected']=df['GenreCorrected'].str.replace('superhero','action')
df['GenreCorrected']=df['GenreCorrected'].str.replace('social','')
df['GenreCorrected']=df['GenreCorrected'].str.replace('suspense','action')


filterE = df['GenreCorrected']=="drama|romance|adult|children"
df.loc[filterE,'GenreCorrected'] = "drama|romance|adult"

df['GenreCorrected']=df['GenreCorrected'].str.replace('\|–\|','|')
df['GenreCorrected']=df['GenreCorrected'].str.strip(to_strip='\|')
df['GenreCorrected']=df['GenreCorrected'].str.replace('actionner','action')
df['GenreCorrected']=df['GenreCorrected'].str.strip()

In [35]:
#calculate the genre count before cleansing

df['Count']=1
df[['Genre','Count']].groupby(['Genre'], as_index=False).count().shape[0]

In [36]:
#after processing the genre count
df[['GenreCorrected','Count']].groupby(['GenreCorrected'], as_index=False).count().shape[0]

In [37]:
df[['GenreCorrected','Count']].groupby(['GenreCorrected'],as_index=False).count().sort_values(['Count'], ascending=False).head(15)

In [38]:
df['GenreCorrected'][11]

In [39]:
# split the multi-genre
df['GenreSplit'] = df['GenreCorrected'].str.split('|')
df['GenreSplit'] = df['GenreSplit'].apply(np.sort).apply(np.unique)

In [40]:
df['GenreSplit'][11]

Number of movies for each genres

In [42]:
genres_array = np.array([])

for i in range(0,df.shape[0]-1):
    genres_array = np.concatenate((genres_array, df['GenreSplit'][i] ))
    
genres_array

In [43]:
genres = pd.DataFrame({'Genre':genres_array})
genres.head(15)

In [44]:
# what genres are most welcome
genres['Count']=1
genres[['Genre','Count']].groupby(['Genre'], as_index=False).sum().sort_values(['Count'], ascending=False).head(15)

Identification of the genres to be selected

In [46]:
genres=genres[['Genre','Count']].groupby(['Genre'], as_index=False).sum().sort_values(['Count'], ascending=False)
genres = genres[genres['Genre']!='']
genres.head(25)

In [47]:
TotalCountGenres=sum(genres['Count'])
TotalCountGenres

In [48]:
genres['Frequency'] = genres['Count']/TotalCountGenres
genres['CumulativeFrequency'] = genres['Frequency'].cumsum()
genres.head(15)

Selecting the genres having a cumulative frequency 95.7% and below

In [50]:
np.array(genres[genres['CumulativeFrequency']<=.957]['Genre'])

In [51]:
genres[genres['CumulativeFrequency']<=.957][['Genre','Count']].plot(x='Genre', y='Count', kind='bar', 
                                                                    legend=False, grid=True, figsize=(22, 16))
plt.title("Number of movies per genre")
plt.ylabel('# of Occurrences', fontsize=20)
plt.xlabel('Movie genres', fontsize=20)
display(plt.show())

In [52]:
mainGenres

In [53]:
mainGenres=np.array(genres[genres['CumulativeFrequency']<=.957]['Genre'])

#delet movies which are not in the main genres
df['GenreSplitMain'] = df['GenreSplit'].apply(lambda x: x[np.in1d(x,mainGenres)])

In [54]:
df[['GenreSplitMain','GenreSplit','Genre']][200:220]

In [55]:
# function for cleansing the plots of the movies
def clean_text(text):
    text = text.lower()
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "can not ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r"\'scuse", " excuse ", text)
    #text = re.sub('\W', ' ', text)
    #text = re.sub('\s+', ' ', text)
    text = text.strip(' ')
    return text

In [56]:
df['PlotClean'] = df['Plot'].apply(clean_text)
df[['Plot','PlotClean','GenreSplitMain']][7:16]

In [57]:
len(df['GenreSplitMain'][6]) #how many genres is one movie under

In [58]:
df['MainGenresCount'] = df['GenreSplitMain'].apply(len)
max(df['MainGenresCount'])

In [59]:
df[df['MainGenresCount']==7]

In [60]:
df['MainGenresCount'].hist()

plt.title("No of movies by no of genres", fontsize=20)
plt.ylabel('# of movies', fontsize=20)
plt.xlabel('# of genres', fontsize=20)
display(plt.show())

##Classifiers Training

###reference on CountVectorizer and TfidfVectorizer:


- https://www.kaggle.com/adamschroeder/countvectorizer-tfidfvectorizer-predict-comments
- https://towardsdatascience.com/multi-label-text-classification-with-scikit-learn-30714b7819c5

###Building the classification algorithms

__Steps to be done:__

1.Building the classes: one dummy variable for each genre. In this final project, there are 20 genres of movies.

2.Split the data in Train and Test

3.Building the features based on TfidfVectorizer

In [63]:
# the dummy classes
df = pd.concat([df, df.GenreSplitMain.apply(lambda x: '-'.join(x)).str.get_dummies(sep='-')], axis=1)

# train-test split
# the train and the test data set will be build when there is at least one genre for a movie
MoviesTrain, MoviesTest = train_test_split(df[df.GenreCorrected!=''], random_state=42, test_size=0.30, shuffle=True)

In [64]:
MoviesTrain.columns[14:]

In [65]:
# definition the algorithm for feature extraction
tfidf = TfidfVectorizer(stop_words ='english', smooth_idf=False, sublinear_tf=False, norm=None, analyzer='word')

In [66]:
x_train = tfidf.fit_transform(MoviesTrain.PlotClean) 
x_test  = tfidf.transform(MoviesTest.PlotClean)
### for test data, the feature extraction will be done through the function transform()
### to make sure there is no features dimensionality mismatch

In [67]:
print('The corpus is huge. It contain {} words.'.format(len(x_train[0].toarray()[0])))

In [68]:
# building the classes
y_train = MoviesTrain[MoviesTrain.columns[14:]]
y_test = MoviesTest[MoviesTest.columns[14:]]

__multinomial Naive Bayes Classification__

In [70]:
multinomialNB=OneVsRestClassifier(MultinomialNB(fit_prior=True, class_prior=None))

accuracy_multinomialNB=pd.DataFrame(columns=['Genre', 'accuracy_multinomialNB'])

i = 0
for genre in mainGenres:
    multinomialNB.fit(x_train, y_train[genre])
    prediction = multinomialNB.predict(x_test)
    accuracy_multinomialNB.loc[i,'Genre'] = genre
    accuracy_multinomialNB.loc[i,'accuracy_multinomialNB'] = accuracy_score(y_test[genre], prediction)
    i=i+1
    

    
accuracy_multinomialNB


__Linear Support Vector Classification__

In [72]:
linearSVC=OneVsRestClassifier(LinearSVC(), n_jobs=1)

accuracy_LinearSVC=pd.DataFrame(columns=['Genre', 'accuracy_LinearSVC'])

i = 0
for genre in mainGenres:
    linearSVC.fit(x_train, y_train[genre])
    prediction = linearSVC.predict(x_test)
    accuracy_LinearSVC.loc[i,'Genre'] = genre
    accuracy_LinearSVC.loc[i,'accuracy_LinearSVC'] = accuracy_score(y_test[genre], prediction)
    i=i+1
    

accuracy_LinearSVC

In [73]:
# merging the accuracy tables
accuracy_mnb_svc = pd.merge(accuracy_multinomialNB, accuracy_LinearSVC, on='Genre', how='inner')
accuracy_mnb_svc

the reaults are quite similar

##Conclusion

- Both algorithms MultinomialNB and LinearSVC showed very good accuracy rate, even though LinearSVC did not success to converge for some genre. The lowest accuracy rate is around 66% and the best accuracy rate above 99%.

- Multinomial Naive Bayes Classification is much faster compared to LinearSVC. In addition, it has no convergence issues.