# Text Classification Assessment

This assessment is a text classification project where the goal is to classify the genre of a movie based on its characteristics, primarily the text of the plot summarization. You have a training set of data that you will use to identify and create your best predicting model. Then you will use that model to predict the classes of the test set of data. We will compare the performance of your predictions to your classmates using the F1 Score. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

The **movie_train.csv** dataset contains information (`Release Year`, `Title`, `Plot`, `Director`, `Cast`) about 10,682 movies and the label of `Genre`. There are 9 different genres in this data set, so this is a multiclass problem. You are expected to primarily use the plot column, but can use the additional columns as you see fit.

After you have identified yoru best performing model, you will create predictions for the test set of data. The test set of data, contains 3,561 movies with all of their information except the `Genre`. 

Below is a list of tasks that you will definitely want to complete for this challenge, but this list is not exhaustive. It does not include any tasds around handling class imbalance or about how to test multiple different models and their tuning parameters, but you should still look at doing those to see if they help you to create a better predictive model.


# Good Luck

### Task #1: Perform imports and load the dataset into a pandas DataFrame


In [168]:
import pandas as pd
import numpy as np
import nltk
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.tokenize import regexp_tokenize, word_tokenize, RegexpTokenizer
import matplotlib.pyplot as plt
import re

In [169]:
df = pd.read_csv('movie_train.csv', index_col=0)

In [170]:
df.head()

Unnamed: 0,Release Year,Title,Plot,Director,Cast,Genre
10281,1984,Silent Madness,A computer error leads to the accidental relea...,Simon Nuchtern,"Belinda Montgomery, Viveca Lindfors",horror
7341,1960,Desire in the Dust,"Lonnie Wilson (Ken Scott), the son of a sharec...",Robert L. Lippert,"Raymond Burr, Martha Hyer, Joan Bennett",drama
10587,1986,On the Edge,"A gaunt, bushy-bearded, 44-year-old Wes Holman...",Rob Nilsson,"Bruce Dern, Pam Grier",drama
25495,1988,Ram-Avtar,Ram and Avtar are both childhood best friends....,Sunil Hingorani,"Sunny Deol, Anil Kapoor, Sridevi",drama
16607,2013,Machete Kills,Machete Cortez (Danny Trejo) and Sartana River...,Robert Rodriguez,"Danny Trejo, Michelle Rodriguez, Sofía Vergara...",action


### Task #2: Check for missing values:

In [171]:
# Check for whitespace strings (it's OK if there aren't any!):
df.isna()

Unnamed: 0,Release Year,Title,Plot,Director,Cast,Genre
10281,False,False,False,False,False,False
7341,False,False,False,False,False,False
10587,False,False,False,False,False,False
25495,False,False,False,False,False,False
16607,False,False,False,False,False,False
...,...,...,...,...,...,...
4652,False,False,False,False,False,False
23220,False,False,False,False,False,False
15847,False,False,False,False,False,False
3102,False,False,False,False,False,False


In [172]:
np.where(pd.isnull(df))

(array([   21,   261,   494,   498,   521,   533,   576,   591,   663,
          702,   803,   839,   847,   852,   948,   957,  1029,  1048,
         1177,  1353,  1488,  1497,  1517,  1565,  1607,  1665,  1722,
         1808,  1884,  1934,  1990,  1995,  1997,  2052,  2060,  2082,
         2202,  2251,  2360,  2451,  2540,  2547,  2869,  2906,  2907,
         2995,  3124,  3169,  3237,  3356,  3468,  3562,  3656,  3672,
         3685,  3693,  3698,  3734,  3748,  3796,  3870,  3895,  3960,
         4121,  4155,  4226,  4269,  4315,  4324,  4344,  4407,  4530,
         4762,  4777,  4816,  4817,  4916,  4926,  5031,  5052,  5188,
         5275,  5328,  5401,  5403,  5733,  5846,  5909,  5928,  5973,
         5983,  5986,  6008,  6115,  6152,  6397,  6438,  6463,  6494,
         6496,  6500,  6589,  6862,  6971,  6985,  6994,  6998,  7035,
         7098,  7339,  7372,  7489,  7502,  7621,  7719,  7766,  7770,
         7785,  7990,  8045,  8068,  8171,  8177,  8195,  8205,  8332,
      

In [173]:
df.columns

Index(['Release Year', 'Title', 'Plot', 'Director', 'Cast', 'Genre'], dtype='object')

### Task #3: Remove NaN values:

In [174]:
df['Cast'] = np.where((df['Cast'].isna()), df.drop, df['Cast'])
df.head()

Unnamed: 0,Release Year,Title,Plot,Director,Cast,Genre
10281,1984,Silent Madness,A computer error leads to the accidental relea...,Simon Nuchtern,"Belinda Montgomery, Viveca Lindfors",horror
7341,1960,Desire in the Dust,"Lonnie Wilson (Ken Scott), the son of a sharec...",Robert L. Lippert,"Raymond Burr, Martha Hyer, Joan Bennett",drama
10587,1986,On the Edge,"A gaunt, bushy-bearded, 44-year-old Wes Holman...",Rob Nilsson,"Bruce Dern, Pam Grier",drama
25495,1988,Ram-Avtar,Ram and Avtar are both childhood best friends....,Sunil Hingorani,"Sunny Deol, Anil Kapoor, Sridevi",drama
16607,2013,Machete Kills,Machete Cortez (Danny Trejo) and Sartana River...,Robert Rodriguez,"Danny Trejo, Michelle Rodriguez, Sofía Vergara...",action


### Task #4: Take a look at the columns and do some EDA to familiarize yourself with the data. This will consists of you cleaning up the data set by doing things like removing stop words, tokenizing, and/or lemitizing words. 

In [175]:
movie_genre = df.copy(deep = True)

In [176]:
movie_genre['Plot'].dropna(inplace = True)

In [177]:
np.where(pd.isnull(movie_genre))

(array([], dtype=int64), array([], dtype=int64))

In [178]:
custom_sw = stopwords.words('english')
custom_sw.extend(["i'd","say"] )
custom_sw[-10:]

['wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't",
 "i'd",
 'say']

In [181]:
test = movie_genre['Plot'].iloc[0]

In [183]:
test

'A computer error leads to the accidental release of homicidal patient Howard Johns from a mental institution. The mute murderer returns to the scene of his original crimes.[2]'

In [145]:
movie_genre.shape

(10682, 6)

In [None]:
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
tokenizer = RegexpTokenizer(r"([a-zA-Z]+(?:[’'][a-z]+)?)")

In [187]:
movie_genre['Clean_Plot'] = movie_genre['Plot'].apply(tokenizer.tokenize(movie_genre['Plot'])

SyntaxError: unexpected EOF while parsing (<ipython-input-187-6dcfccd6cae0>, line 1)

In [161]:
movie_genre['Clean_Plot'] = [token.lower() for token in first_doc]
movie_genre['Clean_Plot'] = [token for token in first_doc if token not in custom_sw]

AttributeError: 'list' object has no attribute 'lower'

### Task #5: Split the data into train & test sets:

Yes we have a holdout set of the data, but you do not know the genres of that data, so you can't use it to evaluate your models. Therefore you must create your own training and test sets to evaluate your models. 

### Task #6: Build a pipeline to vectorize the date, then train and fit your models.
You should train multiple types of models and try different combinations of the tuning parameters for each model to obtain the best one. You can use the SKlearn functions of GridSearchCV and Pipeline to help automate this process.


### Task #7: Run predictions and analyze the results on the test set to identify the best model.  

### Task #8: Refit the model to all of your data and then use that model to predict the holdout set. 

### #9: Save your predictions as a csv file that you will send to the instructional staff for evaluation. 

## Great job!