# Project 2 Part 6: Review Classification

### Load the Data

Load the joblib containing the data frame from Part 5 of the project.

Drop any reviews that do not have a rating.

Use the original review column as your X and the classification target (High/Low Rating Reviews) as your y.

### Build a Machine Learning Model

Build a sklearn modeling pipeline with a text vectorizer and a classification model.

Suggested Models: MultinomialNB, LogisticRegression (you may need to increase max_iter), RandomForestClassifier
Fit and evaluate the model using the machine learning classification models from sklearn.

In a Markdown cell, document your observations from your results. (e.g., how good is the model overall? Is it particularly good/bad at predicting one class?)

## Imports

In [34]:
# packages
import pandas as pd
import joblib
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

#### Load Part 5 joblib

In [4]:
# load joblib
df = joblib.load('Data-NLP/df-proj-2-part-5.joblib')

In [5]:
# preview df
df.head()

Unnamed: 0,review_id,movie_id,imdb_id,original_title,review,rating,high_low,review_lowercase,tokens,spacy_lemmas,lemmas_joined,neg,neu,pos,compound
0,64ecc16e83901800af821d50,843,tt0118694,花樣年華,This is a fine piece of cinema from Wong Kar-W...,7.0,med,this is a fine piece of cinema from wong kar-w...,"[this, is, a, fine, piece, of, cinema, from, w...","[fine, piece, cinema, wong, kar, wai, tell, st...",fine piece cinema wong kar wai tell story peop...,0.068,0.744,0.188,0.9908
1,57086ff5c3a3681d29001512,7443,tt0120630,Chicken Run,"A guilty pleasure for me personally, as I love...",9.0,high,"a guilty pleasure for me personally, as i love...","[a, guilty, pleasure, for, me, personally,, as...","[guilty, pleasure, personally, love, great, es...",guilty pleasure personally love great escape w...,0.053,0.587,0.36,0.945
2,5bb5ac829251410dcb00810c,7443,tt0120630,Chicken Run,Made my roommate who hates stop-motion animati...,6.0,med,made my roommate who hates stop-motion animati...,"[made, my, roommate, who, hates, stop-motion, ...","[roommate, hate, stop, motion, animation, watc...",roommate hate stop motion animation watch 2018...,0.071,0.776,0.153,0.7006
3,5f0c53a013a32000357ec505,7443,tt0120630,Chicken Run,A very good stop-motion animation!\r\n\r\n<em>...,8.0,med,a very good stop-motion animation!\r\n\r\n<em>...,"[a, very, good, stop-motion, animation!, <em>'...","[good, stop, motion, animation, <, em>'chicken...",good stop motion animation < em>'chicken run'<...,0.039,0.637,0.325,0.9944
4,64ecc027594c9400ffe77c91,7443,tt0120630,Chicken Run,"Ok, there is an huge temptation to riddle this...",7.0,med,"ok, there is an huge temptation to riddle this...","[ok,, there, is, an, huge, temptation, to, rid...","[ok, huge, temptation, riddle, review, pun, go...",ok huge temptation riddle review pun go crack ...,0.056,0.699,0.245,0.9943


#### Drop any reviews that do not have a rating.

In [17]:
# drop any rows with na in rating column
df = df.dropna(subset=['rating'])

In [26]:
# drop rows with 'med' rating
df = df[df.high_low != 'med']

In [27]:
# verify changes
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2419 entries, 1 to 8647
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   review_id         2419 non-null   object 
 1   movie_id          2419 non-null   int64  
 2   imdb_id           2419 non-null   object 
 3   original_title    2419 non-null   object 
 4   review            2419 non-null   object 
 5   rating            2419 non-null   float64
 6   high_low          2419 non-null   object 
 7   review_lowercase  2419 non-null   object 
 8   tokens            2419 non-null   object 
 9   spacy_lemmas      2419 non-null   object 
 10  lemmas_joined     2419 non-null   object 
 11  neg               2419 non-null   float64
 12  neu               2419 non-null   float64
 13  pos               2419 non-null   float64
 14  compound          2419 non-null   float64
dtypes: float64(5), int64(1), object(9)
memory usage: 302.4+ KB


#### Use the original review column as your X and the classification target (High/Low Rating Reviews) as your y.

In [28]:
# assign X and y
X = df['review']
y = df['high_low']

## Build a Machine Learning Model

### Build a sklearn modeling pipeline with a text vectorizer and a classification model.

Suggested Models: MultinomialNB, LogisticRegression (you may need to increase max_iter), RandomForestClassifier

In [30]:
# check class balance
y.value_counts(normalize=True)

low     0.505994
high    0.494006
Name: high_low, dtype: float64

In [35]:
## Create a pipeline with a vectorizer and classification model.
clf_pipe = Pipeline([('vectorizer', CountVectorizer(stop_words='english')),
                     ('clf',RandomForestClassifier(random_state=42))])
clf_pipe

### Fit and evaluate the model using the machine learning classification models from sklearn.

In a Markdown cell, document your observations from your results. (e.g., how good is the model overall? Is it particularly good/bad at predicting one class?)