# Movie Reviews

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("reviews.csv")
df.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The dataset is made up of positive and negative movie reviews.

## Preprocessing

👇 Remove punctuation and lower case the text.

In [3]:
import nltk
import string 

def remove_punct(x):
    for p in string.punctuation:
        x = x.replace(p, '').lower()
    return x

df["clean_reviews"] = df['reviews'].apply(lambda x: remove_punct(x))

df.head()

## Bag-of-Words modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Word representation of the texts.

In [27]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline

In [28]:
y = df.target
X = df["clean_reviews"]

In [29]:
# Create Pipeline
pipe = make_pipeline(
        (CountVectorizer()),
        (MultinomialNB()))

In [30]:
pipe

Pipeline(steps=[('countvectorizer', CountVectorizer()),
                ('multinomialnb', MultinomialNB())])

In [31]:
# Set parameters to search
parameters = { 
'multinomialnb__alpha': (0.1,1),}

In [32]:
# Perform grid search
grid_search = GridSearchCV(pipe, parameters, n_jobs=-1,
            verbose=1, scoring = "accuracy", 
            refit=True, cv=5)

grid_search.fit(X, y)

grid_search.best_params_

Fitting 5 folds for each of 2 candidates, totalling 10 fits


{'multinomialnb__alpha': 1}

## N-gram modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Word representation of the texts.

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [36]:
# Create Pipeline
pipe2 = make_pipeline(
        (TfidfVectorizer(ngram_range = (2,2))),
        (MultinomialNB()))

In [37]:
pipe2

Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer(ngram_range=(2, 2))),
                ('multinomialnb', MultinomialNB())])

In [38]:
# Set parameters to search
parameters = { 
'multinomialnb__alpha': (0.1,1),}

In [39]:
# Perform grid search
grid_search = GridSearchCV(pipe2, parameters, n_jobs=-1,
            verbose=1, scoring = "accuracy", 
            refit=True, cv=5)

grid_search.fit(X, y)

grid_search.best_params_

Fitting 5 folds for each of 2 candidates, totalling 10 fits


{'multinomialnb__alpha': 0.1}

⚠️ Please push the exercise once you are done 🙃

## 🏁 