# Movie Reviews

In [1]:
import pandas as pd

data = pd.read_csv("reviews.csv")

data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The dataset is made up of positive and negative movie reviews.

## Preprocessing

👇 Remove punctuation and lower case the text.

In [2]:
import string 

def remove_punct(df_to_treat):
    dataf = df_to_treat.copy()
    for col in dataf:
        if dataf[col].dtype == 'O':
            for punct in string.punctuation:
                dataf[col] = [text.replace(punct, '') for text in dataf[col]]
    return dataf

def lower_func(df_to_treat):
    dataf = df_to_treat.copy()
    for col in dataf:
        if dataf[col].dtype == 'O':
            for punct in string.punctuation:
                dataf[col] = [text.lower() for text in dataf[col]]
    return dataf

In [7]:
clean_data = lower_func(remove_punct(data))
clean_data['target'] = [0 if k=='neg' else 1 for k in clean_data['target']]
clean_data

Unnamed: 0,target,reviews
0,0,plot two teen couples go to a church party d...
1,0,the happy bastards quick movie review \ndamn t...
2,0,it is movies like these that make a jaded movi...
3,0,quest for camelot is warner bros first fe...
4,0,synopsis a mentally unstable man undergoing p...
...,...,...
1995,1,wow what a movie \nits everything a movie ca...
1996,1,richard gere can be a commanding actor but he...
1997,1,glorystarring matthew broderick denzel washin...
1998,1,steven spielbergs second epic film on world wa...


## Bag-of-Words modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Word representation of the texts.

In [14]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

X_bow = vectorizer.fit_transform(clean_data['reviews'])

X_bow = X_bow.toarray()

y = clean_data.target

nb_model = MultinomialNB()

cv = cross_val_score(nb_model, X_bow, y)
cv.mean()


0.8145

## N-gram modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Word representation of the texts.

In [15]:
vectorizer = CountVectorizer(ngram_range=(1,2))

X_bow2 = vectorizer.fit_transform(clean_data['reviews'])

nb_model2 = MultinomialNB()

cv2 = cross_val_score(nb_model2, X_bow2, y)
cv2.mean()

0.8350000000000002

⚠️ Please push the exercise once you are done 🙃

## 🏁 