# Movie Reviews

In [21]:
import pandas as pd

df = pd.read_pickle("reviews")

df.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The dataset is made up of positive and negative movie reviews.

## Preprocessing

👇 Remove punctuation and lower case the text.

In [22]:
import string

def punctuation_lower(text):
    text.translate(str.maketrans("", "", string.punctuation))
    text.lower()
    return text

df['clean_reviews'] = df['reviews'].apply(punctuation_lower)
df

Unnamed: 0,target,reviews,clean_reviews
0,neg,"plot : two teen couples go to a church party ,...","plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs...",""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis : a mentally unstable man undergoing ...
...,...,...,...
1995,pos,wow ! what a movie . \nit's everything a movie...,wow ! what a movie . \nit's everything a movie...
1996,pos,"richard gere can be a commanding actor , but h...","richard gere can be a commanding actor , but h..."
1997,pos,"glory--starring matthew broderick , denzel was...","glory--starring matthew broderick , denzel was..."
1998,pos,steven spielberg's second epic film on world w...,steven spielberg's second epic film on world w...


## Bag-of-Words modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Word representation of the texts.

In [24]:
from sklearn.model_selection import cross_validate
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups

# Load the 20 newsgroups dataset

# Extract feature using CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df["clean_reviews"])
y = df.target

# Train MultinomialNB model using cross_validate
clf = MultinomialNB()
cv_results = cross_validate(clf, X, y, cv=5, scoring='accuracy', return_train_score=False)

# Print mean and standard deviation of test accuracy
print("Mean accuracy:", cv_results['test_score'].mean())
print("Standard deviation:", cv_results['test_score'].std())

Mean accuracy: 0.8145
Standard deviation: 0.014352700094407337


## N-gram modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Word representation of the texts.

In [26]:
from sklearn.model_selection import cross_validate
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Extract feature using CountVectorizer with ngram_range=(2, 2)
vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(df['clean_reviews'])
y = df['target']

# Train MultinomialNB model using cross_validate
clf = MultinomialNB()
cv_results = cross_validate(clf, X, y, cv=5, scoring='accuracy', return_train_score=False)

# Print mean and standard deviation of test accuracy
print("Mean accuracy:", cv_results['test_score'].mean())
print("Standard deviation:", cv_results['test_score'].std())


Mean accuracy: 0.843
Standard deviation: 0.01805547008526779


⚠️ Please push the exercise once you are done 🙃

## 🏁 