# Movie Reviews

In [4]:
import pandas as pd

data = pd.read_pickle("reviews")

data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The dataset is made up of positive and negative movie reviews.

## Preprocessing

👇 Remove punctuation and lower case the text.

In [5]:
import string


data['reviews'] = data['reviews'].str.translate(str.maketrans('', '', string.punctuation))
data['reviews'] = data['reviews'].str.lower()

data

Unnamed: 0,target,reviews
0,neg,plot two teen couples go to a church party d...
1,neg,the happy bastards quick movie review \ndamn t...
2,neg,it is movies like these that make a jaded movi...
3,neg,quest for camelot is warner bros first fe...
4,neg,synopsis a mentally unstable man undergoing p...
...,...,...
1995,pos,wow what a movie \nits everything a movie ca...
1996,pos,richard gere can be a commanding actor but he...
1997,pos,glorystarring matthew broderick denzel washin...
1998,pos,steven spielbergs second epic film on world wa...


## Bag-of-Words modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Word representation of the texts.

In [6]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('clf', MultinomialNB())
    
])

scores = cross_val_score(text_clf, data['reviews'], data['target'], cv=10)

In [7]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.81 (+/- 0.06)


## N-gram modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Word representation of the texts.

In [8]:
text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(2, 2))),
                     ('clf', MultinomialNB())])

scores = cross_val_score(text_clf, data['reviews'], data['target'], cv=10)

In [9]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.83 (+/- 0.05)


⚠️ Please push the exercise once you are done 🙃

## 🏁 