# Movie Reviews

In [1]:
import pandas as pd

data = pd.read_csv("reviews.csv")

data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The dataset is made up of positive and negative movie reviews.

## Preprocessing

👇 Remove punctuation and lower case the text.

In [3]:
def df_text_cleaning(text):
    

    for i in range(len(text)):
        clean_text =  text[i].strip()
        clean_text = clean_text.lower()
        whitelist = set("'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 ")
        clean_text = ''.join(filter(whitelist.__contains__, clean_text))
        clean_text = clean_text.strip()
        text[i] = clean_text

    return text

In [4]:
data['reviews']=df_text_cleaning(data.reviews)

## Bag-of-Words modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Word representation of the texts.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB

#vectorisation
vectorizer = CountVectorizer()
vectorizer.fit(data.reviews)
vector = vectorizer.transform(data.reviews)

#cross validation
x=vector.toarray()
y=data.target
X_train, X_test, y_train, y_test= train_test_split(x, y, test_size= .2, random_state = 42)

#implement MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, y_train)

#predictions
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

#accuracy score 
train_pred_score = accuracy_score(y_train, y_train_pred)
test_pred_score = accuracy_score(y_test, y_test_pred)
print('Training Set Accuracy Score: ', (100 * train_pred_score))
print('Testing Set Accuracy Score: ', (100 * test_pred_score))

Training Set Accuracy Score:  98.0625
Testing Set Accuracy Score:  81.25


## N-gram modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Word representation of the texts.

In [6]:
#vectorisation
vectorizer = CountVectorizer(ngram_range=(2,2))
vectorizer.fit(data.reviews)
vector = vectorizer.transform(data.reviews)

#cross validation
x=vector.toarray()
y=data.target
X_train, X_test, y_train, y_test= train_test_split(x, y, test_size= .2, random_state = 42)

#implement MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, y_train)

#predictions
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

#accuracy score 
train_pred_score = accuracy_score(y_train, y_train_pred)
test_pred_score = accuracy_score(y_test, y_test_pred)
print('Training Set Accuracy Score: ', (100 * train_pred_score))
print('Testing Set Accuracy Score: ', (100 * test_pred_score))

Training Set Accuracy Score:  100.0
Testing Set Accuracy Score:  85.25


⚠️ Please push the exercise once you are done 🙃

## 🏁 