# Vectorizer Tuning

In [4]:
import pandas as pd

df = pd.read_pickle("reviews_3")

df.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The dataset is made up of positive and negative movie reviews.

## Preprocessing

👇 Remove punctuation and lower case the text.

In [5]:
import string

def punctuation_lower(text):
    text.translate(str.maketrans("", "", string.punctuation))
    text.lower()
    return text

df['clean_reviews'] = df['reviews'].apply(punctuation_lower)
df

Unnamed: 0,target,reviews,clean_reviews
0,neg,"plot : two teen couples go to a church party ,...","plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs...",""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis : a mentally unstable man undergoing ...
...,...,...,...
1995,pos,wow ! what a movie . \nit's everything a movie...,wow ! what a movie . \nit's everything a movie...
1996,pos,"richard gere can be a commanding actor , but h...","richard gere can be a commanding actor , but h..."
1997,pos,"glory--starring matthew broderick , denzel was...","glory--starring matthew broderick , denzel was..."
1998,pos,steven spielberg's second epic film on world w...,steven spielberg's second epic film on world w...


## Tuning

👇 Tune a vectorizer of your choice (or try both!) and a MultinomialNB model simultaneously.

In [7]:
# Create Pipeline

# Set parameters to search (model and vectorizer)

# Perform grid search on pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

# example data
X = df["reviews"]
y = df["target"]

# create a pipeline of vectorizer and classifier
pipe = Pipeline([('vectorizer', TfidfVectorizer()), ('classifier', MultinomialNB())])

# set the parameters for grid search
param_grid = {
    'vectorizer__ngram_range': [(1, 1), (1, 2)],
    'vectorizer__stop_words': [None, 'english'],
    'classifier__alpha': [0.1, 1.0, 10.0],
}

# perform grid search with cross-validation
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X, y)

# print the best parameters and score
print("Best parameters: ", grid.best_params_)
print("Best score: ", grid.best_score_)


Best parameters:  {'classifier__alpha': 0.1, 'vectorizer__ngram_range': (1, 2), 'vectorizer__stop_words': None}
Best score:  0.8385


⚠️ Please push the exercise once you are done 🙃

## 🏁 