# Vectorizer Tuning

In [1]:
import pandas as pd

data = pd.read_pickle("reviews_3")

data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The dataset is made up of positive and negative movie reviews.

## Preprocessing

👇 Remove punctuation and lower case the text.

In [2]:
import string

def remove_punct_low(reviews):
    text = "".join([word for word in reviews if word not in string.punctuation])
    return text.lower()


data['reviews'] = data['reviews'].apply(remove_punct_low)
data['reviews']

0       plot  two teen couples go to a church party  d...
1       the happy bastards quick movie review \ndamn t...
2       it is movies like these that make a jaded movi...
3         quest for camelot  is warner bros   first fe...
4       synopsis  a mentally unstable man undergoing p...
                              ...                        
1995    wow  what a movie  \nits everything a movie ca...
1996    richard gere can be a commanding actor  but he...
1997    glorystarring matthew broderick  denzel washin...
1998    steven spielbergs second epic film on world wa...
1999    truman   trueman   burbank is the perfect name...
Name: reviews, Length: 2000, dtype: object

## Tuning

👇 Tune a vectorizer of your choice (or try both!) and a MultinomialNB model simultaneously.

In [5]:
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Create Pipeline

pipe = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

# Set parameters to search (model and vectorizer)

param_grid = {
    'vectorizer__ngram_range': [(1, 1), (2, 2), (3, 3)],
    'classifier__alpha': [0.1, 0.5, 1.0]
}

# Perform grid search on pipeline
grid = GridSearchCV(pipe, param_grid=param_grid, cv=10)
grid.fit(data['reviews'], data['target'])

# Print best parameters and score
print("Best parameters: ", grid.best_params_)
print("Best score: ", grid.best_score_)

Best parameters:  {'classifier__alpha': 0.5, 'vectorizer__ngram_range': (2, 2)}
Best score:  0.8394999999999999


⚠️ Please push the exercise once you are done 🙃

## 🏁 