# Vectorizer Tuning

In [2]:
import pandas as pd

data = pd.read_pickle("reviews_3")

data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The dataset is made up of positive and negative movie reviews.

## Preprocessing

👇 Remove punctuation and lower case the text.

In [3]:
import string


data['reviews'] = data['reviews'].str.translate(str.maketrans('', '', string.punctuation))
data['reviews'] = data['reviews'].str.lower()

data

Unnamed: 0,target,reviews
0,neg,plot two teen couples go to a church party d...
1,neg,the happy bastards quick movie review \ndamn t...
2,neg,it is movies like these that make a jaded movi...
3,neg,quest for camelot is warner bros first fe...
4,neg,synopsis a mentally unstable man undergoing p...
...,...,...
1995,pos,wow what a movie \nits everything a movie ca...
1996,pos,richard gere can be a commanding actor but he...
1997,pos,glorystarring matthew broderick denzel washin...
1998,pos,steven spielbergs second epic film on world wa...


## Tuning

👇 Tune a vectorizer of your choice (or try both!) and a MultinomialNB model simultaneously.

In [16]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Create Pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('nb', MultinomialNB())
])

# Set parameters to search (model and vectorizer)
parameters = {
    'vect__ngram_range': [(1,1), (1,2)],
    'vect__min_df': [2, 3],
    'vect__max_df': [0.5, 0.75],
    'nb': [0.1, 1.0],
    'nb__fit_prior': [True, False],
}

# Perform grid search on pipeline
grid_search = GridSearchCV(pipeline, parameters, n_jobs = -1,
                          verbose = 1, scoring = "accuracy", error_score='raise',
                          refit=True, cv=5)

grid_search.fit(data['reviews'], data['target'])

Fitting 5 folds for each of 32 candidates, totalling 160 fits


AttributeError: 'float' object has no attribute 'set_params'

⚠️ Please push the exercise once you are done 🙃

## 🏁 