# Vectorizer Tuning

In [10]:
import pandas as pd

data = pd.read_csv("reviews.csv")

data

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...
...,...,...
1995,pos,wow ! what a movie . \nit's everything a movie...
1996,pos,"richard gere can be a commanding actor , but h..."
1997,pos,"glory--starring matthew broderick , denzel was..."
1998,pos,steven spielberg's second epic film on world w...


The dataset is made up of positive and negative movie reviews.

## Preprocessing

👇 Remove punctuation and lower case the text.

In [11]:
import string 

string.punctuation

def remove_punctuation(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

def lower_case(text):
    return text.lower()

In [12]:
data.reviews = data.reviews.apply(remove_punctuation)

data.reviews = data.reviews.apply(lower_case)

data

Unnamed: 0,target,reviews
0,neg,plot two teen couples go to a church party d...
1,neg,the happy bastards quick movie review \ndamn t...
2,neg,it is movies like these that make a jaded movi...
3,neg,quest for camelot is warner bros first fe...
4,neg,synopsis a mentally unstable man undergoing p...
...,...,...
1995,pos,wow what a movie \nits everything a movie ca...
1996,pos,richard gere can be a commanding actor but he...
1997,pos,glorystarring matthew broderick denzel washin...
1998,pos,steven spielbergs second epic film on world wa...


## Tuning

👇 Tune a vectorizer of your choice (or try both!) and a MultinomialNB model simultaneously.

In [16]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

# Create Pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', MultinomialNB()),
])

# Set parameters to search
parameters = {
    'tfidf__ngram_range': ((1,1), (2,2)),
    'nb__alpha': (0.1,1),}

# Perform grid search
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, 
                           verbose=1, scoring = "accuracy", 
                           refit=True, cv=5)

grid_search.fit(data.reviews, data.target)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                       ('nb', MultinomialNB())]),
             n_jobs=-1,
             param_grid={'nb__alpha': (0.1, 1),
                         'tfidf__ngram_range': ((1, 1), (2, 2))},
             scoring='accuracy', verbose=1)

In [17]:
grid_search.best_params_

{'nb__alpha': 0.1, 'tfidf__ngram_range': (2, 2)}

⚠️ Please push the exercise once you are done 🙃

## 🏁 