# Vectorizer Tuning

In [1]:
import pandas as pd

df = pd.read_pickle("reviews_3")

df.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The dataset is made up of positive and negative movie reviews.

## Preprocessing

👇 Remove punctuation and lower case the text.

In [2]:
import string

#fonction permettant d'enlever la ponctuatuion des mots
def remove_punctuation(reviews):
    reviews = reviews.translate(str.maketrans("", "", string.punctuation))
    reviews = reviews.lower()
    
    return reviews

In [3]:
df["clean_text"] = df["reviews"].apply(remove_punctuation)
df

Unnamed: 0,target,reviews,clean_text
0,neg,"plot : two teen couples go to a church party ,...",plot two teen couples go to a church party d...
1,neg,the happy bastard's quick movie review \ndamn ...,the happy bastards quick movie review \ndamn t...
2,neg,it is movies like these that make a jaded movi...,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs...",quest for camelot is warner bros first fe...
4,neg,synopsis : a mentally unstable man undergoing ...,synopsis a mentally unstable man undergoing p...
...,...,...,...
1995,pos,wow ! what a movie . \nit's everything a movie...,wow what a movie \nits everything a movie ca...
1996,pos,"richard gere can be a commanding actor , but h...",richard gere can be a commanding actor but he...
1997,pos,"glory--starring matthew broderick , denzel was...",glorystarring matthew broderick denzel washin...
1998,pos,steven spielberg's second epic film on world w...,steven spielbergs second epic film on world wa...


## Tuning

👇 Tune a vectorizer of your choice (or try both!) and a MultinomialNB model simultaneously.

In [11]:
# Create Pipeline

# Set parameters to search (model and vectorizer)

# Perform grid search on pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

# Define the pipeline for vectorization and modeling
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

# Define the hyperparameters to tune
param_grid = {
    'tfidf__ngram_range': [(1,1), (1,2)],
    'nb__alpha': [0.5, 1, 1.5],
}

# Initialize GridSearchCV with 5-fold cross-validation
grid = GridSearchCV(pipeline, param_grid=param_grid, cv=5)

# Fit the grid search on the data
grid.fit(df["clean_text"], df["target"])

# Print the best parameters and score
print("Best parameters:", grid.best_params_)
print("Best score: %0.2f" % grid.best_score_)


Best parameters: {'nb__alpha': 0.5, 'tfidf__ngram_range': (1, 2)}
Best score: 0.83


⚠️ Please push the exercise once you are done 🙃

## 🏁 