# Vectorizer + NaiveBayes Tuning

🎯 The goal of this challenge is to create a Pipeline combining a Vectorizer + a NaiveBayes algorithm and to fine-tune the pipeline.

✍️ Let's reuse the previous dataset with $2000$ reviews classified either as "positive" or "negative".

In [10]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/movie_reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [11]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
data["target_encoded"] =  le.fit_transform(data.target)

In [12]:
data.head()

Unnamed: 0,target,reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",0
1,neg,the happy bastard's quick movie review \ndamn ...,0
2,neg,it is movies like these that make a jaded movi...,0
3,neg,""" quest for camelot "" is warner bros . ' firs...",0
4,neg,synopsis : a mentally unstable man undergoing ...,0


## Preprocessing

❓ **Question (Cleaning)** ❓

Clean your texts

In [13]:
import string
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(sentence):
    sentence = sentence.strip()
    sentence = sentence.lower()
    sentence = ''.join([char for char in sentence if not char.isdigit()])
    sentence = sentence.translate(str.maketrans("", "", string.punctuation))
    tokens = word_tokenize(sentence)
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    cleaned_sentence = ' '.join(tokens) 
    return cleaned_sentence


In [14]:
# Clean reviews
data['cleaned_reviews'] = data['reviews'].apply(preprocessing)
data.head()

Unnamed: 0,target,reviews,target_encoded,cleaned_reviews
0,neg,"plot : two teen couples go to a church party ,...",0,plot two teen couple go church party drink dri...
1,neg,the happy bastard's quick movie review \ndamn ...,0,happy bastard quick movie review damn yk bug g...
2,neg,it is movies like these that make a jaded movi...,0,movie like make jaded movie viewer thankful in...
3,neg,""" quest for camelot "" is warner bros . ' firs...",0,quest camelot warner bros first featurelength ...
4,neg,synopsis : a mentally unstable man undergoing ...,0,synopsis mentally unstable man undergoing psyc...


## Tuning

❓ **Question (Pipelining a Vectorizer and a NaiveBayes Model)** ❓

* Create a Pipeline that chains a vectorizer of your choice with a NaiveBayes model
* Optimize it
* What is your best estimator ?

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import set_config; set_config("diagram")

# Fill missing values with empty strings
data['cleaned_reviews'].fillna('', inplace=True)

# Create Pipeline
pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('naivebayes', MultinomialNB())
])

# Set parameters to search
parameters = {
    'vectorizer__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'vectorizer__stop_words': [None, 'english'],
    'naivebayes__alpha': [0.5, 1.0, 2.0]
}

# Perform grid search on pipeline
grid_search = GridSearchCV(pipeline, parameters, scoring='accuracy', cv=5)
grid_search.fit(data['cleaned_reviews'], data['target_encoded'])

In [16]:
# YOUR CODE HERE
best_estimator = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

Best Parameters: {'naivebayes__alpha': 1.0, 'vectorizer__ngram_range': (1, 2), 'vectorizer__stop_words': None}
Best Score: 0.8285


🏁 Congratulations! You've managed to chain a Vectorizer and a NLP model and fine-tuned it!

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!