# Vectorizer + NaiveBayes Tuning

🎯 The goal of this challenge is to create a Pipeline combining a Vectorizer + a NaiveBayes algorithm and to fine-tune the pipeline.

✍️ Let's reuse the previous dataset with $2000$ reviews classified either as "positive" or "negative".

In [11]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/movie_reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [12]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
data["target_encoded"] =  le.fit_transform(data.target)

In [13]:
data.head()

Unnamed: 0,target,reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",0
1,neg,the happy bastard's quick movie review \ndamn ...,0
2,neg,it is movies like these that make a jaded movi...,0
3,neg,""" quest for camelot "" is warner bros . ' firs...",0
4,neg,synopsis : a mentally unstable man undergoing ...,0


## Preprocessing

❓ **Question (Cleaning)** ❓

Clean your texts

In [14]:
import string
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
punctuation = string.punctuation

def preprocessing(review):
    review = review.strip() #remove whitespace
    review = review.lower() #lowerchase characters
    review = "".join(char for char in review if not char.isdigit()) #remove numbers
    for punc in punctuation:
        review = review.replace(punc,"") #remove puncutuation
    review_toke = word_tokenize(review) #tokenizing
    review_lem = [WordNetLemmatizer().lemmatize(w,pos='n') for w in review_toke] #lemmatizing
    review = " ".join(w for w in review_lem) #assmebling back
    return review

In [15]:
# Clean reviews
data['clean_reviews'] = [preprocessing(text) for text in data.reviews]

In [16]:
data.head()

Unnamed: 0,target,reviews,target_encoded,clean_reviews
0,neg,"plot : two teen couples go to a church party ,...",0,plot two teen couple go to a church party drin...
1,neg,the happy bastard's quick movie review \ndamn ...,0,the happy bastard quick movie review damn that...
2,neg,it is movies like these that make a jaded movi...,0,it is movie like these that make a jaded movie...
3,neg,""" quest for camelot "" is warner bros . ' firs...",0,quest for camelot is warner bros first feature...
4,neg,synopsis : a mentally unstable man undergoing ...,0,synopsis a mentally unstable man undergoing ps...


## Tuning

❓ **Question (Pipelining a Vectorizer and a NaiveBayes Model)** ❓

* Create a Pipeline that chains a vectorizer of your choice with a NaiveBayes model
* Optimize it
* What is your best estimator ?

In [38]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import set_config; set_config("diagram")

# Create Pipeline
pass  # YOUR CODE HERE

# Set parameters to search
pass  # YOUR CODE HERE

# Perform grid search on pipeline
pass  # YOUR CODE HERE

In [42]:
# creating pipeline
pipeline = make_pipeline(
   (TfidfVectorizer()),
   (MultinomialNB())
)

In [43]:
#setting the parameters for the grid search
param_grid = {
    'tfidfvectorizer__ngram_range': ((1,1), (2,2), (1,2)),
    'multinomialnb__alpha': (0.1,1)
}

In [44]:
# setting the search parameters
search = GridSearchCV(
    pipeline,
    param_grid=param_grid,
    scoring = 'recall',
    n_jobs = -1,
    cv=5,
)

search.fit(data.clean_reviews,data.target_encoded)

In [51]:

print(f'best estimator {search.best_estimator_}')

best estimator Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer(ngram_range=(2, 2))),
                ('multinomialnb', MultinomialNB(alpha=1))])


🏁 Congratulations! You've managed to chain a Vectorizer and a NLP model and fine-tuned it!

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!