# Vectorizer + NaiveBayes Tuning

🎯 The goal of this challenge is to create a Pipeline combining a Vectorizer + a NaiveBayes algorithm and to fine-tune the pipeline.

✍️ Let's reuse the previous dataset with $2000$ reviews classified either as "positive" or "negative".

In [1]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/movie_reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [2]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
data["target_encoded"] =  le.fit_transform(data.target)

In [3]:
data.head()

Unnamed: 0,target,reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",0
1,neg,the happy bastard's quick movie review \ndamn ...,0
2,neg,it is movies like these that make a jaded movi...,0
3,neg,""" quest for camelot "" is warner bros . ' firs...",0
4,neg,synopsis : a mentally unstable man undergoing ...,0


## Preprocessing

❓ **Question (Cleaning)** ❓

Clean your texts

In [7]:
import string
from nltk.corpus import stopwords 
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(sentence:str):
    sentence = sentence.lower()
    sentence = sentence.strip()
    sentence = ''.join(char for char in sentence if not char.isdigit())
    
    for k in string.punctuation:
        sentence = sentence.replace(k,'')
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(sentence)
    cleaned = [w for w in word_tokens if not w in stop_words]
    
    verb_lemmatized = [
        WordNetLemmatizer().lemmatize(word,pos='v')
        for word in cleaned
    ]
    noun_lemmatized = [
        WordNetLemmatizer().lemmatize(word,pos='n')
        for word in verb_lemmatized
    ]
    sentence = ' '.join(word for word in noun_lemmatized)
    return sentence

In [9]:
data['clean_reviews']=data.reviews.map(lambda x: preprocessing(x))

In [10]:
data

Unnamed: 0,target,reviews,target_encoded,clean_reviews
0,neg,"plot : two teen couples go to a church party ,...",0,plot two teen couple go church party drink dri...
1,neg,the happy bastard's quick movie review \ndamn ...,0,happy bastard quick movie review damn yk bug g...
2,neg,it is movies like these that make a jaded movi...,0,movie like make jade movie viewer thankful inv...
3,neg,""" quest for camelot "" is warner bros . ' firs...",0,quest camelot warner bros first featurelength ...
4,neg,synopsis : a mentally unstable man undergoing ...,0,synopsis mentally unstable man undergo psychot...
...,...,...,...,...
1995,pos,wow ! what a movie . \nit's everything a movie...,1,wow movie everything movie funny dramatic inte...
1996,pos,"richard gere can be a commanding actor , but h...",1,richard gere command actor he always great fil...
1997,pos,"glory--starring matthew broderick , denzel was...",1,glorystarring matthew broderick denzel washing...
1998,pos,steven spielberg's second epic film on world w...,1,steven spielberg second epic film world war ii...


## Tuning

❓ **Question (Pipelining a Vectorizer and a NaiveBayes Model)** ❓

* Create a Pipeline that chains a vectorizer of your choice with a NaiveBayes model
* Optimize it
* What is your best estimator ?

In [18]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import set_config; set_config("diagram")

# Create Pipeline
pipeline = make_pipeline(
        CountVectorizer(),
        MultinomialNB()
)
# Set parameters to search
params = {
    'coun'
    'multinomialnb__alpha':(0.01,0.1,1)
}

# Perform grid search on pipeline
search = GridSearchCV(
    pipeline,
    params,
    scoring='recall',
    cv=5,
    n_jobs=-1,
    verbose=1
)
search.fit(data.clean_reviews,data.target_encoded)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


In [19]:
search.best_params_

{'multinomialnb__alpha': 1}

In [20]:
search.best_score_

0.797

In [None]:
# YOUR CODE HERE

🏁 Congratulations! You've managed to chain a Vectorizer and a NLP model and fine-tuned it!

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!