# Vectorizer + NaiveBayes Tuning

üéØ The goal of this challenge is to create a Pipeline combining a Vectorizer + a NaiveBayes algorithm and to fine-tune the pipeline.

‚úçÔ∏è Let's reuse the previous dataset with $2000$ reviews classified either as "positive" or "negative".

In [1]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/movie_reviews.csv")
data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [2]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
data["target_encoded"] =  le.fit_transform(data.target)

In [3]:
data.head()

Unnamed: 0,target,reviews,target_encoded
0,neg,"plot : two teen couples go to a church party ,...",0
1,neg,the happy bastard's quick movie review \ndamn ...,0
2,neg,it is movies like these that make a jaded movi...,0
3,neg,""" quest for camelot "" is warner bros . ' firs...",0
4,neg,synopsis : a mentally unstable man undergoing ...,0


## Preprocessing

‚ùì **Question (Cleaning)** ‚ùì

Clean your texts

In [5]:
import string
from nltk.corpus import stopwords 
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english'))

def preprocessing(sentence):
    sentence = sentence.strip()
    sentence = sentence.lower()
    sentence = ''.join(char for char in sentence if not char.isdigit())

    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '')

    tokens = word_tokenize(sentence)
    tokens_cleaned = [w for w in tokens if w not in stop_words]
    lem = [WordNetLemmatizer().lemmatize(w) for w in tokens_cleaned]
    
    cleaned = ' '.join(w for w in lem)
    
    return cleaned

In [7]:
data['clean_reviews'] = data.reviews.apply(lambda x: preprocessing(x))
data.head()

Unnamed: 0,target,reviews,target_encoded,clean_reviews
0,neg,"plot : two teen couples go to a church party ,...",0,plot two teen couple go church party drink dri...
1,neg,the happy bastard's quick movie review \ndamn ...,0,happy bastard quick movie review damn yk bug g...
2,neg,it is movies like these that make a jaded movi...,0,movie like make jaded movie viewer thankful in...
3,neg,""" quest for camelot "" is warner bros . ' firs...",0,quest camelot warner bros first featurelength ...
4,neg,synopsis : a mentally unstable man undergoing ...,0,synopsis mentally unstable man undergoing psyc...


## Tuning

‚ùì **Question (Pipelining a Vectorizer and a NaiveBayes Model)** ‚ùì

* Create a Pipeline that chains a vectorizer of your choice with a NaiveBayes model
* Optimize it
* What is your best estimator ?

In [26]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import set_config; set_config("diagram")

# Create Pipeline
pipe = Pipeline([
('model', TfidfVectorizer()), 
('multi', MultinomialNB())])

# Set parameters to search
grid = [{'multi__alpha' : (0.1, 1),
         'model' : [TfidfVectorizer()],
        'model__ngram_range' : ((1, 1), (1, 2), (2, 2)), 
        'model__max_df' : (0.5, 0.75, 1), 
        'model__min_df' : (0, 0.1, 0.03)
       },
        {'multi__alpha' : (0.1, 1),
         'model' : [CountVectorizer()],
        'model__ngram_range' : ((1, 1), (1, 2), (2, 2)), 
        'model__max_df' : (0.5, 0.75, 1), 
        'model__min_df' : (0, 0.1, 0.03)   
        }]

# Perform grid search on pipeline
search = GridSearchCV(pipe, grid, cv=5, scoring='accuracy')

In [27]:
search.fit(data.clean_reviews, data.target_encoded)

300 fits failed out of a total of 540.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
90 fits failed with the following error:
Traceback (most recent call last):
  File "/home/raphaelsisso/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/raphaelsisso/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/raphaelsisso/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/pipeline.py", line 423, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/home/r

In [28]:
search.best_params_

{'model': CountVectorizer(),
 'model__max_df': 0.75,
 'model__min_df': 0.03,
 'model__ngram_range': (1, 2),
 'multi__alpha': 0.1}

In [29]:
search.best_score_

0.8244999999999999

üèÅ Congratulations! You've managed to chain a Vectorizer and a NLP model and fine-tuned it!

üíæ Don't forget to¬†`git add/commit/push`¬†your notebook...

üöÄ ... and move on to the next challenge!