# Vectorizer Tuning

In [1]:
import pandas as pd

data = pd.read_pickle("reviews_3")

data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The dataset is made up of positive and negative movie reviews.

In [2]:
# just installing everything to be sure
!pip install nltk
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## Preprocessing

👇 Remove punctuation and lower case the text.

In [3]:
import string

clean_data = data

for t in clean_data.loc[:, 'reviews']:
    newt = t.lower()
    for p in string.punctuation:
        newt = newt.replace(p, "")
    newt = newt.replace("\n", "") # apparently \n's aren't removed otherwise
    clean_data = clean_data.replace(t, newt)
        
clean_data.head()

Unnamed: 0,target,reviews
0,neg,plot two teen couples go to a church party d...
1,neg,the happy bastards quick movie review damn tha...
2,neg,it is movies like these that make a jaded movi...
3,neg,quest for camelot is warner bros first fe...
4,neg,synopsis a mentally unstable man undergoing p...


## Tuning

👇 Tune a vectorizer of your choice (or try both!) and a MultinomialNB model simultaneously.

In [27]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Create Pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', MultinomialNB()),
])

# Set parameters to search (model and vectorizer)
parameters = {
    'tfidf__ngram_range': ((1, 1), (2, 2)),
    'nb__alpha': (0.1, 1),
}

# Perform grid search on pipeline
grid_search = GridSearchCV(estimator=pipeline, param_grid=parameters, n_jobs=-1,
                           verbose=1, scoring = "accuracy",
                           refit=True, cv=5)

y = clean_data.target

grid_search.fit(clean_data.reviews, y)
print(grid_search.best_params_)
print(grid_search.best_score_)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
{'nb__alpha': 0.1, 'tfidf__ngram_range': (2, 2)}
0.8394999999999999


⚠️ Please push the exercise once you are done 🙃

## 🏁 