# Hiperparámetros y caching

## En este ejercicio se le pide:

### 1. Evaluar distintos modelos N.B.:
- para el dataset que se encuentra en 'data/emails.csv'
- dados los hiperparámetros detallados a continuación
- utilizando la repartición de datos entre training y testing detallada a continuación
- separando los datos después de mezclarlos con numpy.shuffle usando la semilla detallada a continuación

### 2. Reportar el cross-validation score para cada modelo.

### 3. Evaluar el mejor model N.B.:
- utilizando la misma repartición de datos entre training y testing de más arriba

### 4. Reportar el test score para el mejor modelo.

## Solución

In [0]:
# Estos dos comandos evitan que haya que hacer reload cada vez que se modifica un paquete
%load_ext autoreload
%autoreload 2

Primero vamos a importar todos los paquetes necesarios para hacer el preprocesamiento, manejo de datos y el clasificador.

In [0]:
#Librerias generales
import numpy as np
import time
import os
import pickle

#Paquetes para manejo de datos
import pandas         as pd
import dask.dataframe as dd

#Paquetes de nltk para preprocesamiento
import nltk
from   nltk.tokenize import TreebankWordTokenizer
from   nltk.stem     import PorterStemmer, WordNetLemmatizer
from   nltk.corpus   import stopwords

#Paquetes de sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection         import train_test_split
from sklearn.model_selection         import cross_val_score
from sklearn.naive_bayes             import MultinomialNB


Definimos los objetos y variables necesarios para realizar el preprocesamiento de los datos.

In [0]:
nltk.download('wordnet')
nltk.download('stopwords')

tokenizer  = TreebankWordTokenizer()
stemmer    = PorterStemmer()
lemmatizer = WordNetLemmatizer()

random_seed = 0
test_size   = 0.3
cross_sets  = 5

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Descargamos el dataset y lo cargamos en un DataFrame de Pandas.

In [0]:
! wget 'https://raw.githubusercontent.com/rn-2019-itba/Clase-3---K-folding-TFIDF-Dask-/master/data/emails.csv'
dataset = pd.read_csv('emails.csv')

--2019-08-21 04:41:57--  https://raw.githubusercontent.com/rn-2019-itba/Clase-3---K-folding-TFIDF-Dask-/master/data/emails.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8954755 (8.5M) [text/plain]
Saving to: ‘emails.csv’


2019-08-21 04:41:57 (112 MB/s) - ‘emails.csv’ saved [8954755/8954755]



###Caching
Definimos los path para guardar archivos (mediante pickle) para armar la cache.

In [0]:
caching      = True
dataset_path = 'emails.csv'

def get_nltk_cache_path(hp):
    cache_path = f'cache-{hp["isalpha"]}'
    return cache_path

def get_sklearn_cache_path(hp):
    cache_path = f'cache-{hp["isalpha"]}-{hp["tf_idf"]}-{hp["min_df"]}-{hp["max_df"]}'
    return cache_path

###Hiperparámetros
Definimos un diccionario para probar todas las combinaciones de hiperparámetros.

In [0]:
#Todas las posibilidades
hyperparameters_specs = {
    'isalpha': [True, False],
    'tf_idf':  [True, False],
    'min_df':  [0.01, 0.05, 0.1, 0.49],
    'max_df':  [0.5, 0.75, 0.99],
    'alpha':   [0.01, 0.1, 1.0, 10.0],
}

#Guardaremos todo en un dataFrame de Pandas
hyperparameters = pd.DataFrame()

for isalpha in hyperparameters_specs['isalpha']:
    for tf_idf in hyperparameters_specs['tf_idf']:
        for min_df in hyperparameters_specs['min_df']:
            for max_df in hyperparameters_specs['max_df']:
                for alpha in hyperparameters_specs['alpha']:
                    hp = {
                        'isalpha': isalpha,
                        'alpha':   alpha,
                        'min_df':  min_df,
                        'max_df':  max_df,
                        'tf_idf':  tf_idf,
                    }
                    hp_pandas = pd.DataFrame(hp, index=[0])
                    hyperparameters = hyperparameters.append(hp_pandas,ignore_index=True)

#Veamos como quedo
print(hyperparameters.head(5))

   isalpha  alpha  min_df  max_df  tf_idf
0     True   0.01    0.01    0.50    True
1     True   0.10    0.01    0.50    True
2     True   1.00    0.01    0.50    True
3     True  10.00    0.01    0.50    True
4     True   0.01    0.01    0.75    True


###Preprocesamiento: NLTK

In [0]:
#Callback para el procesamiento paralelo de Dask
def nltk_preprocessor_callback(**kwargs):
    #Preprocesamiento con NLTK igual que en la clase anterior
    def preprocessor(datapoint):
        raw_datapoint          = datapoint
        tokenized_datapoint    = tokenizer.tokenize(raw_datapoint)
        lemmatized_datapoint   = [lemmatizer.lemmatize(x,pos='v') for x in tokenized_datapoint]
        nonstop_datapoint      = [x for x in lemmatized_datapoint if x not in stopwords.words('english')]
        stemmed_datapoint      = [stemmer.stem(x) for x in nonstop_datapoint]
        filtered_datapoint     = stemmed_datapoint
        
        #Salteamos esto dependiendo del hiperparámetro isalpha
        if kwargs.setdefault('isalpha', True):
            alphanumeric_datapoint = [x for x in stemmed_datapoint if x.isalpha()]
            filtered_datapoint     = alphanumeric_datapoint
        
        return ' '.join(filtered_datapoint)

    return preprocessor

def run_nltk_preprocessor(hp, dataset=None):
    print('NLTK Preprocessing...')
    to = time.time()
    cache_path = get_nltk_cache_path(hp)
    
    #Checkeamos si ya se corrió el preprocesamiento para esta combinación de hiperparámetros
    if not (os.path.exists(cache_path) and os.path.isfile(cache_path)):
        print('Cache miss: ', cache_path)

        #Leemos el dataset
        if caching is True:
            dataset = pd.read_csv(dataset_path)
        else:
            dataset = dataset.copy()
        preprocessor    = nltk_preprocessor_callback(isalpha=hp['isalpha'])
        ddataset        = dd.from_pandas(dataset, npartitions=os.cpu_count())
        dataset['text'] = ddataset['text'].map_partitions(lambda df: df.apply(preprocessor)). compute(scheduler='multiprocessing')
        
        #Guardamos en la cache este intento
        if caching is True:
            cache_path = get_nltk_cache_path(hp)
            with open(cache_path, 'wb') as fp:
                pickle.dump(dataset, fp)
        
    tf = time.time()
    print('finished in', (int(tf-to)), 'seconds.')

Corremos el preprocesamiento para la primera combinación de hiperparámetros

In [0]:
for idx,hyperParam in hyperparameters.iterrows():
    break
run_nltk_preprocessor(hyperParam)

NLTK Preprocessing...
Cache miss:  cache-True
finished in 200 seconds.


###Preprocesamiento: sklearn
Corremos aca el count vectorizer o TFIDF vectorizer, según el hiperparámetro que toca.

In [0]:
def run_sklearn_preprocessor(hp, dataset=None):
    print('sklearn preprocessing...')
    to = time.time()
    cache_path = get_sklearn_cache_path(hp)
    
    #Checkeamos si ya intentamos con esta combinación
    if not (os.path.exists(cache_path) and os.path.isfile(cache_path)):    
        print('Cache miss: ', cache_path)   
        
        if caching is True:
            cache_path = get_nltk_cache_path(hp)
            with open (cache_path, 'rb') as fp:
                dataset = pickle.load(fp)
        else:
            dataset = dataset.copy()

        #Corremos el vectorizer que corresponde, igual que en clase anterior
        V = (TfidfVectorizer if hp['tf_idf'] is True else CountVectorizer)(min_df=hp['min_df'], max_df=hp['max_df'])
        X = V.fit_transform(dataset['text']).toarray()
        Y = np.array([dataset['spam'].values]).T
        D = np.hstack((X, Y))

        np.random.seed(seed=random_seed)
        np.random.shuffle(D)

        if caching is True:
            cache_path = get_sklearn_cache_path(hp)
            with open(cache_path, 'wb') as fp:
                pickle.dump(D, fp)

    tf = time.time()
    print('finished in', (int(tf-to)), 'seconds.')

Corremos el procesamiento para la primera combinación de hiperparámetros.

In [0]:
for idx,hp2 in hyperparameters.iterrows():
    break
run_sklearn_preprocessor(hp2)

sklearn preprocessing...
Cache miss:  cache-True-True-0.01-0.5
finished in 1 seconds.


Ahora corremos ambos preprocesamientos para todo el dataset

In [0]:
#Preprocesamiento completo CUIDADO ESTO TARDA BASTANTE
print('Preprocessing dataset...')
for index, hp in hyperparameters.iterrows():
    print(hp.to_dict())
    run_nltk_preprocessor(hp)
    run_sklearn_preprocessor(hp)

Preprocessing dataset...
{'isalpha': True, 'alpha': 0.01, 'min_df': 0.01, 'max_df': 0.5, 'tf_idf': True}
NLTK Preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
{'isalpha': True, 'alpha': 0.1, 'min_df': 0.01, 'max_df': 0.5, 'tf_idf': True}
NLTK Preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
{'isalpha': True, 'alpha': 1.0, 'min_df': 0.01, 'max_df': 0.5, 'tf_idf': True}
NLTK Preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
{'isalpha': True, 'alpha': 10.0, 'min_df': 0.01, 'max_df': 0.5, 'tf_idf': True}
NLTK Preprocessing...
finished in 0 seconds.
sklearn preprocessing...
finished in 0 seconds.
{'isalpha': True, 'alpha': 0.01, 'min_df': 0.01, 'max_df': 0.75, 'tf_idf': True}
NLTK Preprocessing...
finished in 0 seconds.
sklearn preprocessing...
Cache miss:  cache-True-True-0.01-0.75
finished in 0 seconds.
{'isalpha': True, 'alpha': 0.1, 'min_df': 0.01, 'max_df': 0.75, 'tf_idf'

###Clasificador: Evaluar scores

In [0]:
#Callback para el procesamiento paralelo de Dask
def score_callback(dataset=None):
    def score_classifier(hp):
        print(hp.to_dict())
        
        if caching is True:
            cache_path = get_sklearn_cache_path(hp)
            with open (cache_path, 'rb') as fp:
                D = pickle.load(fp)
        else:
            D = dataset.copy()

        X = D[:,:D.shape[1]-1]
        Y = D[:,D.shape[1]-1:].flatten()

        X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, shuffle=False)

        #Aca defino el clasificador
        clf = MultinomialNB(alpha=hp['alpha'], class_prior=None, fit_prior=False)
        
        #Obtengo el score
        scores = cross_val_score(clf, X_train, Y_train, cv=cross_sets)

        hp['score'] = scores.mean()
        
        return hp
    return score_classifier

Evaluamos el score del clasificador para cada combinación de hiperparámetros

In [0]:
print('Evaluating hyperparameters...')
to = time.time()
    
score_classifier = score_callback(dataset)
dhyperparameters = dd.from_pandas(hyperparameters.copy(), npartitions=os.cpu_count())
scores           = hyperparameters.apply(score_classifier, axis=1)

tf = time.time()
print('finished in', (int(tf-to)), 'seconds.')

Evaluating hyperparameters...
{'isalpha': True, 'alpha': 0.01, 'min_df': 0.01, 'max_df': 0.5, 'tf_idf': True}
{'isalpha': True, 'alpha': 0.01, 'min_df': 0.01, 'max_df': 0.5, 'tf_idf': True}
{'isalpha': True, 'alpha': 0.1, 'min_df': 0.01, 'max_df': 0.5, 'tf_idf': True}
{'isalpha': True, 'alpha': 1.0, 'min_df': 0.01, 'max_df': 0.5, 'tf_idf': True}
{'isalpha': True, 'alpha': 10.0, 'min_df': 0.01, 'max_df': 0.5, 'tf_idf': True}
{'isalpha': True, 'alpha': 0.01, 'min_df': 0.01, 'max_df': 0.75, 'tf_idf': True}
{'isalpha': True, 'alpha': 0.1, 'min_df': 0.01, 'max_df': 0.75, 'tf_idf': True}
{'isalpha': True, 'alpha': 1.0, 'min_df': 0.01, 'max_df': 0.75, 'tf_idf': True}
{'isalpha': True, 'alpha': 10.0, 'min_df': 0.01, 'max_df': 0.75, 'tf_idf': True}
{'isalpha': True, 'alpha': 0.01, 'min_df': 0.01, 'max_df': 0.99, 'tf_idf': True}
{'isalpha': True, 'alpha': 0.1, 'min_df': 0.01, 'max_df': 0.99, 'tf_idf': True}
{'isalpha': True, 'alpha': 1.0, 'min_df': 0.01, 'max_df': 0.99, 'tf_idf': True}
{'isalpha

Veamos como fue el score para cada combinación

In [0]:
print(scores)

     isalpha  alpha  min_df  max_df  tf_idf     score
0       True   0.01    0.01    0.50    True  0.980794
1       True   0.10    0.01    0.50    True  0.978801
2       True   1.00    0.01    0.50    True  0.963333
3       True  10.00    0.01    0.50    True  0.967075
4       True   0.01    0.01    0.75    True  0.980794
5       True   0.10    0.01    0.75    True  0.978801
6       True   1.00    0.01    0.75    True  0.963333
7       True  10.00    0.01    0.75    True  0.967075
8       True   0.01    0.01    0.99    True  0.980794
9       True   0.10    0.01    0.99    True  0.978801
10      True   1.00    0.01    0.99    True  0.963333
11      True  10.00    0.01    0.99    True  0.967075
12      True   0.01    0.05    0.50    True  0.961093
13      True   0.10    0.05    0.50    True  0.958350
14      True   1.00    0.05    0.50    True  0.949868
15      True  10.00    0.05    0.50    True  0.921181
16      True   0.01    0.05    0.75    True  0.961093
17      True   0.10    0.05 

###Clasificador: entrenamiento

In [0]:
print('Training model with best hyperparameters...')

#Me quedo con la mejor combinación de hiperparámetros.
best_hp = scores.loc[scores['score'].idxmax()].drop(['score'])
print(best_hp.to_dict())

if caching is True:
    cache_path = get_sklearn_cache_path(best_hp)
    with open (cache_path, 'rb') as fp:
        D = pickle.load(fp)
else:
    D = dataset.copy()

X = D[:,:D.shape[1]-1]
Y = D[:,D.shape[1]-1:].flatten()

#Separamos el dataset para train y validation
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, shuffle=False)

#Creamos el clasificador para los mejores hiperparámetros
clf = MultinomialNB(alpha=best_hp['alpha'], class_prior=None, fit_prior=False)

#Entrenamos el modelo
clf.fit(X_train, Y_train)

Training model with best hyperparameters...
{'isalpha': True, 'alpha': 0.01, 'min_df': 0.01, 'max_df': 0.5, 'tf_idf': True}


MultinomialNB(alpha=0.01, class_prior=None, fit_prior=False)

###Clasificador: performance
Ahora vemos la performance final del modelo con el set de test

In [0]:
print('Evaluating best model...')
    
if caching is True:
    cache_path = get_sklearn_cache_path(best_hp)
    with open (cache_path, 'rb') as fp:
        D = pickle.load(fp)
else:
    D = dataset.copy()

X = D[:,:D.shape[1]-1]
Y = D[:,D.shape[1]-1:].flatten()

#Separo el set para train y test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, shuffle=False)
    
#Vemos el score final del modelo para test
score = clf.score(X_test, Y_test)
print("accuracy: {:.4}%".format(score*100))

Evaluating best model...
accuracy: 97.85%
