# Redes Neuronales - Trabajo Práctico N° 1 - Notebook #2
En esta segunda notebook, se busca definir cuál métrica es más apropiada para analizar la performance del modelo y qué hiper parámetros se van a utilizar para el ajuste del modelo acorde a la validación. Finalmente, estas decisiones se vuelcan en la selección del mejor modelo para el problema de la clasificación de correos electrónicos asociados grupos de noticias.

### Consideraciones para ejecutar notebook
* La sección de preprocesamiento previo dentro de la definición de hiper parámetros sólo debe ejecutarse una vez, si ya se poseen los datasets preprocesados, no es necesario.

### Integrantes del grupo
* Kammann, Lucas Agustín
* Gaytan, Joaquín Oscar

# 1. Métrica
La métrica a utilizar para cuantificar la performance de los modelos, seleccionar los hiperparámetros y validarlos será la **exactitud** o **accuracy**.

## 1.1 Justificación
La problemática a resolver tiene por objetivo asegurar clasificar entre múltiples clases, por ende, el objetivo es acertar la mayor cantidad de predicciones posibles. Para este tipo de problemas, conviene usar la exactitud, pero antes es necesario comprobar que la distribución de clases está balanceada o es uniforme, dado que si no fuera así (o aproximadamente así) entonces habría un sesgo en la estimación. Esto último se debe a que no seríamos capaces de cuantificar realmente lo malo que el modelo es prediciendo aquellas clases minoritarias.

Es decir, si bien es una métrica acorde al problema, cuando las clases no están balanceadas su interpretación numérica no es realista. En esos casos, se puede utilizar el promedio de la sensibilidad de cada clase, porque dicha sensibilidad representa la probabilidad de acertar en la predicción dada cada clase y si luego las promediamos estamos ponderando de igual forma cada clase.

En conclusión, dado lo que se observó en previos análisis **(ver Notebook #1)**, se puede asumir que la distribución de clases es aproximadamente uniforme por lo cual la exactitud es una métrica aceptable. Si se quisiera obtener una cantidad más realista, con el promedio de sensibilidad se cumpliría tal objetivo.

# 2. Hiper parámetros
Se consideran hiper parámetros aquellos que se determinan de manera óptima eligiendo aquel que da mejor resultado en el conjunto de datos de validación, donde muchos modelos con diversos tipos y valores de hiper parámetros compiten por ver cuál obtuvo la mejor medida de performance, es decir, de la métrica. Para ello, consideraremos como hiper parámetros los siguientes aspectos,

* Algoritmo de preprocesamiento empleado: Stemming, Lemmatization, Ninguno
* Filtrado por stop words
* Vectorizer empleado: CountVectorizer, TfIdfVectorizer
* Modelo empleado: MultinomialNaiveBayes, OneVsRestClassifier
* Coeficiente de Laplacian Smoothing
* Coeficiente de mínima frecuencia por documentos
* Coeficiente de máxima frecuencia por documentos

**NOTA**, sólo correr las siguientes celdas de preprocesamiento, si no se poseen los '*.txt' preprocesados.

## 2.1. Preprocesamiento previo
A continuación, se corren los algoritmos de procesamiento directamente sobre el conjunto de entrenamiento para evitar tener que realizar el preprocesamiento cada vez teniendo en cuenta que no existen más opciones a analizar. Es decir, se dejan preparados los procesamientos de texto más costosos computacionalmente, para que luego el proceso de entrenamiento y validación no requiera realizarlo.

### 2.1.1. Descargando los datasets

In [2]:
import numpy as np
import pickle

from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
import nltk

In [10]:
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to C:\Users\Lucas A.
[nltk_data]     Kammann\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Lucas A.
[nltk_data]     Kammann\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [58]:
from sklearn.datasets import fetch_20newsgroups

# Loading datasets
train = fetch_20newsgroups(
    subset='train', 
    shuffle=True, 
    remove=('headers', 'footers')
)

test = fetch_20newsgroups(
    subset='test', 
    shuffle=True, 
    remove=('headers', 'footers')
)

# Casting
train_raw_input = np.array(train.data)
train_output = np.array(train.target)
train_size = len(train_raw_input)

test_raw_input = np.array(test.data)
test_output = np.array(test.target)
test_size = len(test_raw_input)

# Logging useful information
print(f'Dataset Train: {train_size} elements')
print(f'Dataset Test: {test_size} elements')

Dataset Train: 11314 elements
Dataset Train: 7532 elements


### 2.1.2. Guardando los datasets originales

In [59]:
%%time

# Create the structure of the trainning data set for the non-processed case
normal_train = {
    'input': train_raw_input,
    'output': train_output
}

# Save with pickle
with open('tp1_ej1_train_normal.txt', 'wb') as file:
    pickle.dump(normal_train, file)

# Logging
print('The normal trainning dataset has been saved in the local storage system.')

The normal trainning dataset has been saved in the local storage system.
Wall time: 1min 27s


In [None]:
%%time

# Create the structure of the trainning data set for the non-processed case
normal_test = {
    'input': test_raw_input,
    'output': test_output
}

# Save with pickle
with open('tp1_ej1_test_normal.txt', 'wb') as file:
    pickle.dump(normal_test, file)

# Logging
print('The normal trainning dataset has been saved in the local storage system.')

### 2.1.3. Guardando los datasets con stemming

In [11]:
# Instantiate the stemmer instance
stemmer = PorterStemmer()

In [19]:
%%time

# Process and save the stemmed trainning data
stemmed_train_raw_input = []
for document in train_raw_input:
    tokens = word_tokenize(document)
    new_document = " ".join([stemmer.stem(token.lower()) for token in tokens if token.isalpha()])
    stemmed_train_raw_input.append(new_document)

# Logging
print('Stemming pre processing of the tran dataset finished.')

Stemming pre processing of the tran dataset finished.
Wall time: 1min 32s


In [20]:
%%time

# Create the structure of the trainning data set for the stemming case
stemmed_train = {
    'input': stemmed_train_raw_input,
    'output': train_output
}

# Save with pickle
with open('tp1_ej1_train_stemmed.txt', 'wb') as file:
    pickle.dump(stemmed_train, file)

# Logging
print('The stemming trainning dataset has been saved in the local storage system.')

The stemming trainning dataset has been saved in the local storage system.
Wall time: 17.2 ms


In [21]:
%%time

# Process and save the stemmed trainning data
stemmed_test_raw_input = []
for document in train_raw_input:
    tokens = word_tokenize(document)
    new_document = " ".join([stemmer.stem(token.lower()) for token in tokens if token.isalpha()])
    stemmed_test_raw_input.append(new_document)

# Logging
print('Stemming pre processing of the test dataset finished.')

Stemming pre processing of the test dataset finished.
Wall time: 1min 36s


In [22]:
%%time

# Create the structure of the trainning data set for the stemming case
stemmed_test = {
    'input': stemmed_test_raw_input,
    'output': test_output
}

# Save with pickle
with open('tp1_ej1_test_stemmed.txt', 'wb') as file:
    pickle.dump(stemmed_test, file)

# Logging
print('The stemming trainning dataset has been saved in the local storage system.')

The stemming trainning dataset has been saved in the local storage system.
Wall time: 23.3 ms


### 2.1.4. Guardando los datasets con lemmatization

In [23]:
# Instantiate the lemmatizer instance
lemmatizer = WordNetLemmatizer()

In [24]:
%%time

# Process and save the lemmatized trainning data
lemmatized_train_raw_input = []
for document in train_raw_input:
    tokens = word_tokenize(document)
    new_document = " ".join([lemmatizer.lemmatize(token.lower()) for token in tokens if token.isalpha()])
    lemmatized_train_raw_input.append(new_document)

# Logging
print('Lemmatization pre processing of the train dataset finished.')

Lemmatization pre processing of the train dataset finished.
Wall time: 46.6 s


In [25]:
%%time

# Create the structure of the trainning data set for the lemmatization case
lemmatization_train = {
    'input': lemmatized_train_raw_input,
    'output': train_output
}

# Save with pickle
with open('tp1_ej1_train_lemmatization.txt', 'wb') as file:
    pickle.dump(lemmatization_train, file)

# Logging
print('The lemmatization trainning dataset has been saved in the local storage system.')

The lemmatization trainning dataset has been saved in the local storage system.
Wall time: 25.1 ms


In [26]:
%%time

# Process and save the lemmatized trainning data
lemmatized_test_raw_input = []
for document in train_raw_input:
    tokens = word_tokenize(document)
    new_document = " ".join([lemmatizer.lemmatize(token.lower()) for token in tokens if token.isalpha()])
    lemmatized_test_raw_input.append(new_document)

# Logging
print('Lemmatization pre processing of the test dataset finished.')

Lemmatization pre processing of the test dataset finished.
Wall time: 46.5 s


In [28]:
%%time

# Create the structure of the trainning data set for the lemmatization case
lemmatization_test = {
    'input': lemmatized_test_raw_input,
    'output': test_output
}

# Save with pickle
with open('tp1_ej1_test_lemmatization.txt', 'wb') as file:
    pickle.dump(lemmatization_test, file)

# Logging
print('The lemmatization trainning dataset has been saved in the local storage system.')

The lemmatization trainning dataset has been saved in the local storage system.
Wall time: 20 ms


# 3. Preparación de datasets
En este paso, previo al entrenamiento, selección y validación de los modelos. Es necesario, primero, cargar todos los datasets y, luego, separar entre tres subconjuntos definidos como **train**, **valid** y **test**.

## 3.1. Cargando datasets

In [3]:
# For each of the preprocessed datasets (test and train), load all existing
# variants, normal/original, lemmatized and stemmed. The underscore train variables
# is because it's not exactly the train dataset because it will be splitted into
# the actual training set and the validation set.

with open('tp1_ej1_train_normal.txt', 'rb') as file:
    _train_normal = pickle.load(file)
    
with open('tp1_ej1_train_stemmed.txt', 'rb') as file:
    _train_stemmed = pickle.load(file)
    
with open('tp1_ej1_train_lemmatization.txt', 'rb') as file:
    _train_lemmatization = pickle.load(file)

with open('tp1_ej1_test_normal.txt', 'rb') as file:
    test_normal = pickle.load(file)
    
with open('tp1_ej1_test_stemmed.txt', 'rb') as file:
    test_stemmed = pickle.load(file)
    
with open('tp1_ej1_test_lemmatization.txt', 'rb') as file:
    test_lemmatization = pickle.load(file)

## 3.2. Separando datasets

In [4]:
from sklearn.model_selection import train_test_split

# Splitting into train and validation
train_normal_input, valid_normal_input, train_normal_output, valid_normal_output = \
    train_test_split(_train_normal['input'], _train_normal['output'], test_size=0.2, random_state=14)

train_stemmed_input, valid_stemmed_input, train_stemmed_output, valid_stemmed_output = \
    train_test_split(_train_stemmed['input'], _train_stemmed['output'], test_size=0.2, random_state=14)

train_lemmatization_input, valid_lemmatization_input, train_lemmatization_output, valid_lemmatization_output = \
    train_test_split(_train_lemmatization['input'], _train_lemmatization['output'], test_size=0.2, random_state=14)

# Better formatting
_train = {
    'normal': _train_normal,
    'stemmed': _train_stemmed,
    'lemmatization': _train_lemmatization
}

train = {
    'normal': { 'input': train_normal_input, 'output': train_normal_output },
    'stemmed': { 'input': train_stemmed_input, 'output': train_stemmed_output },
    'lemmatization': { 'input': train_lemmatization_input, 'output': train_lemmatization_output}
}

valid = {
    'normal': { 'input': valid_normal_input, 'output': valid_normal_output },
    'stemmed': { 'input': valid_stemmed_input, 'output': valid_stemmed_output },
    'lemmatization': { 'input': valid_lemmatization_input, 'output': valid_lemmatization_output }
}
        
test = {
    'normal': test_normal,
    'stemmed': test_stemmed,
    'lemmatization': test_lemmatization
}

# 4. Entrenamiento y selección de hiper parámetros

In [22]:
%%time

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from src.multinomial_naive_bayes import MultinomialNaiveBayes
from sklearn.metrics import accuracy_score

# Creating two lists, saving the hiperparameters and the score for the model
model_params = []
model_score = []

for use_algorithm in ['normal', 'stemmed', 'lemmatization']:
    for use_vectorizer in ['TfidfVectorizer', 'CountVectorizer']:
        for use_smooth_idf in ([False, True] if use_vectorizer == 'TfidfVectorizer' else [None]):
            for use_sublinear_tf in ([False, True] if use_vectorizer == 'TfidfVectorizer' else [None]):
                for use_stop_words in [None, 'english']:
                    for use_min_df in [1, 2, 0.0001, 0.001, 0.01, 0.1]:
                        for use_max_df in [0.15, 0.2, 0.3, 0.4, 0.5, 1.0]:
                            # Creating the vectorizer
                            if use_vectorizer == 'CountVectorizer':
                                vectorizer = CountVectorizer(
                                    stop_words=use_stop_words, 
                                    min_df=use_min_df, 
                                    max_df=use_max_df
                                )
                            elif use_vectorizer == 'TfidfVectorizer':
                                vectorizer = TfidfVectorizer(
                                    stop_words=use_stop_words, 
                                    min_df=use_min_df,
                                    max_df=use_max_df,
                                    smooth_idf=use_smooth_idf,
                                    sublinear_tf=use_sublinear_tf
                                )

                            # Processing both the training and the validation datasets
                            x_train = vectorizer.fit_transform(train[use_algorithm]['input'])
                            y_train = train[use_algorithm]['output']
                            x_valid = vectorizer.transform(valid[use_algorithm]['input'])
                            y_valid = valid[use_algorithm]['output']

                            # Run training and validation routines
                            for use_alpha in [0, 0.001, 0.0025, 0.005, 0.0075, 0.01, 0.0125, 0.015, 0.0175, 0.1, 1]:
                                # Creating and training the model
                                classifier = MultinomialNaiveBayes(alpha=use_alpha)
                                classifier.fit(x_train, y_train)

                                # Prediciting and measuring the performace
                                y_pred = classifier.predict(x_valid)
                                score = accuracy_score(y_valid, y_pred)
                                params = {
                                    'use_algorithm': use_algorithm,
                                    'use_vectorizer': use_vectorizer,
                                    'use_stop_words': use_stop_words,
                                    'use_alpha': use_alpha,
                                    'use_min_df': use_min_df,
                                    'use_max_df': use_max_df,
                                    'use_smooth_idf': use_smooth_idf,
                                    'use_sublinear_tf': use_sublinear_tf
                                }

                                # Saving the parameters
                                model_params.append(params)
                                model_score.append(score)

Wall time: 1h 27min 45s


# 5. Mejor modelo y entrenamiento completo

In [23]:
# Search the hiper parameters of the best scored model
selected_model_index = np.argmax(model_score)
selected_model_score = model_score[selected_model_index]
selected_model_params = model_params[selected_model_index]

In [24]:
import pprint
pprint.pprint(selected_model_params)
pprint.pprint(selected_model_score)

{'use_algorithm': 'normal',
 'use_alpha': 0.0075,
 'use_max_df': 0.15,
 'use_min_df': 1,
 'use_smooth_idf': False,
 'use_stop_words': None,
 'use_sublinear_tf': False,
 'use_vectorizer': 'TfidfVectorizer'}
0.8855501546619532


In [25]:
# Creating the model
classifier = MultinomialNaiveBayes(alpha=selected_model_params['use_alpha'])

# Creating the vectorizer
if selected_model_params['use_vectorizer'] == 'CountVectorizer':
    vectorizer = CountVectorizer(
        stop_words=selected_model_params['use_stop_words'], 
        min_df=selected_model_params['use_min_df'], 
        max_df=selected_model_params['use_max_df']
    )
elif selected_model_params['use_vectorizer'] == 'TfidfVectorizer':
    vectorizer = TfidfVectorizer(
        stop_words=selected_model_params['use_stop_words'], 
        min_df=selected_model_params['use_min_df'], 
        max_df=selected_model_params['use_max_df']
    )

# Processing the training dataset
x_train = vectorizer.fit_transform(_train[selected_model_params['use_algorithm']]['input'])
y_train = _train[selected_model_params['use_algorithm']]['output']

# Training the model
classifier.fit(x_train, y_train)

# 6. Validación y performance

In [26]:
# Processing the training dataset
x_test = vectorizer.transform(test[selected_model_params['use_algorithm']]['input'])
y_test = test[selected_model_params['use_algorithm']]['output']

# Prediciting with the model
y_pred = classifier.predict(x_test)

# Measuring the score
score = accuracy_score(y_test, y_pred)
print(score)

0.8016463090812533
