<a href="https://colab.research.google.com/github/PedroJuiz/SupervisedLearning/blob/main/Metaclassifiers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Descomenta esta linea para instalar los paquetes
!pip install -q scikit-learn==0.24.2

<center>
<img src='mioti_logo.jpeg' width='200px'>
<p style="font-size: 18px"><b>Machine learning 3</b><br/>Diego García Morate - diegogm@faculty.mioti.es</p>
</center>
<br/>

# Challenge S3: Jugando con metaclasificadores

# Objetivos

El objetivo de este challenge es entrenar varios metaclasificadores intentando predecir el rendimiento académico de varios alumnos en base a variables de su entorno.

## Prerrequisitos

Para hacer funcionar este worksheet deberás tener instalados los paquetes:

* `numpy`
* `pandas`
* `scikit-learn`
* `scipy`

## Configuración del entorno

In [None]:
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from math import sqrt, exp, pi
import sklearn

import random

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

from sklearn import svm
from sklearn.neural_network import MLPClassifier

from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier

# Dataset

En este challenge vamos a trabajar con un dataset publicado en rel repositorio de la UCI que se llama `Student Performance`. Este dataset relaciona el rendimiento académico de alumnos portugueses de secundaria con un conjunto de variables de entorno. Lo puedes encontrar en: https://archive.ics.uci.edu/ml/datasets/Student+Performance


A continuación se muestra la descripción del dataset:


<p class="small-heading"><b>Data Set Information:</b></p>
<p class="normal">This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).</p>

<br />

<p class="small-heading"><b>Attribute Information:</b></p>
<p class="normal"># Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
<br>1 school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
<br>2 sex - student's sex (binary: 'F' - female or 'M' - male)
<br>3 age - student's age (numeric: from 15 to 22)
<br>4 address - student's home address type (binary: 'U' - urban or 'R' - rural)
<br>5 famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
<br>6 Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
<br>7 Medu - mother's education (numeric: 0 - none,  1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
<br>8 Fedu - father's education (numeric: 0 - none,  1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
<br>9 Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
<br>10 Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
<br>11 reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
<br>12 guardian - student's guardian (nominal: 'mother', 'father' or 'other')
<br>13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
<br>14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
<br>15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
<br>16 schoolsup - extra educational support (binary: yes or no)
<br>17 famsup - family educational support (binary: yes or no)
<br>18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
<br>19 activities - extra-curricular activities (binary: yes or no)
<br>20 nursery - attended nursery school (binary: yes or no)
<br>21 higher - wants to take higher education (binary: yes or no)
<br>22 internet - Internet access at home (binary: yes or no)
<br>23 romantic - with a romantic relationship (binary: yes or no)
<br>24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
<br>25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
<br>26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
<br>27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
<br>28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
<br>29 health - current health status (numeric: from 1 - very bad to 5 - very good)
<br>30 absences - number of school absences (numeric: from 0 to 93)
<br>
<br># these grades are related with the course subject, Math or Portuguese:
<br>31 G1 - first period grade (numeric: from 0 to 20)
<br>31 G2 - second period grade (numeric: from 0 to 20)
<br>32 G3 - final grade (numeric: from 0 to 20, output target)</p>
    

# Preprocessing

Empezaremos, como siempre, realizando todo el preprocesamiento necesario.

Lo primero que haremos es cargar el dataset y tratamiento todas las columnas no numéricas:

In [None]:
df = pd.read_csv('student-por.csv', sep=';')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 649 entries, 0 to 648
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      649 non-null    object
 1   sex         649 non-null    object
 2   age         649 non-null    int64 
 3   address     649 non-null    object
 4   famsize     649 non-null    object
 5   Pstatus     649 non-null    object
 6   Medu        649 non-null    int64 
 7   Fedu        649 non-null    int64 
 8   Mjob        649 non-null    object
 9   Fjob        649 non-null    object
 10  reason      649 non-null    object
 11  guardian    649 non-null    object
 12  traveltime  649 non-null    int64 
 13  studytime   649 non-null    int64 
 14  failures    649 non-null    int64 
 15  schoolsup   649 non-null    object
 16  famsup      649 non-null    object
 17  paid        649 non-null    object
 18  activities  649 non-null    object
 19  nursery     649 non-null    object
 20  higher    

In [None]:
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13


Como podemos ver existen multitud de columnas que son enumerados. En general, para poder entrenar modelos con este tipo de datos tendremos que convertir estos valores en valores numéricos, para ello nos vamos a construir una función que nos ayude en esta tarea:

In [None]:
def preprocess_column(df, column_name):
    if 'int' in str(df[column_name].dtype):
        return df

    values = df[column_name].unique()
    if len(values) == 2:
        # 2 valores, no hay dudas: mapping binario
        df[column_name] = df[column_name].map({values[0]: 0, values[1]: 1})
    elif len(values) == 3:
        # 3 valores: Variables dummy
        df = pd.get_dummies(df, columns=[column_name])
    else:
        # 4 o más: mapping numérico.
        mapping = {}
        index = 0
        for value in values:
            mapping[value] = index
            index += 1

        df[column_name] = df[column_name].map(mapping)

    return df

In [None]:
for column in df.columns:
    df = preprocess_column(df, column)

In [None]:
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,Dalc,Walc,health,absences,G1,G2,G3,guardian_father,guardian_mother,guardian_other
0,0,0,18,0,0,0,4,4,0,0,...,1,1,3,4,0,11,11,0,1,0
1,0,0,17,0,0,1,1,1,0,1,...,1,1,3,2,9,11,11,1,0,0
2,0,0,15,0,1,1,1,1,0,1,...,2,3,3,6,12,13,12,0,1,0
3,0,0,15,0,0,1,4,2,1,2,...,1,1,5,0,14,14,14,0,1,0
4,0,0,16,0,0,1,3,3,2,1,...,1,2,5,0,11,13,13,1,0,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 649 entries, 0 to 648
Data columns (total 35 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   school           649 non-null    int64
 1   sex              649 non-null    int64
 2   age              649 non-null    int64
 3   address          649 non-null    int64
 4   famsize          649 non-null    int64
 5   Pstatus          649 non-null    int64
 6   Medu             649 non-null    int64
 7   Fedu             649 non-null    int64
 8   Mjob             649 non-null    int64
 9   Fjob             649 non-null    int64
 10  reason           649 non-null    int64
 11  traveltime       649 non-null    int64
 12  studytime        649 non-null    int64
 13  failures         649 non-null    int64
 14  schoolsup        649 non-null    int64
 15  famsup           649 non-null    int64
 16  paid             649 non-null    int64
 17  activities       649 non-null    int64
 18  nursery   

## Matriz de características (X)

Eliminaremos del dataframe todas las columnas objetivo:

In [None]:
X = np.array(df[df.columns.difference(['G1', 'G2', 'G3'])])

In [None]:
X.shape

(649, 32)

In [None]:
X

array([[1, 4, 0, ..., 0, 2, 2],
       [1, 1, 1, ..., 0, 2, 1],
       [2, 1, 1, ..., 0, 2, 1],
       ...,
       [1, 1, 1, ..., 0, 2, 2],
       [3, 1, 2, ..., 1, 1, 2],
       [3, 2, 1, ..., 1, 1, 3]], dtype=int64)

Por último normalizaremos los datos:

In [None]:
from sklearn.preprocessing import normalize

X = normalize(X)

In [None]:
X

array([[0.04805693, 0.19222773, 0.        , ..., 0.        , 0.09611387,
        0.09611387],
       [0.05263158, 0.05263158, 0.05263158, ..., 0.        , 0.10526316,
        0.05263158],
       [0.11094004, 0.05547002, 0.05547002, ..., 0.        , 0.11094004,
        0.05547002],
       ...,
       [0.04938648, 0.04938648, 0.04938648, ..., 0.        , 0.09877296,
        0.09877296],
       [0.14383899, 0.04794633, 0.09589266, ..., 0.04794633, 0.04794633,
        0.09589266],
       [0.13912167, 0.09274778, 0.04637389, ..., 0.04637389, 0.04637389,
        0.13912167]])

## Variable objetivo (y)

Usaremos como variable objetivo la puntuación del primer periodo: `G1`.

In [None]:
y, bins = pd.cut(df.G1, 3, labels=[0, 1, 2], retbins=True)

In [None]:
y

0      0
1      1
2      1
3      2
4      1
      ..
644    1
645    2
646    1
647    1
648    1
Name: G1, Length: 649, dtype: category
Categories (3, int64): [0 < 1 < 2]

In [None]:
bins

array([-0.019     ,  6.33333333, 12.66666667, 19.        ])

In [None]:
y.shape

(649,)

In [None]:
y

0      0
1      1
2      1
3      2
4      1
      ..
644    1
645    2
646    1
647    1
648    1
Name: G1, Length: 649, dtype: category
Categories (3, int64): [0 < 1 < 2]

Finalmente vamos a crear el conjunto de entrenamiento y el conjunto de test:

In [None]:
y.shape

(649,)

Con esto ya tenemos terminado nuestro preprocesamiento.

# Clasificador simple

Entrena un clasificador `svm` simple, con validación cruzada estratificada (5 folds) y evalúa su rendimiento en términos de accuracy calculando la media y varianza de los resultados.

In [None]:
results = cross_val_score(svm.SVC(random_state=42), X, y, scoring='accuracy', cv=5)
print(f"Avg: {np.mean(results)}")
print(f"Var: {np.var(results)}")

Avg: 0.6363625521765057
Var: 0.00034036046280502645


In [None]:
results

array([0.61538462, 0.62307692, 0.66923077, 0.63846154, 0.63565891])

El clasificador SVM obtuvo una precisión media del 63.64% en las 5 iteraciones de validación cruzada, lo que indica que clasificó correctamente el 63.64% de las instancias en promedio.

La varianza de las puntuaciones de precisión es relativamente baja (aproximadamente 0.00034), lo que sugiere que el rendimiento del modelo es bastante consistente a lo largo de las diferentes iteraciones de la validación cruzada.

# Voting

Construye un metaclasificador usando `voting` con svm y redes neuronales (MLClassifier), evalúa los resultados en los mismo términos que antes. ¿Los resultados son mejores o peores? ¿es coherente el resultado?

In [None]:
classifiers = [
    ('svm', svm.SVC(random_state=42)),
    ('nn', MLPClassifier(random_state=42))
]

metaclassifier = VotingClassifier(estimators = classifiers, voting = 'hard')

In [None]:
results = cross_val_score(metaclassifier, X, y, scoring='accuracy', cv=5)

In [None]:
print(f"Avg: {np.mean(results)}")
print(f"Var: {np.var(results)}")

Avg: 0.6409779367918903
Var: 0.00023192506993314032


En base a las 2 métricas que se están evaluando, se puede decir que el metaclasificador usando voting funciona mejor que el clasificador simple previamente utilizado.

# Stacking

Entrena un modelo stacked con capas de redes neuronales (MLPClassifier), entrenalo con 1,2,3,4 y 5 capas y evalua los resultados.

In [None]:
def train_stacked_classifier(layers):
    clasificadores = [
        ('nn', MLPClassifier(hidden_layer_sizes=layers, random_state=42, max_iter=1000))
    ]

    metaclasificador_stacked = StackingClassifier(estimators=clasificadores)
    results = cross_val_score(metaclasificador_stacked, X, y, scoring='accuracy', cv=5)

    return np.mean(results), np.var(results)

In [None]:
num_layers = [1, 2, 3, 4, 5]
accuracy_scores = []

for i in num_layers:
    layers = tuple(100 for _ in range(i))
    mean, var = train_stacked_classifier(layers)
    accuracy_scores.append(mean)
    print(f'Precisión con {i} capa(s) oculta(s): Media = {mean}, Varianza = {var}')

best_index = np.argmax(accuracy_scores)
print(f'\nMejor modelo: {num_layers[best_index]} capa(s) oculta(s) con una precisión media de {accuracy_scores[best_index]}')

Precisión con 1 capa(s) oculta(s): Media = 0.6271437090041742, Varianza = 0.0025750307307573195
Precisión con 2 capa(s) oculta(s): Media = 0.6256410256410255, Varianza = 0.0028586456278763953
Precisión con 3 capa(s) oculta(s): Media = 0.5886821705426357, Varianza = 0.000781351541729294
Precisión con 4 capa(s) oculta(s): Media = 0.5979129397734049, Varianza = 0.00346130200271732
Precisión con 5 capa(s) oculta(s): Media = 0.6209898628503279, Varianza = 0.002445201823826444

Mejor modelo: 1 capa(s) oculta(s) con una precisión media de 0.6271437090041742


El metaclasificador MLP utilizando stacking en términos de precisión, ofrece peores resultados que los 2 anteriores, pero en términos de varianza, ofrece los mejores resultados debido a que la salida de un modelo es la entrada del siguiente, lo que hace el metamodelo más robusto.

# Bagging


Entrena con bagging un svm y varia el número de estimadores de 100, 250, 500, 750, 1000. Comenta los resultados.

In [None]:
def train_bagging_svm(n_estimators):
    bagging_svm = BaggingClassifier(base_estimator=svm.SVC(), n_estimators=n_estimators, random_state=42)
    results = cross_val_score(bagging_svm, X, y, scoring='accuracy', cv=5)

    return np.mean(results), np.var(results)

In [None]:
num_estimators = [100, 250, 500, 750, 1000]
accuracy_scores = []

for i in num_estimators:
    mean, var = train_bagging_svm(i)
    accuracy_scores.append(mean)
    print(f'Precisión con {i} estimadores: Media = {mean}, Varianza = {var}')

best_index = np.argmax(accuracy_scores)
print(f'\nMejor modelo: {num_estimators[best_index]} estimadores con una precisión media de {accuracy_scores[best_index]}')

Precisión con 100 estimadores: Media = 0.6471556350626118, Varianza = 0.0002407007146034477
Precisión con 250 estimadores: Media = 0.6456171735241503, Varianza = 0.00015857604142331878
Precisión con 500 estimadores: Media = 0.6502325581395348, Varianza = 0.0002013997651057184
Precisión con 750 estimadores: Media = 0.6456171735241503, Varianza = 0.0002769192366895905
Precisión con 1000 estimadores: Media = 0.6471556350626118, Varianza = 0.0003117066317632109

Mejor modelo: 500 estimadores con una precisión media de 0.6502325581395348


Hasta el momento, este metamodelo es el que mejores métricas ofrece con una precisión del 65% y una varianza de 0.0002

# Boosting

Entrena un modelo con Adaboost con 100, 250, 500, 750, 1000 estimadores y comenta los resultados en términos de accuracy y varianza del mismo.

In [None]:
def train_adaboost(n_estimators):
    adaboost = AdaBoostClassifier(n_estimators=n_estimators, random_state=42)
    results = cross_val_score(adaboost, X, y, scoring='accuracy', cv=5)

    return np.mean(results), np.var(results)

In [None]:
num_estimators = [100, 250, 500, 750, 1000]
accuracy_scores = []

for i in num_estimators:
    mean, var = train_adaboost(i)
    accuracy_scores.append(mean)
    print(f'Precisión con {i} estimadores: Media = {mean}, Varianza = {var}')

best_index = np.argmax(accuracy_scores)
print(f'\nMejor modelo: {num_estimators[best_index]} estimadores con una precisión media de {accuracy_scores[best_index]}')

Precisión con 100 estimadores: Media = 0.6055694692903995, Varianza = 0.004606439147055697
Precisión con 250 estimadores: Media = 0.5886583184257603, Varianza = 0.004180972283114814
Precisión con 500 estimadores: Media = 0.5932737030411449, Varianza = 0.004465546385220221
Precisión con 750 estimadores: Media = 0.6055933214072748, Varianza = 0.0044478784665663235
Precisión con 1000 estimadores: Media = 0.6086821705426357, Varianza = 0.003023413832449901

Mejor modelo: 1000 estimadores con una precisión media de 0.6086821705426357


Basandonos en las métricas que estamos utilizando, este metamodelo es el que nos ofrece peores resultados, con una precisión del 60.86% y una varianza de 0.003

## Clasificación de los modelos

1º. SVM con Baggin

2º. SVM y MLP con Voting

3º. SVM claificador simple

4º. MLP con Stacking

5º. Adaboost