# Percepción (PER): examen de prácticas del bloque 2, grupo 3CO11, turno1, 28-5-2024, 8:00-8:45

Lee este cuaderno y realiza las actividades y ejercicios propuestos

## Importación de librerías relevantes

In [2]:
import warnings; warnings.filterwarnings("ignore")
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

def parse_dni(x):
    return x % 1000

## Lectura y partición del corpus eye-movements 

Ejecuta el código siguiente con dni igual a tu DNI/NIE sin letra.

In [3]:
dni = 12345678

X, y = fetch_openml("eye_movements", version=9, return_X_y=True, as_frame=False)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=True, random_state=parse_dni(dni))

print(f'N = {X.shape[0]} D = {X.shape[1]} C = {len(np.unique(y))} N_train = {len(X_train)} N_test = {len(X_test)}')

N = 7608 D = 23 C = 2 N_train = 6847 N_test = 761


Tarea del [**tabular data learning benchmark**](https://arxiv.org/abs/2207.08815) basada en la competición 1 del [eyechallenge2005](http://www.cis.hut.fi/eyechallenge2005). Se trata de predecir la relevancia o no de una frase leída en un documento (2 clases) a partir de la trayectoria de movimientos oculares asociada (22 características).

## Experimento de referencia con GradientBoosting

Ejecuta el código siguiente para estudiar el error de GradientBoosting con parámetros por defecto.

In [4]:
clf = GradientBoostingClassifier(random_state=parse_dni(dni))

acc = clf.fit(X_train[:1000], y_train[:1000]).score(X_test, y_test)
print(f'Precisión: {acc:.1%}')

Precisión: 59.1%


## Ejercicio 1

Modifica el código de referencia para estudiar el error de GradientBoosting con ajuste de hiper-parámetros mediante alguna de las técnicas de optimización explicadas en las prácticas.

In [6]:
%%timeit -n1 -r1

from sklearn.model_selection import GridSearchCV

G = {"max_depth": [1,2,3,5,10],"n_estimators":[10,20,50,100,150,200],"max_features":[10,20,30,40,57]}

GS = GridSearchCV(clf, G, scoring='accuracy', refit=True, cv=5, verbose=10)

acc = GS.fit(X_train, y_train).score(X_test, y_test)
print(f'Precisión: {acc:.1%} con {GS.best_params_}')

Fitting 5 folds for each of 150 candidates, totalling 750 fits
[CV 1/5; 1/150] START max_depth=1, max_features=10, n_estimators=10.............
[CV 1/5; 1/150] END max_depth=1, max_features=10, n_estimators=10;, score=0.561 total time=   0.0s
[CV 2/5; 1/150] START max_depth=1, max_features=10, n_estimators=10.............
[CV 2/5; 1/150] END max_depth=1, max_features=10, n_estimators=10;, score=0.539 total time=   0.0s
[CV 3/5; 1/150] START max_depth=1, max_features=10, n_estimators=10.............
[CV 3/5; 1/150] END max_depth=1, max_features=10, n_estimators=10;, score=0.590 total time=   0.0s
[CV 4/5; 1/150] START max_depth=1, max_features=10, n_estimators=10.............
[CV 4/5; 1/150] END max_depth=1, max_features=10, n_estimators=10;, score=0.608 total time=   0.1s
[CV 5/5; 1/150] START max_depth=1, max_features=10, n_estimators=10.............
[CV 5/5; 1/150] END max_depth=1, max_features=10, n_estimators=10;, score=0.584 total time=   0.1s
[CV 1/5; 2/150] START max_depth=1, ma

La precisión mejora hasta el 70'4%, pero como podemos ver el tiempo de ejecución es demasiado alto
Para tratar de mejorar este problema, usamos RandomizedSearch

In [8]:
%%timeit -n1 -r1

from sklearn.model_selection import RandomizedSearchCV

G = {"max_depth": [1,2,3,5,10],"n_estimators":[10,20,50,100,150,200],"max_features":[10,20,30,40,57]}

GS = RandomizedSearchCV(clf, G, scoring='accuracy', refit=True, cv=5, verbose=10)

acc = GS.fit(X_train, y_train).score(X_test, y_test)
print(f'Precisión: {acc:.1%} con {GS.best_params_}')

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5; 1/10] START max_depth=5, max_features=10, n_estimators=10..............
[CV 1/5; 1/10] END max_depth=5, max_features=10, n_estimators=10;, score=0.582 total time=   0.3s
[CV 2/5; 1/10] START max_depth=5, max_features=10, n_estimators=10..............
[CV 2/5; 1/10] END max_depth=5, max_features=10, n_estimators=10;, score=0.594 total time=   0.3s
[CV 3/5; 1/10] START max_depth=5, max_features=10, n_estimators=10..............
[CV 3/5; 1/10] END max_depth=5, max_features=10, n_estimators=10;, score=0.608 total time=   0.3s
[CV 4/5; 1/10] START max_depth=5, max_features=10, n_estimators=10..............
[CV 4/5; 1/10] END max_depth=5, max_features=10, n_estimators=10;, score=0.619 total time=   0.3s
[CV 5/5; 1/10] START max_depth=5, max_features=10, n_estimators=10..............
[CV 5/5; 1/10] END max_depth=5, max_features=10, n_estimators=10;, score=0.615 total time=   0.5s
[CV 1/5; 2/10] START max_depth=2, max_featur

En este caso el tiempo es mucho menor y, aunque la precisión sea peor respecto GridSearch, mejora respecto a GradientBoosting

Aplicamos normalización de los datos y PCA antes del GradientBoosting para mejorar el error y el tiempo

In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

scaler = StandardScaler()
pca = PCA()

pipe = Pipeline(steps=[("scaler", scaler), ("pca", pca), ("clf", clf)])

G = {"pca__n_components": [5,25,50], "clf__max_depth": [1,2,3,5,10],"clf__n_estimators":[10,20,50,100,150,200],"clf__max_features":[10,20,30,40,57]}
GS = GridSearchCV(pipe, G, scoring='accuracy', refit=True, cv=3, verbose=10)
acc = GS.fit(X_train, y_train).score(X_test, y_test)
print(f'Precisión: {acc:.1%} con {GS.best_params_}')

Fitting 3 folds for each of 450 candidates, totalling 1350 fits
[CV 1/3; 1/450] START clf__max_depth=1, clf__max_features=10, clf__n_estimators=10, pca__n_components=5
[CV 1/3; 1/450] END clf__max_depth=1, clf__max_features=10, clf__n_estimators=10, pca__n_components=5;, score=0.535 total time=   0.0s
[CV 2/3; 1/450] START clf__max_depth=1, clf__max_features=10, clf__n_estimators=10, pca__n_components=5
[CV 2/3; 1/450] END clf__max_depth=1, clf__max_features=10, clf__n_estimators=10, pca__n_components=5;, score=0.532 total time=   0.0s
[CV 3/3; 1/450] START clf__max_depth=1, clf__max_features=10, clf__n_estimators=10, pca__n_components=5
[CV 3/3; 1/450] END clf__max_depth=1, clf__max_features=10, clf__n_estimators=10, pca__n_components=5;, score=0.536 total time=   0.0s
[CV 1/3; 2/450] START clf__max_depth=1, clf__max_features=10, clf__n_estimators=10, pca__n_components=25
[CV 1/3; 2/450] END clf__max_depth=1, clf__max_features=10, clf__n_estimators=10, pca__n_components=25;, score=nan

Usando PCA y reducción de la dimensionalidad, la precisión empeora

## Ejercicio 2

De forma similar al ejercicio anterior, estudia el error de KNeighbors con ajuste de hiper-parámetros mediante alguna de las técnicas de optimización explicadas en las prácticas

Por defecto:

In [6]:
from sklearn.neighbors import KNeighborsClassifier

kv = KNeighborsClassifier()
acc=kv.fit(X_train,y_train).score(X_test,y_test)

print(f'Precisión: {acc:.1%}')

Precisión: 54.7%


Búsqueda de los mejores parámetros:

In [7]:
import warnings; warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split, GridSearchCV
G = {'n_neighbors': [1,2,5,10], 'weights':['uniform', 'distance'], 'leaf_size': [1,2,5,10,20,30], 'p': [1,2]}
GS = GridSearchCV(kv, G, scoring='accuracy', refit=True, cv=5, verbose=10)
acc = GS.fit(X_train, y_train).score(X_test, y_test)
print(f'Precisión: {acc:.1%} con {GS.best_params_}')

Fitting 5 folds for each of 96 candidates, totalling 480 fits
[CV 1/5; 1/96] START leaf_size=1, n_neighbors=1, p=1, weights=uniform...........
[CV 1/5; 1/96] END leaf_size=1, n_neighbors=1, p=1, weights=uniform;, score=0.597 total time=   0.0s
[CV 2/5; 1/96] START leaf_size=1, n_neighbors=1, p=1, weights=uniform...........
[CV 2/5; 1/96] END leaf_size=1, n_neighbors=1, p=1, weights=uniform;, score=0.597 total time=   0.0s
[CV 3/5; 1/96] START leaf_size=1, n_neighbors=1, p=1, weights=uniform...........
[CV 3/5; 1/96] END leaf_size=1, n_neighbors=1, p=1, weights=uniform;, score=0.606 total time=   0.0s
[CV 4/5; 1/96] START leaf_size=1, n_neighbors=1, p=1, weights=uniform...........
[CV 4/5; 1/96] END leaf_size=1, n_neighbors=1, p=1, weights=uniform;, score=0.588 total time=   0.0s
[CV 5/5; 1/96] START leaf_size=1, n_neighbors=1, p=1, weights=uniform...........
[CV 5/5; 1/96] END leaf_size=1, n_neighbors=1, p=1, weights=uniform;, score=0.621 total time=   0.0s
[CV 1/5; 2/96] START leaf_si

En este caso, la búsqueda de los mejores parámetros nos permite mejorar ligeramente el error

También se le puede aplicar normalización de los datos y PCA antes del KNeighbors para comprobar si el error y el tiempo mejoran

In [8]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

scaler = StandardScaler()
pca = PCA()

pipe = Pipeline(steps=[("scaler", scaler), ("pca", pca), ("kv", kv)])

G = {"pca__n_components": [5,10,20,50,100], "kv__n_neighbors": [1,3,5,10],'kv__weights':['uniform', 'distance'], 'kv__leaf_size': [1,2,5,10,20,30], 'kv__p': [1,2]}
GS = GridSearchCV(pipe, G, scoring='accuracy', refit=True, cv=5, verbose=10)
acc = GS.fit(X_train, y_train).score(X_test, y_test)
print(f'Precisión: {acc:.1%} con {GS.best_params_}')

Fitting 5 folds for each of 480 candidates, totalling 2400 fits
[CV 1/5; 1/480] START kv__leaf_size=1, kv__n_neighbors=1, kv__p=1, kv__weights=uniform, pca__n_components=5
[CV 1/5; 1/480] END kv__leaf_size=1, kv__n_neighbors=1, kv__p=1, kv__weights=uniform, pca__n_components=5;, score=0.507 total time=   0.0s
[CV 2/5; 1/480] START kv__leaf_size=1, kv__n_neighbors=1, kv__p=1, kv__weights=uniform, pca__n_components=5
[CV 2/5; 1/480] END kv__leaf_size=1, kv__n_neighbors=1, kv__p=1, kv__weights=uniform, pca__n_components=5;, score=0.510 total time=   0.0s
[CV 3/5; 1/480] START kv__leaf_size=1, kv__n_neighbors=1, kv__p=1, kv__weights=uniform, pca__n_components=5
[CV 3/5; 1/480] END kv__leaf_size=1, kv__n_neighbors=1, kv__p=1, kv__weights=uniform, pca__n_components=5;, score=0.521 total time=   0.0s
[CV 4/5; 1/480] START kv__leaf_size=1, kv__n_neighbors=1, kv__p=1, kv__weights=uniform, pca__n_components=5
[CV 4/5; 1/480] END kv__leaf_size=1, kv__n_neighbors=1, kv__p=1, kv__weights=uniform, p

En este caso usar PCA sí que mejora la precisión