# KVecinos en un dataset de detección de spam

Se propone emplear un clasificador basado en distancias sobre el dataset id=44 de openml de detección de Spam. Son un total de 4601 muestras con 57 características.

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

## Descarga del dataset Spam
X, y = fetch_openml(data_id=44, as_frame=False, cache=True, return_X_y=True)
print(X.shape)

## Partición train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=23)


## El clasificador por los vecinos más cercanos

In [20]:
from sklearn.neighbors import KNeighborsClassifier

kv = KNeighborsClassifier()
acc=kv.fit(X_train,y_train).score(X_test,y_test)

print(f'Precisión: {acc:.1%}')

Precisión: 79.5%


**Ejercicio:** Explora el principal parámetros del KNN (n_neighbors) y realiza una búsqueda mediante alguna técnica de optimización ya vista en la práctica anterior

In [19]:
# Solución

from sklearn.model_selection import GridSearchCV

G = {"n_neighbors":[1,3,4,5,10]}

GS = GridSearchCV(KNeighborsClassifier(), G, scoring='accuracy', refit=True, cv=5)

acc = GS.fit(X_train, y_train).score(X_test, y_test)
print(f'Precisión: {acc:.1%} con {GS.best_params_}')

Precisión: 82.1% con {'n_neighbors': 1}


## Mejoras

La función de distancia empleada por defecto es la distancia euclídea. Dicha distancia requiere un preproceso de las muestras para que tengan una escala similar todas ellas. Además KNN podría beneficiarse de una proyección mediante PCA con el fin de reducir la dimensionalidad.

**Ejercicio:** Implementa un pipeline con la normalización de los datos y un PCA, seguido del KNN. Busca los mejores parámetros. Se podría conseguir una tasa de acierto >90%.


In [23]:
# Solución

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline;

scaler = StandardScaler()
pca = PCA()
knn = KNeighborsClassifier()

pipe = Pipeline(steps=[("scaler", scaler), ("pca", pca), ("knn", knn)])

G = {"pca__n_components": [5,10,15,20,25,50], "knn__n_neighbors": [1,3,4,5]}
GS = GridSearchCV(pipe, G, scoring='accuracy', refit=True, cv=5)
acc = GS.fit(X_train, y_train).score(X_test, y_test)
print(f'Precisión: {acc:.1%} con {GS.best_params_}')

Precisión: 89.0% con {'knn__n_neighbors': 3, 'pca__n_components': 50}


También podríamos probar diferentes funciones de distancia [sklearn distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics) a emplear en el parámetro "metric". Así mismo podríamos explorar el parámetro "weights" que pondera el voto de cada vecino de forma diferente según el parámetro escogido.

**Ejercicio:** prueba también diferentes métricas y "weights" junto con todo lo anterior. Emplea el BayessianOpt visto en la práctica anterior.

In [22]:
# Solución
!pip install scikit-optimize




In [24]:
# Solución
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

# Probar sólo 10 combinaciones de parámetros, n_iter=10
scaler = StandardScaler()
pca = PCA()
knn = KNeighborsClassifier()

pipe = Pipeline(steps=[("scaler", scaler), ("pca", pca), ("knn", knn)])

G = {"pca__n_components": Integer(1,57),
     "knn__n_neighbors": Integer(1,20),
     "knn__metric": Categorical(["l1","l2"]),
     "knn__weights":Categorical(["uniform","distance"])}

BS = BayesSearchCV(pipe, G, scoring='accuracy', n_iter=20, refit=True, cv=5)

acc = BS.fit(X_train, y_train).score(X_test, y_test)
print(f'Precisión: {acc:.1%} con {BS.best_params_}')

Precisión: 92.4% con OrderedDict([('knn__metric', 'l2'), ('knn__n_neighbors', 13), ('knn__weights', 'distance'), ('pca__n_components', 29)])


## Olivetti Faces

Prueba ahora el clasificador KNN junto con todos los parámetros y preprocesos que creas convenientes sobre el dataset de reconocimiento facial de Olivetti.

In [29]:
# Solución

from sklearn.datasets import fetch_olivetti_faces;
X, y = fetch_olivetti_faces(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=23)


# Probar sólo 10 combinaciones de parámetros, n_iter=10
scaler = StandardScaler()
pca = PCA()
knn = KNeighborsClassifier()

pipe = Pipeline(steps=[("scaler", scaler), ("pca", pca), ("knn", knn)])

G = {"pca__n_components": Integer(10,200),
     "knn__n_neighbors": Integer(1,10),
     "knn__metric": Categorical(["l1","l2"]),
     "knn__weights":Categorical(["uniform","distance"])}

BS = BayesSearchCV(pipe, G, scoring='accuracy', n_iter=20, refit=True, cv=5,verbose=10)

acc = BS.fit(X_train, y_train).score(X_test, y_test)
print(f'Precisión: {acc:.1%} con {BS.best_params_}')

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 1/5; 1/1] START knn__metric=l1, knn__n_neighbors=8, knn__weights=uniform, pca__n_components=134
[CV 1/5; 1/1] END knn__metric=l1, knn__n_neighbors=8, knn__weights=uniform, pca__n_components=134;, score=0.688 total time=   0.7s
[CV 2/5; 1/1] START knn__metric=l1, knn__n_neighbors=8, knn__weights=uniform, pca__n_components=134
[CV 2/5; 1/1] END knn__metric=l1, knn__n_neighbors=8, knn__weights=uniform, pca__n_components=134;, score=0.750 total time=   0.5s
[CV 3/5; 1/1] START knn__metric=l1, knn__n_neighbors=8, knn__weights=uniform, pca__n_components=134
[CV 3/5; 1/1] END knn__metric=l1, knn__n_neighbors=8, knn__weights=uniform, pca__n_components=134;, score=0.719 total time=   0.5s
[CV 4/5; 1/1] START knn__metric=l1, knn__n_neighbors=8, knn__weights=uniform, pca__n_components=134
[CV 4/5; 1/1] END knn__metric=l1, knn__n_neighbors=8, knn__weights=uniform, pca__n_components=134;, score=0.594 total time=   0.5s
[CV 5/5; 1/1] STA



Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 1/5; 1/1] START knn__metric=l2, knn__n_neighbors=8, knn__weights=uniform, pca__n_components=53
[CV 1/5; 1/1] END knn__metric=l2, knn__n_neighbors=8, knn__weights=uniform, pca__n_components=53;, score=0.688 total time=   0.2s
[CV 2/5; 1/1] START knn__metric=l2, knn__n_neighbors=8, knn__weights=uniform, pca__n_components=53
[CV 2/5; 1/1] END knn__metric=l2, knn__n_neighbors=8, knn__weights=uniform, pca__n_components=53;, score=0.812 total time=   0.2s
[CV 3/5; 1/1] START knn__metric=l2, knn__n_neighbors=8, knn__weights=uniform, pca__n_components=53
[CV 3/5; 1/1] END knn__metric=l2, knn__n_neighbors=8, knn__weights=uniform, pca__n_components=53;, score=0.641 total time=   0.2s
[CV 4/5; 1/1] START knn__metric=l2, knn__n_neighbors=8, knn__weights=uniform, pca__n_components=53
[CV 4/5; 1/1] END knn__metric=l2, knn__n_neighbors=8, knn__weights=uniform, pca__n_components=53;, score=0.594 total time=   0.2s
[CV 5/5; 1/1] START knn__



Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 1/5; 1/1] START knn__metric=l1, knn__n_neighbors=10, knn__weights=uniform, pca__n_components=162
[CV 1/5; 1/1] END knn__metric=l1, knn__n_neighbors=10, knn__weights=uniform, pca__n_components=162;, score=0.578 total time=   0.4s
[CV 2/5; 1/1] START knn__metric=l1, knn__n_neighbors=10, knn__weights=uniform, pca__n_components=162
[CV 2/5; 1/1] END knn__metric=l1, knn__n_neighbors=10, knn__weights=uniform, pca__n_components=162;, score=0.672 total time=   0.4s
[CV 3/5; 1/1] START knn__metric=l1, knn__n_neighbors=10, knn__weights=uniform, pca__n_components=162
[CV 3/5; 1/1] END knn__metric=l1, knn__n_neighbors=10, knn__weights=uniform, pca__n_components=162;, score=0.625 total time=   0.4s
[CV 4/5; 1/1] START knn__metric=l1, knn__n_neighbors=10, knn__weights=uniform, pca__n_components=162
[CV 4/5; 1/1] END knn__metric=l1, knn__n_neighbors=10, knn__weights=uniform, pca__n_components=162;, score=0.531 total time=   0.4s
[CV 5/5; 



Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 1/5; 1/1] START knn__metric=l2, knn__n_neighbors=2, knn__weights=distance, pca__n_components=173
[CV 1/5; 1/1] END knn__metric=l2, knn__n_neighbors=2, knn__weights=distance, pca__n_components=173;, score=0.875 total time=   0.7s
[CV 2/5; 1/1] START knn__metric=l2, knn__n_neighbors=2, knn__weights=distance, pca__n_components=173
[CV 2/5; 1/1] END knn__metric=l2, knn__n_neighbors=2, knn__weights=distance, pca__n_components=173;, score=0.953 total time=   0.7s
[CV 3/5; 1/1] START knn__metric=l2, knn__n_neighbors=2, knn__weights=distance, pca__n_components=173
[CV 3/5; 1/1] END knn__metric=l2, knn__n_neighbors=2, knn__weights=distance, pca__n_components=173;, score=0.922 total time=   0.8s
[CV 4/5; 1/1] START knn__metric=l2, knn__n_neighbors=2, knn__weights=distance, pca__n_components=173
[CV 4/5; 1/1] END knn__metric=l2, knn__n_neighbors=2, knn__weights=distance, pca__n_components=173;, score=0.875 total time=   0.7s
[CV 5/5; 



Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 1/5; 1/1] START knn__metric=l2, knn__n_neighbors=9, knn__weights=uniform, pca__n_components=139
[CV 1/5; 1/1] END knn__metric=l2, knn__n_neighbors=9, knn__weights=uniform, pca__n_components=139;, score=0.672 total time=   0.4s
[CV 2/5; 1/1] START knn__metric=l2, knn__n_neighbors=9, knn__weights=uniform, pca__n_components=139
[CV 2/5; 1/1] END knn__metric=l2, knn__n_neighbors=9, knn__weights=uniform, pca__n_components=139;, score=0.781 total time=   0.3s
[CV 3/5; 1/1] START knn__metric=l2, knn__n_neighbors=9, knn__weights=uniform, pca__n_components=139
[CV 3/5; 1/1] END knn__metric=l2, knn__n_neighbors=9, knn__weights=uniform, pca__n_components=139;, score=0.609 total time=   0.4s
[CV 4/5; 1/1] START knn__metric=l2, knn__n_neighbors=9, knn__weights=uniform, pca__n_components=139
[CV 4/5; 1/1] END knn__metric=l2, knn__n_neighbors=9, knn__weights=uniform, pca__n_components=139;, score=0.516 total time=   0.3s
[CV 5/5; 1/1] STA



Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 1/5; 1/1] START knn__metric=l1, knn__n_neighbors=8, knn__weights=distance, pca__n_components=54
[CV 1/5; 1/1] END knn__metric=l1, knn__n_neighbors=8, knn__weights=distance, pca__n_components=54;, score=0.875 total time=   0.2s
[CV 2/5; 1/1] START knn__metric=l1, knn__n_neighbors=8, knn__weights=distance, pca__n_components=54
[CV 2/5; 1/1] END knn__metric=l1, knn__n_neighbors=8, knn__weights=distance, pca__n_components=54;, score=0.906 total time=   0.2s
[CV 3/5; 1/1] START knn__metric=l1, knn__n_neighbors=8, knn__weights=distance, pca__n_components=54
[CV 3/5; 1/1] END knn__metric=l1, knn__n_neighbors=8, knn__weights=distance, pca__n_components=54;, score=0.891 total time=   0.2s
[CV 4/5; 1/1] START knn__metric=l1, knn__n_neighbors=8, knn__weights=distance, pca__n_components=54
[CV 4/5; 1/1] END knn__metric=l1, knn__n_neighbors=8, knn__weights=distance, pca__n_components=54;, score=0.797 total time=   0.2s
[CV 5/5; 1/1] STA