### 1. 

En el archivo "logistic_regression_digits" hemos visto un ejemplo multiclase. Elimina todas las imágenes y etiqueta (label) cuyo valor del label no sea 1 o 9. Es decir, elimina todos los dígitos y quédate solo con los dígitos 1 y 9.

Ahora, realiza un entrenamiento con logistic regression con los nuevos datos:

- ¿Se mejora la precisión del algoritmo con dos clases? ¿por qué?

LogisticRegression() es una clase que tiene varios parámetros de entrada. Investiga (modifica, prueba) los argumentos y comenta si modificando algunas de ellas se mejora el porcentaje de acierto del problema (probar al menos 2 diferentes)

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [40]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.datasets import load_digits

Cargamos los datos en un dataframe y lo filtramos según los requisitos del ejercicio

In [41]:
data = load_digits()

In [42]:
df = pd.concat([pd.DataFrame(data['data']), pd.DataFrame(data['target'])], axis=1, ignore_index=True)

In [43]:
df = df[(df[64] == 1) | (df[64] == 9)]

In [44]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,55,56,57,58,59,60,61,62,63,64
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0,1
9,0.0,0.0,11.0,12.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,9.0,12.0,13.0,3.0,0.0,0.0,9
11,0.0,0.0,0.0,0.0,14.0,13.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,13.0,16.0,1.0,0.0,1
19,0.0,0.0,6.0,14.0,4.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,0.0,7.0,16.0,16.0,13.0,11.0,1.0,9
21,0.0,0.0,0.0,2.0,16.0,16.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,12.0,15.0,4.0,0.0,1


Establecemos una semilla para todo el notebook

In [45]:
seed = 42

Creamos un modelo y una partición de los datos para tener un conjunto de entrenamiento y uno de test

In [46]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(64, axis=1), df.loc[:, 64], test_size=0.2, random_state=seed)

In [47]:
model = LogisticRegression(max_iter=200)

Vamos a hacer una validación cruzada y ver los resultados

In [48]:
cv = KFold(n_splits=5, shuffle=True, random_state=seed)
scores = cross_val_score(estimator=model, X=X_train, y=y_train, scoring='accuracy', cv=cv, n_jobs=-1)

In [49]:
scores

array([1.        , 0.96551724, 0.98275862, 1.        , 1.        ])

In [50]:
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=200,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [51]:
model.score(X_train, y_train)

1.0

In [52]:
model.score(X_test, y_test)

0.9863013698630136

Creamos la matriz de confusión

In [53]:
pd.DataFrame(confusion_matrix(y_true=y_test, y_pred=model.predict(X_test)))

Unnamed: 0,0,1
0,37,0
1,1,35


Vamos a probar cambiando el parámetro 'tol' que representa el valor mínimo para parar las iteracciones. Este parámetro esta por defecto establecido en 0.0001

In [54]:
model = LogisticRegression(max_iter=200, tol=1e-10)
model.fit(X_train, y_train)
model.score(X_test, y_test)
# El resultado permanece igual

0.9863013698630136

Ahora probamos cambiando el parámetro 'C' que es la inversa de la intensidad de regularización. Cuanto más bajo, mas intenso será la regularización. Este parámetro esta por defecto establecido en 1

In [55]:
model = LogisticRegression(max_iter=200, C=0.5)
model.fit(X_train, y_train)
model.score(X_test, y_test)
# El resultado permanece igual

0.9863013698630136

Cambiamos el algoritmo de optimización a 'liblinear' y cambiamos el tipo de regulador a 'l1'

In [56]:
model = LogisticRegression(max_iter=100, penalty='l1', solver='liblinear')
model.fit(X_train, y_train)
model.score(X_test, y_test)
# El resultado permanece igual

0.9863013698630136

Por último entrenamos el modelo con todos los datos

In [57]:
model.fit(df.drop(64, axis=1), df.loc[:, 64])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l1',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)