## Ejercicio breast cancer de sklearn

1. Carga el dataset [breast_cancer de `sklearn`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html)
2. Limpia los datos si es necesario
3. Dibuja con plotly los gráficos que creas necesarios para entender los datos.
4. Utiliza los métodos de clasificación vistos hasta ahora para clasificar el target de los datos, ¿cuál da mejores resultados? 
5. Intenta superarte en el score cambiando las features de los algoritmos.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import pickle

import plotly.offline as py
import plotly.graph_objs as go
import plotly.express as px

from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedKFold

In [4]:
data = datasets.load_breast_cancer()
df = pd.concat([pd.DataFrame(data['data'], columns=data['feature_names']), pd.DataFrame(data['target'], columns=['target'])], axis=1)

In [50]:
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


**2. Limpia los datos si es necesario**

In [51]:
df.isna().mean() # No hay valores NaN

mean radius                0.0
mean texture               0.0
mean perimeter             0.0
mean area                  0.0
mean smoothness            0.0
mean compactness           0.0
mean concavity             0.0
mean concave points        0.0
mean symmetry              0.0
mean fractal dimension     0.0
radius error               0.0
texture error              0.0
perimeter error            0.0
area error                 0.0
smoothness error           0.0
compactness error          0.0
concavity error            0.0
concave points error       0.0
symmetry error             0.0
fractal dimension error    0.0
worst radius               0.0
worst texture              0.0
worst perimeter            0.0
worst area                 0.0
worst smoothness           0.0
worst compactness          0.0
worst concavity            0.0
worst concave points       0.0
worst symmetry             0.0
worst fractal dimension    0.0
target                     0.0
dtype: float64

**3. Dibuja con plotly los gráficos que creas necesarios para entender los datos.**

In [10]:
px.histogram(df, x='mean radius', color='target')
# 0: Benigno, 1: Maligno

In [71]:
px.bar(df['target'].apply(lambda x: 'Benign' if x==0 else 'Malignant'), x='target')

**4. Utiliza los métodos de clasificación vistos hasta ahora para clasificar el target de los datos, ¿cuál da mejores resultados?** 

In [53]:
seed = 42

In [54]:
X = df.iloc[:, :-1].to_numpy()
target = df['target'].to_numpy()

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.2, random_state=seed)

Logistic Regression

In [56]:
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.956140350877193

KNeighbors

In [57]:
model = KNeighborsClassifier()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.956140350877193

SVC

In [58]:
model = SVC()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.9473684210526315

Decision Tree

In [59]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.9473684210526315

**5. Intenta superarte en el score cambiando las features de los algoritmos.**

Linear Regression

In [None]:
log_reg_param = {
    'C': [10, 1, 0.1, 0.01, 0.001],
    'solver': ['newton-cg', 'lbfgs', 'liblinear'],
    'max_iter': [100, 1000, 10000, 100000]
}
model = LogisticRegression()
grid = GridSearchCV(model, log_reg_param)
grid.fit(X_train, y_train)

In [61]:
grid.score(X_test, y_test)

0.956140350877193

KNeighbors

In [62]:
kneighbors_param = {
    'n_neighbors': [i for i in range(1, 21)]
}
model = KNeighborsClassifier()
grid = GridSearchCV(model, kneighbors_param)
grid.fit(X_train, y_train)

GridSearchCV(estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                         13, 14, 15, 16, 17, 18, 19, 20]})

In [63]:
grid.score(X_test, y_test)

0.956140350877193

SVC

In [64]:
svc_param = {
    'C': [1, 0.1, 0.01, 0.001],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
}
model = SVC()
grid = GridSearchCV(model, svc_param)
grid.fit(X_train, y_train)

GridSearchCV(estimator=SVC(),
             param_grid={'C': [1, 0.1, 0.01, 0.001],
                         'kernel': ['linear', 'poly', 'rbf', 'sigmoid']})

In [65]:
grid.score(X_test, y_test)

0.956140350877193

Decision Tree

In [66]:
tree_param = {
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random']
}
model = DecisionTreeClassifier()
grid = GridSearchCV(model, tree_param)
grid.fit(X_train, y_train)

GridSearchCV(estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'splitter': ['best', 'random']})

In [67]:
grid.score(X_test, y_test)

0.9035087719298246