<a href="https://colab.research.google.com/github/C0SS10/AI4ENG-II/blob/main/Breast-Cancer-Wisconsin-Modelado.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **📦 Módulos, archivos, paquetes importantes para la ejecución del notebook.**

In [None]:
# Librerias uso básico
import numpy as np
import pandas as pd

# Librerias para gráficar
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold, LeaveOneOut
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, make_scorer, confusion_matrix, roc_curve, auc
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

## Cargamos el dataset

Descargamos el archivo .CSV que está alojado en Google Drive mediante una petición, y luego conseguimos su contenido para ser parseado a texto

In [None]:
import requests
from io import StringIO

# URL con CSV
url = 'https://drive.google.com/uc?export=download&id=1iMiM-j44duS2TrxH4gQL2FStoimnIlT1'

response = requests.get(url)
response.raise_for_status()  # Check if the request was successful
data_csv = StringIO(response.text)

Debido a que el archivo no contiene nombre de columnas, nosotros mismos asignamos los nombres

In [None]:
# Columnas explicadas en https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic > "Additional Variable Information"
# La estructura de las columnas 3 a 32 se presenta en tres formas:
#   - Media (mean)
#   - Error (se)
#   - Peor (worst)

column_names = [
    "ID", "Diagnosis",
    "Radius_mean", "Texture_mean", "Perimeter_mean", "Area_mean", "Smoothness_mean", "Compactness_mean", "Concavity_mean", "Concave_points_mean", "Symmetry_mean", "Fractal_dimension_mean",
    "Radius_se", "Texture_se", "Perimeter_se", "Area_se", "Smoothness_se", "Compactness_se", "Concavity_se", "Concave_points_se", "Symmetry_se", "Fractal_dimension_se",
    "Radius_worst", "Texture_worst", "Perimeter_worst", "Area_worst", "Smoothness_worst", "Compactness_worst", "Concavity_worst", "Concave_points_worst", "Symmetry_worst", "Fractal_dimension_worst"
]

In [None]:
data = pd.read_csv(data_csv, sep = ",", header=None, names=column_names)

In [None]:
data['Diagnosis'] = data['Diagnosis'].map({'M': 1, 'B': 0})

data.head()

Unnamed: 0,ID,Diagnosis,Radius_mean,Texture_mean,Perimeter_mean,Area_mean,Smoothness_mean,Compactness_mean,Concavity_mean,Concave_points_mean,...,Radius_worst,Texture_worst,Perimeter_worst,Area_worst,Smoothness_worst,Compactness_worst,Concavity_worst,Concave_points_worst,Symmetry_worst,Fractal_dimension_worst
0,842302,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### PyCaret

Utilizaremos PyCaret para poder conseguir un "ranking" de modelos y ver cuales son mejores para nuestro problema de clasificación

Primero dividiremos el dataset en train y test

In [None]:
train_data, test_data = train_test_split(data, test_size=0.2, random_state=123)

Ahora instalamos PyCaret

In [None]:
!pip install pycaret



Realizamos el setup de PyCaret

In [None]:
clf = setup(data=train_data, target='Diagnosis', session_id=123)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Diagnosis
2,Target type,Binary
3,Original data shape,"(455, 32)"
4,Transformed data shape,"(455, 32)"
5,Transformed train set shape,"(318, 32)"
6,Transformed test set shape,"(137, 32)"
7,Numeric features,31
8,Preprocess,True
9,Imputation type,simple


In [None]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ada,Ada Boost Classifier,0.9593,0.9841,0.925,0.9673,0.9437,0.9119,0.9146,0.163
gbc,Gradient Boosting Classifier,0.9593,0.9924,0.9417,0.9548,0.9458,0.9133,0.9162,0.405
et,Extra Trees Classifier,0.9593,0.9919,0.9333,0.962,0.9446,0.9126,0.9159,0.188
xgboost,Extreme Gradient Boosting,0.956,0.9912,0.9167,0.9664,0.9382,0.9043,0.908,0.104
lightgbm,Light Gradient Boosting Machine,0.953,0.9895,0.925,0.9509,0.9361,0.8991,0.9012,1.39
qda,Quadratic Discriminant Analysis,0.9468,0.9824,0.9417,0.9246,0.9312,0.8879,0.8904,0.034
lda,Linear Discriminant Analysis,0.9436,0.9938,0.8667,0.9798,0.9176,0.8755,0.8813,0.058
rf,Random Forest Classifier,0.9435,0.9845,0.925,0.9284,0.9244,0.8795,0.8822,0.225
ridge,Ridge Classifier,0.9405,0.9937,0.8667,0.9715,0.9127,0.8684,0.8747,0.034
dt,Decision Tree Classifier,0.9216,0.9158,0.8917,0.9039,0.8942,0.8321,0.8362,0.046


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

In [None]:
tuned_model = tune_model(best_model)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.9688,0.9958,0.9167,1.0,0.9565,0.9322,0.9344
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,0.9688,1.0,0.9167,1.0,0.9565,0.9322,0.9344
3,0.9688,0.9792,0.9167,1.0,0.9565,0.9322,0.9344
4,0.9688,0.9917,0.9167,1.0,0.9565,0.9322,0.9344
5,1.0,1.0,1.0,1.0,1.0,1.0,1.0
6,0.9375,0.9792,0.8333,1.0,0.9091,0.8621,0.8704
7,0.9688,1.0,1.0,0.9231,0.96,0.9344,0.9364
8,0.9677,0.9956,0.9167,1.0,0.9565,0.931,0.9332
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 10 folds for each of 10 candidates, totalling 100 fits


In [None]:
evaluate_model(tuned_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [None]:
final_model = finalize_model(tuned_model)

In [None]:
predictions = predict_model(final_model, data=test_data)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Ada Boost Classifier,0.9737,0.9906,0.9512,0.975,0.963,0.9426,0.9427


## **Analisis y Resultados** 📊

In [None]:
X = data.iloc[:, 2:]
y = data['Diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### **Regresion Logistica** 📈

In [None]:
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

In [None]:
# Predecir las etiquetas para el conjunto de datos de prueba
y_pred = model.predict(X_test)

# Calcular métricas de evaluación
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("Confusion Matrix:\n", conf_matrix)

Accuracy: 0.956140350877193
Precision: 0.975
Recall: 0.9069767441860465
Confusion Matrix:
 [[70  1]
 [ 4 39]]


### **🏠 KNN | K vecinos más próximos 🏠**

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Crear una instancia del modelo KNN
knn = KNeighborsClassifier(n_neighbors=3)  # Con validación cruzada vimos que con un K = 3 tenemos buen accuracy, consideramos que es suficientemente bueno para no ir por Ks mayores

# Entrenar el modelo
knn.fit(X_train, y_train)

In [None]:
# Predecir las etiquetas para el conjunto de datos de prueba
y_pred = knn.predict(X_test)

# Calcular métricas de evaluación
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("Confusion Matrix:\n", conf_matrix)

Accuracy: 0.9473684210526315
Precision: 0.9302325581395349
Recall: 0.9302325581395349
Confusion Matrix:
 [[68  3]
 [ 3 40]]


### **Arboles de Decision** 🌳

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Crear una instancia del modelo de Árbol de Decisión
tree = DecisionTreeClassifier(random_state=42)

# Entrenar el modelo
tree.fit(X_train, y_train)

In [None]:
y_pred = tree.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("Confusion Matrix:\n", conf_matrix)

Accuracy: 0.9473684210526315
Precision: 0.9302325581395349
Recall: 0.9302325581395349
Confusion Matrix:
 [[68  3]
 [ 3 40]]
