<a href="https://colab.research.google.com/github/TuMyXx93/Analisis_de_datos_etapa5/blob/main/Etapa5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Desarrollo de un análisis de datos y aplicación de aprendizaje supervisado para la Etapa 5.

Usaremos el dataset Breast Cancer Wisconsin de [sklearn](https://scikit-learn.org/stable/). Este conjunto de datos es útil para problemas de clasificación binaria. Las características se calculan a partir de una imagen digitalizada de una aspiración con aguja fina (FNA) de una masa mamaria y describen características de los núcleos celulares presentes en la imagen.

In [1]:
# 1.Importar las bibliotecas necesarias
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [2]:
# 2. Cargar y explorar el conjunto de datos
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

In [3]:
# 3. Preprocesamiento de los datos
df.head()
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,1.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


In [4]:
# 4. Dividir los datos en conjuntos de entrenamiento y prueba
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
# 5. Escalar los datos
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [6]:
# 6. Entrenar el modelo
model = LogisticRegression()
model.fit(X_train, y_train)

In [7]:
# 7. Hacer predicciones y evaluar el rendimiento del modelo
predictions = model.predict(X_test)

print("Accuracy: ", accuracy_score(y_test, predictions))
print("\nConfusion Matrix: \n", confusion_matrix(y_test, predictions))
print("\nClassification Report: \n", classification_report(y_test, predictions))

Accuracy:  0.9736842105263158

Confusion Matrix: 
 [[41  2]
 [ 1 70]]

Classification Report: 
               precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



### Este es un esquema general del análisis de datos y la implementación de aprendizaje supervisado en Python utilizando el conjunto de datos Breast Cancer Wisconsin.

# Realizamos una optimización al modelo

Utilizaremos la técnica de búsqueda en cuadricula para ajustar los hiperparámetros del modelo de regresión logística. En particular, ajustaremos el parámetro de regularización 'C' y el tipo de regularización 'penalty'.

In [8]:
# 8. Importar la biblioteca necesaria para la optimización
from sklearn.model_selection import GridSearchCV

In [9]:
# 9. Definir los parámetros que queremos ajustar y los valores que queremos probar
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    'penalty': ['l1', 'l2'],
}

In [10]:
# 10. Crear un objeto GridSearchCV y ajustarlo a los datos de entrenamiento
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [11]:
# 11. Imprimir los mejores parámetros y el rendimiento del mejor modelo
print("Best Hyperparameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

Best Hyperparameters:  {'C': 10, 'penalty': 'l2'}
Best Score:  0.9758241758241759


In [12]:
# 12. Evaluar el rendimiento del modelo optimizado en el conjunto de prueba
best_model = grid_search.best_estimator_
predictions = best_model.predict(X_test)

print("Accuracy: ", accuracy_score(y_test, predictions))
print("\nConfusion Matrix: \n", confusion_matrix(y_test, predictions))
print("\nClassification Report: \n", classification_report(y_test, predictions))

Accuracy:  0.9736842105263158

Confusion Matrix: 
 [[42  1]
 [ 2 69]]

Classification Report: 
               precision    recall  f1-score   support

           0       0.95      0.98      0.97        43
           1       0.99      0.97      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



### Debemos tener presente que la búsqueda en cuadrícula puede ser computacionalmente costosa, especialmente si el número de hiperparámetros y/o el número de valores que estás probando para cada hiperparámetro es grande. Además, no todas las combinaciones de hiperparámetros pueden ser válidas. Por ejemplo, si 'penalty' es 'l1', debemos especificar el parámetro 'solver' en LogisticRegression como 'liblinear' o 'saga'. Si no lo hacemos, obtendremos un error. En este caso, no lo incluimos en la búsqueda en cuadrícula para simplificar el ejemplo, pero en la práctica, debes tener esto en cuenta.

# Autor: Wilson Tumiña Tumiña
### Curso: Análisis de Datos
### Fecha: Mayo de 2023
### Contactos:

**GitHub** - [TuMyXx93](https://github.com/TuMyXx93)

**Twitter** - [Tumix19](https://twitter.com/Tumix19)

**LinkedIn** - [Wilson Tumi](https://www.linkedin.com/in/wilson-tumi-9982011a9/)

**Platzi** - [Wilson Tumi](https://platzi.com/p/wtumi/)