## Task 1 - Regresión logística polinómica

### Instrucciones: 

Usted usará Python a través de un Jupyter Notebook para llevar a cabo este ejercicio. Recuerde
utilizar comentarios para describir lo que está haciendo en cada parte del proceso. Estará usando el juego de datos
proporcionado dentro del portal. Al finalizar recuerde subir al portal un link a su repositorio en el que se pueda correr
su notebook, usando https://mybinder.org/.
El juego de datos proporcionado es parte de la plataforma Kaggle, dentro del cual se muestran condiciones físicas
y contextuales para más de 4000 pacientes de enfermedades cardíacas. El dataset relaciona a cada paciente con
una etiqueta (1 = tuvo un paro cardíaco, 0 = no tuvo paro cardíaco).

Abajo una breve descripción de las variables incluidas:

● Demographic:

○ Sex: male or female(Nominal)

○ Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole
numbers, the concept of age is continuous)

● Behavioral:

○ Current Smoker: whether or not the patient is a current smoker (Nominal)

○ Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be
considered continuous as one can have any number of cigarettes, even half a cigarette.)

○ Medical( history)

○ BP Meds: whether or not the patient was on blood pressure medication (Nominal)

○ Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal)

○ Prevalent Hyp: whether or not the patient was hypertensive (Nominal)

○ Diabetes: whether or not the patient had diabetes (Nominal)


● Medical(current):

○ Tot Chol: total cholesterol level (Continuous)

○ Sys BP: systolic blood pressure (Continuous)

○ Dia BP: diastolic blood pressure (Continuous)

○ BMI: Body Mass Index (Continuous)

○ Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact
discrete, yet are considered continuous because of large number of possible values.)

○ Glucose: glucose level (Continuous)
Para este ejercicio se le pide que proporcione un modelo de regresión logística polinomial que prediga fielmente si
un paciente sufrirá de un paro cardíaco.


### Task 1.1: Leer el archivo CSV proporcionado y almacenarlo en un np.array

In [None]:
import pandas as pd
import numpy as np

# Leer el archivo CSV
data = pd.read_csv("datos_pacientes.csv")

# Almacenar los datos en un np.array
data_array = np.array(data)


: 

### Task 1.2: Ajustar un modelo logístico polinomial

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression

# Seleccionar las variables independientes y la variable dependiente
X = data_array[:, :-1]
y = data_array[:, -1]

# Dividir los datos en entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Ajustar un modelo logístico polinomial
poly = PolynomialFeatures(degree=2) # Cambiar el grado según se necesite
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

logreg = LogisticRegression(solver="lbfgs", max_iter=10000)
logreg.fit(X_train_poly, y_train)


: 

### Task 1.3: Utilizar la implementación vectorial del algoritmo de regresión logística

In [None]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def gradient_descent(X, y, learning_rate, iterations):
    n_samples, n_features = X.shape
    weights = np.zeros(n_features)
    for _ in range(iterations):
        linear_model = np.dot(X, weights)
        y_predicted = sigmoid(linear_model)
        gradient = np.dot(X.T, (y_predicted - y)) / n_samples
        weights -= learning_rate * gradient
    return weights

learning_rate = 0.1
iterations = 1000

weights = gradient_descent(X_train_poly, y_train, learning_rate, iterations)
y_predicted = sigmoid(np.dot(X_test_poly, weights)) > 0.5


: 

### Metricas de desempeño

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

# Calcular métricas de desempeño
accuracy = accuracy_score(y_test, y_predicted)
precision = precision_score(y_test, y_predicted)
recall = recall_score(y_test, y_predicted)
f1 = f1_score(y_test, y_predicted)
roc_auc = roc_auc_score(y_test, y_predicted)
conf_mat = confusion_matrix(y_test, y_predicted)

print("Métricas de desempeño:")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC AUC Score: {roc_auc:.2f}")
print("Confusion Matrix:")
print(conf_mat)


: 

### Task 1.4: Usar cross-validation para determinar el grado del polinomio

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

def find_best_degree(X, y, degrees, learning_rate, iterations):
    best_degree = 0
    best_accuracy = 0
    for degree in degrees:
        poly = PolynomialFeatures(degree=degree)
        X_poly = poly.fit_transform(X)
        kf = KFold(n_splits=5)
        accuracies = []
        for train_index, test_index in kf.split(X_poly):
            X_train, X_test = X_poly[train_index], X_poly[test_index]
            y_train, y_test = y[train_index], y[test_index]
            weights = gradient_descent(X_train, y_train, learning_rate, iterations)
            y_predicted = sigmoid(np.dot(X_test, weights)) > 0.5
            accuracy = accuracy_score(y_test, y_predicted)
            accuracies.append(accuracy)
        mean_accuracy = np.mean(accuracies)
        if mean_accuracy > best_accuracy:
            best_accuracy = mean_accuracy
            best_degree = degree
    return best_degree

degrees = range(1, 6)
best_degree = find_best_degree(X, y, degrees, learning_rate, iterations)
print("El mejor grado del polinomio es:", best_degree)


: 

### Task 1.5: Análisis