## Heart Disease Risk Prediction: Logistic Regression 
## Step 1: Load and prepare the Dataset

The hearth disease is one of the most important problems in the heathly area, that situation create the neccessity to search a solution or tratements for this problem, for that reason in kaggle exist information about predition or analizys of hearth desease. In this lab, i'm gonna explain this information and create an analysis with logistic regression.

In the first section focuses on: Prepare the dataset and exploring the data analysis.

## Import required libraries 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pit

np.random.seed(42)

## Load the Dataset

In [None]:
datos = pd.read_csv("heath.csv")
datos.head()

## Dataset Overview

In [None]:
datos.shape

In [None]:
datos.info()

In [None]:
datos.describe()

## Target Variable Analysis 

In [None]:
datos['target'].value_counts()

In [None]:
datos['target'].value_counts(normalize=True)

In [None]:
plt.figure()
datos['target'].value_counts().plot(kind='bar')
plt.title("Class Distribution (Heart Disease Presence)")
plt.xlabel("Target")
plt.ylabel("Count")
plt.show()

## Missing Values Check

In [None]:
datos.isnull().sum()

## Feature Selection 

In [None]:
features = ['age', 'chol', 'trestbps', 'thalach', 'oldpeak', 'ca']
X = datos[features].values
y = datos['target'].values

## Feature Normalization

In [None]:
promedio_X = X.mean(axis=0)
desviacion_X = X.std(axis=0)

X_norm = (X - promedio_X) / desviacion_X

Logistic regression converges faster when features are normalized.

## Train Test Split 70/30 

In [None]:
total_muestras = len(y)
indices = np.arange(total_muestras)
np.random.shuffle(indices)

tam_entrenamiento = int(0.7 * total_muestras)

idx_train = indices[:tam_entrenamiento]
idx_test = indices[tam_entrenamiento:]

X_train = X_norm[idx_train]
y_train = y[idx_train]

X_test = X_norm[idx_test]
y_test = y[idx_test]

Verification of rates.

In [None]:
print("Train disease rate:", y_train.mean())
print("Test disease rate:", y_test.mean())

If the values are same, we can say the split is complete.

### Step 1 Summary

- Dataset downloaded from Kaggle (Heart Disease Dataset)
- 303 patient records with clinical features
- Target variable indicates presence (1) or absence (0) of heart disease
- Selected 6 clinically relevant features
- No missing values detected
- Features normalized for gradient descent optimization
- Data split into 70% training and 30% testing sets

The dataset is now ready for implementing logistic regression from scratch.

# Step 2: Implement Basic Logistic Regression
## Objective of model
Predict the probability of heart disease in the pacients given a set of clinics values using a logistic regression, without libraries of high level ML.



## Sigmoid Function

In [None]:
import numpy as np

def sigmoid(z):
    """
    Sigmoid activation function
    """
    return 1 / (1 + np.exp(-z))

The output of the model is a probability between 0 and 1, we can see that like a estimated risk of heart disease.

## Binary Cross-Entropy Loss

We using a Binary Cross Entropy, it a standard in binary clasification

In [None]:
def compute_cost(X, y, pesos, sesgo):
    """
    Binary cross-entropy loss
    """
    num_ejemplos = X.shape[0]
    z = X @ pesos + sesgo
    predicciones = sigmoid(z)
    
    epsilon = 1e-8  # numerical stability
    costo = -(1/num_ejemplos) * np.sum(
        y * np.log(predicciones + epsilon) +
        (1 - y) * np.log(1 - predicciones + epsilon)
    )
    return costo

This funtion penalizes heavily:
- False negatives (risk not detected)
- False predictions very relible but incorrects

## Gradient Computation

In [None]:
def compute_gradients(X, y, pesos, sesgo):
    """
    Compute gradients of cost w.r.t w and b
    """
    num_ejemplos = X.shape[0]
    predicciones = sigmoid(X @ pesos + sesgo)
    
    grad_pesos = (1/num_ejemplos) * (X.T @ (predicciones - y))
    grad_sesgo = (1/num_ejemplos) * np.sum(predicciones - y)
    
    return grad_pesos, grad_sesgo



## Gradient Descent Training Loop

In [None]:
def gradient_descent(X, y, tasa_aprendizaje=0.01, iteraciones=2000):
    """
    Train logistic regression using gradient descent
    """
    num_ejemplos, num_features = X.shape
    pesos = np.zeros(num_features)
    sesgo = 0.0
    
    historial_costo = []
    
    for i in range(iteraciones):
        grad_pesos, grad_sesgo = compute_gradients(X, y, pesos, sesgo)
        
        pesos -= tasa_aprendizaje * grad_pesos
        sesgo -= tasa_aprendizaje * grad_sesgo
        
        if i % 50 == 0:
            costo = compute_cost(X, y, pesos, sesgo)
            historial_costo.append(costo)
    
    return pesos, sesgo, historial_costo

- If alpha is near to 0.01, it is a great point to start.
- 1000 - 3000 iterations ensure stable convergence.

## Training the Model

In [None]:
pesos, sesgo, historial_costo = gradient_descent(
    X_train, y_train,
    tasa_aprendizaje=0.01,
    iteraciones=2500
)

print("Final weights:", pesos)
print("Final bias:", sesgo)

## Convergence Visualization

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(6,4))
plt.plot(historial_costo)
plt.xlabel("Iterations (x50)")
plt.ylabel("Cost (Binary Cross-Entropy)")
plt.title("Training Loss Convergence")
plt.grid(True)
plt.show()

- When we can get a slowly decrease, the GD is stable.
- Not oscillations means learning rate is great.
- If the model get a convergence means the model learned a reasonable border.

## Prediction Function

In [None]:
def predict(X, pesos, sesgo, umbral=0.5):
    """
    Predict class labels using learned parameters
    """
    probabilidades = sigmoid(X @ pesos + sesgo)
    return (probabilidades >= umbral).astype(int)

## Model Evaluation

In [None]:
def classification_metrics(y_real, y_pred):
    exactitud = np.mean(y_real == y_pred)
    
    tp = np.sum((y_real == 1) & (y_pred == 1))
    fp = np.sum((y_real == 0) & (y_pred == 1))
    fn = np.sum((y_real == 1) & (y_pred == 0))
    
    precision = tp / (tp + fp + 1e-8)
    recall = tp / (tp + fn + 1e-8)
    f1 = 2 * precision * recall / (precision + recall + 1e-8)
    
    return exactitud, precision, recall, f1

In [None]:
y_train_pred = predict(X_train, pesos, sesgo)
y_test_pred = predict(X_test, pesos, sesgo)

train_metrics = classification_metrics(y_train, y_train_pred)
test_metrics = classification_metrics(y_test, y_test_pred)

print("Train metrics (Acc, Prec, Recall, F1):", train_metrics)
print("Test metrics  (Acc, Prec, Recall, F1):", test_metrics)

## Interpretation

Clinical interpretation of results

- Accuracy: general correctness

- Recall: ability to detect patients with heart disease (critical)

- Precision: reliability of positive predictions

- F1: balance between false positives and false negatives

In medical risk prediction, recall is often more important than accuracy, as missing a high-risk patient is more costly than a false alarm.