# Logistic Regression Analysis

**Logistic Regression** is a classification algorithm used to predict the probability of a categorical dependent variable. In this notebook, we use it to estimate the likelihood of heart attack risk based on clinical parameters.

### üî¨ Mathematical Approach:
Unlike linear regression, we use the **Sigmoid Function** to map predicted values to probabilities between 0 and 1. 
We will leverage `scipy.optimize` and `sklearn` to find the best-fitting model parameters.

In [1]:
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
import matplotlib.pyplot as plt
%matplotlib inline 

# Specific tools for Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, jaccard_score

print("Scientific environment for Logistic Regression is ready.")

Scientific environment for Logistic Regression is ready.


## üìÇ Data Loading & Structure Inspection

In this step, we re-initialize the dataset for the **Logistic Regression** pipeline. Before training, we inspect the statistical distribution of features to ensure they are suitable for a probabilistic model. 

Key focuses:
* **Scale of values**: Identifying the range for each clinical metric.
* **Target balance**: Confirming the distribution of the `output` variable.

In [4]:
# Load the heart dataset
df = pd.read_csv("../data/heart.csv")

# 1. Peek at the first few rows
print("--- Dataset Preview ---")
display(df.head())

# 2. Statistical breakdown
print("\n--- Descriptive Statistics ---")
display(df.describe())

# 3. Quick Column check for Feature Selection
print(f"\nTotal Features available: {len(df.columns)}")
print(f"Columns: {list(df.columns)}")

# Optional: Check for missing values (always a plus in Git)
if df.isnull().sum().sum() == 0:
    print("\n Data Integrity Check: No missing values found.")

--- Dataset Preview ---


Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1



--- Descriptive Statistics ---


Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0



Total Features available: 14
Columns: ['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh', 'exng', 'oldpeak', 'slp', 'caa', 'thall', 'output']

 Data Integrity Check: No missing values found.


## üõ† Data Structuring & Type Casting

To ensure compatibility with high-performance numerical libraries like **NumPy** and **SciPy**, we perform the following:
1. **Feature Selection**: Explicitly defining our clinical predictors.
2. **Type Casting**: Ensuring the target variable `output` is in integer format for binary classification.
3. **Matrix Conversion**: Converting DataFrames into NumPy arrays ($X$ and $y$) for efficient mathematical operations during the optimization of the Logistic function.

In [5]:
# Re-ordering and selecting relevant columns
df = df[['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh',
         'exng', 'oldpeak', 'slp', 'caa', 'thall', 'output']]

# Ensure the target is of integer type
df["output"] = df['output'].astype(int)

# Extract features into a NumPy array
X = np.asarray(df[['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 
                   'thalachh', 'exng', 'oldpeak', 'slp', 'caa', 'thall']])

# Extract target into a NumPy array
y = np.asarray(df["output"])

# Quick verification of the conversion
print(f"Feature matrix X shape: {X.shape}")
print(f"Target vector y shape: {y.shape}")
print("\nFirst 5 entries of the target (y):", y[0:5])

Feature matrix X shape: (303, 13)
Target vector y shape: (303,)

First 5 entries of the target (y): [1 1 1 1 1]


## ‚öñÔ∏è Feature Standardization

For **Logistic Regression**, standardization is essential to ensure that the optimization algorithm (like `liblinear` or `lbfgs`) converges efficiently. 

By applying `StandardScaler`, we transform our features to ensure they all have a similar influence on the model coefficients. This prevents features with large magnitudes from causing numerical instability during the calculation of the **Sigmoid function**.

In [7]:
from sklearn import preprocessing

# Initialize and fit the scaler
scaler = preprocessing.StandardScaler().fit(X)

# Transform the feature matrix
X = scaler.transform(X)

# Display the first 5 rows of scaled features
print("Features scaled successfully.")
print("Sample of standardized data (First 5 rows):")
print(X[0:5])

Features scaled successfully.
Sample of standardized data (First 5 rows):
[[ 0.9521966   0.68100522  1.97312292  0.76395577 -0.25633371  2.394438
  -1.00583187  0.01544279 -0.69663055  1.08733806 -2.27457861 -0.71442887
  -2.14887271]
 [-1.91531289  0.68100522  1.00257707 -0.09273778  0.07219949 -0.41763453
   0.89896224  1.63347147 -0.69663055  2.12257273 -2.27457861 -0.71442887
  -0.51292188]
 [-1.47415758 -1.46841752  0.03203122 -0.09273778 -0.81677269 -0.41763453
  -1.00583187  0.97751389 -0.69663055  0.31091206  0.97635214 -0.71442887
  -0.51292188]
 [ 0.18017482  0.68100522  0.03203122 -0.66386682 -0.19835726 -0.41763453
   0.89896224  1.23989692 -0.69663055 -0.20670527  0.97635214 -0.71442887
  -0.51292188]
 [ 0.29046364 -1.46841752 -0.93851463 -0.66386682  2.08204965 -0.41763453
   0.89896224  0.58393935  1.43548113 -0.37924438  0.97635214 -0.71442887
  -0.51292188]]


## üß™ Train/Test Split

To ensure our **Logistic Regression** model generalizes well to new, unseen patients, we split our standardized dataset:
* **Training Set (80%)**: Used to optimize the model coefficients.
* **Testing Set (20%)**: Used to evaluate the model's predictive power and calculate metrics like Log Loss.

We use `random_state=42` to ensure that our results can be replicated by other researchers.

In [8]:
from sklearn.model_selection import train_test_split

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Reporting the split sizes
print("Dataset Split Summary:")
print(f"Training set: {X_train.shape[0]} samples")
print(f"Testing set:  {X_test.shape[0]} samples")

Dataset Split Summary:
Training set: 242 samples
Testing set:  61 samples


## ü§ñ Logistic Regression Implementation

We initialize the model using the **Liblinear** solver, which is highly effective for binary classification on smaller datasets. 

* **Regularization (C=0.1)**: We use a smaller C value to increase regularization strength, which helps the model generalize better and prevents it from relying too heavily on any single feature.
* **Probability Estimates**: Beyond simple binary classification (0 or 1), we calculate the probability of each class. This is essential for clinical risk assessment where the confidence of a prediction matters.

In [9]:
from sklearn.linear_model import LogisticRegression

# 1. Initialize and Train
# C is the inverse of regularization strength; smaller values specify stronger regularization.
lr_model = LogisticRegression(C=0.1, solver='liblinear').fit(X_train, y_train)

# 2. Predict Class Labels
y_hat = lr_model.predict(X_test)

# 3. Predict Probabilities
# Returns [Prob of Class 0, Prob of Class 1]
y_hat_prob = lr_model.predict_proba(X_test)

# --- Visualizing Results ---
print(f"Model Training Complete.")
print(f"\nFirst 10 Predicted Labels: {y_hat[0:10]}")
print(f"First 10 Actual Labels:    {y_test[0:10].flatten()}") # Flatten for clean display

print("\n--- Probability Breakdown (First 5 patients) ---")
# Displaying probability of 'High Risk' (Class 1)
for i in range(5):
    print(f"Patient {i+1}: Risk Probability = {y_hat_prob[i][1]:.2%}")

Model Training Complete.

First 10 Predicted Labels: [0 1 1 0 1 1 1 0 0 0]
First 10 Actual Labels:    [0 0 1 0 1 1 1 0 0 1]

--- Probability Breakdown (First 5 patients) ---
Patient 1: Risk Probability = 15.29%
Patient 2: Risk Probability = 64.53%
Patient 3: Risk Probability = 77.43%
Patient 4: Risk Probability = 6.65%
Patient 5: Risk Probability = 90.42%


## üìä Model Evaluation: Beyond Simple Accuracy

To truly understand how our **Logistic Regression** model performs, we use three distinct metrics:
1. **Jaccard Score**: Measures the similarity between the predicted labels and the true labels. We calculate this for both 'Low Risk' (0) and 'High Risk' (1) classes.
2. **Precision & Recall**: Crucial for medical diagnosis. 
    * *Precision*: When we predict a heart attack, how often are we right?
    * *Recall*: Out of all actual heart attacks, how many did we successfully catch?
3. **F1-Score**: The harmonic mean of Precision and Recall, providing a balanced view of the model's performance.

In [11]:
from sklearn.metrics import jaccard_score, classification_report, accuracy_score

# 1. Jaccard Similarity Score
j_score_0 = jaccard_score(y_test, y_hat, pos_label=0)
j_score_1 = jaccard_score(y_test, y_hat, pos_label=1)

print("Jaccard Similarity Scores:")
print(f"   - Class 0 (Low Risk):  {j_score_0:.4f}")
print(f"   - Class 1 (High Risk): {j_score_1:.4f}")
print("-" * 35)

# 2. Accuracy
acc = accuracy_score(y_test, y_hat)
print(f"Overall Accuracy: {acc:.4f}")
print("-" * 35)

# 3. Classification Report
print("Detailed Classification Report:")
print(classification_report(y_test, y_hat, target_names=['Low Risk', 'High Risk']))


Jaccard Similarity Scores:
   - Class 0 (Low Risk):  0.7576
   - Class 1 (High Risk): 0.7778
-----------------------------------
Overall Accuracy: 0.8689
-----------------------------------
Detailed Classification Report:
              precision    recall  f1-score   support

    Low Risk       0.86      0.86      0.86        29
   High Risk       0.88      0.88      0.88        32

    accuracy                           0.87        61
   macro avg       0.87      0.87      0.87        61
weighted avg       0.87      0.87      0.87        61

