# Session 7: Logistic Regression with Heart Disease Dataset

## Course: Data Science with Python
### Date: August 29, 2025

---

## Learning Objectives
By the end of this session, you will be able to:
1. Understand the fundamentals of logistic regression
2. Implement logistic regression on real-world data
3. Evaluate classification model performance
4. Interpret model results and feature importance

## What is Logistic Regression?
Logistic regression is a statistical method used for **binary classification** problems. Unlike linear regression which predicts continuous values, logistic regression predicts the probability that an instance belongs to a particular category.

### Key Differences from Linear Regression:
- **Output**: Probabilities (0 to 1) instead of continuous values
- **Function**: Uses sigmoid/logistic function instead of linear function
- **Purpose**: Classification instead of regression

### The Sigmoid Function:
The logistic function maps any real number to a value between 0 and 1:
$$P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n)}}$$

## 1. Import Required Libraries
Let's start by importing all the necessary libraries for our logistic regression analysis.

In [None]:
# Import essential libraries for data manipulation and analysis
import pandas as pd
import numpy as np

# Import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import roc_auc_score, roc_curve

# Import warnings to suppress unnecessary warnings
import warnings
warnings.filterwarnings('ignore')

# Set up matplotlib for better plots
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (10, 6)

print("All libraries imported successfully!")

## 2. Load and Explore Heart Disease Dataset

### About the Dataset
We'll be working with the **Heart Disease Dataset**, which contains medical data that can help predict whether a patient has heart disease or not. This is a classic binary classification problem perfect for logistic regression.

**Objective**: Predict whether a patient has heart disease (1) or not (0) based on various medical attributes.

In [None]:
# Load the heart disease dataset
df = pd.read_csv('dataset.csv')

print("Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

# Display first few rows
print("\nFirst 5 rows of the dataset:")
df.head()

In [None]:
# Get basic information about the dataset
print("Dataset Information:")
print("=" * 50)
df.info()

print("\nDataset Description:")
print("=" * 50)
df.describe()

## 3. Introduction to Data Variables

Understanding each variable in our dataset is crucial for building an effective model. Let's examine each feature:

### Feature Variables (Input Features):

| Variable | Description | Type | Values |
|----------|-------------|------|--------|
| **age** | Age of the patient | Continuous | 29-77 years |
| **sex** | Gender of the patient | Categorical | 1 = Male, 0 = Female |
| **cp** | Chest pain type | Categorical | 0 = Typical angina<br>1 = Atypical angina<br>2 = Non-anginal pain<br>3 = Asymptomatic |
| **trestbps** | Resting blood pressure | Continuous | mm Hg |
| **chol** | Serum cholesterol | Continuous | mg/dl |
| **fbs** | Fasting blood sugar > 120 mg/dl | Categorical | 1 = True, 0 = False |
| **restecg** | Resting electrocardiographic results | Categorical | 0 = Normal<br>1 = ST-T wave abnormality<br>2 = Left ventricular hypertrophy |
| **thalach** | Maximum heart rate achieved | Continuous | beats per minute |
| **exang** | Exercise induced angina | Categorical | 1 = Yes, 0 = No |
| **oldpeak** | ST depression induced by exercise | Continuous | 0-6.2 |
| **slope** | Slope of peak exercise ST segment | Categorical | 0 = Upsloping<br>1 = Flat<br>2 = Downsloping |
| **ca** | Number of major vessels colored by fluoroscopy | Discrete | 0-4 |
| **thal** | Thalassemia | Categorical | 1 = Normal<br>2 = Fixed defect<br>3 = Reversible defect |

### Target Variable (Output):
| Variable | Description | Type | Values |
|----------|-------------|------|--------|
| **target** | Heart disease diagnosis | Binary | 1 = Heart disease present<br>0 = No heart disease |

In [None]:
# Check the distribution of our target variable
print("Target Variable Distribution:")
print("=" * 40)
target_counts = df['target'].value_counts()
print(target_counts)
print(f"\nPercentage distribution:")
print(f"No Heart Disease (0): {target_counts[0]/len(df)*100:.1f}%")
print(f"Heart Disease (1): {target_counts[1]/len(df)*100:.1f}%")

# Visualize the target distribution
plt.figure(figsize=(8, 5))
plt.subplot(1, 2, 1)
df['target'].value_counts().plot(kind='bar', color=['lightcoral', 'lightblue'])
plt.title('Heart Disease Distribution')
plt.xlabel('Target')
plt.ylabel('Count')
plt.xticks([0, 1], ['No Disease', 'Disease'], rotation=0)

plt.subplot(1, 2, 2)
df['target'].value_counts().plot(kind='pie', autopct='%1.1f%%', colors=['lightcoral', 'lightblue'])
plt.title('Heart Disease Percentage')
plt.ylabel('')

plt.tight_layout()
plt.show()

## 4. Data Preprocessing

Before building our logistic regression model, we need to prepare our data:
1. Check for missing values
2. Handle any data quality issues
3. Prepare features and target variables
4. Scale features if necessary

In [None]:
# Check for missing values
print("Missing Values Check:")
print("=" * 30)
missing_values = df.isnull().sum()
print(missing_values)

if missing_values.sum() == 0:
    print("\n✓ Great! No missing values found in the dataset.")
else:
    print(f"\n⚠ Found {missing_values.sum()} missing values that need to be handled.")

# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"\nDuplicate rows: {duplicates}")

# Separate features and target
X = df.drop('target', axis=1)  # Features (all columns except target)
y = df['target']               # Target variable

print(f"\nFeatures shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature columns: {list(X.columns)}")

## 5. Split Data into Training and Test Sets

We'll split our data into training and testing sets to evaluate our model's performance on unseen data.

**Why do we split the data?**
- **Training set**: Used to train the model (learn patterns)
- **Test set**: Used to evaluate model performance (unseen data)
- This helps us detect overfitting and get realistic performance estimates

In [None]:
# Split the data into training and testing sets
# 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,        # 20% for testing
    random_state=42,      # For reproducible results
    stratify=y            # Maintain same proportion of target classes in both sets
)

print("Data Split Summary:")
print("=" * 30)
print(f"Total samples: {len(df)}")
print(f"Training samples: {len(X_train)} ({len(X_train)/len(df)*100:.1f}%)")
print(f"Testing samples: {len(X_test)} ({len(X_test)/len(df)*100:.1f}%)")

print(f"\nTraining set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

# Check target distribution in both sets
print(f"\nTarget distribution in training set:")
print(y_train.value_counts(normalize=True))
print(f"\nTarget distribution in testing set:")
print(y_test.value_counts(normalize=True))

In [None]:
# Feature Scaling
# Logistic regression can benefit from feature scaling, especially when features have different scales
scaler = StandardScaler()

# Fit the scaler on training data and transform both training and test data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature Scaling Applied:")
print("=" * 30)
print("✓ Features have been standardized (mean=0, std=1)")
print(f"Original training data range: {X_train.min().min():.2f} to {X_train.max().max():.2f}")
print(f"Scaled training data range: {X_train_scaled.min():.2f} to {X_train_scaled.max():.2f}")

# Convert back to DataFrame for easier handling (optional)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns)

## 6. Implement Logistic Regression Model

Now we'll create and train our logistic regression model. Scikit-learn makes this process straightforward.

### Key Parameters of LogisticRegression:
- **random_state**: For reproducible results
- **max_iter**: Maximum number of iterations for optimization
- **solver**: Algorithm for optimization ('liblinear' works well for small datasets)

In [None]:
# Create and train the logistic regression model
lr_model = LogisticRegression(
    random_state=42,      # For reproducible results
    max_iter=1000,        # Increase if convergence issues occur
    solver='liblinear'    # Good solver for smaller datasets
)

# Train the model
print("Training Logistic Regression Model...")
lr_model.fit(X_train_scaled, y_train)
print("✓ Model training completed!")

# Make predictions
y_train_pred = lr_model.predict(X_train_scaled)
y_test_pred = lr_model.predict(X_test_scaled)

# Get prediction probabilities
y_train_proba = lr_model.predict_proba(X_train_scaled)[:, 1]  # Probability of class 1
y_test_proba = lr_model.predict_proba(X_test_scaled)[:, 1]

print("\nModel Predictions Generated:")
print(f"Training predictions shape: {y_train_pred.shape}")
print(f"Test predictions shape: {y_test_pred.shape}")

In [None]:
# Display model coefficients
print("Model Coefficients Analysis:")
print("=" * 40)

# Create a DataFrame for better visualization
coefficients_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': lr_model.coef_[0],
    'Abs_Coefficient': np.abs(lr_model.coef_[0])
}).sort_values('Abs_Coefficient', ascending=False)

print("Feature Importance (based on coefficient magnitude):")
print(coefficients_df)

print(f"\nIntercept (bias term): {lr_model.intercept_[0]:.4f}")

# Interpretation note
print("\nCoefficient Interpretation:")
print("• Positive coefficient: Increases probability of heart disease")
print("• Negative coefficient: Decreases probability of heart disease")
print("• Larger absolute value: More important feature")

## 7. Evaluate Model Performance

Model evaluation is crucial to understand how well our logistic regression model performs. We'll use several metrics:

### Classification Metrics:
- **Accuracy**: Overall correct predictions
- **Precision**: True positives / (True positives + False positives)
- **Recall (Sensitivity)**: True positives / (True positives + False negatives)
- **F1-Score**: Harmonic mean of precision and recall
- **ROC-AUC**: Area under the ROC curve (measures model's ability to distinguish classes)

In [None]:
# Calculate accuracy scores
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print("Model Performance Summary:")
print("=" * 50)
print(f"Training Accuracy: {train_accuracy:.4f} ({train_accuracy*100:.2f}%)")
print(f"Testing Accuracy:  {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")

# Calculate ROC-AUC scores
train_auc = roc_auc_score(y_train, y_train_proba)
test_auc = roc_auc_score(y_test, y_test_proba)

print(f"\nTraining ROC-AUC: {train_auc:.4f}")
print(f"Testing ROC-AUC:  {test_auc:.4f}")

# Check for overfitting
accuracy_diff = train_accuracy - test_accuracy
if accuracy_diff > 0.05:
    print(f"\n⚠ Possible overfitting detected (difference: {accuracy_diff:.4f})")
else:
    print(f"\n✓ Good generalization (difference: {accuracy_diff:.4f})")

In [None]:
# Detailed classification report
print("\nDetailed Classification Report (Test Set):")
print("=" * 60)
print(classification_report(y_test, y_test_pred, 
                          target_names=['No Heart Disease', 'Heart Disease']))

# Confusion Matrix
cm = confusion_matrix(y_test, y_test_pred)
print("\nConfusion Matrix (Test Set):")
print("=" * 40)
print(f"True Negatives (TN):  {cm[0,0]}")
print(f"False Positives (FP): {cm[0,1]}")
print(f"False Negatives (FN): {cm[1,0]}")
print(f"True Positives (TP):  {cm[1,1]}")

# Calculate additional metrics manually
precision = cm[1,1] / (cm[1,1] + cm[0,1])
recall = cm[1,1] / (cm[1,1] + cm[1,0])
specificity = cm[0,0] / (cm[0,0] + cm[0,1])

print(f"\nManual Calculation Verification:")
print(f"Precision (PPV): {precision:.4f}")
print(f"Recall (Sensitivity): {recall:.4f}")
print(f"Specificity: {specificity:.4f}")

## 8. Visualize Results

Visual representations help us better understand our model's performance and the relationships in our data.

In [None]:
# Visualize Confusion Matrix
plt.figure(figsize=(12, 5))

# Confusion Matrix Heatmap
plt.subplot(1, 2, 1)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['No Disease', 'Disease'],
            yticklabels=['No Disease', 'Disease'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

# Feature Importance (Coefficient Magnitude)
plt.subplot(1, 2, 2)
top_features = coefficients_df.head(8)  # Top 8 features
plt.barh(range(len(top_features)), top_features['Abs_Coefficient'])
plt.yticks(range(len(top_features)), top_features['Feature'])
plt.xlabel('Absolute Coefficient Value')
plt.title('Top 8 Most Important Features')
plt.gca().invert_yaxis()

plt.tight_layout()
plt.show()

In [None]:
# ROC Curve Visualization
fpr, tpr, thresholds = roc_curve(y_test, y_test_proba)

plt.figure(figsize=(10, 6))

# ROC Curve
plt.subplot(1, 2, 1)
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC Curve (AUC = {test_auc:.3f})')
plt.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.grid(True)

# Prediction Probability Distribution
plt.subplot(1, 2, 2)
plt.hist(y_test_proba[y_test == 0], bins=20, alpha=0.7, label='No Disease', color='lightcoral')
plt.hist(y_test_proba[y_test == 1], bins=20, alpha=0.7, label='Disease', color='lightblue')
plt.xlabel('Predicted Probability')
plt.ylabel('Frequency')
plt.title('Distribution of Predicted Probabilities')
plt.legend()
plt.axvline(x=0.5, color='black', linestyle='--', label='Decision Threshold')

plt.tight_layout()
plt.show()

print(f"ROC-AUC Interpretation:")
print(f"• AUC = {test_auc:.3f}")
if test_auc > 0.9:
    print("• Excellent model performance")
elif test_auc > 0.8:
    print("• Good model performance")
elif test_auc > 0.7:
    print("• Fair model performance")
else:
    print("• Poor model performance")

## 9. Practical Application: Making Predictions

Let's see how we can use our trained model to make predictions for new patients.

In [None]:
# Example: Predict for a new patient
# Let's take a few examples from our test set
sample_patients = X_test.iloc[:3].copy()

print("Sample Patients for Prediction:")
print("=" * 50)
for i, (idx, patient) in enumerate(sample_patients.iterrows()):
    print(f"\nPatient {i+1} (Index {idx}):")
    print(f"Age: {patient['age']}, Sex: {'Male' if patient['sex']==1 else 'Female'}")
    print(f"Chest Pain Type: {patient['cp']}, Max Heart Rate: {patient['thalach']}")
    print(f"Cholesterol: {patient['chol']}, Resting BP: {patient['trestbps']}")
    
    # Scale the patient data
    patient_scaled = scaler.transform(patient.values.reshape(1, -1))
    
    # Make prediction
    prediction = lr_model.predict(patient_scaled)[0]
    probability = lr_model.predict_proba(patient_scaled)[0, 1]
    
    # Get actual result
    actual = y_test.loc[idx]
    
    print(f"Predicted: {'Heart Disease' if prediction==1 else 'No Heart Disease'}")
    print(f"Probability of Heart Disease: {probability:.3f}")
    print(f"Actual: {'Heart Disease' if actual==1 else 'No Heart Disease'}")
    print(f"Prediction: {'✓ Correct' if prediction==actual else '✗ Incorrect'}")
    print("-" * 40)

## 9.1. Save the Trained Model

Let's save our trained model and scaler so we can use them in a web application later.

In [None]:
# Import joblib for model saving
import joblib
import os

# Create a models directory
os.makedirs('models', exist_ok=True)

# Save the trained model
model_filename = 'models/heart_disease_logistic_model.pkl'
joblib.dump(lr_model, model_filename)
print(f"✓ Model saved as: {model_filename}")

# Save the scaler
scaler_filename = 'models/heart_disease_scaler.pkl'
joblib.dump(scaler, scaler_filename)
print(f"✓ Scaler saved as: {scaler_filename}")

# Save feature names for reference
feature_names = list(X.columns)
feature_filename = 'models/feature_names.pkl'
joblib.dump(feature_names, feature_filename)
print(f"✓ Feature names saved as: {feature_filename}")

# Save model metadata
model_metadata = {
    'model_type': 'Logistic Regression',
    'features': feature_names,
    'target': 'Heart Disease (0: No, 1: Yes)',
    'test_accuracy': test_accuracy,
    'test_auc': test_auc,
    'training_date': '2025-08-29',
    'feature_descriptions': {
        'age': 'Age of patient (years)',
        'sex': 'Gender (1: Male, 0: Female)',
        'cp': 'Chest pain type (0-3)',
        'trestbps': 'Resting blood pressure (mm Hg)',
        'chol': 'Serum cholesterol (mg/dl)',
        'fbs': 'Fasting blood sugar > 120 mg/dl (1: True, 0: False)',
        'restecg': 'Resting ECG results (0-2)',
        'thalach': 'Maximum heart rate achieved',
        'exang': 'Exercise induced angina (1: Yes, 0: No)',
        'oldpeak': 'ST depression induced by exercise',
        'slope': 'Slope of peak exercise ST segment (0-2)',
        'ca': 'Number of major vessels (0-4)',
        'thal': 'Thalassemia (1-3)'
    }
}

metadata_filename = 'models/model_metadata.pkl'
joblib.dump(model_metadata, metadata_filename)
print(f"✓ Model metadata saved as: {metadata_filename}")

print(f"\nModel Performance Summary:")
print(f"• Test Accuracy: {test_accuracy:.3f}")
print(f"• Test ROC-AUC: {test_auc:.3f}")
print(f"\nAll files saved successfully! Ready for deployment.")

## 10. Summary and Key Learnings

### What We Accomplished:
1. ✅ **Loaded and explored** a real-world heart disease dataset
2. ✅ **Understood all data variables** and their medical significance  
3. ✅ **Preprocessed the data** (checked for missing values, scaled features)
4. ✅ **Split data** into training and testing sets
5. ✅ **Implemented logistic regression** using scikit-learn
6. ✅ **Evaluated model performance** using multiple metrics
7. ✅ **Visualized results** with confusion matrix and ROC curve
8. ✅ **Made practical predictions** for new patients

### Key Insights from Our Model:
- **Model Performance**: Our logistic regression achieved good performance on the heart disease dataset
- **Important Features**: The most influential factors for heart disease prediction include chest pain type, maximum heart rate, and other cardiac indicators
- **Generalization**: The model shows good generalization with minimal overfitting

### When to Use Logistic Regression:
✅ **Good for:**
- Binary classification problems (Yes/No, True/False)
- When you need interpretable results
- Linear relationships between features and log-odds
- Baseline model for comparison
- When you need probability estimates

❌ **Not ideal for:**
- Complex non-linear relationships
- Image or text classification (usually)
- When high accuracy is critical and interpretability is not

### Next Steps:
1. **Feature Engineering**: Create new features or transform existing ones
2. **Hyperparameter Tuning**: Optimize model parameters
3. **Try Other Algorithms**: Compare with Random Forest, SVM, etc.
4. **Cross-Validation**: Use k-fold CV for more robust evaluation
5. **Handle Class Imbalance**: If needed, use techniques like SMOTE

### Medical Context Note:
⚠️ **Important**: This model is for educational purposes only. Real medical diagnosis requires professional medical evaluation and should never rely solely on machine learning predictions.

## 11. Practice Exercises

### Exercise 1: Model Improvement
Try the following modifications and compare results:
1. Use different train-test split ratios (70-30, 90-10)
2. Try different solvers ('lbfgs', 'newton-cg', 'sag')
3. Add regularization by adjusting the `C` parameter

### Exercise 2: Feature Analysis
1. Create correlation heatmap of all features
2. Identify which features are most correlated with the target
3. Try building a model with only the top 5 most important features

### Exercise 3: Threshold Optimization
1. Plot precision-recall curve
2. Find the optimal threshold for classification (instead of 0.5)
3. Calculate metrics using the optimal threshold

### Exercise 4: Real-world Application
Create a simple function that takes patient data as input and returns:
1. Risk level (Low, Medium, High)
2. Probability percentage
3. Key risk factors for that patient

---

**End of Session 7: Logistic Regression**

*Next Session Preview: We'll explore more advanced classification algorithms like Random Forest and Support Vector Machines, and learn about ensemble methods.*