# Heart Disease Diagnosis using Neural Networks

This notebook demonstrates:
1. Loading and exploring medical tabular data
2. Data preprocessing and feature engineering
3. **Data Augmentation using SMOTE** (for handling class imbalance)
4. Building a Neural Network for classification
5. Model evaluation and predictions

## Task 1: Import Required Libraries

In [None]:
# Import libraries for data manipulation
import pandas as pd  # for working with tabular data
import numpy as np   # for numerical operations

# Import libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import libraries for preprocessing and modeling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Import SMOTE for data augmentation
from imblearn.over_sampling import SMOTE

# Import Keras for neural network
from keras.models import Sequential
from keras.layers import Dense, Dropout

# Set random seed for reproducibility
np.random.seed(42)

## Task 2: Load the Dataset

In [None]:
# Load the heart disease dataset
df = pd.read_csv('heart_disease_data.csv')

# Display first few rows
print("First 5 rows of the dataset:")
df.head()

**Feature Descriptions:**
- `age`: Age in years
- `sex`: Sex (1 = male, 0 = female)
- `cp`: Chest pain type (0-3)
- `trestbps`: Resting blood pressure (mm Hg)
- `chol`: Serum cholesterol (mg/dl)
- `fbs`: Fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
- `restecg`: Resting electrocardiographic results (0-2)
- `thalach`: Maximum heart rate achieved
- `exang`: Exercise induced angina (1 = yes, 0 = no)
- `oldpeak`: ST depression induced by exercise
- `slope`: Slope of the peak exercise ST segment (0-2)
- `ca`: Number of major vessels (0-3) colored by fluoroscopy
- `thal`: Thalassemia (1-3)
- `target`: Diagnosis of heart disease (1 = disease, 0 = no disease)

## Task 3: Exploratory Data Analysis

In [None]:
# Check dataset shape and info
print(f"Dataset Shape: {df.shape}")
print(f"\nDataset Info:")
df.info()

In [None]:
# Check for missing values
print("Missing Values:")
df.isnull().sum()

In [None]:
# Statistical summary
print("Statistical Summary:")
df.describe()

## Task 4: Check Class Distribution (Before Data Augmentation)

In [None]:
# Check class distribution
class_counts = df['target'].value_counts()
print("Class Distribution (Before Augmentation):")
print(class_counts)
print(f"\nClass 0 (No Disease): {class_counts[0]}")
print(f"Class 1 (Disease): {class_counts[1]}")
print(f"\nImbalance Ratio: {class_counts[0] / class_counts[1]:.2f}:1")

In [None]:
# Visualize class distribution
plt.figure(figsize=(8, 6))
sns.countplot(x='target', data=df)
plt.title('Class Distribution Before Data Augmentation')
plt.xlabel('Target (0 = No Disease, 1 = Disease)')
plt.ylabel('Count')
plt.show()

## Task 5: Prepare Features and Target

In [None]:
# Separate features (X) and target (y)
X = df.drop('target', axis=1)  # All columns except target
y = df['target']               # Only target column

print(f"Features Shape: {X.shape}")
print(f"Target Shape: {y.shape}")
print(f"\nFeature Columns: {list(X.columns)}")

## Task 6: Split Data into Train and Test Sets

In [None]:
# Split the data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42,
    stratify=y  # Ensure equal distribution of classes
)

print(f"Training Set Size: {X_train.shape[0]}")
print(f"Test Set Size: {X_test.shape[0]}")

## Task 7: Feature Scaling (Normalization)

In [None]:
# Scale features to have zero mean and unit variance
# This is important for neural networks
scaler = StandardScaler()

# Fit on training data and transform both train and test
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Scaled Training Data Shape: {X_train_scaled.shape}")
print(f"Scaled Test Data Shape: {X_test_scaled.shape}")
print(f"\nFirst row of scaled data (first 5 features):")
print(X_train_scaled[0, :5])

## Task 8: DATA AUGMENTATION using SMOTE

**SMOTE (Synthetic Minority Over-sampling Technique)** is the tabular data equivalent of data augmentation for images.

- **Image Augmentation**: Rotate, flip, zoom images to create new examples
- **SMOTE**: Creates synthetic examples by interpolating between existing minority class samples

This helps the model learn better by addressing class imbalance!

In [None]:
# Apply SMOTE to training data only
print("Applying SMOTE Data Augmentation...")
print("BEFORE SMOTE:")
print(f"Class 0 samples: {sum(y_train == 0)}")
print(f"Class 1 samples: {sum(y_train == 1)}")

# Create SMOTE object
# sampling_strategy='auto' means balance all classes
smote = SMOTE(random_state=42)

# Apply SMOTE to training data
X_train_augmented, y_train_augmented = smote.fit_resample(X_train_scaled, y_train)

print("\nAFTER SMOTE:")
print(f"Class 0 samples: {sum(y_train_augmented == 0)}")
print(f"Class 1 samples: {sum(y_train_augmented == 1)}")
print(f"\nNew Training Set Size: {X_train_augmented.shape[0]}")
print(f"Samples Added by SMOTE: {X_train_augmented.shape[0] - X_train_scaled.shape[0]}")

In [None]:
# Visualize class distribution after augmentation
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.countplot(x=y_train)
plt.title('Before SMOTE')
plt.xlabel('Target')
plt.ylabel('Count')

plt.subplot(1, 2, 2)
sns.countplot(x=y_train_augmented)
plt.title('After SMOTE (Augmented)')
plt.xlabel('Target')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

## Task 9: Build the Neural Network Model

In [None]:
# Create a sequential neural network
model = Sequential()

# Input layer and first hidden layer
# We have 13 input features
model.add(Dense(units=64, activation='relu', input_dim=13))
model.add(Dropout(0.3))  # Dropout helps prevent overfitting

# Second hidden layer
model.add(Dense(units=32, activation='relu'))
model.add(Dropout(0.2))

# Third hidden layer
model.add(Dense(units=16, activation='relu'))

# Output layer (binary classification)
# Sigmoid activation outputs probability between 0 and 1
model.add(Dense(units=1, activation='sigmoid'))

# Compile the model
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Display model architecture
model.summary()

**Model Architecture Explanation:**

1. **Input Layer**: 13 neurons (one for each feature)
2. **Hidden Layer 1**: 64 neurons with ReLU activation
3. **Dropout**: Randomly disables 30% of neurons (prevents overfitting)
4. **Hidden Layer 2**: 32 neurons with ReLU activation
5. **Dropout**: Randomly disables 20% of neurons
6. **Hidden Layer 3**: 16 neurons with ReLU activation
7. **Output Layer**: 1 neuron with Sigmoid (outputs probability)

## Task 10: Train the Model

In [None]:
# Train the model on augmented data
history = model.fit(
    X_train_augmented, 
    y_train_augmented,
    batch_size=32,
    epochs=100,
    validation_data=(X_test_scaled, y_test),
    verbose=1
)

## Task 11: Visualize Training Performance

In [None]:
# Plot training history
plt.figure(figsize=(14, 5))

# Plot accuracy
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

# Plot loss
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.tight_layout()
plt.show()

## Task 12: Evaluate the Model

In [None]:
# Make predictions on test set
y_pred_prob = model.predict(X_test_scaled)
y_pred = (y_pred_prob > 0.5).astype(int)  # Convert probability to class

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy * 100:.2f}%")

# Display classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['No Disease', 'Disease']))

In [None]:
# Display confusion matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.xticks([0.5, 1.5], ['No Disease', 'Disease'])
plt.yticks([0.5, 1.5], ['No Disease', 'Disease'])
plt.show()

print(f"True Negatives (Correctly predicted No Disease): {cm[0,0]}")
print(f"False Positives (Incorrectly predicted Disease): {cm[0,1]}")
print(f"False Negatives (Incorrectly predicted No Disease): {cm[1,0]}")
print(f"True Positives (Correctly predicted Disease): {cm[1,1]}")

## Task 13: Make Predictions on New Data

In [None]:
# Example 1: Predict for a new patient
# Features: [age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal]
new_patient_1 = np.array([[52, 1, 0, 125, 212, 0, 1, 168, 0, 1.0, 2, 2, 3]])

# Scale the new data using the same scaler
new_patient_1_scaled = scaler.transform(new_patient_1)

# Make prediction
prediction_1 = model.predict(new_patient_1_scaled)
print(f"Patient 1 Risk Probability: {prediction_1[0][0] * 100:.2f}%")
if prediction_1[0][0] > 0.5:
    print("Diagnosis: HIGH RISK of Heart Disease")
else:
    print("Diagnosis: LOW RISK of Heart Disease")

In [None]:
# Example 2: Another patient
new_patient_2 = np.array([[45, 0, 1, 130, 233, 0, 1, 198, 0, 0.2, 1, 0, 2]])

# Scale and predict
new_patient_2_scaled = scaler.transform(new_patient_2)
prediction_2 = model.predict(new_patient_2_scaled)

print(f"Patient 2 Risk Probability: {prediction_2[0][0] * 100:.2f}%")
if prediction_2[0][0] > 0.5:
    print("Diagnosis: HIGH RISK of Heart Disease")
else:
    print("Diagnosis: LOW RISK of Heart Disease")

## Task 14: Compare With vs Without Data Augmentation

In [None]:
# Train a model WITHOUT SMOTE for comparison
print("Training model WITHOUT data augmentation...")

model_no_smote = Sequential()
model_no_smote.add(Dense(units=64, activation='relu', input_dim=13))
model_no_smote.add(Dropout(0.3))
model_no_smote.add(Dense(units=32, activation='relu'))
model_no_smote.add(Dropout(0.2))
model_no_smote.add(Dense(units=16, activation='relu'))
model_no_smote.add(Dense(units=1, activation='sigmoid'))
model_no_smote.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

history_no_smote = model_no_smote.fit(
    X_train_scaled, 
    y_train,
    batch_size=32,
    epochs=100,
    validation_data=(X_test_scaled, y_test),
    verbose=0
)

# Compare accuracies
accuracy_with_smote = model.evaluate(X_test_scaled, y_test, verbose=0)[1]
accuracy_without_smote = model_no_smote.evaluate(X_test_scaled, y_test, verbose=0)[1]

print("\n" + "="*50)
print("COMPARISON: With vs Without SMOTE")
print("="*50)
print(f"WITH SMOTE:     {accuracy_with_smote * 100:.2f}% accuracy")
print(f"WITHOUT SMOTE:  {accuracy_without_smote * 100:.2f}% accuracy")
print(f"Improvement:    {(accuracy_with_smote - accuracy_without_smote) * 100:+.2f}%")

## Summary

In this notebook, we:

1. ✅ Loaded and explored medical tabular data
2. ✅ Preprocessed features using StandardScaler
3. ✅ **Applied SMOTE data augmentation** to handle class imbalance
4. ✅ Built a neural network with dropout for regularization
5. ✅ Trained and evaluated the model
6. ✅ Made predictions on new patient data
7. ✅ Compared performance with and without data augmentation

**Key Takeaways:**
- **SMOTE** creates synthetic samples to balance classes (similar to image augmentation)
- **Feature scaling** is crucial for neural network performance
- **Dropout** helps prevent overfitting
- **Data augmentation** can improve model performance on imbalanced datasets