In [1]:
import pandas as pd

# Load the CSV file
df = pd.read_csv("heart.csv")

# Get and print column names
column_names = df.columns
print(column_names)


Index(['id', 'age', 'sex', 'dataset', 'cp', 'trestbps', 'chol', 'fbs',
       'restecg', 'thalch', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num'],
      dtype='object')


## Heart Disease Prediction Model

In this project, I developed a heart disease prediction model using **Logistic Regression**. Logistic Regression is a statistical method used for binary classification, which predicts the probability that an instance belongs to a particular category.

### Steps Involved:

1. **Data Loading**: The dataset was loaded using the Pandas library. The dataset used is `heart.csv`, which contains various features related to heart health.

2. **Preprocessing**:
   - **Label Encoding**: Categorical variables were converted to numerical format using `LabelEncoder` to facilitate the model training.
   - **Splitting the Dataset**: The dataset was divided into features (`X`) and the target variable (`y`). The target variable indicates the presence (1) or absence (0) of heart disease.
   - **Train-Test Split**: The dataset was split into training (80%) and testing (20%) sets using `train_test_split`.

3. **Handling Missing Values**: The `SimpleImputer` was used to replace missing values in the feature matrix with the mean of the respective columns.

4. **Feature Scaling**: Standardization of features was performed using `StandardScaler`, which scales the data to have a mean of 0 and a standard deviation of 1, ensuring that the model is not biased towards features with larger ranges.

5. **Model Training**: The Logistic Regression model was trained on the standardized training data.

6. **Model Evaluation**:
   - Predictions were made on the test set.
   - The accuracy of the model was calculated using `accuracy_score`.
   - A classification report was generated, which includes precision, recall, and F1-score for each class.

### Results

- The model achieved an accuracy of approximately 53%.
- The classification report provides detailed insights into the performance of the model on the test data.

### Conclusion

The Logistic Regression model is a simple method for predicting heart disease but the accuracy suggests that this is not most effective model since it is basically a guessing game now.


In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load the dataset
data = pd.read_csv('heart.csv')

# Identify categorical columns and apply label encoding
categorical_cols = data.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_cols:
    data[column] = label_encoder.fit_transform(data[column])

# Split data into features and target variable
X = data.drop(columns=['num'])
y = data['num']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Impute missing values in the feature matrix
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions and evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy and classification report
print("Accuracy:", accuracy)
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.5271739130434783
Classification Report:
               precision    recall  f1-score   support

           0       0.66      0.91      0.76        75
           1       0.45      0.46      0.45        54
           2       0.14      0.04      0.06        25
           3       0.17      0.12      0.14        26
           4       0.00      0.00      0.00         4

    accuracy                           0.53       184
   macro avg       0.28      0.31      0.28       184
weighted avg       0.44      0.53      0.47       184



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Heart Disease Prediction Model with Neural Network

In this section, I implemented a heart disease prediction model using a **Neural Network** built with TensorFlow and Keras. The model aims to classify individuals based on their health features and predict the presence or absence of heart disease.

### Model Architecture

1. **Input Layer**: Accepts the features from the dataset.
2. **Hidden Layers**:
   - The first layer consists of 64 neurons with ReLU activation.
   - Dropout layers (30% dropout rate) were added after the first two hidden layers to prevent overfitting.
   - Additional hidden layers with 32 and 16 neurons, respectively, also use ReLU activation.
3. **Output Layer**: Uses softmax activation to predict the class probabilities for the target variable.

### Training Process

- The model was compiled with the **Adam** optimizer and trained using **categorical crossentropy** as the loss function, suitable for multi-class classification.
- The model was trained for 100 epochs with a batch size of 16, and a validation split of 20% was used to monitor performance during training.

### Results

- The neural network model achieved a **test accuracy of 54%**. This is a slight improvement over the previous Logistic Regression model, which demonstrated lower accuracy.
  

Test Accuracy: 0.54

### Conclusion
The neural network model demonstrates a slightly better accuracy than the previous Logistic Regression model. While the improvement is modest, it highlights the potential of using more complex models like neural networks for this classification task. Future work may include hyperparameter tuning, exploring different architectures, and using additional data to enhance model performance.



In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam

# Load the dataset
data = pd.read_csv('heart.csv')

# Identify categorical columns and apply label encoding
categorical_cols = data.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_cols:
    data[column] = label_encoder.fit_transform(data[column])

# Split data into features and target variable
X = data.drop(columns=['num'])
y = data['num']

# One-hot encode the target variable
y = to_categorical(y)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Impute missing values in the feature matrix
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define the neural network model
model = Sequential()
model.add(Dense(64, input_shape=(X_train.shape[1],), activation='relu'))
model.add(Dropout(0.3))  # Dropout for regularization
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(16, activation='relu'))
model.add(Dense(y.shape[1], activation='softmax'))  # Output layer with softmax activation

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=100, batch_size=16, validation_split=0.2, verbose=1)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print("Test Accuracy:", accuracy)

# Predict and generate classification report
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)
y_true_classes = np.argmax(y_test, axis=1)

from sklearn.metrics import classification_report
print("Classification Report:\n", classification_report(y_true_classes, y_pred_classes))


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/100
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 21ms/step - accuracy: 0.2318 - loss: 1.6024 - val_accuracy: 0.5203 - val_loss: 1.3971
Epoch 2/100
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.4435 - loss: 1.3792 - val_accuracy: 0.5676 - val_loss: 1.2379
Epoch 3/100
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.5035 - loss: 1.2523 - val_accuracy: 0.6081 - val_loss: 1.1053
Epoch 4/100
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.5416 - loss: 1.1271 - val_accuracy: 0.6351 - val_loss: 1.0303
Epoch 5/100
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.5524 - loss: 1.1185 - val_accuracy: 0.6419 - val_loss: 1.0030
Epoch 6/100
[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.5573 - loss: 1.0712 - val_accuracy: 0.6554 - val_loss: 0.9710
Epoch 7/100
[1m37/37[0m [32m━━

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Heart Disease Prediction Model with Random Forest

In this section, I implemented a heart disease prediction model using a **Random Forest Classifier**. Random Forest is an ensemble learning method that operates by constructing multiple decision trees during training and outputs the mode of the classes for classification.

### Data Preparation and Preprocessing

1. **Missing Values Handling**: 
   - Checked for missing values in the dataset and imputed them. Categorical columns were filled with their mode, while numerical columns were filled with their mean.

2. **Label Encoding**: 
   - Categorical columns were converted into numeric format using `LabelEncoder` to facilitate model training.

3. **Feature and Target Preparation**: 
   - The dataset was split into features (`X`) and the target variable (`y`), where the target variable indicates the presence (1) or absence (0) of heart disease.

4. **Handling Class Imbalance**: 
   - Applied **SMOTE (Synthetic Minority Over-sampling Technique)** to address class imbalance in the dataset. This technique generates synthetic samples for the minority class, leading to a more balanced dataset.

5. **Train-Test Split**: 
   - The resampled dataset was split into training and testing sets (80% training, 20% testing).

6. **Feature Standardization**: 
   - Standardized the features using `StandardScaler` to ensure all input features are on the same scale, which can improve model performance.

### Model Training

- A Random Forest model was trained using 100 decision trees (`n_estimators=100`).

### Results

- The Random Forest model achieved an impressive **test accuracy of 87%**, significantly improving the model's performance compared to previous attempts using Logistic Regression (54%) and the Neural Network (54%).

```plaintext
Test Accuracy: 0.87


In [11]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler

# Load the data
data = pd.read_csv('heart.csv')

# Check for missing values
print("Missing values in each column:\n", data.isnull().sum())

# Impute missing values (filling numerical columns with mean and categorical with mode)
for column in data.columns:
    if data[column].dtype == 'object':
        # Fill categorical columns with mode
        data[column].fillna(data[column].mode()[0], inplace=True)
    else:
        # Fill numerical columns with mean
        data[column].fillna(data[column].mean(), inplace=True)

# Convert categorical columns to numeric using Label Encoding
categorical_cols = data.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_cols:
    data[column] = label_encoder.fit_transform(data[column])

# Prepare features and target
X = data.drop(columns=['num'])
y = data['num']

# Perform SMOTE to handle class imbalance
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Split resampled data
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train a Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Evaluate model performance
y_pred = rf_model.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))


Missing values in each column:
 id            0
age           0
sex           0
dataset       0
cp            0
trestbps     59
chol         30
fbs          90
restecg       2
thalch       55
exang        55
oldpeak      62
slope       309
ca          611
thal        486
num           0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[column].fillna(data[column].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[column].fillna(data[column].mode()[0], inplace=True)
  data[column].fillna(data[column].mode()[0], inplace=True)


Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.91      0.86        85
           1       0.90      0.67      0.77        81
           2       0.83      0.88      0.85        72
           3       0.85      0.89      0.87        84
           4       0.96      0.99      0.97        89

    accuracy                           0.87       411
   macro avg       0.87      0.87      0.86       411
weighted avg       0.87      0.87      0.87       411



## Hyperparameter Tuning for Random Forest Classifier Using Grid Search

In this section, I employed **Grid Search** to optimize the hyperparameters of the **Random Forest Classifier**. Hyperparameter tuning is a crucial step in improving model performance, as it helps in finding the best combination of parameters that yield the highest predictive accuracy.

### Model Definition

The initial Random Forest model was defined with a fixed random state for reproducibility:

```python
rf_model = RandomForestClassifier(random_state=42)


### Parameter Grid Definition
To explore different hyperparameter combinations, I defined a parameter grid that included:

n_estimators: Number of trees in the forest. Tested values: 100, 200.
max_depth: Maximum depth of the tree. Tested values: None, 10, 20, 30.
min_samples_split: Minimum number of samples required to split an internal node. Tested values: 2, 5, 10.
class_weight: Weights associated with classes. Tested values: 'balanced', None.

### Grid Search Implementation
A GridSearchCV object was created with the Random Forest model and the defined parameter grid. The grid search utilized 5-fold cross-validation and was optimized for the weighted F1 score:



```python
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, 
                           scoring='f1_weighted', cv=5, verbose=1)


The grid search was then fitted to the training data:

```python
grid_search.fit(X_train, y_train)


### Best Parameters
After completing the grid search, the best hyperparameters were identified:

Best parameters: 

```python
Best parameters: {<best_params>}
)

### Model Evaluation
The best model from the grid search was evaluated on the test set, and the results were summarized in the classification report:

```python
best_rf_model = grid_search.best_estimator_
y_pred = best_rf_model.predict(X_test)
print("Best Random Forest Classification Report:\n", classification_report(y_test, y_pred


In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Define the model
rf_model = RandomForestClassifier(random_state=42)

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'class_weight': ['balanced', None]
}

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, 
                           scoring='f1_weighted', cv=5, verbose=1)

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# Best parameters
print("Best parameters:", grid_search.best_params_)

# Evaluate the best model on the test set
best_rf_model = grid_search.best_estimator_
y_pred = best_rf_model.predict(X_test)
print("Best Random Forest Classification Report:\n", classification_report(y_test, y_pred))


Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'class_weight': 'balanced', 'max_depth': None, 'min_samples_split': 2, 'n_estimators': 200}
Best Random Forest Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.87      0.84        85
           1       0.87      0.65      0.75        81
           2       0.81      0.89      0.85        72
           3       0.86      0.87      0.86        84
           4       0.94      0.99      0.96        89

    accuracy                           0.86       411
   macro avg       0.86      0.85      0.85       411
weighted avg       0.86      0.86      0.85       411



In [15]:
pip install optuna

Collecting optuna
  Downloading optuna-4.0.0-py3-none-any.whl.metadata (16 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.13.3-py3-none-any.whl.metadata (7.4 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.6-py3-none-any.whl.metadata (2.9 kB)
Downloading optuna-4.0.0-py3-none-any.whl (362 kB)
   ---------------------------------------- 0.0/362.8 kB ? eta -:--:--
   ------------ --------------------------- 112.6/362.8 kB 3.3 MB/s eta 0:00:01
   ---------------------------------------- 362.8/362.8 kB 5.7 MB/s eta 0:00:00
Downloading alembic-1.13.3-py3-none-any.whl (233 kB)
   ---------------------------------------- 0.0/233.2 kB ? eta -:--:--
   --------------------------------------- 233.2/233.2 kB 14.9 MB/s eta 0:00:00
Downloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Downloading Mako-1.3.6-py3-none-any.whl (78 kB)
   -------------------------

In [16]:
## Optuna Intergrations

import optuna
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

def objective(trial):
    # Hyperparameter suggestions
    n_estimators = trial.suggest_int("n_estimators", 50, 300)
    max_depth = trial.suggest_int("max_depth", 5, 50)
    min_samples_split = trial.suggest_int("min_samples_split", 2, 10)
    class_weight = trial.suggest_categorical("class_weight", ["balanced", None])

    # Create the model
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        class_weight=class_weight,
        random_state=42
    )

    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Fit the model
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    # Return the F1 score as the objective metric to maximize
    return f1_score(y_test, y_pred, average='weighted')


In [17]:
# Create a study object
study = optuna.create_study(direction="maximize")

# Optimize the objective function
study.optimize(objective, n_trials=100)

# Print the best trial
best_trial = study.best_trial
print("Best trial:")
print(f"  F1 Score: {best_trial.value}")
print("  Params:")
for key, value in best_trial.params.items():
    print(f"    {key}: {value}")


[I 2024-10-29 22:12:56,791] A new study created in memory with name: no-name-118ec5ca-db92-440a-8e25-7375976f3a45
[I 2024-10-29 22:12:57,472] Trial 0 finished with value: 0.6504695776620535 and parameters: {'n_estimators': 173, 'max_depth': 42, 'min_samples_split': 9, 'class_weight': 'balanced'}. Best is trial 0 with value: 0.6504695776620535.
[I 2024-10-29 22:12:58,391] Trial 1 finished with value: 0.6451905658068154 and parameters: {'n_estimators': 241, 'max_depth': 35, 'min_samples_split': 9, 'class_weight': 'balanced'}. Best is trial 0 with value: 0.6504695776620535.
[I 2024-10-29 22:12:59,261] Trial 2 finished with value: 0.5732056888770635 and parameters: {'n_estimators': 208, 'max_depth': 46, 'min_samples_split': 4, 'class_weight': None}. Best is trial 0 with value: 0.6504695776620535.
[I 2024-10-29 22:12:59,701] Trial 3 finished with value: 0.54549127343245 and parameters: {'n_estimators': 114, 'max_depth': 37, 'min_samples_split': 8, 'class_weight': None}. Best is trial 0 with

Best trial:
  F1 Score: 0.6514970732883731
  Params:
    n_estimators: 127
    max_depth: 43
    min_samples_split: 9
    class_weight: balanced


In [18]:
# Train the best model with the best hyperparameters
best_params = best_trial.params
best_model = RandomForestClassifier(
    n_estimators=best_params["n_estimators"],
    max_depth=best_params["max_depth"],
    min_samples_split=best_params["min_samples_split"],
    class_weight=best_params["class_weight"],
    random_state=42
)

# Fit the model on the entire training data
best_model.fit(X_train, y_train)

# Evaluate the model
y_pred_final = best_model.predict(X_test)
print("Final Classification Report:\n", classification_report(y_test, y_pred_final))


Final Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.88      0.84        85
           1       0.87      0.59      0.71        81
           2       0.75      0.88      0.81        72
           3       0.85      0.86      0.85        84
           4       0.91      0.97      0.94        89

    accuracy                           0.84       411
   macro avg       0.84      0.83      0.83       411
weighted avg       0.84      0.84      0.83       411

