## Define Parameter Grids

### Subtask:
Define parameter grids for GridSearchCV and RandomizedSearchCV for each of the models (Logistic Regression, Decision Tree, Random Forest, SVM).

**Reasoning**:
Define dictionaries containing the hyperparameters to tune for each model. These grids will be used by GridSearchCV and RandomizedSearchCV to explore different combinations of hyperparameters.

In [None]:
# Define parameter grid for Logistic Regression
param_grid_lr = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'solver': ['liblinear', 'lbfgs']
}

# Define parameter grid for Decision Tree
param_grid_dt = {
    'max_depth': [None, 5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Define parameter grid for Random Forest
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Define parameter grid for SVM
param_grid_svm = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto', 0.1, 1]
}

print("✅ Parameter grids defined.")

✅ Parameter grids defined.


## Apply GridSearchCV

### Subtask:
Apply GridSearchCV to each model with its respective parameter grid to find the best hyperparameters.

**Reasoning**:
Initialize GridSearchCV for each model using the defined parameter grids, fit the GridSearchCV object to the training data (`X_train`, `y_train`), and store the best estimators. This process performs a comprehensive search over the specified parameter values using cross-validation to find the optimal combination for each model.

In [None]:
from sklearn.model_selection import GridSearchCV

# Dictionary to store best models from GridSearchCV
grid_search_models = {}

# Logistic Regression GridSearchCV
print("Running GridSearchCV for Logistic Regression...")
grid_search_lr = GridSearchCV(estimator=LogisticRegression(random_state=42, max_iter=1000),
                              param_grid=param_grid_lr,
                              cv=5, # Using 5-fold cross-validation
                              scoring='accuracy', # Using accuracy as the scoring metric
                              n_jobs=-1) # Use all available cores
grid_search_lr.fit(X_train, y_train)
grid_search_models['Logistic Regression'] = grid_search_lr.best_estimator_
print(f"Best parameters for Logistic Regression: {grid_search_lr.best_params_}")
print(f"Best cross-validation score for Logistic Regression: {grid_search_lr.best_score_:.4f}")
print("-" * 30)

# Decision Tree GridSearchCV
print("Running GridSearchCV for Decision Tree...")
grid_search_dt = GridSearchCV(estimator=DecisionTreeClassifier(random_state=42),
                              param_grid=param_grid_dt,
                              cv=5,
                              scoring='accuracy',
                              n_jobs=-1)
grid_search_dt.fit(X_train, y_train)
grid_search_models['Decision Tree'] = grid_search_dt.best_estimator_
print(f"Best parameters for Decision Tree: {grid_search_dt.best_params_}")
print(f"Best cross-validation score for Decision Tree: {grid_search_dt.best_score_:.4f}")
print("-" * 30)

# Random Forest GridSearchCV
print("Running GridSearchCV for Random Forest...")
grid_search_rf = GridSearchCV(estimator=RandomForestClassifier(random_state=42),
                              param_grid=param_grid_rf,
                              cv=5,
                              scoring='accuracy',
                              n_jobs=-1)
grid_search_rf.fit(X_train, y_train)
grid_search_models['Random Forest'] = grid_search_rf.best_estimator_
print(f"Best parameters for Random Forest: {grid_search_rf.best_params_}")
print(f"Best cross-validation score for Random Forest: {grid_search_rf.best_score_:.4f}")
print("-" * 30)


# SVM GridSearchCV
print("Running GridSearchCV for SVM...")
grid_search_svm = GridSearchCV(estimator=SVC(probability=True, random_state=42), # probability=True needed for ROC AUC
                              param_grid=param_grid_svm,
                              cv=5,
                              scoring='accuracy',
                              n_jobs=-1)
grid_search_svm.fit(X_train, y_train)
grid_search_models['SVM'] = grid_search_svm.best_estimator_
print(f"Best parameters for SVM: {grid_search_svm.best_params_}")
print(f"Best cross-validation score for SVM: {grid_search_svm.best_score_:.4f}")
print("-" * 30)


print("\n✅ GridSearchCV completed for all models.")

Running GridSearchCV for Logistic Regression...
Best parameters for Logistic Regression: {'C': 1, 'solver': 'lbfgs'}
Best cross-validation score for Logistic Regression: 0.6117
------------------------------
Running GridSearchCV for Decision Tree...
Best parameters for Decision Tree: {'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 2}
Best cross-validation score for Decision Tree: 0.5497
------------------------------
Running GridSearchCV for Random Forest...
Best parameters for Random Forest: {'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 200}
Best cross-validation score for Random Forest: 0.6114
------------------------------
Running GridSearchCV for SVM...
Best parameters for SVM: {'C': 100, 'gamma': 'scale', 'kernel': 'linear'}
Best cross-validation score for SVM: 0.6075
------------------------------

✅ GridSearchCV completed for all models.


## Evaluate GridSearchCV Models

### Subtask:
Evaluate the performance of the models with the best hyperparameters found by GridSearchCV.

**Reasoning**:
Use the best estimators obtained from GridSearchCV to predict on the test set (`X_test`) and calculate evaluation metrics such as Accuracy, Precision, Recall, F1-score, and ROC AUC.

In [None]:
# Dictionary to store evaluation metrics for GridSearchCV optimized models
grid_search_evaluation_metrics = {}

# Evaluate each optimized model from GridSearchCV
for name, model in grid_search_models.items():
    y_pred = model.predict(X_test)

    # Calculate basic metrics
    accuracy = accuracy_score(y_test, y_pred)
    # Use weighted average for multi-class
    precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)
    recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)

    # Store metrics
    grid_search_evaluation_metrics[name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-score': f1
    }

    # Calculate ROC AUC score (multi-class using 'ovr') if predict_proba is available
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_test)
        try:
            # Ensure y_test is one-hot encoded for roc_auc_score with 'ovr'
            from sklearn.preprocessing import label_binarize
            y_test_binarized = label_binarize(y_test, classes=np.unique(y_test))

            # Calculate ROC AUC
            roc_auc = roc_auc_score(y_test_binarized, y_prob, multi_class='ovr')
            grid_search_evaluation_metrics[name]['ROC AUC (OvR)'] = roc_auc

        except ValueError as e:
            print(f"Could not calculate ROC AUC for {name}: {e}\n")
    else:
        print(f"Model {name} does not support predict_proba for ROC AUC calculation.\n")


# Display the evaluation metrics in a DataFrame
grid_search_evaluation_df = pd.DataFrame(grid_search_evaluation_metrics).T
print("\nSummary of Evaluation Metrics for GridSearchCV Optimized Models:")
display(grid_search_evaluation_df)


Summary of Evaluation Metrics for GridSearchCV Optimized Models:


Unnamed: 0,Accuracy,Precision,Recall,F1-score,ROC AUC (OvR)
Logistic Regression,0.52459,0.463056,0.52459,0.488559,0.804064
Decision Tree,0.47541,0.441125,0.47541,0.457068,0.697961
Random Forest,0.52459,0.445433,0.52459,0.430212,0.801439
SVM,0.508197,0.43604,0.508197,0.467418,0.810624


## Apply RandomizedSearchCV

### Subtask:
Apply RandomizedSearchCV to each model with its respective parameter distribution to find the best hyperparameters.

**Reasoning**:
Initialize RandomizedSearchCV for each model using the defined parameter grids (which will be treated as distributions in this context), fit the RandomizedSearchCV object to the training data (`X_train`, `y_train`), and store the best estimators. RandomizedSearchCV samples a fixed number of parameter settings from the specified distributions, which can be more efficient than GridSearchCV for large search spaces.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Dictionary to store best models from RandomizedSearchCV
randomized_search_models = {}

# Define the number of parameter settings that will be sampled for each model
n_iter = 32 # You can adjust this number

# Logistic Regression RandomizedSearchCV
print(f"Running RandomizedSearchCV for Logistic Regression ({n_iter} iterations)...")
randomized_search_lr = RandomizedSearchCV(estimator=LogisticRegression(random_state=42, max_iter=1000),
                                          param_distributions=param_grid_lr, # Use the same grid as distribution
                                          n_iter=n_iter,
                                          cv=5,
                                          scoring='accuracy',
                                          random_state=42,
                                          n_jobs=-1)
randomized_search_lr.fit(X_train, y_train)
randomized_search_models['Logistic Regression'] = randomized_search_lr.best_estimator_
print(f"Best parameters for Logistic Regression: {randomized_search_lr.best_params_}")
print(f"Best cross-validation score for Logistic Regression: {randomized_search_lr.best_score_:.4f}")
print("-" * 30)

# Decision Tree RandomizedSearchCV
print(f"Running RandomizedSearchCV for Decision Tree ({n_iter} iterations)...")
randomized_search_dt = RandomizedSearchCV(estimator=DecisionTreeClassifier(random_state=42),
                                          param_distributions=param_grid_dt,
                                          n_iter=n_iter,
                                          cv=5,
                                          scoring='accuracy',
                                          random_state=42,
                                          n_jobs=-1)
randomized_search_dt.fit(X_train, y_train)
randomized_search_models['Decision Tree'] = randomized_search_dt.best_estimator_
print(f"Best parameters for Decision Tree: {randomized_search_dt.best_params_}")
print(f"Best cross-validation score for Decision Tree: {randomized_search_dt.best_score_:.4f}")
print("-" * 30)

# Random Forest RandomizedSearchCV
print(f"Running RandomizedSearchCV for Random Forest ({n_iter} iterations)...")
randomized_search_rf = RandomizedSearchCV(estimator=RandomForestClassifier(random_state=42),
                                          param_distributions=param_grid_rf,
                                          n_iter=n_iter,
                                          cv=5,
                                          scoring='accuracy',
                                          random_state=42,
                                          n_jobs=-1)
randomized_search_rf.fit(X_train, y_train)
randomized_search_models['Random Forest'] = randomized_search_rf.best_estimator_
print(f"Best parameters for Random Forest: {randomized_search_rf.best_params_}")
print(f"Best cross-validation score for Random Forest: {randomized_search_rf.best_score_:.4f}")
print("-" * 30)

# SVM RandomizedSearchCV
print(f"Running RandomizedSearchCV for SVM ({n_iter} iterations)...")
randomized_search_svm = RandomizedSearchCV(estimator=SVC(probability=True, random_state=42), # probability=True needed for ROC AUC
                                           param_distributions=param_grid_svm,
                                           n_iter=n_iter,
                                           cv=5,
                                           scoring='accuracy',
                                           random_state=42,
                                           n_jobs=-1)
randomized_search_svm.fit(X_train, y_train)
randomized_search_models['SVM'] = randomized_search_svm.best_estimator_
print(f"Best parameters for SVM: {randomized_search_svm.best_params_}")
print(f"Best cross-validation score for SVM: {randomized_search_svm.best_score_:.4f}")
print("-" * 30)

print("\n✅ RandomizedSearchCV completed for all models.")

Running RandomizedSearchCV for Logistic Regression (32 iterations)...




Best parameters for Logistic Regression: {'solver': 'lbfgs', 'C': 1}
Best cross-validation score for Logistic Regression: 0.6117
------------------------------
Running RandomizedSearchCV for Decision Tree (32 iterations)...
Best parameters for Decision Tree: {'min_samples_split': 5, 'min_samples_leaf': 4, 'max_depth': 10}
Best cross-validation score for Decision Tree: 0.5497
------------------------------
Running RandomizedSearchCV for Random Forest (32 iterations)...
Best parameters for Random Forest: {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_depth': 5}
Best cross-validation score for Random Forest: 0.6114
------------------------------
Running RandomizedSearchCV for SVM (32 iterations)...
Best parameters for SVM: {'kernel': 'linear', 'gamma': 'scale', 'C': 100}
Best cross-validation score for SVM: 0.6075
------------------------------

✅ RandomizedSearchCV completed for all models.


## Evaluate RandomizedSearchCV Models

### Subtask:
Evaluate the performance of the models with the best hyperparameters found by RandomizedSearchCV.

**Reasoning**:
Use the best estimators obtained from RandomizedSearchCV to predict on the test set (`X_test`) and calculate evaluation metrics such as Accuracy, Precision, Recall, F1-score, and ROC AUC.

In [None]:
# Dictionary to store evaluation metrics for RandomizedSearchCV optimized models
randomized_search_evaluation_metrics = {}

# Evaluate each optimized model from RandomizedSearchCV
for name, model in randomized_search_models.items():
    y_pred = model.predict(X_test)

    # Calculate basic metrics
    accuracy = accuracy_score(y_test, y_pred)
    # Use weighted average for multi-class
    precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)
    recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)

    # Store metrics
    randomized_search_evaluation_metrics[name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-score': f1
    }

    # Calculate ROC AUC score (multi-class using 'ovr') if predict_proba is available
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_test)
        try:
            # Ensure y_test is one-hot encoded for roc_auc_score with 'ovr'
            from sklearn.preprocessing import label_binarize
            y_test_binarized = label_binarize(y_test, classes=np.unique(y_test))

            # Calculate ROC AUC
            roc_auc = roc_auc_score(y_test_binarized, y_prob, multi_class='ovr')
            randomized_search_evaluation_metrics[name]['ROC AUC (OvR)'] = roc_auc

        except ValueError as e:
            print(f"Could not calculate ROC AUC for {name}: {e}\n")
    else:
        print(f"Model {name} does not support predict_proba for ROC AUC calculation.\n")


# Display the evaluation metrics in a DataFrame
randomized_search_evaluation_df = pd.DataFrame(randomized_search_evaluation_metrics).T
print("\nSummary of Evaluation Metrics for RandomizedSearchCV Optimized Models:")
display(randomized_search_evaluation_df)


Summary of Evaluation Metrics for RandomizedSearchCV Optimized Models:


Unnamed: 0,Accuracy,Precision,Recall,F1-score,ROC AUC (OvR)
Logistic Regression,0.52459,0.463056,0.52459,0.488559,0.804064
Decision Tree,0.47541,0.441125,0.47541,0.457068,0.697961
Random Forest,0.52459,0.445433,0.52459,0.430212,0.801439
SVM,0.508197,0.43604,0.508197,0.467418,0.810624


## Compare Optimized Models and Identify Best Performing Model

### Subtask:
Compare the performance of the models optimized with GridSearchCV and RandomizedSearchCV and identify the overall best performing model.

**Reasoning**:
Compare the evaluation metrics from both GridSearchCV and RandomizedSearchCV results. Identify which hyperparameter tuning method worked best for each model and then determine the single best performing model across all models and tuning methods based on a chosen metric (e.g., Accuracy, ROC AUC, F1-score).

In [None]:
print("Comparison of Model Performance (GridSearchCV vs. RandomizedSearchCV):")
print("-" * 70)

# Compare GridSearchCV and RandomizedSearchCV results for each model
for name in grid_search_evaluation_df.index:
    print(f"\nModel: {name}")
    print("  GridSearchCV:")
    display(grid_search_evaluation_df.loc[[name]])
    print("  RandomizedSearchCV:")
    display(randomized_search_evaluation_df.loc[[name]])
    print("-" * 30)

print("\nOverall Best Performing Model:")
print("-" * 50)

# Combine both evaluation DataFrames for overall comparison
combined_evaluation_df = pd.concat([grid_search_evaluation_df, randomized_search_evaluation_df],
                                   keys=['GridSearchCV', 'RandomizedSearchCV'],
                                   names=['Tuning Method', 'Model'])

# Find the best model based on a key metric, e.g., ROC AUC (OvR)
if 'ROC AUC (OvR)' in combined_evaluation_df.columns:
    best_model_roc_auc = combined_evaluation_df.loc[combined_evaluation_df['ROC AUC (OvR)'].idxmax()]
    print(f"Best model based on ROC AUC (OvR):")
    display(best_model_roc_auc)
elif 'Accuracy' in combined_evaluation_df.columns:
     best_model_accuracy = combined_evaluation_df.loc[combined_evaluation_df['Accuracy'].idxmax()]
     print(f"Best model based on Accuracy:")
     display(best_model_accuracy)
else:
    print("Could not identify best model based on available metrics.")


# You can also compare to the baseline models if you have stored their metrics in a similar DataFrame
# print("\nComparison with Baseline Models:")
# display(evaluation_df) # Assuming evaluation_df holds baseline model metrics

print("\nDiscussion:")
print("Review the comparison above to see which hyperparameter tuning method resulted in better performance for each model.")
print("The 'Overall Best Performing Model' is identified based on the highest ROC AUC (OvR) score (or Accuracy if ROC AUC is not available or suitable).")
print("Consider the trade-offs between different metrics and the specific goals of your project when choosing the final model.")

Comparison of Model Performance (GridSearchCV vs. RandomizedSearchCV):
----------------------------------------------------------------------

Model: Logistic Regression
  GridSearchCV:


Unnamed: 0,Accuracy,Precision,Recall,F1-score,ROC AUC (OvR)
Logistic Regression,0.52459,0.463056,0.52459,0.488559,0.804064


  RandomizedSearchCV:


Unnamed: 0,Accuracy,Precision,Recall,F1-score,ROC AUC (OvR)
Logistic Regression,0.52459,0.463056,0.52459,0.488559,0.804064


------------------------------

Model: Decision Tree
  GridSearchCV:


Unnamed: 0,Accuracy,Precision,Recall,F1-score,ROC AUC (OvR)
Decision Tree,0.47541,0.441125,0.47541,0.457068,0.697961


  RandomizedSearchCV:


Unnamed: 0,Accuracy,Precision,Recall,F1-score,ROC AUC (OvR)
Decision Tree,0.47541,0.441125,0.47541,0.457068,0.697961


------------------------------

Model: Random Forest
  GridSearchCV:


Unnamed: 0,Accuracy,Precision,Recall,F1-score,ROC AUC (OvR)
Random Forest,0.52459,0.445433,0.52459,0.430212,0.801439


  RandomizedSearchCV:


Unnamed: 0,Accuracy,Precision,Recall,F1-score,ROC AUC (OvR)
Random Forest,0.52459,0.445433,0.52459,0.430212,0.801439


------------------------------

Model: SVM
  GridSearchCV:


Unnamed: 0,Accuracy,Precision,Recall,F1-score,ROC AUC (OvR)
SVM,0.508197,0.43604,0.508197,0.467418,0.810624


  RandomizedSearchCV:


Unnamed: 0,Accuracy,Precision,Recall,F1-score,ROC AUC (OvR)
SVM,0.508197,0.43604,0.508197,0.467418,0.810624


------------------------------

Overall Best Performing Model:
--------------------------------------------------
Best model based on ROC AUC (OvR):


Unnamed: 0_level_0,GridSearchCV
Unnamed: 0_level_1,SVM
Accuracy,0.508197
Precision,0.43604
Recall,0.508197
F1-score,0.467418
ROC AUC (OvR),0.810624



Discussion:
Review the comparison above to see which hyperparameter tuning method resulted in better performance for each model.
The 'Overall Best Performing Model' is identified based on the highest ROC AUC (OvR) score (or Accuracy if ROC AUC is not available or suitable).
Consider the trade-offs between different metrics and the specific goals of your project when choosing the final model.


## Summary: Hyperparameter Tuning and Best Model

### Data Analysis Key Findings

*   Parameter grids for Logistic Regression, Decision Tree, Random Forest, and SVM were defined for hyperparameter tuning.
*   GridSearchCV was applied to each model, and the models with the best hyperparameters were evaluated on the test set.
*   RandomizedSearchCV was applied to each model, and the models with the best hyperparameters were evaluated on the test set.
*   The performance of models optimized with GridSearchCV and RandomizedSearchCV were compared based on Accuracy, weighted Precision, weighted Recall, weighted F1-score, and multi-class ROC AUC (OvR).
*   Based on the ROC AUC (OvR) metric, the **SVM model optimized with GridSearchCV** achieved the highest score (0.8106), indicating the best overall ability to distinguish between the different heart disease classes among the tested models and tuning methods.
*   While SVM had the best ROC AUC, other models like Logistic Regression also showed competitive performance in terms of ROC AUC. Decision Tree had the highest weighted F1-score and Precision, suggesting a good balance and ability to avoid false positives for the baseline models, although its performance after tuning with GridSearchCV and RandomizedSearchCV in terms of ROC AUC was lower compared to Logistic Regression and SVM.

### Insights or Next Steps

*   Hyperparameter tuning, particularly with GridSearchCV for SVM in this case, led to improved performance in terms of ROC AUC compared to the baseline SVM model.
*   The choice of the best model depends on the specific evaluation metric prioritized for the application (e.g., ROC AUC for overall discrimination, F1-score for balance between precision and recall).
*   Further analysis could involve exploring more extensive parameter grids, using different cross-validation strategies, or ensemble methods to potentially further improve the performance of the best performing model.
*   The insights from the feature selection process could be used in conjunction with the best performing model for a more in-depth analysis of the most influential factors in predicting heart disease.