<a href="https://colab.research.google.com/github/AgVicCodes/DHVA1/blob/main/tutorial_4_AutoML_networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tutorial week 11: Auto ML, and  networks**

This tutorial allows you to extend the end-to-end case study we did in the [tutorial from week 3], by including the following additional steps:

* Step 1: **Data imputation**

* Step 2: **Auto ML model comparison**

* Step 3: **Networks**

It is modified from a Colab Notebook originally written by Rob Yates. Follow the instructions provided below, modifying/adding code where necessary to complete each step. You can use code snippets presented in Lecture 4 where necessary.

In [None]:
# Import standard packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing

## **Step 0: Loading the data**

First, mount your Google Drive and load the heart.csv file into a Pandas DataFrame, as we did in [tutorial from week 10](https://colab.research.google.com/drive/1fnDL9lYMdbuP8XnIFVMsMZKmvi4zdvxY?usp=sharing). Then, execute the subseqent cells to encode the non-numerical features.

In [None]:
#[Write code to mount your Google Drive and load the Heart Disease data set into a Pandas DataFrame called "df"]


In [None]:
# Ordinal encoding

le = preprocessing.LabelEncoder()
label_columns = ['Sex', 'ExerciseAngina']

for label in label_columns:
    df[label] = le.fit_transform(df[label])
df.head()

In [None]:
# One-hot encoding

df = pd.get_dummies(df, columns = ['ChestPainType', 'RestingECG', 'ST_Slope'])
df.head()

## **Step 1: Data imputation**

As discussed in the third tutorial and in Lecture 4, the 'Cholesterol' feature in this dataset has a large number of 0 values. These are likely to be "missing data". Check this by plotting the historgram below, then implement the **MICE imputation technique** discussed in Lectures 2 and 4 to replace these 0s with realistic values.

N.B. MICE imputation works by replacing NaNs, so you'll need to convert the 0s to NaNs first.


In [None]:
# Bar plot of 'Cholesterol' feature before imputation
plt.figure(figsize=(10, 6))
sns.histplot(x='Cholesterol', data=df, hue='HeartDisease')
plt.title('Cholesterol distribution before data imputation')
plt.show()

In [None]:
# [In this cell, use the information in Lecture 2 and Lecture 4 on MICE imputation to replace the 0s in the 'Cholesterol' feature with realistic values]
# [Remember to replace the 0s in the 'Cholesterol' feature with NaNs first.]


In [None]:
# Bar plot of 'Cholesterol' feature after imputation
plt.figure(figsize=(10, 6))
sns.histplot(x='Cholesterol', data=df_imputed, hue='HeartDisease')
plt.title('Cholesterol distribution after data imputation')
plt.show()

## **Step 2: Auto ML**

Now that the data is pre-processed, we can feed it into ML models. We could do this manually, but it can be cumbersome and inefficient to train & compare many models this way.

Instead, **Auto ML packages allow many ML models to be efficiently trained & compared to each other.** They also provide insight int the dataset through XAI techniques.

Below, we will apply the **H20AutoML** package to the heart disease dataset.


In [None]:
# Install H20AutoML (only need to do this once)
!pip install h2o

# Import packages
import h2o
from h2o.automl import H2OAutoML

# Start the H2O cluster (locally)
h2o.init()

In [None]:
# Define the target variable (i.e. "response") and features (i.e. "predictors") for H20AutoML
response = 'HeartDisease'
predictors = list(df_imputed.drop('HeartDisease', axis=1).columns)

# Convert from a pandas dataframe to an h20 frame
hf = h2o.H2OFrame(df_imputed)

# Convert target column to categorical, as this is a binary classification problem
hf['HeartDisease'] = hf['HeartDisease'].asfactor()

# Split the data into training and testing sets using the H20 split_frame() method
train, test = hf.split_frame(seed=1, ratios=[0.75])

The cell below sets-up and trains a series of different ML models on our dataset. Before running the cell, use the [H20AutoML online documentation](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html) to answer the following questions:

*   **Below, we have chosen three ML algorithms ("GLM", "DRF", "XGBoost"). What do these acronyms stand for?**
*   **We have also chosen to limit the training time to 120 secs, via the "max_runtime_sec" argument. What other stopping parameter options are available in H20AutoML?**

In [None]:
# Set-up and train the models
aml = H2OAutoML(max_runtime_secs=120, seed=1, include_algos=["GLM", "DRF", "XGBoost"])
aml.train(x=predictors, y=response, training_frame=train)

# Turn off default printed output
h2o.display.toggle_user_tips()

The cell below plots a "leaderboard" of the top 20 best-performing models for our dataset, including a number of different performance metrics. Run the cell and then answer the following questions:

*   **Which of the three model types we considered above ("GLM", "DRF", "XGBoost") perform best overall?**
*   **For binary classification tasks, H20AutoML ranks  models  by their "AUC" score. What is an "AUC" score?**



In [None]:
# Show leaderboard & performance information
aml.explain(test, include_explanations=["leaderboard"]);

The cell below plots three different XAI methods for the various models  trained by H20AutoML. Run the cell, and then answer the following questions:

* **Explain the main results that these  XAI plots provide**.
* Use the [H20AutoML online documentation](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html) to **plot other XAI methods for a wider range of attributes**. Can you discover any other interesting trends?



In [None]:
# Plot XAI information
aml.explain(test, include_explanations=["varimp_heatmap", "shap_summary", "pdp"], \
            columns = ["ST_Slope_Up", "ChestPainType_ASY", "Cholesterol"]);

## **Step 3: Networks**

Finally, we will explore the similarities between the different instances in our dataset by creating a tree graph using the **NetworkX** package. Complete the cells below, then answer the following questions:

*  What does the **"minimum spanning tree (MST)"** created below tell us about our dataset? Are patients with/without heart disease well separated? Are there sub-groups containing mostly patients with or without heart disease?
* Use the [Network X online documentation](https://networkx.org/documentation/stable/) to try creating **different types of tree and tree algorithm**. Do they return very different results?

In [None]:
# Install NetworkX (only need to do once)
!pip install networkx

# Import  package
import networkx as nx

In [None]:
from scipy.spatial.distance import pdist, squareform

df_imputed["ID"] = df_imputed.index

# Compute Euclidean distances between all feature values
distance_vector = pdist(df_imputed.drop(columns=["HeartDisease", "ID"]), metric="cityblock")

# Compute square distance matrix
distance_matrix = squareform(distance_vector)

# Convert to Network X graph
undirected_graph = nx.from_numpy_array(distance_matrix)

In [None]:
# Create minimum spanning tree (MST)
mst_graph = nx.minimum_spanning_tree(undirected_graph, algorithm="prim")

In [None]:
# Set-up info for plot
node_list = list(mst_graph.nodes())
node_color = df.loc[node_list, "HeartDisease"].map({0: "red", 1: "green"})
pos = nx.kamada_kawai_layout(mst_graph)

In [None]:
# Plot the graph
plt.figure(figsize=(10, 8))
nx.draw_networkx(mst_graph, pos, node_color=node_color, node_size=100, with_labels=False, edge_color="grey")
plt.title("Heart Disease Minimum Spanning Tree")
plt.axis("off")
plt.show()