# Machine Learning Assignment

For this assignment you will be using the "Cardiovascular Disease Risk Dataset" provided in the assignment folder. You will be performing k-means clustering on this dataset to cluster individuals based on their BMI and Abdomnial Circumference. You can see how the amount of clusters will affect how your data looks.

You'll want to run each cell (see the run button above) in order of the code. If you do not run them in order, errors might appear. Also, if there is something you need to fill in and you didn't, an error might appear as well. Read the instructions and run every block of code as you go along. Fill in gaps before running the code if instructed.

With the introduction out of the way, let's start with the assignment! First we have to import the different libraries that we need to perform the analysis.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import ipywidgets as widgets

num_clusters = widgets.IntSlider(value=3, min=0, max=5, step=1, description='Number of Clusters:') # Ignore this

Then we will load the dataset and check the first few rows to see what the data looks like. 

In [None]:
df = pd.read_csv("Cardiovascular_Disease_Risk_Dataset.csv")

df.head()

We only want to use the BMI and Abdominal Circumference for our analysis. We will convert these columns to numeric and drop any rows that have missing values.

In [None]:
axis_1 = 'BMI'
axis_2 = 'ABDOMINAL CIRCUMFERENCE'

df[axis_1] = pd.to_numeric(df[axis_1], errors='coerce')
df[axis_2] = pd.to_numeric(df[axis_2], errors='coerce')
df.dropna(subset=[axis_1, axis_2], inplace=True)

X = df[[axis_1, axis_2]]

df.head()

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Now we will choose the amount of clusters we want to use.

In [None]:
num_clusters

You can change the number of the slider above and then run the code below to see how the clusters change.

In [None]:
# Based on the Elbow method, choose the optimal k (let's say k=3)
kmeans = KMeans(n_clusters=num_clusters.value, random_state=100)
kmeans.fit(X_scaled)

# Get the cluster labels for each data point
labels = kmeans.labels_

# Add the cluster labels to the DataFrame
df['Cluster'] = labels

# Visualize the clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(x=axis_1, y=axis_2, hue='Cluster', data=df, palette='viridis')
plt.title(f'K-means Clustering of {axis_1} and {axis_2}')
plt.show()

**Exercise 1.** How many clusters show the best separation between the data points? Why?

In [None]:
# YOUR ANSWER HERE
# 
# 
# 
# 

**Exercise 2.** What do you think the clusters represent in this dataset? How would you label them?

In [None]:
# YOUR ANSWER HERE
# 
# 
# 
# 

Congratulations! You finished the assignment.