<a href="https://colab.research.google.com/github/Laib/Machine_Learning/blob/main/Unsupervised_learning_KMeans.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](https://www.cu-aflou.dz/img/logoUnivAflou.jpg "Logo Title Text 1")

[University Center -Cherif Bouchoucha- Aflou](https://www.cu-aflou.dz)

Institute of Sciences

Department of Computer Science

Machine learning

Dr. SELLAM Abdellah

## Loading the Heart Disease Dataset
The [Kaggle](https://www.kaggle.com/) version of the [Heart](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset) database consists of **1025** patient record.

Each record consists of **13** attributes (features) and a signle **label** (target) that indicates whether the patient has the heart disease.

The dataset is in **CSV** (Comma Seperated Values) file format.

### Exercise 1
1. Download the dataset from [this link](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset).
2. Extract the downloaded zip file (archive.zip).
3. Upload the file heart.csv to the virtual drive of your Google Colab runtitme.

### Reading the heart.csv file
In order to load the dataset, we need to read the contents of the **heart.csv** file and convert them into a matrix.

This is a complicated task to perform from scratch. Fortunately, we have a python library to do this for us, it is the [pandas](https://pandas.pydata.org/) library.

To read our **heart.csv** file using **pandas**:
1. Import the **panda** library.
2. Pass the **path** to the **csv** file as an argument to the **read_csv** method.

In [None]:
import pandas as pd

df = pd.read_csv('heart.csv')

The **read_csv** method return a data structure defined by **pandas** called a **DataFrame**.

The returned data frame in the previous code cell is assigned to a variable called **df**.

To display the first lines in dataset, we can use the method **head** of the **DataFrame** class.

In [None]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


We then have to transform the dataset to a matrix.

Python has a matrix and n-dimensional array manupilation library called [numpy](https://numpy.org/).

The **DataFrame** class has a method that converts its data into a **numpy** array and returns it, it is the **to_numpy** method.

In [None]:
data = df.to_numpy()

The returned **numpy** array is assigned to a varaible called **data**.

To display the size of array (matrix) we can **print** its **shape** attribute.

In [None]:
print(data.shape)

(1025, 14)


The **\$14^{th}$ column** of the matrix is the **label** (target). Since we are doing **unsupervised learning**, we don't need the **label** feature.

To remove the **label** column, we assign the first **13** column to variable **data**.

In [None]:
data = data[:, :13]
print(data.shape)

(1025, 13)


## Applying K-Means clustering to the loaded dataset
To apply the k-means clustering algorithm to the dataset:
1. Create a k-means object from the python library [scikit-learn](https://scikit-learn.org/).
2. Launch the clustering process using the **fit** method.
3. Get the clusters of our dataset records using the **predict** method.

### Creating a scikit-learn K-Means object
To create a scikit-learn K-Means object:
1. Load the KMeans class from the module **cluster** of the **scikit-learn** library. 
2. Use the **KMeans** class to create an instance:
    a. Set the **init** argument to **random**.
    b. Set the **n_clusters** argument to the desired number of cluster **K**.

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(
    init="random",
    n_clusters=2)

### Launching the K-Means clustering
To start the K-Means clustering procedure, we use the **fit** method of the **kmeans** object on the **data** variable we prepared previously.

In [None]:
kmeans.fit(data)

KMeans(init='random', n_clusters=2)

### Getting the clusters of our dataset
To retrieve the clusters (labels) returned by the **kmeans** object, we use the **predict** method on the **data** variable.

In [None]:
clusters = kmeans.predict(data)

## Understanding the returned Clusters

Now that we have clustered our data, we have to analyze the discovered clusters.

### Displaying cluster centers
One way to understand the clusters is to check their centeroids. We can achieve this using the **scikit-learn KMeans** class by checking the **cluster_centers_** attribute.

In [None]:
kmeans.cluster_centers_

array([[5.68162162e+01, 5.91891892e-01, 7.89189189e-01, 1.34983784e+02,
        2.99175676e+02, 1.43243243e-01, 4.45945946e-01, 1.45040541e+02,
        4.10810811e-01, 1.20513514e+00, 1.35675676e+00, 8.72972973e-01,
        2.41621622e+00],
       [5.30885496e+01, 7.54198473e-01, 1.02900763e+00, 1.29706870e+02,
        2.15961832e+02, 1.52671756e-01, 5.77099237e-01, 1.51415267e+02,
        2.94656489e-01, 9.96030534e-01, 1.40152672e+00, 6.87022901e-01,
        2.27175573e+00]])

We can see that we obtained a matrix of **K** rows and **13** columns.

The **K** rows correspond to the **K** cluster centers.

The **13** columns correspond to the **13** features we have in our **data** matrix.

However, this is not very clear, we can display the cluster centers more eleganntly by showing the names of the **features** for each cluster center separately.

To achieve this we can use the **df** data frame of **pandas** we loaded earlier, it has an attribute called **columns** that has the names of the our **features**.

In [None]:
for c in range(kmeans.cluster_centers_.shape[0]):
    print('Cluster #%d'%(c))
    for i in range(data.shape[1]):
        print('\t%s:'%(df.columns[i]), kmeans.cluster_centers_[c][i])

Cluster #0
	age: 56.81621621621622
	sex: 0.5918918918918921
	cp: 0.7891891891891895
	trestbps: 134.9837837837838
	chol: 299.1756756756757
	fbs: 0.14324324324324322
	restecg: 0.4459459459459461
	thalach: 145.04054054054055
	exang: 0.41081081081081083
	oldpeak: 1.2051351351351354
	slope: 1.3567567567567569
	ca: 0.8729729729729732
	thal: 2.416216216216216
Cluster #1
	age: 53.08854961832061
	sex: 0.7541984732824428
	cp: 1.0290076335877862
	trestbps: 129.70687022900762
	chol: 215.9618320610687
	fbs: 0.1526717557251908
	restecg: 0.5770992366412214
	thalach: 151.41526717557252
	exang: 0.2946564885496184
	oldpeak: 0.9960305343511451
	slope: 1.401526717557252
	ca: 0.6870229007633589
	thal: 2.2717557251908396


### Displaying values ranges in each cluster

We can also write a function that displays the ranges of values in each cluster using the **numpy** library.

First, we must use the **where** function of the **numpy** library to find the record that belong to a certain cluster.

Then, we can use the **min** and **max** method on those records to retrieve the ranges of all features.

Finally, we use **columns** attribute of the variable **df** to display the names of the attributes.

In [None]:
import numpy as np

def cluster_summary(data, columns, clusters, cluster):
    idxs = np.where(clusters==cluster)
    rows = data[idxs]
    print('Cluster #%d'%(c))
    for i in range(data.shape[1]):
        print('\t%s: %g..%g'%(columns[i], rows[:, i].min(), rows[:, i].max()))

for c in range(clusters.max() + 1):
    cluster_summary(data, df.columns, clusters, c)

Cluster #0
	age: 35..77
	sex: 0..1
	cp: 0..3
	trestbps: 100..200
	chol: 258..564
	fbs: 0..1
	restecg: 0..2
	thalach: 88..195
	exang: 0..1
	oldpeak: 0..4.4
	slope: 0..2
	ca: 0..3
	thal: 1..3
Cluster #1
	age: 29..76
	sex: 0..1
	cp: 0..3
	trestbps: 94..178
	chol: 126..260
	fbs: 0..1
	restecg: 0..2
	thalach: 71..202
	exang: 0..1
	oldpeak: 0..6.2
	slope: 0..2
	ca: 0..4
	thal: 0..3


## Exercise 2
Test the K-Means algorithm using different values of **K**

## Exercise 3
Read the **scikit-learn** manual entry for [K-Means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) and try