# Data Mining / Prospecção de Dados

## Sara C. Madeira and André Falcão, 2019/20

# Project 2 - Clustering

## Logistics

**In a "normal" scenario students should work in teams of 2 people. Due to the social distance imposed by current public health situation, students were allowed to work in groups of 1 and 3. In this context, the amount of work was adapted according to the number of students in groups as described below.**

* Tasks **1 to 5** should be done by **all** groups
* Task **6** should be done only by **groups of 2 and 3** students
* Task **7** should be done only by **groups of 3** students

The quality of the project will then dictate its grade.

**The project's solution should be uploaded in Moodle before the end of May, 17th 2020 (23:59).** 

**It is mandatory to produce a Jupyter notebook containing code and text/images/tables/etc describing the solution and the results. Projects not delivered in this format will not be graded. Note that you can use `PD_201920_Project2.ipynb`as template.**

Students should **upload a `.zip` file** containing all the files necessary for project evaluation. 

**Decisions should be justified and results should be critically discussed.**

## Dataset and Tools

In this project you should use [Python 3](https://www.python.org), [Jupyter Notebook](http://jupyter.org) and **[Scikit-learn](http://scikit-learn.org/stable/)**.

The dataset to be analysed is **`AML_ALL_PATIENTS_GENES_EXTENDED.csv`**. This is an extended version of the widely studied **Leukemia dataset**, originally published by Golub et al. (1999) ["Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene
Expression Monitoring"](http://archive.broadinstitute.org/mpr/publications/projects/Leukemia/Golub_et_al_1999.pdf.) 

**This dataset studies patients with leukaemia. At disease onset clinicials diagnosed them in two different types of leukaemia: acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL).** Some of these diagnoses were later confirmed, other revealed to be wrong. The data analyzed here contains the expression levels of 5147 Human genes (features/columns) analyzed in 110 patients (rows): 70 ALL and 40 AML.
Each row identifies a patient: The first column, `ID`, contains the patients' IDs , the second column, `DIAGNOSIS`, contains the initial diagnosis as performed by clinicians (ground truth), and the remaining 5147 columns contain the expression levels of the 5147 genes analysed.

**The goal is to cluster patients and (ideally) find AML groups and ALL groups.**


<img src="AML_ALL_PATIENTS_GENES_EXTENDED.jpg" alt="AML_ALL_PATIENTS_GENES_EXTENDED.csv" style="width: 1000px;"/>

## 1. Load and Preprocess Dataset

At the end of this step you should have:
* a 110 rows × 5147 columns matrix, **X**, containing the values of the 5147 features for each of the 110 patients.
* a vector, **y**, with the 110 diagnosis, which you can use later to evaluate clustering quality.

In [1]:
# Imports
import pandas as pd
import os

from functions import *

# Get data_path
path = get_path()

Path to Data: C:\Users\peped\Documents\Repo\PD_2_Clustering\dataset


In [2]:
# Load Data
df = pd.read_csv(os.path.join(path,"AML_ALL_PATIENTS_GENES_EXTENDED.csv"))
df.head()

Unnamed: 0,ID,DIAGNOSIS,AFFX-BioC-5_at,hum_alu_at,AFFX-DapX-M_at,AFFX-LysX-5_at,AFFX-HUMISGF3A/M97935_MA_at,AFFX-HUMISGF3A/M97935_MB_at,AFFX-HUMISGF3A/M97935_3_at,AFFX-HUMRGE/M10098_5_at,...,M93143_at,U29175_at,U48730_at,U58516_at,X06956_at,X83863_at,Z17240_at,L49218_f_at,M71243_f_at,Z78285_f_at
0,1,ALL,-0.912181,-0.93628,1.330679,0.045416,0.236442,0.196788,0.244435,2.561725,...,0.42894,1.018678,0.103263,-0.518375,0.01914,0.027771,0.122186,0.593119,-0.391378,-0.161117
1,2,ALL,0.842926,-1.311323,-0.067011,-0.964423,-0.224387,-0.200579,-0.3716,-0.417306,...,-0.425219,-0.474402,-0.067901,0.555431,0.160143,0.003223,-0.043618,0.032103,-0.57699,0.415146
2,3,ALL,1.076941,-0.788133,1.859748,1.15143,0.498175,1.244393,1.390191,0.664072,...,2.304741,-0.27335,1.493964,1.747818,-0.568816,0.79768,2.306901,0.705322,-0.331659,-0.261337
3,4,ALL,-1.596222,-0.874097,0.991127,0.574379,-0.065558,-0.04404,-0.402063,0.488835,...,-0.00093,-0.288936,0.691636,0.548844,-0.552853,-0.342677,-0.653193,-1.336776,-0.496289,-1.514082
4,5,ALL,-0.192137,-0.655253,-0.193356,-0.651854,0.142487,-0.176496,-0.28698,-0.274163,...,-0.721104,-0.471285,-0.206971,-0.063819,0.32509,-1.183998,0.049037,0.099424,-0.609271,0.139542


In [3]:
# Preprocess Data
try:
    df, X, y = validate_format(df, rows=110, columns=5147, target='DIAGNOSIS', drop=['ID'], col_print=5)
except TypeError:
    pass

df.head()

Valid Format

First 5 Columns: ['AFFX-BioC-5_at', 'hum_alu_at', 'AFFX-DapX-M_at', 'AFFX-LysX-5_at', 'AFFX-HUMISGF3A/M97935_MA_at']

Targets: {'AML', 'ALL'}


Unnamed: 0,AFFX-BioC-5_at,hum_alu_at,AFFX-DapX-M_at,AFFX-LysX-5_at,AFFX-HUMISGF3A/M97935_MA_at,AFFX-HUMISGF3A/M97935_MB_at,AFFX-HUMISGF3A/M97935_3_at,AFFX-HUMRGE/M10098_5_at,AFFX-HUMRGE/M10098_M_at,AFFX-HUMRGE/M10098_3_at,...,U29175_at,U48730_at,U58516_at,X06956_at,X83863_at,Z17240_at,L49218_f_at,M71243_f_at,Z78285_f_at,target
0,-0.912181,-0.93628,1.330679,0.045416,0.236442,0.196788,0.244435,2.561725,2.489963,1.79452,...,1.018678,0.103263,-0.518375,0.01914,0.027771,0.122186,0.593119,-0.391378,-0.161117,ALL
1,0.842926,-1.311323,-0.067011,-0.964423,-0.224387,-0.200579,-0.3716,-0.417306,-0.408754,-0.002931,...,-0.474402,-0.067901,0.555431,0.160143,0.003223,-0.043618,0.032103,-0.57699,0.415146,ALL
2,1.076941,-0.788133,1.859748,1.15143,0.498175,1.244393,1.390191,0.664072,0.542223,0.548277,...,-0.27335,1.493964,1.747818,-0.568816,0.79768,2.306901,0.705322,-0.331659,-0.261337,ALL
3,-1.596222,-0.874097,0.991127,0.574379,-0.065558,-0.04404,-0.402063,0.488835,0.247321,0.26652,...,-0.288936,0.691636,0.548844,-0.552853,-0.342677,-0.653193,-1.336776,-0.496289,-1.514082,ALL
4,-0.192137,-0.655253,-0.193356,-0.651854,0.142487,-0.176496,-0.28698,-0.274163,0.379259,-0.311094,...,-0.471285,-0.206971,-0.063819,0.32509,-1.183998,0.049037,0.099424,-0.609271,0.139542,ALL


## 2. Dimensionality Reduction

As you already noticed the number of features (genes) is extremely high whe compared to the number of objects to cluster (patients). In this context, you should perform dimensionality reduction, that is, reduce the number of features, in two ways:

* [**Removing features with low variance**](http://scikit-learn.org/stable/modules/feature_selection.html)

* [**Using Principal Component Analysis**](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)

At the end of this step you should have two new matrices with the same number of rows, each with a different number of columns (features): **X_variance** and **X_PCA**. 

**Don't change X you will need it!**

'AFFX-BioC-5_at'

Write text in cells like this ...

## 3. Clustering Patients using Partitional Clustering

Use **`K`-means** to cluster the patients:

* Cluster the original data (5147 features): **X**.
    * Use different values of `K`.
    * For each value of `K` present the clustering by specifying how many patients ALL and AML are in each cluster.     
    For instance, `{0: {'ALL': 70, 'AML': 0}, 1: {'ALL': 0, 'AML': 40}}` is the ideal clustering that we aimed at obtained with K-means when `K=2`, where the first cluster has 70 ALL patients and 0 AML patients and the second cluster has 0 ALL patients and 40 AML patients. 
    You can choose how to output this information.  
    * What is the best value of `K` ? Justify using the clustering results and the [Silhouette score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html).

* Cluster the data obtained after removing features with low variance: **X_variance**.
    * Study different values of `K` as above.

* Cluster the data obtained after applying PCA: **X_PCA**.
    * Study different values of `K` as above.

* Compare the results obtained in the three datasets above for the best `K`. Discuss.

In [None]:
# Write code in cells like this
# ....

Write text in cells like this ...

## 4. Clustering Patients using Hierarchical Clustering

Use a **Hierarchical Clustering Algorithm (HCA)** to cluster the patients: 

* Cluster the data in **X_variance**.
    * Use **different linkage metrics**.
    * Use different values of `K`.
    * For each linkage metric and value of `K` present the clustering by specifying how many patients ALL and AML are in each cluster as you did before. 
    * What is the best linkage metric and the best value of `K`? Justify using the clustering results and the [Silhouette score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html).

* Cluster the data in **X_PCA**.
    * Study different linkage metrics and different values of `K` as above.

* Compare the results obtained in the two datasets above for the best linkage metric and the best `K`. Discuss.

In [None]:
# Write code in cells like this
# ....

Write text in cells like this ...

## 5. Evaluating Clustering Results

In this task you should compare the best results obtained using `K`-means and HCA 
1. **Without using ground truth**
2. **Using ground truth (`DIAGNOSIS`)**.

## 5.1. Without Using Ground Truth

**Choose one adequate measure** from those available by Sciki-learn (https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation) to evaluate the different clusterings. 

Discuss the results.

In [None]:
# Write code in cells like this
# ....

Write text in cells like this ...

## 5.2. Using Ground Truth

**Choose one adequate measure** from those available by Sciki-learn (https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation) to evaluate the different clusterings. 

Discuss the results.

In [None]:
# Write code in cells like this
# ....

Write text in cells like this ...

## 6. Clustering Patients using Density-based Clustering

Use DBSCAN (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) or OPTICS (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.OPTICS.html) to cluster the patients.

Compare the results with those of K-means and HCA.

In [None]:
# Write code in cells like this
# ....

Write text in cells like this ...

## 7. Choose a Different Clustering Algorithm to Group the Patients

Choose **a clustering algorithm** besides `K`-means, HCA and DBSCAN/OPTICS to cluster the patients. 

Justify your choice and compare the results with those of `K`-means, HCA and DBSCAN/OPTICS.

In [None]:
# Write code in cells like this
# ....

Write text in cells like this ...