### MEDC0106: Bioinformatics in Applied Biomedical Science

<p align="center">
  <img src="../../resources/static/Banner.png" alt="MEDC0106 Banner" width="90%"/>
  <br>
</p>

---------------------------------------------------------------

# 08 - Data analysis - Exercises

*Written by:* Oliver Scott

**This notebook contains exercises based around the material provided in session two**

Try to complete the exercises provided before the next session, feel free to refer back to the sessions content to help you complete the tasks.

You may also benefit from looking back at content from session one.

You should work through the tasks consecutively.

Save your results so we can go through the answers in the next session.

----

## Contents

1. [Task 1](#Task-1) - Basic Data Analysis
2. [Task 2](#Task-2) - Searching for Correlations
3. [Task 3](#Task-3) - Model Building

----

#### Imports

Some imports you may, or may not need to complete the tasks (run this before you attempt the exercises).


In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
from sklearn.metrics import RocCurveDisplay, roc_curve, auc
from sklearn.model_selection import train_test_split

## Task 1

#### Basic Data Analysis  

We have provided you with a dataset from the National Institute of Diabetes and Digestive and Kidney Diseases. The dataset contains multiple diagnostic measurements and wether or not a patient has diabetes. The dataset forms a subset of a larger database where in this subset all patients are female >= 21 years of age and of Pima Indian heritage.

#### Reference
[Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.]


This paticular task centers around data analysis and consists of multiple subtasks.

#### Subtask 1

Your first task is to read the CSV file using pandas and indentify what diagnostic measurements were taken from the patients, what data type these measurments are  and how many patients were assessed. The path to the data has been provided for you:

In [None]:
data_path = './data/diabetes.csv'

#### Subtask 2

The keen eyed may have noticed that although there are no null values there are some strange values!

- Take a look at some of the measurements and work out wether these values could be possible.
- Once you have identified unreliable measurements remove these patients from the dataset! 

<details>
<summary>Click here for a hint!</summary>
<em>Some subjects have Glucose == 0! This cannot be correct (do any other measurments have this issue?)</em>
</details>

<details>
<summary>Click here for a hint!</summary>
<em>Remember conditional selections! `df = df[df['somecol'] == 0]`</em>
</details>

In [None]:
# Write your solution here add more cells if you wish!

#### Subtask 3

Now you have a clean dataset, count the number of patients with/without a diabetes diagnosis. Make a nice plot to visualise the result. Is the data balanced or not? 

In [None]:
# Write your solution here add more cells if you wish!

#### Subtask 4

Plot some histograms of the diagnostic measurements. Take note of the shape of the distributions and the units of the measurement.

In [None]:
# Write your solution here add more cells if you wish!

## Task 2

#### Searching for Correlations

In the session we saw how useful it is to find correlations between features and between features and the target. We would like you to answer the following questions by perfoming some analysis on the dataset:

1. Is there a correlation between 'BMI' and 'Glucose' (make a plot)
2. Which two features have the highest correlation?
3. Which feature correlates the most with the outcome (diabetes)?
4. Is the problem suitable as a classification or a regression problem?

In [None]:
# Write your solution here add more cells if you wish!

## Task 3

#### Model Building

Now you have perfomed some data analysis we would like you to build a model to predict the outcome (diabetes) based of the various diagnostic measurements. In paticular we would like you to build a simple [K-nearest neighbours model](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). Complete the subtasks to guide you through the process:

#### Subtask 1

First you should split the data into test/train sets. We would like you to split into 80% train 20% test.

*You may need to seperate features/target into seperate DataFrames*

In [None]:
# Write your solution here add more cells if you wish!

#### Subtask 2

Next you should scale the data since KNN is paticularly senstive to scale

- Scale the training features
- Scale the testing features

In [None]:
# Write your solution here add more cells if you wish!

#### Subtask 3

Train a k-nearest neighbour classifier for this task using your split/scaled data. We would like you to try different values for `n_neighbours` and find the best performing value in terms of accuracy. Parameters of the classifier can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). Once you have your best model calculate the metrics:

- Accuracy
- Confusion Matrix
- ROC-AUC

How well does your model perform? What do you think you can do to make your model better?

***BONUS***:

If you wish, try train a couple of different models to see if you can achieve better results:

- [Support Vector Machine](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
- [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=randomforest#sklearn.ensemble.RandomForestClassifier)

In [None]:
# Write your solution here add more cells if you wish!

----

Remember to save your answers! 