### MEDC0106: Bioinformatics in Applied Biomedical Science

<p align="center">
  <img src="../../resources/static/Banner.png" alt="MEDC0106 Banner" width="90%"/>
  <br>
</p>

---------------------------------------------------------------

# 13 - Supplementary exercises (Session 3)

*Written by:* Oliver Scott

**This notebook contains exercises to help you understand the concepts introduced on the supplementary notebooks for Session 3 of the Python workshop. The exercises are designed to give you practical experience in applying these tools to bioinformatics tasks.**

Feel free to refer back to the content in the previous notebooks to help you complete the tasks.

You should work through the tasks consecutively.

Remember to save your changes.

----

## Contents

1. [Task 2](#Task-2) - Searching for correlations
2. [Task 3](#Task-3) - Model building

----

#### Imports

Some imports you may, or may not need to complete the tasks.

In [None]:
# Run this cell before you attempt the exercises
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix, RocCurveDisplay, roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

## Task 2  

#### Searching for correlations

In this task, you will analyse the dataset from the **National Institute of Diabetes and Digestive and Kidney Diseases** to answer the following questions:

1. Is there a correlation between 'BMI' and 'Glucose'? (Create a plot to illustrate this.)

In [None]:
data_path = 'https://raw.githubusercontent.com/MEDC0106/PythonWorkshop/main/workshop/session_3/data/diabetes.csv'
# Write your solution here and add more cells if you wish

2. Which two features have the strongest correlation with each other?

In [None]:
# Write your solution here, adding more cells if you wish!

3. Which feature has the highest correlation with the outcome ("diabetes" diagnosis)?

In [None]:
# Write your solution here, adding more cells if you wish!

4. Based on the data, would this problem be better suited as a classification or regression task?

In [None]:
# Write your solution here, adding more cells if you wish!

## Task 3

#### Model building

Now that you've performed some data analysis, the next step is to build a model to predict the likelihood of diabetes based on diagnostic measurements. Specifically, you'll implement a simple [*k*-nearest neighbours (KNN) model](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). Follow the steps below to guide you through the process:

##### Part 1

Split the dataset into training and testing sets with an 80/20 ratio. Ensure the features and target are in separate DataFrames before splitting.

*Hint: You may need to separate the features (X) and the target variable (y) before proceeding with the split.*

In [None]:
# Write your solution here add more cells if you wish!

##### Part 2

Since KNN is particularly sensitive to scale, you should standardise the features.

- Scale the training features.
- Apply the same scaling to the testing features. 

In [None]:
# Write your solution here add more cells if you wish!

##### Part 3

Train a KNN classifier using your split and scaled data. Experiment with different values for `n_neighbors` to find the best-performing setting in terms of accuracy. You can explore KNN parameters [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).

After identifying the optimal value for `n_neighbors`, evaluate your model with the following metrics:

- Accuracy
- Confusion matrix
- ROC-AUC

In [None]:
# Write your solution here add more cells if you wish!

##### Part 4

If you wish, try training a couple of different models to see if you can achieve better results.

- [Support Vector Machine](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
- [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=randomforest#sklearn.ensemble.RandomForestClassifier)

In [None]:
# Write your solution here add more cells if you wish!