### MEDC0106: Bioinformatics in Applied Biomedical Science

<p align="center">
  <img src="../../resources/static/Banner.png" alt="MEDC0106 Banner" width="90%"/>
  <br>
</p>

---------------------------------------------------------------

# 10 - Exercises (Session 3)

*Written by:* Oliver Scott

**This notebook contains exercises to help you understand the concepts introduced during Session 3 of the Python workshop. The exercises are designed to give you practical experience in applying these tools to bioinformatics tasks.**

Feel free to refer back to the content in the previous notebooks to help you complete the tasks.

You should work through the tasks consecutively.

Remember to save your changes.

----

## Contents

1. [Task 1](#Task-1) - Basic data analysis

----

#### Imports

Some imports you may, or may not need to complete the tasks (run this before you attempt the exercises).


In [None]:
# Run this cell before you attempt the exercises
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
from sklearn.metrics import RocCurveDisplay, roc_curve, auc
from sklearn.model_selection import train_test_split

## Task 1

#### Basic data analysis  

We have provided a dataset from the **National Institute of Diabetes and Digestive and Kidney Diseases**. This dataset includes various diagnostic measurements and indicates whether each patient has diabetes. All patients in this subset are female, at least 21 years old, and of Pima Indian heritage.

This task focuses on fundamental data analysis and is divided into multiple subtasks.

##### Reference
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). *Using the ADAP learning algorithm to forecast the onset of diabetes mellitus.* In *Proceedings of the Symposium on Computer Applications and Medical Care* (pp. 261–265). IEEE Computer Society Press.

##### Part 1

For this first part, read the CSV file using pandas and identify the following:
- The diagnostic measurements recorded for each patient.
- The data type of each measurement.
- The total number of patients assessed.

The file path to the data has been provided.

<details>
<summary>Example solution</summary>
<pre>
data_path = 'https://raw.githubusercontent.com/MEDC0106/PythonWorkshop/main/workshop/session_3/data/diabetes.csv'
df = pd.read_csv(data_path)
df.info()
</pre>
</details> 

In [None]:
data_path = 'https://raw.githubusercontent.com/MEDC0106/PythonWorkshop/main/workshop/session_3/data/diabetes.csv'
# Write your solution here and add more cells if you wish

##### Part 2

Although there are no null values, some entries contain unusual values.

a) Review the measurements and consider whether these values are plausible.<br>
b) After identifying unreliable measurements, remove these entries from the dataset.

<details>
<summary>Click here for a hint!</summary>
<em>Some subjects have Glucose == 0! This likely isn’t correct (do other measurements have similar issues?)</em>
</details>

<details>
<summary>Click here for a hint!</summary>
<em>Remember conditional selections! For example: <pre>df = df[df['somecol'] != 0]`</pre>
</details>

<details>
<summary>Example solution</summary>
<pre>
print("Number of records before glucose verification:", len(df))
df = df[df.Glucose != 0]
print("Number of records after glucose verification:", len(df))
</pre>
</details>

In [None]:
# Write your solution here and add more cells if you wish

##### Part 3

Now that you have a clean dataset, count the number of patients with/without diabetes. Make a nice plot to visualise the result. Is the data balanced or not? 

<details>
<summary>Example solution</summary>
<pre>
outcome_counts = df.Outcome.value_counts()
print(outcome_counts)
outcome_counts.plot(kind="bar", title="Diabetes diagnosis")
</pre>
</details>

In [None]:
# Write your solution here and add more cells if you wish

##### Part 4

Plot histograms for the diagnostic measurements. Observe the shape of each distribution and note the units for each measurement.

<details>
<summary>Example solution</summary>
<pre>
df.hist(figsize=[10, 10])
</pre>
</details>

In [None]:
# Write your solution here and add more cells if you wish