**[Do not edit the contents of this cell]**

# MSc in Bioinformatics and Theoretical Systems Biology - Maths and Stats Assignment 2020/21

This assignment is to be completed in Python, R or Julia and returned as a clean Jupyter notebook cleared of its output. This is important otherwise Turnitin will reject the submission.

There are 4 types of cells used in this notebook:
1. Cells containing tasks and instructions to be completed. Do not edit these. These are clearly labelled.
2. Cells in which you are meant to provide an answer in Markdown format.
3. Cells containing code that defines e.g. which packages to load, but can also contain routines and snippets of codes that you should use.
4. Cells that contain the Python/R/Julia code that you write to solve the problems set.

Each of these cells will contain explit comments at the top telling you whether to edit or not edit a cell. In Code cells comments are specified by the "#" character. In the Markdown Answer Cells, replace the xxx by your answer, whenever these are present. You will have to execute all code and Markdown cells in order to (i) make use of the provided code, and (ii) format the markdown appropriately.

There are four problems to be tackled:
1. Data exploration [40%]
2. Hypothesis testing [20%]
3. Regression [20%]
4. Classification [20%]

For each questions there several parts of different difficulty. Where appropriate, further reading will be given at the start of each question.

You will have to specify which language (and version) you used and all packages needed in order to run all Code cells. Please add this information in the next two cells.

**[Provide your answer here]**
- The kernel for this Jupyter notebook is xxx, version xxx, with the following packages: xxx, xxx

In [None]:
# [Write your code in this cell]
# #import all libraries or packages needed
# #for instance, in R:
# library(coda)
# library(ggplot2)
# #in python:
# import numpy as np
# import matplotlib.pyplot as plt
# #or in Julia:
# using Plots
# using Distributions
# ...

**[Do not edit the contents of this cell]**

## Problem 1: data exploration

We consider a subset of data coming from a putative association study where researchers collected various metrics and phenotypes to find associations with a putative generic cardiovascular disease.
All recruited subjects are adults.

For each subject several __predictor variables__ are recorded: sex (1 for female, 0 for male), height (in cm), mass (in kg), whether he/she is a smoker (1) or not (0), whether he/she is a drinker (1) or not (0), and levels of 5 different metabolites (labelled A-E).
**Each subject has a unique ID number**. 

For each subject a disease score, which is the __response__ variable, measuring the severity of the disease phenotype in arbitrary units, is provided.
The data is provided in the file `association_data.csv`.

### Part 1

Load the dataset `association_data.csv`.
How many unique records of subjects do we have? How many unique predictor variables?

In [None]:
# [Write your code in this cell]


**[Provide your answer here]**
- The number of unique records of subjects in the dataset is: xxx
- The number of variables in the dataset is: xxx

**[Do not edit the contents of this cell]**

### Part 2

Produce a plot to illustrate the distribution of variables `smoker`, `mass`, `metabolite_A`. Choose the most appropriate visualisation depending on the type of each variable.

In [None]:
# [Write your code in this cell]


In [None]:
# [Write your code in this cell]


In [None]:
# [Write your code in this cell]


**[Do not edit the contents of this cell]**

### Part 3

Write a function that returns the Body Mass Index (BMI). Learn from a literature search how to calculate BMI.
Calculate BMI for each subject and add it as new __predictor variable__ in the data set.

In [None]:
# [Write your code in this cell]


**[Do not edit the contents of this cell]**

### Part 4

Calculate the correlation matrix between __quantitative__ (numerical) predictors. Use this information to impute any missing values, if possible.

In [None]:
# [Write your code in this cell]


In [None]:
# [Write your code in this cell]


**[Do not edit the contents of this cell]**

### Part 5

Assuming that a disease status is recorded when the disease score is greater than 1, add a new __response variable__ in the dataset defining the diseased `status` of each subject, either non diseased (0) or diseased (1).

In [None]:
# [Write your code in this cell]


**[Do not edit the contents of this cell]**

## Problem 2: hypothesis testing

Starting from the same dataset in Problem 1, provide answers for the following questions.

### Part 1

Given this sample space of subjects, what is the probability that a given subject is diagnosed as diseased? What is the probability that a subject is diagnosed as diseased given that he/she is not a smoker?

In [None]:
# [Write your code in this cell]


**[Provide your answer here]**
- The probability that a subject is diseased is: xxx
- The probability that a subject is diseased given that she/he is not a smoker is: xxx

**[Do not edit the contents of this cell]**

### Part 2

Assuming that they distributed as a Normal distribution $N(\mu, \sigma^2)$, provide an estimate of $\mu$ and $\sigma^2$ for the distributions of height and mass separately for males and females.

In [None]:
# [Write your code in this cell]


**[Do not edit the contents of this cell]**

### Part 3

Test whether height is different between males and females. Perform the same test on the mass variable. Define (in words) which ones are your null and alternative hypotheses and significance threshold. Finally, discuss (in words) any conclusion you can make out the results of your statistical tests.

In [None]:
# [Write your code in this cell]


**[Provide your answer here]**

- xxx


**[Do not edit the contents of this cell]**

### Part 4

Repeat the statistical test in Part 3 for all numerical predictor variables in the dataset. How many tests are significant with $\alpha=0.05$?
Calculate corrected p-values for multiple tests using a Bonferroni correction. 
Learn what the Bonferroni correction implies from a literature search. 
How many tests are significant after correcting for multiple tests? 

In [None]:
# [Write your code in this cell]


**[Do not edit the contents of this cell]**

## Problem 3: regression

Assume you have been provided with a new __predictor variable__ indicating a generic glucose index level. We know that such glucose index is related to the variable `metabolite_D` given in the previous dataset.

### Part 1

Load the new dataset called `glucose_data.csv` which gives the measure of a generic glucose level (`glucose_index`) for each tested subject. Perform a regression where `metabolite_D` is the predictor and `glucose_index` is the response variable. Be aware to merge the two data sets by matching subject IDs.

In [None]:
# [Write your code in this cell]


**[Do not edit the contents of this cell]**

## Problem 4: classification

### Part 1

Implement an algorithm to predict the disease status of a subject given all response variables provided in the dataset. You are free to choose the appropriate statistical tool you prefer. Assess the performance of your classifier.

In [None]:
# [Write your code in this cell]


**[Do not edit the contents of this cell]**

### Part 2

Given the classifier you devised in Part 1, predict the disease status of the following subject:
- subject: ID1986
- sex: 1
- height: 180.2
- mass: 70.1
- smoker: 1
- drinker: 0
- metabolite_A: 0.5
- metabolite_B: 1.2
- metabolite_C: 0.5
- metabolite_D: 8.5
- metabolite_E: 10.2

In [None]:
# [Write your code in this cell]


**[Provide your answer here]**

- xxx