<a href="https://colab.research.google.com/github/D34dP0oL/4216_Biomedical_DS_and_AI/blob/main/Sheet3/Assignment3_Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [89]:
import numpy as np
import math
from scipy.stats import poisson
from scipy.stats import sem
import pandas as pd
from sklearn import linear_model 

import matplotlib.pyplot as plt

## Biomedical Data Science & AI

## Assignment 3

#### Group members:  Fabrice Beaumont, Fatemeh Salehi, Genivika Mann, Helia Salimi, Jonah

---
### Exercise 1 - Probability
The amount of wine bottles sold in a shop follows a Poisson distribution with *180*
bottles per week (6 days). If $C$ is the **random variable for bottles per day**, how is:

#### 1.1. The probability that the shop will only sell 20 bottles per day?

The probability distribution of a Poisson random variable $X$ (here amount of sold wine bottles over time) is given by
$$ \mathbb{P}[X=x] = \frac{e^{-\mu} \mu^x}{x!} $$


The given setting gives the following means: $\mu_{\text{week}} = 180 \ \implies \ \mu_{\text{day}} = 180/6 = 30$. 

Thus the probability to sell $20$ bottles per day is:

$\begin{aligned} \mathbb{P}[X=20|\ \mu_{\text{day}}=30] &= \frac{e^{-\mu_{\text{day}}} \mu_{\text{day}}^{20}}{20!}\\
&= \frac{e^{-30} 30^{20}}{20!} \qquad \approx 1.34 \%
\end{aligned}$


In [28]:
tmp1 = poisson.pmf(20, 30)
print( f"P[X=20] = {tmp1} ~ {tmp1*100:.2f}%" )
print(f"\n(Direct computation yields almost the same result: {((30**20)*np.e**(-30))/math.factorial(20)})")

P[X=20] = 0.013411150012837837 ~ 1.34%

(Direct computation yields almost the same result: 0.01341115001283781)


In [15]:
### QUESTION: Why does this computation lead to a different solution?
# print(f"{np.exp(-30)* np.power(30, 20) / math.factorial(20)}")
# -1.6755778325549218e-13

#### 1.2. The probability that the demand is more than average for a particular day?

**No statement about the demand was made!** We assume this is equal to the number of sold bottles. Since $\mu_{\text{day}}=30$ is the expected/average amount of sold bottles per day, the probability to sell more than this is:

$\begin{aligned} \mathbb{P}[X> \mu_{\text{day}}] = 1-\mathbb{P}[X\le \mu_{\text{day}}] &= 1-\sum_{i=0}^{\mu_{\text{day}}} \frac{e^{-\mu_{\text{day}}} \mu_{\text{day}}^{i}}{i!}\\
&= 1-\sum_{i=0}^{30} \frac{e^{-30} 30^{i}}{i!}\\
&\approx 45.16 \%
\end{aligned}$

We used the cumulative distribution function (cdf) to compute $\mathbb{P}[X\le \mu_{\text{day}}]$:

In [18]:
tmp2 = 1 - poisson.cdf(30, 30)
print( f"P[X>30] = {tmp2} ~ {tmp2*100:.2f}%" )

P[X>30] = 0.45164848742208863 ~ 45.16%


#### 1.3. The expected number of units per day $\mathbb{E}[C]$?

**Who the hell wrote this? Units were not introduced either!** We assume again that this is equal to the number of sold bottles. The expected value (average) was needed and thus computed already in task 1.a:
$$ \mathbb{E}[C] = \mu_{\text{day}} = 30 $$

#### 1.4. What is $\text{Var}[C]$?

Funny. In Poisson distribution, the variance equals the mean:
$$ \text{Var}[C] = \mathbb{E}[C] = \mu_{\text{day}} = 30 $$

#### 1.5. The standard deviation of $C$?

It is:
$$ \sigma(C) = \sqrt{\mathbb{E}[C]} = \sqrt{\mu_{\text{day}}} = \sqrt{30} \approx 5.48 $$

In [17]:
tmp3 = np.sqrt(30)
print( f"P[X>30] = {tmp3} ~ {round(tmp3,2)}" )

P[X>30] = 5.477225575051661 ~ 5.48


---
### Exercise 2 - Hypothesis testing
This exercise illustrates a gene expression data set with its normally distributed values.
Consider the gene expression data of the Golub dataset. Load the file `golub.csv`. 
It contains gene expression data of 3051 genes from 38 tumor mRNA samples. The expression data is organized in a matrix where rows correspond to genes and columns to samples. The tumor class of the columns is given in the file `golub.cl`. The names of the genes (rows) are given in `golub.gnames`.

In [31]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [40]:
!ls "/content/drive/MyDrive/Colab Notebooks/MAINF4216_Databases"

Fish.csv  golub.cl.csv	golub.csv  golub.gnames.csv


In [63]:
file_path = "/content/drive/MyDrive/Colab Notebooks/MAINF4216_Databases/"
db_golub_tumorClass = pd.read_csv(file_path + "golub.cl.csv", index_col='Unnamed: 0')
db_golub_genes = pd.read_csv(file_path + "golub.csv", index_col='Unnamed: 0')
db_golub_gnames = pd.read_csv(file_path + "golub.gnames.csv", index_col='Unnamed: 0')

In [64]:
db_golub_tumorClass.head(4)

Unnamed: 0,x
1,0
2,0
3,0
4,0


In [65]:
db_golub_genes.head(4)

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,V29,V30,V31,V32,V33,V34,V35,V36,V37,V38
1,-1.45769,-1.3942,-1.42779,-1.40715,-1.42668,-1.21719,-1.37386,-1.36832,-1.47649,-1.21583,-1.28137,-1.03209,-1.36149,-1.39979,0.17628,-1.40095,-1.56783,-1.20466,-1.24482,-1.60767,-1.06221,-1.12665,-1.20963,-1.48332,-1.25268,-1.27619,-1.23051,-1.43337,-1.08902,-1.29865,-1.26183,-1.44434,1.10147,-1.34158,-1.22961,-0.75919,0.84905,-0.66465
2,-0.75161,-1.26278,-0.09052,-0.99596,-1.24245,-0.69242,-1.37386,-0.50803,-1.04533,-0.81257,-1.28137,-1.03209,-0.74005,-0.83161,0.412,-1.27669,-0.7437,-1.20466,-1.0238,-0.38779,-1.06221,-1.12665,-1.20963,-1.12185,-0.65264,-1.27619,-1.23051,-1.18065,-1.08902,-1.05094,-1.26183,-1.25918,0.97813,-0.79357,-1.22961,-0.71792,0.45127,-0.45804
3,0.45695,-0.09654,0.90325,-0.07194,0.03232,0.09713,-0.11978,0.23381,0.23987,0.44201,-0.3956,-0.62533,0.45181,1.09519,1.09318,0.343,0.2001,0.38992,0.00641,1.10932,0.21952,-0.72267,0.5169,0.28577,0.61937,0.20085,0.29278,0.26624,-0.43377,-0.10823,-0.29385,0.05067,1.6943,-0.12472,0.04609,0.24347,0.90774,0.46509
4,3.13533,0.21415,2.08754,2.23467,0.93811,2.24089,3.36576,1.97859,2.66468,-1.21583,0.5911,3.2605,-1.36149,0.6418,2.32621,-1.40095,-1.56783,0.83502,-1.24482,-1.60767,-1.06221,3.69445,3.70837,-1.48332,2.36698,-1.27619,2.89604,0.7199,0.29598,-1.29865,2.76869,2.0896,0.70003,0.13854,1.75908,0.06151,1.30297,0.58186


In [66]:
db_golub_gnames.head(4)

Unnamed: 0,V1,V2,V3
1,36,AFFX-HUMISGF3A/M97935_MA_at (endogenous control),AFFX-HUMISGF3A/M97935_MA_at
2,37,AFFX-HUMISGF3A/M97935_MB_at (endogenous control),AFFX-HUMISGF3A/M97935_MB_at
3,38,AFFX-HUMISGF3A/M97935_3_at (endogenous control),AFFX-HUMISGF3A/M97935_3_at
4,39,AFFX-HUMRGE/M10098_5_at (endogenous control),AFFX-HUMRGE/M10098_5_at


Unnamed: 0,x
0,0
1,0
2,0
3,0


#### 2.1. Calculate the sample mean of all genes $\beta$ in the pooled expression matrix. Use these means to determine the overall mean $\beta_0$ by just taking the average.

In [72]:
### Since the rows correspond to the genes (and the columns to the samples)
### we need to take the means by iterating over the column. This is done by setting 'axis=1'
beta = db_golub_genes.mean(axis=1)
beta_0 = beta.mean()
print("Sample mean of all genes:\n", beta)
print("Mean of all sample mean of all genes:\n", beta_0)

Sample mean of all genes:
 1      -1.129013
2      -0.846746
3       0.260806
4       0.949458
5       0.475348
          ...   
3047    0.422469
3048   -0.353091
3049    0.484367
3050   -0.366183
3051   -0.360164
Length: 3051, dtype: float64
Mean of all sample mean of all genes:
 -6.813986781078161e-09


#### 2.2. Based on the $t$-statistic defined below obtain the $100$ most signifiant genes.
$$ t_{\hat{\beta}} = \frac{ \hat{\beta}-\beta_0}{\text{s.e.}(\hat{\beta})} \qquad\qquad = \frac{ \hat{\beta}-\beta_0}{\text{std.dev}(\hat{\beta})}\sqrt{\hat{n}}$$
[Hint: $\hat{\beta}$ is the sample mean of a particular gene, and $\sqrt{n}$ the number of samples - here its always $38$.]

In [79]:
### We guess 's.e.' denotes the standard error. 
### One could compute it using the 'sem()' function from the SciPy Stats library.
def t_test(samples):
    beta = samples.mean()
    return 38*(beta - beta_0)/ np.std(samples)
    # return (beta - beta_0)/ sem(beta)

In [83]:
### Apply the t-test function row-wise
db_golub_genes_tstats = db_golub_genes.apply(t_test, axis=1)
### Return the genes with the 100 largest scores
db_golub_genes_tstats.nlargest(100)

2030    375.536162
272     354.082561
2847    336.511117
1984    322.652845
1014    313.164531
           ...    
1783    180.554593
1890    179.279957
2833    178.889022
158     177.968046
3009    177.619055
Length: 100, dtype: float64

#### 2.3. Perform two sampled student $t$-tests for all genes comparing the distributions for ALL and AML.

#### 2.4. Based on the $p$-values obtained in 2.3., obtain the top $10$ genes with the lowest $p$-values.


#### 2.5. Shapiro-Wilk test is used to test if a random variable follows a Normal distribution (Null-hypothesis). Using this test identify the top $100$ genes which deviate significantly from normal.

#### 2.6. Out of the $100$ genes obtained in 2.5., use an appropriate statistical test to obtain the $10$ most differentiating genes between ALL and AML classes.

#### 2.7. Inform yourself about the multiple testing problem. Apply one appropriate method to deal with it and explain how it works.

---
### Exercise 3 - Linear regression
Using the `Fish.csv` dataset, generate a multiple linear regression model with `Weight` as the response variable and `Length1`, `Length2`, `Length3`, `Height`, and `Width` as the predictors. Answer the following questions based on the regression model.

In [88]:
file_path = "/content/drive/MyDrive/Colab Notebooks/MAINF4216_Databases/"
db_fish = pd.read_csv(file_path + "Fish.csv")
db_fish.head(4)

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width
0,Bream,242.0,23.2,25.4,30.0,11.52,4.02
1,Bream,290.0,24.0,26.3,31.2,12.48,4.3056
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961
3,Bream,363.0,26.3,29.0,33.5,12.73,4.4555


In [92]:
### We use linear_model from sklearn
### See: https://www.w3schools.com/python/python_ml_multiple_regression.asp for reference
predictors = db_fish[['Length1', 'Length2', 'Length3', 'Height', 'Width']]
response = db_fish['Weight']
regr = linear_model.LinearRegression()
regr.fit(predictors, response) 

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

#### 3.1. How large is the coefficient of the predictors?

In [96]:
print("The coefficient")
for coef, pred in zip(regr.coef_, predictors):
    print(f"of {pred} is\t{coef}")

The coefficient
of Length1 is	62.35521443246453
of Length2 is	-6.526752492044295
of Length3 is	-29.026218612693484
of Height is	28.297351322276644
of Width is	22.4733066522373


#### 3.2. What is the value of the adjusted $R$-squared? What does this tell us?

#### 3.3. Which predictors lead to the increase in the weight of the fish, and which have a negative effect?

#### 3.4. Using a simple regression model, predict the weight of a fish with `length1` of $15$?

In [110]:
predictor_3_4 = db_fish['Length1'].values.reshape(-1, 1)
regr_3_4 = linear_model.LinearRegression()
regr_3_4.fit(predictor_len, response) 
print(f"{regr_3_4.predict([[15]])[0]:.2f} kg")

29.51 kg


#### 3.5. Predict the weight of a fish with `length1`, `height`, and `width` of $20.0$, $7.3$, and $5.3$ using a multiple linear regression model.

In [112]:
predictor_3_5 = db_fish[['Length1', 'Height', 'Width']]
regr_3_5 = linear_model.LinearRegression()
regr_3_5.fit(predictor_3_5, response) 
print(f"{regr_3_5.predict([[20.0, 7.3, 5.3]])[0]:.2f} kg")

272.67 kg


#### 3.6. What are the associated 95% confidence intervals for the predictors, as well as for intercept?