# Statistical analysis
To learn various statistical tests and estimates using Scipy, machine learning using Scikit-Learn, and how to use it for genome analysis.

## Contents

### Introduction
* [Contents of this course](#0.1)

### Basic statistical tests & estimates by Scipy
1. [Loading Dataset](#1.1)
1. [Calculate basic statistic](#1.2)
1. [Try basic statistical tests](#1.3)
1. [How to use statistical method to genome data](#1.4)

### Regression analysis by Scikit-Learn(sklearn)
1. [Regression analysis](#1.5)
1. [Regression analysis for genome data](#1.6)

## Introduction

So far, we studied the how to use basic grammer of python, how to load dataset and how to treat dataset by pandas.

From here, using these methods, we are going to learn about statistic prediction for genomic data.<br>

In particular, we are going to learn about
* Basic statistical processing using [Scipy](https://docs.scipy.org/doc/scipy/reference/) library
* Basic Machine Learning methods for prediction using [Scikit-Learn](https://scikit-learn.org/stable/) library

### Today's Contents<a name="0.1"></a>

* Implement basic statistical tests using [Scipy](https://docs.scipy.org/doc/scipy/reference/) library
* Implement Regression analysis using [Scikit-Learn](https://scikit-learn.org/stable/) library
* Apply the above methods to genome data 

#### Contents of the next training on and after
* What is difference between statistics and machine learning
* The basic flow of machine learning
* The weak points of these methods
* Implement some machine learning methods
* The introduction of deep learning...etc

## Basic statistical tests & estimates by Scipy
### sample data

First of all, Iris dataset is used as a sample for implementing basic statistical processing.<br>
This dataset contains three types of iris data "setosa", "versicolor", "virginica".<br>
The data is composed of the widths and lengths of Sepal and Petal.<br>

There are 50 samples of three irises prepared (Total 150 samples).

### 1. Load Iris dataset<a name="1.1"></a>

In [None]:
"""
*** IMPORTANT ***
Run this cell before this practice.
You can download a sample file.
"""

#--- import library  ---
import pandas as pd
import numpy as np
import math
import scipy.stats
import matplotlib.pyplot as plt

#--- loading iris dataset ---
#--- header=None, names=[...] => assign col names
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
                 header=None, names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'])
#--- check dataset
iris

In [None]:
# Extract sepal_width data of each speices
setosa = iris[iris['class'] == 'Iris-setosa'].sepal_width         # Iris setosa
virginica = iris[iris['class'] == 'Iris-virginica'].sepal_width   # Iris virginica
versicolor = iris[iris['class'] == 'Iris-versicolor'].sepal_width # Iris versicolor
setosa

# show sepal_width data of setosa using histgram
# plt.hist(setosa)
# plt.show()

### 2. Calculate basic statistic<a name="1.2"></a>

In [None]:
# Calculate basic statistic like mean, variance and standard deviation...

##### cehck dataset to use #####
setosa

##### maximum value of sepal_width of setosa #####
# max_value = setosa.max()
# print("sepal_width_max:", max_value)

##### minimum value of sepal_width of setosa #####
# min_value = setosa.min()
# print("sepal_width_min:", min_value)

##### mean value of sepal_width of setosa #####
# mean_value = setosa.mean()
# print("sepal_width_mean:", mean_value)

##### unbiased variance of sepal_width of setosa #####
# var_value = setosa.var()
# print("sepal_width_var:", var_value)

##### sample variance of sepal_width of setosa #####
# sample_var = setosa.var(ddof=False)
# print("sepal_width_sample_var:", sample_var)

##### unbiased standard deviation of sepal_width of setosa #####
# std_value = setosa.std()
# print("sepal_width_std:", std_value)

##### sample standard deviation of sepal_width of setosa #####
# sample_std = setosa.std(ddof=False)
# print("sepal_width_sample_std:", sample_std)

#### (Note) Confidence interval of average value
\begin{align}
\bar{X}-t_{\frac{a}{2}(n-1)}\sqrt{\frac{s^2}{n}} < \mu < \bar{X}+t_{\frac{a}{2}(n-1)}\sqrt{\frac{s^2}{n}}
\end{align}

In [None]:
##### (Note) calculate confidence interval of mean value(t distribution)

n = len(setosa) #the number of samples                                
interval = scipy.stats.t.interval(alpha=0.95,                        # Significance level
                                  df=n-1,                            # Degree of freedom
                                  loc=setosa.mean(),                 # mean
                                  scale=math.sqrt(setosa.var() / n)) # standard deviation
print("Confidence interval of mean value(95%):", interval)

##### (Note) mean ± interval, ppf:Percent point function.
# mean_value = setosa.mean()
# under = scipy.stats.t.ppf(0.025, df=n-1) * math.sqrt(setosa.var() / n)
# upper = scipy.stats.t.ppf(0.975, df=n-1) * math.sqrt(setosa.var() / n)
# print("confidence interval of mean value(95%):", mean_value, "±", upper)

##### (Note) calculate confidence interval of mean value(normal distribution)
##### (※ As n increases, t distribution approaches the normal distribution)
# interval = scipy.stats.norm.interval(alpha=0.95,                        # Significance level
#                                      loc=setosa.mean(),                 # mean
#                                      scale=math.sqrt(setosa.var() / n)) # standard deviation
# print("Confidence interval of mean value(95%, when the number of samples is large):", interval)

In [None]:
##### Summarize statistics on sepal_width of each iris species
data = {
        "species":["setosa", "virginica", "versicolor"],
        "max": [setosa.max(), virginica.max(), versicolor.max()],
        "min": [setosa.min(), virginica.min(), versicolor.min()],
        "mean": [setosa.mean(), virginica.mean(), versicolor.mean()],
        "unbiased variance":[setosa.var(), virginica.var(), versicolor.var()],
        "standard deviation":[setosa.std(), virginica.std(), versicolor.std()],
       }
pd.DataFrame(data)

### 3. Try basic statistical tests<a name="1.3"></a>

* Can we assume that the population mean of setosa is 3.5 ? → Hypothesis testing of population mean

\begin{align}
H_0:\mu_0 = 3.5 \quad H_1:\mu_0 \neq 3.5 \quad t_0 = \frac{\bar{X}-\mu_0}{\sqrt{\frac{s^2}{n}}} \quad t_0 > t(df, \alpha=0.05):reject H_0
\end{align}

* Can we assume that the population mean of versicolor & virginica are different ? → Hypothesis Testing of the Difference Between Two Population Means
(※When not equal variance, We use the $df = \upsilon$)

\begin{align}
H_0:\mu_X - \mu_Y = 0 \quad H_1:\mu_X - \mu_Y \neq 0
\end{align}

\begin{align}
s^2 = \frac{(n_X - 1)s^2_X+(n_Y - 1)s^2_Y}{n_X+n_Y-2} \quad t_0 = \frac{\bar{X}-\bar{Y}}{s\sqrt{\frac{1}{n_X} + \frac{1}{n_Y}}} \quad \upsilon=\frac{(\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2})^2}{\frac{(\frac{s_1^2}{n_1})^2}{n_1-1}+\frac{(\frac{s_2^2}{n_2})^2}{n_2-1}} \quad t_0 > t(df, \alpha=0.05):rejectH_0
\end{align}

In [None]:
##### Hypothesis testing of population mean #####
mu_0 = 3.5
scipy.stats.ttest_1samp(setosa, popmean=mu_0) 

### display (t statistic、p value)
### p > 0.05 -> we can't reject H0


##### Hypothesis Testing of the Difference Between Two Population Means  ~if eaqual variance~ #####
# scipy.stats.ttest_ind(versicolor, virginica)

### display (t statistic、p value)
### p < 0.05 -> reject H0 : μ_X - μ_Y = 0


##### F-test of equality of variances #####
# f = versicolor.var() / virginica.var() 
# dfx = 50 - 1
# dfy = 50 - 1
# scipy.stats.f.cdf(f, dfx, dfy)

### p > 0.05 -> we can't reject H0 : σ_X = σ_Y


##### (Note) Hypothesis Testing of the Difference Between Two Population Means  ~if not eaqual variance~ #####
# scipy.stats.ttest_ind(versicolor, virginica, equal_var=False)

### display (t statistic、p value)
### p < 0.05 -> reject H0 : μ_X - μ_Y = 0

#### (Note) In addition, various statistical test functions such as t-test with correspondence, goodness of fit test, and Chi-Square Test of Independence are prepared in Scipy.

* example of goodness of fit test

Is the probability of each number of dice following the hypothesis of 1/6 ? → Chi-Square goodness of fit Test

|dice|1|2|3|4|5|6|total|
|--------|---|-|-|-|-|-|-|
|Observation value:f |10|7|8|11|6|8|50|
|Expected value |50/6|50/6|50/6|50/6|50/6|50/6|50|

\begin{align}
H_0: p_i=1/6 \quad(i=1...6) \quad H_1:p_i \neq 1/6 \quad \chi^2 = \sum_{i} \frac{(f_i - np_i)^2}{np_i} \quad \chi^2 > \chi^2(df, \alpha=0.05):rejectH_0
\end{align}

In [None]:
scipy.stats.chisquare([10, 7, 8, 11, 6, 8], f_exp=[50/6, 50/6, 50/6, 50/6, 50/6, 50/6])
### display (chi-square value, p value)
### p > 0.05, we can't reject H0

* Example of Chi-Square Test of Independence 

Is there a connection between smoking and the onset of a specific disease? → Chi-Square Test of Independence

|&nbsp;|non-smoker|smoker|total|
|--------|---|-|-|
|disease|117|54|171|
|healthy|954|148|1102|
|**total**|1071|202|1273|

\begin{align}
\chi^2=\sum_i \sum_j \frac{(nf_{ij}-f_{i \cdot}f_{\cdot j})^2}{nf_{i \cdot}f_{\cdot j}} \quad \chi^2 > \chi^2(df, \alpha=0.05):reject H_0
\end{align}

In [None]:
# make dataset for test
smoking_df = pd.DataFrame({
    'non-smoker': [117, 954],
    'smoker': [54, 148]
}, index=['disease', 'healthy'])

print(smoking_df)

# scipy.stats.chi2_contingency(smoking_df)

### display(chi-square value、P-value、degree of freedom、(expected value))
### p < 0.05 -> reject H0: smoking and disease are independent

### 4. How to use statistical method to genome data<a name="1.4"></a>

#### 4_1 sample data

Sample data:
* 200 samples
* The width of leaves, the length of leaves, the presence or absence of diseases, and the information on the 10 bases on chromosome 10.

In [None]:
###### sequence & phenotype
gene_data = pd.read_csv("data/gene_data.csv", index_col=0)
gene_data

#### 4_2 The effect of base change(G or A) to leaf length

To change the base of chr10_1 from G to A affect leaf length?
<img src="data/compare.png" alt="compare" width="50%" height="50%">

In [None]:
###### Leaf length data of chr10_1 = G & Leaf length data of chr10_1 = A #####
G_data = gene_data[gene_data["chr10_1"] == "G"].LeafLength
A_data = gene_data[gene_data["chr10_1"] == "A"].LeafLength

###### display boxplot ######
plt.title("Leaf Length")
plt.boxplot((G_data, A_data))
plt.xticks([1, 2], ["G", "A"])
plt.xlabel("chr10_1")
plt.show()

###### Hypothesis Testing of the Difference Between Two Population Means ######
scipy.stats.ttest_ind(G_data, A_data, equal_var=False)

### Gene located in chr10_1 "may" affect leaf length.

#### 4_3 The effect of base change(G or A) to disease resistance

To change the base of chr10_2 from G to A affect disease resistance?
<img src="data/compare2.png" alt="compare2" width="50%" height="50%">

In [None]:
###### Get the number of samples of each condition of chr10_2 & disease
G_True = gene_data[(gene_data["chr10_2"] == "G") & (gene_data["disease"] == True)].shape[0]
G_False = gene_data[(gene_data["chr10_2"] == "G") & (gene_data["disease"] == False)].shape[0]
A_True = gene_data[(gene_data["chr10_2"] == "A") & (gene_data["disease"] == True)].shape[0]
A_False = gene_data[(gene_data["chr10_2"] == "A") & (gene_data["disease"] == False)].shape[0]

disease_df = pd.DataFrame({
    'G': [G_True, G_False],
    'A': [A_True, A_False]
}, index=['True', 'False'])

print(disease_df)

scipy.stats.chi2_contingency(disease_df)

### Gene located in chr10_2 "may" affect disease resistance.

### The limitation & weakness of statistical tests for genomic data

* actually, there so many genes (ex, oryza sativa:30,000~). So, there is only a few possibility that only 1 gene affect phenotype.

　→　It is necessary to consider the influence of multiple genes.<br>
　→　In the t test and the independence test, it is impossible to solve the prerequisite & multiple test problem...etc

* Not only genes but also Environment & other conditions affect phenotype. 

　→　An analysis model that can consider other factors besides genes is necessary.

## Regression analysis by Scikit-Learn(sklearn)
### 1. Regression analysis<a name="1.5"></a>

Linear regression is one of the simplest analysis way to consider multiple factors.<br>
This model predict the value of the target variable from the value of the explanatory variable.<br>
<img src="data/regression_base_en.png" alt="reg_base" width="50%" height="50%">
First, explain a single regression analysis that considers only one factor. Then, explain multiple regression analysis that takes multiple factors into account.<br>

#### Single regression analysis
As an example, how much sepal_length of iris setosa can be explained with sepal_width of iris setosa.<br>

$ Y(sepal\_length) = \beta * X(sepal\_width) + e \quad : calculate \beta$ & $e$

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
##### loading iris data set ######
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
                 header=None, names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'])
setosa = iris[iris['class'] == 'Iris-setosa']
setosa.head()

$ Y(sepal\_length) = \beta * X(sepal\_width) + e$

In [None]:
# loading library
from sklearn import linear_model

# select model
clf = linear_model.LinearRegression()

# Explanatory variable:"sepal_width"
X = setosa[["sepal_width"]]

# Target Variable:"sepal_length"
Y = setosa[["sepal_length"]]

# Prediction 
clf.fit(X, Y)
 
# display Coefficient
print(clf.coef_)
 
# display intercept (e)
print(clf.intercept_)
 
# display coefficient of determination
print(clf.score(X, Y))

Then, the regression equation is expressed as follows.<br>
$ Y(sepal\_length) = 0.6908544 * X(sepal\_width) + 2.64465968$

And, Coefficient of determination is calculated by the following equation.<br>
This is an indicator that shows how much the predicted value matches the actual value of Target Variable.<br>
\begin{align}
R^{2}= 1-\frac{\sum_i(y_i-f_i)^2}{\sum_i(y_i-\bar{y})^2} \quad 0 \leqq R^2 \leqq 1
\end{align}

In [None]:
# display result
plt.scatter(X, Y, color="black")
 
# make X(X.min, X.mix+0.1, X.min+0.2 ... X.max())
px = np.arange(X.min(), X.max(), 0.1)[:,np.newaxis]
# predict values of py (= β*px+e)
py = clf.predict(px)
 
# draw regression line
plt.plot(px, py, color="b")
 
plt.title("Relationship Width & Length")
plt.xlabel("Sepal Width")
plt.ylabel("Sepal Length")
plt.show()

# compare the predicted value & the actual value
predict_value = clf.predict(X)
plt.plot(range(len(Y)), Y, color="b")
plt.plot(range(len(Y)), predict_value, color="r")
 
plt.title("Measured value vs Prediction")
plt.xlabel("n")
plt.ylabel("Sepal Length")
plt.show()

Next, Multiple regression analysis.<br>

How much sepal_length of all iris can be explained with sepal_width, petal_length and petal_width of all iris.

$Y(sepal\_length) = \beta_1*X_1(sepal\_width) + \beta_2*X_2(petal\_length) + \beta_3*X_3(petal\_width) + e$

In [None]:
# select model
clf = linear_model.LinearRegression()

# Explanatory variable:"sepal_width", "petal_length", "petal_width"
X = iris[["sepal_width", "petal_length", "petal_width"]]

# Target variable:"sepal_length"
Y = iris[["sepal_length"]]

# Prediction
clf.fit(X, Y)
 
# Display Coefficient
print(clf.coef_)
 
# Display Intercept
print(clf.intercept_)
 
# Display coefficient of determination
print(clf.score(X, Y))

Then, the regression equation is expressed as follows.<br>
$ Y(sepal\_length) = 0.65486424 × X(sepal\_width) + 0.71106291 × X(petal\_length) + -0.56256786 × X(petal\_width) + 1.8450608$

In [None]:
# compare the predicted value & the actual value
predict_value = clf.predict(X)
plt.figure(figsize=(15, 8), dpi=50)

plt.plot(range(len(Y)), Y, color="b")
plt.plot(range(len(Y)), predict_value, color="r")

plt.title("Measured value vs Prediction")
plt.xlabel("n")
plt.ylabel("Sepal Length")
plt.show()

##### (Note) the solution of least squares method of regression analysis

\begin{align}
y = \beta_1x_1 + \beta_2x_2 + \beta_2x_2 + \beta_3x_3... + \beta_kx_k + e \\
\end{align}<br>

is the regression equation. Then, this equation is

\begin{align}
\boldsymbol{y} = \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{e}
\end{align}

\begin{align}
\boldsymbol{y} = \left[\begin{array}{c}
            y_1 \\
            y_2 \\
            ... \\
            y_n \\
        \end{array}\right] \quad
\boldsymbol{X} = \left[\begin{array}{c}
            x_{11} & x_{21} & x_{31} & ... & x_{k1} \\
            x_{12} & x_{22} & x_{32} & ... & x_{k2} \\
            ... & ... & ... & ... & ...\\
            x_{1n} & x_{2n} & x_{3n} & ... & x_{kn} \\
        \end{array}\right] \quad
\boldsymbol{\beta} = \left[\begin{array}{c}
            \beta_1 \\
            \beta_2 \\
            ... \\
            \beta_k \\
        \end{array}\right] \quad
\boldsymbol{e} = \left[\begin{array}{c}
            e_1 \\
            e_2 \\
            ... \\
            e_n \\
        \end{array}\right]
\end{align}

In this case, $\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} = \boldsymbol{e}$. We only have to calculate $\boldsymbol{\beta}$ to minimize the sum of squares of $\boldsymbol{e}(e_1,e_2...e_n)$, $\sum{e_i^2}$.<br><br>
The predicted value of $y_i$ is $\hat{y_i} = \boldsymbol{X_i} \boldsymbol{\beta}$, then $E = \sum{e_i^2} = \sum{(\hat{y_i} - y_i)^2}$. By calculate the sum of squares of $\boldsymbol{e}$ and partial derivative with respect to $\beta$,<br><br>
eventually, $\boldsymbol{\beta}$ that minimize $E$ is calculated by $\boldsymbol{X'}\boldsymbol{X}\boldsymbol{\beta} = \boldsymbol{X'}\boldsymbol{y}$

### 2. Multiple regression analysis for genome data.<a name="1.6"></a>
If you use genome data with Multiple regression analysis, you have to be careful

\begin{align}
y = \beta_1x_1 + \beta_2x_2 + \beta_2x_2 + \beta_3x_3... + \beta_kx_k + e 
\end{align}

We want to express this equation as follows.

\begin{align}
y(phenotype) = gene1\_effect * x_1 + gene2\_effect * x_2 + gene3\_effect * x_3 + ...
\end{align}

However, Genome data is not Quantitative, it is Qualitative (like A, T, G, C)<br>
This kind of data is called "Categorical Data" and it is necessary to convert.
<img src="data/categorical_en.png" alt="categorical" width="25%" height="25%">

##### Example of converting categorical data

for example, try to express effect of one base. 
<img src="data/categorical2_en.png" alt="categorical2" width="75%" height="75%">

Such conversion can be done with pandas `get_dummies()` function.

In [None]:
###### sequence and phenotype
gene_data = pd.read_csv("data/gene_data.csv", index_col=0)
gene_data.head()

In [None]:
# convert categorical data to quantitative data
convert_gene_data = pd.get_dummies(gene_data, drop_first=True)
convert_gene_data.head()

In [None]:
# select model
clf = linear_model.LinearRegression()

# Explanatory variable: base data of chr10(after fourth col)
X = convert_gene_data.iloc[:, 3:]

# Target variable: "LeafWidth"
Y = convert_gene_data.loc[:, "LeafWidth"]

# Prediction
clf.fit(X, Y)
 
# change display setting(Number of digits)
%precision 20

# display coefficient
print(clf.coef_)
 
# display intercept
print(clf.intercept_)
 
# display coefficient of determination
print(clf.score(X, Y))

In [None]:
# compare the predicted value & the actual value
predict_value = clf.predict(X)
plt.figure(figsize=(15, 8), dpi=50)

plt.plot(range(len(Y)), Y, color="b")
plt.plot(range(len(Y)), predict_value, color="r")

plt.title("Measured value vs Prediction")
plt.xlabel("n")
plt.ylabel("Sepal Width")
plt.show()

##### The result of Multiple regression analysis...

regression equation is<br>

$y = 0.016 * x_1 + 0.002 * x_2 + -0.493 * x_3 + ... + 2.96$

Coefficient of determination is very high(0.978...) and it looks like that this model can explain LeafWidth.

Such a very high coefficient of determination is often seen when we use linear regression with the large number of variables such as genome data( it called high dimentional data).

Is this equation reflect on reality? 

Actually, there is big problem. The next lecture is from this problem.