# Statistical analysis(2)
To learn various statistical tests and estimates using Scipy, machine learning using Scikit-Learn, and how to use it for genome analysis.

## Contents

### "The difference between Machine Learning & Statistics" & "How to apply these methods to genome analysis"
1. [Slide](#0.1)

Referrence:[Machine Learning vs. Statistics](https://www.svds.com/machine-learning-vs-statistics)

### Previous review & supplement
1. [Multi Regression](#1.1)
1. [Problem](#1.2)
1. [Solution](#1.3)

### Today's contents
1. [Basic flow of Machine Learning](#2.1)
1. [Example of variable selection](#2.2)
1. [Decision Tree](#2.3)
1. [Data preparation](#2.4)

## Previous review & supplement

### 1. Multi Regression<a name="1.1"></a>

This model predict the value of the target variable from the value of the explanatory variable.

<img src="https://github.com/CropEvol/lecture/blob/master/textbook_2018/09_statistics/data/regression_base_en.png?raw=true" alt="reg_base" width="50%" height="50%">

each $x_k$ is 0 or 1. ex) the sample has gene_1 or doesn't has it (1 or 0)

\begin{align}
\boldsymbol{y} = \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{e}
\end{align}

\begin{align}
\boldsymbol{y} = \left[\begin{array}{c}
            y_1 \\
            y_2 \\
            ... \\
            y_n \\
        \end{array}\right] \quad
\boldsymbol{X} = \left[\begin{array}{c}
            x_{11} & x_{21} & x_{31} & ... & x_{k1} \\
            x_{12} & x_{22} & x_{32} & ... & x_{k2} \\
            ... & ... & ... & ... & ...\\
            x_{1n} & x_{2n} & x_{3n} & ... & x_{kn} \\
        \end{array}\right] \quad
\boldsymbol{\beta} = \left[\begin{array}{c}
            \beta_1 \\
            \beta_2 \\
            ... \\
            \beta_k \\
        \end{array}\right] \quad
\boldsymbol{e} = \left[\begin{array}{c}
            e_1 \\
            e_2 \\
            ... \\
            e_n \\
        \end{array}\right]
\end{align}

$\hat{y_i} = \boldsymbol{X_i} \boldsymbol{\beta}$, To minimize $\sum{e_i^2} = \sum{(\hat{y_i} - y_i)^2}$, calculate $\boldsymbol{\beta}$ by $\boldsymbol{X'}\boldsymbol{X}\boldsymbol{\beta} = \boldsymbol{X'}\boldsymbol{y}$<br><br>
→ $\boldsymbol{\hat{\beta}} = (\boldsymbol{X'}\boldsymbol{X})^{-1}\boldsymbol{X'}\boldsymbol{y}$

In [None]:
# loading library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model 

# loading dataset & convert categorical variables
gene_data = pd.read_csv("https://github.com/CropEvol/lecture/blob/master/textbook_2018/09_statistics/data/gene_data.csv?raw=true", index_col=0)
convert_gene_data = pd.get_dummies(gene_data, drop_first=True)
convert_gene_data.head()

In [None]:
# select model
clf = linear_model.LinearRegression()

# Explanatory variable: base data of chr10(after fourth col)
X = convert_gene_data.iloc[:, 3:]

# Target variable: "LeafWidth"
Y = convert_gene_data.loc[:, "LeafWidth"]

# Prediction
clf.fit(X, Y)

# display coefficient
print(clf.coef_)
 
# display intercept
print(clf.intercept_)
 
# display coefficient of determination
print(clf.score(X, Y))

In [None]:
# display the effects of all base
plt.figure(figsize=(25, 5), dpi=50)
col_names = convert_gene_data.iloc[:, 3:].columns.values
col_num = len(col_names)
plt.scatter(range(col_num), clf.coef_)
plt.xticks(range(col_num), col_names)
plt.show()

# the effects of chr_3,4,5 are specifically large

### 2. Problem<a name="1.2"></a>

#### Fundamental problem
* Actually effective bases are only chr3,4,5. → But, adding information of bases that have no effect increase the coefficient of determination. 
* In other words, even the information that is meaningless in practice, the more you add information, the higher the coefficient of determination will be.


#### Biological problem
* It's unnatural that all bases have effect to specific phenotype.
* The number of genes are tremendous large. So if you add the information of all gene, the above fundamental problems arise.
* This model doesn't consider the interactions between genes.
* ...etc

→　This model can perfectly explain about samples for making model. But this model can't adjust to new dataset. <br>
　　(Overfitting)<br>

#### Mathmatical Problem
* Estimating effect of gene is done by $\boldsymbol{\hat{\beta}} = (\boldsymbol{X'}\boldsymbol{X})^{-1}\boldsymbol{X'}\boldsymbol{y}$ → you need to calculate the inverse matrix of  $\boldsymbol{X'}\boldsymbol{X}$
* When the number of variables(genes) are large, the dataset may include homologous columns. In this case, the rank drop of the matrix occurs, and the inverse matrix may not be obtained correctly in some cases.
* Especially linear regression of genomic data tends to cause above problem, because of the linkage.
* To solve such kind of problem, we use a method called "regularization" (like this: $\boldsymbol{\hat{\beta}} = (\boldsymbol{X'}\boldsymbol{X}+\lambda\boldsymbol{I_{p+1}})^{-1}\boldsymbol{X'}\boldsymbol{y}$)

##### Example) the more adding meaningless information, the higher the coefficient of determination will be

<img src="https://github.com/CropEvol/lecture/blob/master/textbook_2018/11_statistics/data/variables_en.png?raw=true" alt="reg_base" width="50%" height="50%">

In [None]:
##### c.f. how to add new column to data of pandas style
df = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                   'B': ['B1', 'B2', 'B3'],
                   'C': ['C1', 'C2', 'C3']},
                  index=['ONE', 'TWO', 'THREE'])
df

# df["D"] = ["D1", "D2", "D3"]
# df

In [None]:
# add random base data to test_data
import random

test_data = pd.read_csv("https://github.com/CropEvol/lecture/blob/master/textbook_2018/09_statistics/data/gene_data.csv?raw=true", index_col=0)

# 200 bases list consist of mixed A,T,G,C randomly.
random.choices(["A", "T", "G", "C"], k=200)

### try to write code adding random base data to test_data ###







In [None]:
##### Example) the more adding meaningless information, the higher the coefficient of determination will be
import random

# copy y(phenotype) value
test_data = pd.read_csv("https://github.com/CropEvol/lecture/blob/master/textbook_2018/09_statistics/data/gene_data.csv?raw=true", index_col=0)

# make random meaningless base information
for i in range(20):
    test_data["gene_test_{}".format(i)] = random.choices(["A", "T", "G", "C"], k=200)

# convert categorical data
convert_test_data = pd.get_dummies(test_data, drop_first=True)

# select model
clf = linear_model.LinearRegression()
    
X = convert_test_data.iloc[:, 2:]
Y = convert_test_data.iloc[:, 1]

# predict
clf.fit(X, Y)

# display coefficient of determination
print(clf.score(X, Y))

# show the effects of bases
plt.title("the effect of each base")
plt.scatter(range(len(clf.coef_)), clf.coef_)
plt.show()

##### Example) if the dataset includes homologous columns, you can't estimate valid values

In [None]:
##### Example) if the dataset includes homologous columns, you can't estimate valid values
import random

# add same dataset
test_data = convert_gene_data.iloc[:, 3:]
test_data = pd.concat([convert_gene_data, test_data], axis=1)

# select model
clf = linear_model.LinearRegression()
    
X = test_data.iloc[:, 3:]
Y = test_data.iloc[:, 1]

# predict
clf.fit(X, Y)

# display coefficient
print(clf.coef_)

# display coefficient of determination
print(clf.score(X, Y))

# show the effects of bases
plt.title("the effect of each gene")
plt.scatter(range(len(clf.coef_)), clf.coef_)
plt.show()

### 3. Solution Example<a name="1.3"></a>

#### Increase the number of samples
* Of course, regression analysis based on 1,000,000 samples is better than regression analysis based on 200 samples

#### Variable selection
If you can select only explanatory variables which related phenotype($y$), the regression equation may become meaningfull.

* biological selection
    * select meaningfull variables by GWAS(Genome Wide Association Study)
    * etc
    

* selection by analysis method
    * Using an indicator of validity of the model which is different from the determination coefficient.
    * There are some method which consider variable selection.


* ...etc

There is no perfect method, because of prediction. You should choose models according to what you want (Model validity? Prediction accuracy? ...etc)

#### Then, when we try to add new samples or use variable selection, we want to know how well the model got better.
#### But...How to judge the goodness of model ?
---

## Today's contents

### Basic flow of Machine Learning<a name="2.1"></a>

First, how to measure how realistic (meaningful) the regression equation / model is?<br>

##### Regression analysis of last lecture
<img src="https://github.com/CropEvol/lecture/blob/master/textbook_2018/10_statistics/data/data_split1_en.png?raw=true" alt="reg_base" width="70%" height="70%">
<br>
In this process, even though coefficient of determination is high, there is a possibility that explanatory power for new data is weak.<br>
例)
<img src="https://github.com/CropEvol/lecture/blob/master/textbook_2018/10_statistics/data/overfit_en.png?raw=true" alt="reg_base" width="60%" height="60%">

→It's difficult to know the prediction accuracy of new datasets. How to know it ?<br>

→Split datasets, and intentionally prepare what to treat as a new data set.<br>

##### To measure the prediction accuracy (example of regression analysis)
<img src="https://github.com/CropEvol/lecture/blob/master/textbook_2018/10_statistics/data/data_split2_en.png?raw=true" alt="reg_base" width="65%" height="65%">

In [None]:
# loading library
from sklearn.model_selection import train_test_split

# Explanatory variable: base data of chr10(after fourth col)
X = convert_gene_data.iloc[:, 3:]

# Target variable: "LeafWidth"
Y = convert_gene_data.loc[:, "LeafWidth"]

# Divided into Training Data and Test Data
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0, test_size=0.1)

# check the number of each dataset
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

##### First, make model by Training data

In [None]:
# select model
clf = linear_model.LinearRegression()

# Predict from Training Data
# clf.

# display coefficient
print(clf.coef_)

# display intercept
print(clf.intercept_)

# display coefficient of determination in Training Dataset
print(clf.score(X_train, Y_train))

In [None]:
# predict y of Test Data
#clf.

# display coefficient of determination in Test Dataset
#clf.

##### Summary

In [None]:
# select model
clf = linear_model.LinearRegression()

# Predict from Training Data
clf.fit(X_train, y_train)

# display coefficient
print(clf.coef_)

# display intercept
print(clf.intercept_)

# display coefficient of determination in Training Dataset
print(clf.score(X_train, y_train))

# predict y of Test Data
print(clf.predict(X_test))

# display coefficient of determination in Test Dataset
print(clf.score(X_test, y_test))

# compare the predicted value & the actual value in Test Dataset
predict_value = clf.predict(X_test)

plt.plot(range(len(y_test)), y_test, color="b")
plt.plot(range(len(y_test)), predict_value, color="r")

plt.title("Measured value vs Prediction")
plt.xlabel("n")
plt.ylabel("Sepal Width")
plt.show()

### 2. Example of variable selection<a name="2.2"></a>

Especially genome analysis, as in previous examples of multiple regression analysis, there are many cases in which only some of the variables actually have effects.

Though it's better if you already know which variables(genes) are important, it's impossible in many cases. So, you need to select variables with some criteria.

Then, I show some method to select variables.

#### 2.1. Significance test for each variable

$y=\beta_1x_1+\beta_2x_2+\beta_3x_3...+e$<br>

Check each variables has effect or not, like this: $H_0:\beta_1=0 \quad H_1:\beta_1\neq0$ and select meaningfull variables<br>

<img src="https://github.com/CropEvol/lecture/blob/master/textbook_2018/10_statistics/data/yuui_en.png?raw=true" alt="reg_base" width="50%" height="50%">

The library : `statsmodels` is useful to do this analysis.<br>

In [None]:
import statsmodels.api as sm

# Exponential Variables
X = convert_gene_data.iloc[:, 3:]

# Target Variables
Y = convert_gene_data.loc[:, "LeafWidth"]

# add Intercept
X = sm.add_constant(X)

# analysis
results = sm.OLS(endog=Y, exog=X).fit()
results.summary()

The p_values of the bases in chr_3,chr_4,chr_5 are small and it's significant.

To study statistical modeling, the book : [データ解析のための統計モデリング入門](http://goo.gl/Ufq2) is famous(Japanese only).
___

#### 2.2 Select model by AIC, BIC

The indicator for measuring the goodness of the model:<br>
* AIC: Indicator that the model with the best prediction ability is good<br>
* BIC: Indicator that the model with the highest probability as a true model is good<br>

→ Create several models and select a good model based on AIC.

ex)<br>
　$y = {gene_1}\_effect + {gene_2}\_effect + {gene_3}\_effect + e$...model(1), AIC=243.8<br>
　$y = {gene_1}\_effect + {gene_2}\_effect + e$...model(2), AIC=198.3<br>

　AIC of model(2) is smaller → model(2) is better for prediction ability

`statsmodels` is useful for calculate AIC, BIC

In [None]:
# model based on all bases
X = convert_gene_data.iloc[:, 3:]
Y = convert_gene_data.loc[:, "LeafWidth"]
X = sm.add_constant(X)
results = sm.OLS(endog=Y, exog=X).fit()
print("all base model's AIC:", results.aic)
print("all base model's BIC:", results.bic)

# model based on only bases in chr_3,4,5
X = convert_gene_data.iloc[:, 5:14]
Y = convert_gene_data.loc[:, "LeafWidth"]
X = sm.add_constant(X)
results = sm.OLS(endog=Y, exog=X).fit()
print("chr_3,4,5 base model's AIC:", results.aic)
print("chr_3,4,5 base model's BIC:", results.bic)

##### chr_3,4,5 base model has smaller AIC(BIC). So chr_3,4,5 base model is better.

___
#### 2.3 Method with variable selection

one of the method with variable selection is Lasso.<br>

Lasso(Least absolute shrinkage and selection operator)[[Tibshirani, 1996]](http://statweb.stanford.edu/~tibs/lasso/lasso.pdf)<br>

In method 2.1, 2.2, Model estimation and variable selection were done separately.<br>

In Lasso, some variable's $\beta$ are estimated 0, So It is possible to estimate models and select variables at the same time.


In [None]:
# Select model(change part: LinearRegression() → Lasso())
# the more increasing alpha values, the less variables that are selected
clf = linear_model.Lasso(alpha=0.01)

# Predict
clf.fit(X_train, y_train)

# display coefficient
print(clf.coef_)

# display intercept
print(clf.intercept_)

# display coefficient of determination in Training Data
print(clf.score(X_train, y_train))

# display coefficient of determination in Test Data
print(clf.score(X_test, y_test))

##### (参考) L1正則化

* In case of least squares method of multiple regression analysis<br>

　　$\boldsymbol{y} = \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{e}$<br>

　　Estimate $\boldsymbol{\beta}$ values to minimize $\sum{e_i^2} = \sum{(\boldsymbol{X_i} \boldsymbol{\beta} - y_i)^2}$→　 $\boldsymbol{\hat{\beta}} = (\boldsymbol{X'}\boldsymbol{X})^{-1}\boldsymbol{X'}\boldsymbol{y}$
<br><br>

* In case of Lasso

　　Estimate $\boldsymbol{\beta}$ values to minimize $\sum{(\boldsymbol{X_i} \boldsymbol{\beta} - y_i)^2} + \lambda||\boldsymbol{\beta}||_1 \quad (||\boldsymbol{\beta}||_1=\sum{|\beta_i|})$<br><br>
　　→　$||\boldsymbol{\beta}||_1$ can not differentiate, so it can not be solved analytically, it is obtained by numerical calculation method.<br><br>
　　[Coordinate Descent](https://core.ac.uk/download/pdf/6287975.pdf?repositoryId=153), [LARS](http://statweb.stanford.edu/~imj/WEBLIST/2004/LarsAnnStat04.pdf) is famous algorithm to solve this problem.

### 3. DecisionTree<a name="2.3"></a>

In the machine learning methods, Decision tree is the method which is easy to understand.<br>

This method is easy to capture data features, so it's useful to understand not only genome analysis but also some wet experiments.<br>

Also, some of the latest methods of machine learning are sometimes based on this decision tree, so it is good to know.<br>

##### Example of decision tree

<img src="https://github.com/CropEvol/lecture/blob/master/textbook_2018/10_statistics/data/tree.png?raw=true" alt="tree" width="50%" height="50%">

Decision tree method split dataset to change dataset understandable.

In [None]:
from sklearn import tree

clf = tree.DecisionTreeRegressor(max_depth=3)

clf = clf.fit(X_train, y_train)

predicted = clf.predict(X_train)

print("Training DataのScore", clf.score(X_train, y_train))
print("Test DataのScore", clf.score(X_test, y_test))

# display decision tree(!!!It's not good code!!!)
from sklearn.externals.six import StringIO
from IPython.display import Image
dot_data = StringIO()
with open("tree.dot", 'w') as f:
    tree.export_graphviz(clf, out_file=f, feature_names=convert_gene_data.iloc[:, 3:].columns.values, filled=True, rounded=True, impurity=False, proportion=False,)
!dot -T png tree.dot > tree.png
Image('tree.png')

Decision tree method has weak point. Decision tree tends to overfit to samples.<br>

The more increasing max_depth, the more increasing score of Training Data & the more decreasing score of Test Data.(The prediction accuracy of new dataset is decreasing.)<br>

### 4. Data Preparation<a name="2.4"></a>

There are many methods in Machine Learning.　(refference)[scikit-learn algorithm cheat sheet](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)<br>

Almost all methods are powerful and useful. But, Quality of Data is also very important. There is a concept of Garbage in, Garbage out.

<img src="https://github.com/CropEvol/lecture/blob/master/textbook_2018/10_statistics/data/gigo.png?raw=true" alt="gigo" width="50%" height="50%">

It means, No matter how powerful and perfect models are used, if input data quality is bad, the results become bad quality like garbage.<br>

In such a meaning, it is highly necessary to think about the quality of the data itself to be analyzed before proceeding with data analysis.<br>

In the world of agriculture and biology, it is also possible to obtain more interpretable analysis results by controlling data well.<br>

##### (ex) RIL

For example, There are RIL(Recombinant Inbred Lines)<br>

The genome architecture of RIL population has only parental homozygous, through long term self-mating.<br>

This population is very useful for genome analysis because we don't need to consider heterozygous.

<img src="https://github.com/CropEvol/lecture/blob/master/textbook_2018/10_statistics/data/ril.png?raw=true" alt="gigo" width="50%" height="50%">

In the field of agriculture not targeting humans, it is possible to control the growing environment and genetic structure to some extent with respect to target crops and organisms<br>

Therefore, consciousness and ingenuity for what kind of data set and how much scale is to be created is very important.