<img src="https://github.com/slt666666/FAO_lecture/blob/main/title.png?raw=true" alt="title" height="300px">


# Genomic Prediction - example -

## The contents in this notebook ... 

* Review of Genomic Prediction

* Genomic prediction model using sample dataset

  * We assume rice population.

# Main contents

# Review of Genomic Prediction

Genomic prediction is to generate prediction model that explain phenotype by genotype.

The process to generate genomic prediction model is ...

1. Prepared mating population such as recombinant imbred lines (RILs), Nested association mapping (NAM) population, ...etc.

2. Perform genotyping by NGS and phenotyping.

3. Modeling the relationship between genotyping and phenotyping.

4. Check performance of the generated model.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/genomic_prediction.png?raw=true" alt="colab" height="300px">

After generating good genomic prediction model,

we can apply the model to improve breeding strategy.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/apply_model.png?raw=true" alt="colab" height="600px">



# Experience Genomic Prediction
In this notebook, we try to make genomic prediction model and apply it to the genomic breeding.

In [None]:
# Prepare modules & packages
!wget -O genomic_prediction.py https://github.com/slt666666/FAO_lecture/blob/main/genomic_prediction.py?raw=true

from genomic_prediction import load_dataset
from genomic_prediction import split_dataset
from genomic_prediction import make_genomic_prediction_model
from genomic_prediction import predict_phenotype
from genomic_prediction import check_accuracy

## Materials

We generated NAM population by crossing rice cultivar A and 5 other cultivars (B~F).

<img src="https://github.com/slt666666/FAO_lecture/blob/main/nam.png?raw=true" alt="colab" height="300px">

Then, we perform sequencing & phenotyping for this population.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/genopheno.png?raw=true" alt="colab" height="300px">

## Load dataset

In this notebook, we use genotype & phenotype dataset of NAM population.

In [None]:
genotype, phenotype = load_dataset()

The dataset include almost 1000 lines.

- SNP genotype
- Grain number & Leaf width phenotypes

In [None]:
display(genotype)
display(phenotype)

## Methods

The process to generate genomic prediction model & check performance of the model is ...

1. Separate all data to 80%(training data) & 20%(test data).

2. Make prediction model using training data. (we use ElasticNet regression model in this notebook)

3. Predict phenotype of test data **from genotype** by generated model

4. Compare predicted phenotype & observed phenotype to check performance of the model

<img src="https://github.com/slt666666/FAO_lecture/blob/main/gpmethod.png?raw=true" alt="colab" height="400px">

## Separate dataset

To separate dataset to training data and test data, we split dataset.

In [None]:
test_genotype, test_phenotype, train_genotype, train_phenotype = split_dataset(genotype, phenotype, "LW_mean", test=0.2)

The above code split all dataset to 20% & 80%

<img src="https://github.com/slt666666/FAO_lecture/blob/main/simulation14.png?raw=true" alt="colab" height="200px">

You can check the splitted genotype & phenotype data using below code.

In [None]:
# show test data
display(test_phenotype)
# show training data
display(train_phenotype)

## Make prediction model

After splitting dataset,

Using training dataset (80%), make genomic prediction model that explain phenotype from genotype.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/simulation15.png?raw=true" alt="colab" height="200px">

The below code generate prediction model.

(we skipped explanation of details of the model in this lecture.)

In [None]:
LW_prediction_model = make_genomic_prediction_model(train_genotype, train_phenotype, "LW_mean")

## Predict phenotype of test data

After generating prediction model, we check the performance (accuracy) of the model.

To check accuracy, we used test data that is not untouched data to make model.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/simulation16.png?raw=true" alt="colab" height="200px">

At first, we predict phenotype values from genotype using generated model.

Then, compare predicted values with observed values.

If these values showed similarity, the model has robustness.

In [None]:
predicted_test_phenotype = predict_phenotype(test_genotype, LW_prediction_model)

The above code predict phenotypes from genotype using the generated model.

You can check the predicted values by below code.

In [None]:
predicted_test_phenotype

## Compare predicted phenotype & observed phenotype

After predicted phenotype values of test data, we compare predicted phenotype values with observed phenotype values.

Here, we used correlation coefficient between them to check the accuracy.

In [None]:
check_accuracy(predicted_test_phenotype, test_phenotype, "LW_mean")

Above code calculate a correlation coefficient value and generate scatter plot of predicted & observed values.

Correlation coefficient is over 0.85, so the generated model looks good.

## For Grain number

try to make genomic prediction model for grain number.


# Applying genomic prediction model to breeding strategy

In this section, we try to apply generated models to breeding startegy.

We'll try to predict phenotypes of progenies generated by crossing 2 cultivars.

We'll also try to construct ideal genotype for the trait.

## Consider which combination of cultivars is best to cross to get high phenotype progenies.

### make simulation data

### predict phenotypes of simulation data

### check distribution of simulated data

## Consider best genotype for traits

### check estimated SNP effect

In [None]:
show_estimated_SNP_effect(prediction_model)

### make customized genotype

In [None]:
make_customized_genotype()

### predict phenotypes of customized genotype

In [None]:
predict_phenotype(customized_genotype, prediction_model)

---
## Summary

In this notebook, we demonstrate **MutMap** analysis using simulation data & published sample data.

This data is small dataset (only chromosome10, just 272 SNPs).
   
You can use the pipeline that our lab developped for bigger data (like few million lines...etc)
(https://github.com/YuSugihara/MutMap)
   
Tomorrow, we'll demonstrate **QTL-seq** analysis & **Sliding window** analysis for MutMap & QTL-seq to identify causative genes.

