<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/title.png?raw=true" alt="title" height="250px">

# Genomic Prediction - practice session -

In this notebook, we will perform Genomic Prediction analysis using real dataset (but unpublished!).

And we will experience what can we do with genomic prediction models for breeding.

## The contents in this notebook ...

* Make Genomic prediction model using real dataset

  * We use rice population & grain number phenotype.

* Application of genomic prediction model
  * We will consider good genotype for grain number by prediction model
  * We will search good parental lines to generate lines with high phenotype

# Main contents

## Mut-Map, QTL-seq, GWAS analysis

We studied the methods to identify genetic variants associated with phenotypes such as Mut-map, QTL-seq.

These approaches are focusing on **the identification of QTLs/genetic variants**.

So, these approaches are very useful for **marker assisted selection** in breeding.

Several crops were improved by introducing QTLs related yields, immunity, etc.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_1.png?raw=true" alt="title" height="150px">

However, most agronomical traits might be controlled by multiple genes, not only one gene.

In addition, it is difficult to estimate effect of each gene by the identification methods like Mut-map, QTL-seq, and GWAS ...etc.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_2.png?raw=true" alt="title" height="300px">

If it is possible to estimate the effects of genetic variants, we can introduce genetic variants to breeding strategy more efficiently.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_3.png?raw=true" alt="title" height="140px">

**Genomic Prediction** is one of the approaches to estimate genetic effect of variants on phenotypes and predict phenotype values from genotype information.

In this session, we'll introduce **what is Genomic Prediction model** and **what can be able to do by Genomic Prediction model**.



# Review of Genomic Prediction

Genomic prediction model is the approach to try to predict phenotype value by genotype information.

Usually, we use SNP genotypes as input genotype data.

So, in other words, genomic prediction model predicts phenotype value based on SNP genotypes.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/genomic_prediction.png?raw=true" alt="colab" height="300px">

To generate genomic prediction model, we have to understand the relationship between genotype and phenotype.

So, we need the information about various genotypes and their phenotypes.

We have to use a large segregating population and genotype & phenotype data of them to catch this relationship.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_19.png?raw=true" alt="title" height="250px">


Then, we apply statistical methods to make the model that explain the relationship between genotype and phenotype.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_4.png?raw=true" alt="title" height="170px">

That's the brief process to make genomic prediction model.

Let's experience the process to make genomic prediction model Step by Step.

# Experience Genomic Prediction
In this section, we try to make genomic prediction model of rice.

We will make prediction model that can predict **grain number** phenotype from genotype.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_21.png?raw=true" alt="title" height="150px">



## **Genetic diverse population**

At first, we have to generate genetic diverse  population to catch the relationship between genotype and phenotype.

To make it easier to understand the relationship between phenotypes and genotypes,

we often use **designed breeding population** as genetic diverse population
 for genomic prediction.

Designed breeding population is like **RIL population, NAM population, MAGIC population, ...etc**.


In this practice, we'll use NAM population of rice.

NAM(Nested Association Mapping) population is consist from multiple Recombinant inbred lines(RILs).

NAM population is derived from crossing diverse parents with a common reference parent.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_5.png?raw=true" alt="title" height="300px">

In this practice session, we will use NAM population of rice.

You can download phenotype data and genotype data by below command.



In [None]:
%%bash
wget -q -O genotype.csv https://raw.githubusercontent.com/slt666666/FAO_lecture/main/FAO_2024/data/genotype.csv
wget -q -O phenotype.csv https://raw.githubusercontent.com/slt666666/FAO_lecture/main/FAO_2024/data/phenotype.csv
wget -q -O modules.py https://raw.githubusercontent.com/slt666666/FAO_lecture/main/FAO_2024/data/modules2.py

After finish downloads, you can see phenotype & genotype dataset.

Phenotype data file is `phenotype.csv` and genotype data is `genotype data.csv`.

### Phenotype data

Our NAM population contains **747 lines** and we use **Grain number** as phenotype in this practice.

We will use mean value of 5 replicates as phenotype value.

You can check some part of phenotype data using below code.

(This is unpublished data, so we just named NAM_XXX for each line)

In [None]:
import pandas as pd
phenotype = pd.read_csv("phenotype.csv")
phenotype

### Genotype data

To generate Genotype data, we have to perform sequencing, alignment, and identification of SNPs as  experienced with Mut-map or QTL-seq analysis.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_6.png?raw=true" alt="title" height="300px">

But we omitted this process in this practice and we'll use the genotype data that has been organized into table data by our side.

SNP genotypes were converted into **Number**.

- Homozygous mutated genotype = 2
- Heterozygous mutated genotype = 1
- Original (Not mutated) genotype = 0

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_7.png?raw=true" alt="title" height="250px">

You can check some part of genotype data using below code.

In [None]:
import pandas as pd
genotype = pd.read_csv("genotype.csv")
genotype

Now, we got genotype & phenotype dataset.

Next step is making genomic prediction model that explain the relationship between genotype and phenotype data.

## **Make Genomic prediction model**

From genotype and phenotype datasets,  we make statistical model that explain the relationship between genotype and phenotype.

ex) very simle example

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_8.png?raw=true" alt="title" height="300px">

Recently, statistical analysis and machine learning methods showed remarkable progress.

A lot of tools are available to make various models.

Therefore, we can build the model very flexibly depending on the purpose,

such as a model that assumes gene-gene interactions or a model include environmental factors into account.

<br>

In this practice, we will use one of the linear regression models as a simple example.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_9.png?raw=true" alt="title" height="70px">

Linear regression model can be calculated by [scikit-learn library](https://scikit-learn.org/stable/) that contains a lot of data analysis tools.

```
※ In this practice, we will not explain the detail of statistical things, data analysis, how to write code.
But, if you are interested in it, many lectures are available via online like: https://pll.harvard.edu/subject/data-science
```

In [None]:
from modules import linear_model
model = linear_model(genotype.iloc[:, 2:].T, phenotype)

We succeeded to make genomic prediction model.

So, now, we can predict grain number from SNP genotypes using this model.

However, we can't judge this model is good or bad.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_10.png?raw=true" alt="title" height="120px">

So, we have to prepare the approch to check the accuracy of the model.

To do so, when we generate genomic prediction model,

we usually separate dataset into training data and test data before making prediction model.

1. Based on training data, we generate prediction model.

1. Then, we'll check the accuracy of generated model using the rest of dataset (test data).

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_11.png?raw=true" alt="title" height="320px">

In our case, we make genomic prediction model using 597 lines (training data).

Then, we will check the accuracy of generated model using 150 lines (test data).

Let's check "how accurate the model can be generated?".

### **1. Split dataset**

At first, we split dataset into training data & test data.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_14.png?raw=true" alt="title" height="300px">

[scikit-learn library](https://scikit-learn.org/stable/) has data split function that split dataset so that the distribution of phenotype is the same.

In [None]:
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
X_train, X_test, y_train, y_test = train_test_split(genotype.iloc[:, 2:].T, phenotype, test_size=0.2, random_state=1024)
plt.figure(figsize=[4,4])
plt.hist(y_train.Grain_number, label="Training data")
plt.hist(y_test.Grain_number, label="Test data")
plt.xlabel("Grain number"); plt.ylabel("Lines"); plt.legend(); plt.show()

We will use training data for prediction model and test data to check accuracy.

### **2. Make prediction model**

Then, we will make genomic prediction model based on training dataset.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_15.png?raw=true" alt="title" height="300px">


We will make prediction model that explain the relationship between genotype and phenotype of training dataset by linear regression model using [scikit-learn library](https://scikit-learn.org/stable/)


In [None]:
from modules import linear_model
model = linear_model(X_train, y_train)

We obtained linear regression model that explained the relationship between genotype & phenotype dataset.

### **3. Check accuracy of the model**

To check accuracy, we apply generated model in 2. to test dataset.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_16.png?raw=true" alt="title" height="300px">

We predict phenotype values based on genotype data of test data using generated prediction model.

And compare observed phenotype values and predicted phenotype values.

If these values are very similar, the model can predict phenotype values accurately.

We can predict phenotype values from genotype information using generated prediction model by the below code.

In [None]:
from modules import check_accuracy
y_test["Predicted_Grain_number"] = model.predict(X_test)
display(y_test)
check_accuracy(model, X_test, y_test)

The below scatterplot shows the relationship between observed and predicted phenotype values.

Samples with low phenotypic values also tend to have low predicted phenotypic values.

Samples with high phenotypic values also tend to have high predicted phenotypic values.

The correlation coefficient between predicted phenotypes and observed phenotypes is most commonly used as indicator of prediction accuracy.

In this case, correlation coefficient is 0.85~. It's not perfect prediction, but it seems like we can catch trends.

Therefore, we succeeded to generate good genomic prediciton model and now we can predict grain number phenotype based on genotypes with high accuracy.

<br>

I omitted many process, but this is the general flow to make genomic prediction model and check accuracy of the model.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_11.png?raw=true" alt="title" height="250px">


# Application of genomic prediction model

Since we succeeded to generate genomic prediction model, we'll consider how to apply this model in breeding.

## **Consider good genotype for traits**

Genomic prediction model can predict phenotype values based only on genotype information.

So, we can consider what genotype is ideal for traits.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_12.png?raw=true" alt="title" height="120px">


In this session, let's try to make good genotype that improve grain number using genomic prediction model.

### Predict phenotype of customized genotype

We can predict phenotype values of hypothetical genotype.

For example, if we introduced another cultivar genotype in chromosome 1 & chromosome 3, how does phenotype change?

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_13.png?raw=true" alt="title" height="360px">

We can predict the phenotype value of this genotype by genomic prediction model.

The below code predict phenotype values of hypothetical genotypes.

Please check prediction results by below code.

In [None]:
from modules import predict_customized_genotype
regions = [['chr01', 10000000, 20000000], ['chr03', 10000000, 20000000]]
predict_customized_genotype(genotype, regions, model)

Now, we can predict the phenotype values of new genotype.

So, we can investigate the what kind of genotype generate high phenotype values.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_20.png?raw=true" alt="title" height="360px">

### **Play with genomic prediction**

Let's try to find out genotypes with high grain number phenotype!

The below code predict phenotype values that your customized genotype.

You can edit the second line to specify genomic regions that you wan to introduce mutations (other cultivar's genotypes) as above example showed.

`regions = [['chr01', 0, 20000000], ['chr03', 10000000, 20000000]]`.

You can specify more than 2 regions like this:

`regions = [ ['chrXX', YY, ZZZ], ['chrXX', YY, ZZZ], ['chrXX', YY, ZZZ], ['chrXX', YY, ZZZ] ]`

For example, if you edit it to:

 `regions = [['chr01', 12000000, 20000000], ['chr05', 0, 5000000], ['chr12', 10000000, 15000000]]`

It means, another cultivar genotype is introduced into these ↓ genomic regions
* chr01...12 Mbp ~ 20 Mbp
* chr05...0 ~ 5 Mbp
* chr12...10 Mbp ~ 15 Mbp

```
Common mistakes:
※ Forget  ' ' of 'chr01' and  Forget 0 of 'chr01'
※ Forget comma
※ The number of [ ]
```
Let's try to find the genotype that produces a high grain number phenotypes while being careful of the common mistakes mentioned above!


※chromosome length is around 20000000bp ~ 40000000bp.


In [None]:
from modules import predict_customized_genotype
regions = [['chr01', 10000000, 20000000], ['chr03', 10000000, 20000000]]
predict_customized_genotype(genotype, regions, model)

Based on this information, we can consider the direction of new cultivars to be created through breeding.

## Genomic Selection: Consider good parental lines to generate progenies with high phenotypes

If we have genotype inofrmation of breeding lines, we can simulate genotypes of next generation and predict phenotype values of them.

So, we possibly identify the best parental pair to generate progenies with high phenotypes that would have been missed by phenotypic selection.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_17.png?raw=true" alt="title" height="400px">

<br>

In our case, we have 747 lines of NAM population.

<br>

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_5.png?raw=true" alt="title" height="300px">

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/GP_18.png?raw=true" alt="title" height="240px">

To select good combination of parental lines from this population, genomic prediction model is useful.

We can simulate genotypes of F2 population based on paretnal lines and predict phenotype values of F2 population using below code.


`You can change "NAM_001" & "NAM_002" to your favorite NAM lines (NAM_001 ~ NAM_747).`

In [None]:
from modules import predict_progeny_phenotype
predict_progeny_phenotype("NAM_001", "NAM_002", 100, phenotype, genotype, model)

From this result, we can select the best combination of parental lines to make new cultivars.

---
## Summary

In this notebook, we play with **Genomic Prediction** analysis using unpublished data.

We can predict phenotypes from genotype information by genomic prediction model.

Thus, you can find out the ideal genotype for traits and good combination of parental lines.
   
If you have plan to generate **designed breeding population** which has high genetic variaty,

genomic prediction approach is one of the approaches to achieve generating high-yield cultivar.

