<img src="https://github.com/slt666666/FAO_lecture/blob/main/title.png?raw=true" alt="title" height="300px">


# QTL-seq - practice -

We will perform QTL-seq analysis using simulation data & perform sliding window analysis using yesterday's result in this practice part.

It may help you to understand ...
* the process of QTL-seq analysis 
* How to interpret the results of QTL-seq analysis
* What data is required for QTL-seq
* What is sliding window analysis & Why it is required.

## The contents in this notebook ... 

* Review of QTL-seq

* Practice - QTL-seq analysis for simulation data -

  * We assume very simple organism & conduct QTL-seq analysis to understand the process of QTL-seq.

* Introduction of QTL-Seq pipeline (Github)

* Sliding window analysis
  * What is sliding window analysis for MutMap & QTL-seq?
  * We use published data in MutMap paper [(Abe et al., 2012)](https://www.nature.com/articles/nbt.2095) & perform sliding window analysis.

# Main contents

# Review of QTL-seq

  QTL-seq analysis is one the methods to identify the genomic region which is associate with the quantitative traits. (like QTL mapping.)

* MutMap: Qualitative trait

* QTL-seq: Quantitative trait

The brief process of QTL-seq is below:

1. Cross 2 cultivars that showed different phenotype to generate F2 population and survey phenotypes of F2  progenies.

1. If there are segregtion, Select 20~ samples each showing opposite phenotypes (high and low).

1. Collect bulked DNA from selected samples for each phenotype (Low bulk & High bulk), and perform next generation sequencing (bulk sequencing).

1. Sequence reads are aligned to the reference genome of one parent. And then, calculate SNP-index for whole genome region based on alignment results.

1. Finally, to compare SNP-index between high bulk and low bulk, calculate ΔSNP-index which indicate the difference of SNP-index between high and low bulks. As a results, the genomic region which allele frequency is hugely different between high and low bulk are identified as high ΔSNP-index(0 <) or low ΔSNP-index(< 0).

This genomic region is candidate of causative region that affect phenotype.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/QTLseq.png?raw=true" alt="title" height="750px">

In this practice lecture, we plan to experience each step using simulation data to understand the process of QTL-seq.

# Practice: QTLseq analysis using very simple simulation data

<img src="https://github.com/slt666666/FAO_lecture/blob/main/intro2.png?raw=true" alt="title" height="200px">

!! please run the below code, this code downloads programs. !!

In [None]:
# Prepare modules & packages
!wget -O module_qtlseq.py https://github.com/slt666666/FAO_lecture/blob/main/module_qtlseq.py?raw=true

from module_qtlseq import make_2_cultivars
from module_qtlseq import make_F2_progeny
from module_qtlseq import check_distribution
from module_qtlseq import high_and_low_bulk_sequencing
from module_qtlseq import alignment
from module_qtlseq import calculate_SNP_index
from module_qtlseq import visualize_SNP_index
from module_qtlseq import calculate_delta_SNP_index
from module_qtlseq import visualize_delta_SNP_index
from module_qtlseq import check_results
from module_qtlseq import qtl_seq_simulation
from module_qtlseq import get_yesterday_SNP_index
from module_qtlseq import visualize_SNP_index2
from module_qtlseq import sliding_window

## Simulation setting:

In this practice, we assume...

* **The plant which has only 1 chromosome & genome size is 100bp.**

* **There are 2 cultivars. cultivar A is higher plant than cultivar B.**

* **Cultivar B has 40 SNPs which genotypes are different from cultivar A genotype.**

* **1 SNP has a big effect on the plant height. Other 39 SNPs have small effect.**

    **But we don't know which SNP has big effect.**

<img src="https://github.com/slt666666/FAO_lecture/blob/main/simulation5.png?raw=true" alt="title" height="200px">

If we have these cultivars, to identify the SNP that have big effect on plant height,

Let's try to perform QTL-seq analysis !!

## 1. Cross parent lines and generate F2 population

At first in QTL-seq analysis, we'll generate F2 population from cultivar A & cultivar B.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/simulation6.png?raw=true" alt="title" height="300px">

The phenotypes of F2 progenies are randomly distributed depending on their genotypes.

<br>

Please run the below code to generate F2 population between Cultivar A and B.

The below code generates 200 F2 population & check distribution of phenotype values of F2.

In [None]:
cultivar_A, cultivar_B = make_2_cultivars(length=100, snp=40)
f2_progeny = make_F2_progeny(cultivar_A, cultivar_B, progeny=200)
check_distribution(f2_progeny)

```
※ Memo
The simulation program generates genotype of progeny which is randomly mixed between cultivar A and cultivar B.
Then, program decide the phenotype values based on the simulated effect of each SNP genotype (& This program consider the error variance).
So, the program saved the genotype data of simulated population & the effect of each SNP.
Therefore, we can check the result of QTL-seq is correct or not by comparing simulated data.
```

## 2. Bulk sequence of high phenotype samples & that of low phenotype samples.

The 2nd step is bulk sequencing of high phenotype samples & low phenotype samples.

If there is a SNP mutation that only high(/low) phenotype samples have, this SNP may habe big effect.

In this study, we collect top 20 samples and bottom 20 samples for phenotype values.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/simulation7.png?raw=true" alt="title" height="400px">


Please run the below code to perform bulk sequencing for each sample set!!

Below code generate sequence read file for each bulk.

In [None]:
high_reads, low_reads = high_and_low_bulk_sequencing(f2_progeny, top=20, bottom=20, reads=200)

The above program perform sequencing of bulk DNA & sequence results(fastq file) is saved in the Colab server.

You can check sequence result (fastq files) using file system ↓

```
How to check files in your Google Colab server space.
1. Click the file icon in upper left.
2. The file list showed files in your server space.
(3. if there is no "high_bulked_sequences.fastq" and "low_bulked_sequences.fastq", please click the third icon from the right
```
<img src="https://github.com/slt666666/FAO_lecture/blob/main/filesystem.png?raw=true" alt="title" height="250px">

```
※Memo
If you want to know more about fastq format, please check here↓
https://en.wikipedia.org/wiki/FASTQ_format
```

## 3. Alignment reads to the cultivar A sequence genotype

After sequencing each bulked DNA, we will align these sequence reads to the cultivar_A genome to identify SNP positions for each bulk samples.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/simulation9.png?raw=true" alt="title" height="400px">

Please run the below code to align bulk sequence reads to cultivar A genotype !!


In [None]:
high_bulk_alignment_result, low_bulk_alignment_result = alignment(high_reads, low_reads, cultivar_A)


```
※ Memo
In this notebook, we perform alignment of sequence reads by our program.
But basically, when we perform alignment, we use mapping tool such as BWA(http://bio-bwa.sourceforge.net/).
So, if you conduct QTL-seq analysis by your own data, it may has required to use mapping tool.
But QTL-seq pipeline that we will introduce later contains mapping tools. So, you don't need to care about it.
```

## 4. Calculate SNP-index based on alignment results for low bulk & high bulk

After the alignment, we can calculate SNP index for each bulk.

In the QTL-seq, SNP-index showed the ratio of number of reads which are different genotype from cultivar_A.

If SNP-index is close to 1, most reads showed different genotype from cultivar_A genotype in the SNP position.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/simulation12.png?raw=true" alt="title" height="250px">

Please run the below code to calculate SNP-index for each bulk !!

In [None]:
high_bulk_SNP_index, low_bulk_SNP_index = calculate_SNP_index(high_bulk_alignment_result, low_bulk_alignment_result, cultivar_A, cultivar_B)
print("High bulk")
display(high_bulk_SNP_index)
print("Low bulk")
display(low_bulk_SNP_index)

## 5. Calculate ΔSNP-index

After we get SNP-index data for each bulk (high & low bulk), we should compare them.

ex) SNP positions that showed high SNP index in both high and low bulk may not be candidates.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/simulation11.png?raw=true" alt="title" height="600px">

So, we should remove this kind of positions and identify the high SNP-index position that only High or Low bulk samples has.

To do that, we calculate ΔSNP-index = SNP-index(high bulk) - SNP-index(low bulk).

Please run the below code to caculate ΔSNP-index

In [None]:
delta_SNP_index = calculate_delta_SNP_index(high_bulk_SNP_index, low_bulk_SNP_index)
delta_SNP_index

## 7. Visulalize ΔSNP-index plot

After calculating ΔSNP-index, visualize it to search the causative position.

In [None]:
visualize_delta_SNP_index(delta_SNP_index)

The SNP position which showed high ΔSNP-index, might have the big effect on plant height.

## check analysis results & real genotype

QTL-seq analysis showed causative SNP that has ΔSNP index is almost 1.

The below code showed the effect of each SNP & calculated ΔSNP-index (sorted by SNP effect.)

Try to check QTL-seq result is correct or not !!

In [None]:
check_results(delta_SNP_index)

The setting of this simulation was ... 1 SNP has big effect on phenotype (+10).

QTLseq analysis success to identify the position using only reference fasta & high and low bulked sequencing data!!

# Play with simulation!

You can specify
- the length of reference genome (genome size)
- the number of SNP mutations
- the number of F2 progeny
- the number of sequence reads

please make different situation & perform MutMap analysis using below code.

In [None]:
qtl_seq_simulation(length=100, snp=40, progeny=200, reads=500)

# Introduction of QTL-seq pipeline

Our laboratory developped the very simple pipeline to conduct QTL-seq.

<img src="https://github.com/YuSugihara/QTL-seq/blob/master/images/1_logo.png?raw=true" alt="title" height="100px">

(https://github.com/YuSugihara/QTL-seq)

This pipeline is very simple to use.

To use this pipeline, the required input data is ...

The required input data is ...
* reference fasta
* reference sequence reads
* High bulk DNA sequence reads 
* Low bulk DNA sequence reads

And you just install pipeline & use below command. That's it !

```
# command
qtlseq -r reference.fasta -p reference.fastq -b1 high_bulked_sequences.fastq -b2 low_bulked_sequences.fastq -n1 20 -n2 20 -o output_name
```

then, you can get data file that contains ΔSNP positions & SNP-index values & graph that visualize it.

# Sliding window analysis

In this part, we use yesterday's MutMap result in 2nd practice to experience sliding window analysis.


## Review of Sliding Window analysis

Different from simulation study, the real data (experimental data) does not always showed theoretically correct values.

The real data usually contains errors, outliers, human error...etc. And it generates false positives.

For MutMap(QTLseq) analysis, SNP-index will be different from expected values due to the errors.

ex) When the read depth is not enough,

sometimes it could lead a big outlier like SNP-index=0(or SNP-index=1) in the not important SNP position.

### Sliding window

To escape these errors, one approach is taking the average of SNP-indexes of nearby regions.

It will diminish the influence of error, and we can get more reliable SNP-index results.

<br>

The process of sliding window analysis is...

1. Decide the size of interval (window size), and movement size to next interval (step size)
1. Extract data (SNP-indexes) in the interval.
1. Calculate average value of them.
1. Move the interval by the step size.
1. Continue 2-4 process to finish the whole region.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/simulation13.png?raw=true" alt="title" height="400px">

## Yesterday's result

We used yesterday's results of SNP-index.

**※Materials & Methods are same as yesterday.**

<img src="https://github.com/slt666666/FAO_lecture/blob/main/material.png?raw=true" alt="title" height="400px">

### SNP-index

Please run the below code, to download yesterday's result (alignment & SNP calling results & SNP-index) !!

In [None]:
!wget -q -O mutmap_dataset.txt https://raw.githubusercontent.com/CropEvol/lecture/master/data/mutmap_chr10.txt
SNP_index = get_yesterday_SNP_index()
display(SNP_index)
visualize_SNP_index2(SNP_index)

Red circles showed high SNP-index (>0.8). 

There are multiple positions showed high SNP-index.

To extract more reliable regions from this results, we try to perform sliding window analysis.

## Sliding window analysis & visualize it.

Below code perform sliding window analysis.

Window size & Step size to perform sliding window analysis can be set.

```
At first, we set window size = 1mbp & step size = 200kbp
(considering that the legth of chromosome10 is 23207287bp.)
```

In [None]:
sliding_window(SNP_index, window_size=1000000, step_size = 200000)

Red line showed average SNP-index for each interval.

Plot showed 22~23Mbp in chromosome 10 showed high average SNP-index.

This result suggests that there is a gene that related to leaf color in 22~23Mbp, chromosome 10.

In this region, there is a gene "OsCAO1" that codes Chlorophyllide a oxygenase.

And it was known that the knock-out line of OsCAO1 showed light green leaf. (Abe et al., 2012)。

<br>

But, of course, even after applying sliding window anaylsis, results may still contains false positive.

In that case, we have to consider other approach to select candidate SNPs, such as biological information of the SNP position(exon, important gene region...etc)


## Our developped MutMap/QTL-seq analysis pipeline include sliding window analysis

You can specify window size & step size in these pipelines.

- MutMap
https://github.com/YuSugihara/MutMap#Usage

- QTL-seq
https://github.com/YuSugihara/QTL-seq#usage

---
## Summary

In this notebook, we demonstrate **QTL-seq** analysis using simulation data.

And **Sliding window** analysis using published data.

Actually, MutMap & QTL-seq analysi is very simple & easy methods.

<br>

The important point is how to interpret the results (SNP-index values).

We should take care of false positives, errors, ...etc by applying sliding window analysis, checking read depth of interested SNPs...etc

<br>

Tomorrow, we'll plan to demonstrate **genomic prediction**.