<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/title.png?raw=true" alt="title" height="250px">


# QTL-seq - practice session -

In this session, we will perform **QTL-seq analysis** using simulation data.

In addition, we also perform **sliding window analysis** using real data (published data).

Through this session, it may help you to understand ...
* The process of QTL-seq analysis
* How to interpret the results of QTL-seq analysis
* What kind of data is required for QTL-seq
* What is sliding window analysis & Why it is required.

## The contents in this notebook ...

* QTL-seq analysis
  * Limitations of Mut-map analysis
  * QTL-seq analysis using simulation data (Step by Step)
  * Review of QTL-seq

* Sliding window analysis
  * Limitations of the interpretation of the results of MutMap & QTL-seq analysis.
  * Sliding window analysis using published data.

# Main contents

## **Limitations of Mut-map analysis**

In the previous practice session, we perform Mut-map analysis to identify genetic variant that is associated with leaf color.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/Mutmap_8.png?raw=true" alt="title" height="200px">


<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/Mutmap_12.png?raw=true" alt="title" height="250px"><br>


<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/Mutmap_13.png?raw=true" alt="title" height="150px">

It looks this approach can be applied to wide variety traits.

However, Mut-map analysis can be applied to only **Qualitative traits**.

If traits were **Quantitative traits**, Mut-map can't be applied.

Because, in case of Quantitative traits, segregating population (ex. F2 population) often showed a wide spectrum of variance.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/QTLseq_1.png?raw=true" alt="title" height="500px">


And a lot of important agronomical traits are quntitative.
 - grain number
 - panicle number
 - root length
 - ...etc

Also, Quantitative trait is controlled by not only few QTL, but also several QTLs.

 To cope with these problem, we developped QTL-seq analysis.

 QTL-seq is very similar method to Mut-map analysis, but QTL-seq is focusing on quantitative traits.


 **So, let's try to perform QTL-seq analysis step by step!**

### **Practice of QTL-seq analysis**

In this section, we will practice QTL-seq analysis under the following situation.

- Reference genome of Cultivar B is available.
- Mutated cutivar showed high plant height compared to original cultivar.
- We generated 100 F2 progeny by crossing Cultivar B and mutated cultivar.

**Purpose: Identify genetic variants that are associated with plant height using QTL-seq.**

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/QTLseq_2.png?raw=true" alt="title" height="500px">

Plant height is a quantitative trait, so F2 population showed wide variety of plant height.

And we **can't apply Mut-map analysis.**

Let's check the distribution of phenotype values of this F2 population.

Below code download the phenotype data of F2 population and visualize it.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
!wget -O F2_phenotype.csv https://raw.githubusercontent.com/slt666666/FAO_lecture/main/FAO_2024/data/F2_phenotype.csv 2>/dev/null
phenotype = pd.read_csv("F2_phenotype.csv")
fig, ax = plt.subplots(figsize=(4, 3))
sns.histplot(phenotype["plant height"], bins=20, ax=ax)
plt.show()

In our F2 population showed the wide distribution of plant height.

Basically, quantitative traits are controlled by multiple genes.

Therefore, phenotype values are variable depending on the combination of genetic variants.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/QTLseq_3.png?raw=true" alt="title" height="500px">

In this case, F2 progenies with taller plant height are likely contain genetic variants with strong positive effect (ex. the 2nd SNP in above figure).

In other words, F2 progenies with shorter plant height are likely do **not** contain genetic variants with strong positive effect.

So, genetic varinats with strong positive effect are conserved to F2 progenies with tall plant height.

QTL-seq analysis use this biased distribution of genetic variants. (The idea is similar to MutMap analysis.)

Therefore, QTL-seq analysis is the approach to identify genetic variants that **conserved in F2 progenies with high plant height** and **did not conserved F2 progenies with low plant height** to identify causal SNPs.

#### **Bulked DNA sequencing**

In QTL-seq analysis, we want to identify SNPs that are mutated in the most F2 progenies with taller plant height and not mutated in the most F2 progenies with shorter plant height.

So, in the QTL-seq anaysis, we perform 2 bulked sequencing.

First, we extract DNA from F2 progenies with top 10 plant height and bulked sequencing.

We also extract DNA from F2 progenies with bottom 10 plant height and bulked sequencing.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/QTLseq_4.png?raw=true" alt="title" height="450px">

Sequencing data and Bioinformatics tools can be downloaded using below command.

In [None]:
%%bash
# download sequence data
wget https://raw.githubusercontent.com/slt666666/FAO_lecture/main/FAO_2024/data/get_data2.sh 2>/dev/null
bash get_data2.sh

# install bwa, samtools, bcftools
apt-get install -q bwa
apt-get install -q samtools
apt-get install -q bcftools

## igv-notebook-0.3.1
pip install -q igv-notebook==0.3.1
wget -q -O modules.py https://raw.githubusercontent.com/slt666666/FAO_lecture/main/FAO_2024/data/modules.py

# In this practice, the installation command is tailored to Google Colab.
# When you try your own PC, please check the installation method on each tool's website.

Bulked sequencing data is `reads/high_bulked_read1/2.fastq` and `reads/low_bulked_read1/2.fastq`.

These sequencing data is derived from multiple F2 progenies with high plant height and low plant height.

So, if we perform alignment using these sequences, we can identify causal muation that is conserved F2 progenies with high plant height.

#### **Alignment**

Next step is the alignment by `bwa`.

To identify genetic variants that are conserved in high-phenotype F2 progenies and not conserved in low-phenotype F2 progenies,

we will compare alignment results of high & low bulked sequencing.

Let's see what kind of alignment results we get when we use **2** bulked sequences.

In [None]:
%%bash
# alignment high bulked sequencing data to referece genome
bwa index genome/CultivarB.fa
bwa aln genome/CultivarB.fa reads/high_bulked_read1.fastq > high_bulked_read1.sai 2>/dev/null
bwa aln genome/CultivarB.fa reads/high_bulked_read2.fastq > high_bulked_read2.sai 2>/dev/null
bwa sampe genome/CultivarB.fa high_bulked_read1.sai high_bulked_read2.sai reads/high_bulked_read1.fastq reads/high_bulked_read2.fastq > high_bulked.sam; rm -fR high_bulked_read1.sai high_bulked_read2.sai
# convert sam to bam
samtools sort -O bam high_bulked.sam > high_bulked.bam
samtools index high_bulked.bam

# alignment low bulked sequencing data to referece genome
bwa index genome/CultivarB.fa
bwa aln genome/CultivarB.fa reads/low_bulked_read1.fastq > low_bulked_read1.sai 2>/dev/null
bwa aln genome/CultivarB.fa reads/low_bulked_read2.fastq > low_bulked_read2.sai 2>/dev/null
bwa sampe genome/CultivarB.fa low_bulked_read1.sai low_bulked_read2.sai reads/low_bulked_read1.fastq reads/low_bulked_read2.fastq > low_bulked.sam; rm -fR low_bulked_read1.sai low_bulked_read2.sai
# convert sam to bam
samtools sort -O bam low_bulked.sam > low_bulked.bam
samtools index low_bulked.bam

# generate vcf file
bcftools mpileup -Ob -o bulked.bcf -f genome/CultivarB.fa --annotate FORMAT/AD low_bulked.bam high_bulked.bam
bcftools call -vmO v -o bulked.vcf bulked.bcf; rm bulked.bcf

After finish the alignment, you will obtain `high_bulked.bam`, `low_bulked.bam`, `bulked.vcf` as a aignment results.

Then, let's visualize alignment results for high bulked sequencing & low bulked sequencing.

In [None]:
## Visualize alignment results
import igv_notebook
from modules import RefTrack, AnnotationTrack, BamTrack
igv_notebook.init()
# show reference genome
ref = RefTrack({ "fastaPath":"genome/CultivarB.fa", "indexPath":"genome/CultivarB.fa.fai", "id":"CultivarB" })
# show aligned high bulked reads
B = BamTrack({ "name":"high_bulked", "path":"high_bulked.bam", "indexPath":"high_bulked.bam.bai", "viewAsPairs":True })
# show aligned low bulked reads
B2 = BamTrack({ "name":"low_bulked", "path":"low_bulked.bam", "indexPath":"low_bulked.bam.bai", "viewAsPairs":True })
# visualize
b = igv_notebook.Browser(ref)
b.load_track(B)
b.load_track(B2)

We can check genetic variants conserved in high(low) phenotype F2 progenies.

#### **SNP index**

As a results, we obtained SNPs from each bulked sequences.

From these SNPs, we want to identify SNPs that is conserved high phenotype F2 progenies.

This kind of SNPs can be identified based on SNP index that was also used in Mut-map analysis.

For example ↓

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/QTLseq_5.png?raw=true" alt="title" height="300px">

All high bulk samples have mutated genotype of this SNP.

But, we can't say this SNP is causal genetic variant yet.

We have to consider **low bulk samples**.

Even if the SNP index is very high in high bulk, it can't be considered a causative genetic variants if it is **also high SNP index in low bulk**.

(like left SNP in figure).

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/QTLseq_6.png?raw=true" alt="title" height="450px">

Therefore, we have to identify SNPs with high SNP index in high bulk, but low SNP index in low bulk.

(like right SNP in figure).

#### **ΔSNP index**

To identify this kind of SNPs, we'll use **ΔSNP index**.


**ΔSNP index** is calculated by `SNP index(high bulk) - SNP index(low bulk)`.

**High ΔSNP index means only high bulk F2 progenies have mutated genotype, and few or no low bulk F2 progenies have it.**

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/QTLseq_7.png?raw=true" alt="title" height="450px">

In summary, we have to identify **SNPs with high ΔSNP index.** \

This SNP is only conserved in high phenotype F2 progenies, so it could be causative variants.

Let's calculate ΔSNP index from vcf file (alignment results).

You can calculate ΔSNP index values by my script using below code.


In [None]:
from modules import calc_delta_SNP_index
delta_SNP_index = calc_delta_SNP_index("bulked.vcf")
delta_SNP_index

It's not so difficult to manually check ΔSNP index if the number of SNPs is few hundreds.

But if you find more SNPs like thousands to millions...etc, it is better to visualize it with a graph.

Let's try to visualize SNP index & ΔSNP index using below code.

In [None]:
from modules import visualize_delta_SNP_index
visualize_delta_SNP_index(delta_SNP_index)

You can see SNP index of high/low bulk and ΔSNP index across whole genome.

Basically, genetic variants randomly mixed by recombinations in F2 population.

So, SNP index of high bulk and low bulk is around 0.5, and ΔSNP index of a **not** causal SNP is aroud 0 (red line).

On the other hand, only high bulk F2 progenies have causal SNP genotype.

So, ΔSNP index of a causal SNP is close to 1. In this case, **a SNP located around 5,800 bp might be a causal genetic variant**.

Also, we can identify several SNPs showed ΔSNP index > 0. **These SNPs might also have positive effect on phenotypes.**

(because not all, but most high phenotype samples have them).

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/QTLseq_8.png?raw=true" alt="title" height="450px">

<br>

**This is the process of QTL-seq analysis!**

# Review of QTL-seq

Let's take a look at whole process of QTL-seq analysis.

  QTL-seq analysis is one the methods to identify the genomic region which is associate with the quantitative traits. (like QTL mapping.)

* MutMap: Qualitative trait

* QTL-seq: Quantitative trait

To perform QTL-seq analysis...
1. Generate a segregating population like F2 population by crossing the mutated cultivar and the original cultivar.
  - By creating an F2 population, we obtain a large number of individuals with shuffled genome.
  - Each mutation introduced to the genome has about a 50% chance of being transmitted to each F2 individual.

1. If there are segregtion as a quantitative trait, select 10~ samples each showing opposite phenotypes (high and low).

  - The F2 individuals with the high phenotype should have common causal genetic variants in the genome.
  - And individuals with the low phenotype should **not** have this causal variant.

1. Collect bulked DNA from selected samples for each phenotype (Low bulk & High bulk), and perform sequencing (bulk sequencing).

1. Alignment of bulked sequences to the reference genome. You can get information like
  - The position where the mutation
  - The number of reference and mutant bases in the high/low bulked sequence, respectively

1. Finally, to compare SNP-index between high bulk and low bulk, calculate ΔSNP-index which indicate the difference of SNP-index between high and low bulks.
  - SNPs which allele frequency is hugely different between high and low bulk are identified as high ΔSNP-index(0 <) or low ΔSNP-index(< 0).

This genetic vatinat is candidate of causative variant that affect phenotype.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/QTLseq_9.png?raw=true" alt="title" height="750px">


# Introduction of QTL-seq pipeline

In this practice session, we perform QTL-seq analysis step by step using bioinformatics tools like bwa, samtools, bcftools...etc and our own scripts.

Our laboratory developped the very simple pipeline to conduct QTL-seq.

<img src="https://github.com/YuSugihara/QTL-seq/blob/master/images/1_logo.png?raw=true" alt="title" height="100px">

(https://github.com/YuSugihara/QTL-seq)

This pipeline is very simple to use. Required input data is just sequencing data (= fasta and fastq files).

And you just install pipeline & use below command. That's it !

```
# command
qtlseq -r reference.fasta -p reference.fastq -b1 high_bulked_sequences.fastq -b2 low_bulked_sequences.fastq -n1 20 -n2 20 -o output_name
```

Then, you can get data files that contains SNP positions & SNP-index values & graph that visualize it.

So, if you are interested in perform QTL-seq analysis, this pipeline is one option.

# **Sliding window analysis**

Finally, we introduce **Sliding window analysis** for Mut-map & QTL-seq.

In this part, we use yesterday's MutMap result of published data to experience sliding window analysis.


## **Review of Sliding Window analysis**

Different from simulation study, the real data (experimental data) does not always showed theoretically correct values.

The real data usually contains **sequencing errors, outliers, human error...etc**.

And it generates **false positives.**

So, in case of Mut-Map and QTL-seq analysis **using real dataset**, SNP index will be different from expected values due to the errors.

Especially, when the amount of sequencing data or population size is not enough, it usually happens.

Sometimes it could lead a big outlier like SNP index=1 in the not important SNP position.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/Slidingwindow_1.png?raw=true" alt="title" height="250px">

### **Sliding window**

To escape these errors, one approach is taking the average of SNP indexes of nearby regions.

Considering the linkage, the SNP index of nearby regions should have similar values.

So, sliding window will diminish the influence of error, and we can get more reliable SNP index results.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/Slidingwindow_4.png?raw=true" alt="title" height="300px">

<br>

The process of sliding window analysis is...

1. Extract data (SNP-indexes) in the interval regions (of window size).
1. Calculate average value of them.
1. Move the interval by the step size.
1. Continue 2-4 process to finish the whole region.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/Slidingwindow_2.png?raw=true" alt="title" height="400px">

Let's see how much it differs from the original SNP index, **after sliding window analysis.**

## Example

In this section, we try to perform sliding window analysis using published data.

We used published results of SNP index of Mut-map analysis as an example [(Abe et al., 2012)](https://www.nature.com/articles/nbt.2095).

This paper try to identify genetic variants that are associated with leaf color (green to lightgreen).

<img src="https://github.com/slt666666/FAO_lecture/blob/main/material.png?raw=true" alt="title" height="400px">

### SNP index

At first, we try to visualize original SNP index values using below code.

In [None]:
import pandas as pd
from modules import visualize_SNP_index

!wget -O published_Mutmap.csv https://raw.githubusercontent.com/slt666666/FAO_lecture/main/FAO_2024/data/published_Mutmap.csv 2>/dev/null
SNP_index = pd.read_csv("published_Mutmap.csv")
visualize_SNP_index(SNP_index)

There are multiple SNPs showed high SNP index.

From this plot, it's difficult to distinguish true positives and false positives.

So, to extract more reliable regions from this results, we try to perform sliding window analysis for this result.

## Sliding window analysis & visualize it.

We try to perform sliding window analysis by our script.

```
Window size & Step size to perform sliding window analysis can be set.
At first, we set window size = 1mbp & step size = 200kbp
(considering that the legth of chromosome10 is 23207287bp.)
```

In [None]:
from modules import sliding_window
average_SNP_index = sliding_window(SNP_index, window_size=1000000, step_size = 200000)

Red line showed average SNP-index for each interval.

Plot showed right end position(22~23Mbp) in chromosome 10 showed high average SNP index.

This result suggests that there is a gene that related to leaf color in 22~23Mbp, chromosome 10.

In this region, there is a gene "*OsCAO1*" that codes Chlorophyllide a oxygenase.

And it was known that the knock-out line of *OsCAO1* showed light green leaf. (Abe et al., 2012)。

<br>

But, of course, even after applying sliding window anaylsis, results may still contains false positive.

In that case, we have to consider other approach to select candidate SNPs, such as biological information of the SNP position(exon, important gene region...etc) or increase population size / sequence data.


## Our developped MutMap/QTL-seq analysis pipeline include sliding window analysis

In this practice, we perform sliding window analysis by our scripts.

But the pipeline that we introduced before contains sliding window analysis.

You can specify window size & step size in these pipelines.

- MutMap
https://github.com/YuSugihara/MutMap#Usage

- QTL-seq
https://github.com/YuSugihara/QTL-seq#usage

## **ex) The result of QTL-seq analysis of published data**

In case of QTL-seq analysis, sliding window is important.

For example, [Takagi et al., 2012](https://doi.org/10.1111/tpj.12105) showed the identification of genetic variants associated with disease resistance.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/FAO_2024/Slidingwindow_3.png?raw=true" alt="title" height="650px">

Original ΔSNP index plot showed that multiple SNPs have high ΔSNP index (blue dot).

But, we succeeded to identify causal genomic regions after sliding window analysis (red line).


---
## Summary

In this notebook, we demonstrate **QTL-seq** analysis using simulation data to understand the process of QTL-seq analysis.

Also, we introduced **Sliding window** analysis that is requred to identify reliable causal genetic wariants.

<br>

The important point is how to interpret the results (SNP-index values).

We should take care of false positives, errors, ...etc by applying sliding window analysis, checking read depth of interested SNPs...etc

<br>

And if you want to conduct QTL-seq analysis, the pipeline that our group developped is prepared. (https://github.com/YuSugihara/QTL-seq)

Next session, we'll intorduce **Genomic Prediction**.