<img src="https://github.com/slt666666/FAO_lecture/blob/main/title.png?raw=true" alt="title" height="300px">


# MutMap - practice -

We will perform MutMap analysis using simulation data & published data in this practice part.

It may help you to understand ...
* the process of MutMap analysis 
* How to interpret the results of MutMap
* What data is required for MutMap

## The contents in this notebook ... 

* What is Google Colab (What is this platform)
  * How to use Google Colab
* Review of MutMap

* 1st practice - MutMap analysis for simulation data -

  * We assume very simple organism & perform MutMap analysis to understand the process of MutMap.
* Introduction of MutMap pipeline (Github)

* 2nd practice - MutMap analysis for published sample data -
  * We use published data in MutMap paper [(Abe et al., 2012)](https://www.nature.com/articles/nbt.2095) & perform MutMap analysis.

# Main contents

# What is Google Colab (What is this platform)

Now, you access to the Google Colaboratory(Google Colab) server through web browser such as Google Chrome, FireFox, Safari...etc.

Google Colab is the online service that you can show, edit, and run the program on the Google server (Colab server).

So, you don't need to use your local PC to run the program!

<img src="https://github.com/slt666666/FAO_lecture/blob/main/colab.png?raw=true" alt="colab" height="200px">

In the Google Colab, we call this page(file) is "Notebook". We use this Notebook to edit/show/run the programs.

In addition, you can access and use the notebook that other people made.

So, using Google Colab, you can experience programming without setting of your own PC & without writing a program!

In our lecture, we will use our notebooks to experience MutMap anaylsis, QTL-seq analysis (tomorrow), Genomic analysis (day after tomorrow).

**And we keep these notebooks in the server, so you can access these notebooks at any time even after this lecture**

## How to use Google Colab

In the notebook, there are 2 type of cells.

The cell shows texts & figures is text cell. 

The cell shows code (like below cell) is code cell.

In the Google Colab notebook, you can write code(program), and run your program in the code cell.

```
How to run the program

  There are several ways to run the program: 
  - Click "▶︎" botton that is in the top left of the code cell
  - Select the code cell and press "Shift + Enter" key.
  - [for Windows] Select the code cell and press "Ctrl + Enter" key.
  - [for Mac] Select the code cell and press the "control + Enter" keys.
```

please try to run (& edit) the below code.

In [None]:
# sample code

11 + 100

In [None]:
# sample code2

for i in range(10):
  print("{}x{}x{} = ".format(i,i,i), i*i*i)

Googel Colab has a lot of useful functions such as save & export notebook,  use environment for Machine Learning ... etc

https://colab.research.google.com/

But, unfortunately, we have limited time, so we don't have time to teach other functions, how to do programming...etc.

And we are focusing on Genomic analysis such as MutMap/QTL-seq.

So, in this lecture, basically you will run the program that we developped through our notebook to experience MutMap/QTL-seq/...etc analysis and edit some part.

# Review of MutMap

  MutMap analysis is one the methods to identify the genomic region which is associate with the phenotype. (like QTL mapping.)

  The brief process of MutMap is below:
1. Create a mutant line by introducing mutations in the original line.
  - Several mutations are introduced into the genome of the mutant line, one of which is the causative mutation that alters the trait.
1. Cross the mutant line with the original line to create a second generation (F2 population).
  - By creating an F2 population, we obtain a large number of individuals with shuffled genome.
  - Each mutation introduced to the genome has about a 50% chance of being transmitted to each F2 individual.
  - The F2 individuals with the mutant trait should have certain common mutations in the genome.
1. Collect DNA from the mutant individuals (bulk DNA) and perform next generation sequencing. The original line is sequenced at the same time.
1. Compare  the original line and the bulk DNA sequence to obtain the following information
  - The position where the mutation was introduced
  - The type of base in the original line (reference base) and the base of the mutation
  - The number of reference and mutant bases in the bulk DNA sequence
1. Calculate the percentage of mutant bases in the bulk DNA sequence (**SNP-index**) across the genome to identify genomic regions where only mutant bases are found in the bulk DNA (causal genomic regions).

<img src="https://github.com/slt666666/FAO_lecture/blob/main/MutMap.png?raw=true" alt="title" height="600px">

In this practice lecture, we will experience each step using simulation data to understand the process of MutMap.

And then, we will experience MutMap analysis using some of the published data.

# 1st practice: MutMap analysis using very simple simulation data

!! please run the below code, this code downloads programs...etc !!


In [None]:
# Prepare modules & packages
!wget -O module_mutmap.py https://github.com/slt666666/FAO_lecture/blob/main/module_mutmap.py?raw=true
import pandas as pd
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 200)
from module_mutmap import make_reference_and_mutant
from module_mutmap import cross_reference_and_mutant
from module_mutmap import bulk_sequencing
from module_mutmap import alignment
from module_mutmap import calculate_SNP_index
from module_mutmap import visualize_SNP_index
from module_mutmap import check_results
from module_mutmap import MutMap_simulation
from module_mutmap import load_data
from module_mutmap import calculate_SNP_index2
from module_mutmap import visualize_SNP_index2

## Simulation setting:

In this practice, we assume...

* **The plant which has only 1 chromosome & 100 bp.**

* **We introduce 40 SNPs in this plant by EMS mutagenesis to make mutant line.**

* **Mutation in 1 of 40 SNPs change the leaf color from green to light green.**

    (But we don't know which SNP is causative.)

<img src="https://github.com/slt666666/FAO_lecture/blob/main/simulation2.png?raw=true" alt="title" height="250px">

If we have this mutant line, to identify the causative SNP that alter leaf color,

Let's try to perform MutMap analysis !!

## 1. Cross reference and mutant line.

At first in MutMap analysis, we'll generate F2 progenies from reference line & mutant line.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/simulation3.png?raw=true" alt="title" height="300px">

Please run the below code to make F2 population !

Below code generates 200 F2 population from reference & mutant line.


In [None]:
reference, mutant = make_reference_and_mutant(length=100, mutation=40)
progeny = cross_reference_and_mutant(reference, mutant, progeny=200)


```
※ Memo 
The above program generates genotype of progeny which is randomly mixed between reference and mutatnt line.
Then, program decide the phenotype based on this genotype.
So, the program stored the genotype & phenotype data of simulated population in the background.
Therefore, we can check the result of MutMap is correct or not by comparing simulated data.

```

## 2. Bulk sequence of mutant phenotype line

The 2nd step is bulk sequencing of mutant phenotype lines.

Mutant phenotype(light green leaf) samples might have causative SNP mutation.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/simulation4.png?raw=true" alt="title" height="200px">

Please run the velow code to do the sequencing !

The below code generate 20bp reads of bulk DNA (default is 200 reads)

```
※ Memo
below program generates 20bp reads cause it's simulation.
but usually, sequence reads are 150bp~300bp.
```

In [None]:
reads = bulk_sequencing(progeny, read=200)

The above program perform sequencing of bulk DNA & sequence results(fastq file) is saved in the Colab server.

You can check this fastq file using file system in Google Colab ↓

```
How to check files in your Google Colab server space.
1. Click the file icon in upper left.
2. The file list showed files in your server space.
(3. if there is no "bulked_sequences.fastq", please click the third icon from the right
```
<img src="https://github.com/slt666666/FAO_lecture/blob/main/filesystem.png?raw=true" alt="title" height="250px">

## 3. Alignment sequence reads to refrence genome.

After sequencing bulked DNA, we will align these sequence reads to the reference genome to identify SNP positions.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/alignment.png?raw=true" alt="title" height="250px">

Please run the below code to perform alignment !

The below code align reads to the reference & show results.

In [None]:
alignment_result = alignment(reads, reference)
alignment_result


```
※ Memo
In this notebook, we perform alignment of these reads by our program.
But basically, when we perform alignment, we use mapping tool such as BWA(http://bio-bwa.sourceforge.net/).
So, if you conduct MutMap analysis by your own data, it may has required to use mapping tool.
But MutMap pipeline that we will introduce later contains mapping tools. So, you don't need to care about it.
```

### 4. Calculate SNP-index based on alignment results

After the alignment, the next process is calculating SNP-index.

SNP index showed the ratio of the mutant base for each genomic position.

If SNP index is close to 1, almost all mutant phenotype lines have different genotype from reference at this position (= this SNP might be causative SNP).

<img src="https://github.com/slt666666/FAO_lecture/blob/main/SNP_index.png?raw=true" alt="title" height="300px">

Please run the below code to calculate SNP index basde on alignment results !!

In [None]:
SNP_index = calculate_SNP_index(alignment_result, reference, mutant)
SNP_index

## 5. Visulalize SNP index plot

After calculating SNP index, visualize it to search the causative position.

Please run the below code to visualize SNP index!


In [None]:
visualize_SNP_index(SNP_index)

# check analysis results & real genotype

MutMap analysis showed the candidate position of causative SNP. (position showed SNP index is 1)

The below code showed the genotype of some F2 progenies that simulation program generated.

Try to check MutMap result is correct or not !!

```
※this is simulation, so all genotypes of progenies are saved in background.
```

In [None]:
check_results(reference, SNP_index)

The setting of this simulation was ... The "Light Green" phenotyope is controlled by the one SNP mutation.

MutMap analysis success to identify this position using only reference fasta & bulked fasta!!

<img src="https://github.com/slt666666/FAO_lecture/blob/main/causative.png?raw=true" alt="title" height="200px">

# Play with simulation!

You can specify
- the length of reference genome
- the number of mutations
- the number of progeny
- the number of reads

please make different situation & perform MutMap analysis using below code.

```
※if you specify large length (like 10000~), it may take so much time.
```

In [None]:
MutMap_simulation(length=100, mutation=20, progeny=200, read=10)

# Introduction of MutMap pipeline

Our laboratory & team developped the very simple pipeline to perform MutMap.

<img src="https://github.com/YuSugihara/MutMap/blob/master/images/1_logo.png?raw=true" alt="title" height="100px">

(https://github.com/YuSugihara/MutMap)

This pipeline is very simple to use.

To use this pipeline, the required input data is ...
* reference fasta file
* reference sequence reads fastq file
* Bulk DNA sequence reads fastq file

And you just install pipeline & use below command. That's it !

```
# command
mutmap -r reference_sequences.fasta -c reference.fastq -b bulked_sequences.fastq -n 20 -o output_name
```

<br>

```
ex) The public avaliable reference genome of Rice is Nipponbare cultivar.
But if you want to use different cultivar like Hitomebore cultivar as a reference, reference.fastq is required.
```


# 2nd practice: MutMap analysis using published data

In this part, we used published data (Abe et al., 2012) to see the real result of SNP index.


## Materials

We used rice cultivar "Hitomebore" as a reference. 

"Hit1917-pl1" cultivar was generated by mutagenesis and showed lightgreen leaf color (low chlorophyll content).

Hit1917-pl1 has almost 1500 SNPs that are different from reference genotype.


## Methods

To identify which SNP variant is causative for the change of leaf color, we conducted MutMap analysis.

We generated over 200 F2 progenies by crossing reference and mutant lines.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/material.png?raw=true" alt="title" height="600px">

## Bulked sequencing & alignment of mutant phenotype F2 lines

Then, we conducted bulked sequencing of progenies that showed mutant phenotype.

(You can access sequence data in [DRA000499](https://www.ncbi.nlm.nih.gov/sra?term=DRA000499) as a reference.)

After that, we aligned these sequences and calculate SNP index.

In this notebook, we used [table data](https://raw.githubusercontent.com/CropEvol/lecture/master/data/mutmap_chr10.txt) that showed alignment results (the SNP information) in the chromosome10.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/sample_data.png?raw=true" alt="title" height="450px">



Please run the below code, to download the table data (alignment & SNP calling results) !!

In [None]:
# download and load alignment results of chr10
!wget -O mutmap_dataset.txt https://raw.githubusercontent.com/CropEvol/lecture/master/data/mutmap_chr10.txt
alignment_results = load_data()

print("Data looks like ...")
alignment_results.head()

## Calculate SNP-index

Based on this table information, we calcuate SNP-index.

<img src="https://github.com/slt666666/FAO_lecture/blob/main/calculation.png?raw=true" alt="mutmap_analysis_calc_snpindex" height="200px">

Please run the below code to calculate SNP-index !!

In [None]:
SNP_index = calculate_SNP_index2(alignment_results)
SNP_index

## Visualize SNP index plot

Visualize calculated SNP index to check the causative positon in chromosome 10.

Please run the below code to visualize SNP-index in the chromosome 10 !!

In [None]:
visualize_SNP_index2(SNP_index)

Red circle showed SNPs with over 0.8 SNP-index and they might be causative position.

However, there are several regions that showed SNP-index > 0.8.

**It means there are several genes affect leaf color ? Maybe No.**

Some of them might be occurred by low depth, sequence error...etc (false positives).

So, we need to reduce false positives, and

**Sliding window** analysis is one of the approaches to reduce these false positives.

We will try to do sliding window analysis tomorrow practice !

---
## Summary

In this notebook, we demonstrate **MutMap** analysis using simulation data to understand the process of MutMap analysis..

Also, we check the MutMap analysis result of published data to interpret results of MutMap analysis.
   
And if you want to conduct MitMap analysis, the pipeline that our group developped is prepared.
(https://github.com/YuSugihara/MutMap)
   
Tomorrow, we'll demonstrate **QTL-seq** analysis & **Sliding window** analysis for MutMap & QTL-seq.
