# Introduction to Large Data Analysis (3)
Simulation and evaluation of the observerd data

## Contents
### Introduction
- [About previous practices](#0.1)
- [About sample data](#0.2)
- [About this practices](#0.3)

### Practice
1. [Generating one SNP-index by simulation](#1.1)
1. [Generating 10 SNP-indexes by simulation](#1.2)
1. [Generating 10,000 SNP-indexes by simulation](#1.3)
1. [Is the observed SNP-index frequently obtained under null hypothesis?](#1.4)
1. [Identification of genomic region including causal gene](#1.5)

---
## About previous practice<a name="0.1"></a>

Looking back on [the previous practices](../07_large_data_analysis/02_large_data_analysis_en.ipynb)
- Loading a tab-delimited text file as pandas dataframe
    - ```python
        import pandas as pd
        df = pd.read_csv(<File name>, sep='\t', header=<header line No.>, names=[List of column names])
        ```
- Accessing to an arbitrary data
    - `df[column name]` => Extracting data of one column
    - `df.loc[row names, column names]` => Extracting data of one or some rows and/or columns
    - `df.iloc[row No., column No.]` => Extracting data of one or some rows and/or columns
- Calculating each data between two columns and adding the calculated data as new column to the dataframe 
    - `df[column C] = df[column A] + df[column B]`
- Extracting the subset of dataframe by condition
    - `df[ (condition) ]`
    - AND condition(`&`)、OR condition(`|`)
    - ex. `df[ (df['snp_index'] >= 0.7) & (df['snp_index'] < 0.9) ]`
- Drawing the scatter plot
    - Drawing the graph by using "Matplotlib" library
    - The graph is composed some layers.  
    - ```python
        %matplotlib inline
        import matplotlib.pyplot as plt  # preparing matplotlib
        
        fig = plt.figure()  # graph field
        plt.scatter([data of x-axis], [data of y-axis])  # scatter plot
        plt.title('title of the graph')
        plt.xlabel('label of x-axis')
        plt.ylabel('label of y-axis')
        ```
- Sliding window analyisis

## About sample data<a name="0.2"></a>

We use the data file of MutMap (Abe et al., 2012), same to the previous one.

Please run the below cell once, 
- To download the file of sample data
- To loading the file as pandas dataframe
- To calculate SNP-index and add the calculated data as new column

In [None]:
"""
*** IMPORTANT ***
Run this cell before this practice.
"""
!wget -q https://raw.githubusercontent.com/CropEvol/lecture/master/data/mutmap_bulk.txt -O mutmap_bulk.txt
    
#--- Preparing pandas  ---
import pandas as pd

#--- Loading file ---
dataset = 'mutmap_bulk.txt'       
df = pd.read_csv(dataset, sep="\t", header=-1, names=['chr', 'pos', 'ref_nucl', 'alt_nucl', 'ref_N', 'alt_N'])

#--- Calculating SNP-index & Adding new column---
df['snp_index'] = df['alt_N'] / (df['ref_N'] + df['alt_N'])

#--- Show (only first10 data) ---
df.head(10)

## About this practice<a name="0.3"></a>

To identify the region of causal gene for mutant phenotype (pale green leaf), we do simulation of SNP-index generated under null hypothesis, and evaluate the observed SNP-indexes by using the simulated data.

<div style="margin-bottom: 5px;"><img src="https://github.com/CropEvol/lecture/blob/master/images/08/simulation01_en.jpg?raw=true" alt="simulation"></div>

---
## Practice
We use only one data (data of first row) in Practice 1-4.

First row => `chr10	51406	G	A	6	3	0.333333`

In [None]:
# Extracting first row (0th row)
onedata = df.iloc[0,:]

# SNP-index data in the row into the variable "mysnpindex"
# （This variable is used in STEP 4）
mysnpindex = onedata['snp_index']

# Show
onedata

In sample data, nine nucleotides from bulked NGS reads of F2 individuals are on the 51406 bp of chromosome 10.
Of nine nucleotides, six are `G` derived from the original line, three are `A` derived from the mutant line.

<div style="margin-bottom: 5px;"><img src="https://github.com/CropEvol/lecture/blob/master/images/08/simulation02_en.jpg?raw=true" alt="simulation"></div>

__Question: What is the probability of obtaining six "G" and three "A" from nine nucleotides?__  
(Condition: Two nucleotides `G` and` A` are obtained with a 50% chance.)

### 1. Generating one SNP-index by simulation<a name="1.1"></a>

In [None]:
import numpy as np  # NumPy library

### Total number of alleles
total_allele_num = 9

### Simulating how many mutant alleles are selected.
### Selection of mutant allele is according to the binomial distribution. 

simu_mut_num = np.random.binomial(n=total_allele_num, p=0.5)
print(simu_mut_num)


### Calculating SNP-index from number of mutant alleles obtained by simulation.
### Formula: number of mutant alleles　/ total number of alleles

#simu_snp_index = simu_mut_num / total_allele_num
#print(simu_snp_index)

### 2. Generating 10 SNP-indexes by simulation<a name="1.2"></a>

In [None]:
"""
Method 1: using "for" （Bad method）
"""

import numpy as np  # NumPy library

### Total number of alleles
total_allele_num = 9

### Preparing list for simulated SNP-indexes and number of mutant alleles
simu_mut_nums = []
simu_snp_indexes = []

### Repeating10 times
for i in range(10):
    
    ### Simulating how many mutant alleles are selected.
    simu_mut_num = np.random.binomial(n=total_allele_num, p=0.5)
    
    ### Calculating SNP-index
    simu_snp_index = simu_mut_num / total_allele_num
    
    ### Adding value to list
    simu_mut_nums.append(simu_mut_num)
    simu_snp_indexes.append(simu_snp_index)

### Show
print(simu_mut_nums)
print(simu_snp_indexes)

In [None]:
"""
Method 2: using argument of numpy.random.binomial （Good method）
"""

import numpy as np  # NumPy library

### Total number of alleles
total_allele_num = 9

 ### Simulating how many mutant alleles are selected. (10 times)
simu_mut_num = np.random.binomial(n=total_allele_num, p=0.5, size=10)
print(simu_mut_num) # Show

### Calculating SNP-index
simu_snp_index = simu_mut_num / total_allele_num
print(simu_snp_index) # Show

#### Appendix: Data-type "LIST" adn "NumPy array"
- NumPy array can be calculated each values between two arrays.
- LIST cannot be calculated each values between two lists.

In [None]:
# Check data type & Difference between LIST and  NumPy array

import numpy as np

a = [1,2,3,4]
b = np.array([1,2,3,4])

# Checking data type by using the function "type()"
print('a:', a, '  data-type: ', type(a))
print('b:', b, '  data-type: ', type(b))

print('---')
print(a + a) # joining two list
print(b + b) # adding each values

print('---')
print(a * 5) # duplicating the list and joining them
print(b * 5) # multiplicating to each values

### 3. Generating 10,000 SNP-indexes by simulation<a name="1.3"></a>

To generate 10,000 SNP-indexes, please change the below program.

In [None]:
import numpy as np  # NumPy library

### Total number of alleles
total_allele_num = 9

 ### Simulating how many mutant alleles are selected. 
simu_mut_num = np.random.binomial(n=total_allele_num, p=0.5, size=10)
print(simu_mut_num) # Show

### Calculating SNP-index
simu_snp_index = simu_mut_num / total_allele_num
print(simu_snp_index) # Show

Let's draw a histgram of 10,000 simulated SNP-indexes. 

In [None]:
# Preparing matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# Graph field
fig = plt.figure(figsize=[9,4])    

# Drawing histgram
plt.hist(simu_snp_index, bins=10, range=[0.0,1.0], rwidth=0.8)
        # bins: Specifing the number of bars
        # range: Specifing the min and max values of x-axis
        # rwidth: Specifing the width of bar

# title, x-label, y-label
plt.title('Histgram of 10,000 simulated SNP-index', fontsize=16)
plt.xlabel('SNP-index', fontsize=16)  
plt.ylabel('Counts', fontsize=16)

How many values are in the bar including SNP-index=0.333 ?

In [None]:
# Extracting values in the bar of  0.3-0.4 range
bar_vals = simu_snp_index[ (simu_snp_index >= 0.3) & (simu_snp_index < 0.4)]
print(bar_vals)

# Counting values and calculating the frequency
bar_counts = len(bar_vals)
print(bar_counts)
print(bar_counts / 10000)

Question: What is the probability of obtaining six "G" and three "A" from nine nucleotides?

Answer: The probability is same as the frequency calculated in upper cell.

### Appendix: Obtaining the probability from the equation of binomial theorem
We can also obtain the probability from the equation of binomial theorem.

$$
P[X=k] = {}_n\mathrm{C}_k p^k (1-p)^{n-k} = \dfrac {n!}{k! (n-k)!} p^k (1-p)^{n-k} 
$$


In [None]:
import math

P = math.factorial(9) / (math.factorial(3) * math.factorial(6)) * 0.5**3 * 0.5 **6

P

### 4. Is the observed SNP-index frequently obtained under null hypothesis?<a name="1.4"></a>

We generate SNP-index under null hypothesis in this simulation.  
```text
The null hypothesis is that the SNP locus is not linked to mutant phenotype.
```

If one observed SNP-index is rarely obtained under the null hypothesis, we can reject the null hypothesis.
The SNP locus is probably linked to mutant phenotype.

In practice ３, we obtained the result that `SNP-index=0.333` is generated by probability of about 16%.

Here, we will verify that the SNP-index is included in whether the top 5% region on distribution or not.

#### procedure of this analysis
1. Sorting 10,000 simulated values.
1. Getting value of top 95% position.
1. Comparing the value and the observed value.
1. If the observed value is more than value of top 95% position, we reject the null hypothesis.  
    However if the observed value is NOT more than value of top 95% position, we adopt the null hypothesis.  
    
<div style="margin-bottom: 5px;"><img src="https://github.com/CropEvol/lecture/blob/master/images/08/simulation03_en.jpg?raw=true" alt="simulation"></div>

In [None]:
import numpy as np

# Show the observed SNP-index
print('Observed SNP-index: ', mysnpindex)

# 1) Sorting 10,000 simulated values AND
# 2) Getting value of top 95% position
get_snp_index  = np.percentile(simu_snp_index, 95, interpolation='lower')

# Show the SNP-index of top 95% position
print('SNP-index of threshold: ', get_snp_index)

# 3) Comparing the value and the observed value
if get_snp_index < mysnpindex:
    # 4.1) If the observed value is more than value of top 95% position
    print('Rejecting the null hypothesis') 
else:
    # 4.1) If the observed value is NOT more than value of top 95% position
    print('Adopting the null hypothesis')

### 5. Identification of genomic region including causal gene<a name="1.5"></a>

Next, we do simulation at all loci.

In [None]:
import numpy as np

#=== Function for simulation of SNP-index ===
def snp_index_simulation(df, ref_colname, alt_colname, trials=10000):
    """
    Function for simulattion of SNP-index
    
    Arguments:
    - df : Dataframe （neccessary）
    - ref_colname: Column name of original alleles（neccessary）
    - alt_colname: Column name of mutant alleles（neccessary）
    - trials : Number of SNP-index generated in one locus （default:10000）
    
    Return: Dataframe with column of simulated values
    """
    
    # Function for simulation in one locus
    def one_locus_simulation(allele_num):
        # Selecting alleles
        simu_allele_nums  = np.random.binomial(n=allele_num, p=0.5, size=trials)
        # Calculating SNP-index
        simu_snp_index = simu_allele_nums / allele_num
        # Value of top 95% positoin (Return)
        get_val  = np.percentile(simu_snp_index, 95, interpolation='lower')
        return get_val
    
    # Total number of alleles
    total = df[ref_colname] + df[alt_colname]
    
    # Doing simulation
    df['simu95'] = total.map(one_locus_simulation)
    
    # Return
    return df


#=== Main ===
df2 = snp_index_simulation(df, ref_colname='ref_N', alt_colname='alt_N', trials=10000)

# Show
df2

Let's do sliding window analysis for the observed SNP-index and the SNP-index of top 95% position, and draw the graph.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

#=== Function for Sliding window analysis ===
def sliding_window(df, pos_colname, val_colname, winsize=1000000, stepsize=200000, chromsize=None):
    """
    Function for Sliding window analysis
    
    Arguments:
    - df : Dataframe （neccessary）
    - pos_colname: Column name of chromsome position （neccessary） => values of x-axis
    - val_colname: Column name of values you want to analyze（neccessary） => values of y-axis
    - winsize :  Window size（default: 1Mb）
    - stepsize : Step size（default: 0.2Mb）
    - chromsize : Chromosome size（neccessary）
    
    Returns: 
    - List of medians of each windows 
    - List of averages of SNP-indexes in each windows
    """   
    
    win_position  = []  # List of medians of chromosome position
    win_snpindex = []  # List of averages of SNP-indexes

    n = 0 # repeat No.
    while True:
        #--- start and end position of window ---
        start = stepsize * n 
        end   = start + winsize
        #--- median position of window ---
        p = (start + end) / 2
        win_position.append(p)
        #--- Extracting dataset in window ---
        sub = df[(df[pos_colname] >= start) & (df[pos_colname] < end)]
        #--- Calculating average of SNP-index ---
        i = sub[val_colname].mean()
        win_snpindex.append(i)
        #--- repeat No. +1 ---
        n += 1
        
        #--- stop and get out this looping ---
        if end > chromsize:
            break
    
    return win_position, win_snpindex


#=== Main ===
CHROM_SIZE = 23207287  # Length of Chromosome 10
WIN_SIZE = 1000000 # 1Mb
STEP_SIZE = 200000  # 0.2Mb

# Sliding window analysis for the observed SNP-indexes
win_pos, win_snpidx = sliding_window(
    df2, 
    pos_colname='pos', 
    val_colname='snp_index', 
    winsize=WIN_SIZE, 
    stepsize=STEP_SIZE, 
    chromsize=CHROM_SIZE
)

# Sliding window analysis for the simulated SNP-indexes
win_pos, win_simu95 = sliding_window(
    df2, 
    pos_colname='pos', 
    val_colname='simu95', 
    winsize=WIN_SIZE, 
    stepsize=STEP_SIZE, 
    chromsize=CHROM_SIZE
)

#--- Drawing graph: scatter plot  ---
fig = plt.figure(figsize=[16,9])
plt.scatter(df2['pos'], df2['snp_index'], color='gray')      # scatter plot
plt.xlabel('Position (x 10 Mb)', fontsize=16)  # x-label
plt.ylabel('SNP-index', fontsize=16)                # y-label

#--- Drawing graph: Siliding Window analysis --
plt.plot(win_pos, win_snpidx, color='blue')  # the observed SNP-indexes
plt.plot(win_pos, win_simu95, color='orange')  # the simulated SNP-indexes 

In [None]:
### Getting information of the genomic region including causal gene  

w_pos   = np.array(win_pos)        # medians of each windows
w_idx    = np.array(win_snpidx)   # averages of the observed SNP-indexes
w_simu = np.array(win_simu95) # taverages of he simulated SNP-indexes

# Extracting region
w_pos[w_idx > w_simu]

Considering that the window's position is median, we found the candidate region is between 21,000,000 and 22,600,000 bp on chromosome 10 in rice.

The region includes _Chlorophyllide a oxygenase_  (Os10t0567400) gene.  
The gene is a candidate gene of mutant phenotype, pale green leaf (Abe et al., 2012).

---

### Summary
We have learn the basis of large data analysis across three lectures.
- Loading a file and Writing to a file
- Calculating each values between columns
- Drawing the graph
- Sliding window analysis
- Simulation and evalation of the observed data