In [None]:
"""
*** IMPORTANT ***
Run this cell before this practice.
You can download a sample file.
"""
!wget https://raw.githubusercontent.com/CropEvol/lecture/master/data/mutmap_bulk.txt -O mutmap_bulk.txt

# Introduction to Large Data Analysis (2)
Data analysis using pandas & Drawing figures

## Contents

### Introduction
- [About previous and this practice](#0.1)
- [About Sample data](#0.2)

### Practice
1. [Loading a data-file as pandas dataframe](#1.1)
1. [Accessing to an arbitrary data](#1.2)
1. [Calculating SNP-index](#1.3)
1. [Selecting data by condition](#1.4)
1. [Writing into a file](#1.5)
1. [Drawing graph](#1.6)
1. [Sliding window analysis](#1.7)

## Introduction

### About previous and this practice<a name="0.1"></a>

In the previous practice, we learn the basis of text-data analysis.  
=> [the previous page](../06_large_data_analysis/01_large_data_analysis_en.ipynb) 

Using `for` context and reading one line, spliting the line, we process all data in text.   
The process is an applicable not only to text data but to all kinds of data. But it has disadvantages: lower processing speed and longer code.

If the data is writen table form, we can use [pandas](https://pandas.pydata.org/) library. It becomes faster and easier data-processing.

In this practice,
1. we process the text-data by using pandas library.
1. we draw the graph by using [Matplotlib](https://matplotlib.org/) library.

### About sample data<a name="0.2"></a>

We use the data file of MutMap (Abe et al., 2012), same to the previous one.  
=> [the previous page / About sample data](../06_large_data_analysis/01_large_data_analysis_ja.ipynb#0.3)

The file is a table-form text file with separeted by tab code `\t`.

<div style="margin-bottom: 5px;"><img src="https://github.com/CropEvol/lecture/blob/master/images/07/tab-sep_text_en.jpg?raw=true" alt="matplotlib_graph"></div>

Check the file => [mutmap_bulk.txt](mutmap_bulk.txt)

## Practice

#### Contents of this practice
1. [Loading a data-file as pandas dataframe](#1.1)
1. [Accessing to an arbitrary data](#1.2)
1. [Calculating SNP-index](#1.3)
1. [Selecting data by condition](#1.4)
1. [Writing into a file](#1.5)
1. [Drawing graph](#1.6)
1. [Sliding window analysis](#1.7)


Most of the programs have already been written, but those are not completely.  
We have to add some code lines.  

The position adding code is written like the below.

```python
# !!! Add code !!!
```

### 1. Loading a data-file as pandas dataframe<a name="1.1"></a>

To load the data from a file, we use the function `read_csv` of pandas library.

=== Basic syntax ===
```python
import pandas
df = pandas.read_csv("<File name>", sep="<Separater>", header=<Row No.>)
```
OR
```python
import pandas as pd
df = pd.read_csv("<File name>", sep="<Separater>", header=<Row No.>)
```

=== Descriptions ===

#### About `import pandas`
- Loading pandas to use in the program
- If written `import pandas`,  we can use the pandas function as follows： `pandas.FUNCTION()`.
- If written `import pandas as pd`,  we can use the pandas function as follows： `pd.FUNCTION()`. （often use `pd` as abbreviation)


#### About `df = pandas.read_csv("<File name>", sep="<Separater>", header=<Row No.>)`

- First arguments => File name
- `sep=`: Specify the separater (delimiter) used in the input file. 
    * Comma-separated: `,`
    * Tab-separated: `\t` (backslash + t)
- `header=`: Specify the header (row of column names) of the input file. 
    * If specified `0`, header is first row.
    * If specified `-1`, header is none.
- The loaded data-table is called "Data frame".

In [None]:
#--- Import library ---
import pandas as pd

#--- Loading data from file ---
dataset = 'mutmap_bulk.txt'        # input-file name
df = pd.read_csv(dataset, sep='\t', header=-1, names=['chr', 'pos', 'ref_nucl', 'alt_nucl', 'ref_N', 'alt_N']) 

df  # show

### Supplementary explanation 1
#### About Data frame
- Data frame is a table like as Excel. There are same formated data in one column.
- First row is a line of column-names.
- First column is a column of indexes (names of each rows).

### Supplementary explanation 2
#### Why is the results displayed without using `print()`
It is due to the function of Jupyter Notebook.

Jupyter Notebook can show the value of last written variable without using `print()`.

In [None]:
a = 1
b = 2
c = 3

a  # Not Displayed
b  # Not Displayed
c  # Displayed

### 2. Accessing to an arbitrary data<a name="1.2"></a>

The codes have already written in the following cell (commented out).

Remove each `#`(hash), and comfirm the results. 

In [None]:
###### show dataset ######
df


###### Extract one column  ######
#df['ref_nucl']
#df.loc[:, 'ref_nucl']
#df.iloc[:, 2]


###### Extract multi column ######
#df.loc[:, ['ref_nucl','alt_nucl']]
#df.iloc[:, 2:4]


###### Extract one row ######
#df.loc[10,:]
#df.iloc[10,:]


###### Extract multi rows ######
#df.loc[10:15, :]
#df.iloc[10:15, :]


###### Extract one data-cell ######
#df.loc[10, 'ref_nucl']
#df.iloc[10, 2]


###### Extract multi data-cells ######
#df.loc[10:15, ['ref_nucl', 'alt_nucl']]
#df.iloc[10:15, 2:4]


### 3. Calculating SNP-index<a name="1.3"></a>

In the previous practice, we used `for` and `split()` to calculate SNP-indexes.  
In this practice, we will calculate SNP-indexes without using `for` and `split()`.

#### The formula of SNP-index
SNP-index = alt_N / (ref_N + alt_N)

In [None]:
###### SNP-indexの計算 ######

# !!! Add code !!!
#df['snp_index'] = 

#--- Show ---
df

the completed program is [here](./02_large_data_analysis_en_complete_version.ipynb#1.3)

### 4. Selecting data by condition<a name="1.4"></a>

=== Basic syntax ===

```python
df[ (Condition) ]
```

In [None]:
###### Selecting data by condition ######

#--- Single condition ---
# df['ref_nucl']=='A'  

# df[ df['ref_nucl']=='A' ] 

#--- Multi conditions ---
# df[ (df['ref_nucl']=='A' ) & (df['alt_nucl']=='G' ) ]    # AND

# df[ (df['ref_nucl']=='A' ) | (df['alt_nucl']=='G' ) ]    # OR


#--- Only data with SNP-index >= 0.9 ---
# !!! Add code !!!



the completed program is [here](./02_large_data_analysis_en_complete_version.ipynb#1.4)

### Supplementary explanation 3
#### Diffirence between  `df['ref_nucl']=='A' ` and `df[ df['ref_nucl']=='A' ] `

`df['ref_nucl']=='A' `: The return is the list of `True`/`False` (`True` indicates match. `False` indicates unmatch)

`df[ (True/False list) ]`: The return is the data frame remained only data matched the condition (only `True` data).

<div style="margin-bottom: 5px;"><img src="https://github.com/CropEvol/lecture/blob/master/images/07/pandas_filtering_en.jpg?raw=true" alt="pandas_filtering"></div>

### 5. Writing into a file<a name="1.5"></a>

To write the data frame into a file, we use the `to_csv` function in pandas library.

=== Basic syntax ===

```python
df.to_csv("<File name>", sep="<Separater>", header=<True/False>, index=<True/False>)
```

=== Descriptions ===
- First arguments => File name
- `sep=`: Specify the separater (delimiter) used in the input file. 
    * Comma-separated: `,`
    * Tab-separated: `\t` (backslash + t)
- Specified `header=True`, header line (names of each columns) is written into the output file.
- Specified `index=True`, Indexes (names of each rows) is written into the output file.

In [None]:
###### write new table into the output-file ######
#outdata = 'mutmap_snpindex.txt'        # output-file name
#df.to_csv(outdata, sep='\t', header=True, index=False)

Check the output file => [File list](./)

### 6. Drawing graph<a name="1.6"></a>

In this, we use [Matplotlib](https://matplotlib.org/) library. The library often used to draw graph.

Graph of matplotlib library is composed some layers.  
For example, layer for graph field, layer for a line plot, and layer for labels.

<div style="margin-bottom: 5px;"><img src="https://github.com/CropEvol/lecture/blob/master/images/07/07_drawing_graph.png?raw=true" alt="matplotlib_graph"></div>

In this practice, we will draw the scatter plot: x-axis is chromosome position and y-axis is SNP-index.

And, we will show the plot of "SNP-index >= 0.9" as red dot.

In [None]:
"""
The below line is needed to display the graph in Jupyter Notebook.
This is not a python program. This is a "Magic command" of Jupyter Notebook.
"""
%matplotlib inline


"""
Python program is from here
"""

#--- Import library ---
import matplotlib.pyplot as plt

#--- x-values, y-values ---
df['snp_index'] = df['alt_N'] / (df['ref_N'] + df['alt_N'])
x = df['pos']
y = df['snp_index']

#--- Drawing all data ---
fig = plt.figure(figsize=[16,9])    # graph field
plt.scatter(x, y, color='gray')      # scatter plot
plt.title('SNP-index on chromosome 10', fontsize=24)  # title of this graph
plt.xlabel('Position (x 10 Mb)', fontsize=16)  # label of x-axis
plt.ylabel('SNP-index', fontsize=16)                # label of y-axis


# Drawing the data of "SNP-index >= 0.9"
# !!! Add code !!!


the completed program is [here](./02_large_data_analysis_en_complete_version.ipynb#1.6)

### 7. Sliding window analysis<a name="1.7"></a>
In MutMap, to detect the genomic region of casual gene for mutant phenotype, the sliding window analysis is done.

<div style="margin-bottom: 5px;"><img src="https://github.com/CropEvol/lecture/blob/master/images/07/sliding_window_en.jpg?raw=true" alt="matplotlib_graph"></div>


In this practice, we will see the transition of SNP-index accross chromosome 10 in rice.  

#### How to make the program for sliding window analysis
1. Decide window size and step size
1. Prepare two lists for genomic positions (median is used here) and averages of SNP-indexes in each regions.
1. Search SNP-index in all regions by using `while`.
    1. Extract data in one region.
    1. Calculate an average of SNP-indexes and  a median of window in the region.
    1. Add these data to lists.
    1. If finished to search all regions, get out from `while`.
1. Draw the graph: x-axis is genomic position of window and y-axis is the averages of SNP-indexes

In [None]:
###### Sliding Window解析 ######
#---  Import library ---
import numpy as np

#--- Chromosome size, Window size, step size ---
CHROM_SIZE = 23207287       # Length of Chromosome 10　 (bp)
WIN_SIZE       = 1 * 1000 * 1000     #  Window size: 1 Mb = 1000 kb = 1,000,000 bp
STEP_SIZE     = 0.2 * 1000* 1000     #  Step size: 0.2 Mb = 200 kb = 200,000 bp

#--- Prepare lists for the positions and averages of SNP-index in each region ---
win_position  = []  # list for positions
win_snpindex = []  # list for averages of SNP-index

#--- Search all regions---
"""
/// start and end position of each regions ///
start, end
0, 0+1000 (kb)
200, 200+1000
400, 400+1000
  .
  .
  .

/// express by using  WIN_SIZE and STEP_SIZE///
Repeats:　n = 0, 1, 2, ...

start = STEP_SIZE * n  
end = start + WIN_SIZE


If "end > CHROM_SIZE", stop and get out looping.
"""

n = 0 # Repeats
while True:
    
    #--- Start & end position ---
    start = STEP_SIZE * n 
    end   = start + WIN_SIZE
    
    #--- Median of window ---
    p = (start + end) / 2
    win_position.append(p)
    
    #--- Extract data in region ---
    sub = df[(df['pos'] >= start) & (df['pos'] < end)]
    
    #--- Average of SNP-indexes ---
    i = sub['snp_index'].mean()
    win_snpindex.append(i)
        
    #--- Repeats Num +1 ---
    n += 1
    
    #--- stop and get out this looping---
    if end > CHROM_SIZE:
        break

#--- Scatter plot of all data  ---
fig = plt.figure(figsize=[16,9])
plt.scatter(x, y, color='gray')      # all data
plt.title('SNP-index on chromosome 10', fontsize=24)  # title
plt.xlabel('Position (x 10 Mb)', fontsize=16)  # label of x-axis
plt.ylabel('SNP-index', fontsize=16)                # label of y-axis

#--- Scatter plot of SNP-index>=0.9 ---
df_ext = df[ df['snp_index'] >= 0.9 ]
x1 = df_ext['pos']
y1 = df_ext['snp_index']
plt.scatter(x1, y1, color='red')

#--- Line plot of siding window ---
plt.plot(win_position, win_snpindex, color='blue')      

In [None]:
### Extract the window has the averages of SNP-index is more than 0.9. ###

# Dataframe of Sliding window
W = pd.DataFrame({ 'pos': win_position, 'snp_index': win_snpindex})

# start position  & end position of window
W['start'] = W['pos'] - WIN_SIZE / 2 
W['end'] = W['pos'] + WIN_SIZE / 2

# the window of "SNP-index >= 0.9"
W[W['snp_index'] >= 0.9]