In [2]:
"""
*** IMPORTANT ***
Run this cell before practice below.
You can download a sample file for this practice.
"""
!wget https://raw.githubusercontent.com/CropEvol/lecture/master/data/mutmap_bulk.txt -O mutmap_bulk.txt

--2018-11-05 19:53:09--  https://raw.githubusercontent.com/CropEvol/lecture/master/data/mutmap_bulk.txt
Resolving raw.githubusercontent.com... 151.101.108.133
Connecting to raw.githubusercontent.com|151.101.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6154 (6.0K) [text/plain]
Saving to: ‘mutmap_bulk.txt’


2018-11-05 19:53:09 (96.2 MB/s) - ‘mutmap_bulk.txt’ saved [6154/6154]



# Introduction to Large Data Analysis (1)

## Contents

### Introduction
1. [Genomic analysis and text data](#0.1)
1. [Basics of text data processing](#0.2)
1. [About sample data](#0.3)

--- 

### Practice
1. [Open and close a text-file](#1.1)
1. [Read a file line-by-line](#1.2)
1. [Remove line feed code](#1.3)
1. [Split one line](#1.4)
1. [Calculate SNP-index](#1.5)
1. [Write into a file](#1.6)

---
## Introduction

### Genomic analysis and Text data analyisis<a name="0.1"></a>

Big data including genomic data is just text file (but very large data size).  

For example, genome sequence data is the strings data consisted of "A"(Adenine), "C"(Cytosine), "G"(Guanine) and "T"(Thymine). 

So, it is not an exaggeration to say that genomic analysis is an analysis that find the impotant data from very long text. 

In this practice, we learn to deal with large data in a text file by using Python.

### Basic of text data processing<a name="0.2"></a>

Most important point is ___"Process lines one by one from the top line"___

Overall flow of text data processing
- Open file
- Extract one line
- Split one line by data
- Calculate and/or process the data in one line
- Move next line
- Close file

### Sample data => [Link](mutmap_bulk.txt)<a name="0.3"></a>

In this practice, we use the MutMap data for the "pale green" leaf in rice (Abe et al., 2012). 
<div style="margin-bottom: 5px;"><img src="https://github.com/CropEvol/lecture/blob/master/images/06/mutmap04_en.jpg?raw=true" alt="mutmap"></div>

Abe, A., Kosugi, S., Yoshida, K., Natsume, S., Takagi, H., Kanzaki, H., Matsumura, H., Yoshida, K., Mitsuoka, C., Tamiru, M., Innan, H., Cano, L., Kamoun, S., Terauchi, R. (2012). [Genome sequencing reveals agronomically important loci in rice using MutMap.](https://www.nature.com/articles/nbt.2095) _Nature biotechnology_, 30(2), 174.

The MutMap is one of methods of gene mapping.  
The outline of MutMap:
1. You cross original line (Here is rice cultivar "Hitomebore") and its mutate line.
1. You develop F2 individuals, and pool the DNA of F2 individuals have mutant phenotype.
1. You get two NGS data from original parent line and F2 individuals pooled.
1. You compare both data, and you can find position of mutation, what and how many nucleotides are in the position.
1. Because it should be that all F2 individuals with mutant phenotype share the same mutation caused the phenotype, you can expect the fixed mutation allele in the locus.
1. You search the locus has fixed SNP-index (frequency of mutant allele) in the whole of genome.
1. You draw the map of SNP-index (x-axis => Chromsome position, y-axis => SNP-index).
1. You can find the candidate locus or site.

<div style="margin-bottom: 5px;"><img src="https://github.com/CropEvol/lecture/blob/master/images/06/mutmap01_en.jpg?raw=true" alt="mutmap"></div>

<div style="margin-bottom: 5px;"><img src="https://github.com/CropEvol/lecture/blob/master/images/06/mutmap02.png?raw=true" alt="mutmap"></div>

Check the sample file => [Link](mutmap_bulk.txt)

Information of each columns:
- Chromosome
- Position
- "Hitomebore" allele
- Mutant allele
- Depth of "Hitomebore" allele: Counts of "Hitomebore" allele in the pooled sequences of 20 individuals have mutant phenotype 
- Depth of mutant allele: Counts of mutant allele in the pooled sequences of 20 individuals have mutant phenotype 

<div style="margin-bottom: 5px;"><img src="https://github.com/CropEvol/lecture/blob/master/images/06/mutmap03_en.jpg?raw=true" alt="mutmap"></div>

Let's detect the locus has fixed SNP-index (1.0) to find the mutation caused "pale green" leaf in rice.

---
## Practice
In this practice, we will create the program "Calculate SNP-index and Write the result into a file" by adding the code step-by-step.

### 1. Open and close a text-file<a name="1.1"></a>
Open the sample file. And close the file.

=== Basic syntax ===
```python
f_in = open('<File Name>', 'r')
f_in.close()
```

=== Description ===
- open()
    - 1st argment: Name of file that you want to open
    - 2nd argment: 
        * 'r': Open a file with read-only mode  
        * 'w': Open a file with writing mode
        * 'a': Open a file with appending mode
    - File contents are set in the variable `f_in`.
        (The variable is called __file object__.)
- close()
    - Usage: FILE_OBJECT.close()

In [3]:
### Add codes to the below ###

dataset = 'mutmap_bulk.txt' # File name




### 2. Read a file line-by-line<a name="1.2"></a>
Read the opened file line-by-line, and print out strings of the line .

=== Basic syntax ===
```python
for line in f_in:
    print(line)
```

=== Description ===  
You can get one line by using `for`.  
The contents of `f_in` are like items of list. => [1st line, 2nd line, 3rd line, ..., last line]  

In [4]:
### Add codes to the below ###


dataset = 'mutmap_bulk.txt' # Input file name

f_in = open(dataset, 'r')  # Open the file

f_in.close()  # Close the file

### 3. Remove line feed code<a name="1.3"></a>

In the upper code, you can see the blank lines between each lines.  
This is due to each lines in text file include the line feed code(`\n`).  
In text data processing, it is better that the line feed code is removed .

=== Basic syntax ===  
```python
line = line.rstrip()
```

=== Description ===
- `rstrip()`
    - Usage: STRINGS_OBJECT.rstrip()
    - right-strip: remove a string at the end of strings or an item at the end of list.
    - We remove the line feed code at the end of line here.
    
- `lstrip()`
    - Usage: STRINGS_OBJECT.lstrip()
    - left-strip: remove the first string of strings or the first item at the first of list.
    

In [5]:
### Add codes to the below ###

dataset = 'mutmap_bulk.txt' #  Input file name

f_in   = open(dataset, 'r')  # Open file with read-only mode

# Read a file line-by-line
for line in f_in:
    print(line)
    
f_in.close()  # Close file

chr10	51406	G	A	6	3

chr10	59101	A	T	6	3

chr10	59112	A	C	7	2

chr10	61001	A	T	11	3

chr10	161375	A	G	13	5

chr10	161561	A	C	6	5

chr10	161562	T	A	6	5

chr10	393574	A	G	8	2

chr10	465981	A	C	7	2

chr10	1076409	G	A	3	3

chr10	1076415	T	C	3	3

chr10	1330441	A	T	9	2

chr10	1435253	C	T	4	3

chr10	1435288	T	C	3	3

chr10	1544709	T	C	11	4

chr10	1567026	G	A	4	6

chr10	1613426	T	C	3	7

chr10	1613434	T	C	3	9

chr10	1613481	C	T	6	11

chr10	1982349	T	A	2	2

chr10	1982350	G	A	2	2

chr10	1982351	A	G	2	2

chr10	2010618	C	T	5	4

chr10	2246198	G	A	15	6

chr10	2443102	C	A	5	4

chr10	2470700	C	T	11	5

chr10	2470703	A	T	11	5

chr10	2755964	T	C	4	6

chr10	2757274	A	C	4	34

chr10	2757277	A	G	4	35

chr10	2797570	G	A	9	6

chr10	2939265	G	A	5	3

chr10	2979632	A	G	10	4

chr10	2979769	A	T	10	4

chr10	2979865	G	A	7	4

chr10	2979994	T	C	5	4

chr10	2981435	C	T	13	4

chr10	3094017	G	A	6	5

chr10	3120054	C	T	5	3

chr10	3165573	G	A	6	5

chr10	3165694	T	A	6	4

chr10	3209687	G	A	5	2

chr10	3248252	C	T	4	5

chr10	339046

### 4. Split one line<a name="1.4"></a>
To accesse one data in line, split the line.  
This sample file is the tab-delimited file.   
You can split the line by specifying the tab code `\t`.

=== Basic syntax ===  
```python
items = line.split('\t')
```

=== Description ===
- `split()`
    - Usage: STRINGS_OBJECT.split("delimiter")
        * Commonly used delimiters
            - `\t`: tab
            - `,`: comma
    - The split strings is converted to a list.
        * ex.) The first line `chr10	51406	G	A	6	3` is converted to the list `['chr10', '51406', 'G', 'A', '6', '3']`.

In [9]:
### Add codes to the below ###

dataset = 'mutmap_bulk.txt' #  Input file name

f_in   = open(dataset, 'r')  # Open file with read-only mode

# Read a file line-by-line
for line in f_in:
    
    line = line.rstrip()  # Remove line feed code
    print(line)
    
f_in.close()  # Close file

chr10	51406	G	A	6	3
chr10	59101	A	T	6	3
chr10	59112	A	C	7	2
chr10	61001	A	T	11	3
chr10	161375	A	G	13	5
chr10	161561	A	C	6	5
chr10	161562	T	A	6	5
chr10	393574	A	G	8	2
chr10	465981	A	C	7	2
chr10	1076409	G	A	3	3
chr10	1076415	T	C	3	3
chr10	1330441	A	T	9	2
chr10	1435253	C	T	4	3
chr10	1435288	T	C	3	3
chr10	1544709	T	C	11	4
chr10	1567026	G	A	4	6
chr10	1613426	T	C	3	7
chr10	1613434	T	C	3	9
chr10	1613481	C	T	6	11
chr10	1982349	T	A	2	2
chr10	1982350	G	A	2	2
chr10	1982351	A	G	2	2
chr10	2010618	C	T	5	4
chr10	2246198	G	A	15	6
chr10	2443102	C	A	5	4
chr10	2470700	C	T	11	5
chr10	2470703	A	T	11	5
chr10	2755964	T	C	4	6
chr10	2757274	A	C	4	34
chr10	2757277	A	G	4	35
chr10	2797570	G	A	9	6
chr10	2939265	G	A	5	3
chr10	2979632	A	G	10	4
chr10	2979769	A	T	10	4
chr10	2979865	G	A	7	4
chr10	2979994	T	C	5	4
chr10	2981435	C	T	13	4
chr10	3094017	G	A	6	5
chr10	3120054	C	T	5	3
chr10	3165573	G	A	6	5
chr10	3165694	T	A	6	4
chr10	3209687	G	A	5	2
chr10	3248252	C	T	4	5
chr10	3390460	C	T	11	3
chr10	3390538	T	A	4	3
chr10	3391

### 5. Calculate SNP-index<a name="1.5"></a>
Calculate SNP-index.  
You can calculate the following formula.

```python
SNP-index = Counts of ALT allele / (Counts of REF allele + Counts of ALT allele) 

"Counts of REF allele" is 4th item.
"Counts of ALT allele" is 5th item.
（ATTENTION: "0-started index" is used in Python）
```

In [10]:
### Add codes to the below ###

dataset = 'mutmap_bulk.txt' #  Input file name

f_in   = open(dataset, 'r')  # Open file with read-only mode

# Read a file line-by-line
for line in f_in:
    
    line = line.rstrip()  # Remove line feed code
    items = line.split('\t')  # Split one line
    
    #print(line)
    print(items)
    
f_in.close()  # Close file

['chr10', '51406', 'G', 'A', '6', '3']
['chr10', '59101', 'A', 'T', '6', '3']
['chr10', '59112', 'A', 'C', '7', '2']
['chr10', '61001', 'A', 'T', '11', '3']
['chr10', '161375', 'A', 'G', '13', '5']
['chr10', '161561', 'A', 'C', '6', '5']
['chr10', '161562', 'T', 'A', '6', '5']
['chr10', '393574', 'A', 'G', '8', '2']
['chr10', '465981', 'A', 'C', '7', '2']
['chr10', '1076409', 'G', 'A', '3', '3']
['chr10', '1076415', 'T', 'C', '3', '3']
['chr10', '1330441', 'A', 'T', '9', '2']
['chr10', '1435253', 'C', 'T', '4', '3']
['chr10', '1435288', 'T', 'C', '3', '3']
['chr10', '1544709', 'T', 'C', '11', '4']
['chr10', '1567026', 'G', 'A', '4', '6']
['chr10', '1613426', 'T', 'C', '3', '7']
['chr10', '1613434', 'T', 'C', '3', '9']
['chr10', '1613481', 'C', 'T', '6', '11']
['chr10', '1982349', 'T', 'A', '2', '2']
['chr10', '1982350', 'G', 'A', '2', '2']
['chr10', '1982351', 'A', 'G', '2', '2']
['chr10', '2010618', 'C', 'T', '5', '4']
['chr10', '2246198', 'G', 'A', '15', '6']
['chr10', '2443102', 'C'

#### Supplementary: Converting data type
"Counts of REF allele" and "Counts of ALT allele" are extracted as strings from `items`.
So, you need to comvert the data type from strings to integer to calculate SNP-index.   

=== Basic syntax ===  
```python
int(Strings or Float number)
```

=== Description ===
- `int()`
    - Usage: int(Strings or Float number)
    - Converting to integer.
- `float()`
    -  Converting to float number.
- `str()`
    - Converting to strings.


### 6. Write into a file<a name="1.6"></a>

Add the calculated SNP-index to the end of line.
And write the line to a file.

＊ATTENTIONS＊  
- To add the SNP-index to strings, you need to convert the value of SNP-index from float number to string. 
- To print out as tab-delimited line, you need to add tab code `\t` between the existing line and SNP-index.
- Because the line feed code `\n` is removed at STEP 3, you need to add new `\n` in the `write()`.

=== Basic syntax ===
```python
f_out = open('<file name>', 'w')
f_out.write(STRINGS)　
f_out.close()
```
=== Description ===
- write()
    - Usage: FILE_OBJECT.write("any strings")

In [12]:
### Add codes to the below ###

dataset = 'mutmap_bulk.txt' #  Input file name

f_in   = open(dataset, 'r')  # Open file with read-only mode

# Read a file line-by-line
for line in f_in:
    
    line = line.rstrip()  # Remove line feed code
    items = line.split('\t')  # Split one line
    snpindex = int(items[5]) / (int(items[4]) + int(items[5]))  # Calculate SNP-index
    
    #print(line)
    #print(items)
    #print(items[1])
    print(snpindex)
    
f_in.close()  # Close file

0.3333333333333333
0.3333333333333333
0.2222222222222222
0.21428571428571427
0.2777777777777778
0.45454545454545453
0.45454545454545453
0.2
0.2222222222222222
0.5
0.5
0.18181818181818182
0.42857142857142855
0.5
0.26666666666666666
0.6
0.7
0.75
0.6470588235294118
0.5
0.5
0.5
0.4444444444444444
0.2857142857142857
0.4444444444444444
0.3125
0.3125
0.6
0.8947368421052632
0.8974358974358975
0.4
0.375
0.2857142857142857
0.2857142857142857
0.36363636363636365
0.4444444444444444
0.23529411764705882
0.45454545454545453
0.375
0.45454545454545453
0.4
0.2857142857142857
0.5555555555555556
0.21428571428571427
0.42857142857142855
0.2857142857142857
0.3333333333333333
0.38461538461538464
0.4166666666666667
0.35714285714285715
0.3125
0.3333333333333333
0.3
0.3333333333333333
0.3333333333333333
0.4
0.375
0.3333333333333333
0.9333333333333333
0.9565217391304348
0.9565217391304348
0.8636363636363636
0.8245614035087719
0.6888888888888889
0.6363636363636364
0.4
0.6
0.3333333333333333
0.3333333333333333
0.33

#### Supplementary: Embedding the values of variables into strings printed out

If you want to embed the values of variables into strings printed out, you can use the following code.

```python
x = 5.3
y = 3.4

print('%d + %d = %d' % (x, y, x + y))
# Print out => 5 + 3 = 8

print('%f + %.3f = %.2f' % (x, y, x + y))
# Print out => 5.300000 + 3.400 = 8.70

print('%s + %s = %s' % (x, y, x + y))
# Print out => 5.3 + 3.4 = 8.7
```

In `print()`, there are two parts.
ex.) `'%d + %d = %d'`　and `(x, y, x + y)`


The former part `'%d + %d = %d'` is a framework of strings printed out.

In the framework, there are three `%d`.  
Each `%d` in the former part are substituted into `(x, y, x + y)` in the latter part.  
- The first `%d` is substituted into `x`.  
- The second `%d` is substituted into `y`.  
- The third `%d` is substituted into `x + y`.  

You can also specify the data type of value to be substituted in by `%d`, `%f` or `%s`.
- `%d`: Integer
- `%f`: Float number (Usage: %.3f => 0.123; %.8f => 0.12345678)
- `%s`: Strings

In [13]:
x = 5.3
y = 3.4

print('%d + %d = %d' % (x, y, x + y))
# Print out => 5 + 3 = 8

print('%f + %.3f = %.2f' % (x, y, x + y))
# Print out => 5.300000 + 3.400 = 8.70

print('%s + %s = %s' % (x, y, x + y))
# Print out => 5.3 + 3.4 = 8.7

5 + 3 = 8
5.300000 + 3.400 = 8.70
5.3 + 3.4 = 8.7


In [14]:
### Completed code ###

dataset = 'mutmap_bulk.txt' #  Input file name
outdata = 'mutmap_snpindex.txt' # Output file name

f_in   = open(dataset, 'r')  # Open file with read-only mode
f_out = open(outdata, 'w')  # Open file with writing mode

# Read a file line-by-line
for line in f_in:
    
    line = line.rstrip()  # Remove line feed code
    items = line.split('\t')  # Split one line
    snpindex = int(items[5]) / (int(items[4]) + int(items[5]))  # Calculate SNP-index
    
    out_line = line + '\t' + str(snpindex)
        # Add SNP-index at the end of line
        # Joining Line and SNP-index by tab code because writing out  as tab-delimit text file
    
    #print(line)
    #print(items)
    #print(items[1])
    #print(snpindex)
    print(out_line)
    
    f_out.write("%s\n" % out_line)  # Write into the "f_out" file
    
f_in.close()  # Close file
f_out.close()

chr10	51406	G	A	6	3	0.3333333333333333
chr10	59101	A	T	6	3	0.3333333333333333
chr10	59112	A	C	7	2	0.2222222222222222
chr10	61001	A	T	11	3	0.21428571428571427
chr10	161375	A	G	13	5	0.2777777777777778
chr10	161561	A	C	6	5	0.45454545454545453
chr10	161562	T	A	6	5	0.45454545454545453
chr10	393574	A	G	8	2	0.2
chr10	465981	A	C	7	2	0.2222222222222222
chr10	1076409	G	A	3	3	0.5
chr10	1076415	T	C	3	3	0.5
chr10	1330441	A	T	9	2	0.18181818181818182
chr10	1435253	C	T	4	3	0.42857142857142855
chr10	1435288	T	C	3	3	0.5
chr10	1544709	T	C	11	4	0.26666666666666666
chr10	1567026	G	A	4	6	0.6
chr10	1613426	T	C	3	7	0.7
chr10	1613434	T	C	3	9	0.75
chr10	1613481	C	T	6	11	0.6470588235294118
chr10	1982349	T	A	2	2	0.5
chr10	1982350	G	A	2	2	0.5
chr10	1982351	A	G	2	2	0.5
chr10	2010618	C	T	5	4	0.4444444444444444
chr10	2246198	G	A	15	6	0.2857142857142857
chr10	2443102	C	A	5	4	0.4444444444444444
chr10	2470700	C	T	11	5	0.3125
chr10	2470703	A	T	11	5	0.3125
chr10	2755964	T	C	4	6	0.6
chr10	2757274	A	C	4	34	0.894736842105263