# Advanced random IR
Week2: 13.03-19.03  
Stage: model Zero - a simple model  
This is a tool to generate random integration regions on genome of choice
#### Authors: 
Elvira Mingazova, Saira Afzal, Raffaele Fronza 2017

## Week 2 plan:
1. Generalization of the genome dimension file
2. Checking the input file
3. Testing the script with three different genomes: human, mouse and honey bee
4. Finding out the differences between:
    + hg 38 and hg19  
    + hg 38 versions 
5. Plotting and describing the empirical distribution function on chr1 with a sufficient number of samples

### 1. Generalize genome dimension file  
In the previous notebook file getRegionW1.ipynb the lenghts of the chromosomes were **hard coded**. Here we will change it by adding a new input argument -i. We will provide python with a text file where the chromosome names are in the first column and their lenghts in bp - in the second. Two columns are separated by tab or space.  
Below is an example of the text file that should be provided.

In [3]:
#To use bash commands inside of the notebook put "!" before
!cat ../scripts/hg38.txt| head -n5

chr1	224999719
chr2	237712649
chr3	194704827
chr4	187297063
chr5	177702766


Now we can modify the script in a way that it processes the text file and generates random integration regions on randomly chosen chromosomes.

In [6]:
#Create an empty dictionary to fill in the contents of each line in the .txt file as key:value pairs
dimension_dict = {}
with open('/Users/elming/Advanced-random-IR/scripts/hg38.txt') as dimension_file:
    for line in dimension_file:
        
        key, value = line.split()
        #fill the dictionary and make sure the length is of a type int
        dimension_dict[key.strip()]=int(value.strip())
    print dimension_dict
    dimension_file.close()

{'chr13': 95559980, 'chr12': 130303534, 'chr11': 131130853, 'chr10': 131624737, 'chr17': 77800220, 'chr16': 78884754, 'chr15': 81341915, 'chr14': 88290585, 'chr19': 55785651, 'chr18': 74656155, 'chr24': 57741652, 'chr22': 34893953, 'chr23': 151058754, 'chr20': 59505254, 'chr21': 34171998, 'chr7': 154952424, 'chr6': 167273993, 'chr5': 177702766, 'chr4': 187297063, 'chr3': 194704827, 'chr2': 237712649, 'chr1': 224999719, 'chr9': 120312298, 'chr8': 142612826}


Now we are ready to generate the integration regions

In [11]:
import random
#choose random chromosome
chrom = random.choice(dimension_dict.keys())
print chrom

chr22


In [12]:
#select a site on that chromosome
start = random.randint(1,dimension_dict[chrom])
print start

15092629


We are now ready to combine the code parts from getRegionW1.py and this notebook and write it into a getRegionW2.py

In [13]:
%%writefile ../scripts/getRegionW2.py
#parser setup
import sys
import argparse
parser = argparse.ArgumentParser()
parser.prog = 'progName.py'
parser.description = 'You can provide the program with three parameters through the terminal'
parser.add_argument("-n", type=int, help='Number of integration regions generated')
parser.add_argument("-r", type=int, help='Range-value: defines an interval where the IR is located')
parser.add_argument("-d", type=int, help='Delta-value: expands the range with the value provided by user')
#add new input argument
parser.add_argument("-i", help='Input-file .txt with names of chromosomes and their lengths in bp organized into two columns and separated by \t')
namespace = parser.parse_args((sys.argv[1:]))
dimension_dict = {}
with open(namespace.i) as dimension_file:
    for line in dimension_file:
        key, value = line.split("\t")
        dimension_dict[key.strip()]=int(value.strip())
    dimension_file.close()
#loop on n random IR
import random
count = 1
print "chr#", '\t', 'Start', '\t\t', 'End', '\t\trnd#'
for i in range(namespace.n):
    #select a chromosome
    chrom = random.choice(dimension_dict.keys())

    #select a site on that chromosome
    start = random.randint(1,dimension_dict[chrom])

    #select a random region
    end = start + random.randint(0,namespace.r) + namespace.d
    if len(str(chrom))==1:
        print chrom,'',"\t",start,"\t",end,"\trnd{}".format(count)
    else:
        print chrom,"\t",start,"\t",end,"\trnd{}".format(count)
    count +=1


Overwriting ../scripts/getRegionW2.py


We can run the script with a new input file provided in the command line. You can change any of the parameters

In [14]:
!python2 ../scripts/getRegionW2.py -n5 -r0 -d1 -i "../scripts/hg38.txt"

chr# 	Start 		End 		rnd#
chr18 	62928698 	62928699 	rnd1
chr4 	37775302 	37775303 	rnd2
chr10 	82930932 	82930933 	rnd3
chr24 	2854206 	2854207 	rnd4
chr15 	10830799 	10830800 	rnd5


### 2. Check the script  
In this part we want to add some testing options to the script to be able to check the input file. It has to fulfill the following coditions:
- the file is well formatted (2 columns)
- the first column contains a string "chrXY" where X and Y are digits
- the second column contains an integer
- the integer is not negative
To test the file we could define "checking" functions

In [90]:
#a function that proves whether the input file contains  two columns
def checkFormatting():
    check=True
    with open('../scripts/hg38.txt') as dimension_file:
        #go through all of the lines and check whether only two columns are present
        for line in dimension_file:
            result=line.split()
            if len(result)==2:
                continue
            else:
                print 'The text file has to contain two columns'
                check=False
                break
    dimension_file.close()
    if check == True:
        return "Checking result is positive: the file is well formatted"
    
        
checkFormatting()

'Checking result is positive: the file is well formatted'

In [89]:
#a function that proves whether the first column contains a string "chrXY" where X and Y are digits
def checkColumn1():
    check = True
    #use regEx
    import re
    #save a regEx pattern matching "chrXY" where XY are digits
    #is there a better regEx to use in this case? Wasn't sure if it's optimal
    pattern = re.compile('[chr\d*]{4,5}',re.IGNORECASE)
    with open('../scripts/hg38wrong.txt') as dimension_file:
        column1=[]
        lines=dimension_file.readlines()
        for line in lines:
            column1.append(line.split('\t')[0].strip())
        for word in column1:
            if re.match(pattern,word):
                continue
            else:
                print "{} in the first column is not a chromosome name".format(word)
                check = False
                break
        if check == True:
            print "Checking result is positive: the first column contains names of chromosomes"
    dimension_file.close()
    
checkColumn1()

group1 in the first column is not a chromosome name


The following function is not ready yet

In [99]:
#a function that proves whether the second column contains an integer and that it is not negative
def checkColumn2():
    check = True
    with open('../scripts/hg38wrong.txt') as dimension_file:
        column2=[]
        lines=dimension_file.readlines()
        for line in lines:
            column2.append(int(line.split()[1].strip()))
        #here I have to catch a ValueError: invalid literal for int()
        print column2
        
            
    dimension_file.close()

checkColumn2()
    

ValueError: invalid literal for int() with base 10: '18729706i'

In [91]:
a=-3
type(a)

int

In [84]:
a = "Chr1"
b = "chr20"
c = "group1"
import re
pattern=re.compile('[chr\d*]{4,5}',re.IGNORECASE)
re.IGNORECASE

#print re.match(pattern, a)

2

In [86]:
print re.match(pattern,b).group()

chr20


In [57]:
if re.match(pattern,b):
    print "True"

True


chromosom lenghts:
* Mouse:
http://genome.ucsc.edu/cgi-bin/hgTracks?db=mm10&chromInfoPage=
* Honeybee A. mellifera:
http://genome.ucsc.edu/cgi-bin/hgTracks?db=apiMel2&chromInfoPage=
