# Advanced random IR
Week2: 13.03-19.03  
Stage: model Zero - a simple model  
This is a tool to generate random integration regions on genome of choice
#### Authors: 
Elvira Mingazova, Saira Afzal, Raffaele Fronza 2017

## Week 2 plan:
1. Generalization of the genome dimension file
2. Checking the input file
3. Testing the script with three different genomes: human, mouse and honey bee
4. Finding out the differences between:
    + hg 38 and hg19  
    + hg 38 versions 
5. Plotting and describing the empirical distribution function on chr1 with a sufficient number of samples

### 1. Generalize genome dimension file  
In the previous notebook file getRegionW1.ipynb the lenghts of the chromosomes were **hard coded**. Here we will change it by adding a new input argument -i. We will provide python with a text file where the chromosome names are in the first column and their lenghts in bp - in the second. Two columns are separated by tab or space.  
Below is an example of the text file that should be provided.

In [92]:
#To use bash commands inside of the notebook put "!" before
!cat ../scripts/hg38.txt| head -n5

chr1	224999719
chr2	237712649
chr3	194704827
chr4	187297063
chr5	177702766


Now we can modify the script in a way that it processes the text file and generates random integration regions on randomly chosen chromosomes.

In [93]:
#Create an empty dictionary to fill in the contents of each line in the .txt file as key:value pairs
dimension_dict = {}
with open('/Users/elming/Advanced-random-IR/scripts/hg38.txt') as dimension_file:
    for line in dimension_file:
        
        key, value = line.split()
        #fill the dictionary and make sure the length is of a type int
        dimension_dict[key.strip()]=int(value.strip())
    print dimension_dict
    dimension_file.close()

{'chr13': 95559980, 'chr12': 130303534, 'chr11': 131130853, 'chr10': 131624737, 'chr17': 77800220, 'chr16': 78884754, 'chr15': 81341915, 'chr14': 88290585, 'chr19': 55785651, 'chr18': 74656155, 'chr24': 57741652, 'chr22': 34893953, 'chr23': 151058754, 'chr20': 59505254, 'chr21': 34171998, 'chr7': 154952424, 'chr6': 167273993, 'chr5': 177702766, 'chr4': 187297063, 'chr3': 194704827, 'chr2': 237712649, 'chr1': 224999719, 'chr9': 120312298, 'chr8': 142612826}


Now we are ready to generate the integration regions

In [94]:
import random
#choose random chromosome
chrom = random.choice(dimension_dict.keys())
print chrom

chr11


In [95]:
#select a site on that chromosome
start = random.randint(1,dimension_dict[chrom])
print start

27837706


Now let's combine the code parts from getRegionW1.py and this notebook and write it into a getRegionW2.py

In [96]:
%%writefile ../scripts/getRegionW2.py
#parser setup
import sys
import argparse
parser = argparse.ArgumentParser()
parser.prog = 'progName.py'
parser.description = 'You can provide the program with three parameters through the terminal'
parser.add_argument("-n", type=int, help='Number of integration regions generated')
parser.add_argument("-r", type=int, help='Range-value: defines an interval where the IR is located')
parser.add_argument("-d", type=int, help='Delta-value: expands the range with the value provided by user')
#add new input argument
parser.add_argument("-i", help='Input-file .txt with names of chromosomes and their lengths in bp organized into two columns and separated by \t')
namespace = parser.parse_args((sys.argv[1:]))
dimension_dict = {}
with open(namespace.i) as dimension_file:
    for line in dimension_file:
        key, value = line.split("\t")
        dimension_dict[key.strip()]=int(value.strip())
    dimension_file.close()
#loop on n random IR
import random
count = 1
print "chr#", '\t', 'Start', '\t\t', 'End', '\t\trnd#'
for i in range(namespace.n):
    #select a chromosome
    chrom = random.choice(dimension_dict.keys())

    #select a site on that chromosome
    start = random.randint(1,dimension_dict[chrom])

    #select a random region
    end = start + random.randint(0,namespace.r) + namespace.d
    if len(str(chrom))==1:
        print chrom,'',"\t",start,"\t",end,"\trnd{}".format(count)
    else:
        print chrom,"\t",start,"\t",end,"\trnd{}".format(count)
    count +=1


Overwriting ../scripts/getRegionW2.py


We can run the script with a new input file provided in the command line. You can change any of the parameters

In [97]:
!python2 ../scripts/getRegionW2.py -n5 -r0 -d1 -i "../scripts/hg38.txt"

chr# 	Start 		End 		rnd#
chr14 	1944671 	1944672 	rnd1
chr8 	79732127 	79732128 	rnd2
chr4 	22139988 	22139989 	rnd3
chr9 	14315167 	14315168 	rnd4
chr2 	196919476 	196919477 	rnd5


### 2. Check the script  
In this part we want to add some testing options to the script to be able to check the input file. It has to fulfill the following coditions:
- the file is well formatted (2 columns)
- the first column contains a string "chrXY" where X and Y are digits
- the second column contains an integer
- the integer is not negative
To test the file we could define "checking" functions

In [101]:
#a function that proves whether the input file contains  two columns
def checkFormatting():
    check=True
    with open('../scripts/hg38.txt') as dimension_file:
        #go through all of the lines and check whether exactly two columns are present
        for line in dimension_file:
            result=line.split()
            if len(result)==2:
                continue
            else:
                print 'Checking the format: the file has to contain two columns'
                check=False
                break
    dimension_file.close()
    if check == True:
        print "Checking the format: the file is well formatted"
        
checkFormatting()


Checking the format: the file is well formatted


In [102]:
#a function that proves whether the first column contains a string "chrXY" where X and Y are digits
def checkColumn1():
    check = True
    #use regEx
    import re
    #save a regEx pattern matching "chrXY" where XY are digits
    #is there a better regEx to use in this case? Wasn't sure if it's optimal
    pattern = re.compile('(chr\d\Z|chr\d\d\Z)',re.IGNORECASE)
    with open('../scripts/hg38wrong2.txt') as dimension_file:
        column1=[]
        lines=dimension_file.readlines()
        for line in lines:
            column1.append(line.split()[0].strip())
        for word in column1:
            if re.match(pattern,word):
                continue
            else:
                print "Checking the first column: {} is not a chromosome name".format(word)
                check = False
                break
        if check == True:
            print "Checking the first column: the names of chromosomes are correct"
    dimension_file.close()
    
checkColumn1()

Checking the first column: group1 is not a chromosome name


In the file hg38wrong.txt one of the chromosome names in the first column was wrong, so it didn't match the regex and a help-message was printed. In the following cell you can try the regex used to match different words

In [103]:
a = "Chr1"
b = "chr20"
c = "group1"
import re
pattern=re.compile('(chr\d\Z|chr\d\d\Z)',re.IGNORECASE)
print re.match(pattern,a).group()

Chr1


And one more demonstration:

In [104]:
if re.match(pattern,c):
    print "True"
else:
    print "False"

False


In [105]:
#a function that proves whether the second column contains a positive integer
def checkColumn2():
    check = True
    with open('../scripts/hg38wrong3.txt') as dimension_file:
        column2=[]
        lines=dimension_file.readlines()
        for line in lines:
            try:
                #extract the second column to a list and try to convert each element into an integer
                column2.append(int(line.split()[1].strip()))
            #catch a ValueError: invalid literal for int()
            except ValueError:
                check = False
                print "Checking the second column: could not convert {} to an integer".format(line.split()[1].strip())
    dimension_file.close()
    if check == True:
        for length in column2:
            if length<0:
                print "Checking the second column: you provided a negative chromosome length: {}".format(length)
                check = False
    if check == True:
        return "Checking the second column: the lengths of chromosomes are correct"

checkColumn2()

Checking the second column: could not convert 81341915i to an integer


Now we can add the testing functions to the original script and test if it works

In [106]:
%%writefile ../scripts/getRegionW2.py

#parser setup
import sys
import argparse
parser = argparse.ArgumentParser()
parser.prog = 'progName.py'
parser.description = 'You can provide the program with three parameters through the terminal'
parser.add_argument("-n", type=int, help='Number of integration regions generated')
parser.add_argument("-r", type=int, help='Range-value: defines an interval where the IR is located')
parser.add_argument("-d", type=int, help='Delta-value: expands the range with the value provided by user')
#add new input argument
parser.add_argument("-i", help='Input-file .txt with names of chromosomes and their lengths in bp organized into two columns and separated by \t')
namespace = parser.parse_args((sys.argv[1:]))

#a function that proves whether the input file contains  two columns
def checkFormatting():
    check=True
    with open(namespace.i) as dimension_file:
        #go through all of the lines and check whether exactly two columns are present
        for line in dimension_file:
            result=line.split()
            if len(result)==2:
                continue
            else:
                print 'Checking format: the text file has to contain two columns'
                check=False
                break
    dimension_file.close()
    if check == True:
        print "Checking format: the file is well formatted"
    return check

#a function that proves whether the first column contains a string "chrXY" where X and Y are digits
def checkColumn1():
    check = True
    #use regEx
    import re
    #save a regEx pattern matching "chrXY" where XY are digits
    pattern = re.compile('(chr\d\Z|chr\d\d\Z)',re.IGNORECASE)
    with open(namespace.i) as dimension_file:
        column1=[]
        lines=dimension_file.readlines()
        for line in lines:
            column1.append(line.split()[0].strip())
        for word in column1:
            if re.match(pattern,word):
                continue
            else:
                print "Checking the first column: {} is not a chromosome name".format(word)
                check = False
                break
        if check == True:
            print "Checking the first column: the names of chromosomes are correct"
    dimension_file.close()
    return check
    
#a function that proves whether the second column contains a positive integer
def checkColumn2():
    check = True
    with open(namespace.i) as dimension_file:
        column2=[]
        lines=dimension_file.readlines()
        for line in lines:
            try:
                #extract the second column to a list and try to convert each element into an integer
                column2.append(int(line.split()[1].strip()))
            #catch a ValueError: invalid literal for int()
            except ValueError:
                check = False
                print "Checking the second column: could not convert {} to an integer".format(line.split()[1].strip())
    dimension_file.close()
    if check == True:
        for length in column2:
            if length<0:
                print "Checking the second column: you provided a negative chromosome length: {}".format(length)
                check = False
    if check == True:
        print "Checking the second column: the lengths of chromosomes are correct"
    return check

#check the input file    
check=checkFormatting()
if check==True:
    check=checkColumn1()
    if check==True:
        check=checkColumn2()
        if check == True:
            
    
            dimension_dict = {}
            with open(namespace.i) as dimension_file:
                for line in dimension_file:
                    key, value = line.split("\t")
                    dimension_dict[key.strip()]=int(value.strip())
                dimension_file.close()
            #loop on n random IR
            import random
            count = 1
            print "chr#", '\t', 'Start', '\t\t', 'End', '\t\trnd#'
            for i in range(namespace.n):
                #select a chromosome
                chrom = random.choice(dimension_dict.keys())

                #select a site on that chromosome
                start = random.randint(1,dimension_dict[chrom])

                #select a random region
                end = start + random.randint(0,namespace.r) + namespace.d
                if len(str(chrom))==1:
                    print chrom,'',"\t",start,"\t",end,"\trnd{}".format(count)
                else:
                    print chrom,"\t",start,"\t",end,"\trnd{}".format(count)
                count +=1

Overwriting ../scripts/getRegionW2.py


In [107]:
!python2 ../scripts/getRegionW2.py -n5 -r0 -d1 -i "../scripts/hg38.txt"

Checking format: the file is well formatted
Checking the first column: the names of chromosomes are correct
Checking the second column: the lengths of chromosomes are correct
chr# 	Start 		End 		rnd#
chr22 	21845635 	21845636 	rnd1
chr12 	93704051 	93704052 	rnd2
chr21 	8587751 	8587752 	rnd3
chr18 	65563405 	65563406 	rnd4
chr7 	106310384 	106310385 	rnd5


The input file above passed through all the checking functions and the program gave an expected output. If we run the program with a wrong input file we will get one of the help-messages:

In [108]:
!python2 ../scripts/getRegionW2.py -n5 -r0 -d1 -i "../scripts/hg38wrong1.txt"

Checking format: the text file has to contain two columns


In [109]:
!python2 ../scripts/getRegionW2.py -n5 -r0 -d1 -i "../scripts/hg38wrong2.txt"

Checking format: the file is well formatted
Checking the first column: group1 is not a chromosome name


I am providing the program with different wrong files, so I can test different kinds of errors. I replaced "group1" with "chr16" in the hg38wrong3.txt, now something else is wrong:

In [110]:
!python2 ../scripts/getRegionW2.py -n5 -r0 -d1 -i "../scripts/hg38wrong3.txt"

Checking format: the file is well formatted
Checking the first column: the names of chromosomes are correct
Checking the second column: could not convert 81341915i to an integer


### 3. Test the script with three different genomes: human, mouse and honey bee  
The text files containing chromosome lengths are stored in the scripts folder. The data was derived from the UCSC database:
* Human:
 - http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&chromInfoPage=
 - http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&chromInfoPage=
* Mouse:  
 - http://genome.ucsc.edu/cgi-bin/hgTracks?db=mm10&chromInfoPage=
* Honeybee A. mellifera:
 - http://genome.ucsc.edu/cgi-bin/hgTracks?db=apiMel2&chromInfoPage=




In [111]:
!python2 ../scripts/getRegionW2.py -n5 -r0 -d1 -i "../scripts/hg19.txt"

Checking format: the file is well formatted
Checking the first column: the names of chromosomes are correct
Checking the second column: the lengths of chromosomes are correct
chr# 	Start 		End 		rnd#
chr22 	16985086 	16985087 	rnd1
chr6 	139903941 	139903942 	rnd2
chr16 	61006902 	61006903 	rnd3
chr7 	114179228 	114179229 	rnd4
chr22 	41018440 	41018441 	rnd5


In [86]:
!python2 ../scripts/getRegionW2.py -n5 -r0 -d1 -i "../scripts/mm10.txt"

Checking format: the file is well formatted
Checking the first column: chromosome names are correct
Checking the second column: the lengths of chromosomes are correct
chr# 	Start 		End 		rnd#
chr1 	146311823 	146311824 	rnd1
chr12 	49185545 	49185546 	rnd2
chr6 	82452061 	82452062 	rnd3
chr20 	165263502 	165263503 	rnd4
chr6 	91291194 	91291195 	rnd5


In [84]:
!python2 ../scripts/getRegionW2.py -n5 -r0 -d1 -i "../scripts/apiMel2.txt"

Checking format: the file is well formatted
Checking the first column: chromosome names are correct
Checking the second column: the lengths of chromosomes are correct
chr# 	Start 		End 		rnd#
chr11 	3579349 	3579350 	rnd1
chr10 	3785841 	3785842 	rnd2
chr15 	1017642 	1017643 	rnd3
chr8 	3315049 	3315050 	rnd4
chr12 	7042320 	7042321 	rnd5


The human genome file was already tested above

### 4. The differences between different versions of human genome:
 - *hg38 and hg19*