# Genome plotter

This ipython script was written to create a visual representation of the human genome, reflecting its chemical composition (GC content), distribution of protein coding genes, and location of know genome wide association signals.

### Requirements:

* Previously downloaded and processed genomic data (prepared by `prepare_data.sh`): chunked chromosomes in bed format, gwas catalog, gencode file.
* non-standard python libraries: [pandas](http://pandas.pydata.org/), [numpy](http://www.numpy.org/), [cairosvg](http://cairosvg.org/), [pybedtools](https://pythonhosted.org/pybedtools/)
* Just for the record: once cairo has been installed on an OSX using homebrew, python by default won't be able to link to the libraries. Therefore this path pointing the `cairo/lib` directory has to be exported:

```bash
export DYLD_FALLBACK_LIBRARY_PATH=~/homebrew/Cellar/cairo/1.14.12/lib/
```


### About the data:

In this work I was using the GRCH38 build of the human genome (Ensembl release 83), GENCODE version , the GWAS catalog was downloaded on 2015.06.16 (genomic coordinates also in GRCH37), the downloadable GWAS catalog was extended with an in-house maintained positive control collection. 


### Further readings and references:

* Human genome:

* GWAS:

### Data sources:

* [Ensembl](Ensembl.org): the European genome database, where all known genomic information is aggregated including genes, transcripts, variations, regulatory elements and many more. They provide access to the human genome via ftp server.

* [GWAS catalog](https://www.ebi.ac.uk/gwas/): manually curated collection of variations in the human genome with known phenotypical changes including [breast size](https://www.ebi.ac.uk/gwas/search?query=Breast%20size) and [polytical ideology](https://www.ebi.ac.uk/gwas/search?query=Political%20ideology). 

* [GENCODE](http://www.gencodegenes.org/): ultimate resource of annotated genetic elements example genes, transcripts, exons etc. 

In [1]:
import time
print ("Last modified:", (time.strftime("%d/%m/%Y")))

Last modified: 04/01/2018


## Stage 1.

In the first stage we further process the sequence data, and integrate GENCODE and GWAS data into a single dataframe and save it for plotting.

### Steps:

1. Reading bedfiles with the chunk numbers and the GC content.
2. Run bedtools to find out which chunks overlap with protein coding genes and GWAS hits.
3. Based on the GC content and the genetic overlapping, a color is assigned to each chunk.
4. Constructing a data frame with the above mentioned data.
5. Save dataframe in binary file (pickle).

In [2]:
'''
Importig libraries:
'''

import gzip # For reading data
import pybedtools # For finding overlap between our chunks and genes and GWAS signals
import pandas # data handling
import numpy as np # Working with large arrays
import colorsys # Generate color gradient.
import cairosvg # Converting svg to image
import pickle # Saving dataframes to disk
import os.path # checking if the datafies are already there.

In [3]:
'''
Files used for the anaysis. Let's check if they are really there:
'''

workingDir = os.getcwd()
Chromosome_file_loc = workingDir + "/data/Processed_chr%s.bed.gz"
GWAS_file_loc = workingDir + "/data/processed_GWAS.bed.gz"
GENCODE_file_loc = workingDir + "/data/GENCODE.merged.bed.gz"
outputDir = workingDir + "/processed_data/dataframe_chr%s.pkl"

if not os.path.isfile(GWAS_file_loc):
    print ("GWAS file is missing. Run `prepare_data.sh` first!")
if not os.path.isfile(GENCODE_file_loc):
    print ("GENCODE file is missing. Run `prepare_data.sh` first!")
if not os.path.isfile(Chromosome_file_loc % 11):
    print ("Processed chromosome files are missing! Run `prepare_data.sh` first!")

In [4]:
'''
Functions for generating nice colors and gradients
'''
def linear_gradient(start_hex, finish_hex="#FFFFFF", n=10):
    ''' 
    returns a gradient list of (n) colors between
    two hex colors. start_hex and finish_hex
    should be the full six-digit color string,
    inlcuding the sharp sign (eg "#FFFFFF") 
    '''
    # Starting and ending colors in RGB form
    s = hex_to_RGB(start_hex)
    f = hex_to_RGB(finish_hex)
    
    # Initilize a list of the output colors with the starting color
    RGB_list = [start_hex]
    
    # Calcuate a color at each evenly spaced value of t from 1 to n
    for t in range(1, n):
    
        # Interpolate RGB vector for color at the current value of t
        curr_vector = [
            int(s[j] + (float(t)/(n-1))*(f[j]-s[j]))
            for j in range(3)
        ]

        # Add it to our list of output colors
        RGB_list.append(RGB_to_hex(curr_vector))

    return RGB_list

def hex_to_RGB(hex):
    ''' "#FFFFFF" -> [255,255,255] '''
    # Pass 16 to the integer function for change of base
    return [int(hex[i:i+2], 16) for i in range(1,6,2)]


def RGB_to_hex(RGB):
    ''' [255,255,255] -> "#FFFFFF" '''
    # Components need to be integers for hex to make sense
    RGB = [int(x) for x in RGB]
    return "#"+"".join(["0{0:x}".format(v) if v < 16 else
            "{0:x}".format(v) for v in RGB])

def get_color(row, colors):
    '''
    Based on the values in the submitted row, this
    function picks a color from the color dictionary, and returns it.
    '''
    
    # Built-in parameters:
    threshold = 0.7 
    max_diff_value = 0.15

    # At first we get the index:
    try:
        index = int(float(row["GC_content"])*20)
    except:
        index = 0
    
    # Extracting color based on the gene and the index:
    try:
        color = colors[row["Genes"]][index]
    except:
        color = colors[0][index]
        
    # Ok, we have the color, now based on the column we make it a bit darker:
    if row["Column_frac"] > threshold:
        diff = (row["Column_frac"] - threshold)/(1 - threshold)
        factor = 1 - max_diff_value*diff 
        
        # Get rgb code of the hexa code:
        rgb_code = hex_to_RGB(color)
        
        # Get the hls code of the rgb:
        hls_code = colorsys.rgb_to_hls(rgb_code[0]/float(255), 
                                       rgb_code[1]/float(255), 
                                       rgb_code[2]/float(255))
        
        # Get the modifed rgb code:
        new_rgb = colorsys.hls_to_rgb(hls_code[0], 
                                      hls_code[1]*factor, 
                                      hls_code[2])
        
        # Get the modifed hexacode:
        color = RGB_to_hex([new_rgb[0] * 255,
                           new_rgb[1] * 255,
                           new_rgb[2] * 255])
        
    return color  

def def_chromosome_colors(chromosome):
    '''
    This function retruns with an array of colors based on the chromosome number.
    input: chromosome number from 0 to 23 
    output: [[ten colors for intergenic], [ten colors for genic region]]
    
    (Ten is the step number of the gradient showing GC content in the given chunk.)
    '''

    # The colors will be picked from the following array:
    color_set = [
        ["#b74242", "#b79e42"],
        ["#b75f42", "#b3b742"],
        ["#b77c42", "#96b742"],
        ["#b79a42", "#78b742"],
        ["#b7b742", "#5bb742"],
        ["#9ab742", "#42b746"],
        ["#7cb742", "#42b763"],
        ["#5fb742", "#42b780"],
        ["#42b742", "#42b79e"],
        ["#42b75f", "#42b3b7"],
        ["#42b77c", "#4296b7"],
        ["#42b79a", "#4278b7"],
        ["#42b7b7", "#425bb7"],
        ["#429ab7", "#4642b7"],
        ["#427cb7", "#6342b7"],
        ["#425fb7", "#8042b7"],
        ["#4242b7", "#9e42b7"],
        ["#5f42b7", "#b742b3"],
        ["#7c42b7", "#b74296"],
        ["#9a42b7", "#b74278"],
        ["#b742b7", "#b7425b"],
        ["#b7429a", "#b74642"],
        ["#b7427c", "#b76342"],
        ["#b7425f", "#b78042"]
    ]

    # 
    colors = {}
    colors[0] = linear_gradient(color_set[chromosome - 1][0], n=20) # For non genes.
    colors[1] = linear_gradient(color_set[chromosome - 1][1], n=20) # For regions overlapping with genes

    return colors

In [35]:
'''Function for process data'''

def ProcessChromosome(chromosome, Max_rows = 600):
    '''
    This function reads processed genomic data saved in bed format, compressed with gzip.
    
    Input is just the chromosome number and a number how many rows we want to 
    slice the chromosomes (this parameter is optional, default value is 600). 
    
    Output is a dataframe with the following columns:
        "Chunk_no": integer, 
        "GC_content": float (0..1),
        "Genes": integer 1 or 0 -> 1 indicates the chunk overlaps with a gene
        "GWAS": integer 1 or 0 -> 1 indicate overlap with a GWAS signal
        "Colors": hexadecimal RGB code representing the given chunk.
    '''
    
    # The name of the chromosome is X and Y for number 23 and 24 respectively:
    chr_name = chromosome
    if chr_name == 23: chr_name = "X"
    if chr_name == 24: chr_name = "Y"

    # Importing global variables:
    global Chromosome_file_loc
    global GWAS_file_loc
    global GENCODE_file_loc

    # Opening bedfile, read GC contents and chunk number:
    chunk_no = []
    CG_content = []
    with gzip.open(Chromosome_file_loc % chr_name,'r') as bedfile:
        for line in bedfile:
            line = line.strip()
            (chrom, start, end, GC, chunk) = line.split("\t")

            chunk_no.append(chunk)
            CG_content.append(GC)
            
    # As soon as the GC contents are read, we check 
    # if the chunks are overlapping with genes or GWAS hits:

    # Run intersectBed query:
    data_file = pybedtools.BedTool((Chromosome_file_loc % chr_name))
    GENCODE_file = pybedtools.BedTool(GENCODE_file_loc)
    GWAS_file = pybedtools.BedTool(GWAS_file_loc)

    # Get intersecting genes:
    GencodeIntersect = data_file.intersect(GENCODE_file, wa = True)

    # Get intersecting GWAS signals:
    GWASIntersect = data_file.intersect(GWAS_file, wa = True)
    
    # Now we have to loop through all intersecting values:
    Genes = np.zeros(len(CG_content), dtype=np.int)
    GWAS = np.zeros(len(CG_content), dtype=np.int)
    chunks_in_genes = [hit.fields[4] for hit in GencodeIntersect]
    chunks_in_gwas  = [hit.fields[4] for hit in GWASIntersect]
    
    # elements of the arrays corresponding to overlapping chunks
    # will be updated to 1 from 0:
    for index in chunks_in_genes:
        Genes[int(index)] = 1

    for index in chunks_in_gwas:
        GWAS[int(index)] = 1

    # Data organized into dataframe:
    df = pandas.DataFrame({
            "Chunk_no": chunk_no,
            "GC_content": CG_content,
            "Genes": Genes,
            "GWAS": GWAS
        })    
        
    return df


def color_chromosome(df, chromosome, Max_rows = 600):
    '''
    Once we have the dataframe of the chunks we can assign colors based on the 
    chromosome number, CG content, gene annotation.
    '''
    # Based on the maximum number of rows, we assign a row location 
    # for each element:
    # Number of lines the whole genome is broken up
    Max_columns = len(df) / Max_rows

    # Calculating the coordinates for each chunk in the final plot, and add to the df:
    rows = [int(x)/Max_columns for x in df.Chunk_no.tolist()]
    column = [int(x) - Max_columns*(int(x)/Max_columns) for x in df.Chunk_no.tolist()]
    df["Row_no"] = pandas.Series(rows)
    df["Column_no"] = pandas.Series(column)
    df["Column_frac"] = [float(x) / Max_columns for x in column]

    # Picking colors for the chromosome
    colors = def_chromosome_colors(chromosome)

    # Assigning colors from the above defined dictionary:
    df["color"] = df.apply(get_color, axis=1, args=(colors,))    

    return df

def plot_chromosome(df, chromosome, filename, rows = 600):
    '''
    Genomic data processed, colors have been assigned to each chunk.
    Now create an svg image, then render it in a png.
    '''
    width = int(len(df)/rows)
    chromosome_name = chromosome
    # Although I decided to focus on the autosomal chromosomes, the script is ready to deal with
    # the sex chromosomes.
    if chromosome_name == 23: chromosome_name = "X"
    if chromosome_name == 24: chromosome_name = "Y"        

    # The size of box which represent a single chunk of the genome:
    box_width = 2
    box_height = 2

    # svg file is saved:
    #f = open('chromosome1.html', 'w')
    plot = '<svg width="%s" height="%s">\n' % (width*1.5*box_width, rows*1.5*box_height)

    # Plotting genome based on GC content and gene annotation:
    for index, color in enumerate(df.color):

        # Get x,y coordinates based on index:
        y = (index / width) * (box_height + 1)
        x = (index - (index / width)*width)*(box_width+1)

        # Generate svg line:
        plot +='<rect x="%s" y="%s" width="%s" height="%s" style="stroke-width:1;stroke:%s; fill: %s" />\n' % (x, y, box_width + 1 , box_height + 1, color, color)

    # looking through gwas annotation:
    for index, gwas in enumerate(df.GWAS):
        if gwas == 1:
            # based on the index of the given line, we get the x,y coordinates:
            y = (index / width) * (box_height + 1) + 1.5
            x = (index - (index / width)*width)*(box_width+1) + 1.5

            # Drawing circle around the defined point:
            plot += '<circle cx="%s" cy="%s" r="3" stroke="%s" stroke-width="1" fill="%s" />\n' % (x, y, "black", "black")

    ###
    ### Drawing legend for the figure.
    ### A box with the name of the chromosome explanation of the colors
    ### Scale bars etc.
    ###

    # Adding a bigger rectangle, for legend:
    legend_x = 15
    legend_y = 15
    legend_width = 250
    legend_height = 220
    plot += '<g>\n'
    plot += '<rect x="%s" y="%s" rx="20" ry="20" width="%s" height="%s" style="fill:white;stroke:black;stroke-width:2" />\n' % (
            legend_x, legend_y, legend_width, legend_height)

    # Adding chromosome number:
    plot += '<text x="140" y="46" text-anchor="middle" font-family="Verdana" font-size="30">chr%s</text>\n' % (chromosome_name)

    # Adding text: (GC content)
    plot += '<text x="95" y="80" text-anchor="middle" font-family="Verdana" font-size="13">GC content</text>\n'

    # For the color section of the legend, we have to generate a color gradient:
    colors = def_chromosome_colors(chromosome)

    # Adding gene colors:
    y = 85
    x = 20
    height = 15
    width = 7
    for index, col in enumerate (reversed(colors[1])):
        x_coord = x+width*int(index)
        plot += '<rect x="%s" y="%s" width="%s" height="%s" style="stroke-width:1;stroke:%s; fill: %s" />\n' %(x_coord, y, width, height, col, col)
    plot += '<text x="%s" y="%s" font-family="Verdana" font-size="17">%s</text>\n' %(x_coord + 15, y + 12, "Genes")

    # Adding background colors:
    y = 105
    for index, col in enumerate (reversed(colors[0])):
        x_coord = x+width*int(index)
        plot += '<rect x="%s" y="%s" width="%s" height="%s" style="stroke-width:1;stroke:%s; fill: %s" />\n' %(x_coord, y, width, height, col, col)
    plot += '<text x="%s" y="%s" font-family="Verdana" font-size="17">%s</text>\n' %(x_coord + 15, y + 12, "Intergenic")

    # Adding scale bars:
    length = 63
    x = 40
    y = 160
    tick = 5

    plot += '<text x="%s" y="%s" font-family="Verdana" text-anchor="middle"  font-size="15">15kbp</text>\n' % (length/2 + x, y - 10)
    plot += '<line x1="%s" y1="%s" x2="%s" y2="%s" style="stroke:black;stroke-width:3" />\n' %(x,y,x + length,y)
    plot += '<line x1="%s" y1="%s" x2="%s" y2="%s" style="stroke:black;stroke-width:3" />\n' %(x,y-tick,x,y+tick)
    plot += '<line x1="%s" y1="%s" x2="%s" y2="%s" style="stroke:black;stroke-width:3" />\n' %(x + length,y-tick,x + length,y+tick)

    # Adding verical units: 42 units for 500kbp
    # Get the length of one row: 
    length_row = len(df[df.Row_no == 3]) * 480
    length_bar = 50
    bar_size_genome = round(length_bar / 2 * length_row/1e6, 2)

    x = 140
    y = 160
    plot += '<text x="%s" y="%s" font-family="Verdana" font-size="15">%sMbp</text>\n' % (
        x + 15, y, bar_size_genome)
    plot += '<line x1="%s" y1="%s" x2="%s" y2="%s" style="stroke:black;stroke-width:3" />\n' % (
        x, y-length_bar/2, x, y + length_bar/2)
    plot += '<line x1="%s" y1="%s" x2="%s" y2="%s" style="stroke:black;stroke-width:3" />\n' % (
        x-tick,y-length_bar/2,x+tick,y-length_bar/2)
    plot += '<line x1="%s" y1="%s" x2="%s" y2="%s" style="stroke:black;stroke-width:3" />\n' % (
        x-tick,y + length_bar/2,x+tick,y + length_bar/2)

    # Adding a circle to the bottom of the legend:
    circle_r = 5
    circle_x = 40 
    circle_y = 210
    plot += '<circle cx="%s" cy="%s" r="%s" stroke="black" stroke-width="1" fill="black" />\n' % (
            circle_x, circle_y, circle_r)
    plot += '<text x="%s" y="%s" font-family="Verdana" font-size="15">%s</text>\n' % (
            circle_x + 20, circle_y + 5, "Known GWAS signals")

    # Closing legend
    plot += '</g>'

    #f.write('</svg>\n</body>\n</html>\n')
    plot += '</svg>\n'
    #f.close()

    fout = open('%s_chr%s.png' % (filename, chromosome),'w')
    cairosvg.svg2png(bytestring=plot,write_to=fout)
    fout.close()

In [36]:
'''
Now we loop through all autosomes and generate a plot.
In the meanwhile, the dataframes are also saved, and upon a future step,
they can be read back withouth calculating them from the scratch.
'''
for chromosome in range(1, 22):
    print "Processing chromosome %s" % (chromosome)

    # Process chromosome data, assign colors, combine into dataframe:
    df = ProcessChromosome(chromosome)

    # Pickling dataframes:
    filename = outputDir % chromosome
    output = open(filename, 'wb')
    pickle.dump(df, output)
    
    # If we jus want ot read df from file:

    # Add colors to datafrome:
    df = color_chromosome(df, chromosome)
    plot_chromosome(df, chromosome, "plot", 600)


Processing chromosome 1
Processing chromosome 2
Processing chromosome 3
Processing chromosome 4
Processing chromosome 5
Processing chromosome 6
Processing chromosome 7
Processing chromosome 8
Processing chromosome 9
Processing chromosome 10
Processing chromosome 11
Processing chromosome 12
Processing chromosome 13
Processing chromosome 14
Processing chromosome 15
Processing chromosome 16
Processing chromosome 17
Processing chromosome 18
Processing chromosome 19
Processing chromosome 20
Processing chromosome 21


In [5]:
import pandas as pd
import numpy as np

In [222]:
# Reading datafile:
chr_name = 2
dataFile = Chromosome_file_loc % chr_name
chr_dataf = pd.read_csv(dataFile, compression='gzip', header=None, sep='\t', quotechar='"')

# Reading GENCODE file:
GENCODE_bed = pybedtools.BedTool(GENCODE_file_loc)

# Run intersectbed:
chr_data_bed = pybedtools.BedTool(dataFile)

# Get intersecting GENCODE features:
GencodeIntersect = chr_data_bed.intersect(GENCODE_bed, wa = True, wb = True)
GC_INT = pybedtools.bedtool.BedTool.to_dataframe(GencodeIntersect)

# Assign features to each chunk:
GENCODE_chunks = GC_INT.groupby('score').apply(lambda x: 'exon' if 'exon' in x.itemRgb.unique() else 'gene' )
GENCODE_chunks.name = "GENCODE"
chr_dataf['GENCODE'] = 'intergenic' # By default, all GENCODE values are intergenic
chr_dataf.GENCODE.update(GENCODE_chunks) # This value will be overwritten if overlaps with exon or gene

# Assinging column number:
row_count = 800
col_count = int(chr_dataf.shape[0] / rows)
chr_dataf['col_frac'] = chr_dataf[4] % col_count / float(col_count)
chr_dataf.head()

# Get colors based on GENCODE feature:
colors_GENCODE = {
    'intergenic': linear_gradient('#42b79a', finish_hex='#42b79a', n=20), # gray
    'exon': linear_gradient('#CDCD00', n=20), # Purple 
    'gene': linear_gradient('#4278b7', n=20)} # Goldenrod
chr_dataf['color'] = chr_dataf.apply(lambda x: colors_GENCODE[x['GENCODE']][int(x[3]*20)], axis = 1)

# Get the colors darker based on column number:
chr_dataf['color'] = chr_dataf.apply(color_darkener, axis = 1)

#plot_chromosome(chr_dataf, rows=row_count)
print('done')

  interactivity=interactivity, compiler=compiler, result=result)


AttributeError: 'DataFrame' object has no attribute 'itemRgb'

In [219]:
def plot_chromosome(df,rows = 600):
    '''
    Genomic data processed, colors have been assigned to each chunk.
    Now create an svg image, then render it in a png.
    '''
    width = int(chr_dataf.shape[0] / rows)

    #chromosome_name = chromosome
    # Although I decided to focus on the autosomal chromosomes, the script is ready to deal with
    # the sex chromosomes.
    #if chromosome_name == 23: chromosome_name = "X"
    #if chromosome_name == 24: chromosome_name = "Y"        

    # The size of box which represent a single chunk of the genome:
    box_width = 2
    box_height = 2

    # svg initiated:
    plot = '<svg width="%s" height="%s" viewBox="0 0 %s %s">\n' \
        % (width*(box_width+1), rows*(box_height+1), width*(box_width+1), rows*(box_height+1))

    # Plotting genome based on GC content and gene annotation:
    for index, color in enumerate(df.color):
        # Get x,y coordinates based on index:
        y = int(index / width) * (box_height + 1)
        x = (index - int(index / width)*width)*(box_width+1)

        # Generate svg line:
        plot +='<rect x="%s" y="%s" width="%s" height="%s" style="stroke-width:1;stroke:%s; fill: %s" />\n' % (x, y, box_width + 1 , box_height + 1, color, color)
        
    # looking through gwas annotation:
    #for index, gwas in enumerate(df.GWAS):
    #    if gwas == 1:
    #        # based on the index of the given line, we get the x,y coordinates:
    #        y = (index / width) * (box_height + 1) + 1.5
    #        x = (index - (index / width)*width)*(box_width+1) + 1.5

    #        # Drawing circle around the defined point:
    #        plot += '<circle cx="%s" cy="%s" r="3" stroke="%s" stroke-width="1" fill="%s" />\n' % (x, y, "black", "black")

    plot += '</svg>\n'# Terminate svg 
    
    #f.close()

    fout = open('cicaful.svg','w')\
    # Saving svg into file:
    fout.write(plot)
    #cairosvg.svg2png(bytestring=plot,write_to='cicaful.png')
    fout.close()
    
print(chr_dataf.head())
#plot_chromosome(df, chromosome, filename, rows = 600):


   0      1      2         3  4     GENCODE  col_frac    color
0  2  10020  10500  0.612500  1  intergenic  0.001597  #b9e4d9
1  2  10500  10980  0.662500  2  intergenic  0.003195  #c3e8df
2  2  10980  11460  0.787500  3  intergenic  0.004792  #d7efe9
3  2  11460  11940  0.645833  4  intergenic  0.006390  #b9e4d9
4  2  11940  12420  0.339583  5  intergenic  0.007987  #7dcdb9


![test](cicaful.png)

In [125]:
def hex_to_RGB(hex):
    ''' "#FFFFFF" -> [255,255,255] '''
    # Pass 16 to the integer function for change of base
    return [int(hex[i:i+2], 16) for i in range(1,6,2)]


def RGB_to_hex(RGB):
    ''' [255,255,255] -> "#FFFFFF" '''
    # Components need to be integers for hex to make sense
    RGB = [int(x) for x in RGB]
    return "#"+"".join(["0{0:x}".format(v) if v < 16 else
            "{0:x}".format(v) for v in RGB])

def linear_gradient(start_hex, finish_hex="#FFFFFF", n=10):
    ''' 
    returns a gradient list of (n) colors between
    two hex colors. start_hex and finish_hex
    should be the full six-digit color string,
    inlcuding the sharp sign (eg "#FFFFFF") 
    '''
    # Starting and ending colors in RGB form
    s = hex_to_RGB(start_hex)
    f = hex_to_RGB(finish_hex)
    
    # Initilize a list of the output colors with the starting color
    RGB_list = [start_hex]
    
    # Calcuate a color at each evenly spaced value of t from 1 to n
    for t in range(1, n):
    
        # Interpolate RGB vector for color at the current value of t
        curr_vector = [
            int(s[j] + (float(t)/(n-1))*(f[j]-s[j]))
            for j in range(3)
        ]

        # Add it to our list of output colors
        RGB_list.append(RGB_to_hex(curr_vector))

    return RGB_list

def color_darkener(row):
    '''
    Once the colors are assigned, we make the colors darker for those columns 
    that are at the end of the plot.
    '''
    
    color = row['color']
    col_frac = row['col_frac']
    
    # Built in parameters:
    threshold = 0.7 # at which column the darkening starts
    max_diff_value = 0.15 # The max value of darkening
    
    # Ok, we have the color, now based on the column we make it a bit darker:
    if col_frac > threshold:
        diff = (col_frac - threshold)/(1 - threshold)
        factor = 1 - max_diff_value*diff 

        # Get rgb code of the hexa code:
        rgb_code = hex_to_RGB(color)

        # Get the hls code of the rgb:
        hls_code = colorsys.rgb_to_hls(rgb_code[0]/float(255), 
                                       rgb_code[1]/float(255), 
                                       rgb_code[2]/float(255))

        # Get the modifed rgb code:
        new_rgb = colorsys.hls_to_rgb(hls_code[0], 
                                      hls_code[1]*factor, 
                                      hls_code[2])

        # Get the modifed hexacode:
        color = RGB_to_hex([new_rgb[0] * 255,
                           new_rgb[1] * 255,
                           new_rgb[2] * 255])
    return(color)


220    #e8ecec
221    #e1e7e7
222    #e4e9e9
223    #e4e9e9
224    #e4e9e9
225    #dd99ff
226    #ffd775
227    #d685ff
228    #dd99ff
229    #ffdb80
dtype: object

In [225]:
GWAS_df[GWAS_df.chrom == str(chr_name)].start.tolist()



TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

In [244]:
fin = open('cicaful.test.svg','r')
plot = fin.read()
fin.close()

# This value comes from the sh script:
window = 480
pixel = 3
row_count = 800

#GWAS_file_loc
GWAS_file = pybedtools.BedTool(GWAS_file_loc)
GWAS_df = pybedtools.bedtool.BedTool.to_dataframe(GWAS_file)
GWAS_df[GWAS_df.chrom == str(chr_name)].start

# Looping through all the GWAS hits:
for pos in GWAS_df[GWAS_df.chrom == str(chr_name)].start.unique().tolist():
    # Based on the position, calculate chunk:
    chunk = int(int(pos)/window)
    
    # Based on the chunk no and the pixel size and row count, let's get the coordinates:
    y = int(chunk / row_count) * (pixel)
    x = (chunk - int(chunk / row_count)*row_count)*(pixel)
    
    # Draw point:
    plot += '<circle cx="%s" cy="%s" r="3" stroke="%s" stroke-width="1" fill="%s" />\n' % \
        (x, y, "black", "black")
        
plot += '</svg>\n'# Terminate svg 

fout = open('cicaful_GWAS.svg','w')\
# Saving svg into file:
fout.write(plot)
#cairosvg.svg2png(bytestring=plot,write_to='cicaful_GWAS.png')
fout.close()

In [240]:
pos = 11802720
window = 400
chunk = int(int(pos)/window)
print("Chunk: %s" % chunk)
y = int(chunk / row_count) * (pixel)
x = (chunk - int(chunk / row_count)*row_count)*(pixel)
print("X: %s, Y: %s" % (x,y))

Chunk: 26228
X: 1884, Y: 96


In [110]:
def generate_xy(pos, min_pos, chunk_size, width):
    '''
    Generating x,y coordinates from genomic position and 
    the provided plot width or chunk size.

    pos - genomic position
    pixel - the size of the point taken up by one chunk.
    width - number of chunks in one row
    chunk_size - the number of basepairs pooled together in one chunk.
    '''
    pos = pos - min_pos
    chunk = int(int(pos)/chunk_size)
    print("Chunk: %s" % chunk)
    y = int(chunk / row_count)
    x = (chunk - int(chunk / row_count)*row_count)
    print("X: %s, Y: %s" % (x,y))

chr_dataf.head()

Unnamed: 0,chr,start,end,GC_ratio,x,y,GENCODE
0,2.0,10000.0,10500.0,0.612,0,0,intergenic
1,2.0,10500.0,11000.0,0.668,1,0,intergenic
2,2.0,11000.0,11500.0,0.77,2,0,intergenic
3,2.0,11500.0,12000.0,0.632,3,0,intergenic
4,2.0,12000.0,12500.0,0.348,4,0,intergenic


# Cleaning up the code


* input parameters:
    * chromosome name
    * chunk size
    * axis
    * dimension of the plot
* Open input files
* Assign x,y columns to each chunk
* Assign GENCODE feature to each chunk.
* Assign color to each chunk
* Darken color if neccessary

In [99]:
'''
Importig libraries:
'''

import gzip # For reading data
import pybedtools # For finding overlap between our chunks and genes and GWAS signals
import pandas # data handling
import numpy as np # Working with large arrays
import colorsys # Generate color gradient.
import cairosvg # Converting svg to image
import pickle # Saving dataframes to disk
import os.path # checking if the datafies are already there.

'''
Input files and folder + checking if they are present
'''

workingDir = os.getcwd()
Chromosome_file_loc = workingDir + "/data/Processed_chr%s.bed.gz"
GWAS_file_loc = workingDir + "/data/processed_GWAS.bed.gz"
GENCODE_file_loc = workingDir + "/data/GENCODE.merged.bed.gz"
outputDir = workingDir + "/processed_data/dataframe_chr%s.pkl"

if not os.path.isfile(GWAS_file_loc):
    print ("GWAS file is missing. Run `prepare_data.sh` first!")
if not os.path.isfile(GENCODE_file_loc):
    print ("GENCODE file is missing. Run `prepare_data.sh` first!")
if not os.path.isfile(Chromosome_file_loc % 11):
    print ("Processed chromosome files are missing! Run `prepare_data.sh` first!")

# These values are read from the command line once the script is wrapped:
axis = 1
dimension = 400 
chromosome = 2 

# Reading datafile:
dataFile = Chromosome_file_loc % chromosome
chr_dataf = pd.read_csv(dataFile, compression='gzip', sep='\t', quotechar='"')

# Based on the difference between the stat and end positions, we get the chunk size:
chunk_size = chr_dataf.head(1).apply(lambda x: x['end'] - x['start'], axis=1)
min_pos  = chr_dataf.start.min()

# Get the width (number of chunks plotted in one row)
width = dimension # We are dealing with fixed column numbers
if axis == 0:  width = int(chr_dataf.shape[0]/dimension) # We are dealing with fixed number of rows.

# Adding plot coordinates to the dataframe:
chr_dataf = chr_dataf.apply(generate_xy, args = (min_pos, chunk_size, width), axis =1)

# Reading GENCODE file:
GENCODE_bed = pybedtools.BedTool(GENCODE_file_loc)

# Run intersectbed:
chr_data_bed = pybedtools.BedTool(dataFile)

# Get intersecting GENCODE features:
GencodeIntersect = chr_data_bed.intersect(GENCODE_bed, wa = True, wb = True)
GC_INT = pybedtools.bedtool.BedTool.to_dataframe(GencodeIntersect)

# Assign features to each chunk:
#GENCODE_chunks = GC_INT.groupby('score').apply(lambda x: 'exon' if 'exon' in x.itemRgb.unique() else 'gene' )
#GENCODE_chunks.name = "GENCODE"
#chr_dataf['GENCODE'] = 'intergenic' # By default, all GENCODE values are intergenic
#chr_dataf.GENCODE.update(GENCODE_chunks) # This value will be overwritten if overlaps with exon or gene
#chr_dataf.head()

# Assinging column number:
#row_count = 800
#col_count = int(chr_dataf.shape[0] / rows)
#chr_dataf['col_frac'] = chr_dataf[4] % col_count / float(col_count)
#chr_dataf.head()
GC_INT.head()

Unnamed: 0,chrom,start,end,name,score,strand,thickStart,thickEnd
0,2,38500,39000,0.34,2,38814,41627,exon
1,2,38500,39000,0.34,2,38814,46870,gene
2,2,39000,39500,0.416,2,38814,41627,exon
3,2,39000,39500,0.416,2,38814,46870,gene
4,2,39500,40000,0.378,2,38814,41627,exon


In [101]:
GENCODE_chunks = GC_INT.groupby('start').apply(lambda x: 'exon' if 'exon' in x.thickEnd.unique() else 'gene' )

In [104]:
GENCODE_chunks.name = "GENCODE"
chr_dataf['GENCODE'] = 'intergenic'
chr_dataf.GENCODE.update(GENCODE_chunks)
chr_dataf.GENCODE.unique()

array(['intergenic', 'exon', 'gene'], dtype=object)

In [115]:
def generate_xy(df, min_pos, chunk_size, width):
    '''
    Generating x,y coordinates from genomic position and
    the provided plot width or chunk size.

    pos - genomic position
    pixel - the size of the point taken up by one chunk.
    width - number of chunks in one row
    chunk_size - the number of basepairs pooled together in one chunk.
    '''
    pos = df['start'] - min_pos
    chunk = int(int(pos)/chunk_size)
    df["x"] = (chunk - int(chunk / width)*width)
    df["y"] = int(chunk / width)
    return (df)

test_df = chr_dataf.ix[1:10000]
test_df = test_df.apply(generate_xy, args = (min_pos, chunk_size, width), axis =1)
test_df.head()

Unnamed: 0,chr,start,end,GC_ratio,x,y,GENCODE
1,2.0,10500.0,11000.0,0.668,1,0,intergenic
2,2.0,11000.0,11500.0,0.77,2,0,intergenic
3,2.0,11500.0,12000.0,0.632,3,0,intergenic
4,2.0,12000.0,12500.0,0.348,4,0,intergenic
5,2.0,12500.0,13000.0,0.422,5,0,intergenic


In [118]:
#print("[Info %s] Assigning colors to each chunk." % (get_now()))
print(chr_dataf.head())
colors_GENCODE = {
    'intergenic': linear_gradient('#42b79a', n=20), # gray
    'exon': linear_gradient('#CDCD00', n=20), # Purple
    'gene': linear_gradient('#4278b7', n=20)} # Goldenrod

   chr    start      end  GC_ratio  x  y     GENCODE
0  2.0  10000.0  10500.0     0.612  0  0  intergenic
1  2.0  10500.0  11000.0     0.668  1  0  intergenic
2  2.0  11000.0  11500.0     0.770  2  0  intergenic
3  2.0  11500.0  12000.0     0.632  3  0  intergenic
4  2.0  12000.0  12500.0     0.348  4  0  intergenic


In [120]:
test_df['color'] = test_df.apply(lambda x: colors_GENCODE[x['GENCODE']][int(x[3]*20)], axis = 1)
test_df.head()

Unnamed: 0,chr,start,end,GC_ratio,x,y,GENCODE,color
1,2.0,10500.0,11000.0,0.668,1,0,intergenic,#c3e8df
2,2.0,11000.0,11500.0,0.77,2,0,intergenic,#d7efe9
3,2.0,11500.0,12000.0,0.632,3,0,intergenic,#b9e4d9
4,2.0,12000.0,12500.0,0.348,4,0,intergenic,#7dcdb9
5,2.0,12500.0,13000.0,0.422,5,0,intergenic,#91d5c4


In [124]:
def color_darkener(row, width, threshold, max_diff_value):
    '''
    width - how many chunks do we have in one line.
    threshold - at which column the darkening starts
    max_diff_value - the max value of darkening
    
    Once the colors are assigned, we make the colors darker for those columns
    that are at the end of the plot.
    '''

    color = row['color']
    col_frac = row['x'] / width

    # Ok, we have the color, now based on the column we make it a bit darker:
    if col_frac > threshold:
        diff = (col_frac - threshold)/(1 - threshold)
        factor = 1 - max_diff_value*diff

        # Get rgb code of the hexa code:
        rgb_code = hex_to_RGB(color)

        # Get the hls code of the rgb:
        hls_code = colorsys.rgb_to_hls(rgb_code[0]/float(255),
                                       rgb_code[1]/float(255),
                                       rgb_code[2]/float(255))

        # Get the modifed rgb code:
        new_rgb = colorsys.hls_to_rgb(hls_code[0],
                                      hls_code[1]*factor,
                                      hls_code[2])

        # Get the modifed hexacode:
        color = RGB_to_hex([new_rgb[0] * 255,
                           new_rgb[1] * 255,
                           new_rgb[2] * 255])
    return(color)

start = 0.75
threshold = 0.15
test_df['color'] = test_df.apply(color_darkener, axis = 1, args=(width, start, threshold))




In [155]:
def generate_xy(df, min_pos, chunk_size, width):
    '''
    Generating x,y coordinates from genomic position and
    the provided plot width or chunk size.

    pos - genomic position
    pixel - the size of the point taken up by one chunk.
    width - number of chunks in one row
    chunk_size - the number of basepairs pooled together in one chunk.
    '''
    pos = int(df['start']) - min_pos
    chunk = int(int(pos)/chunk_size)
    df["x"] = (chunk - int(chunk / width)*width)
    df["y"] = int(chunk / width)
    return (df)

GWAS_file = pybedtools.BedTool(GWAS_file_loc)
full_GWAS = pybedtools.bedtool.BedTool.to_dataframe(GWAS_file)
GWAS_df = full_GWAS[full_GWAS.chrom == str(chromosome)]

chr_dataf.start.min()
chunk_size = 500
width = 400
GWAS_df = GWAS_df.apply(generate_xy, args = (min_pos, chunk_size, width), axis =1)

In [227]:
from time import gmtime, strftime
import sys 

def get_now():
    return strftime("%H:%M:%S", gmtime())


workingDir = os.getcwd()
Chromosome_file_loc = workingDir + "/data/Processed_chr%s.bed.gz"
GWAS_file_loc = workingDir + "/data/processed_GWAS.bed.gz"
GENCODE_file_loc = workingDir + "/data/GENCODE.merged.bed.gz"
outputDir = workingDir + "/processed_data/dataframe_chr%s.pkl"

if not os.path.isfile(GWAS_file_loc):
    print ("GWAS file is missing. Run `prepare_data.sh` first!")
if not os.path.isfile(GENCODE_file_loc):
    print ("GENCODE file is missing. Run `prepare_data.sh` first!")
if not os.path.isfile(Chromosome_file_loc % 11):
    print ("Processed chromosome files are missing! Run `prepare_data.sh` first!")

# These values are read from the command line once the script is wrapped:
axis = 1
dimension = 300 
chromosome = 22

# Reading datafile:
dataFile = Chromosome_file_loc % chromosome
print("[Info %s] reading file %s... " % (get_now(), dataFile))
chr_dataf = pd.read_csv(dataFile, compression='gzip', sep='\t', quotechar='"')

# If we want to do test, we can restrict the dataframe to n lines:
if len(sys.argv) == 6:
    chr_dataf = chr_dataf.ix[0:int(sys.argv[5])]
    print("[Info %s] Test mode is on. The first %s rows will be kept." %(get_now(), sys.argv[5]))

# the start position will be the new index:
chr_dataf = chr_dataf.set_index('start', drop = False)

chunk_size = chr_dataf.head(1).apply(lambda x: x['end'] - x['start'], axis = 1).tolist()[0]
min_pos  = chr_dataf.start.min()
print("[Info %s] Number of chunks read: %s" %(get_now(), chr_dataf.shape[0]))

# Get the width (number of chunks plotted in one row)
width = dimension # We are dealing with fixed column numbers
if axis == 2:  width = int(chr_dataf.shape[0]/dimension) # We are dealing with fixed number of rows.

# Adding plot coordinates to the dataframe:
print("[Info %s] Chunk size: %s bp, plot width: %s" %(get_now(), chunk_size, width))
print("[Info %s] Calculating plot coordinates for each chunk." %(get_now()))
chr_dataf = chr_dataf.apply(generate_xy, args = (min_pos, chunk_size, width), axis =1)
print(chr_dataf.head())

[Info 01:08:04] reading file /Users/ds26/Projects/GenomePlotter/data/Processed_chr22.bed.gz... 
[Info 01:08:04] Number of chunks read: 112929
[Info 01:08:04] Chunk size: 450.0 bp, plot width: 300
[Info 01:08:04] Calculating plot coordinates for each chunk.
        chr   start     end  GC_ratio  x  y
start                                      
0      22.0     0.0   450.0       NaN  0  0
450    22.0   450.0   900.0       NaN  1  0
900    22.0   900.0  1350.0       NaN  2  0
1350   22.0  1350.0  1800.0       NaN  3  0
1800   22.0  1800.0  2250.0       NaN  4  0


In [263]:
# Reading GENCODE file:
print("[Info %s] Reading GENCODE file (%s)..." % (get_now(), GENCODE_file_loc))
GENCODE_bed = pybedtools.BedTool(GENCODE_file_loc)

# Run intersectbed:
chr_data_bed = pybedtools.BedTool(dataFile)

print("[Info %s] Selecting intersecting features." %(get_now()))
GencodeIntersect = chr_data_bed.intersect(GENCODE_bed, wa = True, wb = True)
GC_INT = pybedtools.bedtool.BedTool.to_dataframe(GencodeIntersect)

# Assign features to each chunk:
print("[Info %s] Assign intersecting features to each chunk." % (get_now()))
GENCODE_chunks = GC_INT.groupby('start').apply(lambda x: 'exon' if 'exon' in x.thickEnd.unique() else 'gene' )
GENCODE_chunks.names = "GENCODE"

chr_dataf['GENCODE'] = 'intergenic' # By default, all GENCODE values are intergenic
chr_dataf.GENCODE.update(GENCODE_chunks) # This value will be overwritten if overlaps with exon or gene
chr_dataf.loc[chr_dataf.GC_ratio.isnull(), 'GENCODE'] = 'heterochromatin'
print(chr_dataf.head())

# Reading cytoband file:
cytoband_file = '/Users/ds26/Projects/GenomePlotter/data/cytoBand.GRCh38.bed.bgz'
cyb_df = pd.read_csv(cytoband_file, compression='gzip', sep='\t')
centromer_loc = cyb_df.loc[(cyb_df.chr == '22') & (cyb_df.type == 'acen'),['start', 'end']]
centromer_loc = (int(centromer_loc.start.min()), 
                 int(centromer_loc.end.max()))

# Updating centromere:
chr_dataf.loc[(chr_dataf.end > centromer_loc[0]) & (chr_dataf.start < centromer_loc[1]) , 'GENCODE'] = 'heterochromatin'

# Assigning colors:
print("[Info %s] Assigning colors to each chunk." % (get_now()))
colors_GENCODE = {
    'heterochromatin' : linear_gradient('#ffc6af', finish_hex='#ffc6af', n=20), # 
    'intergenic': linear_gradient('#42b79a', n=20), # gray
    'exon': linear_gradient('#CDCD00', n=20), # Purple
    'gene': linear_gradient('#4278b7', n=20)} # Goldenrod
chr_dataf['color'] = chr_dataf.apply(lambda x: colors_GENCODE[x['GENCODE']][int(x[3]*20)] if not np.isnan(x[3]) else colors_GENCODE['heterochromatin'][0], axis = 1)

# Get the colors darker based on column number:
print("[Info %s] Apply darkness filter." % (get_now()))
chr_dataf['color'] = chr_dataf.apply(color_darkener, axis = 1, args=(width, start, threshold))
print(chr_dataf.head())

[Info 02:56:03] Reading GENCODE file (/Users/ds26/Projects/GenomePlotter/data/GENCODE.merged.bed.gz)...
[Info 02:56:03] Selecting intersecting features.
[Info 02:56:04] Assign intersecting features to each chunk.
        chr   start     end  GC_ratio  x  y          GENCODE    color
start                                                                
0      22.0     0.0   450.0       NaN  0  0  heterochromatin  #FFC1C1
450    22.0   450.0   900.0       NaN  1  0  heterochromatin  #FFC1C1
900    22.0   900.0  1350.0       NaN  2  0  heterochromatin  #FFC1C1
1350   22.0  1350.0  1800.0       NaN  3  0  heterochromatin  #FFC1C1
1800   22.0  1800.0  2250.0       NaN  4  0  heterochromatin  #FFC1C1
[Info 02:56:07] Assigning colors to each chunk.
[Info 02:56:11] Apply darkness filter.
        chr   start     end  GC_ratio  x  y          GENCODE    color
start                                                                
0      22.0     0.0   450.0       NaN  0  0  heterochromatin  #ffc6af


In [234]:
print("[Info %s] Opening GWAS file.." %(get_now()))
GWAS_file = pybedtools.BedTool(GWAS_file_loc)
full_GWAS = pybedtools.bedtool.BedTool.to_dataframe(GWAS_file)
GWAS_df = full_GWAS[full_GWAS.chrom == str(chromosome)]
print("[Info %s] Calculating plot coordinates for overlapping GWAS sigals." %(get_now()))
GWAS_df = GWAS_df.apply(generate_xy, args = (min_pos, chunk_size, width), axis =1)

[Info 01:14:21] Opening GWAS file..
[Info 01:14:21] Calculating plot coordinates for overlapping GWAS sigals.


In [264]:
import cairosvg


class SVG_plot:
    '''
    This class contains all the methods and data to create the genome plot in svg format.
    Once the process is done, use can choose to save the svg or render using cairosvg.
    Good luck boy.
    '''

    def __init__(self, width, height, pixel, centr_start, centr_end):
        self.width = width
        self.height = height
        self.pixel = pixel
        self.centr_start = centr_start
        self.centr_end = centr_end
        self.plot =  '<svg width="%s" height="%s">\n' % (width*pixel, height*pixel)

    # Adding square:
    def draw_chunk(self, row):
        '''
        This function expects a row of a dataframe in which there must be the
        following three columns:
            x - column number
            y - row number
            color - color of the field in hexadecimal code.
        '''
        x = row['x']
        y = row['y']
        color = row['color']
        self.plot += ('<rect x="%s" y="%s" width="%s" height="%s" style="stroke-width:1;stroke:%s; fill: %s" />\n' %
            (x*self.pixel, y*self.pixel, self.pixel , self.pixel, color, color))

    # Adding dot:
    def draw_GWAS(self, row):
        x = row['x']
        y = row['y']
        self.plot += ('<circle cx="%s" cy="%s" r="2" stroke="%s" stroke-width="1" fill="%s" />\n' %
            (x*self.pixel, y*self.pixel, "black", "black"))

    def mark_centromere(self):
        # Calculate centromere dimensions:
        centr = (
            int(self.centr_start / self.width * self.pixel),
            int(self.centr_end / self.width * self.pixel),
            int((self.centr_end + self.centr_start) / (self.width * 2) * self.pixel),
            int((self.centr_end - self.centr_start) / (self.width * 2) * self.pixel)
        )

        # Marking centromere on the left:
        self.plot += ('<path d="M %s %s C %s %s, %s %s, %s %s C %s %s, %s %s, %s %s Z" fill="white"/>' %
                 ( 0, centr[0],
                   0, centr[2], 
                   centr[3], centr[2], 
                   centr[3]*2, centr[2],
                   centr[3], centr[2], 
                   0, centr[2], 
                   0, centr[1]))
        
        # Marking centromoere on the right:
        self.plot += ('<path d="M %s %s C %s %s, %s %s, %s %s C %s %s, %s %s, %s %s Z" fill="white"/>' %
                 ( self.width * self.pixel, centr[0],
                   self.width * self.pixel, centr[2], 
                   self.width * self.pixel - centr[3], centr[2], 
                   self.width * self.pixel - centr[3]*2, centr[2],
                   self.width * self.pixel - centr[3], centr[2], 
                   self.width * self.pixel, centr[2], 
                   self.width * self.pixel, centr[1]))        
        
    # Adding legend?
    # Adding centromere?
    # Adding frame?
    # Adding custom annotation.


    # Close svg document:
    def __close(self):
        self.plot += '</svg>\n'

    # Save svg file:
    def save_svg(self, file_name):
        self.__close() # Closing the svg tag.
        f = open(file_name, 'w')
        f.write(self.plot)
        f.close()

    # Render png file:
    def save_png(self, file_name):
        cairosvg.svg2png(bytestring=self.plot,write_to=file_name)


In [265]:
pixel = 3
outputFileName = ("chr%s.w.test.%s" %(chromosome, width))
plot = SVG_plot(chr_dataf.x.max(), chr_dataf.y.max(), pixel, 
                int(centromer_loc[0]/chunk_size), int(centromer_loc[1]/chunk_size))

print("[Info %s] Drawing svg... might take a while to complete." % (get_now()))
chr_dataf.apply(plot.draw_chunk, axis = 1)

print("[Info %s] Adding GWAS hits." %(get_now()))
GWAS_df.apply(plot.draw_GWAS, axis = 1 )

print("[Info %s] Marking centromoere." %(get_now()))
plot.mark_centromere()

print("[Info %s] Saving svg file.." % (get_now()))
plot.save_svg(outputFileName + ".svg")

print("[Info %s] Saving png file.." % (get_now()))
plot.save_png(outputFileName + ".png")

[Info 02:56:50] Drawing svg... might take a while to complete.
[Info 03:00:32] Adding GWAS hits.
[Info 03:00:36] Marking centromoere.
[Info 03:00:36] Saving svg file..
[Info 03:00:36] Saving png file..


In [78]:
class SVG_plot:
    '''
    This class contains all the methods and data to create the genome plot in svg format.
    Once the process is done, use can choose to save the svg or render using cairosvg.

    margins = array of four integers corresponding to the left, upper, right and bottom margin.
    '''

    def __init__(self, width, height, pixel, margins = [0,0,0,0]):
        self.width = width
        self.height = height
        self.pixel = pixel
        self.margins = margins

        # This is a scaled star for plotting custom annotation:
        self.star = [e for ts in zip([self.__rotate(0, pixel, math.radians(alpha*36)) for alpha in range(0,9,2)],
                                    [self.__rotate(0, pixel, math.radians(alpha*36), 1.7) for alpha in range(1,10,2)],) for e in ts]

        self.plot =  '<svg width="%s" height="%s">\n' % (width*pixel + margins[0] + margins[2], height*pixel + margins[1] + margins[3])

    def __adjust_coord(self, x, y):
        ''' This simple function just sifts the coordinates by the margins and corrects for the pixel size.'''
        return (int(x*self.pixel + self.margins[0]),
                int(y*self.pixel + self.margins[1]))

    def add_assoc(self, row):
        ''' Adding a star to the polt and the rsID and trait... once later'''
        (x,y) = self.__adjust_coord(row['x'], row['y'])

        # Offsetting the star:
        self.plot += ('<polygon stroke="black" fill="#A569BD" stroke-width="2" points="%s" />\n' %
                      " ".join([",".join([str(x + a[0]),str(y+a[1])]) for a in self.star]))

        # Adding annotation:
        # to be implemented.... not yet there...


    # Adding square:
    def draw_chunk(self, row):
        '''
        This function expects a row of a dataframe in which there must be the
        following three columns:
            x - column number
            y - row number
            color - color of the field in hexadecimal code.
        '''
        (x,y) = self.__adjust_coord(row['x'],row['y'])
        color = row['color']
        self.plot += ('<rect x="%s" y="%s" width="%s" height="%s" style="stroke-width:1;stroke:%s; fill: %s" />\n' %
            (x, y, self.pixel , self.pixel, color, color))

    # Adding dot:
    def draw_GWAS(self, row):
        (x,y) = self.__adjust_coord(row['x'],row['y'])
        self.plot += ('<circle cx="%s" cy="%s" r="%s" stroke="%s" stroke-width="1" fill="%s" />\n' %
            (x, y, self.pixel * 0.75, "black", "black"))

    def mark_centromere(self, centr_start, centr_end):

        # calculating the y coordinates of the centromeres based on the chunk count:
        start_y = int(centr_start / self.width)
        end_y = int(centr_end / self.width)

        # Marking centromere on the left:
        self.plot += ('<path d="M %s %s C %s %s, %s %s, %s %s C %s %s, %s %s, %s %s Z" fill="white"/>\n' %
                  ( self.__adjust_coord(0, start_y) +
                    self.__adjust_coord(0, (start_y+end_y)/2) +
                    self.__adjust_coord((end_y-start_y), (start_y+end_y)/2) +
                    self.__adjust_coord((end_y-start_y)*2, (start_y+end_y)/2) +
                    self.__adjust_coord((end_y-start_y), (start_y+end_y)/2) +
                    self.__adjust_coord(0, (start_y+end_y)/2) +
                    self.__adjust_coord(0, end_y )))

        # Marking centromoere on the right:
        self.plot += ('<path d="M %s %s C %s %s, %s %s, %s %s C %s %s, %s %s, %s %s Z" fill="white"/>\n' %
                  ( self.__adjust_coord(self.width+1, start_y) +
                    self.__adjust_coord(self.width+1, (start_y+end_y)/2) +
                    self.__adjust_coord(self.width+1-(end_y-start_y), (start_y+end_y)/2) +
                    self.__adjust_coord(self.width+1-(end_y-start_y)*2, (start_y+end_y)/2) +
                    self.__adjust_coord(self.width+1-(end_y-start_y), (start_y+end_y)/2) +
                    self.__adjust_coord(self.width+1, (start_y+end_y)/2) +
                    self.__adjust_coord(self.width+1, end_y)))

    # Adding legend?
    # Adding centromere? DONE
    # Adding frame?
    # Adding custom annotation.
    # Adding cytobands.
    # Rotating and scaling vecot
    def __rotate(self, x, y, angle, scale=1):
        return ((x * math.cos(angle) - y * math.sin(angle))*scale,
                (x * math.sin(angle) + y * math.cos(angle))*scale)

    # Close svg document:
    def __close(self):
        self.plot += '</svg>\n'

    # Save svg file:
    def save_svg(self, file_name):
        self.__close() # Closing the svg tag.
        f = open(file_name, 'w')
        f.write(self.plot)
        f.close()

    # Render png file:
    def save_png(self, file_name):
        cairosvg.svg2png(bytestring=self.plot,write_to=file_name)

    # R
plot = SVG_plot(200, 40000/200, 9, margins = [10, 10, 10, 10])
row = { 'x' : 2.5,
       'y' : 4.1 }
plot.add_assoc(row)
print(plot.plot)

<svg width="1820" height="1820.0">
<polygon stroke="black" fill="#A569BD" stroke-width="2" points="32.0,55.0 23.00688563992516,58.37796001393669 23.440491353343617,48.78115294937453 17.448835300684152,41.272039986063305 26.70993272936774,38.71884705062547 31.999999999999996,30.700000000000003 37.29006727063226,38.71884705062547 46.55116469931585,41.272039986063305 40.55950864665638,48.78115294937452 40.99311436007484,58.37796001393669" />



In [16]:
centromer_coord = (
    int(centromer_loc[0] / (chunk_size * width) * pixel),
    int(centromer_loc[1] / (chunk_size * width) * pixel),
    int((centromer_loc[1] + centromer_loc[0]) / (chunk_size * width * 2) * pixel),
    int((centromer_loc[1] - centromer_loc[0]) / (chunk_size * width * 2) * pixel)
)

# Drawing centromere:
centr_svg = ('<path d="M %s %s C %s %s, %s %s, %s %s C %s %s, %s %s, %s %s Z" fill="white"/>\n' %
         ( 0, centromer_coord[0],
           0, centromer_coord[2], centromer_coord[3], centromer_coord[2], centromer_coord[3]*2, centromer_coord[2],
           centromer_coord[3], centromer_coord[2], 0, centromer_coord[2], 0, centromer_coord[1]
           , ))
centr_svg

NameError: name 'centromer_loc' is not defined

<svg width="2000" height="2000.0">
<path d="M 100 1189 C 100 1189.0, 100 1189.0, 100 1189.0 C 100 1189.0, 100 1189.0, 100 1189 Z" fill="white"/>
<path d="M 1900 1189 C 1900 1189.0, 1900 1189.0, 1900 1189.0 C 1900 1189.0, 1900 1189.0, 1900 1189 Z" fill="white"/>
<path d="M 100 1189 C 100 1189.0, 100 1189.0, 100 1189.0 C 100 1189.0, 100 1189.0, 100 1189 Z" fill="white"/>
<path d="M 1900 1189 C 1900 1189.0, 1900 1189.0, 1900 1189.0 C 1900 1189.0, 1900 1189.0, 1900 1189 Z" fill="white"/>



In [21]:
40000/200*9

1800.0

In [49]:
import math

def __rotate(x,y,angle, scale=1):
    return ((x * math.cos(angle) - y * math.sin(angle))*scale,
            (x * math.sin(angle) + y * math.cos(angle))*scale)
pixel = 1
star = [e for ts in zip([rotate(0, pixel, alpha*36,2) for alpha in range(1,10,2)], [rotate(0, pixel, alpha*36) for alpha in range(0,9,2)]) for e in ts]


In [55]:
x = 12
y = 20
" ".join([",".join([str(x + a[0]),str(y+a[1])]) for a in star])

'13.98355770688623,19.74407262074519 12.0,21.0 10.14636298916443,20.751019195534024 11.746176637237964,19.03274941172612 13.60230527146766,18.803079861884285 12.49102159389847,20.871147401032342 10.753975577992694,21.564424219884543 11.303941511655088,19.282014916030285 12.808130438912722,18.17053964412925 12.85550437075082,20.517795588650813'