# Read depth and UMI errors

This notebook explorers the connection between read depth and edit distance. It finishes with the production of figure 1c in the manuscript.

Preliminaries:

In [1]:
import os
from CGAT.Database import fetch_DataFrame, fetch
import pandas as pd
import re
%load_ext rpy2.ipython
%R library(ggplot2)
%R library(Hmisc)
%R library(scales)
# Now run from pipline directory, no need to chdir
#os.chdir("/ifs/projects/ians/umisdeduping/iCLIP_deduping/SR_iCLIP_test/")



  rpy2.rinterface.initr()

  rpy2.rinterface.initr()

  rpy2.rinterface.initr()

  rpy2.rinterface.initr()

  rpy2.rinterface.initr()

  rpy2.rinterface.initr()

  rpy2.rinterface.initr()

  rpy2.rinterface.initr()


The edit_distance data is stored in the database in a series of tables one for each combination of sample and dedup method. Each table, as well as containing the edit_distance distribution after deduplication with that method, also contains the edit_distance distribution for just using unique UMIs, as well as the Null distribution generated by randomly sampleing UMIs: it is this data that we will use to calculate the fraction of all positions with more than one UMI where the average edit distance between UMIs at that position is 1.

The total number of reads for each sample is stored in the read_counts table. We will use this data to calculate the dup rate.

In [2]:
def get_data(table):
    track = re.match("(.+_.+_R[0-9])_dedup_cluster_edit_distance", table).groups()[0]
    distances = fetch_DataFrame('''SELECT edit_distance, _unique as count 
                               FROM %(track)s_dedup_cluster_edit_distance 
                               WHERE edit_distance != 'Single_UMI' ''' % locals(), "csvdb")
    distances["freq"] = distances["count"]/distances["count"].sum()
    distances = distances.set_index("edit_distance")
    reads = fetch_DataFrame('''SELECT count 
                           FROM read_counts 
                           WHERE track='%s' AND method="none" ''' % re.sub("_","-", track),
                        "csvdb").iloc[0][0]

    return pd.Series({"track":track, "frac_single_dist": distances.loc[1].freq, "no_reads": reads})


We now need to get a list of all the tables we would like to use. Because we are using just the data from unique UMIs, we will only use tables relating to a single dedup method, and find all tables that match that, then run the get_data function above to collect the data for each sample:

In [3]:
data  = fetch_DataFrame('''SELECT name 
                   FROM sqlite_master 
                   WHERE type='table' 
                   AND name LIKE '%dedup_cluster_edit_distance' ''',
                "csvdb")["name"].apply(get_data)

The first relationship we will look at is the read depth vs. the number of positions where the average edit distance is 1

In [None]:
%%R -i data
library(ggplot2)

ggplot(data) + aes(x=no_reads, y=frac_single_dist) + geom_point()

We don't see much of a relationship here. But while an assumption is that high read depths = more over sampleing, that isn't strictly true - the deeper sequenced samples might simply be more complex samples. Instead lets look at the % reads left after deduplication - this is a direct measure of how over sampled the sample is.

In [None]:
def get_deduped_count(track):
    return fetch_DataFrame('''SELECT count
                              FROM read_counts
                              WHERE track='%s' AND method='unique' ''' % re.sub("_","-",track)
                           , "csvdb").iloc[0][0]

In [None]:
data["deduped"] = data["track"].apply(get_deduped_count)

In [None]:
%%R -i data -w 63 -h 63 -u mm -r 300

ggplot(data) + aes(x=deduped/no_reads, y=frac_single_dist) + geom_point() 

We can see that there is definatly a relationship between these. But it is unclear what fraction of single distance reads is acceptable. Lets normalise the expected number - that is the fraction of single edit distance positions in the randomly sampled null.

In [None]:
def get_null_enrichment(track):
    return fetch_DataFrame('''SELECT
                             (_unique+0.0)/_unique_null as enrichment
                               FROM %(track)s_dedup_cluster_edit_distance as ed
                               WHERE edit_distance =1 ''' % locals(), "csvdb").iloc[0][0]

data["enrichment"] = data["track"].apply(get_null_enrichment)

Plot the enrichment of edit distance 1 positions against the duplication rate (1-% of reads remaining after dedup using unique UMIs) and save as figure1c

In [None]:
%%R -i data -w 58 -h 58 -u mm -r 300
library(scales)
library(Hmisc)
rsquared = with(data, sprintf("%.2f",cor(1-(deduped/no_reads),log2(enrichment))^2))
g <- ggplot(data) + aes(x=1-(deduped/no_reads), y=log2(enrichment)) + 
               geom_point(size=1) + 
               theme_bw(base_size=8) +
               xlab("Duplication Rate") +
               ylab(expression(paste(Log[2]," enrichment of 1-edit positions"))) +
               annotate("text", label=paste("R^2==",rsquared), parse = T, x=0.55, y=5.75, size=3) +
               scale_x_continuous(labels=percent)  + 
               geom_line(aes(group=1), stat="smooth",method="lm", se=F, alpha = 0.3, col="blue") +
               theme(legend.position="bottom",
                     legend.key.size=unit(0.45,"line"))

ggsave("plots/figure1c.svg",g)
print(g)

Test significance of the relationship.

In [None]:
%%R
model = lm(log2(enrichment) ~ I(deduped/no_reads), data=data)
print(summary(model))