## dbSnp153 tabix indexing (hg38) and tabix indexing of WT and OXP reditools tables
Here the hg38 dbSNP database (from UCSC: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/snp151.txt.gz  ) is converted to gtf, sorted, compressed and indexed using tabix as in Lo Giudice et a., 2020 protocol. In parallel, also reditools output tables for illumina samples were bgzipped and indexed by tabix.

In [1]:
# visualize head of dbSnp153.bb downloaded file
!cat /lustre/bio_running/refs/snp151.hg38.txt | head

585	chr1	10019	10020	rs775809821	0	+	A	A	-/A	genomic	deletion	unknown	0	0	near-gene-5	exact	1		1	SSMP,	0				
585	chr1	10038	10039	rs978760828	0	+	A	A	A/C	genomic	single	unknown	0	0	near-gene-5	exact	1		1	USC_VALOUEV,	0				
585	chr1	10042	10043	rs1008829651	0	+	T	T	A/T	genomic	single	unknown	0	0	near-gene-5	exact	1		1	USC_VALOUEV,	0				
585	chr1	10050	10051	rs1052373574	0	+	A	A	A/G	genomic	single	unknown	0	0	near-gene-5	exact	1		1	USC_VALOUEV,	0				
585	chr1	10051	10051	rs1326880612	0	+	-	-	-/C	genomic	insertion	unknown	0	0	near-gene-5	between	1		1	TOPMED,	0				
585	chr1	10054	10055	rs892501864	0	+	T	T	A/T	genomic	single	unknown	0	0	near-gene-5	exact	1		1	USC_VALOUEV,	0				
585	chr1	10055	10055	rs768019142	0	+	-	-	-/A	genomic	insertion	unknown	0	0	near-gene-5	between	1		1	SSMP,	0				
585	chr1	10062	10063	rs1010989343	0	+	A	A	A/C	genomic	single	unknown	0	0	near-gene-5	exact	1		1	USC_VALOUEV,	0				
585	chr1	10067	10067	rs1489251879	0	+	-	-	lengthTooLong	genomic	in-del	unknown	0	0	

In [2]:
# create a gtf
!awk 'OFS="\t"{if ($11=="genomic" && $12=="single") print $2,"ucsc_snp153_hg38","snp",$4,$4,".",$7,".","gene_id \""$5"\"; transcript_id \""$5"\";"}' /lustre/bio_running/refs/snp151.hg38.txt > /lustre/bio_running/refs/snp151.hg38.gtf

In [3]:
# visualize some rows
!cat /lustre/bio_running/refs/snp151.hg38.gtf | head

chr1	ucsc_snp153_hg38	snp	10039	10039	.	+	.	gene_id "rs978760828"; transcript_id "rs978760828";
chr1	ucsc_snp153_hg38	snp	10043	10043	.	+	.	gene_id "rs1008829651"; transcript_id "rs1008829651";
chr1	ucsc_snp153_hg38	snp	10051	10051	.	+	.	gene_id "rs1052373574"; transcript_id "rs1052373574";
chr1	ucsc_snp153_hg38	snp	10055	10055	.	+	.	gene_id "rs892501864"; transcript_id "rs892501864";
chr1	ucsc_snp153_hg38	snp	10063	10063	.	+	.	gene_id "rs1010989343"; transcript_id "rs1010989343";
chr1	ucsc_snp153_hg38	snp	10077	10077	.	+	.	gene_id "rs1022805358"; transcript_id "rs1022805358";
chr1	ucsc_snp153_hg38	snp	10108	10108	.	+	.	gene_id "rs62651026"; transcript_id "rs62651026";
chr1	ucsc_snp153_hg38	snp	10109	10109	.	+	.	gene_id "rs376007522"; transcript_id "rs376007522";
chr1	ucsc_snp153_hg38	snp	10120	10120	.	+	.	gene_id "rs1390810297"; transcript_id "rs1390810297";
chr1	ucsc_snp153_hg38	snp	10132	10132	.	+	.	gene_id "rs1436069773"; transcript_id "rs1436069773";
cat: write error: Br

In [4]:
# sorting positions
!sort -k1,1 -k4,4n /lustre/bio_running/refs/snp151.hg38.gtf > /lustre/bio_running/refs/snp151.hg38.sorted.gtf

In [6]:
# compressing with bgzip
!bgzip /lustre/bio_running/refs/snp151.hg38.sorted.gtf

In [8]:
# index as gff by tabix the sorted and compressed file
!tabix -p gff /lustre/bio_running/refs/snp151.hg38.sorted.gtf.gz

In [11]:
# let's try to retrieve some snps from the indexed gtf file
!tabix /lustre/bio_running/refs/snp151.hg38.sorted.gtf.gz chr1:10039-10039

chr1	ucsc_snp153_hg38	snp	10039	10039	.	+	.	gene_id "rs978760828"; transcript_id "rs978760828";


In [12]:
# remove unsorted
!rm /lustre/bio_running/refs/snp151.hg38.gtf

Now it's time to compress reditools outputs tables for illumina samples

In [1]:
!bgzip -c /lustre/bio_running/conticello/illumina/oxp1/DnaRna_470872555/outTable_470872555 > /lustre/bio_running/conticello/illumina/oxp1/DnaRna_470872555/outTable_470872555.gz

In [2]:
!bgzip -c /lustre/bio_running/conticello/illumina/oxp2/DnaRna_73346045/outTable_73346045 > /lustre/bio_running/conticello/illumina/oxp2/DnaRna_73346045/outTable_73346045.gz

In [3]:
!bgzip -c /lustre/bio_running/conticello/illumina/oxp3/DnaRna_808842865/outTable_808842865 > /lustre/bio_running/conticello/illumina/oxp3/DnaRna_808842865/outTable_808842865.gz

In [5]:
!bgzip -c /lustre/bio_running/conticello/illumina/wt1/DnaRna_505821894/outTable_505821894 > /lustre/bio_running/conticello/illumina/wt1/DnaRna_505821894/outTable_505821894.gz

In [6]:
!bgzip -c /lustre/bio_running/conticello/illumina/wt2/DnaRna_83292749/outTable_83292749 > /lustre/bio_running/conticello/illumina/wt2/DnaRna_83292749/outTable_83292749.gz

In [7]:
!bgzip -c /lustre/bio_running/conticello/illumina/wt3/DnaRna_296402424/outTable_296402424 > /lustre/bio_running/conticello/illumina/wt3/DnaRna_296402424/outTable_296402424.gz

indexing reditools compressed tables with tabix

In [33]:
!zcat /lustre/bio_running/conticello/illumina/oxp1/DnaRna_*/outTable_*.gz | head

Region	Position	Reference	Strand	Coverage-q30	MeanQ	BaseCount[A,C,G,T]	AllSubs	Frequency	gCoverage-q0	gMeanQ	gBaseCount[A,C,G,T]	gAllSubs	gFrequency
chrY	2667118	A	1	1	37.00	[1, 0, 0, 0]	-	0.00	-	-	-	-	-
chrY	2667119	T	1	1	37.00	[0, 0, 0, 1]	-	0.00	-	-	-	-	-
chrY	2667120	T	1	1	37.00	[0, 0, 0, 1]	-	0.00	-	-	-	-	-
chrY	2667121	T	1	1	37.00	[0, 0, 0, 1]	-	0.00	-	-	-	-	-
chrY	2667122	A	1	1	37.00	[1, 0, 0, 0]	-	0.00	-	-	-	-	-
chrY	2667123	A	1	1	37.00	[1, 0, 0, 0]	-	0.00	-	-	-	-	-
chrY	2667124	C	1	1	37.00	[0, 1, 0, 0]	-	0.00	-	-	-	-	-
chrY	2667125	C	1	1	37.00	[0, 1, 0, 0]	-	0.00	-	-	-	-	-
chrY	2667126	T	1	1	37.00	[0, 0, 0, 1]	-	0.00	-	-	-	-	-

gzip: stdout: Broken pipe


In [36]:
# launch the first indexing with tabix for oxp1 sample
!tabix -s 1 -b 2 -e 2 -c R -f /lustre/bio_running/conticello/illumina/oxp1/DnaRna_470872555/outTable_470872555.gz

In [37]:
!tabix /lustre/bio_running/conticello/illumina/oxp1/DnaRna_470872555/outTable_470872555.gz chrY:2667120-2667124

chrY	2667120	T	1	1	37.00	[0, 0, 0, 1]	-	0.00	-	-	-	-	-
chrY	2667121	T	1	1	37.00	[0, 0, 0, 1]	-	0.00	-	-	-	-	-
chrY	2667122	A	1	1	37.00	[1, 0, 0, 0]	-	0.00	-	-	-	-	-
chrY	2667123	A	1	1	37.00	[1, 0, 0, 0]	-	0.00	-	-	-	-	-
chrY	2667124	C	1	1	37.00	[0, 1, 0, 0]	-	0.00	-	-	-	-	-


In [13]:
# continue with the other compressed files
!tabix -s 1 -b 2 -e 2 -c R -f /lustre/bio_running/conticello/illumina/oxp2/DnaRna_*/outTable_*.gz

In [14]:
!tabix -s 1 -b 2 -e 2 -c R -f /lustre/bio_running/conticello/illumina/oxp3/DnaRna_*/outTable_*.gz

In [15]:
!tabix -s 1 -b 2 -e 2 -c R -f /lustre/bio_running/conticello/illumina/wt1/DnaRna_*/outTable_*.gz

In [16]:
!tabix -s 1 -b 2 -e 2 -c R -f /lustre/bio_running/conticello/illumina/wt2/DnaRna_*/outTable_*.gz

In [17]:
!tabix -s 1 -b 2 -e 2 -c R -f /lustre/bio_running/conticello/illumina/wt3/DnaRna_*/outTable_*.gz

These indexed files will be used in the notebook named *"Analysis_merged_EdSites_merged_wt_oxp"* for the filtering of wt and oxp samples common sites (separately). This in order to assess if sites after the correction in wt are more frequently SNPs (as expected) with respect to the oxp ones.