# Woring with BED files

## Beginning

One effective method to process BED files is to use **BEDTools**, which offers tools for a wide-range of genomics analysis tasks. 
These can be installed as a pre-compiled binary from their website and be used to process BEDfiles in the Terminal.
(https://bedtools.readthedocs.io/en/latest/)

To use the BEDTools commands in python the **pybedtools** Python package can be used to transform the Terminal commands into python functions and therefore more complexe pipelines and scripts can be generated. (https://daler.github.io/pybedtools/index.html)

In [2]:
import pybedtools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [7]:
path_01 = "/sybig/home/jme/Bachelorarbeit/Collect_Data/Test_BED_files/ARNT.bed"
path_02 = "/sybig/home/jme/Bachelorarbeit/Collect_Data/Test_BED_files/ASCL1.bed"

a = pybedtools.BedTool(path_01)
b = pybedtools.BedTool(path_02)

a_and_b = a.intersect(b)

print("\nARNT.bed")
a.head(2)

print("\nASCL1.bed")
b.head(2)

print("\nIntersect")
a_and_b.head(2)




ARNT.bed
chr1	827391	827499	.	0	+
 chr1	827504	827610	.	0	+
 
ASCL1.bed
chr1	633900	634013	.	0	+
 chr1	858778	858888	.	0	+
 
Intersect
chr1	910768	910824	.	0	+
 chr1	913807	913855	.	0	+
 

In [16]:
df = a.to_dataframe()
df["chrom"].unique()

array(['chr1', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14',
       'chr14_GL000225v1_random', 'chr14_KI270722v1_random', 'chr15',
       'chr16', 'chr17', 'chr17_GL000205v2_random',
       'chr17_KI270729v1_random', 'chr18', 'chr19',
       'chr1_KI270713v1_random', 'chr2', 'chr20', 'chr21', 'chr22',
       'chr22_KI270732v1_random', 'chr22_KI270733v1_random',
       'chr22_KI270736v1_random', 'chr3', 'chr4',
       'chr4_GL000008v2_random', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9',
       'chr9_KI270718v1_random', 'chrM', 'chrUn_GL000195v1',
       'chrUn_GL000214v1', 'chrUn_GL000216v2', 'chrUn_GL000219v1',
       'chrUn_GL000220v1', 'chrUn_GL000224v1', 'chrUn_KI270303v1',
       'chrUn_KI270304v1', 'chrUn_KI270411v1', 'chrUn_KI270435v1',
       'chrUn_KI270438v1', 'chrUn_KI270442v1', 'chrUn_KI270465v1',
       'chrUn_KI270467v1', 'chrUn_KI270507v1', 'chrUn_KI270510v1',
       'chrUn_KI270515v1', 'chrUn_KI270538v1', 'chrUn_KI270744v1',
       'chrUn_KI270746v1', 'chrUn_KI270751v1', 'chr

In [19]:
test = pybedtools.BedTool("/sybig/home/jme/Bachelorarbeit/Collect_Data/Test_BED_files/ARNT_test.bed")
test.count()

97146

In [13]:
test = "ha"
array = np.array(["ha", "hi"])
np.isin(test, array)

array(True)

In [14]:
chr_ordered = np.array(["chr1", "chr2", "chr3", "chr4","chr5","chr6","chr7","chr8", "chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19", "chr20","chr21","chr22","chrX","chrY"])
c = a.filter(lambda x: np.isin(x.chrom, chr_ordered)).saveas("ARNT_filtered.bed")


In [119]:
ARNT_filter_path = "/sybig/home/jme/Bachelorarbeit/Collect_Data/ARNT_filtered.bed"
c.to_dataframe()

Unnamed: 0,chrom,start,end,name,score,strand
0,chr1,827391,827499,.,0,+
1,chr1,827504,827610,.,0,+
2,chr1,865784,865892,.,0,+
3,chr1,869795,869901,.,0,+
4,chr1,904640,904746,.,0,+
...,...,...,...,...,...,...
97141,chrY,16646113,16646219,.,0,-
97142,chrY,16831575,16831683,.,0,-
97143,chrY,56697203,56697309,.,0,-
97144,chrY,56707185,56707291,.,0,-


## Processing UniBind Data

The genomic Locations of all direct TFBS in the human genome of the robust collection are dowloanded from the UniBind Database.

In [47]:
import pybedtools

raw_UniBind_path = "/sybig/projects/GeneRegulation/data/jme/Bachelorarbeit/raw_data/hg38_UniBind_allTFBSs.bed"

raw_UniBind = pybedtools.BedTool(raw_UniBind_path)

#count entrys of raw data needs ca 8 minutes. --> 97492844
#raw_UniBind.count()


97492844

This data contains 97.492.844 genomic locations from chr1-chr22 and chrX and chrY, but unfortunaly also many locations which could not be assigned to any chromosome and will be refered as ChrUn or ChrN_random. Therefore the raw data will be filtered and saved in a new BED file.

In [82]:
# Filter the raw data and only save chr1-22 and chrX and chrY in BED file

chr_ordered = np.array(["chr1", "chr2", "chr3", "chr4","chr5","chr6","chr7","chr8", "chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19", "chr20","chr21","chr22","chrX","chrY"])
output = "/sybig/projects/GeneRegulation/data/jme/Bachelorarbeit/raw_data/UniBind_allTFBS_filtered.bed"

data_unibind = raw_UniBind.filter(lambda x: np.isin(x.chrom, chr_ordered)).saveas(output)

In [107]:
path = "/sybig/projects/GeneRegulation/data/jme/Bachelorarbeit/raw_data/UniBind_TFBSs.bed"
data = pybedtools.BedTool(path)


In [118]:
tfbs = data[4]
print(tfbs.fields)

['chr1', '16243', '16262', 'EXP000597_HeLa-S3--cervical-adenocarcinoma-_CTCF_MA0139.1', '0', '+', '16243', '16262', '3,22,250']


## Processing RefSeq Data

The RefSeq database includes genomic regions for all known genes in the human genome. 
These files are in GFF format and not in BED format, which means that it has completly different annotations (e.g. NC_000001.11 instead of chr1). 

But BedTools can also work with GFF files but it maybe more complicated in the future. 

In [78]:
refseq_path = "/sybig/projects/GeneRegulation/data/jme/Bachelorarbeit/raw_data/ncbi_dataset/data/GCF_000001405.40/genomic.gff"
refseq = pybedtools.BedTool(refseq_path)

print(refseq[2])

NC_000001.11	BestRefSeq	transcript	11874	14409	.	+	.	ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=GeneID:100287102,GenBank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2



Maybe its simpler to use the UCSC RefGene Files, which are stored in a GTF format and only contains the curated genes.

In [114]:
ucsc_refseq_path = "/sybig/projects/GeneRegulation/data/jme/Bachelorarbeit/raw_data/hg38.refGene.gtf"
ucsc_refseq = pybedtools.BedTool(ucsc_refseq_path)

interval = ucsc_refseq[0]
print(interval.fields)


['chr1', 'refGene', 'transcript', '11874', '14409', '.', '+', '.', 'gene_id "DDX11L1"; transcript_id "NR_046018";  gene_name "DDX11L1";']


To generate Promotors to each gene the exons are not necessary and the dataset can be reduced to the whole transcripts.

In [88]:
df_genes = ucsc_refseq.to_dataframe()
df_transcripts = df_genes[df_genes["feature"]=="transcript"]


In [93]:
df_transcripts

Unnamed: 0,seqname,source,feature,start,end,score,strand,frame,attributes
0,chr1,refGene,transcript,11874,14409,.,+,.,"gene_id ""DDX11L1""; transcript_id ""NR_046018""; ..."
4,chr1,refGene,transcript,14362,29370,.,-,.,"gene_id ""WASH7P""; transcript_id ""NR_024540""; ..."
16,chr1,refGene,transcript,17369,17436,.,-,.,"gene_id ""MIR6859-1""; transcript_id ""NR_106918""..."
18,chr1,refGene,transcript,17369,17436,.,-,.,"gene_id ""MIR6859-2""; transcript_id ""NR_107062""..."
20,chr1,refGene,transcript,17369,17436,.,-,.,"gene_id ""MIR6859-3""; transcript_id ""NR_107063""..."
...,...,...,...,...,...,...,...,...,...
1893688,chr6,refGene,transcript,158536640,158635429,.,+,.,"gene_id ""TMEM181""; transcript_id ""NM_020823""; ..."
1893727,chr6,refGene,transcript,158560091,158635429,.,+,.,"gene_id ""TMEM181""; transcript_id ""NM_001376850..."
1893766,chr6,refGene,transcript,158560091,158635429,.,+,.,"gene_id ""TMEM181""; transcript_id ""NM_001376852..."
1893805,chr6,refGene,transcript,158560091,158635429,.,+,.,"gene_id ""TMEM181""; transcript_id ""NM_001376854..."


Ziel: Entweder durch dataframe oder besser noch durch .fields[2] die datei nach allen transcripts filtern. Vielleicht sogar Gleichzeitig den Promotor definieren. Zunächst als 200bp upstream. Diese dann in neuer BED oder GTF file speichern.
Wichtige Atribute die zuordbar sein müssen ist: Strang des Gens und gene_id 
Viellecht ist es ja sogar möglich genau die selben daten wie oben (in attributes) zu behalten und in "name" von bed file zu speichern.
Oder man lässt sie lieber so als gff file und hat dann in .name den gen namen und in .strand den strang, was für weitere analysen reichen sollte.