Skip to content
/ python Public

A collection of Python scripts for handling various data files.

Notifications You must be signed in to change notification settings

EvoEpi/python

Repository files navigation

python

A collection of Python scripts for handling various data files.

Epigen[etic|omic]

methTable.py. More often than not you want to categorize loci based on their methylation enrichment. methTable.py will do just this. As input methTable.py requires an allc file (a vcf-like file generated from methylpy) and a bed file of attribute coordinates and ID. The bed file can be easily created from an GFF[3] file. For example, here is how to create a bed file of CDS coordinates from the Arabidopsis thaliana annotation file:

#peak at the CDS features in the annotation file to know what to modify
grep 'CDS' Athaliana_447_Araport11.gene_exons.gff3 | head
Chr1	phytozomev12	CDS	3760	3913	.	+	0	ID=AT1G01010.1.Araport11.447.CDS.1;Parent=AT1G01010.1.Araport11.447;pacid=37401853
Chr1	phytozomev12	CDS	3996	4276	.	+	2	ID=AT1G01010.1.Araport11.447.CDS.2;Parent=AT1G01010.1.Araport11.447;pacid=37401853
Chr1	phytozomev12	CDS	4486	4605	.	+	0	ID=AT1G01010.1.Araport11.447.CDS.3;Parent=AT1G01010.1.Araport11.447;pacid=37401853
Chr1	phytozomev12	CDS	4706	5095	.	+	0	ID=AT1G01010.1.Araport11.447.CDS.4;Parent=AT1G01010.1.Araport11.447;pacid=37401853
Chr1	phytozomev12	CDS	5174	5326	.	+	0	ID=AT1G01010.1.Araport11.447.CDS.5;Parent=AT1G01010.1.Araport11.447;pacid=37401853
Chr1	phytozomev12	CDS	5439	5630	.	+	0	ID=AT1G01010.1.Araport11.447.CDS.6;Parent=AT1G01010.1.Araport11.447;pacid=37401853
Chr1	phytozomev12	CDS	8571	8666	.	-	0	ID=AT1G01020.1.Araport11.447.CDS.1;Parent=AT1G01020.1.Araport11.447;pacid=37399351
Chr1	phytozomev12	CDS	8417	8464	.	-	0	ID=AT1G01020.1.Araport11.447.CDS.2;Parent=AT1G01020.1.Araport11.447;pacid=37399351
Chr1	phytozomev12	CDS	8236	8325	.	-	0	ID=AT1G01020.1.Araport11.447.CDS.3;Parent=AT1G01020.1.Araport11.447;pacid=37399351
Chr1	phytozomev12	CDS	7942	7987	.	-	0	ID=AT1G01020.1.Araport11.447.CDS.4;Parent=AT1G01020.1.Araport11.447;pacid=37399351

#generate bed file of gene (attribute) for CDS (feature)
awk -F"\t" '{if($3 == "CDS") print $1 "\t" $4 "\t" $5 "\t" $9}' Athaliana_447_Araport11.gene_exons.gff3 | \
sed 's/ID=//g;s/.Araport11.*//g' \
> Athaliana_447_Araport11.gene_exons.CDS.bed

head Athaliana_447_Araport11.gene_exons.CDS.bed
Chr1	3760	3913	AT1G01010.1
Chr1	3996	4276	AT1G01010.1
Chr1	4486	4605	AT1G01010.1
Chr1	4706	5095	AT1G01010.1
Chr1	5174	5326	AT1G01010.1
Chr1	5439	5630	AT1G01010.1
Chr1	8571	8666	AT1G01020.1
Chr1	8417	8464	AT1G01020.1
Chr1	8236	8325	AT1G01020.1
Chr1	7942	7987	AT1G01020.1

You will notice that the ID column is shared across multiple CDS coordinates. In other words, each line is the coordinate of a single CDS of a particular gene (in this example).

Other required inputs are a taxonomic group for sequence contexts to parse (animal: CG and CH, or plant: CG, CHG, and CHH), the minumum number of sequence context sites per locus, minumum coverage of sequence context per locus, Q value threshold, and the name of an outfile. The statistical enrichment test with False Discovery Rate (FDR) correction performed for each locus follows a very similar method to that described in Takuno and Gaut (2012).

Usage:

python methTable.py \
--allc=<allc file> \
--bed=<bed file of attributes> \
--taxa=<animal|plant> \
--sites=<int> \
--coverage=<int> \
--qvalue=<float> \
--outfile=<outfile>

perSiteMeth.py. The per-site DNA methylation level is the distribution of DNA methylation levels at individual methylated sites and indicates within a population of cells, the proportion that are methylated. perSiteMeth.py will parse an allc file generated by methylpy and calculate weighted methylation Schultz et al. 2012 for each methylated cytosine of a sequence context specified by taxonomic group (animal or plant). Weighted methylation, along with chromosome, position and strand, will be written to separate files specified by sequence context and identifier.

Usage:

python perSiteMeth.py \
--allc=<allc file> \
--coverage=<int> \
--taxa=<animal|plant> \
--id=<unique sample identifier>

metaplot.py. Using an allc file generated by methylpy and a General Feature Format (GFF) file (along with some user specified parameters), this script will generate a tab separated file of weighted methylation levels per window along and flanking a genomic feature. The outfile is used to generate what is commonly referred to as a "metaplot". A couple warnings you might encounter with this script:

WARNING: Interval Chr#:i-j is smaller than the number of windows requested. Skipping.

and

FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.

No need to worry about these. The first warning is because the flanking region specified runs off the contig/scaffold/chromosome. The second warning is most likely because pybedtools has not updated the pandas function read_table to read_csv.

Usage:

python metaplot.py \
--allc=<filename> \
--gff=<filename \
--taxa=<animal|plant> \
--genome=<filename> \
--feature=<str> \
--flanking=<int> \
--windows=<int> \
--outfile=<outfile>

Gen[etic|omic]

alignmentConverstion.py. Converts a multi-sequence alignment (MSA) file of one format to a different format (e.g., fasta to phyip).

Usage:

python alignmentConversion.py \
--infile=<msa> \
--in_format=<format e.g., fasta> \
--outfile=<output filename> \
--out_format=<format e.g., phylip-relaxed>

createIntronGff.py. Creates a GFF3 file of intron features for each gene given the exon coordinates. Caution: intron are not correctly numbered for genes on the - strand.

Usage:

python createIntronGff.py \
--in_gff=<gff[3] infile> \
--out_gff=<intron gff3 outfile>

extractSeqFromFasta.py. A very straightforward script to extract a set of sequence(s) from a fasta file based on id.

Usage:

python extractSeqFromFasta.py \
--ids=<list of ids> \
--fasta=<in fasta file> \
--outfile=<out fasta file>

motifSearch.py. Searches for motif sequence and reverse complement of motif from a fasta file in a fasta file and records first and last base pair positions.

Usage:

python motifSearch.py \
--motif=<motif fasta> \
--search=<fasta to search> \
--outfile=<outfile (bed format)>

About

A collection of Python scripts for handling various data files.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages