A collection of Python scripts for handling various data files.
methTable.py
. More often than not you want to categorize loci based on their methylation enrichment. methTable.py
will do just this. As input methTable.py
requires an allc file (a vcf-like file generated from methylpy) and a bed file of attribute coordinates and ID. The bed file can be easily created from an GFF[3] file. For example, here is how to create a bed file of CDS coordinates from the Arabidopsis thaliana annotation file:
#peak at the CDS features in the annotation file to know what to modify
grep 'CDS' Athaliana_447_Araport11.gene_exons.gff3 | head
Chr1 phytozomev12 CDS 3760 3913 . + 0 ID=AT1G01010.1.Araport11.447.CDS.1;Parent=AT1G01010.1.Araport11.447;pacid=37401853
Chr1 phytozomev12 CDS 3996 4276 . + 2 ID=AT1G01010.1.Araport11.447.CDS.2;Parent=AT1G01010.1.Araport11.447;pacid=37401853
Chr1 phytozomev12 CDS 4486 4605 . + 0 ID=AT1G01010.1.Araport11.447.CDS.3;Parent=AT1G01010.1.Araport11.447;pacid=37401853
Chr1 phytozomev12 CDS 4706 5095 . + 0 ID=AT1G01010.1.Araport11.447.CDS.4;Parent=AT1G01010.1.Araport11.447;pacid=37401853
Chr1 phytozomev12 CDS 5174 5326 . + 0 ID=AT1G01010.1.Araport11.447.CDS.5;Parent=AT1G01010.1.Araport11.447;pacid=37401853
Chr1 phytozomev12 CDS 5439 5630 . + 0 ID=AT1G01010.1.Araport11.447.CDS.6;Parent=AT1G01010.1.Araport11.447;pacid=37401853
Chr1 phytozomev12 CDS 8571 8666 . - 0 ID=AT1G01020.1.Araport11.447.CDS.1;Parent=AT1G01020.1.Araport11.447;pacid=37399351
Chr1 phytozomev12 CDS 8417 8464 . - 0 ID=AT1G01020.1.Araport11.447.CDS.2;Parent=AT1G01020.1.Araport11.447;pacid=37399351
Chr1 phytozomev12 CDS 8236 8325 . - 0 ID=AT1G01020.1.Araport11.447.CDS.3;Parent=AT1G01020.1.Araport11.447;pacid=37399351
Chr1 phytozomev12 CDS 7942 7987 . - 0 ID=AT1G01020.1.Araport11.447.CDS.4;Parent=AT1G01020.1.Araport11.447;pacid=37399351
#generate bed file of gene (attribute) for CDS (feature)
awk -F"\t" '{if($3 == "CDS") print $1 "\t" $4 "\t" $5 "\t" $9}' Athaliana_447_Araport11.gene_exons.gff3 | \
sed 's/ID=//g;s/.Araport11.*//g' \
> Athaliana_447_Araport11.gene_exons.CDS.bed
head Athaliana_447_Araport11.gene_exons.CDS.bed
Chr1 3760 3913 AT1G01010.1
Chr1 3996 4276 AT1G01010.1
Chr1 4486 4605 AT1G01010.1
Chr1 4706 5095 AT1G01010.1
Chr1 5174 5326 AT1G01010.1
Chr1 5439 5630 AT1G01010.1
Chr1 8571 8666 AT1G01020.1
Chr1 8417 8464 AT1G01020.1
Chr1 8236 8325 AT1G01020.1
Chr1 7942 7987 AT1G01020.1
You will notice that the ID column is shared across multiple CDS coordinates. In other words, each line is the coordinate of a single CDS of a particular gene (in this example).
Other required inputs are a taxonomic group for sequence contexts to parse (animal
: CG and CH, or plant
: CG, CHG, and CHH), the minumum number of sequence context sites per locus, minumum coverage of sequence context per locus, Q value threshold, and the name of an outfile
. The statistical enrichment test with False Discovery Rate (FDR) correction performed for each locus follows a very similar method to that described in Takuno and Gaut (2012).
Usage:
python methTable.py \
--allc=<allc file> \
--bed=<bed file of attributes> \
--taxa=<animal|plant> \
--sites=<int> \
--coverage=<int> \
--qvalue=<float> \
--outfile=<outfile>
perSiteMeth.py
. The per-site DNA methylation level is the distribution of DNA methylation levels at individual methylated sites and indicates within a population of cells, the proportion that are methylated. perSiteMeth.py
will parse an allc file generated by methylpy
and calculate weighted methylation Schultz et al. 2012 for each methylated cytosine of a sequence context specified by taxonomic group (animal or plant). Weighted methylation, along with chromosome, position and strand, will be written to separate files specified by sequence context and identifier.
Usage:
python perSiteMeth.py \
--allc=<allc file> \
--coverage=<int> \
--taxa=<animal|plant> \
--id=<unique sample identifier>
metaplot.py
. Using an allc file generated by methylpy
and a General Feature Format (GFF) file (along with some user specified parameters), this script will generate a tab separated file of weighted methylation levels per window along and flanking a genomic feature. The outfile is used to generate what is commonly referred to as a "metaplot". A couple warnings you might encounter with this script:
WARNING: Interval Chr#:i-j is smaller than the number of windows requested. Skipping.
and
FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
No need to worry about these. The first warning is because the flanking region specified runs off the contig/scaffold/chromosome. The second warning is most likely because pybedtools has not updated the pandas function read_table
to read_csv
.
Usage:
python metaplot.py \
--allc=<filename> \
--gff=<filename \
--taxa=<animal|plant> \
--genome=<filename> \
--feature=<str> \
--flanking=<int> \
--windows=<int> \
--outfile=<outfile>
alignmentConverstion.py
. Converts a multi-sequence alignment (MSA) file of one format to a different format (e.g., fasta to phyip).
Usage:
python alignmentConversion.py \
--infile=<msa> \
--in_format=<format e.g., fasta> \
--outfile=<output filename> \
--out_format=<format e.g., phylip-relaxed>
createIntronGff.py
. Creates a GFF3 file of intron features for each gene given the exon coordinates. Caution: intron are not correctly numbered for genes on the - strand.
Usage:
python createIntronGff.py \
--in_gff=<gff[3] infile> \
--out_gff=<intron gff3 outfile>
extractSeqFromFasta.py
. A very straightforward script to extract a set of sequence(s) from a fasta file based on id.
Usage:
python extractSeqFromFasta.py \
--ids=<list of ids> \
--fasta=<in fasta file> \
--outfile=<out fasta file>
motifSearch.py
. Searches for motif sequence and reverse complement of motif from a fasta file in a fasta file and records first and last base pair positions.
Usage:
python motifSearch.py \
--motif=<motif fasta> \
--search=<fasta to search> \
--outfile=<outfile (bed format)>