MASED

Name Origin

MASED - Methylation Analysis of SEgmental Duplications.

Purpose

Analyse the history and methylation of segmental duplications of Arabidopsis thaliana.

Which segments are duplicated from others?
When did their duplication happen?
How does the methylation of these duplications change during evolution and function acquisition？

Directory organization

mcscanx: Scan multiple genomes or subgenomes to identify putative homologous chromosomal regions, then align these regions using genes as anchors.
colline2yml: Get the collinearity regions of the genomes to YAML files.
sort_sd: Sort the segmental duplications by different time of occurrence.
find_promoter: Determine the promoter region of genomes.
analy_meth: Analyse the methylation of segments.

Preparations

Software

Here we should firstly install Sra-toolkit, MCScanX and Bismark.

MCScanX

cd MASED
unzip resource/MCScanX.zip
cd MCScanX
make
cd ~

Bismark

cd MASED
unzip resource/bismark.zip
cd ~

Put into `PATH`

echo '# Bismark' >> .bashrc
echo 'export PATH="$HOME/MASED/bismark:$PATH"' >> .bashrc
echo >> .bashrc
echo '# MCScanX' >> .bashrc
echo 'export PATH="$HOME/MASED/MCScanX:$PATH"' >> .bashrc
echo >> .bashrc
source .bashrc

Data

Collinear

We choose 5 species to build up a time line.

To download them, you should go to JGI website with login. We also provide some data in our folder data/. Others like Theobroma cacao and Brassica rapa can be downloaded from other database.

Specie Name	Database	Version
Arabidopsis thaliana	Araport	V11
Arabidopsis lyrata	JGI	V2.1
Capsella rubella	JGI	V1.1
Brassica rapa	BRAD	V3.0
Theobroma cacao	Cocoa Genome Hub	V2

Take Theobroma cacao as an example:
(ATTENTION: If you want to try this for yourself, make sure you have removed the corresponding files in the folder!)

wget https://cocoa-genome-hub.southgreen.fr/sites/cocoa-genome-hub.southgreen.fr/files/download/Theobroma_cacaoV2_annot_annoted_clean.gff3.tar.gz
tar -zvxf Theobroma_cacaoV2_annot_annoted_clean.gff3.tar.gz
mv Theobroma_cacaoV2_annot_annoted_clean.gff3 ~/MASED/data/Tcac.gff3
rm Theobroma_cacaoV2_annot_annoted_clean.gff3.tar.gz

wget https://cocoa-genome-hub.southgreen.fr/sites/cocoa-genome-hub.southgreen.fr/files/download/Theobroma_cacaoV2_annot_protein.faa.tar.gz
tar -zvxf Theobroma_cacaoV2_annot_protein.faa.tar.gz
mv Theobroma_cacaoV2_annot_protein.faa ~/MASED/data/Tcac.pep
rm Theobroma_cacaoV2_annot_protein.faa.tar.gz

Segment Duplication

We have a reference genome fasta of Arabidopsis thaliana in our folder data/. And then we prepare the genome with App::Egaz.

BS-Seq

We choose SRX2871291 and SRX2871292 as the BS-Seq data of the Columbia wild type Arabidopsis thaliana.

prefetch SRR5631389
prefetch SRR5631390
prefetch SRR5631391
prefetch SRR5631392

The data is now in ~/ncbi/public/sra/

mv ~/ncbi/public/sra/SRR56313* ~/MASED/data/.
cd ~/MASED/data/
fastq-dump --split-3 SRR56313*
rm SRR56313*.sra

Analysis

Pretreatment for MCScanX

cd ~/MASED/data

awk '$3 == "gene" {print $1 "\t" $4 "\t" $5 "\t" $9 "\t" $7}' Atha/Atha.gff3 > Atha.gene.gff
awk '$3 == "gene" {print $1 "\t" $4 "\t" $5 "\t" $9 "\t" $7}' Alyr/Alyr.gff3 > Alyr.gene.gff
awk '$3 == "gene" {print $1 "\t" $4 "\t" $5 "\t" $9 "\t" $7}' Crub/Crub.gff3 > Crub.gene.gff
awk '$3 == "gene" {print $1 "\t" $4 "\t" $5 "\t" $9 "\t" $7}' Brap/Brap.gff3 > Brap.gene.gff
awk '$3 == "gene" {print $1 "\t" $4 "\t" $5 "\t" $9 "\t" $7}' Tcac/Tcac.gff3 > Tcac.gene.gff

perl ../gff_pep_Atha.pl Atha.gene.gff AT.gff Atha/Atha.pep > AT.pep
perl ../gff_pep_Alyr.pl Alyr.gene.gff AL.gff Alyr/Alyr.pep > AL.pep
perl ../gff_pep_Crub.pl Crub.gene.gff CR.gff Crub/Crub.pep > CR.pep
perl ../gff_pep_Brap.pl Brap.gene.gff BR.gff Brap/Brap.pep > BR.pep
perl ../gff_pep_Tcac.pl Tcac.gene.gff TC.gff Tcac/Tcac.pep > TC.pep

makeblastdb -in AT.pep -dbtype prot -parse_seqids -out ATdb

nohup blastp -query AL.pep -db ATdb -out AT_AL.blast -evalue 1e-10 -num_threads 4 -outfmt 6 -num_alignments 5 &
nohup blastp -query CR.pep -db ATdb -out AT_CR.blast -evalue 1e-10 -num_threads 4 -outfmt 6 -num_alignments 5 &
nohup blastp -query BR.pep -db ATdb -out AT_BR.blast -evalue 1e-10 -num_threads 4 -outfmt 6 -num_alignments 5 &
nohup blastp -query TC.pep -db ATdb -out AT_TC.blast -evalue 1e-10 -num_threads 4 -outfmt 6 -num_alignments 5 &

cat AT.gff AL.gff > AT_AL.gff
cat AT.gff CR.gff > AT_CR.gff
cat AT.gff BR.gff > AT_BR.gff
cat AT.gff TC.gff > AT_TC.gff

Run MCScanX

mcscanx -s 3 -m 2 AT_AL
mcscanx -s 3 -m 2 AT_CR
mcscanx -s 3 -m 2 AT_BR
mcscanx -s 3 -m 2 AT_TC

# extract ranges of segment duplications
perl ../colline2yml.pl AT.gff AT_AT.collinearity > AT_AT.yml
perl ../colline2yml_AL.pl AT.gff AT_AL.collinearity > AT_AL.yml
perl ../colline2yml.pl AT.gff AT_CR.collinearity > AT_CR.yml
perl ../colline2yml.pl AT.gff AT_BR.collinearity > AT_BR.yml
perl ../colline2yml.pl AT.gff AT_TC.collinearity > T_TC.yml

Find Segment Duplication

cd ~/MASED/data/
mkdir SAAT
gzip -dc Atha/Atha.fa.gz > Atha/Atha.fa

faops filter -N -s Atha/Atha.fa stdout | faops split-name stdin .
egaz repeatmasker ./*.fa -o SAAT/. --gff --parallel 4
faops size ./*.fa > SAAT/chr.sizes
faToTwoBit ./*.fa SAAT/chr.2bit
cat ./*.fa | faops filter -ßßU stdin SAAT/chr.fasta
mv Atha/Atha.gff3 SAAT/chr.gff

# create anno.yml
runlist gff --tag CDS --remove chr.gff -o cds.yml
runlist gff --remove ./*.rm.gff -o repeat.yml
runlist merge repeat.yml cds.yml -o anno.yml

rm repeat.yml cds.yml ./*.rm.gff /*.rm.out

egaz template \
    . \
    --self -o AT_FSD \
    --taxon ./ensembl_taxon.csv \
    --circos --aligndb --parallel 4 -v

bash AT_FSD/1_self.sh
bash AT_FSD/3_proc.sh
bash AT_FSD/4_circos.sh
bash AT_FSD/6_chr_length.sh
bash AT_FSD/7_self_aligndb.sh
bash AT_FSD/9_pack_up.sh

Positions

Examples in S288c.txt

I:1-100
I(+):90-150
S288c.I(-):190-200
II:21294-22075
II:23537-24097

Simple rules:

chromosome and start are required
species, strand and end are optional
. to separate species and chromosome
strand is one of + and - and surround by round brackets
: to separate names and digits
- to separate start and end
names should be alphanumeric and without spaces

species.chromosome(strand):start-end
--------^^^^^^^^^^--------^^^^^^----

Runlists in YAML

App::RL

jrunlist

Blocked fasta files

Examples in example.fas

>S288c.I(+):13267-13287|species=S288c
TCGTCAGTTGGTTGACCATTA
>YJM789.gi_151941327(-):5668-5688|species=YJM789
TCGTCAGTTGGTTGACCATTA
>RM11.gi_61385832(-):5590-5610|species=RM11
TCGTCAGTTGGTTGACCATTA
>Spar.gi_29362400(+):2477-2497|species=Spar
TCATCAGTTGGCAAACCGTTA

App::Fasops

Ranges and links of ranges

App::Rangeops

jrange

Author

Ivan Woo <wuyifanwd@hotmail.com>

Copyright and license

This is free software; you can redistribute it and/or modify it under the same terms.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
data		data
doc		doc
resource		resource
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
arg_meth_link.pl		arg_meth_link.pl
classify_link.pl		classify_link.pl
colline2yml.pl		colline2yml.pl
colline2yml_AL.pl		colline2yml_AL.pl
count_dhl.pl		count_dhl.pl
dif_meth.pl		dif_meth.pl
gff_pep_Alyr.pl		gff_pep_Alyr.pl
gff_pep_Atha.pl		gff_pep_Atha.pl
gff_pep_Brap.pl		gff_pep_Brap.pl
gff_pep_Crub.pl		gff_pep_Crub.pl
gff_pep_Tcac.pl		gff_pep_Tcac.pl
promotor_intsec.pl		promotor_intsec.pl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MASED

Name Origin

Purpose

Directory organization

Preparations

Software

MCScanX

Bismark

Put into `PATH`

Data

Collinear

Segment Duplication

BS-Seq

Analysis

Pretreatment for MCScanX

Run MCScanX

Find Segment Duplication

Positions

Runlists in YAML

Blocked fasta files

Ranges and links of ranges

Author

Copyright and license

About

Releases

Packages

Languages

IvanWoo22/MASED

Folders and files

Latest commit

History

Repository files navigation

MASED

Name Origin

Purpose

Directory organization

Preparations

Software

MCScanX

Bismark

Put into PATH

Data

Collinear

Segment Duplication

BS-Seq

Analysis

Pretreatment for MCScanX

Run MCScanX

Find Segment Duplication

Positions

Runlists in YAML

Blocked fasta files

Ranges and links of ranges

Author

Copyright and license

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Put into `PATH`

Packages