Repository for "The metazoan landscape of mitochondrial DNA gene order and content is shaped by selection and affects mitochondrial transcription" by Noam Shtolz and Dan Mishmar
This code contains 4 main parts:
- 01 Database construction
- 02 Gene content analysis
- 03 Gene order analysis
- 03.01 Gene order-based distance clustering
- 03.02 Gene order variance analysis (AR analysis)
- 03.03 Gene neighbor prevalence calculations
- 04 Poly-cistrone detection using transcriptomic data (RNA-seq and PRO-seq)
- 04.01 Next generation sequencing (NGS) data collection and preparation
- 04.02 RNA-seq data analysis
- 04.03 Precision run-on sequencing (PRO-seq) and global run-on sequencing (GRO-seq) analyses
This repository contains all the code used to calculate and generate the results presented in the paper. In the paper, we study and compare the mitochondrial gene order and contents of over 8,000 available metazoan oragnisms, downloaded from NCBI Organelle database. We then analyzed available RNA-seq data of 56 different metazoans to disover the polycistronic organization of a variaty of mitochondrial DNAs (mtDNAs) based on intergenic junction expression patterns.
- Python:
- listed in
requirements.txt
- listed in
- R
- tidyverse
- ComplexHeatmap
- circilize
- Cairo
- gmodels
- svglite
- data.table
- Linux
- Clone this repository into any folder.
- Use
pip install -r requirements.txt
to install all dependencies used. - Install tRNAscan-SE if you wish to run step 01_Database_construction. This can be done using conda
conda install -c bioconda trnascanse
- The code should be run in the same order listed: 01 > 02 > 03 > 04. In general,
.py
files contain functions and should not be run directly, the functions within.py
files can be run by running the.ipynb
files.
- Before running, install tRNAscan-SE.
- run
Create_database.py
through linuxpython Create_database.py
(long runtime). - The output will be a file named
final.csv
which contains information on the mtDNA of all organisms listed inorganelles.csv
and will be the basis for all further analyses.
- This section has three main files that need to be run, the rest are just imported functions.
- The files are
database_statistics_and_content_analysis.py
database_statistics_and_content_analysis.R
andget_anticodon
. - Please go over the files line by line, the comments on each line/section indicate its function.
- This subsection uses a dataset generated by running CREx1 via the command line on all pairwise-combinations of mtDNA gene orders (
Distance_new.npy
andfor_dmatrix.csv
), for all organisms onfinal.csv
(only output provided in here). - Run
UMAP and tSNE clusters.ipynb
line by line to generate the results.
- AR calculations and figures are generated by running
AR_calc.R
.
- To generate the figures, run
interesting_clusters.ipynb
,Making_cluster_graph.ipynb
andprob_matrix.R
All samples analysed are within the data
directories of 04.01 and 04.02. The pipelines are not included in this GitHub, below are descriptions of the general steps for RNA-seq and PRO-seq samples analysis.
RNA-seq
-
trimming using
TrimGalore 0.6.7
trim_galore -o out_folder -q 20 --paired f1.fastq.gz f2.fastq.gz
-
fastqc to generate a QC report using
fastqc 0.11.9
fastqc --outdir out_folder f1.fq.gz fastqc --outdir out_folder f2.fq.gz
-
First alignment against a reference mtDNA genome using
STAR 2.7.10a
STAR --genomeDir mtdna_ref_folder --readFilesCommand gunzip -c --limitBAMsortRAM 1200000000 --outFileNamePrefix out_folder --outSAMtype BAM SortedByCoordinate --readFilesIn f1.fq.gz f2.fq.gz --runThreadN 4
-
Generation of pileup file based on alignment using
samtools 1.7
samtools mpileup -f mtdna_ref_folder -u sample.bam > sample.pileup
-
Generation of consensus vcf file based on pileup using
bcftools 1.7
bcftools call -c sample.pileup > sample.vcf
-
Conversion of vcf output to a fastq using
vcfutils.pl
vcfutils.pl vcf2fq sample.vcf > sample.fastq
-
Conversion of fastq output to a fasta using
fastq2fasta.py
(within 04_01_NGS_data_collection folder)python3.6 fastq2fasta.py --fasta_ref mtdna_ref --fastq sample.fastq --output sample.fasta
-
Creation of new STAR index based on sample-specific mtDNA reference using
STAR 2.7.10a
STAR --genomeFastaFiles sample_ref.fasta --genomeDir ref_dir --genomeSAindexNbases 7 --runThreadN 4 --runMode genomeGenerate
-
Uniquely align reads to the new, sample-specific, reference using
STAR 2.7.10a
STAR --limitBAMsortRAM 1200000000 --outFilterMultimapNmax 1 --genomeDir ref_dir --outFileNamePrefix out_folder --outSAMtype BAM SortedByCoordinate --readFilesCommand gunzip -c --readFilesIn f1.fq.gz f2.fq.gz --runThreadN 4
-
Generate a pileup report based on the alignment using
samtools 1.7
samtools mpileup -Q 30 -f sample_ref.fasta sample.bam > sample.pileup
-
Count the reads in genes using
htseq-count 0.11.2
htseq-count --additional-attr gene -t gene -i ID --nonunique none -f bam -s no sample.bam annotations.gff3 > sample.txt
-
Process pileup files into csv files with annotations using
process_pileup.py
(within 04_01_NGS_data_collection folder)python3.6 process_pileup.py --counts sample.txt --org org_name --pileup sample.pileup --flip-strands --output sample.csv --wd "."
-
Normalize htseq-count counts into TPM and RPKM and output a csv file using
normalize_from_htseq.py
(within 04_01_NGS_data_collection folder)python3.6 normalize_from_htseq.py --counts sample.txt --mode both --gtf annotations.gff3 --output sample.csv
-
Create custom gff files only for the intergenic junctions using
custom_gff.py
(within 04_01_NGS_data_collection folder)python3.6 custom_gff.py --org Trichogramma_japonicum --output junc_annot.gff3 --junc sample.final.out --annotated-pileup sample.csv
-
Count junction expression using
htseq-count 0.11.2
htseq-count --additional-attr gene -t gene -i ID --nonunique none -f bam --mode intersection-strict -s no sample.bam junc_annot.gff3 > sample.txt
-
Normalize htseq-count junction counts into TPM and RPKM and output a csv file using
normalize_from_htseq.py
(within 04_01_NGS_data_collection folder)python3.6 normalize_from_htseq.py --counts sample.txt --gtf junc_annot.gff3 --mode both --output sample.csv
-
Aggregate all junction information into a main file using
make_junctions.py
(within 04_01_NGS_data_collection folder)python3.6 make_junctions.py --junc 300 --no-trna --org Trichogramma_japonicum --wd "." --annotated-pileup annotated_pileup.csv --coding --junc-reads junc_reads.csv --output sample.csv --tpm-from-htseq norm_counts.csv