Repository for "The metazoan landscape of mitochondrial DNA gene order and content is shaped by selection and affects mitochondrial transcription" by Noam Shtolz and Dan Mishmar

This code contains 4 main parts:

01 Database construction
02 Gene content analysis
03 Gene order analysis
- 03.01 Gene order-based distance clustering
- 03.02 Gene order variance analysis (AR analysis)
- 03.03 Gene neighbor prevalence calculations
04 Poly-cistrone detection using transcriptomic data (RNA-seq and PRO-seq)
- 04.01 Next generation sequencing (NGS) data collection and preparation
- 04.02 RNA-seq data analysis
- 04.03 Precision run-on sequencing (PRO-seq) and global run-on sequencing (GRO-seq) analyses

Background

This repository contains all the code used to calculate and generate the results presented in the paper. In the paper, we study and compare the mitochondrial gene order and contents of over 8,000 available metazoan oragnisms, downloaded from NCBI Organelle database. We then analyzed available RNA-seq data of 56 different metazoans to disover the polycistronic organization of a variaty of mitochondrial DNAs (mtDNAs) based on intergenic junction expression patterns.

Requirements

Python:
- listed in requirements.txt
R
- tidyverse
- ComplexHeatmap
- circilize
- Cairo
- gmodels
- svglite
- data.table
Linux
- tRNAscan-SE
- CREx

Installation

Clone this repository into any folder.
Use pip install -r requirements.txt to install all dependencies used.
Install tRNAscan-SE if you wish to run step 01_Database_construction. This can be done using conda conda install -c bioconda trnascanse
The code should be run in the same order listed: 01 > 02 > 03 > 04. In general, .py files contain functions and should not be run directly, the functions within .py files can be run by running the .ipynb files.

01_Database_construction (LINUX ONLY)

Before running, install tRNAscan-SE.
run Create_database.py through linux python Create_database.py (long runtime).
The output will be a file named final.csv which contains information on the mtDNA of all organisms listed in organelles.csv and will be the basis for all further analyses.

02_Gene_content

This section has three main files that need to be run, the rest are just imported functions.
The files are database_statistics_and_content_analysis.py database_statistics_and_content_analysis.R and get_anticodon.
Please go over the files line by line, the comments on each line/section indicate its function.

03_Gene_order

03_01_Distance_clustering

This subsection uses a dataset generated by running CREx1 via the command line on all pairwise-combinations of mtDNA gene orders (Distance_new.npy and for_dmatrix.csv), for all organisms on final.csv (only output provided in here).
Run UMAP and tSNE clusters.ipynb line by line to generate the results.

03_02_AR_calculations

AR calculations and figures are generated by running AR_calc.R.

03_03_Gene_neighbors

To generate the figures, run interesting_clusters.ipynb, Making_cluster_graph.ipynb and prob_matrix.R

04_Rearrangements_and_transcriptions

All samples analysed are within the data directories of 04.01 and 04.02. The pipelines are not included in this GitHub, below are descriptions of the general steps for RNA-seq and PRO-seq samples analysis.

RNA-seq

trimming using TrimGalore 0.6.7
trim_galore -o out_folder -q 20 --paired f1.fastq.gz f2.fastq.gz
fastqc to generate a QC report using fastqc 0.11.9
fastqc --outdir out_folder f1.fq.gz fastqc --outdir out_folder f2.fq.gz
First alignment against a reference mtDNA genome using STAR 2.7.10a
STAR --genomeDir mtdna_ref_folder --readFilesCommand gunzip -c --limitBAMsortRAM 1200000000 --outFileNamePrefix out_folder --outSAMtype BAM SortedByCoordinate --readFilesIn f1.fq.gz f2.fq.gz --runThreadN 4
Generation of pileup file based on alignment using samtools 1.7
samtools mpileup -f mtdna_ref_folder -u sample.bam > sample.pileup
Generation of consensus vcf file based on pileup using bcftools 1.7
bcftools call -c sample.pileup > sample.vcf
Conversion of vcf output to a fastq using vcfutils.pl
vcfutils.pl vcf2fq sample.vcf > sample.fastq
Conversion of fastq output to a fasta using fastq2fasta.py (within 04_01_NGS_data_collection folder)
python3.6 fastq2fasta.py --fasta_ref mtdna_ref --fastq sample.fastq --output sample.fasta
Creation of new STAR index based on sample-specific mtDNA reference using STAR 2.7.10a
STAR --genomeFastaFiles sample_ref.fasta --genomeDir ref_dir --genomeSAindexNbases 7 --runThreadN 4 --runMode genomeGenerate
Uniquely align reads to the new, sample-specific, reference using STAR 2.7.10a
STAR --limitBAMsortRAM 1200000000 --outFilterMultimapNmax 1 --genomeDir ref_dir --outFileNamePrefix out_folder --outSAMtype BAM SortedByCoordinate --readFilesCommand gunzip -c --readFilesIn f1.fq.gz f2.fq.gz --runThreadN 4
Generate a pileup report based on the alignment using samtools 1.7
samtools mpileup -Q 30 -f sample_ref.fasta sample.bam > sample.pileup
Count the reads in genes using htseq-count 0.11.2
htseq-count --additional-attr gene -t gene -i ID --nonunique none -f bam -s no sample.bam annotations.gff3 > sample.txt
Process pileup files into csv files with annotations using process_pileup.py (within 04_01_NGS_data_collection folder)
python3.6 process_pileup.py --counts sample.txt --org org_name --pileup sample.pileup --flip-strands --output sample.csv --wd "."
Normalize htseq-count counts into TPM and RPKM and output a csv file using normalize_from_htseq.py (within 04_01_NGS_data_collection folder)
python3.6 normalize_from_htseq.py --counts sample.txt --mode both --gtf annotations.gff3 --output sample.csv
Create custom gff files only for the intergenic junctions using custom_gff.py (within 04_01_NGS_data_collection folder)
python3.6 custom_gff.py --org Trichogramma_japonicum --output junc_annot.gff3 --junc sample.final.out --annotated-pileup sample.csv
Count junction expression using htseq-count 0.11.2
htseq-count --additional-attr gene -t gene -i ID --nonunique none -f bam --mode intersection-strict -s no sample.bam junc_annot.gff3 > sample.txt
Normalize htseq-count junction counts into TPM and RPKM and output a csv file using normalize_from_htseq.py (within 04_01_NGS_data_collection folder)
python3.6 normalize_from_htseq.py --counts sample.txt --gtf junc_annot.gff3 --mode both --output sample.csv
Aggregate all junction information into a main file using make_junctions.py (within 04_01_NGS_data_collection folder)
python3.6 make_junctions.py --junc 300 --no-trna --org Trichogramma_japonicum --wd "." --annotated-pileup annotated_pileup.csv --coding --junc-reads junc_reads.csv --output sample.csv --tpm-from-htseq norm_counts.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repository for "The metazoan landscape of mitochondrial DNA gene order and content is shaped by selection and affects mitochondrial transcription" by Noam Shtolz and Dan Mishmar

Background

Requirements

Installation

01_Database_construction (LINUX ONLY)

02_Gene_content

03_Gene_order

03_01_Distance_clustering

03_02_AR_calculations

03_03_Gene_neighbors

04_Rearrangements_and_transcriptions

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
01_Database_construction		01_Database_construction
02_Gene_content		02_Gene_content
03_Gene_order		03_Gene_order
04_Rearrangements_and_transcription		04_Rearrangements_and_transcription
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

Noam-St/2022_Metazoan_mtDNA_Project

Folders and files

Latest commit

History

Repository files navigation

Repository for "The metazoan landscape of mitochondrial DNA gene order and content is shaped by selection and affects mitochondrial transcription" by Noam Shtolz and Dan Mishmar

Background

Requirements

Installation

01_Database_construction (LINUX ONLY)

02_Gene_content

03_Gene_order

03_01_Distance_clustering

03_02_AR_calculations

03_03_Gene_neighbors

04_Rearrangements_and_transcriptions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages