Skip to content

Noam-St/2022_Metazoan_mtDNA_Project

Repository files navigation

Repository for "The metazoan landscape of mitochondrial DNA gene order and content is shaped by selection and affects mitochondrial transcription" by Noam Shtolz and Dan Mishmar

This code contains 4 main parts:

  • 01 Database construction
  • 02 Gene content analysis
  • 03 Gene order analysis
    • 03.01 Gene order-based distance clustering
    • 03.02 Gene order variance analysis (AR analysis)
    • 03.03 Gene neighbor prevalence calculations
  • 04 Poly-cistrone detection using transcriptomic data (RNA-seq and PRO-seq)
    • 04.01 Next generation sequencing (NGS) data collection and preparation
    • 04.02 RNA-seq data analysis
    • 04.03 Precision run-on sequencing (PRO-seq) and global run-on sequencing (GRO-seq) analyses

Background

This repository contains all the code used to calculate and generate the results presented in the paper. In the paper, we study and compare the mitochondrial gene order and contents of over 8,000 available metazoan oragnisms, downloaded from NCBI Organelle database. We then analyzed available RNA-seq data of 56 different metazoans to disover the polycistronic organization of a variaty of mitochondrial DNAs (mtDNAs) based on intergenic junction expression patterns.

Requirements

  • Python:
    • listed in requirements.txt
  • R
    • tidyverse
    • ComplexHeatmap
    • circilize
    • Cairo
    • gmodels
    • svglite
    • data.table
  • Linux

Installation

  • Clone this repository into any folder.
  • Use pip install -r requirements.txt to install all dependencies used.
  • Install tRNAscan-SE if you wish to run step 01_Database_construction. This can be done using conda conda install -c bioconda trnascanse
  • The code should be run in the same order listed: 01 > 02 > 03 > 04. In general, .py files contain functions and should not be run directly, the functions within .py files can be run by running the .ipynb files.

01_Database_construction (LINUX ONLY)

  • Before running, install tRNAscan-SE.
  • run Create_database.py through linux python Create_database.py (long runtime).
  • The output will be a file named final.csv which contains information on the mtDNA of all organisms listed in organelles.csv and will be the basis for all further analyses.

02_Gene_content

  • This section has three main files that need to be run, the rest are just imported functions.
  • The files are database_statistics_and_content_analysis.py database_statistics_and_content_analysis.R and get_anticodon.
  • Please go over the files line by line, the comments on each line/section indicate its function.

03_Gene_order

03_01_Distance_clustering

  • This subsection uses a dataset generated by running CREx1 via the command line on all pairwise-combinations of mtDNA gene orders (Distance_new.npy and for_dmatrix.csv), for all organisms on final.csv (only output provided in here).
  • Run UMAP and tSNE clusters.ipynb line by line to generate the results.

03_02_AR_calculations

  • AR calculations and figures are generated by running AR_calc.R.

03_03_Gene_neighbors

  • To generate the figures, run interesting_clusters.ipynb, Making_cluster_graph.ipynb and prob_matrix.R

04_Rearrangements_and_transcriptions

All samples analysed are within the data directories of 04.01 and 04.02. The pipelines are not included in this GitHub, below are descriptions of the general steps for RNA-seq and PRO-seq samples analysis.

RNA-seq

  • trimming using TrimGalore 0.6.7

    trim_galore -o out_folder -q 20 --paired f1.fastq.gz f2.fastq.gz
  • fastqc to generate a QC report using fastqc 0.11.9

    fastqc --outdir out_folder f1.fq.gz fastqc --outdir out_folder f2.fq.gz
  • First alignment against a reference mtDNA genome using STAR 2.7.10a

    STAR --genomeDir mtdna_ref_folder --readFilesCommand gunzip -c --limitBAMsortRAM 1200000000 --outFileNamePrefix out_folder --outSAMtype BAM SortedByCoordinate --readFilesIn f1.fq.gz f2.fq.gz --runThreadN 4
  • Generation of pileup file based on alignment using samtools 1.7

    samtools mpileup -f mtdna_ref_folder -u sample.bam > sample.pileup
  • Generation of consensus vcf file based on pileup using bcftools 1.7

    bcftools call -c sample.pileup > sample.vcf
  • Conversion of vcf output to a fastq using vcfutils.pl

    vcfutils.pl vcf2fq sample.vcf > sample.fastq
  • Conversion of fastq output to a fasta using fastq2fasta.py (within 04_01_NGS_data_collection folder)

    python3.6 fastq2fasta.py --fasta_ref mtdna_ref --fastq sample.fastq --output sample.fasta
  • Creation of new STAR index based on sample-specific mtDNA reference using STAR 2.7.10a

    STAR --genomeFastaFiles sample_ref.fasta --genomeDir ref_dir --genomeSAindexNbases 7 --runThreadN 4 --runMode genomeGenerate
  • Uniquely align reads to the new, sample-specific, reference using STAR 2.7.10a

    STAR --limitBAMsortRAM 1200000000 --outFilterMultimapNmax 1 --genomeDir ref_dir --outFileNamePrefix out_folder --outSAMtype BAM SortedByCoordinate --readFilesCommand gunzip -c --readFilesIn f1.fq.gz f2.fq.gz --runThreadN 4
  • Generate a pileup report based on the alignment using samtools 1.7

    samtools mpileup -Q 30 -f sample_ref.fasta sample.bam > sample.pileup
  • Count the reads in genes using htseq-count 0.11.2

    htseq-count --additional-attr gene -t gene -i ID --nonunique none -f bam -s no sample.bam annotations.gff3 > sample.txt
  • Process pileup files into csv files with annotations using process_pileup.py (within 04_01_NGS_data_collection folder)

    python3.6 process_pileup.py --counts sample.txt --org org_name --pileup sample.pileup --flip-strands --output sample.csv --wd "."
  • Normalize htseq-count counts into TPM and RPKM and output a csv file using normalize_from_htseq.py (within 04_01_NGS_data_collection folder)

    python3.6 normalize_from_htseq.py --counts sample.txt --mode both --gtf annotations.gff3 --output sample.csv
  • Create custom gff files only for the intergenic junctions using custom_gff.py (within 04_01_NGS_data_collection folder)

    python3.6 custom_gff.py --org Trichogramma_japonicum --output junc_annot.gff3 --junc sample.final.out --annotated-pileup sample.csv
  • Count junction expression using htseq-count 0.11.2

    htseq-count --additional-attr gene -t gene -i ID --nonunique none -f bam --mode intersection-strict -s no sample.bam junc_annot.gff3 > sample.txt
  • Normalize htseq-count junction counts into TPM and RPKM and output a csv file using normalize_from_htseq.py (within 04_01_NGS_data_collection folder)

    python3.6 normalize_from_htseq.py --counts sample.txt --gtf junc_annot.gff3 --mode both --output sample.csv
  • Aggregate all junction information into a main file using make_junctions.py (within 04_01_NGS_data_collection folder)

    python3.6 make_junctions.py --junc 300 --no-trna --org Trichogramma_japonicum --wd "." --annotated-pileup annotated_pileup.csv --coding --junc-reads junc_reads.csv --output sample.csv --tpm-from-htseq norm_counts.csv

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published