Skip to content

Utility Descriptions (with Examples)

Kinfe Bankole edited this page Oct 23, 2023 · 40 revisions

Below is a description the various utilities found in TranD/utilities that can be used independently. Utilities either use part of TranD output data or create files that would be useful in conjunction with TranD analysis.

Important Disclaimer on Utility Usage:

  • It is not necessary to clone the entire repository to use a single utility, but TranD must be installed in your python environment for the below utilities to work properly.

TranD installation is required to use each of these utilities. For information on how to run and install TranD, see the User Guide.

Utility List

GTF Alteration


Unique Junction Chains (UJCs)


TranD Output Analysis


Plotting


add_prefix_gtf.py

Overview: A utility designed to add a prefix to either the gene_ids or transcript_ids of a GTF file.

Typical Usage:

python add_prefix_gtf.py \
    -g /path/to/input_gtf.gtf \        # Input GTF 
    -t transcript_id \                 # OR gene_id 
    -p tr \                            # Prefix to be added 
    -o /path/to/output_gtf.gtf         # Name and path of desired output file

Notes:

  • The first two attributes of the GTF file must be transcript_id and gene_id, in any order.
  • The default -t (id type) is transcript_id.

Examples: Sample Input | Sample Output

Sample output generated using the following command:

python /TranD/utilities/add_prefix_gtf.py \
    -g /TranD/utility_input/dmel-all-r6.17.gtf \
    -t transcript_id \
    -p pre \
    -o TranD/utility_output/GTF_ALTERATION/add_prefix_example.gtf 

Known Error Triggers:

  • None Currently Known

correct_gffcompare_GTF_gene_id.py

Overview: A utility designed to correct the gene_ids of a GTF file using output from GFFCompare (v12.2 or higher). GFFCompare will find genes on the same strand with overlapping exons and create GTF-like output. Use this utility to correct the original GTF using that output. This utility is especially useful for correcting GTFs mapped onto a different species' genome, allowing for comparison between two species via TranD (which requires equal gene_ids).

Typical Usage:

python correct_gffcompare_GTF_gene_id.py \
    -a path/to/input.annotated.gtf \  # Annotated GTF-like file (GFFCompare (>= v12.2) Output)
    -g path/to/input.gtf \            # Original file input into GFFCompare
    -o path/to/output.gtf \           # Desired output path
    --key path/to/outkey.csv           # Output path for key file

Notes:

  • If multiple GTF files need to have gene_id values assigned, must provide gffcompare with a single concatenated GTF where each individual file has a sampleID in the 'source' column (column 2) of each GTF file.

  • The name of the -g input file MUST match what was given to gffcompare so the correct transcripts can be identified in the output if multiple files were provided to gffcompare.

  • The --key file provides information on what gene_ids were changed and what they were changed to; contains the columns: transcript_id, input_gene_id, output_gene_id

  • The output GTF and output key MUST be a path to the desired output, the utility will not make the output directory.

Additional Arguments:

  • Add -k (--keep-names) to keep the original GTF's gene_id if a reference gene_id does not associate with it. Default: if there is no reference gene_id for a gene, the gene_id will be GFFCompare's XLOC values.
  • -n (--use-gene_name): If there is a reference gene_id not present in GFFCompare output (for class codes not in [s, x, y, i, p, r, u]), utility will check for ref_gene_name or gene_name to use instead. Default is to use ref_gene_id only and replace with XLOC values if not ref_gene_id is not present (unless -k is used).

Examples: Sample Input: Mapped GTF (to be corrected) | Reference GTF (Dmel)

Sample Output: Corrected GTF | Key File

Sample output generated using the following command:

gffcompare -r dmel-all-r6.17.gtf -o mel2mel mel2mel_minimap2.gtf      # Run GFFCompare to get the .annotated.gtf file

python /TranD/utilities/correct_gffcompare_GTF_gene_id.py \
    -a TranD/utility_input/mel2mel.annotated.gtf \
    -g TranD/utility_input/mel2mel_minimap2.gtf \
    -o TranD/utility_output/GTF_ALTERATION/mel2mel_corrected_associated_gene.gtf \
    --key TranD/utility_output/GTF_ALTERATION/mel2mel_corrected_key.csv

Known Error Triggers:

  • None Currently Known

subset_gtf.py

Overview: A utility designed to subset a GTF file based on a list of: transcript_ids (default), gene_ids, or chromosomes.

Typical Usage:

python subset_gtf.py \
    -g /path/to/input_gtf.gtf \        # Input GTF 
    -t listType \                      # What is in the list 
    -i /path/to/include_list.txt \     # List of transcripts/genes/chromosomes to include (or exclude if using -e)
    -o /path/to/output_gtf.gtf         # Path to output file

Notes:

  • The options for -t are:
    • transcript_id (default)
    • gene_id
    • chr
  • No header on the include/exclude list.
  • Note that -o must refer to an output file, rather than a directory.

Additional Arguments:

  • -e is used instead of -i if excluding those in the list from the subset GTF. Either -e or -i must be used, but not both.

Examples:

Sample Input: GTF | Include List

Sample Output

Sample output generated using the following command:

python /TranD/utilities/subset_gtf.py \
    -g /TranD/utility_input/dmel-all-r6.17.gtf \
    -t gene_id \
    -i /TranD/utility_input/subset_gtf_include_list.txt \
    -o /TranD/utility_output/GTF_ALTERATION/subset_gtf_example.gtf

Known Error Triggers:

  • None Currently Known

id_ujc.py

Overview: A "unique junction chain" (UJC) is the series of junctions within a transcript. Some transcripts may have identical UJCs and be grouped together under a "ujc_id". This utility is designed to identify the UJCs within a GTF file and output a csv file with the following columns:

gene_id, transcript_id, ujc_id, junction_string

which lists the transcripts, which UJC they 'belong' to, and the chain of junctions within the transcript.

The format for the junction string is a list of junctions separated by '|', and the junctions have the following format:

seqname:start:end:strand

where the start of the junction is the end of the last exon before it, and the end of the junction is the start of the next exon after it.

All transcripts with the same junction chain will be grouped under one representative transcript from the group, which is the ujc_id.

The utility also outputs a GTF file where each UJC is represented as a transcript. Additionally, there is an option (-c) to output another csv file with information on the number of transcripts represented by each UJC.

Typical Usage:

python id_ujc.py \
    -g /path/to/input.gtf \         # A GTF file for identifying UJCs
    -x prefix \                     # Prefix for transcripts
    -o /path/to/outputdirectory \   # Output directory, must already exist
    -p filePrefix                   # A prefix for the output file, required

Additional Arguments:

  • Adding -s will skip the output of the representative GTF file.

  • Use -x (prefix) to add a prefix to each ujc_id.

  • Adding -c will output the 'counting' file with the following columns:

    gene_id, ujc_id, num_xscripts, junction_string

Examples: Sample Input | Sample Output

Sample output generated using the following command:

python /TranD/utilities/id_ujc.py \
    -g /TranD/utility_input/dmel-all-r6.17.gtf \
    -x tr \
    -o /TranD/utility_output/UJC_OUTPUT/ID_UJC/ \
    -p sample

Known Error Triggers:

  • The output directory must exist before trying to output to it using the utility.

subset_trand_pairwise_transcript_distance.py

Overview: A utility designed to subset a TranD Pairwise Distance File based on a list of: transcript_ids or gene_ids (default).

Typical Usage:

python subset_trand_pairwise_transcript_distance.py \
    -p /path/to/input_pd.csv \         # Input Pairwise Distance File 
    -t listType \                      # What is in the list 
    -i /path/to/include_list.txt \     # List of transcripts/genes/chromosomes to include or exclude if using -e
    -o /path/to/output_pd.csv          # Path to output file

Notes:

  • The options for -t are:
    • transcript_id
    • gene_id (default)
  • If transcripts are selected: if one list is provided, the utility checks for transcripts in both transcript_1 and transcript_2. If two lists are provided (use -i/-e twice), the first list will be used for transcript_1 and the second will be used transcript_2.
    • -e and -i cannot be used together.
  • Note that -o must refer to an output file, rather than a directory.

Additional Arguments:

  • -n1 name1 and -n2 name2 appends the name entered to transcript_1 and transcript_2 values, respectively.
    • NOTE: if these commands are used, assure they match any suffix attached to the transcripts in the PD file. If not, the output of this utility will be empty.

Examples:

Sample Input: PD File | Include List 1 | Include List 2

Sample Output

Sample output generated using the following command:

python TranD/utilities/subset_trand_pairwise_transcript_distance.py \
    -p /TranD/utility_input/dmel_pairwise_transcript_distance.csv \
    -t transcript_id \
    -i /TranD/utility_input/subset_pd_include_list1.txt \
    -i /TranD/utility_input/subset_pd_include_list2.txt \
    -o /TranD/utility_output/GTF_ALTERATION/subset_pd_example.csv

Known Error Triggers:

  • None Currently Known

id_ERG.py

Overview: A utility designed to add organize transcripts found within a TranD pairwise transcript distance output file into Exon Region Groups (ERGs). ERGs are groups of transcripts that share 100% of their exon regions. The output is the following 3-4 files:

  • A csv unique on transcript_id, with the following columns:
gene_id, ERG_id, number of exon regions, a flag for non-overlap, and (if it occurs) transcripts in its group that it does not overlap with.

(Non-overlap is possible if, for example: transcript A overlaps with B and B overlaps with C, but A does not overlap with C. The three transcripts will still be in the same group due to mutual overlap with B.)

  • A csv unique on ERG, with the following columns:
ERG_id, size (number of transcripts), gene_id, a list of the transcripts in the group, a flag for non-overlap, a flag for intron retention (IR) within the group, the number of transcripts with IR, the proportion of transcripts with IR, the number of exon regions, and the min, max, mean and median number/proportion of nucleotides different between the transcripts in the group
  • A csv unique on gene_id, with the following columns:
number of ERGs per gene and the min, max, mean, and median for: the number of exon regions, the size of the ERGs that belong to the gene
  • An optional GTF file with representative transcripts for each ERG. Each exon in the representative GTF file represents the earliest start and the latest end of the exon region by comparing all transcripts within the group.

Note on using this utility for 2 GTF output:

  • The ERG file also contains a column ("contains_which_gtf") that describes which GTF the group belongs to (1 = GTF1, 2 = GTF2, 3 = Both).
  • Transcripts in the first column of the input csv belong to GTF1, transcripts in the second column belong to GTF2.

Typical Usage:

python id_ERG.py \
    -i /path/to/input.csv \       # TranD output, pairwise transcript distance
    -ir Y \                       # OR N, to exclude transcripts from ERGs based on IR
    -o /path/to/outputdirectory   # Output directory, must already exist

Notes:

  • Notably, allows for inclusion or exclusion of transcripts in ERGs on the basis of intron retention:
  • If -ir is Y, transcripts with intron retention events will be included in groups
  • If ir is N, they (specifically, transcripts with the intron retained) will be excluded from groups on that basis

Additional Arguments:

  • Adding -g will cause the output of the optional representative GTF
  • It is recommended to use -p (prefix) to add a prefix for the output files, otherwise the utility will default to the original file name when writing files.
  • For 1GTF data, add -w to add the which_gtf column to the xscript output. This is useful if merging the ERG data with other 2GTF output (as seen in Transcript Model Maps).

Examples: This example is a 1GTF file, however the utility can be used for both 1GTF and 2GTF TranD output.

Sample Input | Sample Output

Sample output generated using the following command:

python id_ERG.py \
    -i /TranD/utility_input/dmel_pairwise_transcript_distance.csv \
    -ir Y \
    -p dmel617 \
    -g \
    -o /TranD/utility_output/ID_ERG/

Known Error Triggers:

  • The output directory must exist before trying to output to it using the utility.

pair_classification.py

Overview: A utility designed to analyze the pairs of transcripts in a minimum pairwise distance file generated from TranD 2GTF output and classifies them into 4-5 possible groups:

  • FSM: The transcripts are a complete match.
  • ERS_wIR: The two transcripts are in the same exon region group (all of their exon regions overlap). There is an intron retention event within at least one transcript.
  • ERS_noIR: The two transcripts are in the same exon region group and there is no intron retention.
    • Large: There is a nucleotide number difference between the two transcripts greater than the threshold entered (less similar).
    • Small: There is a nucleotide number difference between the two transcripts less than the threshold entered (more similar).
  • NRM: The transcript pair is a non-reciprocal match.

It is a key utility for creating a Transcript Model Map.

Typical Usage:

python pair_classification.py \
    -d /path/to/input.csv \     # TranD 2GTF output, minimum pairwise transcript distance
    -s 15 \                     # Nucleotide threshold
    -o /path/to/output/file.csv # Output file

Notes:

  • The threshold -s is not required but allows for the distinction between a large and small transcript distance on the basis of number of nucleotides different between the two transcripts.
    • We recommend calculating it by multiplying the # nucleotides per codon (3) * the average number of exons per gene for that species
    • The mean number of exons per gene can be found in TranD output in the file prefix_transcriptome_complexity_counts.csv under the column mean_exonPerGene.
  • Note that -o must refer to an output file, rather than a directory.

Examples: Sample Input | Sample Output

Sample output is generated using the following command:

python pair_classification.py \
    -d /TranD/utility_input/flair_vs_isoseq.csv \
    -s 15 \
    -o TranD/utility_output/PAIR_CLASSIFICATION/sample_pair_classification.csv

Known Error Triggers:

  • None Currently Known

plot_extra_min_pair_upset.py

Overview: A utility designed to plot two minimum distance UpSet plots for extras in dataset 1 and 2.

Typical Usage:

python plot_extra_min_pair_upset.py
    -i /path/to/minimum_pairwise_distance.csv # Input must be a minimum pairwise distance file created from TranD 2GTF output
    -1 name                                  # name of dataset/GTF 1
    -2 name                                  # name of dataset/GTF 2
    -o /path/output/directory                 # Path to desired output directory

Notes:

  • The name of each dataset must match the suffix added to the column headers of the PD file.

Examples: Sample Input | Sample Output

Sample output is generated using the following command:

python plot_extra_min_pair_upset.py \
    -i /TranD/utility_input/flair_vs_isoseq.csv \
    -1 sim_FLAIR \
    -2 sim_isoseq3cluster \
    -o /TranD/utility_output/PLOTTING/PLOT_MIN

Known Error Triggers:

  • None Currently Known

plot_trand_from_output_files.py

Overview: A utility designed to add create plots from TranD output. This utility is useful if plots were skipped when generating the output with TranD. Takes in all types of TranD output (1 GTF gene, 1 GTF pairwise, 2 GTF pairwise) and creates the plots that would have been created during original TranD output.

Specific Arguments:

  • For plotting from 1 GTF gene output, the following arguments are required:

    • -er: the location of the event_analysis_er.csv file
    • -ef: the location of the event_analysis_ef.csv file - Note: if one of the above arguments is used, the other must also be used
    • -ir: the location of the ir_transcripts.csv file
    • -ue: the location of the uniq_exons_per_gene.csv file
  • For plotting from 1 GTF pairwise output, the following arguments are required:

    • -p1: the location of the pairwise_transcript_distance.csv file
    • Optional -t: a density threshold
      • This will result in the output of an additional plot combining alternative splicing and nucleotide variability. This KDE density threshold will be used as the maximum of the Y-axis for the gene density plot.
  • For plotting from 2 GTF pairwise output, the following arguments are required:

    • -p2: the location of the pairwise_transcript_distance.csv file or minimum_pairwise_transcript_distance.csv file
    • -n1: the name for the first dataset/GTF
    • -n2: the name for the second dataset/GTF
    • Optional -g1: the location of the gtf1_only.gtf file
    • Optional -g2: the location of the gtf2_only.gtf file
      • These two arguments are not necessarily required. However, if not provided, the pie charts of gene counts will show 0 genes exclusive to one dataset.
    • Optional ns: whether or not suffixes are included in the 2GTF data
      • If there are no suffixes attached to the transcripts in the 2GTF data, include this argument

Typical Usage:

python plot_trand_from_output_files.py \
    -p1 /path/to/1GTF/pairwise_transcript_distance.csv # Path to 1GTF pairwise data
    -x prefix \                                        # Output prefix for plots, required
    -o /path/to/output_directory                       # Location of desired output directory

Additional Arguments:

  • -f is necessary if the desired output directory already exists. --force overwrites existing output directory and files within.

Examples: Sample Input | Sample Output

Sample output generated using the following command:

python plot_trand_from_output_files.py \
    -p1 /TranD/utility_input/dmel_pairwise_transcript_distance.csv \
    -x sample \
    -o /TranD/utility_output/PLOTTING/PLOT_TRAND \
    -f

Known Error Triggers:

  • None Currently Known

ignore_AS_type.py

Overview: A utility designed to subset a TranD output upset plot using the the pairwise transcript distance file by ignoring specific types of alternative splicing from the plot. It is useful for analysis focusing on only a certain type of alternative splicing. All 'ignored' pairs will be counted under the first column (the pairs will be treated as though they have no alternative splicing). For pairs with multiple types of alternative splicing, an example:

  • Transcript pair A/B have alternate donor/acceptor (AD), intron retention (IR), and alternate exon (AE).
  • If IR is ignored, pair A/B will be counted under the AD+AE column. If both IR and AE are ignored, pair A/B will be counted under the AD column only. If all three are ignored, A/B is treated as though it has no alternative splicing (first column with no dots).

Typical Usage:

python ignore_AS_type.py \
    -i /path/to/input.csv \      # Trand output (pairwise transcript distance)
    {ignoreoptions} \            # More details below
    -o /path/to/output directory # Output directory, must already exist

Additional Arguments:

  • The ignore options for customizing the plot:

  • -i3: ignore 3' variation

  • -i5: ignore 5' variation

  • -iAD: ignore alternate donor/acceptor

  • -iAE: ignore alternate exon

  • -iIR: ignore intron retention

  • -iNSNT: ignore no shared nucleotides

  • Optional: Add -x to set a prefix for the output files, defaults to the original file name.

Notes:

  • The utility also outputs a RTF file that acts as a legend.

Examples: Sample Input | Sample Output

Output generated using the following commands:

  • Unchanged Original Plot - custom_dmel617.png:
python ignore_AS_type.py \
    -i /TranD/utility_input/dmel_pairwise_transcript_distance.csv \
    -o /TranD/utility_output/PLOTTING/IGNORE_AS \
    -x dmel617 
  • IR ignored - custom_dmel617_ignore_Intron_Retention.png:
python ignore_AS_type.py \
    -i /TranD/utility_input/dmel_pairwise_transcript_distance.csv \
    -o /TranD/utility_output/PLOTTING/IGNORE_AS \
    -x dmel617 \
    -iIR
  • 3' and 5' variation ignored - custom_dmel617_ignore_3_var_ignore_5_var.png:
python ignore_AS_type.py \
    -i /TranD/utility_input/dmel_pairwise_transcript_distance.csv \
    -o /TranD/utility_output/PLOTTING/IGNORE_AS \
    -x dmel617 \
    -i3 -i5

Known Error Triggers:

  • The output directory must exist before trying to output to it using the utility.