# gtfParsingForDecapping.ipynb
## Marcus Viscardi,    April 15, 2024

Working in `decappingQuantification.ipynb`, I realized that we need to be able to calculate distances in transcriptome space from the genomic coordinates of our 5TERA libraries.

To do this, we'll want to be able to "walk" along the GTF file as we move through the genomic space, then subtract out the intronic regions to get the distance along the transcript.

A big problem with this is gonna be the fact that most genes have multiple transcripts...

For our NMD guyz, we have previously binned them into target, non-target, and ambiguous. With this, we can figure out which transcripts in the GTF are the NMD targets and the non-targets then work from there? Feels crazy... And doesn't handle the multiple non-target transcripts...

***

Here, we want to see if we can parse the GTF file into something that will be useful for this purpose!

In [None]:
import nanoporePipelineCommon as npCommon

import numpy as np
import pandas as pd

import seaborn as sea
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

from tqdm.auto import tqdm

from pprint import pprint

from pathlib import Path

from icecream import ic
from datetime import datetime

pd.set_option('display.width', 200)
pd.set_option('display.max_columns', None)

def __time_formatter__():
    now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    return f"ic: {now} | > "
ic.configureOutput(prefix=__time_formatter__)

_ = ic("Imports done!")

In [None]:
gtf_parquet_path = Path("/data16/marcus/genomes/plus_cerENO2_elegansRelease100/230327_allChrs_plus-cerENO2.gtf.parquet")

gtf_df = pd.read_parquet(gtf_parquet_path)
gtf_df.query("gene_name == 'ubl-1'")

In [None]:
bed_path = Path("/data16/marcus/genomes/plus_cerENO2_elegansRelease100/230327_allChrs_plus-cerENO2.bed")

bed_dict = {}
with open(bed_path, 'r') as bed_file:
    lines = bed_file.readlines()
    for i, line in tqdm(enumerate(lines), total=len(lines)):
        line = line.strip().split("\t")
        line_dict = {
            "chr": line[0],
            "start": int(line[1]),
            "end": int(line[2]),
            "gene_id": line[3].strip('"'),
            "score": line[4],
            "strand": line[5],
            "source": line[6],
            "feature": line[7],
            "item_rgb": line[8],
        }
        # print(line[9].strip(";").split(";"))
        for item in line[9].strip(";").split("; "):
            try:
                if "=" in item:
                    key, value = item.split("=", 1)
                else:
                    key, value = item.split(" ", 1)
            except ValueError:
                pprint(line)
                print(f"Error on line {i}: {item}")
                raise ValueError(f"Error on line {i}: {item}")
            line_dict[key] = value.strip('"')
        # pprint(line)
        # pprint(line_dict)
        # print("\n\n")
        bed_dict[i] = line_dict
bed_df = pd.DataFrame.from_dict(bed_dict, orient="index")

In [None]:
target_gene = 'ubl-1'
gene_df = bed_df.query(f"gene_name == @target_gene")

for trans_id in gene_df.transcript_id.unique():
    print('\n', trans_id)
    print(gene_df.query(f"transcript_id == @trans_id and feature == 'exon'").set_index('exon_id')[['start', 'end', 'feature']])
print(gene_df.query(f"feature == 'exon'").set_index('exon_id')[['start', 'end', 'feature']])