# lnc mRNA Filtering

One of the main features of transcription is **alternative splicing (AS)**. This process allows one gene to produce differents mRNA products by modulating the AS machinery. Each unique alternative transcript is called an **isoform**, every gene can have one or more isoforms with different characteristics. Different isoforms can code for different proteins, commonly alternative isoforms have **premature stop codons (PTCs)** that funnel the transcripts to the degradation machinery (**NMD**). In this work we propose a **new function for mRNAs**, some isoforms never reach the ribosomes for translation and also are not degraded, making them an stable molecule that acummulates in the cell. We belive this isoforms have a role as lnc-mRNAs, mainly in the nucleus. This script is intended to filter out isoforms taking into account all the above mentiones characteristics, leaving us with a small group of lnc-mRNA isoforms for further studies.

lnc RNAs will be filtered by two main features: Stable RNAs (not degraded) and not translated. Four datasets are used in this script:
- Polyribosomes: https://doi.org/10.1111/tpj.12502
- NMD deficient(Cicloheximide and upf1 & upf3 mutants): https://doi.org/10.1105/tpc.113.115485 
- Own ONT data (Total, Nuclear, Citoplasmatic)
- Own illumina data (Total, Nuclear, Citoplasmatic)

**Objectives:**
- Define a universe of isoforms (total isoforms present in the model studied)
- Remove the isoforms that are being translated (polyribosomes)
- Remove the isoforms that are being degraded (NMD deficient)
    - To achieve this, isoforms that increase their presence on NMD deficient samples, are considered NMD sensitive
- Keep the nuclear isoforms (and discard the citoplasmatic)
- Asses light/Dark changes.
- Quantify the proportion in nucleus/citoplasm of this isoforms

*For this analisys I used a modified version of "AtRTD2_QUASI", called "AtRTD2.1_QUASI". This version was done by filtering out low expressed isoforms from 155 datasets quantified by SALMON using the "AtRTD2_QUASI" transcriptome (this datasets include illumina and ONT data).*


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from utils import concatenate_files_exclude

In [3]:
import empias

ModuleNotFoundError: No module named 'empias'

## Load data

In [2]:
file_dir = "/home/lucas-elytron/Documents/paper_doct/lnc_mrna/mapping/salmon/AtRTDv2_1_QUASI"
transcriptome = pd.read_parquet("/home/lucas-elytron/Documents/paper_doct/lnc_mrna/mapping/reference/mod_refs/AtRTDv2_1_QUASI.LS.parquet")

In [3]:
illumina_data, illumina_files = concatenate_files_exclude(file_dir, transcriptome, "ONT_data",)
illumina_data = illumina_data.rename(columns={"Name":"isoform"})
pivot_df = illumina_data.pivot_table(
    index=['gene', 'isoform'],
    columns='df_name', 
    values='TPM',
    fill_value=float('nan')  # Use NaN for missing values
)
pivot_df.columns = ['TPM_' + col[4:-6] for col in pivot_df.columns]

result_df = pivot_df
result_df

Unnamed: 0_level_0,Unnamed: 1_level_0,TPM_CHX,TPM_MOCK,TPM_UPF1UPF3,TPM_UPF1,TPM_UPF3,TPM_WT
gene,isoform,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AT1G01010,AT1G01010.1,3.626436,21.971265,24.489552,7.725926,8.610666,7.638263
AT1G01020,AT1G01020_P1,10.880994,25.057372,16.052646,21.630980,16.291703,16.261399
AT1G01020,AT1G01020_P2,0.348537,2.279401,0.358054,1.265387,0.601279,1.381167
AT1G01020,AT1G01020_P3,0.000000,3.719321,2.606379,5.729312,0.000000,8.076229
AT1G01020,AT1G01020_P4,0.945659,1.320172,2.640832,0.000000,3.046311,0.000000
...,...,...,...,...,...,...,...
ATMG01330,ATMG01330.1,0.188437,0.203131,0.090114,0.357031,0.309601,0.101733
ATMG01350,ATMG01350.1,0.198521,0.260449,0.661466,0.000000,0.187362,0.205945
ATMG01370,ATMG01370.1,5.622655,13.540405,10.619525,15.571162,6.142237,9.117486
ATMG01380,ATMG01380.1,32.287186,72.567447,31.927099,117.233290,6.731542,48.946398


In [4]:
result_df.loc[["AT3G61860",],:]

Unnamed: 0_level_0,Unnamed: 1_level_0,TPM_CHX,TPM_MOCK,TPM_UPF1UPF3,TPM_UPF1,TPM_UPF3,TPM_WT
gene,isoform,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AT3G61860,AT3G61860_ID2,7.526846,21.839957,31.148346,35.59151,32.378725,23.847747
AT3G61860,AT3G61860_P1,3.061675,15.876204,11.322559,7.75798,6.43216,8.399215
AT3G61860,AT3G61860_s1,0.090624,1.398329,3.684376,5.838495,4.212439,6.825241
AT3G61860,AT3G61860_s2,12.37566,11.541558,47.315357,5.688878,4.989811,2.186952


In [5]:
ONT_data, ONT_files = concatenate_files_exclude(file_dir, transcriptome, "PRJNA176940")

grouped = ONT_data.groupby('df_name')
ont_dfs, ont_dfs_names  = {x : grouped.get_group(x) for x in grouped.groups}, [x for x in grouped.groups]
ont_barcode_keys = [("Nuc L","TPM_1"),("Nuc L","TPM_2"),("Nuc L","TPM_3"),
                    ("Nuc D","TPM_1"),("Nuc D","TPM_2"),("Nuc D","TPM_3"),
                    ("Cit L","TPM_1"),("Cit L","TPM_2"),("Cit L","TPM_3"),
                    ("Cit D","TPM_1"),("Cit D","TPM_2"),("Cit D","TPM_3"),]
ONT_nuc_cito = pd.concat([ont_dfs[x].TPM for x in ont_dfs_names], axis=1, keys=ont_barcode_keys)
#ONT_total = pd.concat([ont_dfs[x].TPM for x in ont_dfs_names[-2:]], axis=1, keys=[("Total D","TPM_1"),("Total L","TPM_1")])
#Check
#ONT_total.columns.set_names(["condition", "sample"], inplace=True)
ONT_nuc_cito.columns.set_names(["condition", "sample"], inplace=True)
ONT_nuc_cito.head()

Unnamed: 0_level_0,condition,Nuc L,Nuc L,Nuc L,Nuc D,Nuc D,Nuc D,Cit L,Cit L,Cit L,Cit D,Cit D,Cit D
Unnamed: 0_level_1,sample,TPM_1,TPM_2,TPM_3,TPM_1,TPM_2,TPM_3,TPM_1,TPM_2,TPM_3,TPM_1,TPM_2,TPM_3
gene,isoform,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
AT1G01010,AT1G01010.1,2.759793,5.941394,0.0,5.101902,8.20283,6.580232,7.203409,7.886016,8.593391,5.974763,12.130097,10.128777
AT1G01020,AT1G01020_P1,31.251582,55.162492,0.0,26.586747,38.920034,35.217447,31.214771,0.0,19.335129,9.957938,22.957437,10.128777
AT1G01020,AT1G01020_P2,9.33035,16.04977,12.796301,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AT1G01020,AT1G01020_P3,69.809793,47.61562,82.457146,78.85256,58.342096,92.000365,0.0,0.0,0.0,0.0,8.234241,0.0
AT1G01020,AT1G01020_P4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
ONT_nuc_cito.loc[["AT3G61860",],:]

Unnamed: 0_level_0,condition,Nuc L,Nuc L,Nuc L,Nuc D,Nuc D,Nuc D,Cit L,Cit L,Cit L,Cit D,Cit D,Cit D
Unnamed: 0_level_1,sample,TPM_1,TPM_2,TPM_3,TPM_1,TPM_2,TPM_3,TPM_1,TPM_2,TPM_3,TPM_1,TPM_2,TPM_3
gene,isoform,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
AT3G61860,AT3G61860_ID2,136.466411,135.012686,311.250844,557.108229,550.914695,558.654675,2.707569,0.0,0.0,12.336501,0.0,11.070926
AT3G61860,AT3G61860_P1,43.749715,52.408766,53.185616,43.947466,31.514748,51.05177,74.12879,104.001051,84.234626,41.436362,46.787517,59.830515
AT3G61860,AT3G61860_s1,19.868876,26.468734,9.250141,51.98776,61.765164,52.703543,0.0,2.460159,5.995978,0.0,0.0,0.0
AT3G61860,AT3G61860_s2,0.0,0.0,0.0,0.0,1.485313,0.0,0.0,0.0,0.0,0.0,0.0,0.0
