# Relative Percentage of mRNA abundance of interested gene

We adopt the mRNA abundance results from:
>Jingyi Jessica Li, Guo-Liang Chew, Mark D. Biggin, Quantitating translational control: mRNA abundance-dependent and independent contributions and the mRNA sequences that specify them, Nucleic Acids Research, Volume 45, Issue 20, 16 November 2017, Pages 11821–11836, https://doi.org/10.1093/nar/gkx898

We used the line from  Csardi et al. data. in `File017.xlsx`

In [1]:
file_path = "/data2/2024_Yeast_GS/my_current_code/rdme_ode/rdme_ode_mRNA_abundance/gkx898_supp/nar-00812-a-2017-File017.xlsx"

interested_species = ['GAL1', 'GAL2', 'GAL3', 'GAL4', 'GAL80', 'Grep']

mapping_orf = {
    'Gal1': 'YBR020W',
    'Gal2': 'YLR081W',
    'Gal3': 'YDR009W',
    'Gal4': 'YPL248C',
    'Gal80': 'YML051W',
    'reporter': 'YBR020W'
}
#we know the total number of protein coding genes is 5616 from bionumbers: https://bionumbers.hms.harvard.edu/bionumber.aspx?id=105444&ver=9



In [7]:
import pandas as pd
import numpy as np

# Read the Excel file
df = pd.read_excel(file_path,skiprows=15)

# For Csardi data (original calculation)
df_csardi = df.iloc[:, [0, 1]].copy()  # Get first two columns
df_csardi.columns = ['ORF', 'mRNA_csardi']  # Rename columns
df_csardi['mRNA_csardi'] = pd.to_numeric(df_csardi['mRNA_csardi'], errors='coerce')
df_csardi = df_csardi.dropna()
df_csardi = df_csardi[~df_csardi['ORF'].str.startswith('Q', na=False)]

# For RPKM data
df_rpkm = df.iloc[:, [8, 9]].copy()  # Get columns I and J
df_rpkm.columns = ['ORF', 'mRNA_rpkm']
df_rpkm['mRNA_rpkm'] = pd.to_numeric(df_rpkm['mRNA_rpkm'], errors='coerce')
df_rpkm = df_rpkm.dropna()
df_rpkm = df_rpkm[~df_rpkm['ORF'].str.startswith('Q', na=False)]

# Calculate abundances and percentages for both datasets
def calculate_abundances(df, mrna_column):
    total_abundance = df[mrna_column].sum()
    abundances = {}
    for gene, orf in mapping_orf.items():
        abundance = df[df['ORF'] == orf][mrna_column].values
        abundances[gene] = abundance[0] if len(abundance) > 0 else 0
    percentages = {gene: (abundance/total_abundance)*100 
                  for gene, abundance in abundances.items()}
    return abundances, percentages

# Calculate for both datasets
csardi_abundances, csardi_percentages = calculate_abundances(df_csardi, 'mRNA_csardi')
rpkm_abundances, rpkm_percentages = calculate_abundances(df_rpkm, 'mRNA_rpkm')

# Print results for both datasets
print("=== Results using Csardi data ===")
print(f"Total number of ORFs: {len(df_csardi)}")
print("\nAbsolute mRNA abundance (Csardi):")
for gene, abundance in csardi_abundances.items():
    print(f"{gene}: {abundance:.2f}")

print("\nPercentage of total mRNA abundance (Csardi):")
total_csardi_percentage = sum(csardi_percentages.values())
for gene, percentage in csardi_percentages.items():
    print(f"{gene}: {percentage:.4f}%")
print(f"\nTotal percentage of all GAL genes (Csardi): {total_csardi_percentage:.4f}%")

print("\n=== Results using RPKM data ===")
print(f"Total number of ORFs: {len(df_rpkm)}")
print("\nAbsolute mRNA abundance (RPKM):")
for gene, abundance in rpkm_abundances.items():
    print(f"{gene}: {abundance:.2f}")

print("\nPercentage of total mRNA abundance (RPKM):")
total_rpkm_percentage = sum(rpkm_percentages.values())
for gene, percentage in rpkm_percentages.items():
    print(f"{gene}: {percentage:.4f}%")
print(f"\nTotal percentage of all GAL genes (RPKM): {total_rpkm_percentage:.4f}%")

# Calculate ribosomes for both methods
ribosomes = 180000
ribosome_gal_csardi = total_csardi_percentage * ribosomes
ribosome_gal_rpkm = total_rpkm_percentage * ribosomes

print(f"\nRibosomes in galactose switch system (Csardi): {ribosome_gal_csardi:.2f}")
print(f"Ribosomes in galactose switch system (RPKM): {ribosome_gal_rpkm:.2f}")

=== Results using Csardi data ===
Total number of ORFs: 5483

Absolute mRNA abundance (Csardi):
Gal1: 0.14
Gal2: 0.02
Gal3: 0.17
Gal4: 0.07
Gal80: 1.26
reporter: 0.14

Percentage of total mRNA abundance (Csardi):
Gal1: 0.0009%
Gal2: 0.0001%
Gal3: 0.0011%
Gal4: 0.0004%
Gal80: 0.0082%
reporter: 0.0009%

Total percentage of all GAL genes (Csardi): 0.0117%

=== Results using RPKM data ===
Total number of ORFs: 4839

Absolute mRNA abundance (RPKM):
Gal1: 0.00
Gal2: 0.00
Gal3: 3.50
Gal4: 4.19
Gal80: 81.12
reporter: 0.00

Percentage of total mRNA abundance (RPKM):
Gal1: 0.0000%
Gal2: 0.0000%
Gal3: 0.0004%
Gal4: 0.0005%
Gal80: 0.0091%
reporter: 0.0000%

Total percentage of all GAL genes (RPKM): 0.0100%

Ribosomes in galactose switch system (Csardi): 2102.65
Ribosomes in galactose switch system (RPKM): 1794.18
