# Postprocessing of the ExTRI2 pipeline results to create the ExTRI2 resource

This notebook was used to determine the rules to use for renormalisation & discard of sentences. Sentences were extracted from the ExTRI2 resource and checked manually, to determine how to handle each category. 

Furthermore, the **run postprocessing.py** section is a self-contained section to create the post-processed ExTRI2 file from the file obtained from the pipeline.

In [30]:
__import__('sys').path.append('../common/'); __import__('notebook_utils').table_of_contents('postprocessing_checkings.ipynb')

<h3>Table of contents</h3>


[Postprocessing of the ExTRI2 pipeline results to create the ExTRI2 resource](#Postprocessing-of-the-ExTRI2-pipeline-results-to-create-the-ExTRI2-resource)
- [Run postprocessing.py](#Run-postprocessing.py)
- [Setup](#Setup)
- [Postprocessing](#Postprocessing)
  - [AP1 & NFKB](#AP1-&-NFKB)
  - [Initial exploration](#Initial-exploration)
  - [Create sets of sentences to check](#Create-sets-of-sentences-to-check)

This notebook will now only be used for the normalisation of the results.

## Run postprocessing.py
Self-contained cell to run postprocessing.py

In [1]:
import sys
sys.path.append('../common')
sys.path.append('../../')
from scripts.postprocessing.postprocessing import *
main()

### POSTPROCESSING TRI_df
We got 6706 different TFs and 26196 different TGs from sentences labelled as TRI
Retrieving from Entrez...

4967 sentences are dropped as their TG is not normalised

38287 rows (4.23%) will have its TF renormalized to NFKB
6329 rows (0.70%) will be dropped as the TG corresponds to NFKB
9003 rows (1.00%) will have its TF renormalized to AP1
1858 rows (0.21%) will be dropped as the TG corresponds to AP1
Breakdown by NCBI Symbol saved in ../../data/postprocessing/tables/AP1_NFKB_breakdown.tsv
Number of renormalized sentences and normalization:
4827	0.54%	p21 is normalized to CDKN1A
1922	0.21%	p53-ps is normalized to its respective p53 symbol

Number of discarded sentences and percentage from total (896330 sentences) and reasoning:
2556	0.29%	Their TF contains -AS[1-3]
673	0.08%	Their TF are circRNAs
952	0.11%	Their TF (NLRP3) is followed by inflammasome but normalised to NLRP3
1875	0.21%	Their TG (NLRP3) is followed by inflammasome but normalised to NLRP3
1088	0.

## Setup

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import itertools
import re

## Custom functions
import sys

sys.path.append('../common')
sys.path.append('../../')

from notebook_utils import table_of_contents, table_from_dict, h3, h4, h5, md, bold
from renormalisations import *
from postprocessing import *
pd.set_option('display.max_colwidth', 20)

In [2]:
# Checkings on the processed final TRI df
config = load_config()
final_TRI_df = load_df(config['final_ExTRI2_p'])
orthologs_df = load_df(config['orthologs_final_p'])

## Postprocessing

### HGNC orthologs

What I did:
- Date: 12/10/2025
- Went to page: https://www.ensembl.org/biomart/martview/

For the orthologs:
- **Dataset:** Human genes (Human genes (GRCh38.p14)), from Ensembl Genes 115
- **Attributes:**
  - Gene stable ID
  - Transcript stable ID
  - Norway rat - BN/NHsdMcwi gene stable ID
  - Mouse gene stable ID

For mapping Ensembl to NCBI:
- Ensembl Genes 115
  - **Dataset:** Human genes (GRCh38.p14) | Mouse genes (GRCm39) | Norway rat - BN/NHsdMcwi genes (GRCr8)
  - **Attributes** > Features > External > NCBI gene ID
  - For human: HGNC ID

In [40]:
def add_human_orthologs(TRI_df: pd.DataFrame, ensembl_folder: str) -> pd.DataFrame:
    '''
    Add human orthologs, using Ensembl Genes 115 release.
    Returns: 
        TRI_df with four new columns: TF_human_Id, TG_human_Id, TF_HGNC_ID, TG_HGNC_ID
    '''

    # Load Ensembl orthologs (Ensembl Genes 115 version)
    ensembl_orthologs = pd.read_csv(ensembl_folder + 'mart_export_human_orthologs.txt', sep='\t', dtype='str')
    ensembl_orthologs.rename(columns={'Norway rat - BN/NHsdMcwi gene stable ID': 'Norway rat gene stable ID'}, inplace=True)

    # Only keep relevant rows and columns
    m = ensembl_orthologs['Norway rat gene stable ID'].isna() & ensembl_orthologs['Mouse gene stable ID'].isna()
    ensembl_orthologs = ensembl_orthologs[~m][['Gene stable ID', 'Norway rat gene stable ID', 'Mouse gene stable ID']]

    # Load the mappings from Ensembl Gene IDs to NCBI Gene IDs
    ensembl_human_df = pd.read_csv(ensembl_folder + 'mart_export_homo_sapiens_Ensembl_to_NCBI.txt', sep='\t', dtype='str')
    ensembl_mouse_df = pd.read_csv(ensembl_folder + 'mart_export_mus_musculus_Ensembl_to_NCBI.txt', sep='\t', dtype='str')
    ensembl_rat_df = pd.read_csv(ensembl_folder + 'mart_export_rattus_norvegicus_Ensembl_to_NCBI.txt', sep='\t', dtype='str')

    # Build mapping dictionaries from Ensembl ID to NCBI Gene ID
    ensembl_human_map = dict(zip(ensembl_human_df['Gene stable ID'], ensembl_human_df['NCBI gene (formerly Entrezgene) ID']))
    ensembl_mouse_map = dict(zip(ensembl_mouse_df['Gene stable ID'], ensembl_mouse_df['NCBI gene (formerly Entrezgene) ID']))
    ensembl_rat_map   = dict(zip(ensembl_rat_df['Gene stable ID'],   ensembl_rat_df['NCBI gene (formerly Entrezgene) ID']))

    # Add the NCBI Gene ID to the ortholog dataframe
    ensembl_orthologs['Human NCBI Gene ID'] = ensembl_orthologs['Gene stable ID'].map(ensembl_human_map)
    ensembl_orthologs['Mouse NCBI Gene ID'] = ensembl_orthologs['Mouse gene stable ID'].map(ensembl_mouse_map)
    ensembl_orthologs['Rat NCBI Gene ID']   = ensembl_orthologs['Norway rat gene stable ID'].map(ensembl_rat_map)

    # Build mapping from NCBI Gene ID to Human NCBI Gene ID
    mouse_map = dict(zip(ensembl_orthologs['Mouse NCBI Gene ID'], ensembl_orthologs['Human NCBI Gene ID']))
    rat_map   = dict(zip(ensembl_orthologs['Rat NCBI Gene ID'],   ensembl_orthologs['Human NCBI Gene ID']))
    orthologs_map = {k: v for k, v in (mouse_map | rat_map).items() if pd.notna(k) and pd.notna(v)} | {'Complex:NFKB': 'Complex:NFKB', 'Complex:AP1': 'Complex:AP1'}

    # Build mapping from NCBI Gene ID to HGNC ID
    hgnc_map = dict(zip(ensembl_human_df['NCBI gene (formerly Entrezgene) ID'], ensembl_human_df['HGNC ID']))
    hgnc_map = {k: v for k, v in hgnc_map.items() if pd.notna(k) and pd.notna(v)} | {'Complex:NFKB': 'Complex:NFKB', 'Complex:AP1': 'Complex:AP1'}

    def map_NCBI_ID_to_human_NCBI_ID(orthologs_map, gene_ids, taxids):
        '''Map gene IDs (human, mouse, rat) to human NCBI Gene IDs, using Ensembl 115 release orthologs.'''
        human_ids = []
        for id, TaxID in zip(gene_ids.split(';'), taxids.split(';')):
            if TaxID == '9606':
                human_ids.append(id)
            else:
                human_ids.append(orthologs_map.get(id, '-'))
        return ";".join(human_ids)

    def map_NCBI_ID_to_HGNC_ID(hgnc_map, gene_ids):
        '''Map human gene ID to HGNC ID.'''
        human_ids = []
        for id in gene_ids.split(';'):
            human_ids.append(hgnc_map.get(id, '-'))
        return ";".join(human_ids)

    # Add Human NCBI ID into the final dataframe
    TRI_df['TF_human_Id'] = TRI_df.apply(lambda row: map_NCBI_ID_to_human_NCBI_ID(orthologs_map, row['TF Id'], row['TF TaxID']), axis=1)
    TRI_df['TG_human_Id'] = TRI_df.apply(lambda row: map_NCBI_ID_to_human_NCBI_ID(orthologs_map, row['TG Id'], row['TG TaxID']), axis=1)

    # Add HGNC ID into the final dataframe
    TRI_df['TF_HGNC_ID'] = TRI_df.apply(lambda row: map_NCBI_ID_to_HGNC_ID(hgnc_map, row['TF_human_Id']), axis=1)
    TRI_df['TG_HGNC_ID'] = TRI_df.apply(lambda row: map_NCBI_ID_to_HGNC_ID(hgnc_map, row['TG_human_Id']), axis=1)

    return TRI_df

# External folder where I have downloaded the data
ensembl_folder = '../../data/external/ensembl_Release_115_orthologs/'

final_TRI_df = add_human_orthologs(final_TRI_df, ensembl_folder)

In [42]:
m = final_TRI_df['TF_HGNC_ID'].str.contains('-') | final_TRI_df['TG_HGNC_ID'].str.contains('-')
m.sum() / len(final_TRI_df)

0.021792150039132527

In [32]:
display(final_TRI_df[final_TRI_df['TF_human_Id'].str.contains('-')][['TF Id', 'TF Symbol', 'TF TaxID', 'TF_human_Id']].value_counts())
display(final_TRI_df[final_TRI_df['TG_human_Id'].str.contains('-')][['TG Id', 'TG Symbol', 'TG TaxID', 'TG_human_Id']].value_counts())
print(f"Contains '-' {(final_TRI_df['TF_human_Id'].str.contains('-') | final_TRI_df['TG_human_Id'].str.contains('-')).sum() / len(final_TRI_df):.2%}")
print(f"No orthologs found {((final_TRI_df['TF_human_Id'] == '-') | (final_TRI_df['TG_human_Id'] == '-')).sum() / len(final_TRI_df):.2%}")


TF Id                     TF Symbol                       TF TaxID                 TF_human_Id      
19015                     Ppard                           10090                    -                    483
25493                     Nfkbia                          10116                    -                    178
15361                     Hmga1                           10090                    -                    116
108058                    Camk2d                          10090                    -                    103
13039                     Ctsl                            10090                    -                     91
                                                                                                       ... 
311807                    Bmyc                            10116                    -                      1
19015;19013               Ppard;Ppara                     10090;10090              -;5465                 1
19164;19165;387518;67866  Psen1;Pse

TG Id        TG Symbol        TG TaxID     TG_human_Id
19211        Pten             10090        -              370
13088        Cyp2b10          10090        -              289
20309        Cxcl15           10090        -              285
26198        COX2             10116        -              268
20249        Scd1             10090        -              258
                                                         ... 
11668;26358  Aldh1a1;Aldh1a7  10090;10090  -;216            1
293152       Art2b            10116        -                1
29298        Cyp2c7           10116        -                1
292872       Klk1c3           10116        -                1
19202        Rhox6            10090        -                1
Name: count, Length: 1160, dtype: int64

Contains '-' 1.55%
No orthologs found 1.49%


In [32]:
(final_TRI_df['TF_human_Id'].isna() | final_TRI_df['TG_human_Id'].isna()).sum() / len(final_TRI_df)

0.09227138586585801

In [36]:
## CHECK THE DISTRUBTION OF TF TYPES PER HUMAN MAPPING
# Check whether any TF_human_symbol has multiple TF types from different rodent orthologs 
agg_funcs = {
    'TF_type': lambda x: '|'.join(x.unique()),
    'TF TaxID': lambda x: '|'.join(x.unique())
}

final_TRI_grouped = final_TRI_df.groupby('TF_human_Id', as_index=False).agg(agg_funcs)
m = final_TRI_grouped['TF_human_Id'].str.contains(';')
final_TRI_grouped[~m]['TF_type'].value_counts()
# No TF_human_symbol has multiple TF types, nothing to correct.

TF_type
coTF                 1426
dbTF                 1322
ll_coTF               171
dbTF|coTF               5
ll_coTF|dbTF|coTF       1
ll_coTF|coTF            1
coTF|dbTF               1
Name: count, dtype: int64

In [None]:
# TODO - If we use Ensembl, then the way we obtained the TF types is no longer valid (as it used the HGNC ones). Find where that is and update it too

In [None]:
# TODO - If we use Ensembl, the cells below are no longer needed. Delete them
# TODO a
# TODO - Update this function to the NCBI mapping (it simplifies it a tonne  :) )
import re

def add_HGNC_symbols(ExTRI2_df: pd.DataFrame, orthologs_path: str) -> tuple[pd.DataFrame, pd.DataFrame]:
    '''
    use ortholog dicts in orthologs_path (downloaded from HGNC) to get HGNC orthologs for mouse, rat, & human HGNC IDs
    '''

    ## HELPER FUNCTIONS
    def get_unique_HGNC_symbol_index(row):
        '''
        Helper function to get a unique HGNC symbol for each Entrez ID, when there are multiple options.
        1) Return exact lowercase match, if any
        2) First gene family member (e.g. ACSM2A for ACSM2; if multiple, take the one with the smallest numeric suffix)
        3) If no match, return NaN
        '''
        # Get rodent symbol & human symbols
        rodent_symbol = row['symbol'].lower()
        human_symbols = row['human_symbol'].lower().split(";")

        # If there's only one symbol, return index 0
        if len(human_symbols) == 1:
            return 0

        # 1) Return exact lowercase match, if any
        for i, c in enumerate(human_symbols):
            if c == rodent_symbol:
                return i

        # 2) Apply first gene family member rule

        # Strip the numbers from the end of rodent_symbol
        family_stem_match = re.match(r'^(.*?)(\d+)?$', rodent_symbol)
        family_stem = family_stem_match.group(1) if family_stem_match else None

        # Get human symbols that start with the same stem
        matches = [(i, c) for i, c in enumerate(human_symbols) if c.startswith(family_stem)] if family_stem else []

        if matches:
            # Sort by numeric suffix (so ACSM2 < ACSM11)
            def numeric_suffix(symbol):
                m = re.search(r'(\d+)$', symbol)
                return int(m.group(1)) if m else float('inf')

            matches.sort(key=lambda x: numeric_suffix(x[1]))
            return matches[0][0]  # return index of first (smallest numeric suffix)

        # No matches found — leave blank
        return None
    
    def assign_unique_fields(row):
        '''Helper function to assign unique fields based on index'''
        idx = get_unique_HGNC_symbol_index(row)

        if idx is None:
            return pd.Series({
                "unique_human_symbol":      "-",
                "unique_hgnc_id":           "-",
                "unique_human_entrez_gene": "-",
            })

        return pd.Series({
            "unique_human_symbol":      row["human_symbol"].split(";")[idx],
            "unique_hgnc_id":           row["hgnc_id"].split(";")[idx],
            "unique_human_entrez_gene": row["human_entrez_gene"].split(";")[idx],
        })

    def fill_ortholog_column(id, column):
        '''Helper function to fill ortholog columns'''
        result = []
        for entrez_gene in id.split(";"):
            result.append(orthologs_map[entrez_gene][column]) if entrez_gene in orthologs_map else "-"
        return ";".join(result)

    # Create empty df to store orthologs
    orthologs = pd.DataFrame()

    # Get mouse & rat orthologs
    for rodent in ['mouse', 'rat']:
        # Load as string
        hgnc_df = load_df(orthologs_path + f"human_{rodent}_hcop_fifteen_column.txt")
        hgnc_df = hgnc_df.rename(columns={f"{rodent}_entrez_gene": "entrez_gene", f"{rodent}_symbol": "symbol"})
        hgnc_df['TaxID'] = '10090' if rodent == 'mouse' else '10116'
        orthologs = pd.concat([orthologs, hgnc_df])

    # Get human HNGC symbols
    h_hgnc_df = load_df(orthologs_path + "hgnc_human.tsv")
    h_hgnc_df = h_hgnc_df.rename(columns={"HGNC ID": "hgnc_id", "NCBI Gene ID": "entrez_gene", "Approved symbol": "symbol"})
    h_hgnc_df['human_entrez_gene'] = h_hgnc_df['entrez_gene']
    h_hgnc_df['human_symbol'] = h_hgnc_df['symbol']
    h_hgnc_df['TaxID'] = '9606'
    orthologs = pd.concat([orthologs, h_hgnc_df])

    # Keep only IDs present in the ExTRI2_df
    TF_ids = {j for id in ExTRI2_df['TF Id'].unique() for j in id.split(';')}
    TG_ids = {j for id in ExTRI2_df['TG Id'].unique() for j in id.split(';')}
    TF_TG_ids = TF_ids | TG_ids
    orthologs = orthologs[orthologs['entrez_gene'].isin(TF_TG_ids)]

    # Remove all rows that don't have a symbol, human entrez ID, or hgnc ID
    m = (orthologs['symbol'] != '-') & (orthologs['human_entrez_gene'] != '-') & (orthologs['hgnc_id'] != '-')
    orthologs = orthologs[m]

    # === Resolve cases where entrez_gene has a 1-to-many mapping with symbol. Can be fixed by eliminating all "LOCXXXX" & "GmXXXX" ones. ===
    # Get entrez IDs with multiple symbols
    entrezIDs_with_multiple_symbols = (
        orthologs[['entrez_gene', 'symbol']]
        .drop_duplicates()
        .groupby('entrez_gene')
        .filter(lambda g: len(g) > 1)['entrez_gene']
        .unique()
    )
    # Discard all rows with 'LOCXXXX' or 'GmXXXX' symbols for these entrez IDs (& assert this solves all cases)
    m_to_discard = orthologs['entrez_gene'].isin(entrezIDs_with_multiple_symbols) & (orthologs['symbol'].str.contains(r'^(?:LOC|Gm\d+)', regex=True))
    print(f"Discarding {m_to_discard.sum()} rows with 'LOCXXXX' or 'GmXXXX' symbols: {', '.join(orthologs[m_to_discard]['symbol'].unique())}\n")
    orthologs = orthologs[~m_to_discard]
    assert (orthologs[['entrez_gene', 'symbol']].drop_duplicates()['entrez_gene'].duplicated().sum() == 0), "There are still entrez IDs with multiple symbols after discarding 'LOCXXXX' and 'GmXXXX' ones"


    # Join with ';' when an EntrezID has more than 1 human ortholog
    agg_funcs = {
        "symbol": lambda x: ';'.join(x.unique()),
        "TaxID": lambda x: ';'.join(x.unique()),
        "human_entrez_gene": lambda x: ';'.join(x),
        "hgnc_id": lambda x: ';'.join(x),
        "human_symbol": lambda x: ';'.join(x),
    }
    orthologs = orthologs.groupby(['entrez_gene']).agg(agg_funcs).reset_index()

    # Add NFKB & AP1 orthologs
    for dimer in ['NFKB', 'AP1']:
        orthologs = pd.concat([orthologs, pd.DataFrame([{
            'entrez_gene': f'Complex:{dimer}',
            'symbol': dimer,
            'TaxID': '9606',
            'human_entrez_gene': f'Complex:{dimer}',
            'hgnc_id': f'Complex:{dimer}',
            'human_symbol': dimer,
        }])], ignore_index=True)

    # Add columns with 1-to-1 mapping (unique_human_symbol, unique_hgnc_id, unique_human_entrez_gene), by using either exact match or closest match of symbol wrt human_symbol
    orthologs = orthologs.join(orthologs.apply(assign_unique_fields, axis=1))

    # Show how many orthologs we get
    print(f"We get ortholog info for {len(orthologs)}/{len(TF_TG_ids)} Gene IDs\n")

    # Fill in ortholog columns for TFs and TGs using the "unique" fields for a 1-to-1 mapping
    orthologs_map = orthologs.set_index('entrez_gene').to_dict(orient='index')
    for T in ('TF', 'TG'):
        ExTRI2_df[f"{T}_human_entrez_gene"] = ExTRI2_df[f'{T} Id'].apply(lambda id: fill_ortholog_column(id, "unique_human_entrez_gene"))
        ExTRI2_df[f"{T}_hgnc_id"]           = ExTRI2_df[f'{T} Id'].apply(lambda id: fill_ortholog_column(id, "unique_hgnc_id"))
        ExTRI2_df[f"{T}_human_symbol"]      = ExTRI2_df[f'{T} Id'].apply(lambda id: fill_ortholog_column(id, "unique_human_symbol"))

    return ExTRI2_df, orthologs


trial_TRI_df, orthologs_df = add_HGNC_symbols(final_TRI_df, config['orthologs_p'])

Discarding 11 rows with 'LOCXXXX' or 'GmXXXX' symbols: Gm49339, Gm38393, LOC100910792, LOC103693384, LOC102551901, LOC103689968, LOC100911372

We get ortholog info for 24968/25608 Gene IDs



In [None]:
## SAVE ORTHOLOGS WITH AUTOMATIC UNIQUE MAPPING TO CHECK MANUALLY

# Find orthologs that have multiple human symbols, and whose chosen unique human symbol is not an exact string match of the rodent symbol
# These were automatically mapped using approximate string matching, and must be checked manually
m = (orthologs_df['human_symbol'].str.contains(';')) & (orthologs_df['symbol'].str.lower() != orthologs_df['unique_human_symbol'].str.lower())
orthologs_df[m][['symbol', 'unique_human_symbol', 'TaxID']]

# Precompute counts in final_TRI_df
tf_counts = final_TRI_df['TF Symbol'].value_counts()
tg_counts = final_TRI_df['TG Symbol'].value_counts()

# Map counts
orthologs_df['TF_count'] = orthologs_df['symbol'].map(tf_counts).fillna(0).astype(int)
orthologs_df['TG_count'] = orthologs_df['symbol'].map(tg_counts).fillna(0).astype(int)

# Save a dataframe of orthologs to check the correctness of the automatic unique mapping
cols_first = ['TaxID', 'entrez_gene', 'symbol', 'unique_human_symbol', 'human_symbol', 'human_entrez_gene', 'hgnc_id']
orthologs_to_check = orthologs_df[m][cols_first + [c for c in orthologs_df[m].columns if c not in cols_first]].sort_values(by=['symbol', 'TaxID']).reset_index(drop=True)
print(f"Number of orthologs to check: {len(orthologs_to_check)}")

orthologs_to_check.to_csv(config['data_p'] + 'validation/orthologs_to_check.tsv', sep='\t', index=False)

Number of orthologs to check: 356


In [None]:
## CHECK THE DISTRUBTION OF TF TYPES PER HUMAN MAPPING
# Check whether any TF_human_symbol has multiple TF types from different rodent orthologs 
agg_funcs = {
    'TF_type': lambda x: ';'.join(x.unique()),
    'TF TaxID': lambda x: ';'.join(x.unique())
}

final_TRI_grouped = final_TRI_df.groupby('TF_human_symbol', as_index=False).agg(agg_funcs)
m = final_TRI_grouped['TF_type'].str.contains(';')
final_TRI_grouped[~m]['TF_type'].value_counts()
# No TF_human_symbol has multiple TF types, nothing to correct.

TF_type
coTF       1415
dbTF       1318
ll_coTF     169
Name: count, dtype: int64

### AP1 & NFKB

AP1 and NFKB are dimers, and as such don't have neither a NCBI EntrezID, nor a HGNC symbol. PubTator normalizes them to one of their monomers. Therefore, in `postprocessing.py`, we
* Find all dimers incorrectly normalized to monomers using regex
* Change the TF metadata to AP1/NFKB. Delete the TG instances (a TG can't be a dimer)
* Save a summary of the results and affected sentences in `data/postprocessing/tables`

### Initial exploration

In [4]:
# POSTPROCESSING BEFORE RENORMALISATION & DISCARDING WERE IMPLEMENTED
def half_postprocess(ExTRI2_df: pd.DataFrame, TRI_sents: bool, config: dict) -> pd.DataFrame:
    '''same as postprocess but before the renormalisation & discarding'''

    df_type = 'TRI' if TRI_sents else 'nonTRI'
    print(f'### POSTPROCESSING {df_type}_df')

    # Retrieve Symbol & TaxID from Entrez
    save_Symbol_TaxID_dict(ExTRI2_df, config[f'EntrezID_to_Symbol_{df_type}_p'])

    # Filter & add metadata
    if TRI_sents:
        remove_duplicates(ExTRI2_df)
    ExTRI2_df = add_symbols_TaxID(ExTRI2_df, config[f'EntrezID_to_Symbol_{df_type}_p'])
    add_TF_type(ExTRI2_df, config)
    ExTRI2_df = drop_GTFs(ExTRI2_df)
    ExTRI2_df = remove_other_species(ExTRI2_df, TaxID)

    # Fix AP1 & NFKB normalisations
    ExTRI2_df = fix_NFKB_AP1(ExTRI2_df, config)

    return ExTRI2_df

config = load_config()

# Load raw dataframe
TRI_df = load_preprocess_df(config['raw_TRI_p'])

# Postprocess (without renormalisation/discarding)
TRI_df = half_postprocess(TRI_df, TRI_sents=True,  config=config)

### POSTPROCESSING TRI_df
We got 6706 different TFs and 26196 different TGs from sentences labelled as TRI
Retrieving from Entrez...

4967 sentences are dropped as their TG is not normalised

38287 rows (4.23%) will have its TF renormalized to NFKB
6327 rows (0.70%) will be dropped as the TG corresponds to NFKB
9003 rows (1.00%) will have its TF renormalized to AP1
1858 rows (0.21%) will be dropped as the TG corresponds to AP1
Breakdown by NCBI Symbol saved in ../../data/postprocessing/tables/AP1_NFKB_breakdown.tsv


In [5]:
# GET NUMBER OF UNIQUE ENTITY NAMES
ExTRI2_df = TRI_df

m_AP1 = ExTRI2_df['TF Symbol'].str.contains('|'.join(('FOS', 'JUN')), case=False)

NFKB_symbols = {'NFKB1', 'NFKB2', 'RELA', 'RELB'}
m_NFKB = ExTRI2_df['TF Symbol'].str.upper().isin(NFKB_symbols)

print("AP1 TF Unique Entities:", len(ExTRI2_df[m_AP1]['TF'].unique()))
# print(ExTRI2_df[m_AP1]['TF'].unique())
print("NFKB TF Unique Entities:", len(ExTRI2_df[m_NFKB]['TF'].unique()))
# print(ExTRI2_df[m_NFKB]['TF'].unique())

AP1 TF Unique Entities: 155
NFKB TF Unique Entities: 115


In [6]:
# POST-PROCESSING FUNCTIONS
def print_symbol_counts_side_by_side(m_template, max_counts = 10):
    'Print symbol counts of m_template for TF & TG'
    results = []
    for T in ('TF', 'TG'):
        m = eval(m_template.replace('{T}', T))
        T_lines = TRI_df[m][f'{T} Symbol'].value_counts()[:max_counts].to_string().split('\n')
        results.append(T_lines)

    # Print the two tables side by side
    for tf_line, tg_line in zip(*results):
        print(f"{tf_line:<35} {tg_line}")

def print_dubious_pairs_TFTGcounts_side_by_side(dubious_pairs):
    '''Print counts of TF&TG of 3 different symbols side by side'''
    all_tables = []
    for p in dubious_pairs:
        results = []
        for T in ('TF', 'TG'):
            m = TRI_df[f'{T} Symbol'].isin([';'.join((p[0], p[1])), ';'.join((p[1], p[0]))])
            T_counts = TRI_df[m][f'{T}'].value_counts().rename(f'{T} count')[:10]
            results.append(T_counts)

        # Merge the TF and TG counts on the same index
        merged_df = pd.concat(results, axis=1).fillna(0).astype(int)
        all_tables.append(merged_df)

    # Convert each table to a string and split by lines
    table_strings = [table.to_string().split('\n') for table in all_tables]

    # Use itertools.zip_longest to handle tables with different lengths
    for lines in itertools.zip_longest(*table_strings, fillvalue=''):
        # Print each line of the three tables side by side
        print(f"{lines[0]:<40} {lines[1]:<40} {lines[2]}")

def print_TF_TG_counts_side_by_side(title, m_template, sep=40):
    bold(title)
    counts = []
    for T in ('TF', 'TG'):
        m = eval(m_template)
        T_lines = TRI_df[m][[f'{T}', f'{T} Symbol']].value_counts().to_string().split('\n')
        counts.append(T_lines)
    
    for tf_line, tg_line in itertools.zip_longest(*counts, fillvalue=''):
        print(f"{tf_line:<{sep}} {tg_line}")
    print()

In [7]:
### ENTITIES NORMALIZED TO +1 ID
bold("Entities normalized to +1 ID")
m = (TRI_df['TF Symbol'].str.upper().str.contains(';')) | TRI_df['TG Symbol'].str.upper().str.contains(';')
md(f"{m.sum()} ({m.sum() / len(m):.2%}) entities are normalized to more than 1 ID.<br>We revise those that appear more than 100 times further:")
m_template = "TRI_df['{T} Symbol'].str.contains(';')"
print_symbol_counts_side_by_side(m_template, max_counts=15)

dubious_pairs = [('ABL1', 'BCR'), ('FLI1','EWSR1'), ('MMP2','MMP9')]

md(f'From those, 3 seem suspicious and are investigated further: {", ".join((";".join(p) for p in dubious_pairs))}')

print_dubious_pairs_TFTGcounts_side_by_side(dubious_pairs)
md('''\
Through further manual investigation of the sentences, we have determined that:
* FL1;EWSR1 & ABL1;BCR are fusion genes. They are correct TFs but must be discarded as TGs.
* TG = MMP9;MMP2 entities indicate that the TF regulates both genes.
   
One TG sentence example for each case (first two will be discarded)
''')

for p in dubious_pairs:
    pairs = [';'.join(p) for p in itertools.permutations(p)]
    print(f"{pairs[0]}:\t", TRI_df[TRI_df['TG Symbol'].isin(pairs)].sample(1)['Sentence'].values[0])

<b>Entities normalized to +1 ID</b>

25644 (2.86%) entities are normalized to more than 1 ID.<br>We revise those that appear more than 100 times further:

TF Symbol                           TG Symbol
MAPK3;MAPK1            5347         MAPK3;MAPK1       1425
Mapk3;Mapk1            2798         Mapk3;Mapk1        618
MAP2K1;MAP2K2           717         MMP2;MMP9          395
SMAD2;SMAD3             481         SMAD2;SMAD3        253
MAPK8;MAPK9             311         Smad2;Smad3        171
Map2k1;Map2k2           303         CDK4;CDK6           96
Smad2;Smad3             222         Mmp2;Mmp9           89
EWSR1;FLI1              216         HSD11B1;RNU1-1      66
ABL1;BCR                169         MIR143;MIR145       66
BCR;ABL1                152         CASP3;CASP7         63
CREBBP;EP300            147         MAP2K1;MAP2K2       57
OIP5-AS1;OIP5;PTGDR     144         NKX2-5;NKX3-1       54
SMAD1;SMAD5;SMAD9       128         CASP3;CASP9         42
MAPK1;MAPK3             111         Ifna;Ifnb1          38
HDAC1;HDAC2             104         EWSR1;FLI1          36


From those, 3 seem suspicious and are investigated further: ABL1;BCR, FLI1;EWSR1, MMP2;MMP9

           TF count  TG count                         TF count  TG count                                                   TF count  TG count
BCR/ABL         126         9            EWS-FLI1          142        16          MMP-2/9                                         0        99
BCR-ABL1        108        18            EWS/FLI1           43         6          MMP-2/-9                                        0        68
BCR/ABL1         34         6            EWS/FLI-1          12        11          MMP2/9                                          0        65
Bcr/Abl          26         4            EWS::FLI1           9         2          MMP-2 and -9                                    0        40
BCR::ABL1        10         3            EWSR1-FLI1          9         0          matrix metalloproteinase-2 and -9               0        16
bcr/abl           8         6            EWSR1::FLI1         7         1          matrix metalloproteinase-2/9                    0        11
BCR::A

Through further manual investigation of the sentences, we have determined that:
* FL1;EWSR1 & ABL1;BCR are fusion genes. They are correct TFs but must be discarded as TGs.
* TG = MMP9;MMP2 entities indicate that the TF regulates both genes.
   
One TG sentence example for each case (first two will be discarded)


ABL1;BCR:	 In this report, we show evidence that [TF] transcription factor stringently controls the expression of [TG], which can strategically be targeted by our novel RUNX inhibitor, Chb-M'.
FLI1;EWSR1:	 [TF] knockdown showed a reduced cell growth and transcriptional activity of [TG].
MMP2;MMP9:	 Finally, nimbolide suppressed the nuclear translocation of p65/p50 and DNA binding of [TF], which is an important transcription factor for controlling [TG] and VEGF gene expression.


### Create sets of sentences to check

In [8]:
sents_to_check_path = config['data_p'] + 'validation/sents_to_check.tsv'
sents_to_check_2_path = config['data_p'] + 'validation/sents_to_check_2.tsv'
sents_to_check_of_path = config['data_p'] + 'validation/sents_to_check_of.tsv'
sents_to_check_CDX_path = config['data_p'] + 'validation/sents_to_check_CDX.tsv'

In [9]:
# PREPARE SENTECES TO CHECK
def add_to_sents_to_check(sents_to_check: list, m_template: str, issue: str) -> list:
    T = 'TF'
    m = eval(m_template)
    T = 'TG'
    m |= eval(m_template)
    df_m = TRI_df[m].copy()
    df_m['issue'] = issue
    sents_to_check.append(df_m)

sents_to_check = []

t = "Sentences with 'p21' not normalised to 'CDKN1A':"
m_template = "(TRI_df[f'{T}'] == 'p21') & (TRI_df[f'{T} Symbol'].str.upper() != 'CDKN1A')"
print_TF_TG_counts_side_by_side(t, m_template)
add_to_sents_to_check(sents_to_check, m_template, 'p21-CDKN1A')


t = "Sentences with 'p53' not normalised to 'TP53':"
m_template = "(TRI_df[f'{T}'] == 'p53') & (TRI_df[f'{T} Symbol'].str.upper() != 'TP53')"
print_TF_TG_counts_side_by_side(t, m_template)
add_to_sents_to_check(sents_to_check, m_template, 'p53-TP53')

bold(f"Sentences with MDM2-TP53 pairs must be removed: they're always a PPI.")
m = TRI_df['TF Symbol'].str.upper().str.contains('MDM2')
m &= TRI_df['TG Symbol'].str.upper().str.contains('TP53')
print(TRI_df[m][['TF Symbol', 'TG Symbol']].value_counts().to_string(), '\n')


t = "Sentences with 'MET' not normalised to 'MET':"
m_template = "(TRI_df[f'{T}'] == 'MET') & (TRI_df[f'{T} Symbol'].str.upper() != 'MET')"
print_TF_TG_counts_side_by_side(t, m_template)
add_to_sents_to_check(sents_to_check, m_template, 'MET')


t = "Sentences with 'CD\d' :"
m_template = "TRI_df[f'{T}'].str.upper().str.contains(r'^CD(?:4|8A|8B|74|34)(?!\d)')"
print_TF_TG_counts_side_by_side(t, m_template)
add_to_sents_to_check(sents_to_check, m_template, 'CD*')

# Joined NCBI IDs to check
t = "Entities normalised to +1 IDs: ABL1;BCR"
m_template = "TRI_df[f'{T} Symbol'] == 'ABL1;BCR'"
print_TF_TG_counts_side_by_side(t, m_template)
add_to_sents_to_check(sents_to_check, m_template, 'ABL1;BCR')


bold(f"\nAutoregulation:")
m = TRI_df['TF Symbol'].str.upper() == TRI_df['TG Symbol'].str.upper()
md(f"{m.sum() / len(TRI_df):.1%} of sentences show autoregulation: TF & TG are the same. Most popular:")
print(TRI_df[m][['TF Symbol', 'TG Symbol']].value_counts()[:4].to_string())

# Those are potentially commonly wrong. Prepare a set of 300 sentences for validation purposes
df_m = TRI_df[m].sample(n=300)
df_m['issue'] = 'Autoregulation'
sents_to_check.append(df_m)

bold('\nTranslation instead of gene expression')
m = TRI_df['Sentence'].str.lower().str.contains('translat')
print(f"{m.sum()} sentences contain 'translat' in them and should be checked")
df_m = TRI_df[m].sample(n=100)
df_m['issue'] = 'Translate'
sents_to_check.append(df_m)

<b>Sentences with 'p21' not normalised to 'CDKN1A':</b>

TF   TF Symbol                           TG   TG Symbol
p21  Tceal1       14                     p21  H3P16        4542
     TCEAL1        1                          Kras          232
                                              TCEAL1         29
                                              Tceal1         25
                                              Tpt1            1



<b>Sentences with 'p53' not normalised to 'TP53':</b>

TF   TF Symbol                           TG   TG Symbol
p53  Trp53        195                    p53  Trp53-ps     1615
                                              p53-ps        240
                                              Trp53          90



<b>Sentences with MDM2-TP53 pairs must be removed: they're always a PPI.</b>

TF Symbol  TG Symbol
MDM2       TP53         1609
Mdm2       TP53           23
MDM2;MDM4  TP53            3
MDM2       TP53BP2         1 



<b>Sentences with 'MET' not normalised to 'MET':</b>

TF   TF Symbol                           TG   TG Symbol
MET  SLTM         494                    MET  SLTM         392



<b>Sentences with 'CD\d' :</b>

TF            TF Symbol                  TG             TG Symbol
CD34          CD34         1704          CD34           CD34         206
              Cd34          113          CD4            CD4          162
CD34(         Cd34            5                         Cd4           82
              CD34            2          Cd4            CD4           46
CD4CD25FoxP3  FOXP3           2          CD34           Cd34          40
cd34          CD34            2          CD74           CD74          27
CD34+         CD34            1          CD8alpha       Cd8a          17
CD34Exo       CD34            1          Cd4            Cd4           12
CD34LC        CD34            1          Cd74           Cd74           9
CD34brCD38    CD34;CD38       1          CD74           Cd74           8
CD4.Ezh2      Cd4;Ezh2        1          Cd8a           CD8A           5
                                         CD8a           Cd8a           5
                                         CD8alpha       CD

<b>Entities normalised to +1 IDs: ABL1;BCR</b>

TF        TF Symbol                      TG       TG Symbol
BCR/ABL   ABL1;BCR     126               BCR/ABL  ABL1;BCR     9
Bcr/Abl   ABL1;BCR      26               bcr/abl  ABL1;BCR     6
bcr/abl   ABL1;BCR       8               Bcr/Abl  ABL1;BCR     4
BCR::ABL  ABL1;BCR       5               
Bcr/abl   ABL1;BCR       3               
BCR/abl   ABL1;BCR       1               



<b>
Autoregulation:</b>

4.8% of sentences show autoregulation: TF & TG are the same. Most popular:

TF Symbol  TG Symbol
TP53       TP53         1978
VEGFA      VEGFA        1108
EGFR       EGFR          852
MYC        MYC           719


<b>
Translation instead of gene expression</b>

7479 sentences contain 'translat' in them and should be checked


In [10]:
# CREATE EXCEL
def extract_context(sentence, token='[TF]', window=4, how='both'):
    '''Get the last and next 4 words from the token'''
    # Split the sentence by spaces
    words = sentence.split()

    # Find the index of the word that contains '[TF]' or its variations
    index = [i for i, word in enumerate(words) if token in word][0]

    # Extract the 4 words before and after the token, handling boundaries
    start = max(0, index) if how=='right' else max(0, index - window)
    end = min(len(words), index + 1) if how=='left' else min(len(words), index + window + 1)


    # Join the extracted context words back into a string
    return ' '.join(['...'] + words[start:end] + ['...'])

sents_to_check = pd.concat(sents_to_check)

cols_to_keep = ['issue', 'TF', 'TF Symbol', 'TG', 'TG Symbol', 'Sentence',  '#SentenceID', 
                'TF Id', 'TG Id', 'MoR', 'TF TaxID',  'TG TaxID', 'TF_type', 'issue']

sents_to_check = sents_to_check[cols_to_keep]

for T in ('TF', 'TG'):
    sents_to_check[f'{T}_context'] = sents_to_check['Sentence'].apply(lambda x: extract_context(x, token=f'{T}'))
    sents_to_check[f'{T}_left_context'] = sents_to_check['Sentence'].apply(lambda x: extract_context(x, token=f'{T}', how='left'))
    sents_to_check[f'{T}_right_context'] = sents_to_check['Sentence'].apply(lambda x: extract_context(x, token=f'{T}', how='right'))

sents_to_check.to_csv(sents_to_check_path, sep='\t', index=False)

In [11]:
# PREPARE 2nd SET OF SENTENCES TO CHECK
bold(f"\ndbTF Autoregulation:")

# Get a set of 300 sentences to check with autoregulation (dbTF)#
# Previous set contained a lot of coTF sentences. We want to check the number of false positives in dbTF-specific autoregulation.
m = TRI_df['TF Symbol'].str.upper() == TRI_df['TG Symbol'].str.upper()
m &= TRI_df['TF_type'] == 'dbTF'
md(f"{m.sum() / len(TRI_df):.1%} of sentences show autoregulation: TF & TG are the same. Most popular:")
print(TRI_df[m][['TF Symbol', 'TG Symbol']].value_counts()[:4].to_string())
df_m = TRI_df[m].sample(n=300)
df_m['issue'] = 'dbTF_autoregulation'

# We will only check dbTF autoregulation
sents_to_check_2 = df_m
tab_cols = ("issue	#SentenceID	TF	TF Symbol	TG	TG Symbol	Sentence	TF Id	TG Id	TF offset	Gene offset	TRI score	Valid	MoR scores	MoR	PMID	PMID+Sent+TRI_Id	TF TaxID	TG TaxID	TF_type")
sents_to_check_2 = sents_to_check_2[tab_cols.split("\t")]

sents_to_check_2.to_csv(sents_to_check_2_path, sep='\t', index=False)
print(f"results saved in {sents_to_check_2_path}")

<b>
dbTF Autoregulation:</b>

2.1% of sentences show autoregulation: TF & TG are the same. Most popular:

TF Symbol  TG Symbol
TP53       TP53         1978
MYC        MYC           719
HIF1A      HIF1A         478
ESR1       ESR1          434
results saved in ../../data/validation/sents_to_check_2.tsv


In [12]:
# PREPARE 3rd SET OF SENTENCES TO CHECK
m = final_TRI_df['TG'] == 'of'
m |= final_TRI_df['TF'] == 'of'
sents_to_check_of = final_TRI_df[m].sort_values(by=['TF', 'TG'])
sents_to_check_of.to_csv(sents_to_check_of_path, sep='\t', index=False)

m = final_TRI_df['TG'].str.upper().str.contains("^CD[0-9]")
m &= final_TRI_df['Sentence'].str.contains("\[TG\] ?")

display(final_TRI_df[m][['TG', 'TG Symbol']].value_counts())
m.sum()

TG        TG Symbol
CD44      CD44         944
CD133     PROM1        350
CD40      CD40         306
CD86      CD86         232
          Cd86         180
                      ... 
CD40L.    Cd40lg         1
CD41b     ITGA2B         1
CD42a     GP9            1
CD44High  Cd44           1
cd59      CD59           1
Name: count, Length: 491, dtype: int64

9319

In [13]:
m = final_TRI_df['TG'].str.upper().str.contains("^CD[0-9]")
final_TRI_df[m].to_csv(sents_to_check_CDX_path, sep='\t', index=False)