# Postprocessing of the ExTRI2 pipeline results to create the ExTRI2 resource

This notebook was used to determine the rules to use for renormalisation & discard of sentences. Sentences were extracted from the ExTRI2 resource and checked manually, to determine how to handle each category. 

Furthermore, the **run postprocessing.py** section is a self-contained section to create the post-processed ExTRI2 file from the file obtained from the pipeline.

In [1]:
__import__('sys').path.append('../common/'); __import__('notebook_utils').table_of_contents('postprocessing_checkings.ipynb')

<h3>Table of contents</h3>


[Postprocessing of the ExTRI2 pipeline results to create the ExTRI2 resource](#Postprocessing-of-the-ExTRI2-pipeline-results-to-create-the-ExTRI2-resource)
- [Run postprocessing.py](#Run-postprocessing.py)
- [Setup](#Setup)
- [Postprocessing](#Postprocessing)
  - [HGNC orthologs](#HGNC-orthologs)
  - [AP1 & NFKB](#AP1-&-NFKB)
  - [Initial exploration](#Initial-exploration)
  - [Create sets of sentences to check](#Create-sets-of-sentences-to-check)

This notebook will now only be used for the normalisation of the results.

## Run postprocessing.py
Self-contained cell to run postprocessing.py

In [2]:
import sys
sys.path.append('../common')
sys.path.append('../../')
from scripts.postprocessing.postprocessing import *
main()

### POSTPROCESSING TRI_df
We got 6706 different TFs and 26196 different TGs from sentences labelled as TRI
Retrieving from Entrez...

4967 sentences are dropped as their TG is not normalised

38287 rows (4.23%) will have its TF renormalized to NFKB
6329 rows (0.70%) will be dropped as the TG corresponds to NFKB
9003 rows (1.00%) will have its TF renormalized to AP1
1858 rows (0.21%) will be dropped as the TG corresponds to AP1
Breakdown by NCBI Symbol saved in ../../data/postprocessing/tables/AP1_NFKB_breakdown.tsv
Number of renormalized sentences and normalization:
4827	0.54%	p21 is normalized to CDKN1A
1922	0.21%	p53-ps is normalized to its respective p53 symbol

Number of discarded sentences and percentage from total (896330 sentences) and reasoning:
2556	0.29%	Their TF contains -AS[1-3]
673	0.08%	Their TF are circRNAs
952	0.11%	Their TF (NLRP3) is followed by inflammasome but normalised to NLRP3
1875	0.21%	Their TG (NLRP3) is followed by inflammasome but normalised to NLRP3
1088	0.

## Setup

In [3]:
import pandas as pd
import numpy as np
import json
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import itertools
import re

## Custom functions
import sys

sys.path.append('../common')
sys.path.append('../../')

from notebook_utils import table_of_contents, table_from_dict, h3, h4, h5, md, bold
from renormalisations import *
from postprocessing import *
pd.set_option('display.max_colwidth', 20)

In [4]:
# Checkings on the processed final TRI df
config = load_config()
final_TRI_df = load_df(config['final_ExTRI2_p'])
orthologs_df = load_df(config['orthologs_final_p'])

display(final_TRI_df[:2])
display(orthologs_df[:2])

Unnamed: 0,#SentenceID,Sentence,TF,TG,TF Id,TG Id,TF offset,Gene offset,TRI score,Valid,...,TG Symbol,TG TaxID,TF_type,renormalisation,TF_human_Id,TF_human_symbol,TF_HGNC_Id,TG_human_Id,TG_human_symbol,TG_HGNC_Id
0,PMID:35388756:pu...,Chip-IP results ...,CHCHD2,GNPTG,51142,84572,1366,1458,0.992724397527042,Valid,...,GNPTG,9606,coTF,,51142,CHCHD2,HGNC:21645,84572,GNPTG,HGNC:23026
1,PMID:26808438:pu...,Among multiple C...,ChREBP,"Mid1ip1,Txnip",51085,58526;10628,919,1050,0.9927056483604216,Valid,...,MID1IP1;TXNIP,9606,dbTF,,51085,MLXIPL,HGNC:12744,58526;10628,MID1IP1;TXNIP,HGNC:20715;HGNC:...


Unnamed: 0,Gene_ID,human_gene_ID,TaxID,gene_symbol,human_gene_symbol,unique_human_gene_ID,unique_human_gene_symbol,HGNC_ID,unique_HGNC_ID
0,100009600,100125288.0,10090,Zglp1,ZGLP1,100125288.0,ZGLP1,HGNC:37245,HGNC:37245
1,100033459,,10090,Ifi208,,,,,


## Postprocessing

### Orthologs

In [5]:
# CHECKINGS ON ORTHOLOGS_DF
ExTRI2_df = final_TRI_df.copy()
display(orthologs_df[:2])

# Unique human mappings
m = orthologs_df['human_gene_ID'].str.contains(';')
m1 = m & (orthologs_df['unique_human_gene_ID'] != 'None')
m2 = m1 & (orthologs_df['unique_human_gene_symbol'].str.upper() == orthologs_df['gene_symbol'].str.upper())

print(f"""\
All gene IDs in ExTRI2: {len(orthologs_df)}    
Missing gene symbols: {orthologs_df['gene_symbol'].str.contains('None').sum()}

One-to-one human gene mapping: {(~m & (orthologs_df['human_gene_ID'] != 'None')).sum()}
One-to-many human gene mapping: {m.sum()}
- With a unique_human_gene_ID assigned: {m1.sum()} ({(m & ~m1).sum()} unassigned)
- Assigned through exact match: {m2.sum()} ({(m1 & ~m2).sum()} through 1st family member rule or manual correction)
Missing human gene mapping: {(orthologs_df['human_gene_ID'] == 'None').sum()}
Missing HGNC IDs: {(orthologs_df['HGNC_ID'] == 'None').sum()}
""")

Unnamed: 0,Gene_ID,human_gene_ID,TaxID,gene_symbol,human_gene_symbol,unique_human_gene_ID,unique_human_gene_symbol,HGNC_ID,unique_HGNC_ID
0,100009600,100125288.0,10090,Zglp1,ZGLP1,100125288.0,ZGLP1,HGNC:37245,HGNC:37245
1,100033459,,10090,Ifi208,,,,,


All gene IDs in ExTRI2: 24574    
Missing gene symbols: 0

One-to-one human gene mapping: 23723
One-to-many human gene mapping: 270
- With a unique_human_gene_ID assigned: 247 (23 unassigned)
- Assigned through exact match: 189 (58 through 1st family member rule or manual correction)
Missing human gene mapping: 581
Missing HGNC IDs: 619



In [6]:
# MANUALLY CHECK ORTHOLOGS WITH UNIQUE HUMAN ID EITHER UNASSIGNED OR ASSIGNED THROUGH 1ST FAMILY MEMBER RULE

# --- PREPARING TABLE OF ORTHOLOGS TO CHECK ---  
def get_unique_human_symbol_index(row):
    '''
    Get the index of the unique human gene symbol following these rules:
    1) Return exact lowercase match, if any
    2) First gene family member (e.g. ACSM2A for ACSM2; if multiple, take the one with the smallest numeric suffix)
    3) If no match, return None
    '''
    # Get rodent symbol & human symbols
    rodent_symbol = row['gene_symbol'].upper()
    human_symbols = row['human_gene_symbol'].upper().split(";")

    # If there's only one symbol, return index 0
    if len(human_symbols) == 1:
        return 0

    # 1) Exact match (case-insensitive)
    for i, c in enumerate(human_symbols):
        if c == rodent_symbol:
            return i

    # 2) Apply first gene family member rule:
    # Assumption: gene names are formed by "[A-Z]+[0-9]*[A-Z]?"
    # Extract family stem and number (e.g. ADH1 -> stem=ADH, number=1)
    m = re.match(r'^([A-Z]+)(\d+)?([A-Z]?)(\d+)?$', rodent_symbol)
    stem = m.group(1) if m else rodent_symbol
    num = m.group(2) if (m and m.group(2)) else None
    letter = m.group(3) if (m and m.group(3)) else None
    num2 = m.group(4) if (m and m.group(4)) else None

    # Get human symbols that start with the same stem
    candidates = [(i, hs) for i, hs in enumerate(human_symbols) if hs.startswith(stem)]
    if not candidates:
        return None
    
    # Prefer candidates that match the stem
    for i, hs in candidates:
        if hs == stem:
            return i
        
    # If there is a number, prefer same numeric family (e.g. ADH1A over ADH2A)
    if num:
        same_family = [(i, hs) for i, hs in candidates if hs.startswith(stem + num)]

        if same_family:
            # Prefer exact stem+num (if present), else smallest suffix
            exact = [i for i, hs in same_family if hs == stem + num]
            if exact:
                return exact[0]
            same_family.sort(key=lambda x: x[1])
            return same_family[0][0]
        
    # Otherwise, just pick smallest lexicographic suffix (first family member)
    candidates.sort(key=lambda x: x[1])
    return candidates[0][0]

def assign_unique_human_fields(row):
    '''Helper function to assign unique fields based on index'''
    idx = get_unique_human_symbol_index(row)

    if idx is None:
        return pd.Series({
            'unique_human_gene_ID': 'None',
            'unique_human_gene_symbol':  'None',
        })

    return pd.Series({
        'unique_human_gene_ID': row['human_gene_ID'].split(';')[idx],
        'unique_human_gene_symbol': row['human_gene_symbol'].split(';')[idx],
    })

orthologs_df = orthologs_df.drop(columns=['unique_human_gene_ID', 'unique_human_gene_symbol'])
orthologs_df = orthologs_df.join(orthologs_df.apply(assign_unique_human_fields, axis=1))

# Precompute counts in ExTRI2_df
tf_counts = ExTRI2_df['TF Symbol'].value_counts()
tg_counts = ExTRI2_df['TG Symbol'].value_counts()

# Map counts
orthologs_df['TF_count'] = orthologs_df['gene_symbol'].map(tf_counts).fillna(0).astype(int)
orthologs_df['TG_count'] = orthologs_df['gene_symbol'].map(tg_counts).fillna(0).astype(int)

# Save orthologs with unique human ID either unassigned or assigned through 1st family member rule
m = (orthologs_df['human_gene_ID'].str.contains(';')) & (orthologs_df['unique_human_gene_symbol'].str.upper() != orthologs_df['gene_symbol'].str.upper())

cols_first = ['TaxID', 'Gene_ID', 'gene_symbol', 'unique_human_gene_symbol', 'human_gene_symbol', 'human_gene_ID', 'HGNC_ID']
orthologs_to_check = orthologs_df[m][cols_first + [c for c in orthologs_df.columns if c not in cols_first]].sort_values(by=['gene_symbol', 'TaxID']).reset_index(drop=True)
print(f"Number of orthologs to check: {len(orthologs_to_check)}")

orthologs_to_check.to_csv(config['data_p'] + 'validation/orthologs_to_check.tsv', sep='\t', index=False)

# --- GETTING MANUALLY-CORRECTED ORTHOLOGS ---
# Orthologs have been checked and corrections are in:
orthologs_checked = pd.read_csv(config['data_p'] + 'validation/orthologs_to_check_w_comments.tsv', sep='\t')
# Correct ortholog is in one of these 3 columns:
comment_cols = [
    "complies with the 'first-gene-family'-rule", # We have decided to only take this column into consideration
    # 'cited as human orthologue in NCBI Gene "Summary" for the mouse/rat gene',
    # 'human gene has rodent symbol as alias'
]

# Get the subset of manually corrected orthologs
m = (orthologs_checked["complies with the 'first-gene-family'-rule"].notna())
manually_corrected = orthologs_checked[m].copy().set_index("Gene_ID")

# Assign unique human gene symbol and comment based on the comments columns
manually_corrected["unique_human_gene_symbol"] = manually_corrected[comment_cols].bfill(axis=1).iloc[:, 0]
manually_corrected["comment"] = manually_corrected[comment_cols].apply(
    lambda row: next(
        (col for col in comment_cols if pd.notna(row[col]) and row[col] is not None),
        np.nan), axis=1
)
# Display the manually corrected orthologs
print(f"Manually corrected orthologs: {len(manually_corrected)}")
display(manually_corrected[['gene_symbol', 'unique_human_gene_symbol', "comment"] + comment_cols].head())

# Create a dictionary
manual_orthologs_dict = manually_corrected[["unique_human_gene_symbol", "comment"]].to_dict(orient="index")
# Save as json
with open(config['data_p'] + 'postprocessing/manual_orthologs_corrections.json', 'w') as f:
    json.dump(manual_orthologs_dict, f, indent=4)

Number of orthologs to check: 81
Manually corrected orthologs: 5


Unnamed: 0_level_0,gene_symbol,unique_human_gene_symbol,comment,complies with the 'first-gene-family'-rule
Gene_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
25642,Cyp3a23-3a1,CYP3A4,complies with th...,CYP3A4
53973,Cyp3a41a,CYP3A4,complies with th...,CYP3A4
387151,Mir133a-1,MIR133A1,complies with th...,MIR133A1
68774,Ms4a6d,MS4A6A,complies with th...,MS4A6A
394432,Ugt1a7c,UGT1A7,complies with th...,UGT1A7


In [8]:
## CHECK THE DISTRUBTION OF TF TYPES PER HUMAN MAPPING
# Check whether any TF_human_symbol has multiple TF types from different rodent orthologs 
agg_funcs = {
    'TF Symbol': lambda x: '|'.join(x.unique()),
    'TF_type': lambda x: '|'.join(x.unique()),
    'TF TaxID': lambda x: '|'.join(x.unique()),
    'TF_human_symbol': lambda x: '|'.join(x.unique()),
}

final_TRI_grouped = ExTRI2_df.groupby('TF_human_Id', as_index=False).agg(agg_funcs)
m = final_TRI_grouped['TF_human_Id'].str.contains(';') | final_TRI_grouped['TF_type'].isin(['coTF', 'dbTF', 'll_coTF']) | (final_TRI_grouped['TF_human_Id'] == 'None')
display(final_TRI_grouped[~m][['TF_type']].value_counts())
display(final_TRI_grouped[~m])
print(f"Rows in ExTRI2 affected: {ExTRI2_df['TF_human_Id'].isin(final_TRI_grouped[~m]['TF_human_Id']).sum()}")
# Clashes in a few sentences. We will exclude them from CollecTRI.

TF_type     
dbTF|coTF       5
coTF|dbTF       1
ll_coTF|coTF    1
Name: count, dtype: int64

Unnamed: 0,TF_human_Id,TF Symbol,TF_type,TF TaxID,TF_human_symbol
53,10196,PRMT3|Prmt3,dbTF|coTF,9606|10090|10116,PRMT3
72,10270,AKAP8|Akap8,dbTF|coTF,9606|10116,AKAP8
608,1810,DR1|Dr1,dbTF|coTF,9606|10090|10116,DR1
1765,434,a|ASIP,ll_coTF|coTF,10090|9606,ASIP
1903,4855,NOTCH4|Notch4,dbTF|coTF,9606|10116,NOTCH4
2380,56252,YLPM1|Ylpm1,coTF|dbTF,9606|10090,YLPM1
2649,63925,Zfp335|ZNF335,dbTF|coTF,10090|9606|10116,ZNF335


Rows in ExTRI2 affected: 322


### AP1 & NFKB

AP1 and NFKB are dimers, and as such don't have neither a NCBI EntrezID, nor a HGNC symbol. PubTator normalizes them to one of their monomers. Therefore, in `postprocessing.py`, we
* Find all dimers incorrectly normalized to monomers using regex
* Change the TF metadata to AP1/NFKB. Delete the TG instances (a TG can't be a dimer)
* Save a summary of the results and affected sentences in `data/postprocessing/tables`

### Initial exploration

In [None]:
# POSTPROCESSING BEFORE RENORMALISATION & DISCARDING WERE IMPLEMENTED
def half_postprocess(ExTRI2_df: pd.DataFrame, TRI_sents: bool, config: dict) -> pd.DataFrame:
    '''same as postprocess but before the renormalisation & discarding'''

    df_type = 'TRI' if TRI_sents else 'nonTRI'
    print(f'### POSTPROCESSING {df_type}_df')

    # Retrieve Symbol & TaxID from Entrez
    save_Symbol_TaxID_dict(ExTRI2_df, config[f'EntrezID_to_Symbol_{df_type}_p'])

    # Filter & add metadata
    if TRI_sents:
        remove_duplicates(ExTRI2_df)
    ExTRI2_df = add_symbols_TaxID(ExTRI2_df, config[f'EntrezID_to_Symbol_{df_type}_p'])
    add_TF_type(ExTRI2_df, config)
    ExTRI2_df = drop_GTFs(ExTRI2_df)
    ExTRI2_df = remove_other_species(ExTRI2_df, TaxID)

    # Fix AP1 & NFKB normalisations
    ExTRI2_df = fix_NFKB_AP1(ExTRI2_df, config)

    return ExTRI2_df

config = load_config()

# Load raw dataframe
TRI_df = load_preprocess_df(config['raw_TRI_p'])

# Postprocess (without renormalisation/discarding)
TRI_df = half_postprocess(TRI_df, TRI_sents=True,  config=config)

### POSTPROCESSING TRI_df
We got 6706 different TFs and 26196 different TGs from sentences labelled as TRI
Retrieving from Entrez...

4967 sentences are dropped as their TG is not normalised

38287 rows (4.23%) will have its TF renormalized to NFKB
6327 rows (0.70%) will be dropped as the TG corresponds to NFKB
9003 rows (1.00%) will have its TF renormalized to AP1
1858 rows (0.21%) will be dropped as the TG corresponds to AP1
Breakdown by NCBI Symbol saved in ../../data/postprocessing/tables/AP1_NFKB_breakdown.tsv


In [None]:
# GET NUMBER OF UNIQUE ENTITY NAMES
ExTRI2_df = TRI_df

m_AP1 = ExTRI2_df['TF Symbol'].str.contains('|'.join(('FOS', 'JUN')), case=False)

NFKB_symbols = {'NFKB1', 'NFKB2', 'RELA', 'RELB'}
m_NFKB = ExTRI2_df['TF Symbol'].str.upper().isin(NFKB_symbols)

print("AP1 TF Unique Entities:", len(ExTRI2_df[m_AP1]['TF'].unique()))
# print(ExTRI2_df[m_AP1]['TF'].unique())
print("NFKB TF Unique Entities:", len(ExTRI2_df[m_NFKB]['TF'].unique()))
# print(ExTRI2_df[m_NFKB]['TF'].unique())

AP1 TF Unique Entities: 155
NFKB TF Unique Entities: 115


In [None]:
# POST-PROCESSING FUNCTIONS
def print_symbol_counts_side_by_side(m_template, max_counts = 10):
    'Print symbol counts of m_template for TF & TG'
    results = []
    for T in ('TF', 'TG'):
        m = eval(m_template.replace('{T}', T))
        T_lines = TRI_df[m][f'{T} Symbol'].value_counts()[:max_counts].to_string().split('\n')
        results.append(T_lines)

    # Print the two tables side by side
    for tf_line, tg_line in zip(*results):
        print(f"{tf_line:<35} {tg_line}")

def print_dubious_pairs_TFTGcounts_side_by_side(dubious_pairs):
    '''Print counts of TF&TG of 3 different symbols side by side'''
    all_tables = []
    for p in dubious_pairs:
        results = []
        for T in ('TF', 'TG'):
            m = TRI_df[f'{T} Symbol'].isin([';'.join((p[0], p[1])), ';'.join((p[1], p[0]))])
            T_counts = TRI_df[m][f'{T}'].value_counts().rename(f'{T} count')[:10]
            results.append(T_counts)

        # Merge the TF and TG counts on the same index
        merged_df = pd.concat(results, axis=1).fillna(0).astype(int)
        all_tables.append(merged_df)

    # Convert each table to a string and split by lines
    table_strings = [table.to_string().split('\n') for table in all_tables]

    # Use itertools.zip_longest to handle tables with different lengths
    for lines in itertools.zip_longest(*table_strings, fillvalue=''):
        # Print each line of the three tables side by side
        print(f"{lines[0]:<40} {lines[1]:<40} {lines[2]}")

def print_TF_TG_counts_side_by_side(title, m_template, sep=40):
    bold(title)
    counts = []
    for T in ('TF', 'TG'):
        m = eval(m_template)
        T_lines = TRI_df[m][[f'{T}', f'{T} Symbol']].value_counts().to_string().split('\n')
        counts.append(T_lines)
    
    for tf_line, tg_line in itertools.zip_longest(*counts, fillvalue=''):
        print(f"{tf_line:<{sep}} {tg_line}")
    print()

In [None]:
### ENTITIES NORMALIZED TO +1 ID
bold("Entities normalized to +1 ID")
m = (TRI_df['TF Symbol'].str.upper().str.contains(';')) | TRI_df['TG Symbol'].str.upper().str.contains(';')
md(f"{m.sum()} ({m.sum() / len(m):.2%}) entities are normalized to more than 1 ID.<br>We revise those that appear more than 100 times further:")
m_template = "TRI_df['{T} Symbol'].str.contains(';')"
print_symbol_counts_side_by_side(m_template, max_counts=15)

dubious_pairs = [('ABL1', 'BCR'), ('FLI1','EWSR1'), ('MMP2','MMP9')]

md(f'From those, 3 seem suspicious and are investigated further: {", ".join((";".join(p) for p in dubious_pairs))}')

print_dubious_pairs_TFTGcounts_side_by_side(dubious_pairs)
md('''\
Through further manual investigation of the sentences, we have determined that:
* FL1;EWSR1 & ABL1;BCR are fusion genes. They are correct TFs but must be discarded as TGs.
* TG = MMP9;MMP2 entities indicate that the TF regulates both genes.
   
One TG sentence example for each case (first two will be discarded)
''')

for p in dubious_pairs:
    pairs = [';'.join(p) for p in itertools.permutations(p)]
    print(f"{pairs[0]}:\t", TRI_df[TRI_df['TG Symbol'].isin(pairs)].sample(1)['Sentence'].values[0])

<b>Entities normalized to +1 ID</b>

25644 (2.86%) entities are normalized to more than 1 ID.<br>We revise those that appear more than 100 times further:

TF Symbol                           TG Symbol
MAPK3;MAPK1            5347         MAPK3;MAPK1       1425
Mapk3;Mapk1            2798         Mapk3;Mapk1        618
MAP2K1;MAP2K2           717         MMP2;MMP9          395
SMAD2;SMAD3             481         SMAD2;SMAD3        253
MAPK8;MAPK9             311         Smad2;Smad3        171
Map2k1;Map2k2           303         CDK4;CDK6           96
Smad2;Smad3             222         Mmp2;Mmp9           89
EWSR1;FLI1              216         HSD11B1;RNU1-1      66
ABL1;BCR                169         MIR143;MIR145       66
BCR;ABL1                152         CASP3;CASP7         63
CREBBP;EP300            147         MAP2K1;MAP2K2       57
OIP5-AS1;OIP5;PTGDR     144         NKX2-5;NKX3-1       54
SMAD1;SMAD5;SMAD9       128         CASP3;CASP9         42
MAPK1;MAPK3             111         Ifna;Ifnb1          38
HDAC1;HDAC2             104         EWSR1;FLI1          36


From those, 3 seem suspicious and are investigated further: ABL1;BCR, FLI1;EWSR1, MMP2;MMP9

           TF count  TG count                         TF count  TG count                                                   TF count  TG count
BCR/ABL         126         9            EWS-FLI1          142        16          MMP-2/9                                         0        99
BCR-ABL1        108        18            EWS/FLI1           43         6          MMP-2/-9                                        0        68
BCR/ABL1         34         6            EWS/FLI-1          12        11          MMP2/9                                          0        65
Bcr/Abl          26         4            EWS::FLI1           9         2          MMP-2 and -9                                    0        40
BCR::ABL1        10         3            EWSR1-FLI1          9         0          matrix metalloproteinase-2 and -9               0        16
bcr/abl           8         6            EWSR1::FLI1         7         1          matrix metalloproteinase-2/9                    0        11
BCR::A

Through further manual investigation of the sentences, we have determined that:
* FL1;EWSR1 & ABL1;BCR are fusion genes. They are correct TFs but must be discarded as TGs.
* TG = MMP9;MMP2 entities indicate that the TF regulates both genes.
   
One TG sentence example for each case (first two will be discarded)


ABL1;BCR:	 In this report, we show evidence that [TF] transcription factor stringently controls the expression of [TG], which can strategically be targeted by our novel RUNX inhibitor, Chb-M'.
FLI1;EWSR1:	 [TF] knockdown showed a reduced cell growth and transcriptional activity of [TG].
MMP2;MMP9:	 Finally, nimbolide suppressed the nuclear translocation of p65/p50 and DNA binding of [TF], which is an important transcription factor for controlling [TG] and VEGF gene expression.


### Create sets of sentences to check

In [None]:
sents_to_check_path = config['data_p'] + 'validation/sents_to_check.tsv'
sents_to_check_2_path = config['data_p'] + 'validation/sents_to_check_2.tsv'
sents_to_check_of_path = config['data_p'] + 'validation/sents_to_check_of.tsv'
sents_to_check_CDX_path = config['data_p'] + 'validation/sents_to_check_CDX.tsv'

In [None]:
# PREPARE SENTECES TO CHECK
def add_to_sents_to_check(sents_to_check: list, m_template: str, issue: str) -> list:
    T = 'TF'
    m = eval(m_template)
    T = 'TG'
    m |= eval(m_template)
    df_m = TRI_df[m].copy()
    df_m['issue'] = issue
    sents_to_check.append(df_m)

sents_to_check = []

t = "Sentences with 'p21' not normalised to 'CDKN1A':"
m_template = "(TRI_df[f'{T}'] == 'p21') & (TRI_df[f'{T} Symbol'].str.upper() != 'CDKN1A')"
print_TF_TG_counts_side_by_side(t, m_template)
add_to_sents_to_check(sents_to_check, m_template, 'p21-CDKN1A')


t = "Sentences with 'p53' not normalised to 'TP53':"
m_template = "(TRI_df[f'{T}'] == 'p53') & (TRI_df[f'{T} Symbol'].str.upper() != 'TP53')"
print_TF_TG_counts_side_by_side(t, m_template)
add_to_sents_to_check(sents_to_check, m_template, 'p53-TP53')

bold(f"Sentences with MDM2-TP53 pairs must be removed: they're always a PPI.")
m = TRI_df['TF Symbol'].str.upper().str.contains('MDM2')
m &= TRI_df['TG Symbol'].str.upper().str.contains('TP53')
print(TRI_df[m][['TF Symbol', 'TG Symbol']].value_counts().to_string(), '\n')


t = "Sentences with 'MET' not normalised to 'MET':"
m_template = "(TRI_df[f'{T}'] == 'MET') & (TRI_df[f'{T} Symbol'].str.upper() != 'MET')"
print_TF_TG_counts_side_by_side(t, m_template)
add_to_sents_to_check(sents_to_check, m_template, 'MET')


t = "Sentences with 'CD\d' :"
m_template = "TRI_df[f'{T}'].str.upper().str.contains(r'^CD(?:4|8A|8B|74|34)(?!\d)')"
print_TF_TG_counts_side_by_side(t, m_template)
add_to_sents_to_check(sents_to_check, m_template, 'CD*')

# Joined NCBI IDs to check
t = "Entities normalised to +1 IDs: ABL1;BCR"
m_template = "TRI_df[f'{T} Symbol'] == 'ABL1;BCR'"
print_TF_TG_counts_side_by_side(t, m_template)
add_to_sents_to_check(sents_to_check, m_template, 'ABL1;BCR')


bold(f"\nAutoregulation:")
m = TRI_df['TF Symbol'].str.upper() == TRI_df['TG Symbol'].str.upper()
md(f"{m.sum() / len(TRI_df):.1%} of sentences show autoregulation: TF & TG are the same. Most popular:")
print(TRI_df[m][['TF Symbol', 'TG Symbol']].value_counts()[:4].to_string())

# Those are potentially commonly wrong. Prepare a set of 300 sentences for validation purposes
df_m = TRI_df[m].sample(n=300)
df_m['issue'] = 'Autoregulation'
sents_to_check.append(df_m)

bold('\nTranslation instead of gene expression')
m = TRI_df['Sentence'].str.lower().str.contains('translat')
print(f"{m.sum()} sentences contain 'translat' in them and should be checked")
df_m = TRI_df[m].sample(n=100)
df_m['issue'] = 'Translate'
sents_to_check.append(df_m)

<b>Sentences with 'p21' not normalised to 'CDKN1A':</b>

TF   TF Symbol                           TG   TG Symbol
p21  Tceal1       14                     p21  H3P16        4542
     TCEAL1        1                          Kras          232
                                              TCEAL1         29
                                              Tceal1         25
                                              Tpt1            1



<b>Sentences with 'p53' not normalised to 'TP53':</b>

TF   TF Symbol                           TG   TG Symbol
p53  Trp53        195                    p53  Trp53-ps     1615
                                              p53-ps        240
                                              Trp53          90



<b>Sentences with MDM2-TP53 pairs must be removed: they're always a PPI.</b>

TF Symbol  TG Symbol
MDM2       TP53         1609
Mdm2       TP53           23
MDM2;MDM4  TP53            3
MDM2       TP53BP2         1 



<b>Sentences with 'MET' not normalised to 'MET':</b>

TF   TF Symbol                           TG   TG Symbol
MET  SLTM         494                    MET  SLTM         392



<b>Sentences with 'CD\d' :</b>

TF            TF Symbol                  TG             TG Symbol
CD34          CD34         1704          CD34           CD34         206
              Cd34          113          CD4            CD4          162
CD34(         Cd34            5                         Cd4           82
              CD34            2          Cd4            CD4           46
CD4CD25FoxP3  FOXP3           2          CD34           Cd34          40
cd34          CD34            2          CD74           CD74          27
CD34+         CD34            1          CD8alpha       Cd8a          17
CD34Exo       CD34            1          Cd4            Cd4           12
CD34LC        CD34            1          Cd74           Cd74           9
CD34brCD38    CD34;CD38       1          CD74           Cd74           8
CD4.Ezh2      Cd4;Ezh2        1          Cd8a           CD8A           5
                                         CD8a           Cd8a           5
                                         CD8alpha       CD

<b>Entities normalised to +1 IDs: ABL1;BCR</b>

TF        TF Symbol                      TG       TG Symbol
BCR/ABL   ABL1;BCR     126               BCR/ABL  ABL1;BCR     9
Bcr/Abl   ABL1;BCR      26               bcr/abl  ABL1;BCR     6
bcr/abl   ABL1;BCR       8               Bcr/Abl  ABL1;BCR     4
BCR::ABL  ABL1;BCR       5               
Bcr/abl   ABL1;BCR       3               
BCR/abl   ABL1;BCR       1               



<b>
Autoregulation:</b>

4.8% of sentences show autoregulation: TF & TG are the same. Most popular:

TF Symbol  TG Symbol
TP53       TP53         1978
VEGFA      VEGFA        1108
EGFR       EGFR          852
MYC        MYC           719


<b>
Translation instead of gene expression</b>

7479 sentences contain 'translat' in them and should be checked


In [None]:
# CREATE EXCEL
def extract_context(sentence, token='[TF]', window=4, how='both'):
    '''Get the last and next 4 words from the token'''
    # Split the sentence by spaces
    words = sentence.split()

    # Find the index of the word that contains '[TF]' or its variations
    index = [i for i, word in enumerate(words) if token in word][0]

    # Extract the 4 words before and after the token, handling boundaries
    start = max(0, index) if how=='right' else max(0, index - window)
    end = min(len(words), index + 1) if how=='left' else min(len(words), index + window + 1)


    # Join the extracted context words back into a string
    return ' '.join(['...'] + words[start:end] + ['...'])

sents_to_check = pd.concat(sents_to_check)

cols_to_keep = ['issue', 'TF', 'TF Symbol', 'TG', 'TG Symbol', 'Sentence',  '#SentenceID', 
                'TF Id', 'TG Id', 'MoR', 'TF TaxID',  'TG TaxID', 'TF_type', 'issue']

sents_to_check = sents_to_check[cols_to_keep]

for T in ('TF', 'TG'):
    sents_to_check[f'{T}_context'] = sents_to_check['Sentence'].apply(lambda x: extract_context(x, token=f'{T}'))
    sents_to_check[f'{T}_left_context'] = sents_to_check['Sentence'].apply(lambda x: extract_context(x, token=f'{T}', how='left'))
    sents_to_check[f'{T}_right_context'] = sents_to_check['Sentence'].apply(lambda x: extract_context(x, token=f'{T}', how='right'))

sents_to_check.to_csv(sents_to_check_path, sep='\t', index=False)

In [None]:
# PREPARE 2nd SET OF SENTENCES TO CHECK
bold(f"\ndbTF Autoregulation:")

# Get a set of 300 sentences to check with autoregulation (dbTF)#
# Previous set contained a lot of coTF sentences. We want to check the number of false positives in dbTF-specific autoregulation.
m = TRI_df['TF Symbol'].str.upper() == TRI_df['TG Symbol'].str.upper()
m &= TRI_df['TF_type'] == 'dbTF'
md(f"{m.sum() / len(TRI_df):.1%} of sentences show autoregulation: TF & TG are the same. Most popular:")
print(TRI_df[m][['TF Symbol', 'TG Symbol']].value_counts()[:4].to_string())
df_m = TRI_df[m].sample(n=300)
df_m['issue'] = 'dbTF_autoregulation'

# We will only check dbTF autoregulation
sents_to_check_2 = df_m
tab_cols = ("issue	#SentenceID	TF	TF Symbol	TG	TG Symbol	Sentence	TF Id	TG Id	TF offset	Gene offset	TRI score	Valid	MoR scores	MoR	PMID	PMID+Sent+TRI_Id	TF TaxID	TG TaxID	TF_type")
sents_to_check_2 = sents_to_check_2[tab_cols.split("\t")]

sents_to_check_2.to_csv(sents_to_check_2_path, sep='\t', index=False)
print(f"results saved in {sents_to_check_2_path}")

<b>
dbTF Autoregulation:</b>

2.1% of sentences show autoregulation: TF & TG are the same. Most popular:

TF Symbol  TG Symbol
TP53       TP53         1978
MYC        MYC           719
HIF1A      HIF1A         478
ESR1       ESR1          434
results saved in ../../data/validation/sents_to_check_2.tsv


In [None]:
# PREPARE 3rd SET OF SENTENCES TO CHECK
m = final_TRI_df['TG'] == 'of'
m |= final_TRI_df['TF'] == 'of'
sents_to_check_of = final_TRI_df[m].sort_values(by=['TF', 'TG'])
sents_to_check_of.to_csv(sents_to_check_of_path, sep='\t', index=False)

m = final_TRI_df['TG'].str.upper().str.contains("^CD[0-9]")
m &= final_TRI_df['Sentence'].str.contains("\[TG\] ?")

display(final_TRI_df[m][['TG', 'TG Symbol']].value_counts())
m.sum()

TG        TG Symbol
CD44      CD44         944
CD133     PROM1        350
CD40      CD40         306
CD86      CD86         232
          Cd86         180
                      ... 
CD40L.    Cd40lg         1
CD41b     ITGA2B         1
CD42a     GP9            1
CD44High  Cd44           1
cd59      CD59           1
Name: count, Length: 491, dtype: int64

9319

In [None]:
m = final_TRI_df['TG'].str.upper().str.contains("^CD[0-9]")
final_TRI_df[m].to_csv(sents_to_check_CDX_path, sep='\t', index=False)