# Postprocessing of the ExTRI2 pipeline results to create the ExTRI2 resource

This notebook was used to determine the rules to use for renormalisation & discard of sentences. Sentences were extracted from the ExTRI2 resource and checked manually, to determine how to handle each category. 

In [None]:
__import__('sys').path.append('../common/'); __import__('notebook_utils').table_of_contents('postprocessing_checkings.ipynb')

<h3>Table of contents</h3>


[Postprocessing of the ExTRI2 pipeline results to create the ExTRI2 resource](#Postprocessing-of-the-ExTRI2-pipeline-results-to-create-the-ExTRI2-resource)
- [Run Main](#Run-Main)
- [Setup](#Setup)
- [Postprocessing](#Postprocessing)
  - [AP1 & NFKB](#AP1-&-NFKB)
  - [Initial exploration](#Initial-exploration)
  - [Sentences to check](#Sentences-to-check)

This notebook will now only be used for the normalisation of the results.

## Run postprocessing.py
Self-contained cell to run postprocessing.py

In [1]:
import sys
sys.path.append('../common')
sys.path.append('../../')
from scripts.postprocessing.postprocessing import *
main()

### POSTPROCESSING valid_df
We got 6706 different TFs and 26196 different TGs from our valid sentences
Retrieving from Entrez...

4967 sentences are dropped as their TG is not normalised

38287 rows (4.23%) will have its TF renormalized to NFKB
6330 rows (0.70%) will be dropped as the TG corresponds to NFKB
9003 rows (1.00%) will have its TF renormalized to AP1
1858 rows (0.21%) will be dropped as the TG corresponds to AP1
Breakdown by NCBI Symbol saved in ../../data/postprocessing/tables/AP1_NFKB_breakdown.tsv
Number of renormalized sentences and normalization:
4829	0.54%	p21 is normalized to CDKN1A
1921	0.21%	p53-ps is normalized to its respective p53 symbol

Number of discarded sentences and percentage from total (896863 sentences) and reasoning:
2581	0.29%	Their TF contains -AS[1-3]
673	0.08%	Their TF are circRNAs
952	0.11%	Their TF (NLRP3) is followed by inflammasome but normalised to NLRP3
1876	0.21%	Their TG (NLRP3) is followed by inflammasome but normalised to NLRP3
82	0.01%	Th

## Setup

In [6]:
import pandas as pd
import numpy as np
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import itertools
import re

## Custom functions
import sys

sys.path.append('../common')
sys.path.append('../../')

from notebook_utils import table_of_contents, table_from_dict, h3, h4, h5, md, bold
from renormalisations import *
from postprocessing import *
pd.set_option('display.max_colwidth', 20)

In [16]:
# Checkings on the processed final valid df
config = load_config()
f_valid_df = load_df(config['final_ExTRI2_p'])

## Postprocessing

### AP1 & NFKB

AP1 and NFKB are dimers, and as such don't have neither a NCBI EntrezID, nor a HGNC symbol. PubTator normalizes them to one of their monomers. Therefore, in `postprocessing.py`, we
* Find all dimers incorrectly normalized to monomers using regex
* Change the TF metadata to AP1/NFKB. Delete the TG instances (a TG can't be a dimer)
* Save a summary of the results and affected sentences in `data/postprocessing/tables`

### Initial exploration

In [3]:
# POSTPROCESSING BEFORE RENORMALISATION & DISCARDING WERE IMPLEMENTED
def half_postprocess(ExTRI2_df: pd.DataFrame, valid_sents: bool, config: dict) -> pd.DataFrame:
    '''same as postprocess but before the renormalisation & discarding'''

    df_type = 'valid' if valid_sents else 'nonvalid'
    print(f'### POSTPROCESSING {df_type}_df')

    # Retrieve Symbol & TaxID from Entrez
    save_Symbol_TaxID_dict(ExTRI2_df, config[f'EntrezID_to_Symbol_{df_type}_p'])

    # Filter & add metadata
    if valid_sents:
        remove_duplicates(ExTRI2_df)
    ExTRI2_df = add_symbols_TaxID(ExTRI2_df, config[f'EntrezID_to_Symbol_{df_type}_p'])
    add_TF_type(ExTRI2_df, config)
    ExTRI2_df = drop_GTFs(ExTRI2_df)
    ExTRI2_df = remove_other_species(ExTRI2_df, TaxID)

    # Fix AP1 & NFKB normalisations
    ExTRI2_df = fix_NFKB_AP1(ExTRI2_df, config)

    return ExTRI2_df

config = load_config()

# Load raw dataframe
valid_df = load_preprocess_df(config['raw_valid_p'])

# Postprocess (without renormalisation/discarding)
valid_df = half_postprocess(valid_df, valid_sents=True,  config=config)

### POSTPROCESSING valid_df
We got 6706 different TFs and 26196 different TGs from our valid sentences
Retrieving from Entrez...

4967 sentences are dropped as their TG is not normalised

38287 rows (4.23%) will have its TF renormalized to NFKB
6330 rows (0.70%) will be dropped as the TG corresponds to NFKB
9003 rows (1.00%) will have its TF renormalized to AP1
1858 rows (0.21%) will be dropped as the TG corresponds to AP1
Breakdown by NCBI Symbol saved in ../../data/postprocessing/tables/AP1_NFKB_breakdown.tsv


In [4]:
# POST-PROCESSING FUNCTIONS
def print_symbol_counts_side_by_side(m_template, max_counts = 10):
    'Print symbol counts of m_template for TF & TG'
    results = []
    for T in ('TF', 'TG'):
        m = eval(m_template.replace('{T}', T))
        T_lines = valid_df[m][f'{T} Symbol'].value_counts()[:max_counts].to_string().split('\n')
        results.append(T_lines)

    # Print the two tables side by side
    for tf_line, tg_line in zip(*results):
        print(f"{tf_line:<35} {tg_line}")

def print_dubious_pairs_TFTGcounts_side_by_side(dubious_pairs):
    '''Print counts of TF&TG of 3 different symbols side by side'''
    all_tables = []
    for p in dubious_pairs:
        results = []
        for T in ('TF', 'TG'):
            m = valid_df[f'{T} Symbol'].isin([';'.join((p[0], p[1])), ';'.join((p[1], p[0]))])
            T_counts = valid_df[m][f'{T}'].value_counts().rename(f'{T} count')[:10]
            results.append(T_counts)

        # Merge the TF and TG counts on the same index
        merged_df = pd.concat(results, axis=1).fillna(0).astype(int)
        all_tables.append(merged_df)

    # Convert each table to a string and split by lines
    table_strings = [table.to_string().split('\n') for table in all_tables]

    # Use itertools.zip_longest to handle tables with different lengths
    for lines in itertools.zip_longest(*table_strings, fillvalue=''):
        # Print each line of the three tables side by side
        print(f"{lines[0]:<40} {lines[1]:<40} {lines[2]}")

def print_TF_TG_counts_side_by_side(title, m_template, sep=40):
    bold(title)
    counts = []
    for T in ('TF', 'TG'):
        m = eval(m_template)
        T_lines = valid_df[m][[f'{T}', f'{T} Symbol']].value_counts().to_string().split('\n')
        counts.append(T_lines)
    
    for tf_line, tg_line in itertools.zip_longest(*counts, fillvalue=''):
        print(f"{tf_line:<{sep}} {tg_line}")
    print()

In [7]:
### ENTITIES NORMALIZED TO +1 ID
bold("Entities normalized to +1 ID")
m = (valid_df['TF Symbol'].str.upper().str.contains(';')) | valid_df['TG Symbol'].str.upper().str.contains(';')
md(f"{m.sum()} ({m.sum() / len(m):.0%}) entities are normalized to more than 1 ID.<br>We revise those that appear more than 100 times further:")
m_template = "valid_df['{T} Symbol'].str.contains(';')"
print_symbol_counts_side_by_side(m_template, max_counts=15)

dubious_pairs = [('ABL1', 'BCR'), ('FLI1','EWSR1'), ('MMP2','MMP9')]

md(f'From those, 3 seem suspicious and are investigated further: {", ".join((";".join(p) for p in dubious_pairs))}')

print_dubious_pairs_TFTGcounts_side_by_side(dubious_pairs)
md('''\
Through further manual investigation of the sentences, we have determined that:
* FL1;EWSR1 & ABL1;BCR are fusion genes. They are correct TFs but must be discarded as TGs.
* TG = MMP9;MMP2 entities indicate that the TF regulates both genes.
   
One TG sentence example for each case (first two will be discarded)
''')

for p in dubious_pairs:
    pairs = [';'.join(p) for p in itertools.permutations(p)]
    print(f"{pairs[0]}:\t", valid_df[valid_df['TG Symbol'].isin(pairs)].sample(1)['Sentence'].values[0])

<b>Entities normalized to +1 ID</b>

25645 (3%) entities are normalized to more than 1 ID.<br>We revise those that appear more than 100 times further:

TF Symbol                           TG Symbol
MAPK3;MAPK1            5347         MAPK3;MAPK1       1425
Mapk3;Mapk1            2798         Mapk3;Mapk1        618
MAP2K1;MAP2K2           717         MMP2;MMP9          395
SMAD2;SMAD3             481         SMAD2;SMAD3        253
MAPK8;MAPK9             311         Smad2;Smad3        171
Map2k1;Map2k2           303         CDK4;CDK6           96
Smad2;Smad3             222         Mmp2;Mmp9           89
EWSR1;FLI1              216         HSD11B1;RNU1-1      66
ABL1;BCR                169         MIR143;MIR145       66
BCR;ABL1                152         CASP3;CASP7         63
CREBBP;EP300            147         MAP2K1;MAP2K2       57
OIP5-AS1;OIP5;PTGDR     144         NKX2-5;NKX3-1       54
SMAD1;SMAD5;SMAD9       128         CASP3;CASP9         42
MAPK1;MAPK3             111         Ifna;Ifnb1          38
HDAC1;HDAC2             104         EWSR1;FLI1          36


From those, 3 seem suspicious and are investigated further: ABL1;BCR, FLI1;EWSR1, MMP2;MMP9

           TF count  TG count                         TF count  TG count                                                   TF count  TG count
BCR/ABL         126         9            EWS-FLI1          142        16          MMP-2/9                                         0        99
BCR-ABL1        108        18            EWS/FLI1           43         6          MMP-2/-9                                        0        68
BCR/ABL1         34         6            EWS/FLI-1          12        11          MMP2/9                                          0        65
Bcr/Abl          26         4            EWS::FLI1           9         2          MMP-2 and -9                                    0        40
BCR::ABL1        10         3            EWSR1-FLI1          9         0          matrix metalloproteinase-2 and -9               0        16
bcr/abl           8         6            EWSR1::FLI1         7         1          matrix metalloproteinase-2/9                    0        11
BCR::A

Through further manual investigation of the sentences, we have determined that:
* FL1;EWSR1 & ABL1;BCR are fusion genes. They are correct TFs but must be discarded as TGs.
* TG = MMP9;MMP2 entities indicate that the TF regulates both genes.
   
One TG sentence example for each case (first two will be discarded)


ABL1;BCR:	 These findings suggest that interfere PI3K/[TF] signal pathway via down-regulating the expression of [TG] mRNA is implicated in the effect of 2-ME2 on K562 cells.
FLI1;EWSR1:	 Disrupted splicing of the [TF] transcript alters [TG] protein expression and EWS-FLI1-driven expression.
MMP2;MMP9:	 Taken together, our data suggest that SHQA inhibit TNFalpha-induced [TG] expression and age-related inflammation by suppressing AP-1 and NF-kappaB pathway via [TF].


### Sentences to check

In [8]:
sents_to_check_path = config['data_p'] + 'validation/sents_to_check.tsv'
sents_to_check_2_path = config['data_p'] + 'validation/sents_to_check_2.tsv'
sents_to_check_of_path = config['data_p'] + 'validation/sents_to_check_of.tsv'
sents_to_check_CDX_path = config['data_p'] + 'validation/sents_to_check_CDX.tsv'

In [11]:
# PREPARE SENTECES TO CHECK
def add_to_sents_to_check(sents_to_check: list, m_template: str, issue: str) -> list:
    T = 'TF'
    m = eval(m_template)
    T = 'TG'
    m |= eval(m_template)
    df_m = valid_df[m].copy()
    df_m['issue'] = issue
    sents_to_check.append(df_m)

sents_to_check = []

t = "Sentences with 'p21' not normalised to 'CDKN1A':"
m_template = "(valid_df[f'{T}'] == 'p21') & (valid_df[f'{T} Symbol'].str.upper() != 'CDKN1A')"
print_TF_TG_counts_side_by_side(t, m_template)
add_to_sents_to_check(sents_to_check, m_template, 'p21-CDKN1A')


t = "Sentences with 'p53' not normalised to 'TP53':"
m_template = "(valid_df[f'{T}'] == 'p53') & (valid_df[f'{T} Symbol'].str.upper() != 'TP53')"
print_TF_TG_counts_side_by_side(t, m_template)
add_to_sents_to_check(sents_to_check, m_template, 'p53-TP53')

bold(f"Sentences with MDM2-TP53 pairs must be removed: they're always a PPI.")
m = valid_df['TF Symbol'].str.upper().str.contains('MDM2')
m &= valid_df['TG Symbol'].str.upper().str.contains('TP53')
print(valid_df[m][['TF Symbol', 'TG Symbol']].value_counts().to_string(), '\n')


t = "Sentences with 'MET' not normalised to 'MET':"
m_template = "(valid_df[f'{T}'] == 'MET') & (valid_df[f'{T} Symbol'].str.upper() != 'MET')"
print_TF_TG_counts_side_by_side(t, m_template)
add_to_sents_to_check(sents_to_check, m_template, 'MET')


t = "Sentences with 'CD\d' :"
m_template = "valid_df[f'{T}'].str.upper().str.contains(r'^CD(?:4|8A|8B|74|34)(?!\d)')"
print_TF_TG_counts_side_by_side(t, m_template)
add_to_sents_to_check(sents_to_check, m_template, 'CD*')

# Joined NCBI IDs to check
t = "Entities normalised to +1 IDs: ABL1;BCR"
m_template = "valid_df[f'{T} Symbol'] == 'ABL1;BCR'"
print_TF_TG_counts_side_by_side(t, m_template)
add_to_sents_to_check(sents_to_check, m_template, 'ABL1;BCR')


bold(f"\nAutoregulation:")
m = valid_df['TF Symbol'].str.upper() == valid_df['TG Symbol'].str.upper()
md(f"{m.sum() / len(valid_df):.1%} of sentences show autoregulation: TF & TG are the same. Most popular:")
print(valid_df[m][['TF Symbol', 'TG Symbol']].value_counts()[:4].to_string())

# Those are potentially commonly wrong. Prepare a set of 300 sentences for validation purposes
df_m = valid_df[m].sample(n=300)
df_m['issue'] = 'Autoregulation'
sents_to_check.append(df_m)

bold('\nTranslation instead of gene expression')
m = valid_df['Sentence'].str.lower().str.contains('translat')
print(f"{m.sum()} sentences contain 'translat' in them and should be checked")
df_m = valid_df[m].sample(n=100)
df_m['issue'] = 'Translate'
sents_to_check.append(df_m)

<b>Sentences with 'p21' not normalised to 'CDKN1A':</b>

TF   TF Symbol                           TG   TG Symbol
p21  Tceal1       14                     p21  H3P16        4542
     TCEAL1        1                          Kras          232
                                              TCEAL1         29
                                              Tceal1         25
                                              Tpt1            1



<b>Sentences with 'p53' not normalised to 'TP53':</b>

TF   TF Symbol                           TG   TG Symbol
p53  Trp53        195                    p53  Trp53-ps     1615
                                              p53-ps        240
                                              Trp53          90



<b>Sentences with MDM2-TP53 pairs must be removed: they're always a PPI.</b>

TF Symbol  TG Symbol
MDM2       TP53         1609
Mdm2       TP53           23
MDM2;MDM4  TP53            3
MDM2       TP53BP2         1 



<b>Sentences with 'MET' not normalised to 'MET':</b>

TF   TF Symbol                           TG   TG Symbol
MET  SLTM         494                    MET  SLTM         392



<b>Sentences with 'CD\d' :</b>

TF            TF Symbol                  TG             TG Symbol
CD34          CD34         1704          CD34           CD34         206
              Cd34          113          CD4            CD4          162
CD34(         Cd34            5                         Cd4           82
              CD34            2          Cd4            CD4           46
cd34          CD34            2          CD34           Cd34          40
CD4CD25FoxP3  FOXP3           2          CD74           CD74          27
CD34+         CD34            1          CD8alpha       Cd8a          17
CD34LC        CD34            1          Cd4            Cd4           12
CD34Exo       CD34            1          Cd74           Cd74           9
CD4.Ezh2      Cd4;Ezh2        1          CD74           Cd74           8
CD34brCD38    CD34;CD38       1          Cd8a           CD8A           5
                                         CD8a           Cd8a           5
                                         CD8alpha       CD

<b>Entities normalised to +1 IDs: ABL1;BCR</b>

TF        TF Symbol                      TG       TG Symbol
BCR/ABL   ABL1;BCR     126               BCR/ABL  ABL1;BCR     9
Bcr/Abl   ABL1;BCR      26               bcr/abl  ABL1;BCR     6
bcr/abl   ABL1;BCR       8               Bcr/Abl  ABL1;BCR     4
BCR::ABL  ABL1;BCR       5               
Bcr/abl   ABL1;BCR       3               
BCR/abl   ABL1;BCR       1               



<b>
Autoregulation:</b>

4.8% of sentences show autoregulation: TF & TG are the same. Most popular:

TF Symbol  TG Symbol
TP53       TP53         1978
VEGFA      VEGFA        1108
EGFR       EGFR          852
MYC        MYC           719


<b>
Translation instead of gene expression</b>

7479 sentences contain 'translat' in them and should be checked


In [12]:
# CREATE EXCEL
def extract_context(sentence, token='[TF]', window=4, how='both'):
    '''Get the last and next 4 words from the token'''
    # Split the sentence by spaces
    words = sentence.split()

    # Find the index of the word that contains '[TF]' or its variations
    index = [i for i, word in enumerate(words) if token in word][0]

    # Extract the 4 words before and after the token, handling boundaries
    start = max(0, index) if how=='right' else max(0, index - window)
    end = min(len(words), index + 1) if how=='left' else min(len(words), index + window + 1)


    # Join the extracted context words back into a string
    return ' '.join(['...'] + words[start:end] + ['...'])

sents_to_check = pd.concat(sents_to_check)

cols_to_keep = ['issue', 'TF', 'TF Symbol', 'TG', 'TG Symbol', 'Sentence',  '#SentenceID', 
                'TF Id', 'TG Id', 'MoR', 'TF TaxID',  'TG TaxID', 'TF_type', 'issue']

sents_to_check = sents_to_check[cols_to_keep]

for T in ('TF', 'TG'):
    sents_to_check[f'{T}_context'] = sents_to_check['Sentence'].apply(lambda x: extract_context(x, token=f'{T}'))
    sents_to_check[f'{T}_left_context'] = sents_to_check['Sentence'].apply(lambda x: extract_context(x, token=f'{T}', how='left'))
    sents_to_check[f'{T}_right_context'] = sents_to_check['Sentence'].apply(lambda x: extract_context(x, token=f'{T}', how='right'))

sents_to_check.to_csv(sents_to_check_path, sep='\t', index=False)

In [19]:
# PREPARE 2nd SET OF SENTENCES TO CHECK
bold(f"\ndbTF Autoregulation:")

# Get a set of 300 sentences to check with autoregulation (dbTF)#
# Previous set contained a lot of coTF sentences. We want to check the number of false positives in dbTF-specific autoregulation.
m = valid_df['TF Symbol'].str.upper() == valid_df['TG Symbol'].str.upper()
m &= valid_df['TF_type'] == 'dbTF'
md(f"{m.sum() / len(valid_df):.1%} of sentences show autoregulation: TF & TG are the same. Most popular:")
print(valid_df[m][['TF Symbol', 'TG Symbol']].value_counts()[:4].to_string())
df_m = valid_df[m].sample(n=300)
df_m['issue'] = 'dbTF_autoregulation'

# We will only check dbTF autoregulation
sents_to_check_2 = df_m
tab_cols = ("issue	#SentenceID	TF	TF Symbol	TG	TG Symbol	Sentence	TF Id	TG Id	TF offset	Gene offset	Mutated Genes	Mutation offsets	Valid score	Valid	MoR scores	MoR	PMID	PMID+Sent+TRI_Id	Mutated_TF	TF TaxID	TG TaxID	TF_type")
sents_to_check_2 = sents_to_check_2[tab_cols.split("\t")]

sents_to_check_2.to_csv(sents_to_check_2_path, sep='\t', index=False)
print(f"results saved in {sents_to_check_2_path}")

<b>
dbTF Autoregulation:</b>

2.2% of sentences show autoregulation: TF & TG are the same. Most popular:

TF Symbol  TG Symbol
TP53       TP53         1978
MYC        MYC           719
HIF1A      HIF1A         478
ESR1       ESR1          434
results saved in ../../data/validation/sents_to_check_2.tsv


In [20]:
# PREPARE 3rd SET OF SENTENCES TO CHECK
m = f_valid_df['TG'] == 'of'
m |= f_valid_df['TF'] == 'of'
sents_to_check_of = f_valid_df[m].sort_values(by=['TF', 'TG'])
sents_to_check_of.to_csv(sents_to_check_of_path, sep='\t', index=False)

m = f_valid_df['TG'].str.upper().str.contains("^CD[0-9]")
m &= f_valid_df['Sentence'].str.contains("\[TG\] ?")

display(f_valid_df[m][['TG', 'TG Symbol']].value_counts())
m.sum()

TG     TG Symbol
CD44   CD44         944
CD133  PROM1        350
CD40   CD40         306
CD86   CD86         232
       Cd86         180
                   ... 
cd22   CD22           1
cd274  Cd274          1
CD1    Cd1d1          1
cd5    CD5            1
cd59   CD59           1
Name: count, Length: 491, dtype: int64

np.int64(9321)

In [21]:
m = f_valid_df['TG'].str.upper().str.contains("^CD[0-9]")
f_valid_df[m].to_csv(sents_to_check_CDX_path, sep='\t', index=False)