# Update raw data and NTNU dataset

modify `original_tri_sentences.tsv` to obtain `tri_sentences.tsv` with the updated annotations, including:
* The re-annotations
* Negation sentences from the extended NTNU dataset

After running this notebook, `make_train_data.ipynb` must also be run, which converts the created `tri_sentences.tsv` into the datasets used as input for training the TRI and MoR classifiers.

In [142]:
__import__('sys').path.append('../common/'); __import__('notebook_utils').table_of_contents('update_tri_sentences.ipynb')

<h3>Table of contents</h3>


[Update raw data and NTNU dataset](#Update-raw-data-and-NTNU-dataset)
- [Setup](#Setup)
- [Clean the dataset](#Clean-the-dataset)
  - [Modify joined sentences](#Modify-joined-sentences)
- [Reannotations](#Reannotations)
- [Save the dataset](#Save-the-dataset)
- [Deprecated - Add negative sentences](#Deprecated---Add-negative-sentences)
  - [Enhance with negations from the extended NTNU dataset](#Enhance-with-negations-from-the-extended-NTNU-dataset)

## Setup

In [143]:
# Imports
from IPython.display import display, HTML, display_html
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold
import matplotlib.pyplot as plt
import spacy
from tqdm import tqdm

# My functions
import sys
sys.path.append('../common/') 
from analysis_utils import prettify_plots
from notebook_utils import table_from_dict, md, h3, h4, highlight_words

prettify_plots()

# %pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_md-0.5.4.tar.gz

In [144]:
# PATHS  & LOAD DATASETS
# Inputs
data_path = '../../data/external/'
original_data_path = data_path + 'original_tri_sentences.tsv'
NTNU_extended_path = data_path + 'NTNU_extended.tsv'

dataset_improvement_path = '../../data/dataset_improvement/'
iter_1_2_path      = dataset_improvement_path + 'to_reannotate/iter1_iter2_worst_predictions_validated.txt'
iter_3_path        = dataset_improvement_path + 'to_reannotate/iter3_worst_preds_AL_ver2.txt'
negations_to_add_path = dataset_improvement_path + 'negations_NTNU.tsv'

# Load datasets
original_data   = pd.read_csv(original_data_path, sep='\t', header=1, index_col=None, dtype='str')
iter_1_2        = pd.read_csv(iter_1_2_path, sep='\t', header=0, index_col=None, dtype='str')
iter_3          = pd.read_csv(iter_3_path, sep='\t', header=0, index_col=None, dtype='str')
NTNU_extended   = pd.read_csv(NTNU_extended_path, sep='\t', header=1, index_col=None, dtype='str')
negations_to_add   = pd.read_csv(negations_to_add_path, sep='\t', header=0, index_col=None, dtype='str')


# Outputs
splitted_sentences_path = dataset_improvement_path + 'to_reannotate/splitted_sentences_ids.txt'

classifiers_data_path = '../../classifiers_training/data/'
training_data_path  = classifiers_data_path + 'tri_sentences.tsv'
NTNU_dataset_path = '../../results/NTNU_dataset.tsv'

## Clean the dataset

There are a few duplicated sentences: same PMID and sentence number, but different TF or TG ID. Some also differ in their TRI/MoR annotations. They will be removed

In [145]:
# SHOW NUMBER OF DUPLICATE ROWS
unique_rows = original_data['Sentence'].unique()
duplicated_rows = original_data[original_data['Sentence'].duplicated(keep=False)]
unique_duplicated_rows = duplicated_rows.drop_duplicates(subset='Sentence')
duplicates_with_diff_label_or_type = original_data.groupby('Sentence').filter(lambda x: x['Label'].nunique() > 1 or x['Type'].nunique() > 1)

table_from_dict('Information on duplicated rows', {
    'Total number of sentences': len(original_data),
    'Unique sentences': len(unique_rows),
    'Duplicated sentences': f'{len(duplicated_rows)} ({round(len(duplicated_rows)/len(original_data)*100,1)}%)',
    '.. where TRI or MoR differs': f'{len(duplicates_with_diff_label_or_type)} ({round(len(duplicates_with_diff_label_or_type)/len(original_data)*100,1)}%)',
    'Unique duplicated sentences': len(unique_duplicated_rows),
    '.. where TRI or MOR differs': len(duplicates_with_diff_label_or_type['Sentence'].unique()),
}, heading='h4')

md('Two sets of duplicated sentences where either Label or Type differ:')
display(duplicates_with_diff_label_or_type.sort_values(by='Sentence').head(4))

0,1
Total number of sentences,22135
Unique sentences,21763
Duplicated sentences,735 (3.3%)
.. where TRI or MoR differs,278 (1.3%)
Unique duplicated sentences,363
.. where TRI or MOR differs,137


Two sets of duplicated sentences where either Label or Type differ:

Unnamed: 0,#TRI ID,Sentence,TF,TG,Label,Type
20620,21056029:3:NFKB:CCND1,"-catenin activated transcription from the [TG] promoter, while co-expression of NF-B [TF] reduced -catenin-induced transcription.",p65,cyclin D1,False,
10925,21056029:3:RELA:CCND1,"-catenin activated transcription from the [TG] promoter, while co-expression of NF-B [TF] reduced -catenin-induced transcription.",p65,cyclin D1,True,REPRESSION
12854,21720709:0:TCF7L2:AKT1,-catenin/[TF] complex transcriptionally regulates [TG] in glioma.,Tcf-4,AKT1,True,UNDEFINED
19541,21720709:0:TCF4:AKT1,-catenin/[TF] complex transcriptionally regulates [TG] in glioma.,Tcf-4,AKT1,False,


In [146]:
# DROP DUPLICATES
# Drop duplicates in which Sentence, TRI and MoR are the same
m1 = original_data.duplicated(subset=['Sentence', 'Label', 'Type'], keep=False)
m2 = original_data.duplicated(subset=['Sentence', 'Label', 'Type'], keep='first')
md(f"{m1.sum()} sentences are duplicated but have the same Label & MoR. Only the first is kept ({m2.sum()} are dropped).")
original_data = original_data.drop_duplicates(subset=['Sentence', 'Label', 'Type'], keep='first')

# Drop rows where Sentence is duplicated, but Label and/or Type differ
m = original_data.duplicated(subset=['Sentence'], keep=False)
md(f"{m.sum()} sentences are duplicated and differ on their Label or MoR. They're all dropped.")
original_data = original_data[~m]

463 sentences are duplicated but have the same Label & MoR. Only the first is kept (234 are dropped).

275 sentences are duplicated and differ on their Label or MoR. They're all dropped.

### Modify joined sentences

The splitter we will use for the pipeline is `en_core_sci_md`. We will now make sure that the dataset we use for training is separated according to their standards. 

As explained in `scripts/preprocessing/spacy_splitter_analysis.ipynb`, the spacy splitter makes some mistakes, namely separating a sentence in 2 sometimes when a "." is combined by a lowercase letter, or when there is no ".". To avoid these cases, we use a rule-based method to re-merge those sentences, defined in the `merge_sentences()` function.

We pass all sentences in the dataset through the splitter, obtaining a 0.3% of cases where the sentences were incorrectly splitted. They can be divided into 2 groups:
* **[TF] and [TG] are in the same sentence:** Those sentences have been revised, and are included in the `iter_1_2` dataset. They will be updated.
* **[TF] and [TG] are in separate sentences:** Those sentences will be dropped, as they are no longer valid.

In [147]:
nlp = spacy.load("en_core_sci_md")

def merge_sentences(doc):
    merged_sentences = []
    temp_sentence = ""

    sentences = [i for i in doc.sents]  # Convert generator to a list for easier handling
    for i, sentence in enumerate(sentences):
        current_text = sentence.text.strip()
        temp_sentence += current_text + " "

        # Separate to a new sentence if: 
        #     it is the last sentence, or
        #     The sentence ends with ". " and next sentence starts with uppercase
        if i == len(sentences) - 1 or ( current_text.endswith(('.', '!', '?')) 
                                       and sentences[i + 1].text.strip()[0].isupper()
                                       and sentences[i + 1].start_char == sentence.end_char + 1
                                      ):
            merged_sentences.append(temp_sentence.strip())
            temp_sentence = ""
    return merged_sentences


# Use spaCy's pipe for efficient batch processing
sentences = original_data['Sentence'].tolist()  # Convert column to list for efficient processing
%time docs = list(nlp.pipe(sentences))

# Get list of IDs to keep and to discard:
discard_ids = []
keep_ids = {}
for doc, id in zip(docs, original_data['#TRI ID']):
    merged_sents = merge_sentences(doc)
    if len(merged_sents) > 1:
        for sentence in merged_sents:
            if '[TF]' in sentence:
                if '[TG]' in sentence:
                    # TF and TG are together. Keep the sentence
                    keep_ids[id] = sentence
                    pass
                else:
                    # TF and TG are in separate sentences. Discard.
                    discard_ids.append(id)

# Show examples of discarded rows:
md(f"{len(discard_ids)} rows ({len(discard_ids) / len(original_data):.2%}) have [TF] and [TG] in separate sentences. They are discarded.<br>Example:")
m = original_data['#TRI ID'].isin(discard_ids)
df = original_data[m][['#TRI ID', 'Sentence', 'Label']]
df['Sentence'] = df['Sentence'].apply(lambda x: highlight_words(x, ['[TF]', '[TG]']))
display(HTML(f'{df[:2].to_html(escape=False)}'))                                    

md(f'{len(keep_ids)} rows ({len(keep_ids) / len(original_data):.2%}) have [TF] and [TG] in the same sentence. The correct sentence is kept.<br>Example:')
m = original_data['#TRI ID'].isin(keep_ids)
df = original_data[m][['#TRI ID', 'Sentence', 'Label']]
df['Sentence'] = df['Sentence'].apply(lambda x: highlight_words(x, ['[TF]', '[TG]']))
display(HTML(f'{df[:2].to_html(escape=False)}'))      

# Save the ids of the splitted sentences to keep to be reannotated:
with open(splitted_sentences_path, 'w') as f:
    f.writelines([id+'\n' for id in keep_ids.keys()])

CPU times: user 1min 14s, sys: 1.59 s, total: 1min 15s
Wall time: 1min 15s


28 rows (0.13%) have [TF] and [TG] in separate sentences. They are discarded.<br>Example:

Unnamed: 0,#TRI ID,Sentence,Label
4132,12525489:0:activator protein 1:tissue inhibitors of metalloproteinases,"The comparative role of [TF] and Smad factors in the regulation of Timp-1 and MMP-1 gene expression by transforming growth factor-beta 1.  The balance between matrix metalloproteinases (MMPs) and their inhibitors, the [TG] (TIMPs), is pivotal in the remodeling of extracellular matrix.",False
4133,12525489:0:activator protein 1:TIMPs,"The comparative role of [TF] and Smad factors in the regulation of Timp-1 and MMP-1 gene expression by transforming growth factor-beta 1.  The balance between matrix metalloproteinases (MMPs) and their inhibitors, the tissue inhibitors of metalloproteinases ([TG]), is pivotal in the remodeling of extracellular matrix.",False


26 rows (0.12%) have [TF] and [TG] in the same sentence. The correct sentence is kept.<br>Example:

Unnamed: 0,#TRI ID,Sentence,Label
390,1655749:0:ATF-1:cAMP-dependent protein kinase A,The cAMP-regulated enhancer-binding protein [TF] activates transcription in response to [TG].  Many promoters respond transcriptionally to elevated levels of cAMP through the cAMP-responsive enhancer (CRE).,False
4128,12525489:0:activator protein 1:Smad,"The comparative role of [TF] and [TG] factors in the regulation of Timp-1 and MMP-1 gene expression by transforming growth factor-beta 1.  The balance between matrix metalloproteinases (MMPs) and their inhibitors, the tissue inhibitors of metalloproteinases (TIMPs), is pivotal in the remodeling of extracellular matrix.",False


In [148]:
# DROP SPLIT SENTENCES
print(f"We discard {len(discard_ids) + len(keep_ids)} rows because they have been split")
original_data = original_data[~original_data['#TRI ID'].isin(discard_ids)]

original_data = original_data[~original_data['#TRI ID'].isin(keep_ids.keys())]

We discard 54 rows because they have been split


## Reannotations

In [149]:
# PREPROCESS ITER DATASETS

# Remove empty rows & columns
iter_1_2.dropna(how='all', inplace=True)
iter_3.dropna(subset = iter_3.columns.difference(['Problem with:']), how='all', inplace=True)
iter_3.drop(columns=['Unnamed: 17'], inplace=True)

## Fix columns/cells
# If 'Iter' is neither 1 nor 2, it belongs to the Splitter.
iter_1_2.loc[iter_1_2['Iter'].isna(), 'Iter'] = 'Spliter'
iter_3['Iter'] = '3'
# These specific cells were incorrectly annotated
m = iter_1_2['#TRI ID'] == '24276245:6:E2F1:ABCG2'
iter_1_2.loc[m, 'true_MoR'] = 'UNDEFINED'
m = iter_1_2['#TRI ID'] == '21813016:5:CTCF:CDKN1A'
iter_1_2.loc[m, 'true_label'] = 'TRUE'

## Join the 'Problem_with' in one only row
columns_to_join = ['Problem with:', 'Unnamed: 10']
iter_1_2['Problem_with'] = iter_1_2.apply(lambda row: '|'.join(sorted(row[col] for col in columns_to_join if not pd.isna(row[col]))), axis=1)
iter_1_2.drop(columns=columns_to_join, inplace=True)

columns_to_join = ['Problem with:', 'Unnamed: 11', 'Suboptimal?']
iter_3['Problem_with'] = iter_3.apply(lambda row: '|'.join(sorted([row[col] for col in columns_to_join if not pd.isna(row[col])])), axis=1)
iter_3.drop(columns=columns_to_join, inplace=True)

## Remove duplicates (annotated both in Iteration 1 and 2)
iter_1_2.drop_duplicates(subset=['#TRI ID', 'Sentence', 'Label', 'MoR'], keep='first', inplace=True)

## Convert 'both activation & repression' into UNDEFINED
m = iter_1_2['Explanation'] == 'BOTH activation AND repression'
h4('ACTIVATION & REPRESSION')
md(f'''There are {m.sum()} sentences that have been noted as <b>both ACTIVATION and REPRESSION</b><br>
Ideally, they should have a BOTH label, but as we don't have enough data, we'll convert them into UNDEFINED.<br>
Resulting rows:''')
iter_1_2.loc[m, 'Valid?'] = iter_1_2.loc[m, 'MoR'].apply(lambda MoR: 'T' if MoR == 'UNDEFINED' else 'F')
iter_1_2.loc[m, 'true_label'] = 'TRUE'
iter_1_2.loc[m, 'true_MoR'] = 'UNDEFINED'
display(iter_1_2[m][['#TRI ID', 'Sentence', 'Label', 'MoR', 'Valid?', 'true_label', 'true_MoR']])

# Join iter 1,2,3
validated_sents = pd.concat([iter_1_2, iter_3], axis=0)
# Drop duplicates between iters 2 & 3 (keep Iteration 3)
validated_sents = validated_sents.sort_values(by='Iter').drop_duplicates(subset=['#TRI ID', 'Sentence', 'Label', 'MoR'], keep='last')

# Remove validated rows that have been split
validated_sents = validated_sents[~validated_sents['#TRI ID'].isin(keep_ids.keys())]

# SUBOPTIMAL SENTENCES
h4("Suboptimal")
# Suboptimal sentences are only modified for the training dataset, not the NTNU dataset.
# Create a copy of the validated sentences for the NTNU
validated_sents_for_NTNU = validated_sents.copy()

# For the training dataset, modify the suboptimal sentences to "Label" = "False"
m_suboptimal = validated_sents['Problem_with'].str.contains('Suboptimal')
md(f"There are {m_suboptimal.sum()} sentences that are suboptimal and will be converted to 'False' for training")

validated_sents.loc[m_suboptimal, 'Valid?'] = np.where(validated_sents.loc[m_suboptimal, 'Label'] == 'FALSE', 'T', 'F')
validated_sents.loc[m_suboptimal, 'true_label'] = 'FALSE'
validated_sents.loc[m_suboptimal, 'true_MoR'] = np.nan


## Assertions
assert validated_sents.duplicated(subset=['#TRI ID']).sum() == 0
assert all(iter_3.columns == iter_1_2.columns), 'Iter 1_2 and Iter_3 have different columns'
assert validated_sents['Valid?'].isna().sum() == 0, 'Some rows have no Valid? value. Check the data.'
assert ((validated_sents['Valid?'] == 'F') & (validated_sents['true_label'].isna())).sum() == 0
assert ((validated_sents['true_label'] == 'TRUE') & (validated_sents['true_MoR'].isna())).sum() == 0
assert ((validated_sents['true_label'] == 'FALSE') & ~(validated_sents['true_MoR'].isna())).sum() == 0

<h4>ACTIVATION & REPRESSION</h4>

There are 5 sentences that have been noted as <b>both ACTIVATION and REPRESSION</b><br>
Ideally, they should have a BOTH label, but as we don't have enough data, we'll convert them into UNDEFINED.<br>
Resulting rows:

Unnamed: 0,#TRI ID,Sentence,Label,MoR,Valid?,true_label,true_MoR
1578,8994186:6:NR4A1:POMC,"Finally, we provide evidence that the nurr1/[TF] response sequence is pivotal to both nurr1/nur77-dependent positive regulation and glucocorticoid receptor-dependent negative regulation of the [TG] gene.",True,ACTIVATION,F,True,UNDEFINED
1581,9514889:7:NFKB:PTGS2,These results suggest that the [TF] site is involved in both the LPS-induced expression of the [TG] gene and its suppression by DEX and herbimycin A in a differentiation-dependent manner.,True,ACTIVATION,F,True,UNDEFINED
1915,9256061:7:TFEC:HMOX1,"By transient coexpression assays, we showed that [TF] is able to activate or inhibit transcription of a reporter gene linked to either the tyrosinase or the [TG] gene promoter, depending on cell types.",True,UNDEFINED,T,True,UNDEFINED
1918,9819390:0:TLX1:ALDH1A1,The T-cell oncogenic protein [TF] activates [TG] expression in NIH 3T3 cells but represses its expression in mouse spleen development.,True,UNDEFINED,T,True,UNDEFINED
1941,15660126:4:NFKB:SLC1A2,"Herein, we demonstrate that both TNFalpha-mediated repression and EGF-mediated activation of [TG] expression require [TF].",True,UNDEFINED,T,True,UNDEFINED


<h4>Suboptimal</h4>

There are 160 sentences that are suboptimal and will be converted to 'False' for training

In [150]:
# UPDATE TRAINING DATASET

def update_training_dataset(original_data, validated_sents):
    '''
    Update the training dataset with validated sentences.
    '''

    updated_data = original_data.copy()

    # Get data to update in the same format as the training dataset
    m = validated_sents['Valid?'] == 'F'
    to_update = validated_sents[m]
    to_update.loc[:, 'true_label'] = to_update['true_label'].replace({'FALSE': 'false', 'TRUE': 'true'})

    # Create mappings
    label_map = to_update.set_index('#TRI ID')['true_label']
    MoR_map   = to_update.set_index('#TRI ID')['true_MoR']

    # Apply mappings to update 'Label' and 'Type' where '#TRI ID' matches
    mask = updated_data['#TRI ID'].isin(to_update['#TRI ID'])
    updated_data.loc[mask, 'Label'] = updated_data.loc[mask, '#TRI ID'].map(label_map)
    updated_data.loc[mask, 'Type']  = updated_data.loc[mask, '#TRI ID'].map(MoR_map)

    # Create column to keep track of validated rows
    updated_data['validated?'] = updated_data['#TRI ID'].isin(validated_sents['#TRI ID'])

    return updated_data

# Update training data
training_data = update_training_dataset(original_data, validated_sents)

# Update the NTNU dataset
NTNU_dataset = update_training_dataset(original_data, validated_sents_for_NTNU)


In [151]:
# CREATE THE NTNU DATASET

# Return to the original sentence
NTNU_dataset['Sentence'] = NTNU_dataset.apply(lambda row: row['Sentence'].replace('[TF]', row['TF']), axis=1)
NTNU_dataset['Sentence'] = NTNU_dataset.apply(lambda row: row['Sentence'].replace('[TG]', row['TG']), axis=1)

# When the mask is removed, there are some duplicated sentences. They are handled here:

# 1) Discard complete duplicates
m1 = NTNU_dataset[['Sentence', 'TF', 'TG', 'Label', 'Type']].duplicated()
NTNU_dataset = NTNU_dataset[~m1]

# 2) If both are validated or unvalidated, discard
m2 = NTNU_dataset[['Sentence', 'TF', 'TG', 'validated?']].duplicated(keep=False)
NTNU_dataset = NTNU_dataset[~m2]


# 3) If Labels differ, keep the validated version
m3 = NTNU_dataset[['Sentence', 'TF', 'TG']].duplicated(keep=False) & ~NTNU_dataset['validated?']
NTNU_dataset = NTNU_dataset[~m3]

# Discard 'validated?' column
NTNU_dataset = NTNU_dataset.drop(columns=['validated?'])

print(f"We discard {m1.sum()} sentences as duplicates, {m2.sum()} sentences where labels differ, and {m3.sum()} unvalid sentences where the valid version has different labels.")
NTNU_dataset.head()

We discard 43 sentences as duplicates, 14 sentences where labels differ, and 14 unvalid sentences where the valid version has different labels.


Unnamed: 0,#TRI ID,Sentence,TF,TG,Label,Type
0,16373364:0:Ets1:GATA-3,"A role for Ets1, synergizing with AP-1 and GATA-3 in the regulation of IL-5 transcription in mouse Th2 lymphocytes.",Ets1,GATA-3,False,
1,16373364:0:Ets1:IL-5,"A role for Ets1, synergizing with AP-1 and GATA-3 in the regulation of IL-5 transcription in mouse Th2 lymphocytes.",Ets1,IL-5,True,UNDEFINED
2,16373364:0:Ets1:AP-1,"A role for Ets1, synergizing with AP-1 and GATA-3 in the regulation of IL-5 transcription in mouse Th2 lymphocytes.",Ets1,AP-1,False,
3,16373364:0:GATA-3:Ets1,"A role for Ets1, synergizing with AP-1 and GATA-3 in the regulation of IL-5 transcription in mouse Th2 lymphocytes.",GATA-3,Ets1,False,
4,16373364:0:GATA-3:IL-5,"A role for Ets1, synergizing with AP-1 and GATA-3 in the regulation of IL-5 transcription in mouse Th2 lymphocytes.",GATA-3,IL-5,True,UNDEFINED


In [152]:
# SHOW STATISTICS

# Sentences reannotated
m = validated_sents['Valid?'] == 'F'
md(f'{m.sum()}/{len(m)} ({m.sum() / len(m):.0%}) reannotated sentences have changed their labels')

# TRI - Non-TRI issues
md("Issues due to TRI label")
m = validated_sents['Valid?'] == 'F'
display(validated_sents[m][["Valid?", "Label", "true_label"]].value_counts(dropna=False))

md("Erroneous sentences due to suboptimal TRI")
m = validated_sents['Problem_with'].str.contains('Suboptimal') & (validated_sents['Valid?'] == 'F')
display(validated_sents[m][["Valid?", "Label", "true_label"]].value_counts(dropna=False))


md('Other problems encountered:')
problems = set(p for unique in validated_sents['Problem_with'].unique() for p in unique.split('|'))
problems.discard('')
print(f"{'':<11} {'All data':<15} {'Non valid rows':<15}")
for problem in problems:
    in_all_data  = validated_sents['Problem_with'].str.contains(problem).sum()
    in_incorrect = validated_sents[m]['Problem_with'].str.contains(problem).sum()
    print(f"{problem:<11} {in_all_data:<15} {in_incorrect:<15}")

2024/3086 (66%) reannotated sentences have changed their labels

Issues due to TRI label

Valid?  Label  true_label
F       FALSE  TRUE          1067
        TRUE   TRUE           485
               FALSE          472
Name: count, dtype: int64

Erroneous sentences due to suboptimal TRI

Valid?  Label  true_label
F       TRUE   FALSE         149
Name: count, dtype: int64

Other problems encountered:

            All data        Non valid rows 
Splitter    11              0              
PPI         19              0              
dir-gene    27              0              
dir-syntax  3050            149            
Suboptimal  160             149            
negation    60              0              


## Save the dataset

In [154]:
# Ensure data has expected values
assert set(training_data['Label']).issubset({'false', 'true'})
assert set(training_data['Type']).issubset({np.nan, 'UNDEFINED', 'ACTIVATION', 'REPRESSION'})
assert ((training_data['Label'] == 'False') & ~(training_data['Type'].isna())).sum() == 0

# Convert it to False and True
training_data['Label'] = training_data['Label'].map({'false': False, 'true': True})
NTNU_dataset['Label'] = NTNU_dataset['Label'].map({'false': 'Not TRI', 'true': 'TRI'})

# Save the datasets
training_data.to_csv(training_data_path, sep='\t')
NTNU_dataset.to_csv(NTNU_dataset_path, sep='\t')

In [155]:
md(f"NTNU dataset, {len(NTNU_dataset)} sentences")
display(NTNU_dataset[['Label', 'Type']].value_counts(dropna=False))

md(f"Training data, {len(training_data)} sentences")
display(training_data[['Label', 'Type']].value_counts(dropna=False))

NTNU dataset, 21501 sentences

Label    Type      
Not TRI  NaN           9980
TRI      ACTIVATION    5395
         UNDEFINED     4073
         REPRESSION    2053
Name: count, dtype: int64

Training data, 21572 sentences

Label  Type      
False  NaN           10169
True   ACTIVATION     5344
       UNDEFINED      4044
       REPRESSION     2015
Name: count, dtype: int64

## Old code - Add negative sentences
We tried to add negative sentences to the dataset in an attempt for the classifier to flag them as negative. It did not give good results, so we abandoned the idea.

In [None]:
# NEGATIONS - Only mentioned, kept as positive
h4('Negations')
m_negations =  validated_sents['Problem_with'].str.contains('negation')
md(f"There are {m_negations.sum()} sentences that are negations, considered as 'True'")

<h4>Negations</h4>

There are 60 sentences that are negations, considered as 'True'

### Enhance with negations from the extended NTNU dataset

`original_tri_sentences.tsv` is a refined subset of the NTNU dataset, a set of 40K validated sentences from ExTRI. Between other things, it did not include sentences marked as 'negations'. Without those, the model doesn't learn that sentences such as:
> TF has been observed to have no effect on TG
Are marked as 'Valid' by the model. 

Thus, some of the negation sentences from the original NTNU dataset are recovered to enhance training data with negations.

In [None]:
# 終 TODO - I will remove it from here, but there's some NTNU analysis in the one in ExTRI_classifiers
NTNU_extended[:2]

Unnamed: 0,#TRI,Valid,Sign,Negation,QC,Comment,Source file,Sentence
0,18728219:2:ARNTL:CRY1,True,,,True,"---{""transcription factor (associated gene name)"":[""ARNTL""],""target gene (associated gene name)"":[""CRY1""],""valid"":[""TRUE""],""sign"":[],""final_negation_220621"":[],""qc"":[""TRUE""],""sentence"":[""CLOCK and ARNTL are transcriptional activators that regulate Per and Cry gene expression.""],""Valid"":[true],""Sign"":[null],""Negation"":[null],""QC"":[true]}---",AGS_24h_transcriptome_QCed_MANUAL_curation_sept21.tsv,
1,19605937:1:ARNTL:CRY1,True,+,,True,"---{""transcription factor (associated gene name)"":[""ARNTL""],""target gene (associated gene name)"":[""CRY1""],""valid"":[""TRUE""],""sign"":[""a""],""final_negation_220621"":[],""qc"":[""TRUE""],""sentence"":[""\""In the molecular oscillatory mechanism governing circadian rhythms, positive regulators, including CLOCK and BMAL1, transactivate Per and Cry genes through E-box elements, and translated PER and CRY proteins negatively regulate their own transactivation.\""""],""Valid"":[true],""Sign"":[""+""],""Negation"":[null],""QC"":[true]}---",AGS_24h_transcriptome_QCed_MANUAL_curation_sept21.tsv,


The NTNU extended dataset did not keep the original TF/TG mentions in the text (only their normalizations). Only those sentences that are marked as negations, and where the exact mention of the normalized TF and TG are present in the text, are considered.

In [None]:
# Extract negations from NTNU
m = (NTNU_extended['Negation'] == 'true') & ~NTNU_extended['Sentence'].isna()
NTNU_negations = NTNU_extended[m].copy()
NTNU_negations['TF'] = NTNU_negations['#TRI'].str.split(':').str[2]
NTNU_negations['TG'] = NTNU_negations['#TRI'].str.split(':').str[3]

m = NTNU_negations.apply(lambda row: (row['TF'] in row['Sentence']) & (row['TG'] in row['Sentence']), axis=1)

print(f"There are {len(NTNU_negations)} sentences marked as negations.")
print(f"From those, both normalized TF&TG are present in {m.sum()} rows.")
#NTNU_negations[m].to_csv('negations.tsv', sep='\t')

There are 496 sentences marked as negations.
From those, both normalized TF&TG are present in 28 rows.


In most cases, the sentence contains more than 1 mention of the TF or TG, making the process of swapping the TF/TG for their `[TF]` and `[TG]` tokens difficult to automate. It has been done manually, and saved in `negations_NTNU.tsv`

In [None]:
# Prepare the negations in the format of the updated data
negations_to_add = negations_to_add[['#TRI', 'Sentence', 'TF', 'TG']]
negations_to_add.rename(columns={"#TRI": "#TRI ID"}, inplace=True)
negations_to_add['Label'] = 'false'
negations_to_add['Type'] = np.nan
negations_to_add['validated?'] = True

# Combine it with the updated data
training_data = pd.concat([training_data, negations_to_add], axis=0)