# Create training data

This notebook creates the datasets used to train the TRI and MoR classifiers. It must be run after `update_tri_sentences.ipynb` which updates the original dataset with the reannotated sentences.

Using the raw dataset `tri_sentences.tsv`, 4 datasets are created:
* `TRI_data`
* `TRI_span_data`
* `MoR_data`
* `MoR_span_data`

Where:
* In `TRI_data` and `MoR_data`, the TF and TG entities are masked by the `[TF]` and `[TG]` tokens.
* `_span` indicates that the TF and TG are enclosed into `<TF></TF>`, `<TG></TG>` tags.

## Imports and general functions

In [27]:
# Imports
from IPython.display import display, HTML, display_html
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
import matplotlib.pyplot as plt
import re
from tqdm import tqdm

# My functions
import sys
sys.path.append('../common/') 
from analysis_utils import prettify_plots
from notebook_utils import table_of_contents, table_from_dict

prettify_plots()

In [28]:
# Define general functions
def plot_dataset_data(data, ax, title, bins):
    list1 = data[data['Label'] == 1 ]['Text'].apply(len)
    list0 = data[data['Label'] == 0 ]['Text'].apply(len)
    ax.hist(list0, bins=bins[0], edgecolor='k', color= 'red', alpha=0.5)
    ax.hist(list1, bins=bins[1], edgecolor='k', color= 'green', alpha=0.5)
    #ax.hist([list0, list1], bins=bins[0], edgecolor='k', color=['red', 'green'], alpha=1, label=['DISCARDED', 'ACCEPTED'], stacked=False)

    ax.set_xlabel('Sequence Length')
    ax.set_ylabel('Frequency')
    suptitle = f"{len(data[data['Label'] == 1])} positives ({len(data[data['Label'] == 1])/len(data)*100:.2f}%)"
    ax.set_title(f"{title}\n{suptitle}")  
    return

In [29]:
table_of_contents('make_train_data.ipynb')

<h3>Table of contents</h3>

[Create training data](#Create-training-data)
- [Imports and general functions](#Imports-and-general-functions)
- [Paths & Load data](#Paths-&-Load-data)
- [Create training datasets](#Create-training-datasets)
  - [TRI & MoR spans](#TRI-&-MoR-spans)

## Paths & Load data

In [30]:
# PATHS
# Input paths
classifiers_data_path = '../../classifiers_training/data/'
TRI_data_path  = classifiers_data_path + 'tri_sentences.tsv'

# Output paths
output_path = classifiers_data_path

In [31]:
# Load data
TRI_data = pd.read_csv(TRI_data_path,sep='\t', header=0)

# Show data
pd.set_option('display.max_colwidth', 200)
display(HTML('TRI dataset'))
TRI_data[51:53]

Unnamed: 0.1,Unnamed: 0,#TRI ID,Sentence,TF,TG,Label,Type,validated?
51,51,11274170:0:PEA3:Wnt1,"Therefore, we speculate that [TF] factors may contribute to the up-regulation of COX-2 expression resulting from both APC mutation and [TG] expression.",PEA3,Wnt1,False,,False
52,52,11274170:0:PEA3:cyclooxygenase-2,[TF] is up-regulated in response to Wnt1 and activates the expression of [TG].,PEA3,cyclooxygenase-2,True,ACTIVATION,False


## Create training datasets

Both sentences and MoR data come from the same source: `raw_data/tri_sentences.tsv` (modified from the original `raw_data/original_tri_sentences.tsv` in the notebook `update_raw_data.ipynb`)

For the **TRI classifier**, two datasets are created: the original and a masked one, where all the genes are masked by the token `[G]`.

For the **MoR classifier**, the dataset has 3 labels: `UNDEFINED`, `ACTIVATION` and `REPRESSION`. Those labels are changed, respectively, by the numbers 0, 1 and 2. This is specified into the model in the following way:

- `id2label = {0: "UNDEFINED", 1: "ACTIVATION", 2: "REPRESSION"}`

In [32]:
# PREPROCESSING
# Change labels from booleans to 0 and 1
TRI_data['Label'] = TRI_data['Label'].astype(int)
# Get the MoR data
MoR_data = TRI_data[TRI_data['Label'] == 1]

# Show some statistics of the data
print(f'''Stats for the TRI sentences:
    total:\t\t{len(TRI_data)}
    positives:\t\t{len(TRI_data[TRI_data['Label'] == 1])}
    negatives:\t\t{len(TRI_data[TRI_data['Label'] == 0])}

For the positive sentences, number of MoR of each category:''')
for MoR_type in MoR_data['Type'].unique():
    print(f'{MoR_type}\t{len(TRI_data[TRI_data["Type"] == MoR_type])}')

# Change the labels for numbers
MoR_data.loc[MoR_data['Type'] == 'UNDEFINED',  'Label'] = 0
MoR_data.loc[MoR_data['Type'] == 'ACTIVATION', 'Label'] = 1
MoR_data.loc[MoR_data['Type'] == 'REPRESSION', 'Label'] = 2

# Change to names expected by pytorch
TRI_data = TRI_data.rename(columns={'Label': 'labels', 'Sentence': 'texts'})
MoR_data = MoR_data.rename(columns={'Label': 'labels', 'Sentence': 'texts'})

Stats for the TRI sentences:
    total:		22160
    positives:		11695
    negatives:		10465

For the positive sentences, number of MoR of each category:
UNDEFINED	4149
ACTIVATION	5489
REPRESSION	2057


In [33]:
# Save the datasets
TRI_data.to_csv(output_path + 'TRI_data.tsv', sep='\t')
MoR_data.to_csv(output_path + 'MoR_data.tsv', sep='\t')

### TRI & MoR spans

Using `[TF]`, `[TG]`, the model looses all information on what that TF and TG is. This could hinder the model: in the (frequent) cases where the TF or TG is mentioned twice in the sentence, the model can't know the token refers to that other one in the sentence.

Therefore, we will also train the model with datasets that contain spans `<TF></TF>` and `<TG></TG>` instead of `[TF]` and `[TG]`.

In [34]:
# Modify [TF] by <TF>...</TF> in both TRI and MoR
TRI_span_data = TRI_data.copy()
MoR_span_data = MoR_data.copy()

TRI_span_data['texts'] = TRI_span_data.apply(lambda row: row['texts'].replace('[TF]', f"<TF>{row['TF']}</TF>").replace('[TG]', f"<TG>{row['TG']}</TG>"), axis=1)
MoR_span_data['texts'] = MoR_span_data.apply(lambda row: row['texts'].replace('[TF]', f"<TF>{row['TF']}</TF>").replace('[TG]', f"<TG>{row['TG']}</TG>"), axis=1)

# Save the datasets
TRI_span_data.to_csv(output_path + 'TRI_span_data.tsv', sep='\t')
MoR_span_data.to_csv(output_path + 'MoR_span_data.tsv', sep='\t')