# Finding Causal Relations With Question Answering Models
In this notebook we train three models.  

1.   The role of the first model is to find the causal marker given the sentnece.
2.   The role of the second model is to find the cause given the marker and the sentence.
3.   The role of the third model is to find the effect give the marker and the sentence. 

First install dependencies.


In [2]:
!pip install datasets | grep -v 'already satisfied'
!pip install transformers | grep -v 'already satisfied'

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import json
import pandas as pd
import pyarrow as pa
import pyarrow.dataset as ds
from datasets import Dataset
import datasets
import numpy as np
import torch
from transformers import AutoTokenizer, AutoConfig, AutoModelForQuestionAnswering, TrainingArguments, Trainer
from pathlib import Path
from tools import get_p_and_r, run_model

## Data & Model
In this section the data is read from a json file and is converted to a dataset. Furthermore, the model initializations is done here.

In [None]:
model_name = "HooshvareLab/bert-fa-base-uncased"
tokenizer, config = AutoTokenizer.from_pretrained(model_name), AutoConfig.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

file = json.load(open('data_effect.json', 'r', encoding='utf-8'))
df = pd.json_normalize(file['data']).sample(frac=1, random_state=10) # 3080
dataset = ds.dataset(pa.Table.from_pandas(df).to_batches())
train_data = Dataset(pa.Table.from_pandas(df.iloc[:2400]))
validation_data = Dataset(pa.Table.from_pandas(df.iloc[2400: ]))
data = datasets.DatasetDict({"train":train_data,"validation": validation_data})


## Preprocess
This function is to convert our dataset to the suitable format for QA models.

In [None]:
def preprocess(examples):
    """
    Prepare the data to be fed into QA model.

    :param examples: A dataset containing context and answer and question
    :return:
    """

    tokenized_examples = tokenizer(examples["question"], examples["context"], return_offsets_mapping=True)
    tokenized_examples['start_positions'], tokenized_examples['end_positions'] = [], []

    cls_index = 0
    for i, offset in enumerate(tokenized_examples['offset_mapping']):
        answer = examples['answers'][i][0]

        types = np.array(tokenized_examples.sequence_ids(i))
        types[types == None] = 0
        types.astype(int)

        if len(answer['text'][0]) == 0:
            s, e = cls_index, cls_index

        else:
            s_diff = np.abs(np.array([offset[idx][0] - answer['answer_start'][0] for idx in range(len(offset))]))
            s = np.argmin([s_diff[idx] + 100 * (1 - types[idx]) for idx in range(len(s_diff))])

            e_diff = np.abs(np.array(
                [offset[idx][1] - answer['answer_start'][0] - len(answer['text'][0]) for idx in range(len(offset))]))
            e = np.argmin([e_diff[idx] + 100 * (1 - types[idx]) for idx in range(len(e_diff))])

        tokenized_examples['start_positions'].append(s)
        tokenized_examples['end_positions'].append(e)

    tokenized_examples.pop('offset_mapping')
    return tokenized_examples

## Train
In this section the data is preprocessed then fed into the model for training process. 

In [None]:
tokenized_ds = data.map(preprocess, batched=True, remove_columns=data["train"].column_names)

args = TrainingArguments(
    f"result",
    evaluation_strategy = "steps", # 'epochs'
    eval_steps = 12,
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0) 

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['validation'],
    tokenizer=tokenizer)

trainer.train()


Map:   0%|          | 0/679 [00:00<?, ? examples/s]

Map:   0%|          | 0/679 [00:00<?, ? examples/s]

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 679
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 43
  Number of trainable parameters = 162252290
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
12,No log,0.456468
24,No log,0.359837
36,No log,0.278296


***** Running Evaluation *****
  Num examples = 679
  Batch size = 16
***** Running Evaluation *****
  Num examples = 679
  Batch size = 16
***** Running Evaluation *****
  Num examples = 679
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=43, training_loss=0.5638043159662292, metrics={'train_runtime': 694.9065, 'train_samples_per_second': 0.977, 'train_steps_per_second': 0.062, 'total_flos': 12892851494964.0, 'train_loss': 0.5638043159662292, 'epoch': 1.0})

## Connect to Google Drive
Here you can connect to google drive to save/load models.

In [4]:
from google.colab import drive
drive.mount('/content/gdrive')
#trainer.save_model('/content/gdrive/My Drive/effect')

Mounted at /content/gdrive


## Test
On a test data with 300 sentences you can load models and test them. 
9 metrics are returned. Precision, Recall, and SequenceMatcher Ratio for marker, cause, and effect. 

In [None]:
gdp = '/content/gdrive/My Drive/'
paths = [gdp + 'marker', gdp + 'cause', gdp + 'effect']
tokenizers = [AutoTokenizer.from_pretrained(paths[i]) for i in range(3)]
models = [AutoModelForQuestionAnswering.from_pretrained(paths[i]) for i in range(3)]

precisions, recalls, scores = [[], [], []], [[], [], []], [[], [], []]

lines = open('test.txt', mode='r', encoding='utf-8').readlines()
texts = [s.replace('*', '').replace('+', '').replace('&', '') for s in lines]

for i, text in enumerate(texts):
  mark = run_model(models[0], tokenizers[0], text, 'به دلیل این که - نتیجه - علت - زیرا - استنتاج - درصورتی که')
  caus = run_model(models[1], tokenizers[1], text, mark)
  effe = run_model(models[2], tokenizers[2], text, mark)
  answer = [mark, caus, effe] if mark != '[CLS]' else ['', '', '']

  print(i)
  print(lines[i], end='')
  parts = ['marker: ', 'cause: ', 'effect: ']

  for j, tchar in enumerate(['&', '*', '+']):
    p, r, s = get_p_and_r(lines[i], tchar, answer[j])
    precisions[j].append(p)
    recalls[j].append(r)
    scores[j].append(s)

    print(parts[j], end='')
    print(answer[j], end='    ')
    print('precision: ', end='')
    print(p, end='    ')
    print('recall: ', end='')
    print(r, end='    ')
    print('score: ', end='')
    print(s)
  
  print()
  

for j in range(3):
  print(['marker', 'cause', 'effect'][j])
  print(np.mean(np.array(precisions[j])))
  print(np.mean(np.array(recalls[j])))
  print(np.mean(np.array(scores[j])))
  print()