<a href="https://colab.research.google.com/github/TurkuNLP/textual-data-analysis-course/blob/main/ex4-task2-squad_as_sent_classification_unsolved.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# QA as sentence classification

## Task 1:

* Familiarize yourself with the whole notebook and what is going on there
* At the very bottom there is a place where you can simply run a trained model
* Print some interesting correct and incorrect classifications (e.g. cases where the model failed to answer "yes" - false negatives, or cases where it overpredicted a "yes" - false positives)
* Do these make sense? Would you do better?

## Task 2:

* Try to see if you can run the training yourself on the dataset
* AND add precision and recall to the reported metrics so you can better see what happens to the positive class
* Make sure you save the models etc

## Task 3:

* You will notice that the basic model is heavily skewed towards predicting the (more common) negative class
* The correct way to address this is by introducing class weights into training
* Try, for example, to train a model with a 0.9 weight on the positive class and 0.1 on the negative, does anything change? And what if you turn the weights around, and give the positive class a 0.1 weight, will the model ever predict any?
* Google will help you with how to add class weights into the training process (it involves overriding the compute_loss() method of the trainer)

## Task 4:

* Those of you who want, once you make it to the point you can train your own model, you can do it on Finnish
* The Finnish QA data is pointed to below

In [None]:
!pip3 install evaluate transformers datasets

In [None]:
import transformers
import datasets
import evaluate

# SQuAD v2

* First, we will load and explore the SQuAD v2 dataset
* It is also quite important to read through the SQuAD v2 paper to understand the finer points of the data
* `squad_v2` is the original English dataset
* `TurkuNLP/squad_v2_fi` is a machine-translated version produced by TurkuNLP

In [None]:
#squad_dataset_full=datasets.load_dataset("TurkuNLP/squad_v2_fi")
squad_dataset_full=datasets.load_dataset("squad_v2")
squad_dataset_full=squad_dataset_full.shuffle()

Found cached dataset squad_v2 (/users/ginter/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d)


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
for item in squad_dataset_full["train"].select(range(5)):
    print(item)
    print()


{'id': '57321f24e17f3d14004226b0', 'title': 'Party_leaders_of_the_United_States_House_of_Representatives', 'context': 'In brief, there is disagreement among historical analysts as to the exact time period when the minority leadership emerged officially as a party position. Nonetheless, it seems safe to conclude that the position emerged during the latter part of the 19th century, a period of strong party organization and professional politicians. This era was "marked by strong partisan attachments, resilient patronage-based party organizations, and...high levels of party voting in Congress." Plainly, these were conditions conducive to the establishment of a more highly differentiated House leadership structure.', 'question': 'What party characteristics emerged in the house in late 19th century?', 'answers': {'text': ['strong party organization and professional politicians'], 'answer_start': [276]}}

{'id': '572fb50004bcaa1900d76c1f', 'title': 'Armenia', 'context': "Gorbachev's inabilit

# QA as a classification task

* As a warm-up problem, let us cast QA as a classification problem
* The task is: does the string at hand contain the answer to the question?
* This is not how you would do QA but it is a good start, and an interesting problem in its own right
* Maybe we can give ourselves some slack, and go sentence-by-sentence
* So, we need to split the context into sentences, and then assemble a yes/no classification data
* But how do we split the text into sentences?
* Udpipe is an old, but trusty library to do this
* Trained models: https://ufal.mff.cuni.cz/udpipe/1/models#universal_dependencies_25_models
* Go familiarize yourself with that tool and models!
* For Finnish, make sure you use the TDT model (Made in Turku), which is the best
* For English, the EWT (English Web Treebank) model is a good choice

In [None]:
!pip3 install ufal.udpipe



In [None]:
!wget -O english-ewt.udpipe 'https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3131/english-ewt-ud-2.5-191206.udpipe?sequence=17&isAllowed=y'

--2023-01-31 11:32:18--  https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3131/english-ewt-ud-2.5-191206.udpipe?sequence=17&isAllowed=y
Resolving lindat.mff.cuni.cz (lindat.mff.cuni.cz)... 195.113.20.140
Connecting to lindat.mff.cuni.cz (lindat.mff.cuni.cz)|195.113.20.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16309608 (16M) [application/octet-stream]
Saving to: 'english-ewt.udpipe'


2023-01-31 11:32:19 (30.4 MB/s) - 'english-ewt.udpipe' saved [16309608/16309608]



# Running UDPipe

* This is surprisingly easy, once you figure it out
* The documentation (from which I figured it out) is here:
  * https://ufal.mff.cuni.cz/udpipe/1/users-manual#run_udpipe_tokenizer
  * https://ufal.mff.cuni.cz/udpipe/1/api-reference#pipeline

In [None]:
import ufal.udpipe
udpipemodel=ufal.udpipe.Model.load("english-ewt.udpipe")
tokenizer=ufal.udpipe.Pipeline(udpipemodel,"tokenizer=ranges","none","none","conllu")

In [None]:
print(tokenizer.process("I have a dog. The dog is cute. And brown at that."))

# newdoc
# newpar
# sent_id = 1
# text = I have a dog.
1	I	_	_	_	_	_	_	_	TokenRange=0:1
2	have	_	_	_	_	_	_	_	TokenRange=2:6
3	a	_	_	_	_	_	_	_	TokenRange=7:8
4	dog	_	_	_	_	_	_	_	SpaceAfter=No|TokenRange=9:12
5	.	_	_	_	_	_	_	_	TokenRange=12:13

# sent_id = 2
# text = The dog is cute.
1	The	_	_	_	_	_	_	_	TokenRange=14:17
2	dog	_	_	_	_	_	_	_	TokenRange=18:21
3	is	_	_	_	_	_	_	_	TokenRange=22:24
4	cute	_	_	_	_	_	_	_	SpaceAfter=No|TokenRange=25:29
5	.	_	_	_	_	_	_	_	TokenRange=29:30

# sent_id = 3
# text = And brown at that.
1	And	_	_	_	_	_	_	_	TokenRange=31:34
2	brown	_	_	_	_	_	_	_	TokenRange=35:40
3	at	_	_	_	_	_	_	_	TokenRange=41:43
4	that	_	_	_	_	_	_	_	SpaceAfter=No|TokenRange=44:48
5	.	_	_	_	_	_	_	_	SpacesAfter=\n|TokenRange=48:49




# Find the sentence with the answer
* Let's take a shortcut and simply use the `text=` field to find sentences which have the answer
* Of course this might lead to spurious hits where the string is repeated in several sentences, but acts as the correct answer only in one
* The right solution would be to use the `TokenRange` data to find the correct sentence that has the right answer
* I leave that to you as an extra exercise if you wish
* All you need is to parse the output of UDPipe, takes an extra for-loop, really

In [None]:
import tqdm
def get_sentences(parsed):
    # gather the text= lines with the sentences
    sents=[line.replace("# text = ","") for line in parsed.split("\n") if line.startswith("# text = ")]
    return sents

def qa_to_binary(example,tokenizer):
    context=example["context"]
    question=example["question"]
    sentences=get_sentences(tokenizer.process(context))
    result=[]
    #compare every sentence with every answer, when you find an answer, you have an example
    for sent in sentences:
        for answer in example["answers"]["text"]:
            if answer in sent:
                result.append({"question":question,"context":sent,"label":1,"answer":answer})
                break
        else: #we found no answers, so this sentence is then 0
            result.append({"question":question,"context":sent,"label":0,"answer":None})
    return result

train_processed=[]
for item in tqdm.tqdm(squad_dataset_full["train"].select(range(50000))):
    train_processed.extend(qa_to_binary(item,tokenizer))
test_processed=[]
for item in tqdm.tqdm(squad_dataset_full["validation"].select(range(2000))):
    test_processed.extend(qa_to_binary(item,tokenizer))

                      

        

# Save the data

* Now the usual trick of saving the data in a nice, processed form as jsonl
* And then load as a HF Dataset

In [None]:
import json
with open("squad_2_binarized_train.json","wt") as f:
    json.dump(train_processed,f,ensure_ascii=False,indent=2)
with open("squad_2_binarized_test.json","wt") as f:
    json.dump(test_processed,f,ensure_ascii=False,indent=2)

In [None]:
#here you have a version you can download:
dset=datasets.load_dataset("json",data_files={"train":"http://dl.turkunlp.org/TKO_8964_2023/squad_2_binarized_train.json","test":"http://dl.turkunlp.org/TKO_8964_2023/squad_2_binarized_test.json"})
#dset=datasets.load_dataset("json",data_files={"train":"squad_2_binarized_train.json","test":"squad_2_binarized_test.json"},download_mode="force_redownload")

Using custom data configuration default-5ff78b1b6d565ae6


Downloading and preparing dataset json/default (download: 74.41 MiB, generated: 58.22 MiB, post-processed: Unknown size, total: 132.62 MiB) to /users/ginter/.cache/huggingface/datasets/json/default-5ff78b1b6d565ae6/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

  

Extracting data files #1:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #0:   0%|          | 0/1 [00:00<?, ?obj/s]

Generating train split:   0%|          | 0/259266 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10703 [00:00<?, ? examples/s]

Dataset json downloaded and prepared to /users/ginter/.cache/huggingface/datasets/json/default-5ff78b1b6d565ae6/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
dset=dset.shuffle()

# Train the classifier



In [None]:
MODEL = 'bert-base-cased'
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL)

* The tokenizer now needs a pair of texts (the question and the context)
* For BERT, it will actually then do the right thing with token type IDs
* For Finnish you can grab the Finnish BERT model from huggingface (FinBERT)
* The two texts are passed as `text` and `text_pair` (for some confusing naming)

In [None]:
def tokenize(example):
    return tokenizer(text=example["question"],text_pair=example["context"], truncation=True)

dataset = dset.map(tokenize)

Loading cached processed dataset at shuffled_binarized_squad2_dataset/train/cache-35d9c69123ad365c.arrow
Loading cached processed dataset at shuffled_binarized_squad2_dataset/test/cache-94d56838e5efe70a.arrow


In [None]:
trainer_args = transformers.TrainingArguments(
    output_dir='checkpoints',
    evaluation_strategy='steps',
    logging_strategy='steps',
    load_best_model_at_end=True,
    eval_steps=250,
    logging_steps=250,
    learning_rate=0.000015,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    max_steps=5000,
    save_steps=250,
    save_total_limit=6 #I will have an early stopping with patience 5, so max 6 checkpoints sounds reasonable
)


# How to evaluate?

* The data is imbalanced
* The positive class is much rarer than the negative class
* High accuracy does not tell the whole story
* It will be good to report also precision and recall, so we can gauge what happens to the class

In [None]:
accuracy = evaluate.load('accuracy')


def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = outputs.argmax(axis=-1) # pick index of "winning" label
    acc=accuracy.compute(predictions=predictions, references=labels)
    return acc

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

* Let's also add EarlyStopping

In [None]:
estopping_cback=transformers.EarlyStoppingCallback(5)

# Class imbalance

* One way to deal with imbalanced data is to give the minority class higher weight during training
* I.e. the loss is weighted by class-dependent weights
* A way to implement this can be copied straight from the HF documentation
* You can invent some weights, or maybe you can use the formula/code from e.g. here https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html


In [None]:
# A little housekeeping
# to help with various memory leaks

import gc
try:
    del trainer
except:
    pass
try:
    del model
except:
    pass
gc.collect()
torch.cuda.empty_cache()


In [None]:
#This is something I needed on CSC's puhti
#why, I have no idea
try:
    import mlflow
    mlflow.end_run()
    mlflow.start_run()
except:
    pass

model = transformers.AutoModelForSequenceClassification.from_pretrained(
    MODEL,
    num_labels=2
 )

#So, somewhere here would come in a code which does
#the class weightinh

trainer = transformers.Trainer(
    model=model,
    args=trainer_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    compute_metrics=compute_accuracy,
    tokenizer=tokenizer,
    callbacks=[estopping_cback], 
)

trainer.train()

In [None]:
trainer.save_model("english-binarized.model")

# Trained models

* I left the trained models for you here, in case you want them
* http://dl.turkunlp.org/TKO_8964_2023/

# Test the model

In [None]:
#try also the -weighted model from the url above
model=transformers.AutoModelForSequenceClassification.from_pretrained("english-binarized.model")
tokenizer=transformers.AutoTokenizer.from_pretrained("english-binarized.model",truncation=True,padding=True)

In [None]:
dset=datasets.load_dataset("json",data_files={"train":"http://dl.turkunlp.org/TKO_8964_2023/squad_2_binarized_train.json","test":"http://dl.turkunlp.org/TKO_8964_2023/squad_2_binarized_test.json"})

In [None]:
pipe=transformers.TextClassificationPipeline(model=model,tokenizer=tokenizer,device=0)

In [None]:
# Hmm, this is not the right way to go about it, perhaps, but the KeyPairDataset
# I could not get working, even though that would be the right way to do it
# {"text":[listoftexts],"text_pair":[listoftexts]} should have worked according to docs,
# but did not in reality :D
res=pipe([{"text":e["question"],"text_pair":e["context"]} for e in dset["test"].select(range(100))])

In [None]:
# Let's print out some interesting misclassifications

for prediction,datapoint in zip(res,dset["test"].select(range(100))):
    if prediction["label"]=="LABEL_0" and datapoint["label"]==1:
        print(datapoint["question"])
        print(datapoint["context"])
        print()

What type of hypersensitivity is associated with allergies?
Type IV reactions are involved in many autoimmune and infectious diseases, but may also involve contact dermatitis (poison ivy).

Who first described dynamic equilibrium?
Simple experiments showed that Galileo's understanding of the equivalence of constant velocity and rest were correct.

