# Natural Language Processing Project
## NLP Course @ Politecnico di Milano 2024/2025 - Prof. Mark Carman
### Topic 8: Medical Question Answering
Dataset: 
* PubMedQA [link](https://huggingface.co/datasets/qiaojin/PubMedQA )

Reference paper:
* PubMedQA: A Dataset for Biomedical Research Question Answering [Link](https://arxiv.org/pdf/1909.06146)

## Group members:

* Ketrin Mehmeti
* Giulia Ghiazza
* Leonardo Giorgio Franco
* Edoardo Franco Mattei
* Alessandro Epifania

## Introduction
 
The PubMedQA dataset is an innovative resource for question answering (QA) in the biomedical field, created from abstracts of scientific articles available on PubMed. The main purpose of PubMedQA is to assess the reasoning and inference abilities of intelligent systems on natural language, particularly within the context of biomedical research texts, which often require the processing of quantitative content.

A typical instance in PubMedQA consists of the following components:

* A question, which can either be the original title of a research paper or derived from it. For example: "Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?"

* A context, which is the abstract corresponding to the question, excluding its conclusion.

* A long answer, represented by the conclusion of the abstract, which is expected to answer the research question.

* A short answer in the form of "yes," "no," or "maybe," summarizing the conclusion. In the provided example, the long answer is: "(Conclusion) Our study indicated that preoperative statin therapy seems to reduce AF development after CABG," while the short answer is "yes."

The PubMedQA dataset is divided into three subsets:

* PQA-L (Labeled): Contains 1k manually annotated instances with yes/no/maybe answers. These annotations were made in two modes: "reasoning-free," where the annotator had access to the long answer, and "reasoning-required," where the annotator could only rely on the context.

* PQA-U (Unlabeled): Consists of 61.2k unlabeled instances, made up of PubMed articles with question-form titles and structured abstracts.

* PQA-A (Artificial): Includes 211.3k artificially generated instances, where article titles in statement form are converted into questions, and yes/no answers are automatically assigned based on the presence or absence of negations in the original title.

A key feature of PubMedQA is that the contexts are generated to directly answer the questions, with both components written by the same authors, ensuring a strong relationship between the question and context. This makes PubMedQA an ideal benchmark for testing the scientific reasoning capabilities of machine reading comprehension models. The dataset often requires reasoning over the quantitative content found in abstracts to answer the questions.

## Preliminary initialization

qui inseriamo tutte ciò che bisogna scaricare per far runnare il notebook così non ci sono problemi

Eventualmente inseriamo il link della repo se ci sono file da scaricare

<mark style="background-color: white; color: black;">
pip install datasets  <br>
pip install pandas pyarrow <br>
pip install transformers
</mark>


## Libraries

In [44]:
from datasets import load_dataset
from transformers import AutoTokenizer

### Loading the dataset: 

The object dataset is a DatasetDict, which contains different splits like "train", "validation", and "test" if available.

In [22]:
# Load the labeled, unlabeled, and artificial subsets of PubMedQA
# The dataset is split into three subsets:

dataset_labeled = load_dataset("qiaojin/PubMedQA", 'pqa_labeled')               
dataset_unlabeled = load_dataset("qiaojin/PubMedQA", 'pqa_unlabeled')    
dataset_artificial = load_dataset('qiaojin/PubMedQA', 'pqa_artificial')  

In [None]:
print("Labeled dataset:", dataset_labeled)
print("Artificial dataset:", dataset_artificial)
print("Unlabeled dataset:", dataset_unlabeled)

# Notice that the feature final_decision is missing in the Unlabeled dataset, reflecting the fact 
# that these examples do not have a definitive yes/no/maybe label.

Labeled dataset: DatasetDict({
    train: Dataset({
        features: ['pubid', 'question', 'context', 'long_answer', 'final_decision'],
        num_rows: 1000
    })
})
Artificial dataset: DatasetDict({
    train: Dataset({
        features: ['pubid', 'question', 'context', 'long_answer', 'final_decision'],
        num_rows: 211269
    })
})
Unlabeled dataset: DatasetDict({
    train: Dataset({
        features: ['pubid', 'question', 'context', 'long_answer'],
        num_rows: 61249
    })
})


In [43]:
print(dataset_labeled.keys())
print(dataset_artificial.keys())
print(dataset_unlabeled.keys())

dict_keys(['train'])
dict_keys(['train'])
dict_keys(['train'])


## Preliminary analysis

In [None]:
# check per vedere se sono tutte stringhe se no tokenizer non funziona

print(type( dataset_labeled['train']['question'] ))

for i, x in enumerate(dataset_labeled['train']['question']):
    if not isinstance(x, str):
        print(f"Elemento non stringa in posizione {i}: {x} (tipo {type(x)})")
        break

print(type( dataset_labeled['train']['context'] ))

for i, x in enumerate(dataset_labeled['train']['context']):
    if not isinstance(x, str):
        print(f"Elemento non stringa in posizione {i}: {x} (tipo {type(x)})")
        break

print(type( dataset_labeled['train']['long_answer'] ))

for i, x in enumerate(dataset_labeled['train']['long_answer']):
    if not isinstance(x, str):
        print(f"Elemento non stringa in posizione {i}: {x} (tipo {type(x)})")
        break

<class 'list'>
<class 'list'>
Elemento non stringa in posizione 0: {'contexts': ['Programmed cell death (PCD) is the regulated death of cells within an organism. The lace plant (Aponogeton madagascariensis) produces perforations in its leaves through PCD. The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cells at the center of these areoles and progresses outwards, stopping approximately five cells from the vasculature. The role of mitochondria during PCD has been recognized in animals; however, it has been less studied during PCD in plants.', 'The following paper elucidates the role of mitochondrial dynamics during developmentally regulated PCD in vivo in A. madagascariensis. A single areole within a window stage leaf (PCD is occurring) was divided into three areas based on the progression of PCD; cells that will not undergo PCD (NPCD), cells in early stages of PCD (EPCD), and cells in late stages of PCD (LPCD). 

In [None]:
# Tokenization

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Extract the questions, contexts, and long answers from the datasets
questions_labeled = dataset_labeled['train']['question']
questions_unlabeled = dataset_unlabeled['train']['question']
questions_artificial = dataset_artificial['train']['question']

# PROBLEMA DA RISOLVERE: il contesto è una lista di liste, non una lista di stringhe
# QUINDI NON FUNZIONA IL TOKENIZER!!!!!!!!!!!!!!!!!!
context_labeled = [x["contexts"] for x in dataset_labeled['train']['context']]
context_unlabeled = [x["contexts"] for x in dataset_unlabeled['train']['context']]
context_artificial = [x["contexts"] for x in dataset_artificial['train']['context']]

#context_labeled = [" ".join(x["contexts"]) for x in dataset_labeled['train']['context']]
#context_unlabeled = [" ".join(x["contexts"]) for x in dataset_unlabeled['train']['context']]
#context_artificial = [" ".join(x["contexts"]) for x in dataset_artificial['train']['context']]

long_answers_labeled = dataset_labeled['train']['long_answer']
long_answers_unlabeled = dataset_unlabeled['train']['long_answer']
long_answers_artificial = dataset_artificial['train']['long_answer']

# Tokenize the questions, contexts, and long answers
# Note: We are not adding special tokens, padding, or truncating the sequences here because we want to keep the original lengths.
tokenized_questions_labeled = tokenizer(questions_labeled, add_special_tokens=False, padding=False, truncation=False)["input_ids"]
tokenized_questions_unlabeled = tokenizer(questions_unlabeled, add_special_tokens=False, padding=False, truncation=False)["input_ids"]
tokenized_questions_artificial = tokenizer(questions_artificial, add_special_tokens=False, padding=False, truncation=False)["input_ids"]

tokenized_contexts_labeled = tokenizer(context_labeled, add_special_tokens=False, padding=False, truncation=False)["input_ids"]
tokenized_contexts_unlabeled = tokenizer(context_unlabeled, add_special_tokens=False, padding=False, truncation=False)["input_ids"]
tokenized_contexts_artificial = tokenizer(context_artificial, add_special_tokens=False, padding=False, truncation=False)["input_ids"]

tokenized_long_answers_labeled = tokenizer(long_answers_labeled, add_special_tokens=False, padding=False, truncation=False)["input_ids"]
tokenized_long_answers_unlabeled = tokenizer(long_answers_unlabeled, add_special_tokens=False, padding=False, truncation=False)["input_ids"]
tokenized_long_answers_artificial = tokenizer(long_answers_artificial, add_special_tokens=False, padding=False, truncation=False)["input_ids"]


TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

dict_keys(['train'])
dict_keys(['train'])
dict_keys(['train'])
