### Fundamentals of Natural Language Processing
# Negation and Uncertainty Detection using a Machine-Learning Based Approach

*Authors:*

> *Anna Blanco, Agustina Lazzati, Stanislav Bultaskii, Queralt Salvadó*

*Aims:*
> Our goal is to train various Machine Learning based models for each of the two sub-tasks (detection of negation and uncertainty signals, and detection of the negation and uncertainty scopes). In order to do so, we followed the implementation method described by *Enger, Velldal, and Øvrelid (2017)*, which employs a maximum-margin approach for negation detection.

*References:* 
<br>
> Enger, M., Velldal, E., & Øvrelid, L. (2017). *An open-source tool for negation detection: A maximum-margin approach*. Proceedings of the Workshop on Computational Semantics Beyond Events and Roles (SemBEaR), 64–69.

---

In [24]:
# Import necessary libraries and functions
import json
import pandas as pd
import spacy
from langdetect import detect
from utils import preprocess_text

## 1. Preprocessing Pipeline


First of all we will load all the data as a dictionary, remove unnecessary information such as '*', normalize whitespaces, and convert it into a proper format, so that we can then work with it. To do this, we will separate the texts from the prediction information.

In [25]:
import json
import pandas as pd

def load_data(path):
    """Load JSON data and return raw dict"""
    with open(path) as f:
        return json.load(f)

def preprocess_data(data):
    """Extract texts and predictions from loaded data"""
    texts = []
    predictions = []

    for i in range(len(data)):
        sentence = preprocess_text(data['data'][i]['text'], keep_case=False)
        texts.append(sentence)
        predictions.append(data['predictions'][i])

    return texts, predictions

# Load train and test data
train_data = load_data('negacio_train_v2024.json')
test_data = load_data('negacio_test_v2024.json')

# Convert data into a DataFrame for simplicity
train_df = pd.DataFrame(train_data)
test_df = pd.DataFrame(test_data)

# Preprocess it and obtain both the texts and the predictions
train_texts, train_preds = preprocess_data(train_df)
test_texts, test_preds = preprocess_data(test_df)

Next we will define a function to extract only the annotations from the predictions, those annotations will be used later to align them with the token indices:

In [26]:
def extract_annotations_grouped(preds):
    """Extracts annotations from the predictions format, grouped per text"""
    all_annotations = []

    for pred in preds:
        text_annotations = []
        for result_entry in pred:
            for result in result_entry['result']:
                value = result['value']
                text_annotations.append({
                    'start': value['start'],
                    'end': value['end'],
                    'labels': value['labels']
                })
        all_annotations.append(text_annotations)

    return all_annotations

train_annotations = extract_annotations_grouped(train_preds)
test_annotations = extract_annotations_grouped(test_preds)

# Show an example of a single annotation information
print(train_annotations[0][0])

{'start': 449, 'end': 452, 'labels': ['NEG']}


We will use `spacy` for tokenization and sentence splitting:

In [27]:
#nlp = spacy.load('xx_ent_wiki_sm') # supports both spanish and catalan

nlp_es = spacy.load("es_core_news_sm") # Spanish
nlp_ca = spacy.load("ca_core_news_sm") # Catalan

def tokenize_sentences(data):
    """Splits texts into tokens using spaCy"""
    # List to store resulting tokens and docs
    tokenized_data = []
    docs = []

    for text in data:
        # detect text language
        lang = detect(text)
        # choose model according to the language
        if lang == 'es':
            doc = nlp_es(text)
        elif lang == 'ca':
            doc = nlp_ca(text)
        else:
            doc = nlp_es(text)

        # split data into tokens, avoiding white space tokens
        tokens = [token.text for token in doc if not token.is_space]
        # add resulting list to the tokens list
        tokenized_data.append(tokens)
        docs.append(doc)

    return tokenized_data, docs

train_tokens, train_docs = tokenize_sentences(train_texts)
test_tokens, test_docs = tokenize_sentences(test_texts)

# Show an example of the tokenized text
print(train_tokens[0])

['nº', 'historia', 'clinica', ':', 'REDACTED', 'REDACTED', 'REDACTED', 'nºepisodi', ':', 'REDACTED', 'sexe', ':', 'home', 'data', 'de', 'naixement', ':', '16.05.1936', 'edat', ':', '82', 'anys', 'procedencia', 'cex', 'mateix', 'hosp', 'servei', 'urologia', 'data', "d'ingres", '24.07.2018', 'data', "d'alta", '25.07.2018', '08:54:04', 'ates', 'per', 'REDACTED', ',', 'REDACTED', ';', 'REDACTED', ',', 'REDACTED', 'informe', "d'alta", "d'hospitalitzacio", 'motiu', "d'ingres", 'paciente', 'que', 'ingresa', 'de', 'forma', 'programada', 'para', 'realizacion', 'de', 'uretrotomia', 'interna', '.', 'antecedents', 'alergia', 'a', 'penicilina', 'y', 'cloramfenicol', '.', 'no', 'habitos', 'toxicos', '.', 'antecedentes', 'medicos', ':', 'bloqueo', 'auriculoventricular', 'de', 'primer', 'grado', 'hipertension', 'arterial', '.', 'diverticulosis', 'extensa', 'insuficiencia', 'renal', 'cronica', 'colelitiasis', 'antecedentes', 'quirurgicos', ':', 'exeresis', 'de', 'lesiones', 'cutaneas', 'con', 'anestesi

Map character-level annotations (e.g. `'start': 347, 'end': 350`) to token indices

In [None]:
def char_to_token_indices(texts, annotations_list):
    #Maps character-level annotation spans to token indices using language-specific spaCy models.
    all_token_annotations = []

    for text, annotations in zip(texts, annotations_list):
        # Detect language and choose the appropriate model
        lang = detect(text)
        if lang == 'es':
            doc = nlp_es(text)
        elif lang == 'ca':
            doc = nlp_ca(text)
        else:
            doc = nlp_es(text) 

        token_annotations = []

        for ann in annotations:
            start, end = ann['start'], ann['end']
            # Map character span to token span
            span = doc.char_span(start, end, alignment_mode='contract')
            if span:
                token_annotations.append((span.start, span.end, ann['labels'][0]))

        all_token_annotations.append(token_annotations)

    return all_token_annotations

train_token_annotations = char_to_token_indices(train_texts, train_annotations)
test_token_annotations = char_to_token_indices(test_texts, test_annotations)

print(train_token_annotations[0][0])

(68, 71, 'NSCO')


*Now, the indexes in `train_token_annotations` and `test_token_annotations` correspond to the tokens rather than to the characters, the label tells us whether it is a negation or uncertainty cue or scope.*

The next thing we will do is to extract the dependency paths and PoS tagging.

In [None]:
def extract_token_info(docs):
    #Extracts token info from a list of spaCy Doc objects
    # List to store token information
    all_token_info = []

    # Iterate through each doc
    for doc in docs:
        token_info = []
        # Iterate through every token
        for token in doc:
            # add token info if the token is not a whitespace
            if not token.is_space:
                token_info.append([
                    token.text,
                    token.pos_,
                    token.dep_,
                    token.head.text
                ])
                
        all_token_info.append(token_info)
    
    return all_token_info

train_token_info = extract_token_info(train_docs)
test_token_info = extract_token_info(test_docs)

# Show an example
print(train_token_info[0])

[['nº', 'NOUN', 'det', 'historia'], ['historia', 'NOUN', 'ROOT', 'historia'], ['clinica', 'ADJ', 'amod', 'historia'], [':', 'PUNCT', 'punct', 'REDACTED'], ['REDACTED', 'PROPN', 'appos', 'historia'], ['REDACTED', 'PROPN', 'flat', 'REDACTED'], ['REDACTED', 'PROPN', 'flat', 'REDACTED'], ['nºepisodi', 'ADJ', 'amod', 'REDACTED'], [':', 'PUNCT', 'punct', 'REDACTED'], ['REDACTED', 'PROPN', 'appos', 'REDACTED'], ['sexe', 'PROPN', 'amod', 'REDACTED'], [':', 'PUNCT', 'punct', 'home'], ['home', 'PROPN', 'acl', 'REDACTED'], ['data', 'PROPN', 'flat', 'home'], ['de', 'ADP', 'case', 'naixement'], ['naixement', 'PROPN', 'flat', 'data'], [':', 'PUNCT', 'punct', 'edat'], ['16.05.1936', 'NUM', 'amod', 'edat'], ['edat', 'NOUN', 'obj', 'home'], [':', 'PUNCT', 'punct', 'data'], ['82', 'NUM', 'nummod', 'anys'], ['anys', 'PROPN', 'nsubj', 'data'], ['procedencia', 'PROPN', 'flat', 'anys'], ['cex', 'NOUN', 'flat', 'anys'], ['mateix', 'NOUN', 'amod', 'anys'], ['hosp', 'PROPN', 'flat', 'anys'], ['servei', 'PROPN'

To continue we will save tokens, PoS, and dependencies in a **CoNLL-U style format**

In [30]:
def save_conllu(docs, file_path):
    """Saves a list of spaCy docs to a CoNLL-U file"""
    with open(file_path, 'w') as f:
        for doc in docs:
            for token in doc:
                f.write(f"{token.i+1}\t{token.text}\t{token.lemma_}\t{token.pos_}\t"
                        f"{token.tag_}\t_\t{token.head.i+1 if token.head != token else 0}\t"
                        f"{token.dep_}\t_\t_\n")
            f.write("\n")  # Blank line between sentences


# Save train and test docs to CoNLL-U format
save_conllu(train_docs, 'train_data.conllu')
save_conllu(test_docs, 'test_data.conllu')

### 1.2 Feature Extraction for Cue Detection

### 1.3 Feature Extraction for Scope Resolution