### Fundamentals of Natural Language Processing
# Negation and Uncertainty Detection using a Machine-Learning Based Approach

*Authors:*

> *Anna Blanco, Agustina Lazzati, Stanislav Bultaskii, Queralt Salvadó*

*Aims:*
> Our goal is to train various Machine Learning based models for each of the two sub-tasks (detection of negation and uncertainty signals, and detection of the negation and uncertainty scopes).

---

In [3]:
# Import necessary libraries and functions
import json
import pandas as pd
import spacy
from utils import preprocess_text

## Data Preprocessing


First of all we will load all the data as a dictionary, remove unnecessary information such as '*', normalize whitespaces, and convert it into a proper format, so that we can then work with it. To do this, we will separate the texts from the prediction information.

In [4]:
import json
import pandas as pd

def load_data(path):
    """Load JSON data and return raw dict"""
    with open(path) as f:
        return json.load(f)

def preprocess_data(data):
    """Extract texts and predictions from loaded data"""
    texts = []
    predictions = []

    for i in range(len(data)):
        sentence = preprocess_text(data['data'][i]['text'], keep_case=False)
        texts.append(sentence)
        predictions.append(data['predictions'][i])

    return texts, predictions

# Load train and test data
train_data = load_data('negacio_train_v2024.json')
test_data = load_data('negacio_test_v2024.json')

# Convert data into a DataFrame for simplicity
train_df = pd.DataFrame(train_data)
test_df = pd.DataFrame(test_data)

# Preprocess it and obtain both the texts and the predictions
train_texts, train_preds = preprocess_data(train_df)
test_texts, test_preds = preprocess_data(test_df)

# Visualize data
print(train_texts[0])
print(train_preds)

 nº historia clinica: REDACTED REDACTED REDACTED nºepisodi: REDACTED sexe: home data de naixement: 16.05.1936 edat: 82 anys procedencia cex mateix hosp servei urologia data d'ingres 24.07.2018 data d'alta 25.07.2018 08:54:04 ates per REDACTED, REDACTED; REDACTED, REDACTED informe d'alta d'hospitalitzacio motiu d'ingres paciente que ingresa de forma programada para realizacion de uretrotomia interna . antecedents alergia a penicilina y cloramfenicol . no habitos toxicos. antecedentes medicos: bloqueo auriculoventricular de primer grado hipertension arterial. diverticulosis extensa insuficiencia renal cronica colelitiasis antecedentes quirurgicos: exeresis de lesiones cutaneas con anestesia local protesis total de cadera cordectomia herniorrafia inguinal proces actual varon de 81a que a raiz de episodio de hematuria macroscopica se realiza cistoscopia que es negativa para lesiones malignas pero se objetiva estenosis de uretra . se intentan dilataciones progresivas en el gabinete de urolo

Next we will define a function to extract only the annotations from the predictions, those annotations will be used later to align them with the token indices:

In [10]:
def extract_annotations_grouped(preds):
    """Extracts annotations from the predictions format, grouped per text"""
    all_annotations = []

    for pred in preds:
        text_annotations = []
        for result_entry in pred:
            for result in result_entry['result']:
                value = result['value']
                text_annotations.append({
                    'start': value['start'],
                    'end': value['end'],
                    'labels': value['labels']
                })
        all_annotations.append(text_annotations)

    return all_annotations

train_annotations = extract_annotations_grouped(train_preds)
test_annotations = extract_annotations_grouped(test_preds)

# Show an example
print(train_annotations[0])

[{'start': 416, 'end': 419, 'labels': ['NEG']}, {'start': 771, 'end': 774, 'labels': ['NEG']}, {'start': 1067, 'end': 1071, 'labels': ['NEG']}, {'start': 1445, 'end': 1448, 'labels': ['NEG']}, {'start': 1498, 'end': 1501, 'labels': ['NEG']}, {'start': 1501, 'end': 1546, 'labels': ['NSCO']}, {'start': 1890, 'end': 1899, 'labels': ['UNC']}, {'start': 1899, 'end': 1948, 'labels': ['USCO']}, {'start': 2097, 'end': 2100, 'labels': ['NEG']}, {'start': 2837, 'end': 2841, 'labels': ['NEG']}, {'start': 2911, 'end': 2926, 'labels': ['UNC']}, {'start': 2926, 'end': 2971, 'labels': ['USCO']}, {'start': 3176, 'end': 3179, 'labels': ['NEG']}, {'start': 3399, 'end': 3402, 'labels': ['NEG']}, {'start': 3485, 'end': 3493, 'labels': ['NEG']}, {'start': 3736, 'end': 3740, 'labels': ['NEG']}, {'start': 3774, 'end': 3781, 'labels': ['UNC']}, {'start': 3781, 'end': 3853, 'labels': ['USCO']}, {'start': 3854, 'end': 3857, 'labels': ['NEG']}, {'start': 4112, 'end': 4116, 'labels': ['NEG']}, {'start': 4228, 'en

We will use `spacy` for tokenization and sentence splitting:

In [11]:
nlp = spacy.load('xx_ent_wiki_sm') # supports both spanish and catalan

def tokenize_sentences(data, nlp):
    """Splits texts into tokens using spaCy"""
    # List to store resulting tokens
    tokenized_data = []

    for text in data:
        # tokenize sentence using spacy model
        doc = nlp(text)
        # split data into tokens, avoiding white space tokens
        tokens = [token.text for token in doc if not token.is_space]
        # add resulting list to the tokens list
        tokenized_data.append(tokens)
    
    return tokenized_data

train_tokens = tokenize_sentences(train_texts, nlp)
test_tokens = tokenize_sentences(test_texts, nlp)

# Show an example of a tokenized text
print(train_tokens[1])

['nº', 'historia', 'clinica', ':', 'REDACTED', 'REDACTED', 'REDACTED', 'nºepisodi', ':', 'REDACTED', 'sexe', ':', 'dona', 'data', 'de', 'naixement', ':', '04.08.2000', 'edat', ':', '19', 'anys', 'procedencia', 'domicil', '/', 'res.soc', 'servei', 'obstetricia', 'data', "d'ingres", '04.10.2019', 'data', "d'alta", '06.10.2019', '13:02:36', 'ates', 'per', 'REDACTED', ',', 'REDACTED', ';', 'REDACTED', ',', 'REDACTED', 'informe', "d'alta", "d'hospitalitzacio", 'motiu', "d'ingres", 'treball', 'de', 'part', 'antecedents', 'no', 'al·lergies', 'medicamentoses', 'conegudes', '.', 'no', 'intervencions', 'quirurgiques', 'ni', 'altres', 'antecedents', 'patologics', '.', 'nega', 'habits', 'toxics', '.', 'no', 'medicacio', 'habitual', '.', 'evolucio', 'clinica', 'evolucion', 'parto', 'finaliza', 'por', 'parto', 'eutocico', 'el', 'dia', '04/10', 'a', 'las', '9:34h', 'obtencion', 'de', 'rn', ',', 'sexo', ':', 'masculino', ',', 'peso', ':', '2820', 'apgar', ':', '9/10', 'gs', ':', 'ab+', 'el', 'pueperio

Map character-level annotations(e.g. `'start': 347, 'end': 350`) to token indices

In [15]:
def char_to_token_indices(texts, annotations_list, nlp):
    """Maps character-level annotation spans to token indices using spaCy."""
    all_token_annotations = []

    for text, annotations in zip(texts, annotations_list):
        doc = nlp(text)
        token_annotations = []
        for ann in annotations:
            start, end = ann['start'], ann['end']
            span = doc.char_span(start, end, alignment_mode='contract')
            if span:
                token_annotations.append((span.start, span.end, ann['labels'][0]))
            else:
                pass
        all_token_annotations.append(token_annotations)

    return all_token_annotations

train_token_annotations = char_to_token_indices(train_texts, train_annotations, nlp)
test_token_annotations = char_to_token_indices(test_texts, test_annotations, nlp)

print(train_token_annotations)

[[(68, 71, 'NSCO'), (124, 125, 'NEG'), (126, 128, 'NSCO'), (145, 146, 'NSCO'), (236, 241, 'NSCO'), (251, 253, 'UNC'), (254, 267, 'USCO'), (288, 289, 'UNC'), (289, 295, 'USCO'), (305, 311, 'NSCO'), (339, 343, 'NSCO'), (375, 376, 'NEG')], [(54, 55, 'NEG'), (59, 60, 'NEG'), (71, 72, 'NEG'), (55, 57, 'NSCO'), (60, 65, 'NSCO'), (67, 68, 'NEG'), (68, 69, 'NSCO'), (72, 73, 'NSCO')], [(145, 146, 'NEG'), (147, 149, 'NSCO'), (156, 157, 'NSCO'), (164, 166, 'NEG'), (188, 192, 'NSCO'), (193, 197, 'NSCO'), (254, 257, 'NSCO'), (397, 400, 'NSCO'), (453, 454, 'NEG'), (454, 456, 'NSCO'), (61, 64, 'NSCO')], [(263, 271, 'NSCO'), (332, 339, 'USCO'), (522, 523, 'NEG'), (537, 538, 'UNC'), (539, 543, 'USCO'), (640, 641, 'NEG'), (652, 654, 'NEG'), (699, 700, 'UNC'), (700, 711, 'USCO'), (711, 712, 'NEG'), (980, 981, 'UNC'), (981, 985, 'USCO'), (989, 997, 'NSCO'), (1240, 1241, 'USCO'), (1242, 1243, 'NEG'), (1376, 1377, 'UNC'), (1377, 1384, 'USCO'), (1399, 1403, 'USCO'), (1455, 1457, 'UNC'), (68, 70, 'NSCO'), (13

*Now, the indexes we're seeing correspond to the tokens rather than to the characters, the label tells us whether it is a negation or uncertainty cue or scope.*