### Fundamentals of Natural Language Processing
# Negation and Uncertainty Detection using a Machine-Learning Based Approach

*Authors:*

> *Anna Blanco, Agustina Lazzati, Stanislav Bultaskii, Queralt Salvadó*

*Aims:*
> Our goal is to train various Machine Learning based models for each of the two sub-tasks (detection of negation and uncertainty signals, and detection of the negation and uncertainty scopes).

---

In [1]:
# Import necessary libraries and functions
import json
import pandas as pd
import spacy
from utils import preprocess_text

## Data Preprocessing


First of all we will load all the data as a dictionary, remove unnecessary information such as '*', normalize whitespaces, and convert it into a proper format, so that we can then work with it. To do this, we will separate the texts from the prediction information.

In [2]:
import json
import pandas as pd

def load_data(path):
    """Load JSON data and return raw dict"""
    with open(path) as f:
        return json.load(f)

def preprocess_data(data):
    """Extract texts and predictions from loaded data"""
    texts = []
    predictions = []

    for i in range(len(data)):
        sentence = preprocess_text(data['data'][i]['text'], keep_case=False)
        texts.append(sentence)
        predictions.append(data['predictions'][i])

    return texts, predictions

# Load train and test data
train_data = load_data('negacio_train_v2024.json')
test_data = load_data('negacio_test_v2024.json')

# Convert data into a DataFrame for simplicity
train_df = pd.DataFrame(train_data)
test_df = pd.DataFrame(test_data)

# Preprocess it and obtain both the texts and the predictions
train_texts, train_preds = preprocess_data(train_df)
test_texts, test_preds = preprocess_data(test_df)

Next we will define a function to extract only the annotations from the predictions, those annotations will be used later to align them with the token indices:

In [3]:
def extract_annotations_grouped(preds):
    """Extracts annotations from the predictions format, grouped per text"""
    all_annotations = []

    for pred in preds:
        text_annotations = []
        for result_entry in pred:
            for result in result_entry['result']:
                value = result['value']
                text_annotations.append({
                    'start': value['start'],
                    'end': value['end'],
                    'labels': value['labels']
                })
        all_annotations.append(text_annotations)

    return all_annotations

train_annotations = extract_annotations_grouped(train_preds)
test_annotations = extract_annotations_grouped(test_preds)

We will use `spacy` for tokenization and sentence splitting:

In [4]:
nlp = spacy.load('xx_ent_wiki_sm') # supports both spanish and catalan

def tokenize_sentences(data, nlp):
    """Splits texts into tokens using spaCy"""
    # List to store resulting tokens
    tokenized_data = []

    for text in data:
        # tokenize sentence using spacy model
        doc = nlp(text)
        # split data into tokens, avoiding white space tokens
        tokens = [token.text for token in doc if not token.is_space]
        # add resulting list to the tokens list
        tokenized_data.append(tokens)
    
    return tokenized_data

train_tokens = tokenize_sentences(train_texts, nlp)
test_tokens = tokenize_sentences(test_texts, nlp)

Map character-level annotations(e.g. `'start': 347, 'end': 350`) to token indices

In [5]:
def char_to_token_indices(texts, annotations_list, nlp):
    """Maps character-level annotation spans to token indices using spaCy."""
    all_token_annotations = []

    for text, annotations in zip(texts, annotations_list):
        doc = nlp(text)
        token_annotations = []
        for ann in annotations:
            start, end = ann['start'], ann['end']
            span = doc.char_span(start, end, alignment_mode='contract')
            if span:
                token_annotations.append((span.start, span.end, ann['labels'][0]))
            else:
                pass
        all_token_annotations.append(token_annotations)

    return all_token_annotations

train_token_annotations = char_to_token_indices(train_texts, train_annotations, nlp)
test_token_annotations = char_to_token_indices(test_texts, test_annotations, nlp)

*Now, the indexes we're seeing correspond to the tokens rather than to the characters, the label tells us whether it is a negation or uncertainty cue or scope.*