### Fundamentals of Natural Language Processing
# Negation and Uncertainty Detection using a Machine-Learning Based Approach

*Authors:*

> *Anna Blanco, Agustina Lazzati, Stanislav Bultaskii, Queralt Salvadó*

*Aims:*
> Our goal is to train various Machine Learning based models for each of the two sub-tasks (detection of negation and uncertainty signals, and detection of the negation and uncertainty scopes). In order to do so, we followed the implementation method described by *Enger, Velldal, and Øvrelid (2017)*, which employs a maximum-margin approach for negation detection. However, for our particular application, we also included uncertainty cues and scope detection.

*References:* 
<br>
> Enger, M., Velldal, E., & Øvrelid, L. (2017). *An open-source tool for negation detection: A maximum-margin approach*. Proceedings of the Workshop on Computational Semantics Beyond Events and Roles (SemBEaR), 64–69.

---

We can erase this if you want but the thing is that we need to use the environment that queralt did. You need to write some commands to have the nlp_project (Python) as we have specific libraries. 

I did that and in the preprocessing it worked but here in order to work I had to run this command above, if it is not needed just avoid them!

In [18]:
import spacy

# Check installed models
print(spacy.util.get_installed_models())


['es_core_news_sm']


In [19]:
#!python -m spacy download es_core_news_sm


In [20]:
# Import necessary libraries and functions
import json
import spacy
from collections import defaultdict
import re
import pandas as pd
from preprocessing import df_svm_neg_test, df_svm_neg_test, df_svm_unc_train, df_svm_unc_test, df_crf_neg_train, df_crf_neg_test, df_crf_unc_train, df_crf_unc_test

## CUE DETECTION USING CRF

First of all, we'll need to vectorize:

### SVM for negation cue detection

### SVM for uncertainty cue detection

## SCOPE DETECTION USING CRF

In [21]:
# pip install sklearn-crfsuite

We'll use CRF BIO tagging:

**BIO tagging** is a way to label each word in a sentence to show if it is part of a scope (like negation or uncertainty). The labels are:

* **B** for the **Beginning** of the scope
* **I** for **Inside** the scope
* **O** for **Outside** the scope

We use BIO tagging to help machine learning models, like **CRFs (Conditional Random Fields)**, understand where a scope starts and ends. For example, if a sentence has a negation like “No tiene fiebre”, BIO tagging shows that “No” is the beginning (**B-SCOPE**) and “tiene fiebre” is inside the scope (**I-SCOPE**), while other words would be labeled **O** if they are not part of it.

Using BIO makes it easier for the model to learn patterns and detect complete scopes correctly, not just single words. 


In [22]:
from sklearn_crfsuite import CRF
from sklearn_crfsuite import metrics

def to_bio_labels(labels, label_type="SCOPE"):
    # Convert lists of binary labels (0/1) into BIO tagging format for scopes
    bio_labels = []
    prefix = label_type.upper() + '_SCOPE'  # e.g., NEG_SCOPE or UNC_SCOPE
    for sent in labels:
        bio = []
        prev = 'O'
        for i, tag in enumerate(sent):
            if tag == 1:
                if i == 0 or prev == 0:
                    bio.append(f'B-{prefix}')
                else:
                    bio.append(f'I-{prefix}')
            else:
                bio.append('O')
            prev = tag
        bio_labels.append(bio)
    return bio_labels

def df_to_crf_format(df):
    # Convert a DataFrame into a list of feature dictionaries per sentence for CRF input
    sentences = []
    grouped = df.groupby("sentence_id")
    for _, group in grouped:
        sentence = []
        for _, row in group.iterrows():
            features = {
                'word.lower()': row['word'].lower(),
                'word.isupper()': row['word'].isupper(),
                'word.istitle()': row['word'].istitle(),
                'pos': row['pos'],
                'prefix': row['prefix'],
                'suffix': row['suffix'],
                'is_punct': row['is_punct'],
                'in_single_word_cues': row['in_single_word_cues'],
                'in_affixal_cues': row['in_affixal_cues'],
                'ends_with_ment': row['ends_with_ment']
            }
            sentence.append(features)
        sentences.append(sentence)
    return sentences

def df_to_labels(df, label_col):
    # Extracts label sequences from the DataFrame, grouped by sentence
    label_sequences = []
    grouped = df.groupby("sentence_id")
    for _, group in grouped:
        label_list = group[label_col].tolist()
        label_sequences.append(label_list)
    return label_sequences


In [23]:
def train_and_evaluate_crf(df_train, df_test, label_col):
    # Trains and evaluates a CRF model for BIO tagging using specified label column (e.g., 'neg_scope_label')
    scope_type = "NEG" if "neg" in label_col.lower() else "UNC"

    X_train = df_to_crf_format(df_train)
    y_train_raw = df_to_labels(df_train, label_col)
    y_train = to_bio_labels(y_train_raw, label_type=scope_type)

    X_test = df_to_crf_format(df_test)
    y_test_raw = df_to_labels(df_test, label_col)
    y_test = to_bio_labels(y_test_raw, label_type=scope_type)

    crf = CRF(algorithm='lbfgs', max_iterations=100)
    crf.fit(X_train, y_train)
    y_pred = crf.predict(X_test)

    print(f"CRF Evaluation for: {label_col}")
    print(metrics.flat_classification_report(y_test, y_pred))   
    
    return X_test, y_test, y_pred  # Return these variables for further use



We should try implementing something like a print to see how well it does in sentences (real examples)

In [None]:
def print_crf_predictions(df, X, y_true, y_pred, sentence_idx=0):
    """
    Print words, true BIO labels, and predicted BIO labels for a given sentence index.
    """
    grouped = df.groupby("sentence_id")
    sentence_ids = list(grouped.groups.keys())

    if sentence_idx >= len(sentence_ids):
        print(f"Sentence index {sentence_idx} is out of range.")
        return

    sentence_id = sentence_ids[sentence_idx]
    sentence_df = grouped.get_group(sentence_id)

    print(f"\nSentence {sentence_idx} (ID: {sentence_id})\n{'-'*50}")
    print("{:<15} {:<15} {:<15}".format("Word", "True Label", "Predicted"))
    print("-" * 50)
    for word, true, pred in zip(sentence_df["word"], y_true[sentence_idx], y_pred[sentence_idx]):
        print("{:<15} {:<15} {:<15}".format(word, true, pred))


### CRF for negation scope detection

In [29]:

# CRF BIO tagging evaluation for NEGATION scopes
X_test, y_test, y_pred = train_and_evaluate_crf(df_crf_neg_train, df_crf_neg_test, "neg_scope_label")
for i in range(5):
    print_crf_predictions(df_crf_neg_test, X_test, y_test, y_pred, sentence_idx=i)

CRF Evaluation for: neg_scope_label
              precision    recall  f1-score   support

 B-NEG_SCOPE       0.87      0.46      0.60      1071
 I-NEG_SCOPE       0.74      0.56      0.64      2522
           O       0.97      0.99      0.98     61938

    accuracy                           0.97     65531
   macro avg       0.86      0.67      0.74     65531
weighted avg       0.96      0.97      0.96     65531


Sentence 0 (ID: 0)
--------------------------------------------------
Word            True Label      Predicted      
--------------------------------------------------
                O               O              

Sentence 1 (ID: 1)
--------------------------------------------------
Word            True Label      Predicted      
--------------------------------------------------
nº              O               O              
historia        O               O              
clinica         O               O              
:               O               O              
*  

### CRF for uncertainty scope detection

In [32]:
# CRF BIO tagging evaluation for UNCERTAINTY scopes
X_test, y_test, y_pred = train_and_evaluate_crf(df_crf_unc_train, df_crf_unc_test, "unc_scope_label")
for i in range(100,106):
    print_crf_predictions(df_crf_unc_test, X_test, y_test, y_pred, sentence_idx=i)

CRF Evaluation for: unc_scope_label
              precision    recall  f1-score   support

 B-UNC_SCOPE       0.33      0.04      0.07       129
 I-UNC_SCOPE       0.19      0.06      0.09       437
           O       0.99      1.00      0.99     64965

    accuracy                           0.99     65531
   macro avg       0.51      0.37      0.39     65531
weighted avg       0.99      0.99      0.99     65531


Sentence 100 (ID: 100)
--------------------------------------------------
Word            True Label      Predicted      
--------------------------------------------------
cardiovascular  O               O              
:               O               O              
auscultacion    O               O              
cardiaca        O               O              
con             O               O              
tonos           O               O              
ritmicos        O               O              
y               O               O              
sin             O        