### Fundamentals of Natural Language Processing
# Negation and Uncertainty Detection using a Machine-Learning Based Approach

*Authors:*

> *Anna Blanco, Agustina Lazzati, Stanislav Bultaskii, Queralt Salvadó*

*Aims:*
> Our goal is to train various Machine Learning based models for each of the two sub-tasks (detection of negation and uncertainty signals, and detection of the negation and uncertainty scopes). In order to do so, we followed the implementation method described by *Enger, Velldal, and Øvrelid (2017)*, which employs a maximum-margin approach for negation detection. However, for our particular application, we also included uncertainty cues and scope detection.

*References:* 
<br>
> Enger, M., Velldal, E., & Øvrelid, L. (2017). *An open-source tool for negation detection: A maximum-margin approach*. Proceedings of the Workshop on Computational Semantics Beyond Events and Roles (SemBEaR), 64–69.

---

We can erase this if you want but the thing is that we need to use the environment that queralt did. You need to write some commands to have the nlp_project (Python) as we have specific libraries. 

I did that and in the preprocessing it worked but here in order to work I had to run this command above, if it is not needed just avoid them!

In [1]:
import spacy

# Check installed models
print(spacy.util.get_installed_models())


['es_core_news_sm']


In [2]:
#!python -m spacy download es_core_news_sm


In [3]:
# Import necessary libraries and functions
import json
import spacy
from collections import defaultdict
import re
import pandas as pd
from preprocessing import df_svm_neg_test, df_svm_neg_train, df_svm_neg_test, df_svm_unc_train, df_svm_unc_test, df_crf_neg_train, df_crf_neg_test, df_crf_unc_train, df_crf_unc_test

   sentence_id  token_id      word     lemma    pos prefix suffix  is_punct  \
0            0         0                      SPACE                       0   
1            1         0        nº        nº   NOUN     nº     nº         0   
2            1         1  historia  historia   NOUN    his    ria         0   
3            1         2   clinica   clinico    ADJ    cli    ica         0   
4            1         3         :         :  PUNCT      :      :         1   

   is_redacted    dep head_pos  in_single_word_cues  in_affixal_cues  \
0            0    dep    SPACE                    0                0   
1            0    det     NOUN                    0                0   
2            0   ROOT     NOUN                    0                0   
3            0   amod     NOUN                    0                0   
4            0  punct     NOUN                    0                0   

   ends_with_ment  neg_cue_label  
0               0              0  
1               0     

## CUE DETECTION USING SVM

First of all, we'll need to vectorize:

In [4]:
import pandas as pd
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from joblib import dump
from sklearn.feature_extraction import DictVectorizer

def prepare_dataframe_for_svm(df, label_col):
    drop_cols = ["sentence_id", "token_id", label_col]
    feature_dicts = df.drop(columns=drop_cols).to_dict(orient="records")
    labels = df[label_col].tolist()

    vectorizer = DictVectorizer(sparse=True)
    X = vectorizer.fit_transform(feature_dicts)
    y = labels

    return X, y, vectorizer


def train_and_evaluate_svm(df_train, df_test, label_col, model_name):
    X_train, y_train, vec = prepare_dataframe_for_svm(df_train, label_col)
    X_test = vec.transform(df_test.drop(columns=["sentence_id", "token_id", label_col]).to_dict(orient="records"))
    y_test = df_test[label_col].tolist()

    pipeline = Pipeline([
        ("svm", LinearSVC(class_weight="balanced", max_iter=5000))
    ])

    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)

    print(f"\n--- Evaluation for {model_name} ---")
    print(classification_report(y_test, y_pred, digits=3))

    # Save both the model and the vectorizer
    dump(pipeline, f"{model_name}.joblib")
    dump(vec, f"{model_name}_vectorizer.joblib")
    return pipeline

### SVM for negation cue detection

In [5]:
neg_cue_model = train_and_evaluate_svm(
    df_train=df_svm_neg_train,
    df_test=df_svm_neg_test,
    label_col="neg_cue_label",
    model_name="svm_negation_cue"
)




--- Evaluation for svm_negation_cue ---
              precision    recall  f1-score   support

           0      1.000     0.998     0.999     64399
           1      0.916     0.996     0.954      1132

    accuracy                          0.998     65531
   macro avg      0.958     0.997     0.977     65531
weighted avg      0.998     0.998     0.998     65531



### SVM for uncertainty cue detection

In [6]:
unc_cue_model = train_and_evaluate_svm(
    df_train=df_svm_unc_train,
    df_test=df_svm_unc_test,
    label_col="unc_cue_label",
    model_name="svm_uncertainty_cue"
)




--- Evaluation for svm_uncertainty_cue ---
              precision    recall  f1-score   support

           0      1.000     0.931     0.964    251284
           1      0.038     0.987     0.072       686

    accuracy                          0.931    251970
   macro avg      0.519     0.959     0.518    251970
weighted avg      0.997     0.931     0.962    251970



In [7]:
def balance_training_data(df, label_col, neg_ratio=4, seed=42):
    positives = df[df[label_col] == 1]
    negatives = df[df[label_col] == 0].sample(n=len(positives) * neg_ratio, random_state=seed)
    df_balanced = pd.concat([positives, negatives]).sample(frac=1, random_state=seed).reset_index(drop=True)
    return df_balanced

df_balanced_unc_train = balance_training_data(df_svm_unc_train, label_col="unc_cue_label", neg_ratio=4)
print(df_balanced_unc_train["unc_cue_label"].value_counts())

0    2744
1     686
Name: unc_cue_label, dtype: int64


In [8]:
svm_unc_balanced_model = train_and_evaluate_svm(
    df_train=df_balanced_unc_train,
    df_test=df_svm_unc_test,
    label_col="unc_cue_label",
    model_name="svm_uncertainty_cue_balanced"
)




--- Evaluation for svm_uncertainty_cue_balanced ---
              precision    recall  f1-score   support

           0      1.000     0.929     0.963    251284
           1      0.036     0.987     0.070       686

    accuracy                          0.929    251970
   macro avg      0.518     0.958     0.517    251970
weighted avg      0.997     0.929     0.961    251970



## SCOPE DETECTION USING CRF

In [9]:
# pip install sklearn-crfsuite

We'll use CRF BIO tagging:

**BIO tagging** is a way to label each word in a sentence to show if it is part of a scope (like negation or uncertainty). The labels are:

* **B** for the **Beginning** of the scope
* **I** for **Inside** the scope
* **O** for **Outside** the scope

We use BIO tagging to help machine learning models, like **CRFs (Conditional Random Fields)**, understand where a scope starts and ends. For example, if a sentence has a negation like “No tiene fiebre”, BIO tagging shows that “No” is the beginning (**B-SCOPE**) and “tiene fiebre” is inside the scope (**I-SCOPE**), while other words would be labeled **O** if they are not part of it.

Using BIO makes it easier for the model to learn patterns and detect complete scopes correctly, not just single words. 


In [10]:
from sklearn_crfsuite import CRF
from sklearn_crfsuite import metrics

def to_bio_labels(labels, label_type="SCOPE"):
    # Convert lists of binary labels (0/1) into BIO tagging format for scopes
    bio_labels = []
    prefix = label_type.upper() + '_SCOPE'  # e.g., NEG_SCOPE or UNC_SCOPE
    for sent in labels:
        bio = []
        prev = 0
        for i, tag in enumerate(sent):
            if tag == 1:
                if i == 0 or prev == 0:
                    bio.append(f'B-{prefix}')
                else:
                    bio.append(f'I-{prefix}')
            else:
                bio.append('O')
            prev = tag
        bio_labels.append(bio)
    return bio_labels

def df_to_crf_format(df):
    """
    Convert a DataFrame into a list of feature dictionaries per sentence for CRF input.
    Includes original features + contextual features + lexicon-based features.
    
    Parameters:
        df (pd.DataFrame): must contain columns like 'word', 'pos', 'prefix', 'suffix', etc.

    Returns:
        List of list of feature dicts (one per token, grouped by sentence)
    """
    sentences = []
    grouped = df.groupby("sentence_id")

    for _, group in grouped:
        sentence = []
        group = group.reset_index(drop=True)  # Reset index so we can use idx in loop

        for idx, row in group.iterrows():
            word_lower = row['word'].lower()

            features = {
                'word.lower()': word_lower,
                'word.isupper()': row['word'].isupper(),
                'word.istitle()': row['word'].istitle(),
                'pos': row['pos'],
                'pos_prefix': row['pos'][:2] if isinstance(row['pos'], str) else 'NA',
                'prefix': row['prefix'],
                'suffix': row['suffix'],
                'is_punct': row['is_punct'],
                'in_single_word_cues': row['in_single_word_cues'],
                'in_affixal_cues': row['in_affixal_cues'],
                'ends_with_ment': row['ends_with_ment'],
                'has_neg_prefix': word_lower.startswith(('un', 'in', 'non', 'dis')),
                'has_neg_suffix': word_lower.endswith(('less', "n't")),
                'is_modal': word_lower in ['might', 'may', 'could', 'would', 'should']
            }

            # dependency features
            if 'dep' in row and 'head_word' in row and 'head_pos' in row:
                features.update({
                    'dep_label': row['dep'],
                    'head_word': str(row['head_word']).lower(),
                    'head_pos': row['head_pos']
                })

            # Contextual features: previous and next token
            if idx > 0:
                prev_row = group.iloc[idx - 1]
                features.update({
                    '-1:word.lower()': prev_row['word'].lower(),
                    '-1:pos': prev_row['pos']
                })
            else:
                features['BOS'] = True  # Beginning of sentence

            if idx < len(group) - 1:
                next_row = group.iloc[idx + 1]
                features.update({
                    '+1:word.lower()': next_row['word'].lower(),
                    '+1:pos': next_row['pos']
                })
            else:
                features['EOS'] = True  # End of sentence

            sentence.append(features)
        sentences.append(sentence)

    return sentences


def df_to_labels(df, label_col):
    # Extracts label sequences from the DataFrame, grouped by sentence
    label_sequences = []
    grouped = df.groupby("sentence_id")
    for _, group in grouped:
        label_list = group[label_col].tolist()
        label_sequences.append(label_list)
    return label_sequences


In [11]:
# Train + evaluate CRF model
def train_and_evaluate_crf(df_train, df_test, label_col):
    # Trains and evaluates a CRF model for BIO tagging using specified label column (e.g., 'neg_scope_label')
    scope_type = "NEG" if "neg" in label_col.lower() else "UNC"

    X_train = df_to_crf_format(df_train)
    y_train_raw = df_to_labels(df_train, label_col)
    y_train = to_bio_labels(y_train_raw, label_type=scope_type)

    X_test = df_to_crf_format(df_test)
    y_test_raw = df_to_labels(df_test, label_col)
    y_test = to_bio_labels(y_test_raw, label_type=scope_type)

    crf = CRF(algorithm='lbfgs', max_iterations=100, all_possible_transitions=True)
    crf.fit(X_train, y_train)
    y_pred = crf.predict(X_test)

    print(f"CRF Evaluation for: {label_col.upper()}")
    print(metrics.flat_classification_report(y_test, y_pred))   
    
    return X_test, y_test, y_pred  # Return these variables for further use



We should try implementing something like a print to see how well it does in sentences (real examples)

In [12]:
def print_crf_predictions(df, X, y_true, y_pred, sentence_idx=0):
    """
    Print words, true BIO labels, and predicted BIO labels for a given sentence index.
    """
    grouped = df.groupby("sentence_id")
    sentence_ids = list(grouped.groups.keys())

    if sentence_idx >= len(sentence_ids):
        print(f"Invalid sentence index {sentence_idx}. Max allowed: {len(sentence_ids) - 1}")
        return

    sentence_id = sentence_ids[sentence_idx]
    sentence_df = grouped.get_group(sentence_id)

    print(f"\n--- Sentence {sentence_idx} (ID {sentence_id}) ---")
    print(f"{'WORD':<15} {'TRUE':<15} {'PRED':<15}")
    print(f"{'-'*45}")
    for i, row in sentence_df.iterrows():
        word = row['word']
        true_label = y_true[sentence_idx][row['token_id']]
        pred_label = y_pred[sentence_idx][row['token_id']]
        print(f"{word:<15} {true_label:<15} {pred_label:<15}")



### CRF for negation scope detection

In [13]:

# CRF BIO tagging evaluation for NEGATION scopes
X_test, y_test, y_pred = train_and_evaluate_crf(df_crf_neg_train, df_crf_neg_test, "neg_scope_label")
for i in range(5):
    print_crf_predictions(df_crf_neg_test, X_test, y_test, y_pred, sentence_idx=i)

CRF Evaluation for: NEG_SCOPE_LABEL
              precision    recall  f1-score   support

 B-NEG_SCOPE       0.97      0.89      0.93      1071
 I-NEG_SCOPE       0.90      0.79      0.84      2522
           O       0.99      1.00      0.99     61938

    accuracy                           0.99     65531
   macro avg       0.96      0.89      0.92     65531
weighted avg       0.99      0.99      0.99     65531


--- Sentence 0 (ID 0) ---
WORD            TRUE            PRED           
---------------------------------------------
                O               O              

--- Sentence 1 (ID 1) ---
WORD            TRUE            PRED           
---------------------------------------------
nº              O               O              
historia        O               O              
clinica         O               O              
:               O               O              
*               O               O              
*               O               O              
*    

### CRF for uncertainty scope detection

In [14]:
# CRF BIO tagging evaluation for UNCERTAINTY scopes
X_test, y_test, y_pred = train_and_evaluate_crf(df_crf_unc_train, df_crf_unc_test, "unc_scope_label")
for i in range(100,106):
    print_crf_predictions(df_crf_unc_test, X_test, y_test, y_pred, sentence_idx=i)

CRF Evaluation for: UNC_SCOPE_LABEL
              precision    recall  f1-score   support

 B-UNC_SCOPE       0.89      0.24      0.38       129
 I-UNC_SCOPE       0.74      0.31      0.44       437
           O       0.99      1.00      1.00     64965

    accuracy                           0.99     65531
   macro avg       0.87      0.52      0.61     65531
weighted avg       0.99      0.99      0.99     65531


--- Sentence 100 (ID 100) ---
WORD            TRUE            PRED           
---------------------------------------------
cardiovascular  O               O              
:               O               O              
auscultacion    O               O              
cardiaca        O               O              
con             O               O              
tonos           O               O              
ritmicos        O               O              
y               O               O              
sin             O               O              
soplos          O         