### Fundamentals of Natural Language Processing
# Negation and Uncertainty Detection using a Machine-Learning Based Approach

*Authors:*

> *Anna Blanco, Agustina Lazzati, Stanislav Bultaskii, Queralt Salvadó*

*Aims:*
> Our goal is to train various Machine Learning based models for each of the two sub-tasks (detection of negation and uncertainty signals, and detection of the negation and uncertainty scopes). In order to do so, we followed the implementation method described by *Enger, Velldal, and Øvrelid (2017)*, which employs a maximum-margin approach for negation detection. However, for our particular application, we also included uncertainty cues and scope detection.

*References:* 
<br>
> Enger, M., Velldal, E., & Øvrelid, L. (2017). *An open-source tool for negation detection: A maximum-margin approach*. Proceedings of the Workshop on Computational Semantics Beyond Events and Roles (SemBEaR), 64–69.

---

We can erase this if you want but the thing is that we need to use the environment that queralt did. You need to write some commands to have the nlp_project (Python) as we have specific libraries. 

I did that and in the preprocessing it worked but here in order to work I had to run this command above, if it is not needed just avoid them!

In [11]:
import spacy

# Check installed models
print(spacy.util.get_installed_models())


[]


In [13]:
!python -m spacy download es_core_news_sm


Collecting es-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.8.0/es_core_news_sm-3.8.0-py3-none-any.whl (12.9 MB)
     ---------------------------------------- 0.0/12.9 MB ? eta -:--:--
     --------------------- ------------------ 6.8/12.9 MB 41.7 MB/s eta 0:00:01
     --------------------------------------- 12.9/12.9 MB 40.3 MB/s eta 0:00:00
Installing collected packages: es-core-news-sm
Successfully installed es-core-news-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')


In [14]:
# Import necessary libraries and functions
import json
import spacy
from collections import defaultdict
import re
import pandas as pd
from preprocessing import df_svm_neg_test, df_svm_neg_test, df_svm_unc_train, df_svm_unc_test, df_crf_neg_train, df_crf_neg_test, df_crf_unc_train, df_crf_unc_test

   sentence_id  token_id      word     lemma    pos prefix suffix  is_punct  \
0            0         0                      SPACE                       0   
1            1         0        nº        nº   NOUN     nº     nº         0   
2            1         1  historia  historia   NOUN    his    ria         0   
3            1         2   clinica   clinico    ADJ    cli    ica         0   
4            1         3         :         :  PUNCT      :      :         1   

   is_redacted    dep head_pos  in_single_word_cues  in_affixal_cues  \
0            0    dep    SPACE                    0                0   
1            0    det     NOUN                    0                0   
2            0   ROOT     NOUN                    0                0   
3            0   amod     NOUN                    0                0   
4            0  punct     NOUN                    0                0   

   ends_with_ment  neg_cue_label  
0               0              0  
1               0     

## CUE DETECTION USING CRF

First of all, we'll need to vectorize:

### SVM for negation cue detection

### SVM for uncertainty cue detection

## SCOPE DETECTION USING CRF

In [None]:
# pip install sklearn-crfsuite

We'll use CRF BIO tagging:

**BIO tagging** is a way to label each word in a sentence to show if it is part of a scope (like negation or uncertainty). The labels are:

* **B** for the **Beginning** of the scope
* **I** for **Inside** the scope
* **O** for **Outside** the scope

We use BIO tagging to help machine learning models, like **CRFs (Conditional Random Fields)**, understand where a scope starts and ends. For example, if a sentence has a negation like “No tiene fiebre”, BIO tagging shows that “No” is the beginning (**B-SCOPE**) and “tiene fiebre” is inside the scope (**I-SCOPE**), while other words would be labeled **O** if they are not part of it.

Using BIO makes it easier for the model to learn patterns and detect complete scopes correctly, not just single words. 


In [None]:
from sklearn_crfsuite import CRF
from sklearn_crfsuite import metrics

def to_bio_labels(labels):
    # Convert lists of binary labels (0/1) into BIO tagging format for scopes

    bio_labels = []
    for sent in labels:
        bio = []
        prev = 'O'
        for i, tag in enumerate(sent):
            if tag == 1:
                if i == 0 or prev == 0:
                    bio.append('B-SCOPE')
                else:
                    bio.append('I-SCOPE')
            else:
                bio.append('O')
            prev = tag
        bio_labels.append(bio)
    return bio_labels

def df_to_crf_format(df):
    # Convert a DataFrame into a list of feature dictionaries per sentence for CRF input
    sentences = []
    grouped = df.groupby("sentence_id")
    for _, group in grouped:
        sentence = []
        for _, row in group.iterrows():
            features = {
                'word.lower()': row['word'].lower(),
                'word.isupper()': row['word'].isupper(),
                'word.istitle()': row['word'].istitle(),
                'pos': row['pos'],
                'prefix': row['prefix'],
                'suffix': row['suffix'],
                'is_punct': row['is_punct'],
                'in_single_word_cues': row['in_single_word_cues'],
                'in_affixal_cues': row['in_affixal_cues'],
                'ends_with_ment': row['ends_with_ment']
            }
            sentence.append(features)
        sentences.append(sentence)
    return sentences

def df_to_labels(df, label_col):
    # Extracts label sequences from the DataFrame, grouped by sentence
    label_sequences = []
    grouped = df.groupby("sentence_id")
    for _, group in grouped:
        label_list = group[label_col].tolist()
        label_sequences.append(label_list)
    return label_sequences


In [21]:
def train_and_evaluate_crf(df_train, df_test, label_col):
    # Trains and evaluates a CRF model for BIO tagging using specified label column (e.g., 'neg_scope_label')
    X_train = df_to_crf_format(df_train)
    y_train_raw = df_to_labels(df_train, label_col)
    y_train = to_bio_labels(y_train_raw)

    X_test = df_to_crf_format(df_test)
    y_test_raw = df_to_labels(df_test, label_col)
    y_test = to_bio_labels(y_test_raw)

    crf = CRF(algorithm='lbfgs', max_iterations=100)
    crf.fit(X_train, y_train)
    y_pred = crf.predict(X_test)

    print(f"CRF Evaluation for: {label_col}")
    print(metrics.flat_classification_report(y_test, y_pred))

### CRF for negation scope detection

In [20]:

# CRF BIO tagging evaluation for NEGATION scopes
train_and_evaluate_crf(df_crf_neg_train, df_crf_neg_test, "neg_scope_label")

CRF Evaluation for: neg_scope_label
              precision    recall  f1-score   support

     B-SCOPE       0.87      0.46      0.60      1071
     I-SCOPE       0.74      0.56      0.64      2522
           O       0.97      0.99      0.98     61938

    accuracy                           0.97     65531
   macro avg       0.86      0.67      0.74     65531
weighted avg       0.96      0.97      0.96     65531



We should try implementing something like a print to see how well it does in sentences (real examples)

### CRF for uncertainty scope detection