## BERT&GPT-NER-Token Classification-Anxiety

## Automated BIO Tagging for Anxiety Symptoms Using spaCy

This Python script utilizes the `spaCy` library to perform Named Entity Recognition (NER) for identifying mentions of anxiety in a corpus of text. The goal is to tag these mentions using the BIO (Begin, Inside, Outside) tagging format.

### Key Components

- **spaCy Model Loading**: Loads the English language model from spaCy.
- **Data Loading**: Reads a CSV file containing the corpus into a pandas DataFrame.
- **Phrase Matching**: Utilizes spaCy's `PhraseMatcher` to find sequences of words that match terms indicative of anxiety.
- **BIO Tagging Function**: A custom function `bio_tagging_spacy` that applies BIO tagging to the identified terms within the text.
- **DataFrame Creation**: Constructs a new DataFrame from the tagged data, which pairs each token with its corresponding BIO tag.
- **Output**: Saves the tagged data to a new CSV file for further analysis or training machine learning models.

### Output

The script outputs a CSV file named `df_tokens_Anxiety.csv`. This file contains three columns: `goldID`, `token`, and `tag`, where each token from the corpus is annotated with its respective BIO tag indicating whether it is a part of an anxiety mention.

By automating the tagging process, this script facilitates more efficient preparation of text data for complex NLP tasks such as training models for sentiment analysis or more detailed psychological state assessments.


In [1]:
import spacy
from spacy.matcher import PhraseMatcher
import pandas as pd

# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

# Load your data
df = pd.read_csv('new_corpus_14_symptoms_counted.csv')
df_Anxiety = df[["Unnamed: 0", "goldID", "text", "Anxiety"]]
df_dic = pd.read_csv('Anxiety_dic.csv', encoding='latin1')

# Existing terms from your dictionary
Anxiety_terms = df_dic['Anxiety'].tolist()

patterns = [nlp.make_doc(term.lower()) for term in Anxiety_terms]

# Create the PhraseMatcher object
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
matcher.add("AnxietyPattern", None, *patterns)

# Function to apply BIO tagging using spaCy's PhraseMatcher
def bio_tagging_spacy(text):
    doc = nlp(text)
    matches = matcher(doc)
    tokens = [token.text for token in doc]
    labels = ['O'] * len(tokens)  # Default labels

    for match_id, start, end in matches:
        labels[start] = 'B-Anxiety'  # Begin tag, you could use 'B-Vomit' if specific tagging is required
        for i in range(start + 1, end):
            labels[i] = 'I-Anxiety'  # Inside tag, similarly 'I-Vomit'

    return list(zip(tokens, labels))

# Apply BIO tagging and create a new DataFrame
token_data = []
for _, row in df_Anxiety.iterrows():
    result = bio_tagging_spacy(row['text'])
    for token, tag in result:
        token_data.append({'goldID': row['goldID'], 'token': token, 'tag': tag})

df_tokens = pd.DataFrame(token_data)

# Save and display the results
df_tokens.to_csv('df_tokens_Anxiety.csv', index=False)





   goldID        token tag
0     356  authorizing   O
1     356     provider   O
2     356       younke   O
3     356       denise   O
4     356            l   O


## BERT-Based Named Entity Recognition for Anxiety Symptoms

This section of the Python script is designed to set up a BERT-based named entity recognition (NER) system for identifying anxiety symptoms in text. It uses libraries such as PyTorch, Hugging Face's transformers, and spaCy to perform text processing and entity tagging.

### Overview

The code performs the following functions:

1. **Imports and Logger Setup**: The necessary libraries are imported, and logging is configured for better debugging and tracking.
2. **Model Configuration**: Defines model names and configurations for different versions of BERT models tailored for clinical text (e.g., Bio-ClinicalBERT).
3. **Label Mapping**: Sets up the mapping between labels and their corresponding IDs for classification.
4. **Data Preparation**: Functions to load and preprocess the data are defined, ensuring the data fits the model requirements.
5. **Dataset and DataLoader**: Implements custom PyTorch dataset and dataloader to handle the tokenization and encoding of text data.
6. **Model Training and Evaluation**: Functions to train and evaluate the model using the provided datasets, calculating metrics such as precision, recall, and F1-score.

### Key Components

- **Config Class**: Holds configuration constants like batch size, learning rate, and device (CPU/GPU).
- **SentenceDataset Class**: Custom dataset class for handling sentence tokenization and encoding.
- **Training and Evaluation Functions**: Include detailed logging and performance metrics tracking to monitor the training process and evaluate the model's effectiveness.

### Output

The script outputs training and test loss, along with precision, recall, and F1 scores for the test dataset. Additionally, it saves these metrics to a CSV file in a results directory for further analysis.

### Execution

The main function orchestrates the loading of data, model training, and evaluation. It also ensures the results are saved and properly logged. This setup allows for a robust evaluation of different BERT models on the specific task of anxiety symptom recognition in clinical texts.

By structuring and training models in this way, researchers and practitioners can develop more effective tools for automatic symptom detection, aiding in faster and more accurate clinical assessments.


## Step By Step

## Setup for BERT-Based NER Model

This section of the Python script sets up the foundation for a Named Entity Recognition (NER) system specifically tailored to identify anxiety symptoms in text. Utilizing PyTorch and the transformers library, it configures logging, model parameters, and data handling mechanisms necessary for the NER task.

### Dependencies

The script imports necessary libraries such as `pandas` for data manipulation, `torch` and its utilities for deep learning operations, and `transformers` for accessing pre-trained BERT models and tokenizers. Additionally, metrics from `seqeval` are used to evaluate model performance.

### Configuration and Logging

- **Logging**: Configured to provide timestamped logs at the INFO level, helping in debugging and tracking model training and evaluation.
- **Model Names**: A dictionary mapping model identifiers to their descriptive names, facilitating easy switches between different BERT models.
- **Label Mapping**: Defines a mapping of NER labels to numerical IDs, essential for model training and inference.

### System Configuration Class

A `Config` class contains all relevant system and model settings:
- **Maximum Sequence Length**: Defines the cutoff length for tokenization.
- **Batch Sizes**: Specifies different batch sizes for training and validation phases.
- **Training Parameters**: Includes settings such as the number of epochs, learning rate, and gradient clipping norm.
- **Device Setup**: Automatically assigns the model to run on GPU if available, otherwise uses CPU.
- **Data and Results Handling**: Designates paths for the input data and directory for saving results.

### Purpose

This setup ensures that all components of the NER system are configured properly before training begins. It enables the model to efficiently process text data, learn from it, and store the results for further analysis or deployment in clinical settings.

By centralizing configuration settings and logging mechanisms, the script maintains high modularity and ease of maintenance, making adjustments and upgrades straightforward as new data or models become available.


In [None]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForTokenClassification
from seqeval.metrics import classification_report, precision_score, recall_score, f1_score
from torch import cuda
import logging
import os

#Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Define model names and their friendly names
model_names = {
    "bert-base-uncased": "BERT",
    "emilyalsentzer/Bio_ClinicalBERT": "Bio-ClinicalBERT",
    "New_Bio-Clinical_BERT_finetuned": "Symptom_BERT"
}

# Define label mapping
label2id = {'O': 0, 'B-Anxiety': 1, 'I-Anxiety': 2}  # Adjust as necessary
id2label = {v: k for k, v in label2id.items()}  # Reverse mapping from ID to label

class Config:
    MAX_LEN = 512
    TRAIN_BATCH_SIZE = 4
    VALID_BATCH_SIZE = 2
    EPOCHS = 5
    LEARNING_RATE = 3e-05
    MAX_GRAD_NORM = 10
    TRAIN_SIZE = 0.8
    DEVICE = 'cuda' if cuda.is_available() else 'cpu'
    DATA_FILE = "df_tokens_Anxiety.csv"
    RESULTS_DIR = './results_Anxiety_NER_BERT&GPT'  # Directory to save results

## Data Loading and Preprocessing Function

The `load_data` function is designed to handle the initial loading and preprocessing of the dataset necessary for training the NER model. This function is crucial for ensuring the data is in the correct format for tokenization and sequence labeling.

### Function Details

- **Input**: Accepts a file path to the CSV file containing the annotated tokens.
- **Processing Steps**:
  1. **Reading the Data**: The data is loaded using `pandas` with `unicode_escape` encoding to handle any special characters.
  2. **Handling Missing Values**: Uses forward fill (`ffill`) to handle any missing values in the data, ensuring no gaps in token sequences.
  3. **Sentence Reconstruction**: Aggregates tokens by their corresponding `goldID` to reconstruct the original sentences.
  4. **Label Aggregation**: Similarly, aggregates the BIO tags to form the corresponding label sequences for each sentence.
  5. **Deduplication**: Removes duplicate entries to ensure the uniqueness of each training example.
  
- **Output**: Returns a DataFrame with two columns: `sentence` and `word_labels`, which contain the reconstructed sentences and their respective label sequences.

### Error Handling

- The function includes robust error handling to log any issues encountered during the data loading process. If an error occurs, it raises an exception to halt further execution, ensuring that no faulty data compromises the training process.

### Purpose

This preprocessing step is vital for converting raw tokenized data into a structured format that the BERT tokenizer can effectively process. By preparing the data meticulously, this function helps in maintaining the integrity and quality of the training data, leading to more reliable NER model performance.

This structured approach not only facilitates error tracking and debugging but also enhances the model's ability to learn from accurately labeled data, improving its efficacy in real-world applications.


In [None]:
def load_data(file_path):
    try:
        data = pd.read_csv(file_path, encoding='unicode_escape')
        data = data.fillna(method='ffill')
        data['sentence'] = data.groupby(['goldID'])['token'].transform(lambda x: ' '.join(x))
        data['word_labels'] = data.groupby(['goldID'])['tag'].transform(lambda x: ','.join(x))
        data = data[["sentence", "word_labels"]].drop_duplicates().reset_index(drop=True)
        return data
    except Exception as e:
        logging.error(f"Failed to load data: {e}")
        raise

## Dataset Preparation Function

The `prepare_datasets` function is pivotal for splitting the preprocessed data into training and testing datasets, essential for training and evaluating the NER model's performance.

### Function Overview

- **Input**: Accepts a `pandas` DataFrame that contains the entire dataset.
- **Functionality**:
  1. **Sampling Training Data**: Randomly samples a specified fraction (`Config.TRAIN_SIZE`) of the data for training. The sampling is reproducible due to the fixed `random_state`.
  2. **Creating Test Data**: Identifies and separates the remainder of the data to form the test dataset.
  
- **Output**: Returns two DataFrames:
  - `train_data`: The dataset intended for training the model.
  - `test_data`: The dataset reserved for validating the model's performance.

### Configuration Dependency

- The function leverages the `TRAIN_SIZE` parameter from the `Config` class to determine the proportion of data used for training. This parameter can be adjusted depending on the dataset size or specific training requirements.

### Purpose

Splitting the data into training and testing sets is crucial for machine learning models to:
- **Learn Patterns** (training phase): The model learns to recognize patterns and associations within the labeled training data.
- **Evaluate Accuracy** (testing phase): The model's predictions are compared against actual labels in the test dataset to evaluate its accuracy and generalizability.

By properly preparing these datasets, this function ensures that the NER model is trained in a controlled environment and evaluated objectively, facilitating the development of a robust and effective entity recognition system.


In [None]:
# Prepare datasets
def prepare_datasets(data):
    train_data = data.sample(frac=Config.TRAIN_SIZE, random_state=200)
    test_data = data.drop(train_data.index).reset_index(drop=True)
    return train_data.reset_index(drop=True), test_data

## Custom Dataset Class for NER Model

The `SentenceDataset` class is a custom implementation extending PyTorch's `Dataset` class. It is specifically tailored to process text data for Named Entity Recognition (NER) tasks using BERT models.

### Class Structure

- **Initialization (`__init__`)**: Configures the dataset with the necessary components.
  - `dataframe`: A DataFrame containing the sentences and their corresponding labels.
  - `tokenizer`: A BERT tokenizer to convert text into tokens that the model can understand.
  - `max_len`: The maximum length of the sequences to be processed, ensuring consistency in input size.

- **Length Determination (`__len__`)**: Returns the number of items in the dataset, allowing PyTorch's `DataLoader` to plan batching and shuffling operations.

- **Item Access (`__getitem__`)**: Retrieves a single processed item from the dataset by index.
  - **Sentence and Label Extraction**: Extracts the sentence and its labels from the DataFrame.
  - **Tokenization and Label Preservation**: Tokenizes the sentence and adjusts the labels to align with the tokenized output.
  - **Encoding**: Converts tokens and labels into the format required by BERT, including adding special tokens, converting to IDs, and applying padding.

### Detailed Functionality

- **Tokenization (`tokenize_and_preserve_labels`)**: Splits the sentence into tokens/subwords and duplicates the labels accordingly to match the length of the generated tokens, preserving the label alignment.
  
- **Encoding (`encode_plus`)**: Prepares the final tokenized input for the model. This includes:
  - Adding special tokens (`[CLS]`, `[SEP]`) necessary for BERT.
  - Creating attention masks to differentiate real tokens from padding.
  - Converting tokens to their corresponding IDs.
  - Adjusting label IDs for compatibility with loss functions, using a default ID for padding tokens.

### Purpose

This class is crucial for transforming raw text data into a structured format that can be directly fed into a BERT model for training or inference. By ensuring that each text entry is tokenized and encoded correctly, the `SentenceDataset` class supports effective model training and contributes to higher model performance on NER tasks.

The design of this class also facilitates easy integration and usage within PyTorch training frameworks, making it a versatile component for various NLP tasks involving BERT.


In [None]:

# Custom Dataset class
class SentenceDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        sentence = self.data.iloc[index]['sentence']
        word_labels = self.data.iloc[index]['word_labels']
        tokenized_sentence, labels = self.tokenize_and_preserve_labels(sentence, word_labels)
        return self.encode_plus(tokenized_sentence, labels)

    def tokenize_and_preserve_labels(self, sentence, text_labels):
        tokenized_sentence, labels = [], []
        for word, label in zip(sentence.split(), text_labels.split(',')):
            subwords = self.tokenizer.tokenize(word)
            tokenized_sentence.extend(subwords)
            labels.extend([label] * len(subwords))
        return tokenized_sentence, labels
   
    def encode_plus(self, tokenized_sentence, labels):
        tokenized_sentence = ['[CLS]'] + tokenized_sentence[:self.max_len-2] + ['[SEP]']
        labels = ['O'] + labels[:self.max_len-2] + ['O']
        attention_mask = [1] * len(tokenized_sentence) + [0] * (self.max_len - len(tokenized_sentence))
        input_ids = self.tokenizer.convert_tokens_to_ids(tokenized_sentence)
        label_ids = [label2id.get(label, -100) for label in labels]
        padding_length = self.max_len - len(input_ids)
        input_ids += [self.tokenizer.pad_token_id] * padding_length
        label_ids += [-100] * padding_length
        return {'input_ids': torch.tensor(input_ids, dtype=torch.long),
                'attention_mask': torch.tensor(attention_mask, dtype=torch.long),
                'labels': torch.tensor(label_ids, dtype=torch.long)}

## Training Function for NER Model

The `train` function is designed to handle the training process of a BERT-based NER model using PyTorch. It encapsulates the steps necessary to iterate over batches of data, compute the loss, and update the model's weights.

### Function Overview

- **Inputs**:
  - `model`: The NER model to be trained, pre-initialized with the necessary architecture and weights.
  - `loader`: A DataLoader that provides batches of the dataset.
  - `optimizer`: The optimization algorithm used to update the weights based on the computed gradients.

### Core Operations

1. **Mode Setting**: Sets the model to training mode, enabling features like dropout layers that are essential during the training phase but not during evaluation.
2. **Loss Initialization**: Initializes the total loss to zero. This will accumulate the loss over all batches in the dataset.
3. **Batch Processing**:
   - For each batch, the function extracts `input_ids`, `attention_mask`, and `labels`, and moves them to the appropriate device (CPU or GPU).
   - Clears old gradients from the last step (if existing).
4. **Forward Pass**:
   - The model performs a forward pass to compute the logits from the input data.
   - Computes the loss between the logits and the ground-truth labels.
5. **Backward Pass**:
   - Backpropagation is used to compute the gradient of the loss function with respect to the model parameters.
   - Gradient clipping is applied to prevent exploding gradients, which can destabilize the training process.
6. **Optimization Step**:
   - Updates the model parameters based on the gradients computed during backpropagation.

### Output

- Returns the average loss computed over all batches. This metric is essential for monitoring the training process and diagnosing issues with model convergence.

### Purpose

This training function is critical for optimizing the NER model, allowing it to learn from the training data effectively. By iteratively adjusting the model's weights, the function seeks to minimize the loss, thus enhancing the model's ability to accurately predict entity labels.

Structured and well-documented, this function is vital for achieving high performance in NER tasks, facilitating robust model training through systematic iteration and optimization.


In [None]:
def train(model, loader, optimizer):
    model.train()
    total_loss = 0
    for batch in loader:
        inputs, masks, labels = batch['input_ids'].to(Config.DEVICE), batch['attention_mask'].to(Config.DEVICE), batch['labels'].to(Config.DEVICE)
        model.zero_grad()
        outputs = model(input_ids=inputs, attention_mask=masks, labels=labels)
        loss = outputs.loss
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), Config.MAX_GRAD_NORM)
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(loader)

## Evaluation Function for NER Model

The `evaluate` function is crucial for assessing the performance of a trained NER model using a validation or test dataset. It measures the model's ability to predict entity labels accurately against the known labels.

### Function Overview

- **Inputs**:
  - `model`: The NER model to be evaluated.
  - `loader`: A DataLoader that supplies batches of the dataset for evaluation.

### Core Operations

1. **Mode Setting**: Puts the model in evaluation mode, which deactivates training-specific features like dropout to stabilize the model’s predictions.
2. **Loss and Prediction Tracking**: Initializes variables to track the total loss and predictions across all batches.
3. **Batch Processing**:
   - Processes each batch in a no-gradient context to prevent updates to model parameters and reduce memory consumption.
   - For each batch, extracts inputs and their corresponding labels and transfers them to the appropriate device.
4. **Model Prediction**:
   - Conducts a forward pass to generate logits from the input data.
   - Computes the loss to gauge the discrepancy between predicted and actual labels.
   - Accumulates the total loss for later averaging.
5. **Label Prediction Extraction**:
   - Converts logits to actual label predictions using the `argmax` function, which selects the most likely label for each token.
   - Extracts labels from both the predictions and the ground truth where the attention mask indicates actual token presence (ignoring padded areas).
6. **Metrics Computation**:
   - Calculates precision, recall, and F1-score to evaluate the model's performance on identifying correct labels accurately.

### Outputs

- Returns the average loss over all batches and the precision, recall, and F1 scores, providing a comprehensive view of the model's performance.

### Purpose

Evaluating the model using these metrics is essential for understanding its effectiveness in real-world scenarios and identifying areas for improvement. This function provides a systematic approach to quantify the model's accuracy and generalizability, ensuring robust performance in NER tasks.

Structured to facilitate thorough performance assessment, this function is key to validating the model’s capability and ensuring it meets the expected standards of accuracy and efficiency in entity recognition.


In [None]:
def evaluate(model, loader):
    model.eval()
    total_loss = 0
    predictions, labels = [], []
    with torch.no_grad():
        for batch in loader:
            inputs, masks, targets = batch['input_ids'].to(Config.DEVICE), batch['attention_mask'].to(Config.DEVICE), batch['labels'].to(Config.DEVICE)
            outputs = model(input_ids=inputs, attention_mask=masks, labels=targets)
            loss = outputs.loss
            total_loss += loss.item()

            logits = outputs.logits
            predictions_batch = torch.argmax(logits, axis=2)
            for i, mask in enumerate(masks):
                temp_1 = []
                temp_2 = []
                for j, m in enumerate(mask):
                    if m and targets[i, j] != torch.tensor(-100):
                        temp_1.append(id2label[targets[i, j].item()])
                        temp_2.append(id2label[predictions_batch[i, j].item()])
                labels.append(temp_1)
                predictions.append(temp_2)

    precision = precision_score(labels, predictions)
    recall = recall_score(labels, predictions)
    f1 = f1_score(labels, predictions)
    return total_loss / len(loader), precision, recall, f1


## Main Execution Flow for Training and Evaluating NER Models

The `main` function orchestrates the overall process of training and evaluating named entity recognition (NER) models. It sets up the environment, initializes data, configures models, and conducts training and evaluation cycles.

### Process Overview

1. **Data Loading and Preparation**:
   - Loads the dataset from a specified file using the `load_data` function.
   - Prepares the dataset into training and testing sets using the `prepare_datasets` function.

2. **Environment Setup**:
   - Checks and creates a results directory if it doesn't exist, ensuring a place to save the output results.

3. **Model Training Setup**:
   - Iterates through predefined model configurations.
   - For each configuration:
     - Initializes the tokenizer and model with the specified BERT variant.
     - Transfers the model to the appropriate compute device (CPU or GPU).
     - Prepares data loaders for both training and testing datasets with specified batch sizes and shuffle configurations.
     - Initializes the optimizer with a defined learning rate.

4. **Training Loop**:
   - Conducts multiple epochs of training:
     - Each epoch involves running the `train` function to process the training data through the model and update the weights.
     - Logs the loss at the end of each epoch for monitoring.

5. **Final Epoch Evaluation**:
   - In the last epoch, evaluates the model on the test dataset using the `evaluate` function.
   - Extracts and logs performance metrics such as test loss, precision, recall, and F1 score.

6. **Results Handling**:
   - Prints the collected test metrics for quick reference.
   - Saves the metrics to a CSV file in the predefined results directory.

### Purpose

The main function is designed to provide a comprehensive and systematic approach to training and evaluating deep learning models for NER tasks. It integrates various components and functions, ensuring a streamlined workflow from data preparation to performance evaluation.

This approach not only facilitates efficient training cycles but also ensures that the models are rigorously tested and their performance metrics are accurately captured and reported, crucial for iterative model development and refinement.


In [22]:
# Main Execution
def main():
    data = load_data(Config.DATA_FILE)
    train_data, test_data = prepare_datasets(data)
    
    if not os.path.exists(Config.RESULTS_DIR):
        os.makedirs(Config.RESULTS_DIR)
        
    for model_key in model_names:
        logging.info(f"Training and evaluating model: {model_names[model_key]}")
        tokenizer = BertTokenizer.from_pretrained(model_key)
        model = BertForTokenClassification.from_pretrained(model_key, num_labels=len(id2label))
        model.to(Config.DEVICE)
        train_loader = DataLoader(SentenceDataset(train_data, tokenizer, Config.MAX_LEN), batch_size=Config.TRAIN_BATCH_SIZE, shuffle=True)
        test_loader = DataLoader(SentenceDataset(test_data, tokenizer, Config.MAX_LEN), batch_size=Config.VALID_BATCH_SIZE, shuffle=False)
        optimizer = torch.optim.Adam(model.parameters(), lr=Config.LEARNING_RATE)

    for epoch in range(Config.EPOCHS):
        train_loss = train(model, train_loader, optimizer)
        logging.info(f'Epoch {epoch+1}, Train Loss: {train_loss}')
        if epoch == Config.EPOCHS - 1:  # Only evaluate and save in the last epoch
            test_loss, precision, recall, f1 = evaluate(model, test_loader)
            test_metrics = {'Test Loss': test_loss, 'Precision': precision, 'Recall': recall, 'F1 Score': f1}


         # Now outside the loop, only for the final epoch
    print(f"Test metrics: {test_metrics}")
    print("*********************************************************************************************")
        
    # Save metrics
    metrics_filename = os.path.join(Config.RESULTS_DIR, f"{model_names[model_key]}_test_metrics.csv")
    pd.DataFrame([test_metrics]).to_csv(metrics_filename, index=False)
    print(f"Saved test metrics to {metrics_filename}")
        

if __name__ == "__main__":
    main()


2024-05-17 11:15:30,890 - INFO - Training and evaluating model: BERT
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2024-05-17 11:15:31,599 - INFO - Training and evaluating model: Bio-ClinicalBERT
Some weights of BertForTokenClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2024-05-17 11:16:21,227 - INFO - Epoch 1, Train Loss: 0.03271314332635294
2024-05-17 11:17:10,789 - INFO - Epoch 2, Train Loss: 0.008752769470041574
2024-05-17 11:18:00,889 - INFO - Epoch 3, Train Loss: 0.0026288037556210036
2024-05-17 11:18:51,482 - INFO 

Test metrics: {'Test Loss': 0.00427703428660472, 'Precision': 0.9210526315789473, 'Recall': 0.813953488372093, 'F1 Score': 0.8641975308641974}
*********************************************************************************************
Saved test metrics to ./results_Anxiety_NER_BERT&GPT/Bio-ClinicalBERT_test_metrics.csv


## GPT-NER-Token Classification-Anxiety

## Python Setup for GPT-2 Based Token Classification

This script sets the foundation for a token classification task using variants of the GPT-2 model, tailored for Named Entity Recognition (NER). It configures the necessary environment, including data handling, model selection, and logging.

### Imports and Dependencies

- **Libraries**: Utilizes `pandas` for data handling, `torch` for constructing deep learning models, and `transformers` for accessing pretrained GPT-2 models and tokenizers.
- **Metrics**: Employs `seqeval` metrics to evaluate the classification performance, including precision, recall, and F1 score.

### Configuration and Logging

- **Logging Setup**: Configures basic logging to track and log runtime events and model performance metrics. (Note: This line is commented out in the snippet provided.)
- **Model Selection**: Specifies a dictionary of GPT-2 model variants. Each key is a model identifier with its respective name. (Note: Most entries are commented out, with only the base 'gpt2' model active for use.)

### Label Mapping

- **Label Definitions**: Maps entity labels to numeric IDs, essential for model training and prediction. This setup uses an example mapping for anxiety-related labels (`B-Anxiety`, `I-Anxiety`).
- **Reverse Mapping**: Provides a mechanism to convert numeric IDs back to their corresponding textual labels.

### System Configuration Class

Defines several crucial constants and parameters for the model training and evaluation:
- **`MAX_LEN`**: Maximum sequence length to which inputs will be padded or truncated.
- **`TRAIN_BATCH_SIZE`** and **`VALID_BATCH_SIZE`**: Specifies the sizes of data batches for training and validation.
- **`EPOCHS`** and **`LEARNING_RATE`**: Defines the number of training cycles and the step size for updating model weights.
- **`MAX_GRAD_NORM`**: Sets a limit for gradient norm clipping to prevent exploding gradients during backpropagation.
- **`TRAIN_SIZE`**: Fraction of data used for training, with the remainder used for validation.
- **`DEVICE`**: Automatically selects CUDA if available (for GPU acceleration) or defaults to CPU.
- **`DATA_FILE`** and **`RESULTS_DIR`**: Paths for loading the dataset and saving results, respectively.

### Purpose

This setup script is designed to initialize the environment and configurations for training and evaluating a GPT-2 based NER model. It ensures all components are in place for effective model operation, from data preprocessing to model training and performance evaluation. By organizing these settings at the start, the script facilitates a smooth and efficient workflow for complex NER tasks involving deep learning.

### Usage

To run NER tasks with the configured settings, further scripts would utilize these configurations to prepare data, train models, and evaluate results. This setup ensures that all components are optimized for the specific demands of token classification using advanced language models like GPT-2.


In [None]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2ForTokenClassification
from seqeval.metrics import classification_report, precision_score, recall_score, f1_score
from torch import cuda
import logging
import os

# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

model_names = {
    'gpt2': 'gpt2',
    #'biogpt': 'microsoft/BioGPT',
    #'bioMedLM': 'stanford-crfm/BioMedLM',
    'symptom-GPT': './symptom-BioGPT-1 Million',
    #'symptom-GPT-Neo': 'EleutherAI/gpt-neo-1.3B'
}
# Define label mapping
label2id = {'O': 0, 'B-Anxiety': 1, 'I-Anxiety': 2}  # Example, adjust according to your actual labels
id2label = {v: k for k, v in label2id.items()}  # Reverse mapping from ID to label

# Constants and Configurations
class Config:
    MAX_LEN = 512
    TRAIN_BATCH_SIZE = 4
    VALID_BATCH_SIZE = 2
    EPOCHS = 5
    LEARNING_RATE = 3e-05
    MAX_GRAD_NORM = 10
    TRAIN_SIZE = 0.8
    DEVICE = 'cuda' if cuda.is_available() else 'cpu'
    DATA_FILE = "df_tokens_Anxiety.csv"
    RESULTS_DIR = './results_Anxiety_NER_BERT&GPT'  # Directory to save results

## Data Loading and Preprocessing Function

The `load_data` function is crucial for the initial steps of loading and preprocessing the dataset required for training the token classification model. This function ensures the data is in an appropriate format for further processing and analysis.

### Function Overview

- **Input**: Accepts the file path to a CSV file containing the dataset.
- **Exception Handling**: Implements robust error handling to manage and log issues that occur during data loading.

### Core Operations

1. **CSV Data Loading**: 
   - Loads the dataset from the specified CSV file using `pandas`, with `unicode_escape` encoding to correctly handle special characters.
   
2. **Missing Value Handling**: 
   - Applies forward fill (`ffill`) to fill any missing values in the data. This method ensures that all data points have complete information by propagating the last valid observation forward.

3. **Sentence and Label Aggregation**: 
   - Groups tokens by `goldID` and concatenates them to reconstruct full sentences, facilitating easier processing in NLP tasks.
   - Similarly, concatenates corresponding labels for each token within a sentence, preserving the sequence of tags for NER.

4. **Data Deduplication**: 
   - Removes duplicate rows based on the `sentence` and `word_labels` columns, ensuring the uniqueness of data entries in the dataset.
   - Resets the DataFrame index for easy data handling in subsequent operations.

### Output

- **Returns**: A `pandas` DataFrame containing two columns: `sentence` and `word_labels`. Each row represents a unique sentence and its associated sequence of labels, prepared for NER training.

### Error Handling

- If an exception is encountered during the loading process, the function logs the error and re-raises the exception to halt further execution, ensuring issues are addressed promptly.

### Purpose

This function is essential for transforming raw CSV data into a structured format suitable for NER tasks. By preprocessing and cleaning the data at this stage, the function facilitates smoother integration and efficiency in subsequent modeling steps, ensuring the data is ready for tokenization and model training.


In [None]:
# Load and preprocess data
def load_data(file_path):
    try:
        data = pd.read_csv(file_path, encoding='unicode_escape')
        data = data.fillna(method='ffill')
        data['sentence'] = data.groupby(['goldID'])['token'].transform(lambda x: ' '.join(x))
        data['word_labels'] = data.groupby(['goldID'])['tag'].transform(lambda x: ','.join(x))
        data = data[["sentence", "word_labels"]].drop_duplicates().reset_index(drop=True)
        return data
    except Exception as e:
        logging.error(f"Failed to load data: {e}")
        raise

## Data Loading and Preprocessing Function

The `load_data` function is crucial for the initial steps of loading and preprocessing the dataset required for training the token classification model. This function ensures the data is in an appropriate format for further processing and analysis.

### Function Overview

- **Input**: Accepts the file path to a CSV file containing the dataset.
- **Exception Handling**: Implements robust error handling to manage and log issues that occur during data loading.

### Core Operations

1. **CSV Data Loading**: 
   - Loads the dataset from the specified CSV file using `pandas`, with `unicode_escape` encoding to correctly handle special characters.
   
2. **Missing Value Handling**: 
   - Applies forward fill (`ffill`) to fill any missing values in the data. This method ensures that all data points have complete information by propagating the last valid observation forward.

3. **Sentence and Label Aggregation**: 
   - Groups tokens by `goldID` and concatenates them to reconstruct full sentences, facilitating easier processing in NLP tasks.
   - Similarly, concatenates corresponding labels for each token within a sentence, preserving the sequence of tags for NER.

4. **Data Deduplication**: 
   - Removes duplicate rows based on the `sentence` and `word_labels` columns, ensuring the uniqueness of data entries in the dataset.
   - Resets the DataFrame index for easy data handling in subsequent operations.

### Output

- **Returns**: A `pandas` DataFrame containing two columns: `sentence` and `word_labels`. Each row represents a unique sentence and its associated sequence of labels, prepared for NER training.

### Error Handling

- If an exception is encountered during the loading process, the function logs the error and re-raises the exception to halt further execution, ensuring issues are addressed promptly.

### Purpose

This function is essential for transforming raw CSV data into a structured format suitable for NER tasks. By preprocessing and cleaning the data at this stage, the function facilitates smoother integration and efficiency in subsequent modeling steps, ensuring the data is ready for tokenization and model training.


In [None]:
# Prepare datasets
def prepare_datasets(data):
    train_data = data.sample(frac=Config.TRAIN_SIZE, random_state=200)
    test_data = data.drop(train_data.index).reset_index(drop=True)
    return train_data.reset_index(drop=True), test_data

## Custom Dataset Class for NER Model

The `SentenceDataset` class is an essential part of the token classification pipeline, designed to interface with PyTorch’s `DataLoader`. This custom class prepares and provides data in a format suitable for training and evaluating NER models based on transformers.

### Class Definition

- **Inheritance**: Extends PyTorch's `Dataset` class, ensuring compatibility with PyTorch workflows and functionalities such as batching and parallel data processing.

### Constructor

- **Parameters**:
  - `dataframe`: A pandas DataFrame containing sentences and their corresponding labels.
  - `tokenizer`: A tokenizer object from the transformers library, configured for a specific model.
  - `max_len`: The maximum sequence length, ensuring uniformity in token length across data entries.

### Methods

1. **`__len__`**:
   - Returns the number of entries in the dataset, allowing the `DataLoader` to determine the size of each batch and the number of iterations per epoch.

2. **`__getitem__`**:
   - Retrieves a single data point from the dataset by index.
   - **Operations**:
     - Extracts the sentence and its labels.
     - Tokenizes the sentence while preserving label alignment with tokens, handling cases where a word is split into multiple subwords.
     - Encodes the tokenized data into the format required by the model, including converting tokens to IDs, creating attention masks, and applying necessary padding.

3. **`tokenize_and_preserve_labels`**:
   - Tokenizes each word in the sentence and replicates each label to match the number of subwords generated from the corresponding word, ensuring that the label sequence remains aligned with the token sequence.

4. **`encode_plus`**:
   - Formats the tokenized input for the model by adding special tokens, converting tokens to input IDs, creating attention masks, and ensuring that all sequences are padded to a uniform length.
   - **Padding Strategy**: Uses the model’s end-of-sequence token ID for padding if the standard padding token ID is not set, which is crucial for models like GPT-2.

### Outputs

- Each call to `__getitem__` produces a dictionary with keys:
  - `input_ids`: Token IDs suitable for model input.
  - `attention_mask`: Mask specifying which tokens should be attended to, ignoring padding.
  - `labels`: Numeric labels for each token, aligned with the `input_ids`.

### Purpose

This class is designed to streamline the preparation of data for NER models, transforming raw text data into structured inputs that are optimized for deep learning models. By handling complex preprocessing tasks such as tokenization, label alignment, and sequence padding, `SentenceDataset` ensures that the model can focus on learning from accurately prepared inputs.
### Example Usage

To integrate this class into a training or evaluation pipeline, instantiate it with the appropriate parameters and pass it to a PyTorch `DataLoader`:


In [None]:
# Custom Dataset class
class SentenceDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        sentence = self.data.iloc[index]['sentence']
        word_labels = self.data.iloc[index]['word_labels']
        tokenized_sentence, labels = self.tokenize_and_preserve_labels(sentence, word_labels)
        return self.encode_plus(tokenized_sentence, labels)

    def tokenize_and_preserve_labels(self, sentence, text_labels):
        tokenized_sentence, labels = [], []
        for word, label in zip(sentence.split(), text_labels.split(',')):
            subwords = self.tokenizer.tokenize(word)
            tokenized_sentence.extend(subwords)
            labels.extend([label] * len(subwords))
        return tokenized_sentence, labels

    def encode_plus(self, tokenized_sentence, labels):
        tokenized_sentence = ['<|endoftext|>'] + tokenized_sentence[:self.max_len-2] + ['<|endoftext|>']
        labels = ['O'] + labels[:self.max_len-2] + ['O']
        input_ids = self.tokenizer.convert_tokens_to_ids(tokenized_sentence)
        attention_mask = [1] * len(input_ids) + [0] * (self.max_len - len(input_ids))

        # Padding
        padding_length = self.max_len - len(input_ids)
        input_ids += [self.tokenizer.eos_token_id] * padding_length  # Use eos_token_id for padding if pad_token_id is None
        label_ids = [label2id.get(label, -100) for label in labels]
        label_ids += [-100] * padding_length

        return {'input_ids': torch.tensor(input_ids, dtype=torch.long),
                'attention_mask': torch.tensor(attention_mask, dtype=torch.long),
                'labels': torch.tensor(label_ids, dtype=torch.long)}

## Training Function for NER Model

The `train` function orchestrates the training process for a named entity recognition (NER) model, utilizing PyTorch's functionalities. This function is crucial for fitting the model to the training data, adjusting model weights through backpropagation based on the loss calculated from predictions.

### Function Overview

- **Parameters**:
  - `model`: The NER model to be trained.
  - `loader`: A DataLoader that batches the dataset for efficient training.
  - `optimizer`: The optimization algorithm used to update model weights.

### Core Operations

1. **Mode Setting**: 
   - Sets the model to training mode, which enables certain functionalities like dropout layers that are essential during training but not during evaluation.

2. **Loss Initialization**:
   - Initializes a counter for total loss to track the loss across all batches, providing a measure of training performance per epoch.

3. **Batch Processing**:
   - Iterates over each batch provided by the DataLoader. For each batch, performs the following steps:
     - **Data Loading**: Moves input IDs, attention masks, and labels to the designated computing device (CPU or GPU).
     - **Forward Pass**: Processes the inputs through the model, generating predictions and computing the loss.
     - **Backward Pass**: Performs backpropagation to calculate gradients with respect to the loss.
     - **Gradient Clipping**: Applies gradient norm clipping to prevent exploding gradients, a common issue in training deep neural networks.
     - **Optimization Step**: Updates the model weights using the optimizer.

4. **Logging**:
   - Logs the loss for each batch to monitor training progress and diagnose training stability.

### Output

- **Returns**: The average loss for the epoch, computed by dividing the total loss by the number of batches processed. This metric helps in evaluating the training process across epochs.

### Purpose

This function is designed to effectively train NER models by systematically adjusting their weights to minimize the loss on training data. The structured approach ensures that each step of the model training is carried out optimally, from data handling to weight updates, fostering robust learning.

### Usage

The `train` function is typically called within a training loop across multiple epochs, allowing the model to iteratively learn from the training data:

```python
for epoch in range(total_epochs):
    epoch_loss = train(model, data_loader, optimizer)
    print(f'Epoch {epoch}: Loss = {epoch_loss}')


In [None]:
# Training and Evaluation Functions
def train(model, loader, optimizer):
    model.train()
    total_loss = 0
    for batch in loader:
        inputs, masks, labels = batch['input_ids'].to(Config.DEVICE), batch['attention_mask'].to(Config.DEVICE), batch['labels'].to(Config.DEVICE)
        model.zero_grad()
        outputs = model(input_ids=inputs, attention_mask=masks, labels=labels)
        loss = outputs.loss
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), Config.MAX_GRAD_NORM)
        optimizer.step()
        total_loss += loss.item()
        logging.info(f'Batch loss: {loss.item()}')  # Log loss for each batch
    return total_loss / len(loader)

## Evaluation Function for NER Model

The `evaluate` function is designed to assess the performance of a trained named entity recognition (NER) model using a test or validation dataset. It calculates metrics such as precision, recall, and F1-score, which are essential for understanding the model's accuracy.

### Function Overview

- **Parameters**:
  - `model`: The NER model to be evaluated.
  - `loader`: A DataLoader that batches the dataset for efficient evaluation.

### Core Operations

1. **Mode Setting**:
   - Sets the model to evaluation mode, which disables training-specific operations like dropout to ensure consistent behavior for predictions.

2. **Loss and Metric Initialization**:
   - Initializes a counter for total loss and lists to store predictions and actual labels for later metric calculation.

3. **Batch Processing**:
   - Iterates over each batch from the DataLoader within a no-gradient context to save memory and prevent model updates.
   - For each batch, performs the following steps:
     - **Data Loading**: Transfers input IDs, attention masks, and labels to the designated compute device.
     - **Forward Pass**: Computes the outputs and loss without updating model parameters.
     - **Loss Accumulation**: Adds up the loss for each batch to calculate the total loss after all batches are processed.

4. **Prediction Extraction and Label Mapping**:
   - Extracts predictions using `argmax` on the logits to determine the most likely labels.
   - Aligns predictions with their corresponding labels based on the attention mask, ensuring only non-padded elements are considered.

5. **Metric Calculation**:
   - Calculates precision, recall, and F1-score based on the gathered predictions and actual labels, providing a quantitative measure of the model’s performance.

6. **Logging**:
   - Logs calculated metrics for analysis and debugging purposes.

### Output

- **Returns**: A tuple containing the average loss per batch, precision, recall, and F1-score. These metrics help to quantify the model's effectiveness in identifying correct entities.

### Purpose

This function provides a comprehensive evaluation of the NER model, allowing developers and researchers to measure how well the model performs on unseen data. It highlights areas where the model excels and where it may need improvement, guiding further model tuning or deployment decisions.

### Usage

The `evaluate` function is typically called after the training process to assess the model's generalization capabilities:

```python
test_loss, precision, recall, f1 = evaluate(model, test_loader)
print(f"Test Loss: {test_loss}, Precision: {precision}, Recall: {recall}, F1 Score: {f1}")


In [None]:
def evaluate(model, loader):
    model.eval()
    total_loss = 0
    predictions, labels = [], []
    with torch.no_grad():
        for batch in loader:
            inputs, masks, targets = batch['input_ids'].to(Config.DEVICE), batch['attention_mask'].to(Config.DEVICE), batch['labels'].to(Config.DEVICE)
            outputs = model(input_ids=inputs, attention_mask=masks, labels=targets)
            loss = outputs.loss
            total_loss += loss.item()

            logits = outputs.logits
            predictions_batch = torch.argmax(logits, axis=2)
            for i, mask in enumerate(masks):
                temp_1 = []
                temp_2 = []
                for j, m in enumerate(mask):
                    if m and targets[i, j] != torch.tensor(-100):
                        temp_1.append(id2label[targets[i, j].item()])
                        temp_2.append(id2label[predictions_batch[i, j].item()])
                labels.append(temp_1)
                predictions.append(temp_2)

    precision = precision_score(labels, predictions)
    recall = recall_score(labels, predictions)
    f1 = f1_score(labels, predictions)

    logging.info(f"Precision: {precision}, Recall: {recall}, F1-Score: {f1}")
    return total_loss / len(loader), precision, recall, f1

## Main Execution Flow for NER Model Training and Evaluation

The `main` function serves as the entry point for the script, coordinating the data loading, model training, and evaluation phases. This function ensures that each component of the model training pipeline is executed in sequence and managed properly.

### Process Overview

1. **Data Loading**:
   - Calls the `load_data` function to load and preprocess the dataset from a specified CSV file.

2. **Dataset Preparation**:
   - Uses the `prepare_datasets` function to split the loaded data into training and testing datasets.

3. **Model Setup**:
   - Iterates over predefined model configurations stored in `model_names`.
   - Initializes the tokenizer and model for each specified GPT-2 variant.
   - Configures the model to run on the appropriate device (CPU or GPU).

4. **DataLoader Configuration**:
   - Sets up PyTorch `DataLoader`s for both training and testing datasets, specifying batch sizes and shuffle settings.

5. **Optimizer Initialization**:
   - Configures the Adam optimizer with a predefined learning rate, ready to update model weights during training.

6. **Training Loop**:
   - Executes a training loop over a set number of epochs, using the `train` function to process the training data.
   - Logs the training loss after each epoch to monitor progress.

7. **Final Evaluation**:
   - In the last epoch, evaluates the model on the test dataset using the `evaluate` function.
   - Captures and logs key performance metrics: precision, recall, and F1 score.

8. **Results Handling**:
   - Prints and logs the final test metrics for quick reference and analysis.
   - Saves the evaluation metrics to a CSV file in a specified results directory for further analysis or reporting.

### Purpose

This main function is critical for executing the full lifecycle of a machine learning project—from data handling through to model training and performance evaluation. It encapsulates all necessary steps in a clear and logical sequence, ensuring that the model is trained and evaluated effectively.

### Example Usage

To run this script, ensure it is executed as the main module:

```bash
python script_name.py


In [1]:
# Main Execution
def main():
    data = load_data(Config.DATA_FILE)
    train_data, test_data = prepare_datasets(data)
    for model_key in model_names:
        logging.info(f"Training and evaluating model: {model_names[model_key]}")
        tokenizer = GPT2Tokenizer.from_pretrained(model_key)
        model = GPT2ForTokenClassification.from_pretrained(model_key, num_labels=len(id2label))
        model.to(Config.DEVICE)
        train_loader = DataLoader(SentenceDataset(train_data, tokenizer, Config.MAX_LEN), batch_size=Config.TRAIN_BATCH_SIZE, shuffle=True)
        test_loader = DataLoader(SentenceDataset(test_data, tokenizer, Config.MAX_LEN), batch_size=Config.VALID_BATCH_SIZE, shuffle=False)
        optimizer = torch.optim.Adam(model.parameters(), lr=Config.LEARNING_RATE)

    for epoch in range(Config.EPOCHS):
            train_loss = train(model, train_loader, optimizer)
            logging.info(f'Epoch {epoch+1}, Train Loss: {train_loss}')
            if epoch == Config.EPOCHS - 1:  # Only evaluate and save in the last epoch
                test_loss, precision, recall, f1 = evaluate(model, test_loader)
                test_metrics = {'Test Loss': test_loss, 'Precision': precision, 'Recall': recall, 'F1 Score': f1}


         # Now outside the loop, only for the final epoch
    print(f"Test metrics: {test_metrics}")
    print("*********************************************************************************************")
        
    # Save metrics
    metrics_filename = os.path.join(Config.RESULTS_DIR, f"{model_names[model_key]}_test_metrics.csv")
    pd.DataFrame([test_metrics]).to_csv(metrics_filename, index=False)
    print(f"Saved test metrics to {metrics_filename}")
    
if __name__ == "__main__":
    main()

Some weights of GPT2ForTokenClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Test metrics: {'Test Loss': 0.01575318283615464, 'Precision': 0.8421052631578947, 'Recall': 0.7529411764705882, 'F1 Score': 0.7950310559006211}
*********************************************************************************************
Saved test metrics to ./results_Anxiety_NER_BERT&GPT/gpt2_test_metrics.csv
