In [5]:
#! pip install -r requirements.txt

## BERT& GPT NER-Token Classification-Nausea/vomiting

## Nausea and Vomiting Symptom Extraction Using spaCy

This script utilizes the `spaCy` library to identify and tag mentions of nausea and vomiting symptoms in text data using Named Entity Recognition (NER) techniques. The script loads textual data, uses phrase matching to find symptom mentions, and applies BIO tagging to categorize each word in the context of symptom identification.

### Overview of Steps

1. **spaCy Model Loading**:
   - Loads the English language model from spaCy, which will be used to process text and create document objects.

2. **Data Loading**:
   - Imports a CSV file containing text data into a pandas DataFrame.
   - Loads a dictionary of terms related to nausea and vomiting from another CSV file.

3. **Term Preparation**:
   - Extracts nausea-related terms from the loaded dictionary.
   - Manually adds additional terms that describe vomiting to ensure comprehensive coverage.
   - Combines all terms and converts them into `Doc` objects for phrase matching.

4. **Phrase Matching Setup**:
   - Initializes a `PhraseMatcher` object with the intent to match text data against the list of prepared terms.
   - Configures the matcher to recognize phrases based on lowercased words to ensure case insensitivity.

5. **BIO Tagging Function**:
   - Defines a function to tag words in text with BIO labels (Beginning, Inside, Outside) using matches found by the `PhraseMatcher`.
   - Labels are set to 'B-Nausea' or 'I-Nausea', indicating the start and continuation of a nausea or vomiting mention within the text.

6. **Tagging Application and DataFrame Creation**:
   - Applies the BIO tagging function to each text entry in the DataFrame.
   - Collects the results, including the text's original gold ID, the tokenized words, and their BIO tags into a new DataFrame.

7. **Results Handling**:
   - Saves the tagged data to a CSV file for further analysis or use.
   - Displays the head of the resulting DataFrame to provide a quick preview of the tagged text.

### Example Usage

This script is particularly useful in healthcare data processing where identifying mentions of symptoms accurately is crucial for patient care and data analysis. By automating the tagging process, the script facilitates more efficient data annotation, aiding in tasks such as training more sophisticated NLP models or conducting detailed symptom frequency analyses.

### Output

The output of this script is a CSV file named `df_tokens_Nausea_Vomiting.csv`, which contains columns for the original identifiers (`goldID`), the tokenized text (`token`), and the corresponding BIO tags (`tag`). This file can be used directly for training machine learning models or for further text analysis tasks.

```python
print(df_tokens.head())


In [42]:
import spacy
from spacy.matcher import PhraseMatcher
import pandas as pd

# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

# Load your data
df = pd.read_csv('new_corpus_14_symptoms_counted.csv')
df_nausea = df[["Unnamed: 0", "goldID", "text", "Nausea"]]
df_dic = pd.read_csv('Nausea_dic.csv', encoding='latin1')

# Existing terms from your dictionary
nausea_terms = df_dic['Nausea_Vomiting'].tolist()

# Add additional vomiting-related terms manually
vomiting_terms = [
    "vomiting", "emesis", "vomit", "regurgitation", "throws up", "throwing up",
    "retching", "puke", "puking", "upchuck", "heave", "barfing"
]

# Combine all terms and create Doc objects
all_terms = nausea_terms + vomiting_terms
patterns = [nlp.make_doc(term.lower()) for term in all_terms]

# Create the PhraseMatcher object
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
matcher.add("NauseaVomitingPatterns", None, *patterns)

# Function to apply BIO tagging using spaCy's PhraseMatcher
def bio_tagging_spacy(text):
    doc = nlp(text)
    matches = matcher(doc)
    tokens = [token.text for token in doc]
    labels = ['O'] * len(tokens)  # Default labels

    for match_id, start, end in matches:
        labels[start] = 'B-Nausea'  # Begin tag, you could use 'B-Vomit' if specific tagging is required
        for i in range(start + 1, end):
            labels[i] = 'I-Nausea'  # Inside tag, similarly 'I-Vomit'

    return list(zip(tokens, labels))

# Apply BIO tagging and create a new DataFrame
token_data = []
for _, row in df_nausea.iterrows():
    result = bio_tagging_spacy(row['text'])
    for token, tag in result:
        token_data.append({'goldID': row['goldID'], 'token': token, 'tag': tag})

df_tokens = pd.DataFrame(token_data)

# Save and display the results
df_tokens.to_csv('df_tokens_Nausea_Vomiting.csv', index=False)
print(df_tokens.head())




   goldID        token tag
0     356  authorizing   O
1     356     provider   O
2     356       younke   O
3     356       denise   O
4     356            l   O


## BERT-NER-Token classification-Nausea

## Configuration for BERT-Based NER Task

This Python script sets up the necessary configurations for training a BERT-based Named Entity Recognition (NER) model to identify and classify mentions of nausea symptoms in textual data.

### Key Components

#### Importing Libraries
- **`os` and `pandas`**: Used for file operations and data manipulation.
- **`torch` and `transformers`**: Core libraries for building and training neural network models using PyTorch and accessing pre-trained BERT models.
- **`seqeval`**: Provides metrics for evaluating sequence labeling accuracy.

#### Model and Label Setup
- **Model Names**: Defines a dictionary mapping model identifiers to their descriptive names, including standard BERT and domain-specific variants like Bio_ClinicalBERT.
- **Label Mapping**: Establishes a mapping between textual labels ('B-Nausea', 'I-Nausea') and numeric IDs, crucial for model training and prediction.

#### Logging Configuration
- Sets up logging to record information at the INFO level, including timestamps and log levels, facilitating debugging and tracking model performance.

#### Configuration Class
- **`Config`**:
  - `MAX_LEN`: Maximum sequence length for tokenizing text data.
  - `TRAIN_BATCH_SIZE` and `VALID_BATCH_SIZE`: Batch sizes for training and validation phases.
  - `EPOCHS`, `LEARNING_RATE`, and `MAX_GRAD_NORM`: Training parameters such as the number of epochs, learning rate, and maximum gradient norm for gradient clipping.
  - `TRAIN_SIZE`: Proportion of the dataset used for training.
  - `DEVICE`: Specifies the computing device (CPU or GPU) based on CUDA availability.
  - `DATA_FILE`: Path to the dataset file.
  - `RESULTS_DIR`: Directory where results (e.g., trained models and metrics) will be stored.

### Purpose

The script prepares the environment for training NER models by:
- Setting computational parameters and paths, ensuring that the models are trained with consistent settings.
- Specifying model details and label mappings, which are essential for the models to interpret and learn from the data correctly.
- Providing a structured way to log and monitor the training process, helping in the maintenance and optimization of the models.

### Usage

This setup script is typically the first step in a pipeline that involves loading data, training models, and evaluating their performance. It ensures that all subsequent steps have the necessary configurations and resources to execute effectively.


In [44]:
import os
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForTokenClassification
from seqeval.metrics import classification_report, precision_score, recall_score, f1_score
from torch import cuda
import logging

# Define model names and their friendly names
model_names = {
    "bert-base-uncased": "BERT",
    "emilyalsentzer/Bio_ClinicalBERT": "Bio-ClinicalBERT",
    "New_Bio-Clinical_BERT_finetuned": "Symptom_BERT"
}

# Define label mapping
label2id = {'O': 0, 'B-Nausea': 1, 'I-Nausea': 2}  # Adjust as necessary
id2label = {v: k for k, v in label2id.items()}  # Reverse mapping from ID to label

# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Constants and Configurations
class Config:
    MAX_LEN = 512
    TRAIN_BATCH_SIZE = 4
    VALID_BATCH_SIZE = 2
    EPOCHS = 5
    LEARNING_RATE = 3e-05
    MAX_GRAD_NORM = 10
    TRAIN_SIZE = 0.8
    DEVICE = 'cuda' if cuda.is_available() else 'cpu'
    DATA_FILE = "df_tokens_Nausea_Vomiting.csv"
    RESULTS_DIR = './results_Nausea_NER_BERT&GPT'  # Directory to save results
# Load and preprocess data
def load_data(file_path):
    try:
        data = pd.read_csv(file_path, encoding='unicode_escape')
        data = data.fillna(method='ffill')
        data['sentence'] = data.groupby(['goldID'])['token'].transform(lambda x: ' '.join(x))
        data['word_labels'] = data.groupby(['goldID'])['tag'].transform(lambda x: ','.join(x))
        data = data[["sentence", "word_labels"]].drop_duplicates().reset_index(drop=True)
        return data
    except Exception as e:
        logging.error(f"Failed to load data: {e}")
        raise

# Prepare datasets
def prepare_datasets(data):
    train_data = data.sample(frac=Config.TRAIN_SIZE, random_state=200)
    test_data = data.drop(train_data.index).reset_index(drop=True)
    return train_data.reset_index(drop=True), test_data

# Custom Dataset class
class SentenceDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        sentence = self.data.iloc[index]['sentence']
        word_labels = self.data.iloc[index]['word_labels']
        tokenized_sentence, labels = self.tokenize_and_preserve_labels(sentence, word_labels)
        return self.encode_plus(tokenized_sentence, labels)

    def tokenize_and_preserve_labels(self, sentence, text_labels):
        tokenized_sentence, labels = [], []
        for word, label in zip(sentence.split(), text_labels.split(',')):
            subwords = self.tokenizer.tokenize(word)
            tokenized_sentence.extend(subwords)
            labels.extend([label] * len(subwords))
        return tokenized_sentence, labels
   
    def encode_plus(self, tokenized_sentence, labels):
        tokenized_sentence = ['[CLS]'] + tokenized_sentence[:self.max_len-2] + ['[SEP]']
        labels = ['O'] + labels[:self.max_len-2] + ['O']
        attention_mask = [1] * len(tokenized_sentence) + [0] * (self.max_len - len(tokenized_sentence))
        input_ids = self.tokenizer.convert_tokens_to_ids(tokenized_sentence)
        label_ids = [label2id.get(label, -100) for label in labels]
        padding_length = self.max_len - len(input_ids)
        input_ids += [self.tokenizer.pad_token_id] * padding_length
        label_ids += [-100] * padding_length
        return {'input_ids': torch.tensor(input_ids, dtype=torch.long),
                'attention_mask': torch.tensor(attention_mask, dtype=torch.long),
                'labels': torch.tensor(label_ids, dtype=torch.long)}
    

def train(model, loader, optimizer):
    model.train()
    total_loss = 0
    for batch in loader:
        inputs, masks, labels = batch['input_ids'].to(Config.DEVICE), batch['attention_mask'].to(Config.DEVICE), batch['labels'].to(Config.DEVICE)
        model.zero_grad()
        outputs = model(input_ids=inputs, attention_mask=masks, labels=labels)
        loss = outputs.loss
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), Config.MAX_GRAD_NORM)
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(loader)

def evaluate(model, loader):
    model.eval()
    total_loss = 0
    predictions, labels = [], []
    with torch.no_grad():
        for batch in loader:
            inputs, masks, targets = batch['input_ids'].to(Config.DEVICE), batch['attention_mask'].to(Config.DEVICE), batch['labels'].to(Config.DEVICE)
            outputs = model(input_ids=inputs, attention_mask=masks, labels=targets)
            loss = outputs.loss
            total_loss += loss.item()

            logits = outputs.logits
            predictions_batch = torch.argmax(logits, axis=2)
            for i, mask in enumerate(masks):
                temp_1 = []
                temp_2 = []
                for j, m in enumerate(mask):
                    if m and targets[i, j] != torch.tensor(-100):
                        temp_1.append(id2label[targets[i, j].item()])
                        temp_2.append(id2label[predictions_batch[i, j].item()])
                labels.append(temp_1)
                predictions.append(temp_2)

    precision = precision_score(labels, predictions)
    recall = recall_score(labels, predictions)
    f1 = f1_score(labels, predictions)

    logging.info(f"Precision: {precision}, Recall: {recall}, F1-Score: {f1}")
    return total_loss / len(loader), precision, recall, f1


# Main Execution
def main():
    data = load_data(Config.DATA_FILE)
    train_data, test_data = prepare_datasets(data)
    
    if not os.path.exists(Config.RESULTS_DIR):
        os.makedirs(Config.RESULTS_DIR)
        
    for model_key in model_names:
        logging.info(f"Training and evaluating model: {model_names[model_key]}")
        tokenizer = BertTokenizer.from_pretrained(model_key)
        model = BertForTokenClassification.from_pretrained(model_key, num_labels=len(id2label))
        model.to(Config.DEVICE)
        train_loader = DataLoader(SentenceDataset(train_data, tokenizer, Config.MAX_LEN), batch_size=Config.TRAIN_BATCH_SIZE, shuffle=True)
        test_loader = DataLoader(SentenceDataset(test_data, tokenizer, Config.MAX_LEN), batch_size=Config.VALID_BATCH_SIZE, shuffle=False)
        optimizer = torch.optim.Adam(model.parameters(), lr=Config.LEARNING_RATE)

        for epoch in range(Config.EPOCHS):
            train_loss = train(model, train_loader, optimizer)
            logging.info(f'Epoch {epoch+1}, Train Loss: {train_loss}')
            if epoch == Config.EPOCHS - 1:  # Only evaluate and save in the last epoch
                test_loss, precision, recall, f1 = evaluate(model, test_loader)
                test_metrics = {'Test Loss': test_loss, 'Precision': precision, 'Recall': recall, 'F1 Score': f1}


         # Now outside the loop, only for the final epoch
        print(f"Test metrics: {test_metrics}")
        print("*********************************************************************************************")
        
        # Save metrics
        metrics_filename = os.path.join(Config.RESULTS_DIR, f"{model_names[model_key]}_test_metrics.csv")
        pd.DataFrame([test_metrics]).to_csv(metrics_filename, index=False)
        print(f"Saved test metrics to {metrics_filename}")
        
       # Define dynamic save directory based on model name
        save_directory = f'./fine_tuned_models/{model_names[model_key]}'
        
        # Ensure the save directory exists
        if not os.path.exists(save_directory):
            os.makedirs(save_directory)
        
        # Save the model's weights, configuration, and tokenizer
        model.save_pretrained(save_directory)
        tokenizer.save_pretrained(save_directory)
        print(f"Model and tokenizer saved to {save_directory}")
        

if __name__ == "__main__":
    main()


2024-09-04 10:48:46,907 - INFO - Training and evaluating model: BERT
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2024-09-04 10:49:35,515 - INFO - Epoch 1, Train Loss: 0.02596419217433174
2024-09-04 10:50:23,659 - INFO - Epoch 2, Train Loss: 0.0030944848477131104
2024-09-04 10:51:12,236 - INFO - Epoch 3, Train Loss: 0.0014244473060094859
2024-09-04 10:52:01,141 - INFO - Epoch 4, Train Loss: 0.0013555966694818993
2024-09-04 10:52:50,323 - INFO - Epoch 5, Train Loss: 0.0003255846109146491
2024-09-04 10:52:55,198 - INFO - Precision: 0.96, Recall: 0.9795918367346939, F1-Score: 0.9696969696969697


Test metrics: {'Test Loss': 0.004780127523224143, 'Precision': 0.96, 'Recall': 0.9795918367346939, 'F1 Score': 0.9696969696969697}
*********************************************************************************************
Saved test metrics to ./results_Nausea_NER_BERT&GPT/BERT_test_metrics.csv


2024-09-04 10:52:57,811 - INFO - Training and evaluating model: Bio-ClinicalBERT


Model and tokenizer saved to ./fine_tuned_models/BERT


Some weights of BertForTokenClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2024-09-04 10:53:48,378 - INFO - Epoch 1, Train Loss: 0.0453207800910689
2024-09-04 10:54:37,851 - INFO - Epoch 2, Train Loss: 0.0035026436751819387
2024-09-04 10:55:27,285 - INFO - Epoch 3, Train Loss: 0.0022781480582252274
2024-09-04 10:56:16,841 - INFO - Epoch 4, Train Loss: 0.0006674245420706643
2024-09-04 10:57:06,462 - INFO - Epoch 5, Train Loss: 9.091104142033484e-05
2024-09-04 10:57:11,384 - INFO - Precision: 0.9705882352941176, Recall: 0.9850746268656716, F1-Score: 0.9777777777777777


Test metrics: {'Test Loss': 0.00821915688959086, 'Precision': 0.9705882352941176, 'Recall': 0.9850746268656716, 'F1 Score': 0.9777777777777777}
*********************************************************************************************
Saved test metrics to ./results_Nausea_NER_BERT&GPT/Bio-ClinicalBERT_test_metrics.csv


2024-09-04 10:57:13,931 - INFO - Training and evaluating model: Symptom_BERT
Some weights of BertForTokenClassification were not initialized from the model checkpoint at New_Bio-Clinical_BERT_finetuned and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model and tokenizer saved to ./fine_tuned_models/Bio-ClinicalBERT


2024-09-04 10:58:03,619 - INFO - Epoch 1, Train Loss: 0.05336024356453658
2024-09-04 10:58:53,183 - INFO - Epoch 2, Train Loss: 0.002878304777685878
2024-09-04 10:59:42,828 - INFO - Epoch 3, Train Loss: 0.0015797028866959303
2024-09-04 11:00:32,504 - INFO - Epoch 4, Train Loss: 0.0010596444197797737
2024-09-04 11:01:22,173 - INFO - Epoch 5, Train Loss: 0.00014637028993050525
2024-09-04 11:01:27,135 - INFO - Precision: 0.9848484848484849, Recall: 0.9701492537313433, F1-Score: 0.9774436090225564


Test metrics: {'Test Loss': 0.0025464292570076294, 'Precision': 0.9848484848484849, 'Recall': 0.9701492537313433, 'F1 Score': 0.9774436090225564}
*********************************************************************************************
Saved test metrics to ./results_Nausea_NER_BERT&GPT/Symptom_BERT_test_metrics.csv
Model and tokenizer saved to ./fine_tuned_models/Symptom_BERT


In [64]:
import torch
from transformers import BertTokenizer, BertForTokenClassification
from bertviz import head_view, model_view

# Load pre-trained BERT model for token classification (NER) and tokenizer
model_name = 'fine_tuned_models/Symptom_BERT'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name, num_labels=3, output_attentions=True)  # num_labels = 3 for "O", "B-Nausea", and "I-Nausea"

# Define label mappings
label2id = {'O': 0, 'B-Nausea': 1, 'I-Nausea': 2}
id2label = {v: k for k, v in label2id.items()}

# Sample input text with an entity to predict (e.g., Nausea)
text = "The patient reported feeling nauseous and vomiting multiple times."

# Tokenize the input text
inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=128)

# Run the model and get outputs (attention scores and token predictions)
with torch.no_grad():
    outputs = model(**inputs)

# Get attention scores and token predictions from the model output
attention = outputs.attentions
logits = outputs.logits
predicted_ids = torch.argmax(logits, dim=2)

# Convert token IDs to token labels
predicted_labels = [id2label[id.item()] for id in predicted_ids[0]]
token_ids = inputs['input_ids'][0]
tokens = tokenizer.convert_ids_to_tokens(token_ids)

# Visualize the attention using head_view
head_view(attention, tokens)


<IPython.core.display.Javascript object>

## GPT-NER-Token Classification-Nausea

In [65]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2ForTokenClassification
from seqeval.metrics import classification_report, precision_score, recall_score, f1_score
from torch import cuda
import logging
import os

# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

model_names = {
    #'gpt2': 'gpt2',
    #'biogpt': 'microsoft/BioGPT',
    #'bioMedLM': 'stanford-crfm/BioMedLM',
    'symptom-GPT': './symptom-BioGPT-1 Million',
    # 'symptom-GPT-Neo': 'EleutherAI/gpt-neo-1.3B'
}

# Constants and Configurations
class Config:
    MAX_LEN = 512
    TRAIN_BATCH_SIZE = 4
    VALID_BATCH_SIZE = 2
    EPOCHS = 5
    LEARNING_RATE = 3e-05
    MAX_GRAD_NORM = 10
    TRAIN_SIZE = 0.8
    DEVICE = 'cuda' if cuda.is_available() else 'cpu'
    DATA_FILE = "df_tokens_Nausea_Vomiting.csv"
    RESULTS_DIR = './results_Nausea_NER_BERT&GPT'  # Directory to save results

# Define label mapping
label2id = {'O': 0, 'B-Nausea': 1, 'I-Nausea': 2}  # Example, adjust according to your actual labels

# Load and preprocess data
def load_data(file_path):
    try:
        data = pd.read_csv(file_path, encoding='unicode_escape')
        data = data.fillna(method='ffill')
        data['sentence'] = data.groupby(['goldID'])['token'].transform(lambda x: ' '.join(x))
        data['word_labels'] = data.groupby(['goldID'])['tag'].transform(lambda x: ','.join(x))
        data = data[["sentence", "word_labels"]].drop_duplicates().reset_index(drop=True)
        return data
    except Exception as e:
        logging.error(f"Failed to load data: {e}")
        raise

# Prepare datasets
def prepare_datasets(data):
    train_data = data.sample(frac=Config.TRAIN_SIZE, random_state=200)
    test_data = data.drop(train_data.index).reset_index(drop=True)
    return train_data.reset_index(drop=True), test_data

# Custom Dataset class
class SentenceDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        sentence = self.data.iloc[index]['sentence']
        word_labels = self.data.iloc[index]['word_labels']
        tokenized_sentence, labels = self.tokenize_and_preserve_labels(sentence, word_labels)
        return self.encode_plus(tokenized_sentence, labels)

    def tokenize_and_preserve_labels(self, sentence, text_labels):
        tokenized_sentence, labels = [], []
        for word, label in zip(sentence.split(), text_labels.split(',')):
            subwords = self.tokenizer.tokenize(word)
            tokenized_sentence.extend(subwords)
            labels.extend([label] * len(subwords))
        return tokenized_sentence, labels

    def encode_plus(self, tokenized_sentence, labels):
        tokenized_sentence = ['<|endoftext|>'] + tokenized_sentence[:self.max_len-2] + ['<|endoftext|>']
        labels = ['O'] + labels[:self.max_len-2] + ['O']
    
        input_ids = self.tokenizer.convert_tokens_to_ids(tokenized_sentence)
        attention_mask = [1] * len(input_ids) + [0] * (self.max_len - len(input_ids))

       # Padding
        padding_length = self.max_len - len(input_ids)
        input_ids += [self.tokenizer.eos_token_id] * padding_length  # Use eos_token_id for padding if pad_token_id is None
        label_ids = [label2id.get(label, -100) for label in labels]
        label_ids += [-100] * padding_length

        return {
        'input_ids': torch.tensor(input_ids, dtype=torch.long),
        'attention_mask': torch.tensor(attention_mask, dtype=torch.long),
        'labels': torch.tensor(label_ids, dtype=torch.long)
           }


# Training and Evaluation Functions
def train(model, loader, optimizer):
    model.train()
    total_loss = 0
    for batch in loader:
        inputs, masks, labels = batch['input_ids'].to(Config.DEVICE), batch['attention_mask'].to(Config.DEVICE), batch['labels'].to(Config.DEVICE)
        model.zero_grad()
        outputs = model(input_ids=inputs, attention_mask=masks, labels=labels)
        loss = outputs.loss
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), Config.MAX_GRAD_NORM)
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(loader)

def evaluate(model, loader):
    model.eval()
    total_loss = 0
    predictions, labels = [], []
    with torch.no_grad():
        for batch in loader:
            inputs, masks, targets = batch['input_ids'].to(Config.DEVICE), batch['attention_mask'].to(Config.DEVICE), batch['labels'].to(Config.DEVICE)
            outputs = model(input_ids=inputs, attention_mask=masks, labels=targets)
            loss = outputs.loss
            total_loss += loss.item()

            logits = outputs.logits
            predictions_batch = torch.argmax(logits, axis=2)
            for i, mask in enumerate(masks):
                temp_1 = []
                temp_2 = []
                for j, m in enumerate(mask):
                    if m and targets[i, j] != torch.tensor(-100):
                        temp_1.append(id2label[targets[i, j].item()])
                        temp_2.append(id2label[predictions_batch[i, j].item()])
                labels.append(temp_1)
                predictions.append(temp_2)

    precision = precision_score(labels, predictions)
    recall = recall_score(labels, predictions)
    f1 = f1_score(labels, predictions)

    logging.info(f"Precision: {precision}, Recall: {recall}, F1-Score: {f1}")
    return total_loss / len(loader), precision, recall, f1

# Main Execution
def main():
    data = load_data(Config.DATA_FILE)
    train_data, test_data = prepare_datasets(data)
    for model_key in model_names:
        logging.info(f"Training and evaluating model: {model_names[model_key]}")
        tokenizer = GPT2Tokenizer.from_pretrained(model_key)
        model = GPT2ForTokenClassification.from_pretrained(model_key, num_labels=len(id2label))
        model.to(Config.DEVICE)
        train_loader = DataLoader(SentenceDataset(train_data, tokenizer, Config.MAX_LEN), batch_size=Config.TRAIN_BATCH_SIZE, shuffle=True)
        test_loader = DataLoader(SentenceDataset(test_data, tokenizer, Config.MAX_LEN), batch_size=Config.VALID_BATCH_SIZE, shuffle=False)
        optimizer = torch.optim.Adam(model.parameters(), lr=Config.LEARNING_RATE)

        for epoch in range(Config.EPOCHS):
            train_loss = train(model, train_loader, optimizer)
            logging.info(f'Epoch {epoch+1}, Train Loss: {train_loss}')
            if epoch == Config.EPOCHS - 1:  # Only evaluate and save in the last epoch
                test_loss, precision, recall, f1 = evaluate(model, test_loader)
                test_metrics = {'Test Loss': test_loss, 'Precision': precision, 'Recall': recall, 'F1 Score': f1}


            # Now outside the loop, only for the final epoch
        print(f"Test metrics: {test_metrics}")
        print("*********************************************************************************************")
        
        # Save metrics
        metrics_filename = os.path.join(Config.RESULTS_DIR, f"{model_names[model_key]}_test_metrics.csv")
        pd.DataFrame([test_metrics]).to_csv(metrics_filename, index=False)
        print(f"Saved test metrics to {metrics_filename}")
         # Define dynamic save directory based on model name
        save_directory = f'./fine_tuned_models/{model_names[model_key]}'
        
        # Ensure the save directory exists
        if not os.path.exists(save_directory):
            os.makedirs(save_directory)
        
        # Save the model's weights, configuration, and tokenizer
        model.save_pretrained(save_directory)
        tokenizer.save_pretrained(save_directory)
        print(f"Model and tokenizer saved to {save_directory}")
    
if __name__ == "__main__":
    main()

2024-09-04 11:54:23,459 - INFO - Training and evaluating model: ./symptom-BioGPT-1 Million
Some weights of GPT2ForTokenClassification were not initialized from the model checkpoint at symptom-GPT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2024-09-04 11:55:20,354 - INFO - Epoch 1, Train Loss: 0.05315941390053499
2024-09-04 11:56:17,356 - INFO - Epoch 2, Train Loss: 0.010801426028943795
2024-09-04 11:57:14,798 - INFO - Epoch 3, Train Loss: 0.007511967869961451
2024-09-04 11:58:12,517 - INFO - Epoch 4, Train Loss: 0.006626972464335606
2024-09-04 11:59:10,454 - INFO - Epoch 5, Train Loss: 0.004443956468838773
2024-09-04 11:59:15,519 - INFO - Precision: 0.9465648854961832, Recall: 0.8732394366197183, F1-Score: 0.9084249084249083


Test metrics: {'Test Loss': 0.0074904073987530255, 'Precision': 0.9465648854961832, 'Recall': 0.8732394366197183, 'F1 Score': 0.9084249084249083}
*********************************************************************************************
Saved test metrics to ./results_Nausea_NER_BERT&GPT/./symptom-BioGPT-1 Million_test_metrics.csv
Model and tokenizer saved to ./fine_tuned_models/./symptom-BioGPT-1 Million


In [67]:
import torch
from transformers import GPT2Tokenizer, GPT2Model
from bertviz import head_view

# Load pre-trained GPT-2 model and tokenizer
model_name = 'fine_tuned_models/symptom-BioGPT-1 Million'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2Model.from_pretrained(model_name, output_attentions=True)

# Add a custom classification layer for token classification
num_labels = 3  # 'O', 'B-Nausea', 'I-Nausea'
classification_head = torch.nn.Linear(model.config.n_embd, num_labels)

# Define label mappings
label2id = {'O': 0, 'B-Nausea': 1, 'I-Nausea': 2}
id2label = {v: k for k, v in label2id.items()}

# Sample input text with an entity to predict (e.g., Nausea)
text = "The patient reported feeling nauseous and vomiting multiple times."

# Tokenize the input text
inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=128)

# Run the model and get hidden states (we will add classification on top of these)
with torch.no_grad():
    outputs = model(**inputs)  # Get hidden states and attention
    hidden_states = outputs.last_hidden_state  # The hidden states from GPT-2
    attentions = outputs.attentions  # Attention weights from GPT-2

# Apply the classification layer on top of hidden states to predict token labels
logits = classification_head(hidden_states)
predicted_ids = torch.argmax(logits, dim=2)

# Convert token IDs to token labels
predicted_labels = [id2label[id.item()] for id in predicted_ids[0]]
token_ids = inputs['input_ids'][0]
tokens = tokenizer.convert_ids_to_tokens(token_ids)

# Visualize the attention using head_view
head_view(attentions, tokens)


<IPython.core.display.Javascript object>