## Pre-trained Symptom-BERT

This script efficiently sets up and executes the training of a Bio_ClinicalBERT model using clinical text data. The process involves crucial steps from data preparation (tokenization) to model training and saving, aimed at enhancing the model's performance on biomedical language understanding tasks.
The hyperparameters used for further pre-training the Bio-Clinical BERT model.  

In [3]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertModel, logging
import random
from transformers import BertForMaskedLM, DataCollatorForLanguageModeling, Trainer, TrainingArguments
from torch.utils.data import Dataset

### Text Data Loading and Preprocessing

This Python script processes text data loaded from a CSV file into a DataFrame for further analysis or usage. Below are the detailed steps:

1. **Load Data:**
   - The CSV file named `1_million_cleaned_data.csv` is read into a pandas DataFrame `df`. This file is assumed to contain a large dataset, possibly cleaned and preprocessed in prior steps.

2. **Remove Missing Values:**
   - Rows that have NaN values in the 'SentText' column are dropped. This step ensures that the dataset contains only complete text entries, crucial for maintaining the quality of data analysis or machine learning models.

3. **Data Type Standardization:**
   - The 'SentText' column is processed to ensure all its entries are strings. Any non-string data (like numbers) are converted to their string representations. This uniformity is important when the text data might be used for tasks that require string manipulation or analysis.

4. **Extract Text Data:**
   - The cleaned text entries in the 'SentText' column are converted into a list `texts`, which makes it easier to handle and manipulate the data in subsequent processing steps, such as feature extraction, text analysis, or input into text-based models.

### Summary

This script is designed to load, clean, and prepare text data efficiently, making it suitable for a variety of downstream applications including natural language processing tasks. The focus is on ensuring data completeness and uniformity, which are critical for the reliability of any analysis or predictive modeling performed on the data.


In [7]:
# Read the CSV file into a DataFrame
df = pd.read_csv("1_million_cleaned_data.csv")
# Drop rows with NaN values in the 'SentText' column
df = df.dropna(subset=['SentText'])
# Convert non-string values to strings (e.g., numbers to their string representations)
df['SentText'] = df['SentText'].apply(lambda x: str(x) if not isinstance(x, str) else x)
# Get the cleaned list of texts
texts = df['SentText'].tolist()

In [8]:
df.shape

(1000000, 3)

### Clinical Text Data Processing and Model Training with Bio_ClinicalBERT

This Python script outlines a comprehensive setup for training a language model specifically optimized for biomedical text, using the Bio_ClinicalBERT model from Hugging Face's transformers library.

1. **Load Model and Tokenizer:**
   - **Model Name:** The `emilyalsentzer/Bio_ClinicalBERT` model, a variant of BERT fine-tuned for biomedical texts, is loaded along with its tokenizer.
   - **Tokenizer:** The `BertTokenizer` is used to convert text data into a format suitable for model training.

2. **Tokenize Text Data:**
   - Text data is tokenized with settings for padding, truncation, and tensor format to ensure uniformity in token lengths and compatibility with the PyTorch library.
   - A maximum token length of 512 is set, which can be adjusted based on specific needs.

3. **Create PyTorch Dataset:**
   - A custom `TextDataset` class is defined to handle the tokenized text data, making it compatible with PyTorch training routines.
   - The dataset supports indexing and length retrieval, crucial for efficient data loading during training.

4. **Setup Data Collation:**
   - `DataCollatorForLanguageModeling` is used to dynamically mask input tokens, facilitating the model's learning to predict masked tokens—a core mechanism in training BERT-like models.

5. **Configure Training:**
   - Training parameters are set using `TrainingArguments`, configuring aspects like output directory, batch size, learning rate, and logging.
   - These settings ensure the training process is monitored, optimized, and capable of resuming from checkpoints.

6. **Initialize and Run Training:**
   - The `Trainer` class from Hugging Face's library is initialized with the model, training arguments, and dataset.
   - The training process is executed, during which the model learns to better predict the context of masked words in biomedical texts.

7. **Save Model and Tokenizer:**
   - Post-training, both the fine-tuned model and tokenizer are saved to the specified directory for future use in biomedical NLP tasks.

### Summary

This script efficiently sets up and executes the training of a Bio_ClinicalBERT model using clinical text data. The process involves crucial steps from data preparation (tokenization) to model training and saving, aimed at enhancing the model's performance on biomedical language understanding tasks.


In [None]:
# Load the Bio_ClinicalBERT tokenizer and model
model_name = "emilyalsentzer/Bio_ClinicalBERT"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)

# Tokenize and preprocess your text data
tokenized_texts = tokenizer(
    texts,
    padding=True,
    truncation=True,
    return_tensors="pt",
    max_length=512,  # You can adjust the max length as needed
)

# Create a PyTorch dataset for training
class TextDataset(Dataset):
    def __init__(self, tokenized_texts, tokenizer):
        self.tokenized_texts = tokenized_texts
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.tokenized_texts["input_ids"])

    def __getitem__(self, idx):
        return {
            "input_ids": self.tokenized_texts["input_ids"][idx],
            "attention_mask": self.tokenized_texts["attention_mask"][idx],
            "labels": self.tokenized_texts["input_ids"][idx].clone(),
        }

dataset = TextDataset(tokenized_texts, tokenizer)

# Set up data collation
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm_probability=0.15,  # You can adjust this probability
)

# Define training arguments without evaluation
training_args = TrainingArguments(
    output_dir="./bio_clinical_bert_lm",
    overwrite_output_dir=True,
    num_train_epochs=3,  # Adjust the number of training epochs
    per_device_train_batch_size=4,  # Adjust batch size as needed
    save_steps=10_000,  # Save model checkpoint every n steps
    save_total_limit=2,  # Limit the number of saved checkpoints
    logging_dir="./logs",
    logging_steps=100,  # Log every n steps
    learning_rate=1e-4,  # Adjust learning rate as needed
    warmup_steps=500,  # Adjust warmup steps as needed
)

# Create a Trainer and start training
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

# Start training
trainer.train()

# Save the finetuned model
model.save_pretrained("New_Bio-Clinical_BERT_finetuned")

# Save the tokenizer
tokenizer.save_pretrained("New_Bio-Clinical_BERT_finetuned")


Some weights of the model checkpoint at microsoft/deberta-base were not used when initializing DebertaForMaskedLM: ['lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.bias', 'deberta.embeddings.position_embeddings.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.dense.weight']
- This IS expected if you are initializing DebertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaForMaskedLM were not initialized from the model checkpoint at microsoft/deberta-base and are newly initialized: ['cls.predictions.transform.LayerNorm.

Step,Training Loss
100,9.1888
200,6.8144
300,5.5608
400,4.7072
500,4.0684
600,3.7107
700,3.4888
800,3.2842
900,3.1883
1000,3.0566


IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

