# Further Pre-Training Spritual-BERT Model

This code fine-tunes a pre-trained BERT model (`bert-base-uncased`) on a dataset for spiritual care language modeling. Here's a detailed explanation:

1. **Importing Libraries**: The code uses PyTorch for deep learning and the Hugging Face `transformers` library for pre-trained models and tokenizers. Pandas is used for data manipulation.

2. **Loading and Sampling the Dataset**: The dataset `1 million sentense note.csv` is loaded, and a random sample of 1 million entries is taken from the `SentText` column. The text data is stored in a list named `texts`.

3. **Preparing the BERT Model**: A tokenizer and masked language model (`BertForMaskedLM`) are loaded using the pre-trained model `bert-base-uncased`. The tokenizer processes text into numerical formats suitable for the model.

4. **Tokenizing Text Data**: The text is tokenized with padding and truncation to a maximum length of 512 tokens. The output is a PyTorch tensor for efficient processing.

5. **Creating a Custom Dataset**: A PyTorch `Dataset` is defined to return `input_ids`, `attention_mask`, and labels for each text. Labels are a copy of the input IDs for MLM.

6. **Setting Up a Data Collator**: A `DataCollatorForLanguageModeling` is used to apply dynamic masking during training, with a 15% probability of masking tokens.

7. **Configuring Training Arguments**: Training configurations include:
   - Batch size: 4
   - Epochs: 3
   - Learning rate: 0.0001
   - Checkpoints: Saved every 10,000 steps, with a maximum of 2 retained.
   - Logging: Metrics logged every 100 steps.

8. **Training the Model**: The Hugging Face `Trainer` class is used to streamline training. It combines the model, training arguments, dataset, and data collator. The training process begins with `trainer.train()`.

9. **Saving the Fine-Tuned Model**: The trained model and tokenizer are saved to a directory named `Spritual_BERT`, making them available for later use.

This workflow prepares a BERT model fine-tuned for domain-specific language modeling, enabling it to better understand and process text related to spiritual care.


### Importing Libraries with Comments
```python
# Import PyTorch, a deep learning framework
# Import the neural network module from PyTorch
# Import specific components from the Hugging Face Transformers library
# Import the random module for generating random numbers
# Import pandas, a library for data manipulation and analysis



In [1]:
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertModel, logging
import random
import pandas as pd

# Explanation of the Code

- The code reads a CSV file named `1 million sentense note.csv` into a pandas DataFrame using `pd.read_csv()`, enabling structured data manipulation and analysis.
- A random sample of 1 million rows is selected from the DataFrame using `df.sample(n=1000000, random_state=1)`. This step reduces the size of the dataset for computational efficiency, and the `random_state=1` ensures the sampling process is reproducible.
- The `SentText` column, which contains the text data, is extracted from the sampled DataFrame and converted into a Python list named `texts` using the `tolist()` method. This list format is suitable for further processing such as tokenization or analysis.


In [3]:
# Read the CSV file into a DataFrame
df = pd.read_csv("1 million sentense note.csv")
# get sentence text
df = df.sample(n=1000000, random_state=1)
texts = df['SentText'].tolist()

- **Import Libraries**: The code uses Hugging Face's `transformers` library for NLP tasks and PyTorch's `Dataset` class for creating custom datasets. These libraries enable efficient tokenization, model training, and data handling.

- **Load Pre-trained Model and Tokenizer**:
  - The BERT tokenizer and model (`bert`) are loaded using `BertTokenizer` and `BertForMaskedLM`. These are pre-trained on a general language corpus and fine-tuned for masked language modeling (MLM).

- **Tokenize Text Data**:
  - The text data (`texts`) is tokenized using the BERT tokenizer with padding, truncation, and conversion to PyTorch tensors. The `max_length=512` ensures all inputs fit within the model's constraints.

- **Custom PyTorch Dataset**:
  - A `TextDataset` class is defined to provide tokenized input IDs, attention masks, and labels for each data entry.
  - The `__len__` method returns the number of samples, and the `__getitem__` method fetches the data for a given index.

- **Data Collation**:
  - The `DataCollatorForLanguageModeling` dynamically masks tokens during training with a probability of `0.15`, enhancing the MLM training process.

- **Define Training Arguments**:
  - The training configuration includes:
    - Output directory: `./bio_clinical_bert_lm`.
    - Number of epochs: 3.
    - Batch size: 4.
    - Learning rate: 0.0001.
    - Checkpointing: Save every 10,000 steps, limiting to 2 checkpoints.
    - Logging: Record metrics every 100 steps.
    - Warm-up steps: 500.

- **Create a Trainer**:
  - The `Trainer` class simplifies model training by integrating the model, data, and training arguments.

- **Start Training**:
  - The model is trained using `trainer.train()` on the prepared dataset.

- **Save the Fine-Tuned Model**:
  - The fine-tuned model and tokenizer are saved in the `Spritual_BERT` directory for future use.


In [None]:
from transformers import BertTokenizer, BertForMaskedLM, DataCollatorForLanguageModeling, Trainer, TrainingArguments
from torch.utils.data import Dataset

# Load the Bio_ClinicalBERT tokenizer and model
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)

# Tokenize and preprocess your text data
tokenized_texts = tokenizer(
    texts,
    padding=True,
    truncation=True,
    return_tensors="pt",
    max_length=512,  # You can adjust the max length as needed
)

# Create a PyTorch dataset for training
class TextDataset(Dataset):
    def __init__(self, tokenized_texts, tokenizer):
        self.tokenized_texts = tokenized_texts
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.tokenized_texts["input_ids"])

    def __getitem__(self, idx):
        return {
            "input_ids": self.tokenized_texts["input_ids"][idx],
            "attention_mask": self.tokenized_texts["attention_mask"][idx],
            "labels": self.tokenized_texts["input_ids"][idx].clone(),
        }

dataset = TextDataset(tokenized_texts, tokenizer)

# Set up data collation
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm_probability=0.15,  # You can adjust this probability
)

# Define training arguments without evaluation
training_args = TrainingArguments(
    output_dir="./bio_clinical_bert_lm",
    overwrite_output_dir=True,
    num_train_epochs=3,  # Adjust the number of training epochs
    per_device_train_batch_size=4,  # Adjust batch size as needed
    save_steps=10_000,  # Save model checkpoint every n steps
    save_total_limit=2,  # Limit the number of saved checkpoints
    logging_dir="./logs",
    logging_steps=100,  # Log every n steps
    learning_rate=1e-4,  # Adjust learning rate as needed
    warmup_steps=500,  # Adjust warmup steps as needed
)

# Create a Trainer and start training
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

# Start training
trainer.train()

# Save the finetuned model
model.save_pretrained("Spritual_BERT")

# Save the tokenizer
tokenizer.save_pretrained("Spritual_BERT")
