<a href="https://colab.research.google.com/github/Ramjeet-Dixit/IITM-AIML-Rdixit/blob/main/HuggingFace_Transformers_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Step1: Load the hugging face library

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import torch

In [None]:
from google.colab import drive
#drive.mount('/content/drive')

## Check and load the device

In [None]:
# Determine device (GPU if available, else CPU)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## Step 3: Load pre-trained model bio-bert

In [None]:

# Load pre-trained BioBERT tokenizer and model for sequence classification with 5 output labels
model_name = "dmis-lab/biobert-base-cased-v1.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)

# Move model to the appropriate device
model = model.to(device)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dmis-lab/biobert-base-cased-v1.1 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

## Step 4: Load the data set from here

In [None]:
# Load your dataset: assumes CSV with 'text' and 'label' columns and splits named 'train' and 'validation'
dataset = load_dataset("csv", data_files={"train": "/content/sample_data/medical_cases_train.csv",
                                         "validation": "/content/sample_data/medical_cases_val.csv"})


Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

## Step 5: Tokenize the dataset

In [None]:

# Define the tokenize function
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

# Tokenize the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Set format for PyTorch (optional but recommended)
tokenized_datasets.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

Map:   0%|          | 0/33 [00:00<?, ? examples/s]

Map:   0%|          | 0/8 [00:00<?, ? examples/s]

## Step 6: Train the arguments

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./medical_classifier",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

# Training
trainer.train()

  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mjayant-mulmoodi[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss
1,No log,1.546232
2,No log,1.533426
3,No log,1.526128


TrainOutput(global_step=9, training_loss=1.5860907236735027, metrics={'train_runtime': 56.6345, 'train_samples_per_second': 1.748, 'train_steps_per_second': 0.159, 'total_flos': 26048696103936.0, 'train_loss': 1.5860907236735027, 'epoch': 3.0})

## Step 7: Inference function and predict

In [None]:
# Inference function
def classify_medical_text(text):
    # Tokenize input
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

    # Move inputs to the same device as the model
    inputs = {key: value.to(device) for key, value in inputs.items()}

    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(probabilities, dim=-1).item()
    labels = ["Cardiology", "Neurology", "Oncology", "Pediatrics", "Other"]  # Your classes
    return labels[predicted_class]

# Example usage
text = "Patient experiences persistent headaches and vision changes."
result = classify_medical_text(text)
print(f"Predicted specialty: {result}")

Predicted specialty: Cardiology
