### Domain Adaption and Transfer Learning Challenges

## Understanding Domain Adaptation and Handling Domain-Specific Data

### What is Domain Adaptation?

Domain adaptation involves transferring a model trained on one domain (**source domain**) to perform tasks in a different domain (**target domain**).

**Example:**
- **Source domain:** General news articles
- **Target domain:** Medical text

### Why Domain Adaptation?

- Many real-world tasks involve domain-specific data.
- Pre-trained models on general datasets might not perform optimally without adaptation.
- Domain adaptation helps leverage existing models and data, reducing the need for large labeled datasets in the target domain.

### Steps in Domain Adaptation

1. **Fine-tune the pre-trained model** on a domain-specific dataset.
2. **Incorporate domain-specific embeddings or vocabulary** to better capture the nuances of the target domain.
3. **Apply additional pre-training** on the domain data if necessary, to further align the model with the target domain's characteristics.

### Challenges in Transfer Learning

#### Data Mismatch
- The source domain may not represent the target domain adequately.
- **Example:** General text (Wikipedia) vs. technical medical jargon.

#### Catastrophic Forgetting
- During fine-tuning, the model may forget the knowledge learned from the source domain, leading to reduced generalization.

#### Computational Constraints
- Fine-tuning large pre-trained models requires significant computational resources, which may not always be available.

### Strategies to Address Challenges

#### Transfer Learning from Related Domains
- Fine-tune on an intermediate domain dataset before adapting to the target domain.
- **Example:** Adapt from general news to scientific articles, then to medical text.

#### Data Augmentation
- Generate synthetic domain-specific data to augment the target dataset.
- **Example:** Use paraphrasing techniques or back-translation to increase dataset diversity.

#### Domain-Specific Embeddings
- Use pre-trained embeddings tailored to the target domain (e.g., BioWordVec for biomedical text).
- Incorporate specialized tokenizers or vocabularies to better handle domain-specific terminology.

---

**Summary:**  
Domain adaptation is essential for applying machine learning models to specialized domains. By understanding the challenges and employing effective strategies, we can improve model performance and generalization in domain-specific tasks.

In [None]:
from datasets import load_dataset

In [None]:
# load pubmed 20k RCT dataset
dataset = load_dataset("pubmed_rct", "20k_rct")
pritn(dataset["train"][0])

from transformers import AutoTokenizer

# load bert tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_data(examples):
    return tokenizer(examples["abstract"], truncation=True, padding="max_length", max_length=512)

tokenized_dataset = dataset.map(preprocess_data, batched=True)
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")
tokenized_dataset.set_format("torch")

from transformers import AutoModelForSequenceClassification

# model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=5)

model = AutoModelForSequenceClassification.from_pretrained(
    "dmis-lab/biobert-base-cased-v1.1", num_labels=5
)

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
)

trainer.train()

# evaluate model
results = trainer.evaluate()
print(results)

import random

def augment_text(text):
    synonyms = {"cancer": ["tumour", "tumours"], "therapy": ["treatment", "treatments"]}
    words = text.split()
    new_words = [random.choice(synonms[word]) if word in synonyms else word for word in words]
    return " ".join(new_words)

augmented_data = [augmented_text(sample["text" ]) for sample in dataset["train"]]

augmented_dataset = dataset["train"].add_column("augmented_text", augmented_data)

augmented_tokenized_dataset = augmented_dataset.map(preprocess_data, batched=True)
augmented_tokenized_dataset = augmented_tokenized_dataset.rename_column("label", "labels")
augmented_tokenized_dataset.set_format("torch")

augmented_trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=augmented_tokenized_dataset["train"],
    eval_dataset=augmented_tokenized_dataset["test"],
    tokenizer=tokenizer,
)

augmented_trainer.train()

augmented_results = augmented_trainer.evaluate()
print(augmented_results)