<a href="https://colab.research.google.com/github/Mohamed-Taha-Essa/Generative-AI/blob/main/Finetuning_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Task description:**

You are working on a health-care mobile app, the users can enter a brief description, and based on that, a socialist form a specific department will contact them either with the proper treatment or with a certain doctor contact

You are responsible for develop the ML model for classifying the inserted text to one of the following department, so that the designated person should apply:

**['symptom' 'disease' 'treatment' 'specialty']**

You have a relatively small dataset to train your model, so you are thinking leveraging one of the pretrained models will be a good choice, explore this idea with the aid of HuggingFace echo-system

*There is some boilerplate code to assess you, no need to change it*

Main Ideas:
* Zero-shot Classification
* Why could we need finetuning?
* The classifier Layer/s automatic or using pytorch  



Explore the data

In [2]:
import pandas as pd

df = pd.read_csv("./data.csv")

In [None]:
df.head()

Unnamed: 0,text,label
0,أعاني من صداع شديد وألم خلف العينين,symptom
1,مرض السكري من النوع الثاني يتطلب إدارة دقيقة ل...,disease
2,ينصح الأطباء باستخدام الباراسيتامول لتخفيف الألم,treatment
3,طب الأطفال يتعامل مع صحة الأطفال والمراهقين,specialty
4,الحمى وارتفاع درجة الحرارة من الأعراض الشائعة ...,symptom


In [3]:
print("total examples are", len(df))
print("labels are", df.label.unique())
print("total labels count is", df.label.nunique())


total examples are 52
labels are ['symptom' 'disease' 'treatment' 'specialty']
total labels count is 4


In [4]:
# Create a LabelEncoder and fit it to your labels
from sklearn.preprocessing import LabelEncoder

labels = df['label']
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)  # Converts labels to integers

In [5]:
label_encoder.classes_

array(['disease', 'specialty', 'symptom', 'treatment'], dtype=object)

In [6]:
list(labels)[0:5]

['symptom', 'disease', 'treatment', 'specialty', 'symptom']

In [7]:
encoded_labels[0:5]

array([2, 0, 3, 1, 2])

In [8]:
df['label'] = encoded_labels
df.head()

Unnamed: 0,text,label
0,أعاني من صداع شديد وألم خلف العينين,2
1,مرض السكري من النوع الثاني يتطلب إدارة دقيقة ل...,0
2,ينصح الأطباء باستخدام الباراسيتامول لتخفيف الألم,3
3,طب الأطفال يتعامل مع صحة الأطفال والمراهقين,1
4,الحمى وارتفاع درجة الحرارة من الأعراض الشائعة ...,2


Preprocess the data to be ready for the models, with the following steps:

1. split the text data into words (tokens)
2. map those tokens to the corresponding ids (from vocab.txt)
3. to get the proper splitter (tokenizer) and the correct vocab.txt, use the same model name you will use to import its proper tokinzer
4. make sure there are some arabic tokens in the vocab.txt file e.g: make sure you pick a model supports arabic

In [9]:
pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [10]:
from datasets import Dataset
from sklearn.model_selection import train_test_split

def preprocess_data(data): #takes a pandas dataframe
    """Splits the dataset into train and test sets."""
    tr, te = train_test_split(data, test_size=0.2, stratify=data['label'], random_state=42)
    tr = Dataset.from_pandas(tr)
    te = Dataset.from_pandas(te)
    return tr, te

train_data, test_data = preprocess_data(df)

In [11]:
train_data

Dataset({
    features: ['text', 'label', '__index_level_0__'],
    num_rows: 41
})

In [12]:
test_data

Dataset({
    features: ['text', 'label', '__index_level_0__'],
    num_rows: 11
})

In [13]:
train_data[0]['text']

'الشعور بالخمول والنوم المفرط من أعراض الاكتئاب'

In [14]:
train_data[0]['label']

2

In [15]:
label_encoder.classes_[train_data[0]['label']]

'symptom'

Let's test what we can do without Fine-tuning

- can we try pipeline?
- can we try zero-shot Classification?

In [16]:
from transformers import pipeline

classifier = pipeline("text-classification")

classifier("مرض التهاب الأمعاء يتطلب تغييرات كبيرة في الحمية")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'NEGATIVE', 'score': 0.5130485892295837}]

What is the problem with this approach?

In [17]:
# uncomment this line and test this different idea
classifier(" التهاب الأمعاء يتطلب تغييرات كبيرمرضة في الحمية", candidate_labels=["symptom", "disease", "treatment", "specialty"])

TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'candidate_labels'

If you run the previous line of code, it will give you an error: got an unexpected keyword argument 'candidate_labels'

because, this kind of task is designed to used text-classification models to predict lables that they were previously trained on

Explore Zero shot classification task

In [18]:
#TODO
from transformers import pipeline

classifier = pipeline("zero-shot-classification") # changed pipeline type

classifier(
    "مرض التهاب الأمعاء يتطلب تغييرات كبيرة في الحمية",
    candidate_labels=["symptom", "disease", "treatment", "specialty"]
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


{'sequence': 'مرض التهاب الأمعاء يتطلب تغييرات كبيرة في الحمية',
 'labels': ['specialty', 'symptom', 'treatment', 'disease'],
 'scores': [0.42990320920944214,
  0.33342093229293823,
  0.12693731486797333,
  0.10973860323429108]}

Did it work? if yes, can you spot any issues? yes the scores is little i think it need to train on my data first


**I think we have to finetune**

In [23]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

In [82]:
# pick a suitable model for the task from huggingface models hub
# model_name = "MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7"
model_name ='MoritzLaurer/mDeBERTa-v3-base-mnli-xnli'

In [83]:
# Tokenize the data

tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=384)

train_data = train_data.map(tokenize_function, batched=True)
test_data = test_data.map(tokenize_function, batched=True)

# Prepare for PyTorch
train_data.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
test_data.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

Map:   0%|          | 0/41 [00:00<?, ? examples/s]

Map:   0%|          | 0/11 [00:00<?, ? examples/s]

In [84]:
num_labels = df['label'].nunique()
print(num_labels)

4


In [103]:
#TODO
# Load the pre-trained model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained("MoritzLaurer/mDeBERTa-v3-base-mnli-xnli" ,
                                                           num_labels=4,  ignore_mismatched_sizes=True )

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at MoritzLaurer/mDeBERTa-v3-base-mnli-xnli and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([3]) in the checkpoint and torch.Size([4]) in the model instantiated
- classifier.weight: found shape torch.Size([3, 768]) in the checkpoint and torch.Size([4, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [104]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",          # Output directory to save the model
    evaluation_strategy="epoch",     # Evaluate after each epoch
    learning_rate=5e-5,              # Learning rate
    per_device_train_batch_size=8,   # Batch size for training
    per_device_eval_batch_size=8,    # Batch size for evaluation
    num_train_epochs=30,              # Number of epochs
    weight_decay=0.01,               # Weight decay for regularization
    logging_dir="./logs",            # Directory for logs
    logging_steps=100,               # How often to log the training process
    report_to="none",         # Report training metrics to TensorBoard
)



In [105]:
import torch
from sklearn.metrics import accuracy_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    logits = torch.tensor(logits)  # Convert logits to a Tensor if it's a numpy.ndarray
    predictions = torch.argmax(logits, axis=-1)
    accuracy = accuracy_score(labels, predictions.numpy())  # Convert predictions to numpy for accuracy_score
    return {"accuracy": accuracy}


Better results?

In [106]:
# Initialize the Trainer
trainer = Trainer(
    model=model,                         # Model to train
    args=training_args,                  # Training arguments
    train_dataset=train_data,         # Training dataset
    eval_dataset=test_data,           # Evaluation dataset
    compute_metrics=compute_metrics,     # Metric for evaluation
)

In [107]:
# Fine-tune the model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.026257,0.818182
2,No log,0.729188,0.818182
3,No log,0.582999,0.818182
4,No log,0.43262,0.909091
5,No log,0.342708,0.909091
6,No log,0.168486,1.0
7,No log,0.085415,1.0
8,No log,0.10857,1.0
9,No log,0.0368,1.0
10,No log,0.016688,1.0


TrainOutput(global_step=180, training_loss=0.14750919424825246, metrics={'train_runtime': 68.7059, 'train_samples_per_second': 17.902, 'train_steps_per_second': 2.62, 'total_flos': 16434753023520.0, 'train_loss': 0.14750919424825246, 'epoch': 30.0})

In [108]:
# Evaluate the model
results = trainer.evaluate()

# Print the evaluation results
print(f"Evaluation results: {results}")

Evaluation results: {'eval_loss': 0.0053993877954781055, 'eval_accuracy': 1.0, 'eval_runtime': 0.0744, 'eval_samples_per_second': 147.761, 'eval_steps_per_second': 26.866, 'epoch': 30.0}


In [110]:

trainer.save_model("./model")

In [111]:
# # Save the fine-tuned model
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/spm.model',
 './fine_tuned_model/added_tokens.json',
 './fine_tuned_model/tokenizer.json')