# Fine tune distilbert to perform Text classification 

This notebook is intended to train `text-classification` models based on `distilbert base uncased` model. To do so we are using [Transformers 🤗🤗](https://huggingface.co/docs/transformers/index).

### Considerations
- The dataset must have column "text" where all the input questions are setted
- An `S3 Instance` is required to correctly store the model

In [2]:
!pip install pydantic

Collecting pydantic
  Downloading pydantic-2.10.2-py3-none-any.whl.metadata (170 kB)
Collecting annotated-types>=0.6.0 (from pydantic)
  Using cached annotated_types-0.7.0-py3-none-any.whl.metadata (15 kB)
Collecting pydantic-core==2.27.1 (from pydantic)
  Downloading pydantic_core-2.27.1-cp312-cp312-macosx_11_0_arm64.whl.metadata (6.6 kB)
Downloading pydantic-2.10.2-py3-none-any.whl (456 kB)
Downloading pydantic_core-2.27.1-cp312-cp312-macosx_11_0_arm64.whl (1.8 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m
[?25hUsing cached annotated_types-0.7.0-py3-none-any.whl (13 kB)
Installing collected packages: pydantic-core, annotated-types, pydantic
Successfully installed annotated-types-0.7.0 pydantic-2.10.2 pydantic-core-2.27.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[

In [31]:
from pydantic import BaseModel
from typing import Optional

class User(BaseModel):
    name: Optional[str]=None
    email: Optional[str]=None

# Create a Pydantic model instance
user = User(name="Aloja",email='aaaa')

# Convert to a dictionary
user_dict = user.model_dump()
user_dict.update({k: v for k, v in {}.items() if v is not None})
print(user_dict)


{'name': 'Aloja', 'email': 'aaaa'}


#### Install required libs   📥📥

In [1]:
!pip install transformers datasets evaluate accelerate  mlflow tf-keras seaborn optimum[openvino,nncf,exporters] psutil pynvml -q

zsh:1: no matches found: optimum[openvino,nncf,exporters]


## Dataset manipulation & env preparation

In [1]:
import sys
from pathlib import Path

notebook_dir = Path().resolve()
sys.path.append(str(notebook_dir.parents[1]))

In [4]:
from nlp import AlquimiaTrainer

Matplotlib is building the font cache; this may take a moment.


In [None]:
import pandas as pd
import os
import json

input_column_name="text"
labeled_dataset = "datasets/dataset.csv"
df = pd.read_csv(labeled_dataset)
file_label2id = open('datasets/label2id.json')
file_id2label = open('datasets/id2label.json')
label2id = json.load(file_label2id)
id2label=json.load(file_id2label)
df.head()
df['label'] = df[output_column_name].replace(label2id)
df.head(3)
print(f"The label2id json loaded correctly: {label2id}")
print(f"The id2label json loaded correctly: {id2label}")

## Give a name to your model and version  🧙‍♂️🧙‍♂️

This process is crucial mainly because a `text-classification` model can be intended for a huge amount of approaches

In [None]:
model_name = "intents-copa"
MLFLOW_EXPERIMENT = "showcases"
base_model = 'distilbert-base-uncased'
MLFLOW_RUN_NAME = "V1 Intents for copa model"

In [None]:
trainer = AlquimiaTrainer(model_name, MLFLOW_EXPERIMENT)

In [None]:
### Fine tune model
fine_tune = trainer.text_classification(dataset=df,label2id,id2label,base_model,MLFLOW_RUN_NAME)
fine_tune.train()

## Save binary 
fine_tune.save_model(model_name)

## Log model in mlflow
finetune.log_model(MLFLOW_RUN_NAME)

### Batch size per epoch

So if you have a batch size of 20 then 

total_dataset/batch_size = n

n represents the total amount of batches per epoch

### How many times does my model going to be trained?

n*epochs

In [None]:
## Delete directories in Jupyter Notebook
import shutil

# Remove the local model directory
shutil.rmtree(model_name)
shutil.rmtree(run_name)
os.remove(labeled_dataset)
shutil.rmtree(f"{model_name}_onnx")
os.remove("datasets/label2id.json")
os.remove("datasets/id2label.json")
os.remove("./confusion_matrix.png")

---