# Text Classification Using Hugging Face(Fine-Tuning)
* Reference : https://huggingface.co/docs/transformers/v4.18.0/en/tasks/sequence_classification
* similar article : https://medium.com/@sandeep.ai/text-classification-using-hugging-face-fine-tuning-43c7416b049b

In [None]:
!pip install transformers
!pip install accelerate -U
!pip install datasets

### Loading Model
* loading frome Huggingface
* Source : https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english


In [2]:
from transformers import pipeline

# load model
classifier = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [3]:
# test
result = classifier("I really enjoyed the movie!")
print(result)

[{'label': 'POSITIVE', 'score': 0.9998713731765747}]


## DataSet
* Resource : https://huggingface.co/datasets/stanfordnlp/imdb

In [5]:
from datasets import load_dataset

imdb = load_dataset("imdb")

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [6]:
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

## Tokenize

In [7]:
from transformers import AutoTokenizer

model_ckpt = "distilbert-base-uncased" # or path of the model for in local storage
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [8]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [9]:
tokenized_imdb = imdb.map(preprocess_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [10]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Pretrained Model
* different model : https://huggingface.co/models?pipeline_tag=text-classification

In [11]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
# Define your training hyperparameters in TrainingArguments.
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
)

# Pass the training arguments to Trainer along with the model, dataset, tokenizer, and data collator.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [13]:
# Call train() to fine-tune your model.
trainer.train()

Step,Training Loss
500,0.3183
1000,0.2548
1500,0.2309
2000,0.1675
2500,0.1488
3000,0.1583
3500,0.1029
4000,0.0912
4500,0.0942
5000,0.0701


TrainOutput(global_step=7815, training_loss=0.11887818102796949, metrics={'train_runtime': 6155.0369, 'train_samples_per_second': 20.309, 'train_steps_per_second': 1.27, 'total_flos': 1.6382133492223008e+16, 'train_loss': 0.11887818102796949, 'epoch': 5.0})