# Text classification using BERT on Clothing Data.

**This guide has been adapted and modified from [huggingface](https://huggingface.co/docs/transformers/en/tasks/sequence_classification).**

This guide will show you how to:

1. Finetune [DistilBERT](https://huggingface.co/distilbert-base-uncased) a custom dataset.
2. Use your finetuned model for inference.


Before you begin, make sure you have all the necessary libraries installed:

```bash
pip install transformers datasets evaluate pandas torch
```

## Load the dataset from csv file


In [2]:
from datasets import Dataset
import pandas as pd


df = pd.read_csv(
    r'C:\Users\ISS-User1\Desktop\eugene\Glowing-Torch\Datasets\dataset.csv', encoding='latin1')
hf_dataset = Dataset.from_pandas(df)
hf_dataset = hf_dataset.train_test_split(
    test_size=0.8, shuffle=True)

  from .autonotebook import tqdm as notebook_tqdm


Collect the classification labels from the dataset

In [3]:
x_train = hf_dataset['train']['name']
x_test = hf_dataset['test']['name']
y_train = hf_dataset['train']['category']
y_test = hf_dataset['test']['category']
labels = set()
for label in y_test:
    labels.add(label)
for label in y_train:
    labels.add(label)


We will now represent each label with their respective ids and create a mapping of label to id and vice versa from all labels of the dataset.

In [4]:
id2label = {}
label2id = {}
for i, label in enumerate(labels):
    id2label[i] = label
    label2id[label] = i
label_dict = {"id2label": id2label, "label2id": label2id}

After creation of the label mappings, we will now update the text labels into their id labels.

In [5]:
from datasets import DatasetDict

for i, _ in enumerate(y_train):
    y_train[i] = label2id[y_train[i]]
for i, _ in enumerate(y_test):
    y_test[i] = label2id[y_test[i]]

We will now create a new HuggingFace `DatasetDict` to store our newly map labels with its respective product names.

In [6]:
x_test = hf_dataset['test']['name']
x_train = hf_dataset['train']['name']
dataset = DatasetDict({'train': Dataset.from_dict({'label': y_train, 'text': x_train}),
                       'test': Dataset.from_dict({'label': y_test, 'text': x_test})})

We can visualize the dataset below.

In [7]:
print(dataset)
print(dataset["train"][0])

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 6338
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 25353
    })
})
{'label': 6, 'text': 'Twill chinos'}


There are two fields in this dataset:

- `text`: the movie review text.
- `label`: a value ranging from a `0` to `len(labels)` which represent a clothing category

## Preprocess

The next step is to load a DistilBERT tokenizer to preprocess the `text` field:

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Create a preprocessing function to tokenize `text` and truncate sequences to be no longer than DistilBERT's maximum input length:

In [9]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up `map` by setting `batched=True` to process multiple elements of the dataset at once:

In [10]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map: 100%|██████████| 6338/6338 [00:00<00:00, 41770.96 examples/s]
Map: 100%|██████████| 25353/25353 [00:00<00:00, 36118.19 examples/s]


Now create a batch of examples using [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [11]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

In [12]:
import evaluate

accuracy = evaluate.load("accuracy")

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the accuracy:

In [13]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

## Train

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load DistilBERT with [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSequenceClassification) along with the number of expected labels, and the label mappings:

In [14]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=len(labels), id2label=id2label, label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now, we "move" the model to the GPU.
The architecture of the model can be also visualised from the output.

In [15]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

We will use a subset of the dataset to reduce compuation time.

In [16]:
small_train_dataset = tokenized_dataset["train"].shuffle(
    seed=42) .select(range(100))
small_eval_dataset = tokenized_dataset["test"].shuffle(
    seed=42) .select(range(100))

At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments)
2. At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the accuracy and save the training checkpoint.
3. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
4. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [17]:
training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print(training_args.device) # see where is the model being trained

trainer.train()

cpu


 29%|██▊       | 2/7 [00:02<00:06,  1.28s/it]

                                             
100%|██████████| 7/7 [00:09<00:00,  1.11s/it]

{'eval_loss': 2.1764607429504395, 'eval_accuracy': 0.35, 'eval_runtime': 0.8548, 'eval_samples_per_second': 116.993, 'eval_steps_per_second': 8.19, 'epoch': 1.0}


100%|██████████| 7/7 [00:10<00:00,  1.55s/it]

{'train_runtime': 10.8156, 'train_samples_per_second': 9.246, 'train_steps_per_second': 0.647, 'train_loss': 2.241724286760603, 'epoch': 1.0}





TrainOutput(global_step=7, training_loss=2.241724286760603, metrics={'train_runtime': 10.8156, 'train_samples_per_second': 9.246, 'train_steps_per_second': 0.647, 'train_loss': 2.241724286760603, 'epoch': 1.0})

<Tip>

[Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) applies dynamic padding by default when you pass `tokenizer` to it. In this case, you don't need to specify a data collator explicitly.

</Tip>


<Tip>

For a more in-depth example of how to finetune a model for text classification, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb).

</Tip>

## Inference

We will now try to make a prediction with our fine-tuned model of BERT.
We can set a sample clothing name to predict it's category and use the tokenizer to encode it.

In [18]:
model.eval()
name = "Cotton short-sleeve shirt"
inputs = tokenizer.encode(name, return_tensors="pt").to(device)

We will now make a prediction with the tokenized input.
The logits represent the unnormalized probabilites from the output of the neural network.

In [19]:
logits = model(inputs).logits
print(logits)

tensor([[-0.0676, -0.1646, -0.0449, -0.1894,  0.1228, -0.0996,  0.1313,  0.4119,
         -0.0779, -0.0822]], grad_fn=<AddmmBackward0>)


We can use `torch.max()` to obtain the index of the label with the highest probability.
Using the index and the previously obtained id2label map, we can get the category predicted for the product item.

In [20]:
predictions = torch.max(logits, 1).indices
prediction: str = id2label[predictions.tolist()[0]]
print(f"Predicted: {prediction} for {name}")

Predicted: tops for Cotton short-sleeve shirt
