# Transformers for Classification
In this notebook we  will use transformers to classify our disaster tweets.

Make sure this notebook is running on a GPU as this will speed up training. Click "Runtime" -> "Change Runtime Type" and make sure a GPU is selected as the Hardware accelerator.

First, lets make sure we have the right libraries installed:

In [1]:
!pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.70.5-py3-none-any.whl.metadata (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.3/43.3 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting seqeval (from simpletransformers)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tensorboardx (from simpletransformers)
  Downloading tensorboardx-2.6.4-py3-none-any.whl.metadata (6.2 kB)
Collecting streamlit (from simpletransformers)
  Downloading streamlit-1.51.0-py3-none-any.whl.metadata (9.5 kB)
Collecting pydeck<1,>=0.8.0b4 (from streamlit->simpletransformers)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading simpletransformers-

## Simple Transformers

There are lots of libraries you can use for transformer models, but the [Huggingface Transformers](https://github.com/huggingface/transformers/) library is the most widly used.

It is a little complicated however, so if all you want to do is train a simple model on a dataset then you can use [SimpleTransformers](https://github.com/ThilinaRajapakse/simpletransformers) which hides away a lot of the complexity.

We'll look at the main transformers library later but here's an example of how you can train a basic model using simpletransformers.

First, we need to load some data. The simple transformers library expects data in a [pandas dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), so lets use pandas to load a csv file.

The following cell downloads a csv file from the web:

In [2]:
!rm -rf data
!mkdir -p data
!wget https://github.com/ghomasHudson/text-mining-demos-workshop/raw/main/disaster_tweets.zip -O data/data.zip
!unzip -j data/data.zip -d data
!rm data/data.zip

--2025-11-25 09:50:53--  https://github.com/ghomasHudson/text-mining-demos-workshop/raw/main/disaster_tweets.zip
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ghomasHudson/text-mining-demos-workshop/main/disaster_tweets.zip [following]
--2025-11-25 09:50:54--  https://raw.githubusercontent.com/ghomasHudson/text-mining-demos-workshop/main/disaster_tweets.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 410737 (401K) [application/zip]
Saving to: ‘data/data.zip’


2025-11-25 09:50:54 (16.6 MB/s) - ‘data/data.zip’ saved [410737/410737]

Archive:  data/data.zip
  inflating: data/README.md       

Now lets load it into a pandas dataframe:

In [3]:
import pandas as pd
train_df = pd.read_csv("data/train.csv")
eval_df = pd.read_csv("data/test.csv")

train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


Simple transformers requires the dataframe to have columns with the names "text" and "label". This is how simpletransformers will know what to do with the data. The [Simple Transformers Documentation](https://simpletransformers.ai/docs/classification-data-formats/) has a page which describes the format you need.

We already have the tweets in the "text" column so simply rename "target" to "labels":

In [4]:
train_df = train_df.rename(columns={"target": "labels"})
eval_df = eval_df.rename(columns={"target": "labels"})

train_df.head()

Unnamed: 0,id,keyword,location,text,labels
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


Now that the data is in the correct format, we can train a model using simple transformers:

In [5]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs
from sklearn.metrics import f1_score, accuracy_score

model_args = ClassificationArgs()
model_args.train_batch_size = 128
model_args.eval_batch_size = 64
model_args.num_train_epochs = 3
model_args.overwrite_output_dir = True


model = ClassificationModel("roberta", "roberta-base", args=model_args)
model.train_model(train_df, eval_df=eval_df, f1=f1_score, accuracy=accuracy_score)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



  0%|          | 0/15 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

  scaler = amp.GradScaler()


Running Epoch 1 of 3:   0%|          | 0/59 [00:00<?, ?it/s]

  with amp.autocast():


Running Epoch 2 of 3:   0%|          | 0/59 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/59 [00:00<?, ?it/s]

(177, 0.394963606350166)

The running loss should go down over time. Lets evaluate the final model:

In [6]:
result, model_outputs, wrong_predictions = model.eval_model(eval_df, f1=f1_score, accuracy=accuracy_score)
result

0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

  with amp.autocast():


{'mcc': np.float64(0.9561771561771562),
 'accuracy': 0.9787234042553191,
 'f1_score': 0.9780885780885781,
 'tp': np.int64(54),
 'tn': np.int64(38),
 'fp': np.int64(1),
 'fn': np.int64(1),
 'auroc': np.float64(0.9916083916083915),
 'auprc': np.float64(0.9953764921946741),
 'f1': 0.9818181818181818,
 'eval_loss': 0.08956431224942207}

Lets try out our trained model. Try changing the sentence below to see how the model performs on different inputs:

In [8]:
predictions, _ = model.predict(["My house is burning down!"])

"Disaster" if predictions[0] else "Not Disaster"

0it [00:00, ?it/s]

Predicting:   0%|          | 0/1 [00:00<?, ?it/s]

'Disaster'

# Additional Exercises

1. Experiment with different training parameter values. Can you improve the performance? See the simpletransformers documentation for a [list of the possible parameters](https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model).
2. Try using a pretrained model: [hkayesh/twitter-disaster-nlp](https://huggingface.co/hkayesh/twitter-disaster-nlp?text=The+woods+are+on+fire)
3. Try other datasets. Either your own or one from the [Huggingface Hub](https://huggingface.co/datasets?task_categories=task_categories:text-classification&sort=trending).
  A good choice is the [Twitter financial news sentiment](https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment) dataset. Work out how to download it and convert to the format required for simple transformers.
4. Experiment with other transformer models. We used roberta in the example above but you should be able to swap it for any in [this list](https://simpletransformers.ai/docs/classification-specifics/#supported-model-types).

If you want a dataset to experiment with, you might want to try the yelp product review dataset:

In [None]:
!wget https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz -O data/data.tgz
!tar -xvzf data/data.tgz -C data/
!mv data/yelp_review_polarity_csv/* data/
!rm -r data/yelp_review_polarity_csv/
!rm data/data.tgz

Be careful with the labels for this dataset. Here's some code to do that for you:

In [None]:
train_df = pd.read_csv("data/train.csv", header=None)
eval_df = pd.read_csv("data/test.csv", header=None)

train_df.head()

First, notice that the label column has values 1 and 2. We need them to be 0,1 for binary classification. Lets fix that:

In [None]:
train_df[0] = (train_df[0] == 2).astype(int)
eval_df[0] = (eval_df[0] == 2).astype(int)

train_df.head()

*Now* lets rename the columns to be "text" and "label" alongside removing line breaks:

In [None]:
train_df = pd.DataFrame(
      {"text": train_df[1].replace(r"\n", " ", regex=True), "labels": train_df[0]}
)

eval_df = pd.DataFrame(
      {"text": eval_df[1].replace(r"\n", " ", regex=True), "labels": eval_df[0]}
)
train_df.head()

It will take a long time for the model to train with such a large dataset so you might want to cut it down a little:

In [None]:
train_df = train_df.head(4000)
eval_df = eval_df.head(100)

Now continue adding the training code...