# Twitter Disaster Tweet Classifier with DistilBERT

The purpose of this model is to determine whether a given Tweet is about a real diaster (war, flood, famine, etc.) or benign. For example, the Tweet "the sky looks beutifully ablaze tonight" likelly does not refer to a real fire. 

### Load and Format

In [27]:
import pandas as pd

# Read from CSV
dataset = pd.read_csv("./data/train.csv")

# Drop (potentially) unnecessary columns. These may be useful, but I'm not quite ready to work with missing data.
dataset = dataset.drop(["id", "keyword", "location"], axis=1)
dataset.head(20)

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1
5,#RockyFire Update => California Hwy. 20 closed...,1
6,#flood #disaster Heavy rain causes flash flood...,1
7,I'm on top of the hill and I can see a fire in...,1
8,There's an emergency evacuation happening now ...,1
9,I'm afraid that the tornado is coming to our a...,1


## Fine-Tune Pretrained Model for Inference 

Below, I use the HuggingFace `transformers` library to fine-tune DistilBERT on the tweets dataset. 

### Load Imports

In [28]:
from transformers import AutoModel, PreTrainedModel, TrainingArguments, Trainer
from transformers.modeling_outputs import SequenceClassifierOutput
from transformers import AutoTokenizer, DataCollatorWithPadding
from datasets import Dataset
from torch import nn
import torch

### Create Dataset

In [29]:
# Init tokenizer for converting text to numbers
model_path = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_path)

# In order to add padding on a batch-level rather than a dataset level, add dynamic padding using a data 
# collator. This will add padding to the maximum input in a batch rather than the entire 
# data set which saves computation. 
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Read data from CSV, embed, and split into test and train
raw_dataset = Dataset.from_pandas(dataset)
raw_dataset = raw_dataset.rename_column("target", "labels")
raw_dataset = raw_dataset.map(lambda example: tokenizer(example["text"]), batched=True)
raw_dataset = raw_dataset.with_format("torch")
formatted_datasets = raw_dataset.train_test_split(0.2)

# Show Output
formatted_datasets

loading configuration file https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json from cache at /home/kdobrien/.cache/huggingface/transformers/4e60bb8efad3d4b7dc9969bf204947c185166a0a3cf37ddb6f481a876a3777b5.9f8326d0b7697c7fd57366cdde57032f46bc10e37ae81cb7eb564d66d23ec96b
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "ti

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 6090
    })
    test: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 1523
    })
})

In [30]:
class TraitDetectionModel(PreTrainedModel):
    def __init__(self, encoding_model, num_labels=2):
        super(TraitDetectionModel, self).__init__(config=encoding_model.config)
        self.num_labels = num_labels
        self.encoder = encoding_model
        input_dimension = encoding_model.config.hidden_size
        self.classifier = nn.Linear(input_dimension, num_labels)

    def forward(self, input_ids, attention_mask=None, position_ids=None, head_mask=None, labels=None):
        encoding = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
            head_mask=head_mask
        )
        cls_tensor = encoding[0][:, 0, :]
        logits = self.classifier(cls_tensor)

        loss = None
        if labels is not None:
            loss_function = nn.CrossEntropyLoss()
            loss = loss_function(logits.view(-1, self.num_labels), labels.view(-1))

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=encoding.hidden_states,
            attentions=encoding.attentions
        )
   

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model = TraitDetectionModel(AutoModel.from_pretrained(model_path))
model.to(device)

print(f"Running on Device Type: {device.type}")


loading configuration file https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json from cache at /home/kdobrien/.cache/huggingface/transformers/4e60bb8efad3d4b7dc9969bf204947c185166a0a3cf37ddb6f481a876a3777b5.9f8326d0b7697c7fd57366cdde57032f46bc10e37ae81cb7eb564d66d23ec96b
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "ti

Running on Device Type: cpu


In [35]:
training_arguments = TrainingArguments("test-trainer", num_train_epochs=1)
trainer = Trainer(
    model,
    training_arguments,
    train_dataset=formatted_datasets["train"],
    eval_dataset=formatted_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [36]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `TraitDetectionModel.forward` and have been ignored: text. If text are not expected by `TraitDetectionModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 6090
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 762


Step,Training Loss
500,0.401


Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=762, training_loss=0.4089442275640533, metrics={'train_runtime': 1279.482, 'train_samples_per_second': 4.76, 'train_steps_per_second': 0.596, 'total_flos': 77973872171832.0, 'train_loss': 0.4089442275640533, 'epoch': 1.0})