<a href="https://colab.research.google.com/github/Cezarrr9/NLP-with-Disaster-Tweets/blob/main/NLP_with_Disaster_Tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This version of code is meant to be run outside of kaggle. However, you can find the Kaggle version of the code here: https://www.kaggle.com/code/cezarr/nlp-disaster-tweets-with-huggingface-transformers

# Get the Data

The simplest way to access Kaggle datasets is through the Kaggle API. This can be installed using pip by running the following cell:

In [1]:
!pip install kaggle



Then, in order to use Kaggle API, we need an API key. You can find this by clicking on your profile picture on Kaggle → Account → Create New Token. A json file will be saved in your PC. You need to access it, copy its contents and then paste them in the following cell (between the quotes):

In [2]:
creds = '{"username":"cezarr","key":"d6ae599aaf03fe6cc1b2e1ecb2905cd1"}'

You can run the following cells that will download the datasets to the specified path and then extract them:

In [3]:
import os
iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

In [4]:
from pathlib import Path

cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)

In [5]:
path = Path('nlp-getting-started')

In [6]:
if not iskaggle and not path.exists():
    import zipfile,kaggle
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)

In [7]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Using cached datasets-2.14.5-py3-none-any.whl (519 kB)
Collecting evaluate
  Using cached evaluate-0.4.0-py3-none-any.whl (81 kB)
Collecting transformers[sentencepiece]
  Using cached transformers-4.33.2-py3-none-any.whl (7.6 MB)
Installing collected packages: transformers, datasets, evaluate
Successfully installed datasets-2.14.5 evaluate-0.4.0 transformers-4.33.2


In [8]:
import pandas as pd
import numpy as np

# We read the data
train_data = pd.read_csv(path/'train.csv')
test_data = pd.read_csv(path/'test.csv')

In [9]:
# Get some basic info
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [10]:
train_data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [11]:
train_data.describe(include = 'object')

Unnamed: 0,keyword,location,text
count,7552,5080,7613
unique,221,3341,7503
top,fatalities,USA,11-Year-Old Boy Charged With Manslaughter of T...
freq,45,104,10


# Prepare the data

Transformers use a *Dataset* object to store a dataset (really obvious, I know):

In [12]:
from datasets import Dataset

raw_dataset = Dataset.from_pandas(train_data)

In [13]:
raw_dataset

Dataset({
    features: ['id', 'keyword', 'location', 'text', 'target'],
    num_rows: 7613
})

Now, we have a problem. A deep learning model works only with numbers, not with sentences. Thus, we have to tokenize the sentences, meaning that we split them into words and then convert each word into a specific number.

The *AutoTokenizer* will create an appropriate tokenizer for the given model. If you want to experiment with a different one, you can check up HuggingFace Model Hub: https://huggingface.co/models

I also created the function 'tokenize_function', which tokenizes our inputs. For this process to run quickly in parallel on every row in our dataset, I used the 'map' function.

Because we want to store the samples together inside a batch, we have to use a *collate function*. To batch the inputs, our sentences have to be of the same size. That's when padding comes into play. In short, padding is the process that involves bringing the input data to the exact size and shape. We need to define a collate function that can apply the correct amount of padding to the items we want to put in each batch. Fortunately, the HuggingFace Transformers library provides one like this via 'DataCollatorWithPadding'.

In [14]:
from transformers import AutoTokenizer, DataCollatorWithPadding

checkpoint = "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
  return tokenizer(example["text"], truncation = True)

tokenized_dataset = raw_dataset.map(tokenize_function, batched = True)
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

Downloading (…)okenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

Downloading spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/7613 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [15]:
tokenized_dataset

Dataset({
    features: ['id', 'keyword', 'location', 'text', 'target', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 7613
})

In [16]:
# We create a test and a training dataset
tokenized_datasets = tokenized_dataset.train_test_split(0.2, seed = 42)

In [17]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'keyword', 'location', 'text', 'target', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 6090
    })
    test: Dataset({
        features: ['id', 'keyword', 'location', 'text', 'target', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1523
    })
})

We need to define some additional objects before creating our training loop. The first ones are the dataloaders, which will iterate over the batches. However, before that, we have to take care of a few things first:
- We need to remove columns that contain values the model doesn't recognize (like location and keyword columns)
- The column 'target' needs to be renamed into 'labels' (because the model expects the target argument to be named *labels*)
- The datasets must also be formatted so that they return PyTorch tensors instead of lists

In [18]:
tokenized_datasets = tokenized_datasets.remove_columns(["text", "id", "location", "keyword"])
tokenized_datasets = tokenized_datasets.rename_column("target", "labels")
tokenized_datasets.set_format("torch")

Now, our dataloaders can be defined:

In [19]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle = True, batch_size = 8, collate_fn = data_collator
)

eval_dataloader = DataLoader(
    tokenized_datasets["test"], batch_size = 8, collate_fn = data_collator
)

# Model

It's time to instantiate our model:

In [20]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = 2)

Downloading pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'pooler.dense.bias', 'classifier.weight', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In order to finish our training loop, we need two more things: an optimizer and a learning rate scheduler.

We will use AdamW optimizer, which is the same as Adam but with a little twist for weight decay regularization:

In [21]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr = 5e-5)



I implemented a learning rate scheduler identical with the one in the *Trainer* class. The default one is just a linear decay from 5e-5 to 0. The number of epochs used in *Trainer* is three, so we'll keep that.

In [22]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer = optimizer,
    num_warmup_steps = 0,
    num_training_steps = num_training_steps,
)

# Training loop

We want to use a GPU if it is available because the training on a CPU can last a few hours instead of a few minutes. For this purpose, we define a device to put our model and batches on:

In [23]:
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

Ready for training!

In [24]:
model.train()
for epoch in range(num_epochs):
  for batch in train_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


# Evaluation Loop

We will use the f1 score as the metric for our model because this is the one that the competition uses for classifying the predictions.

As we go over the prediction loop, metrics will accumulate batches using the method add_batch(). By accumulating all the batches, we can get the final result using metric.compute():

In [25]:
import evaluate

metric = evaluate.load("f1")

for batch in eval_dataloader:
  batch = {k: v.to(device) for k, v in batch.items()}
  with torch.no_grad():
    outputs = model(**batch)

  logits = outputs.logits
  predictions = torch.argmax(logits, dim = -1)
  metric.add_batch(predictions = predictions, references = batch["labels"])

metric.compute()

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

{'f1': 0.7666126418152351}

# Predictions

To make predictions on the test set, we have to apply the same transformations we used on the train set:

In [26]:
test_ds = Dataset.from_pandas(test_data).map(tokenize_function, batched = True)
test_ds = test_ds.remove_columns(["location", "keyword", "id", "text"])

Map:   0%|          | 0/3263 [00:00<?, ? examples/s]

In [27]:
test_ds

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3263
})

In [28]:
test_dataloader = DataLoader(
    test_ds, batch_size = 8, collate_fn = data_collator
)

We apply the same procedure as for evaluating the model, but now using the unlabeled data (and no metric is involved, of course):

In [29]:
preds = []
for batch in test_dataloader:
  batch = {k: v.to(device) for k, v in batch.items()}
  with torch.no_grad():
    outputs = model(**batch)

  logits = outputs.logits
  predictions = torch.argmax(logits, dim = -1)
  predictions_list = predictions.tolist()
  for element in predictions_list:
    preds.append(element)


In [30]:
submission = pd.read_csv(path/'sample_submission.csv')

In [31]:
submission["target"] = preds

In [32]:
submission.head()

Unnamed: 0,id,target
0,0,0
1,2,1
2,3,1
3,9,1
4,11,1


In [33]:
submission.to_csv("submission.csv", index = False)