# BERT-based Disaster Tweet classifier

In [1]:
# This cell assumes a project structure of: project-root/src/experiments/this_notebook.ipynb
# We append the parent directory to the system path, so now we can import modules from src
# We also create a variable named path which points to the project root.

import sys
from pathlib import Path

sys.path.append("../") # go to parent dir
path =  str(Path().resolve().parent.parent)

print(path)

/data2/Kaggle-Knowledge-Competitions


Train the model and generate predictions with the trained model. Logging is done with TensorBoard, you can view them by running `tensorboard --logdir logs` on the command line. 

In [2]:
import pytorch_lightning as pl
from models.tweet_classifier import TweetClassifierModule
from datasets.kaggle_tweets import KaggleTweetsDataModule
from pytorch_lightning.loggers import TensorBoardLogger

logger = TensorBoardLogger(save_dir=path+"/logs", name="lightning-tweet-classifier")

model = TweetClassifierModule(learning_rate=1e-6)

# might have to change batch size and num_workers depending on your hardware
data = KaggleTweetsDataModule(data_dir=path+"/data/kaggle_tweets",
                            batch_size=32,
                            num_workers=4)

trainer = pl.Trainer(default_root_dir=path,
                    max_epochs=5,
                    gpus=1,
                    precision=16,
                    logger=logger,
                    checkpoint_callback=False,
                    log_every_n_steps=50,
                    )
trainer.fit(model, data)

data.setup()
raw_test_predictions = trainer.predict(model, data.test_dataloader())

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Using native 16bit precision.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CU

Validation sanity check: 0it [00:00, ?it/s]

Training: -1it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

  rank_zero_deprecation(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 191it [00:00, ?it/s]

Our raw prediction is currently a list of tuples representing the predictions for each batch. We want to turn this into two tensors, each the length of the test set.

In [9]:
import torch

def unpack_predictions(predictions):
    """Takes the output of trainer.predict and unpacks it into a tuple of two tensors
    over the data set:
        (imgs, predictions)
    """
    # predictions start as list of lists of preds, of length num_batches.
    # each tensor is 1D with length batch_size.
    # we want to convert this to two tensors which are the length of the val/test set.
    unpacked_predictions = torch.Tensor().to(predictions[0][0].device)
    for batch in predictions:
        preds = batch
        unpacked_predictions = torch.cat([unpacked_predictions, preds], dim=0)

    return unpacked_predictions

predictions = unpack_predictions(raw_test_predictions)
predictions = predictions.int().detach().cpu().numpy()

Finally, save our predictions in the format for Kaggle submission. You can submit by running the following line:
```bash
# submits preds.csv to the mnist classification competition
kaggle competitions submit -c nlp-getting-started -f data/kaggle_tweets/preds.csv --message first_submission_with_api
```

In [19]:
import pandas as pd
# use sample submission to get the id of the predictions
df = pd.read_csv(path+"/data/kaggle_tweets/sample_submission.csv")
df["target"] = predictions
df.to_csv(path+"/data/kaggle_tweets/preds.csv", index=False)