# Introduction
The purpose of this competition is to try to predict, given a tweet, whether that tweet is announcing a disaster or not. In order to do this, we use will use an ensemble of the DistilBERT model and the XLNet model.

# Importing dependencies

In [13]:
import numpy as np
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification, XLNetTokenizer, TFXLNetForSequenceClassification

# Loading the data

In [14]:
train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

In [15]:
train_df

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


# Importing the tokenizers for DistilBERT and XLNet

In [16]:
tokenizer_B = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
max_length = 140

def encode_text(df, tokenizer, max_length):
    return tokenizer.batch_encode_plus(
        df['text'].tolist(),
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='tf'
    )

train_data_B = encode_text(train_df, tokenizer_B, max_length)
test_data_B = encode_text(test_df, tokenizer_B, max_length)

In [17]:
tokenizer_X = XLNetTokenizer.from_pretrained('xlnet-base-cased')
train_data_X = encode_text(train_df, tokenizer_X, max_length)
test_data_X = encode_text(test_df, tokenizer_X, max_length)

In [18]:
model_B = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
model_X = TFXLNetForSequenceClassification.from_pretrained('xlnet-base-cased', num_labels=2)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [19]:
batch_size_B = 10
epochs_B = 5
batch_size_X = 32
epochs_X = 3

# Training the models

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    {'input_ids': train_data_B['input_ids'], 'attention_mask': train_data_B['attention_mask']},
    train_df['target'].values
)).shuffle(len(train_df)).batch(batch_size_B)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
model_B.compile(optimizer=optimizer, loss=loss_fn, metrics=['accuracy'])

model_B.fit(train_dataset, epochs=epochs_B)

Epoch 1/5
Epoch 2/5
 38/762 [>.............................] - ETA: 50:46 - loss: 0.2933 - accuracy: 0.8895

In [None]:
train_dataset_X = tf.data.Dataset.from_tensor_slices((
    {'input_ids': train_data_X['input_ids'], 'attention_mask': train_data_X['attention_mask']},
    train_df['target'].values
)).shuffle(len(train_df)).batch(batch_size_X)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
model_X.compile(optimizer=optimizer, loss=loss_fn, metrics=['accuracy'])

model_X.fit(train_dataset_X, epochs=epochs_X)

In [None]:
test_dataset_B = tf.data.Dataset.from_tensor_slices((
    {'input_ids': test_data_B['input_ids'], 'attention_mask': test_data_B['attention_mask']}
)).batch(batch_size_B)

logits1 = model_B.predict(test_dataset_B).logits

# Convert logits to class predictions
predictions1 = tf.argmax(logits1, axis=1).numpy()

In [None]:
test_dataset_X = tf.data.Dataset.from_tensor_slices((
    {'input_ids': test_data_X['input_ids'], 'attention_mask': test_data_X['attention_mask']}
)).batch(batch_size_X)

# Get the prediction logits for both models
logits2 = model_X.predict(test_dataset_X).logits

predictions2 = tf.argmax(logits2, axis=1).numpy()

# Making the final predictions and then generating the submission file

In [None]:
final_predictions = (predictions1 + predictions2) > 1  # This will work since 0 (not disaster) + 0 = 0, 1 (disaster) + 1 = 2
final_predictions = final_predictions.astype(int)

submission = pd.DataFrame({'id': test_df['id'], 'target': final_predictions})
submission.to_csv('submission.csv', index=False)

A few things I would like to try in the future:
* Try using a larger model: both of the models selected here were moreso selected due to their relative smallness, and I wanted to play around a bit with transformer models on an interesting task
* I would also like to try and train each of the models for longer in order to have better accuracy on the training data. In different modifications of the batch size with the models, I was never able to get the accuracy to be above around 96% or so on the training data. I think this would be fixed by training longer, as each epoch the accuracy improved quite a bit.
* This ensemble structure performed much better than the baseline linear regression model that I previously implemented, but I am also curious to try making an architecture from scratch to see how well it could do, instead of the nearly 100 million parameters in these pre-trained models
* Lastly, I think it could be interesting to try to work on this problem in more languages than English, and then try to apply data augmentation after acquiring other tweets in other languages. For example, all of the tweets in the data set here are in English, but there are pre-trained models trained on large corpuses of non-English text, too and it could be interesting if the model does better on some languages than others!