# NLP Disaster Tweets Competition

This notebook is a review of the code I used to generate my submission of a score of ```0.83450```. The underlying task here is text classification. Check out the competition here: [Natural Language Processing with Disaster Tweets](https://www.kaggle.com/competitions/nlp-getting-started)

The other resources I used are listed as follows:
 - [Getting Started with Sentiment Analysis](https://huggingface.co/blog/sentiment-analysis-python)
 - [Does BERT need clean data?](https://towardsdatascience.com/part-1-data-cleaning-does-bert-need-clean-data-6a50c9c6e9fd)

## Installing the ```transformers``` module


In [None]:
! pip install -U accelerate
! pip install -U transformers
! pip install -U datasets

Collecting accelerate
  Downloading accelerate-0.31.0-py3-none-any.whl (309 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.w

## Importing the modules

In [None]:
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
from torch.utils.data import Dataset, DataLoader, SequentialSampler, RandomSampler
from transformers import DistilBertTokenizer, BertweetTokenizer, DistilBertConfig, DistilBertModel, AdamW
import transformers
from transformers import DataCollatorWithPadding
import datasets
from datasets import load_metric
from transformers import Trainer, TrainingArguments
import regex as re

## Reading the Files

The files can be downloaded from the kaggle website.

In [None]:
train_file = pd.read_csv("train.csv")
test_file = pd.read_csv("test.csv")
train_file.head(10)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


## Preprocessing the data

In [None]:
train_file_id = train_file['id']
test_file_id = test_file['id']
train_file.drop('id', axis=1, inplace=True)
test_file.drop('id', axis=1, inplace=True)

In [None]:
train_file.drop('location', axis=1, inplace=True)
test_file.drop('location', axis=1, inplace=True)
train_file.head(5)

Unnamed: 0,keyword,text,target
0,,Our Deeds are the Reason of this #earthquake M...,1
1,,Forest fire near La Ronge Sask. Canada,1
2,,All residents asked to 'shelter in place' are ...,1
3,,"13,000 people receive #wildfires evacuation or...",1
4,,Just got sent this photo from Ruby #Alaska as ...,1


In [None]:
tf_nondupes = train_file.groupby(['text']).nunique().sort_values(by='target', ascending=False)
df_dupes = tf_nondupes[tf_nondupes['target'] > 1]
df_dupes

Unnamed: 0_level_0,keyword,target
text,Unnamed: 1_level_1,Unnamed: 2_level_1
Caution: breathing may be hazardous to your health.,1,2
wowo--=== 12000 Nigerian refugees repatriated from Cameroon,1,2
He came to a land which was engulfed in tribal war and turned it into a land of peace i.e. Madinah. #ProphetMuhammad #islam,1,2
#foodscare #offers2go #NestleIndia slips into loss after #Magginoodle #ban unsafe and hazardous for #humanconsumption,1,2
The Prophet (peace be upon him) said 'Save yourself from Hellfire even if it is by giving half a date in charity.',1,2
To fight bioterrorism sir.,1,2
In #islam saving a person is equal in reward to saving all humans! Islam is the opposite of terrorism!,1,2
#Allah describes piling up #wealth thinking it would last #forever as the description of the people of #Hellfire in Surah Humaza. #Reflect,1,2
RT NotExplained: The only known image of infamous hijacker D.B. Cooper. http://t.co/JlzK2HdeTG,1,2
Hellfire is surrounded by desires so be careful and donÛªt let your desires control you! #Afterlife,1,2


In [None]:
# take index which is the texts themselves
dupe_text_list = df_dupes.index
dupe_text_list = list(dupe_text_list)
# manually make label list to iterate
right_labels = [0,0,0,1,0,0,1,0,1,1,1,0,1,1,1,0,0,0]
# drop duplicates except for one
train_file = train_file.drop_duplicates(subset=['text'], keep='last').reset_index(drop=True)
# relabel duplicate rows
for i in range(len(dupe_text_list)):
    train_file.loc[train_file['text'] == dupe_text_list[i], 'target'] = right_labels[i]

In [None]:
def text_clean(x):
    x = x.lower()
    x = x.encode('ascii', 'ignore').decode()
    x = re.sub(r'https*\S+', ' ', x)
    x = re.sub(r'http*\S+', ' ', x)

    x = re.sub(r'\'\w+', '', x)
    x = re.sub(r'\w*\d+\w*', '', x)
    x = re.sub(r'\s{2,}', ' ', x)
    x = re.sub(r'\s[^\w\s]\s', '', x)

    return x

In [None]:
train_file['cleaned_text'] = train_file.text.apply(text_clean)
test_file['cleaned_text'] = test_file.text.apply(text_clean)

## Training the Model and Preparing the dataset

In [None]:
PRE_TRAINED_MODEL_NAME = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [None]:
model = transformers.AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
model.to(device)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [None]:
class MyDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

    def __len__(self):
        return len(self.labels)

train_encodings = tokenizer(train_file['cleaned_text'].tolist(), truncation=True, padding=True)
train_labels = train_file['target'].tolist()
train_dataset = MyDataset(train_encodings, train_labels)

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
def compute_metrics(eval_pred):
   load_accuracy = load_metric("accuracy")
   load_f1 = load_metric("f1")

   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
   return {"accuracy": accuracy, "f1": f1}

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="mymodel",
   learning_rate=1e-5,
   per_device_train_batch_size=16,
   num_train_epochs=3,
   weight_decay=0.01,
   save_strategy="epoch"
)

trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=train_dataset,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics
)

trainer.train()
model.save_pretrained("mymodel")
tokenizer.save_pretrained("mymodel")

Step,Training Loss
500,0.4471
1000,0.3517


('mymodel/tokenizer_config.json',
 'mymodel/special_tokens_map.json',
 'mymodel/vocab.txt',
 'mymodel/added_tokens.json')

## Getting the predictions and Exporting the file

In [None]:
test_encodings = tokenizer(test_file['cleaned_text'].tolist(), truncation=True, padding=True, return_tensors = "pt")
inputs = {key: value.to(device) for key, value in test_encodings.items()}
with torch.no_grad():
    outputs = model(**inputs)

In [None]:
logits = outputs.logits
probabilities = torch.nn.functional.sigmoid(logits)

predicted_class_id = torch.argmax(probabilities, dim=-1)
predicted_class_id

tensor([1, 1, 1,  ..., 1, 1, 1], device='cuda:0')

In [None]:
test_file['target'] = pd.Series(predicted_class_id.cpu().detach().numpy().tolist())
sub_df = pd.DataFrame(test_file_id, test_file['target'])
sub_df.to_csv("submission.csv")

In [None]:
### exporting the file as per the Kaggle rules

f1 = pd.read_csv("test.csv")
f2 = pd.read_csv("submission.csv")

column_from_file1 = f1['id']
column_from_file2 = f2['target'].astype(int)

# Combine the selected columns into a new DataFrame
new_df = pd.DataFrame({
    'id': column_from_file1,
    'target': column_from_file2
})

# Display the new DataFrame
new_df.to_csv("finalSubmission.csv", index=False)