# Natural Language Processing with Disaster Tweets

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.

Different things to keep in mind compare to main.ipynb:
- Use all the columns
- Processing pipeline (lowercasing, stopword removal, punctuation removal, lemmatization, tokenization, and padding)
- Use ML classification algorithms

In [1]:
import pandas as pd

import numpy as np

import os

import re

import spacy

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.optimizers import Adam

from transformers import BertTokenizer
from transformers import TFBertForSequenceClassification

from datetime import datetime

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
kaggle_run = 'false'

In [2]:
if kaggle_run:
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            print(os.path.join(dirname, filename))
    train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
    test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
else:
    train = pd.read_csv('data/train.csv')
    test = pd.read_csv('data/test.csv')
    submission = pd.read_csv('data/sample_submission.csv')

# Exploratory data analysis

## Preprocessing

In [3]:
def preprocessing(df):
    df.fillna('', inplace=True)
    
    df['combined_text'] = df['keyword'] + ' ' + df['location'] + ' ' + df['text']
    df = df.drop(['id','keyword','location','text'], axis=1)
    return df

In [4]:
train = preprocessing(train)

In [5]:
# ----------- Tokenize the data -----------

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the data
tokenized_data = tokenizer(
    train['combined_text'].tolist(),
    padding=True,
    truncation=True,
    return_tensors='tf'
)

# Add labels to the tokenized data
labels = train['target'].tolist()

In [10]:
# ----------- Convert the data to a TensorFlow dataset -----------

# Convert tokenized data to a TensorFlow dataset
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(tokenized_data),  # Tokenized input
    labels  # Corresponding labels
))

# Batch the dataset
train_dataset = train_dataset.shuffle(len(train)).batch(16)

In [8]:
# Load BERT model for sequence classification
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Build model

In [14]:
# Compile the model
model.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)

# Train the model
model.fit(train_dataset, epochs=10)

Epoch 1/10
 10/476 [..............................] - ETA: 19:19 - loss: 0.7246 - accuracy: 0.4688

KeyboardInterrupt: 

## Prediction on new data 

In [12]:
test = preprocessing(test)

# Tokenize new data
test = tokenizer(
    test['combined_text'].tolist(),
    padding=True,
    truncation=True,
    return_tensors='tf'
)

# ----------- Convert the data to a TensorFlow dataset -----------

# Convert tokenized data to a TensorFlow dataset
test = tf.data.Dataset.from_tensor_slices((dict(test)))

# Batch the dataset
test = test.batch(16)

# Make predictions
predictions = model.predict(test)

# Convert logits to probabilities
probabilities = tf.nn.softmax(predictions.logits, axis=-1)

# Get the predicted class (0 or 1)
predicted_classes = tf.argmax(probabilities, axis=1).numpy()

print(predicted_classes)

[1 1 1 ... 1 1 1]


## Prepare upload

In [13]:
choosen_model_name = 'bert_e1'
choosen_model_predictions = predicted_classes

now = datetime.now()
date_time_str = now.strftime("%Y%m%d_%H%M%S")

submission = pd.DataFrame({
    'id': pd.read_csv('data/test.csv')['id'],
    'target': choosen_model_predictions
})

submission.to_csv(f'output/submission_{choosen_model_name}_{date_time_str}.csv', index=False)

# Conclusion

- Best result so far 0.75881 in Kaggle upload. I believe the more I preprocess the text the less accuracy I get.
- Watch videos about NLP
- Idk if Deep learning is the best approach, it's the only I know how to do.
- Explore other ML models.