# Natural Language Processing with Disaster Tweets

## Competition Description
Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. For example: author may use the word “ABLAZE”, but he means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.

In this competition, it was necessary to create a machine learning model that predicts which tweets are dedicated to real disasters and which are not. To do this, there was access to a data set of 10,000 tweets that were classified manually.

## Data preparation and initial analysis

In [15]:
import pandas as pd
import re
from transformers import RobertaTokenizer, RobertaModel
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
import nltk
from sklearn.metrics import f1_score
from sklearn.model_selection import KFold
import torch
from sklearn.linear_model import LogisticRegression

In [16]:
train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')

In [17]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [18]:
tain_keywords = train.query('keyword.notna()')
tain_keywords['keyword'].unique()

array(['ablaze', 'accident', 'aftershock', 'airplane%20accident',
       'ambulance', 'annihilated', 'annihilation', 'apocalypse',
       'armageddon', 'army', 'arson', 'arsonist', 'attack', 'attacked',
       'avalanche', 'battle', 'bioterror', 'bioterrorism', 'blaze',
       'blazing', 'bleeding', 'blew%20up', 'blight', 'blizzard', 'blood',
       'bloody', 'blown%20up', 'body%20bag', 'body%20bagging',
       'body%20bags', 'bomb', 'bombed', 'bombing', 'bridge%20collapse',
       'buildings%20burning', 'buildings%20on%20fire', 'burned',
       'burning', 'burning%20buildings', 'bush%20fires', 'casualties',
       'casualty', 'catastrophe', 'catastrophic', 'chemical%20emergency',
       'cliff%20fall', 'collapse', 'collapsed', 'collide', 'collided',
       'collision', 'crash', 'crashed', 'crush', 'crushed', 'curfew',
       'cyclone', 'damage', 'danger', 'dead', 'death', 'deaths', 'debris',
       'deluge', 'deluged', 'demolish', 'demolished', 'demolition',
       'derail', 'derailed

In [19]:
tain_keywords.query('keyword=="tsunami"')

Unnamed: 0,id,keyword,location,text,target
6943,9958,tsunami,,I feel so lucky rn,0
6944,9960,tsunami,,So did we have a hurricane tornado tsunami? So...,1
6945,9961,tsunami,in the Word of God,@helene_yancey GodsLove &amp; #thankU my siste...,1
6946,9963,tsunami,in the Word of God,@freefromwolves GodsLove &amp; #thankU brother...,0
6947,9965,tsunami,"Washington, DC",I'm at Baan Thai / Tsunami Sushi in Washington...,0
6948,9967,tsunami,,she keep it wet like tsunami.,0
6949,9971,tsunami,"Louavul, KY",#BBShelli seems pretty sure she's the one that...,0
6950,9972,tsunami,,Crptotech tsunami and banks.\n http://t.co/KHz...,1
6951,9973,tsunami,,#sing #tsunami Beginners #computer tutorial.: ...,0
6952,9974,tsunami,IG : Sincerely_TSUNAMI,It's my senior year I just wanna go all out,0


Most of the omissions in the keywords and location attributes cannot be replaced with specific values for several reasons:

1. The data is filled in manually. Most likely, the omissions are not accidental
2. In some answers, it is not possible to clearly define the location and keyword

In [20]:
train['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

The balance of classes is almost met

## Preparation and training of the RoBERT model

In [21]:
# Load pre-trained RoBERTa tokenizer and model
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
nltk.data.path.append('/kaggle/working/nltk_data/')

nltk.download('wordnet', download_dir='/kaggle/working/nltk_data/')
nltk.download('averaged_perceptron_tagger', download_dir='/kaggle/working/nltk_data/')
nltk.download('stopwords', download_dir='/kaggle/working/nltk_data/')
nltk.download('omw-1.4', download_dir='/kaggle/working/nltk_data/')
nltk.download('punkt', download_dir='/kaggle/working/nltk_data/')
nltk.download('wordnet2022')


! cp -rf /usr/share/nltk_data/corpora/wordnet2022 /usr/share/nltk_data/corpora/wordnet # temp fix for lookup error.

[nltk_data] Downloading package wordnet to
[nltk_data]     /kaggle/working/nltk_data/...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /kaggle/working/nltk_data/...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /kaggle/working/nltk_data/...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /kaggle/working/nltk_data/...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /kaggle/working/nltk_data/...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet2022 to /usr/share/nltk_data...
[nltk_data]   Package wordnet2022 is already up-to-date!


In [23]:
# Functions for lemmatization and text cleaning

def lemmatize(text):
    m = WordNetLemmatizer()
    word_list = word_tokenize(text)
    tagged_words = nltk.pos_tag(word_list)
    lemmatized_words = []

    for word, tag in tagged_words:
        if tag.startswith('NN'):
            lemmatized_words.append(m.lemmatize(word, pos='n'))
        elif tag.startswith('VB'):
            lemmatized_words.append(m.lemmatize(word, pos='v'))
        elif tag.startswith('JJ'):
            lemmatized_words.append(m.lemmatize(word, pos='a'))
        elif tag.startswith('R'):
            lemmatized_words.append(m.lemmatize(word, pos='r'))
        else:
            lemmatized_words.append(m.lemmatize(word))

    return ' '.join(lemmatized_words)

def clear_text(text):
    text = re.sub(r"[^a-zA-Z']", ' ', text)
    return " ".join(text.split())

In [24]:
train['text'] = train['text'].apply(clear_text)

In [25]:
train['lemmatize_text'] = train['text'].apply(lemmatize)
train = train.drop(['text'], axis=1)
train.head()

Unnamed: 0,id,keyword,location,target,lemmatize_text
0,1,,,1,Our Deeds be the Reason of this earthquake May...
1,4,,,1,Forest fire near La Ronge Sask Canada
2,5,,,1,All resident ask to 'shelter in place ' be be ...
3,6,,,1,people receive wildfire evacuation order in Ca...
4,7,,,1,Just get send this photo from Ruby Alaska a sm...


In [26]:
corpus = train['lemmatize_text']
test_corpus = test['text']

target_train = train['target']

features_train = []
features_test = []

In [27]:
# Convert text to RoBERTa embeddings
for text in corpus:
    tokens = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        embeddings = model(**tokens).last_hidden_state.mean(dim=1)
        features_train.append(embeddings)

features_train = torch.cat(features_train, dim=0)

In [28]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

In [29]:
logreg_model = LogisticRegression(max_iter=200, C=5, class_weight='balanced')
logreg_scores = []

for train_index, val_index in kf.split(features_train):
    train_features, val_features = features_train[train_index], features_train[val_index]
    train_target, val_target = target_train[train_index], target_train[val_index]

    logreg_model.fit(train_features, train_target)
    val_predictions = logreg_model.predict(val_features)
    val_f1 = f1_score(val_target, val_predictions)
    logreg_scores.append(val_f1)

logreg_f1 = sum(logreg_scores) / len(logreg_scores)
print('Logistic regression, F1-measure on cross-validation:', logreg_f1)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Logistic regression, F1-measure on cross-validation: 0.774748675773278


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Predicting the results of a target feature on a test sample

In [16]:
for text in test_corpus:
    tokens = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        embeddings = model(**tokens).last_hidden_state.mean(dim=1)
        features_test.append(embeddings)

features_test = torch.cat(features_test, dim=0)
test_predictions = logreg_model.predict(features_test)

In [17]:
# Create a DataFrame with the predicted results
submission = pd.DataFrame({'id': test['id'], 'target': test_predictions})

# Save the predictions to a CSV file
submission.to_csv('submission.csv', index=False)

In [18]:
submission

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1
...,...,...
3258,10861,1
3259,10865,1
3260,10868,1
3261,10874,1
