<a href="https://www.kaggle.com/code/mateotfuentes/xgboost-and-lstm-for-disaster-tweets?scriptVersionId=108596121" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/xgboost2/XGBoostScore.png
/kaggle/input/nlp-getting-started/sample_submission.csv
/kaggle/input/nlp-getting-started/train.csv
/kaggle/input/nlp-getting-started/test.csv


# Importing data 

In this case we have acces to a training data, with outputs, and a testing data, without outputs, which is used for grading in the competition

In [2]:
train_data = pd.read_csv('../input/nlp-getting-started/train.csv')
train_data = train_data.set_index('id')
test_data  = pd.read_csv('../input/nlp-getting-started/test.csv')
test_data  = test_data.set_index('id') 

# Getting the data ready

Now we can take a look at both the training data and the test data

In [3]:
train_data.head()

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,,,Our Deeds are the Reason of this #earthquake M...,1
4,,,Forest fire near La Ronge Sask. Canada,1
5,,,All residents asked to 'shelter in place' are ...,1
6,,,"13,000 people receive #wildfires evacuation or...",1
7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [4]:
test_data.head()

Unnamed: 0_level_0,keyword,location,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,,,Just happened a terrible car crash
2,,,"Heard about #earthquake is different cities, s..."
3,,,"there is a forest fire at spot pond, geese are..."
9,,,Apocalypse lighting. #Spokane #wildfires
11,,,Typhoon Soudelor kills 28 in China and Taiwan


We can see that there are some missing values in both the location and keyword. We are going to investigate how many missing values are there in both the training data and the test data

In [5]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7613 entries, 1 to 10873
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   keyword   7552 non-null   object
 1   location  5080 non-null   object
 2   text      7613 non-null   object
 3   target    7613 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 297.4+ KB


As we see there are no missing values on text or target. The number of keyowrd missing values is very low (61 out of 7613), so we can fill values with the mode

In [6]:
train_data.keyword = train_data.keyword.fillna(train_data.keyword.mode()[0])

However, the location have more missing values (2533 out of 7613, over 33% of missing values), so it might be better to just drop this column. Let's just see how many unique values there are

In [7]:
len(train_data.location.unique())

3342

We see that the numbers is relatively very high (over 65% of the whole list of values), so the only way for it to provide useful information would be to use a library capable of organizing this locations by regions. To keep things simple we are just going to eliminate the column

In [8]:
train_data = train_data.drop("location", axis = 1) 

Now we can do the same for the test data

In [9]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3263 entries, 0 to 10875
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   keyword   3237 non-null   object
 1   location  2158 non-null   object
 2   text      3263 non-null   object
dtypes: object(3)
memory usage: 102.0+ KB


Again, the proportion of keyword missing values is very low while the proportion of location missing values is high, so we are going to proceed the same way 

In [10]:
test_data.keyword = test_data.keyword.fillna(test_data.keyword.mode()[0])

In [11]:
test_data = test_data.drop("location", axis = 1) 

***Cleaning text data***

Let's examine the keyword unique values

In [12]:
print("Training data values:")
print(train_data.keyword.unique())

print("Test data values:")
print(test_data.keyword.unique())

Training data values:
['fatalities' 'ablaze' 'accident' 'aftershock' 'airplane%20accident'
 'ambulance' 'annihilated' 'annihilation' 'apocalypse' 'armageddon' 'army'
 'arson' 'arsonist' 'attack' 'attacked' 'avalanche' 'battle' 'bioterror'
 'bioterrorism' 'blaze' 'blazing' 'bleeding' 'blew%20up' 'blight'
 'blizzard' 'blood' 'bloody' 'blown%20up' 'body%20bag' 'body%20bagging'
 'body%20bags' 'bomb' 'bombed' 'bombing' 'bridge%20collapse'
 'buildings%20burning' 'buildings%20on%20fire' 'burned' 'burning'
 'burning%20buildings' 'bush%20fires' 'casualties' 'casualty'
 'catastrophe' 'catastrophic' 'chemical%20emergency' 'cliff%20fall'
 'collapse' 'collapsed' 'collide' 'collided' 'collision' 'crash' 'crashed'
 'crush' 'crushed' 'curfew' 'cyclone' 'damage' 'danger' 'dead' 'death'
 'deaths' 'debris' 'deluge' 'deluged' 'demolish' 'demolished' 'demolition'
 'derail' 'derailed' 'derailment' 'desolate' 'desolation' 'destroy'
 'destroyed' 'destruction' 'detonate' 'detonation' 'devastated'
 'devastation

The "%20" means spaces, so we are going to change them. This way it will allow for relations in words inside those two words separated by %20. I mean, if we split airplane%accident into "airplane accident" it will recognize the presence of "accident"

In [13]:
train_data.keyword = [doc.replace ("%20", " ") for doc in train_data.keyword]
test_data.keyword  = [doc.replace ("%20", " ") for doc in test_data.keyword]

Let's check whether it has worked

In [14]:
print("Train data values:")
print(train_data.keyword.unique())

print("Test data values:")
print(test_data.keyword.unique())

Train data values:
['fatalities' 'ablaze' 'accident' 'aftershock' 'airplane accident'
 'ambulance' 'annihilated' 'annihilation' 'apocalypse' 'armageddon' 'army'
 'arson' 'arsonist' 'attack' 'attacked' 'avalanche' 'battle' 'bioterror'
 'bioterrorism' 'blaze' 'blazing' 'bleeding' 'blew up' 'blight' 'blizzard'
 'blood' 'bloody' 'blown up' 'body bag' 'body bagging' 'body bags' 'bomb'
 'bombed' 'bombing' 'bridge collapse' 'buildings burning'
 'buildings on fire' 'burned' 'burning' 'burning buildings' 'bush fires'
 'casualties' 'casualty' 'catastrophe' 'catastrophic' 'chemical emergency'
 'cliff fall' 'collapse' 'collapsed' 'collide' 'collided' 'collision'
 'crash' 'crashed' 'crush' 'crushed' 'curfew' 'cyclone' 'damage' 'danger'
 'dead' 'death' 'deaths' 'debris' 'deluge' 'deluged' 'demolish'
 'demolished' 'demolition' 'derail' 'derailed' 'derailment' 'desolate'
 'desolation' 'destroy' 'destroyed' 'destruction' 'detonate' 'detonation'
 'devastated' 'devastation' 'disaster' 'displaced' 'drough

If we examine some of the texts on "text" data, we can check that they do not have the same problem

In [15]:
for i in range(10): 
    print(train_data.iloc[i,1])

Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
Forest fire near La Ronge Sask. Canada
All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
13,000 people receive #wildfires evacuation orders in California 
Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 
#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires
#flood #disaster Heavy rain causes flash flooding of streets in Manitou, Colorado Springs areas
I'm on top of the hill and I can see a fire in the woods...
There's an emergency evacuation happening now in the building across the street
I'm afraid that the tornado is coming to our area...


In [16]:
for i in range(10): 
    print(test_data.iloc[i,1])

Just happened a terrible car crash
Heard about #earthquake is different cities, stay safe everyone.
there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all
Apocalypse lighting. #Spokane #wildfires
Typhoon Soudelor kills 28 in China and Taiwan
We're shaking...It's an earthquake
They'd probably still show more life than Arsenal did yesterday, eh? EH?
Hey! How are you?
What a nice hat?
Fuck off!


There does not same to be any %20 to mean spaces, but we can get ride of the hashes and the dots

In [17]:
def changeSpace (a): 
    train_data.text = [doc.replace (a, " ") for doc in train_data.text]
    test_data.text  = [doc.replace (a, " ") for doc in test_data.text]

changeSpace("#")
changeSpace(".")
changeSpace("?")
changeSpace("!")



Let's now repeat the 

***Lowercase***

In [18]:
train_data.text = [doc.lower() for doc in train_data.text]
test_data.text  = [doc.lower() for doc in test_data.text]

Again, we check that what we have done has worked

In [19]:
train_data.iloc[:10,1]

id
1     our deeds are the reason of this  earthquake m...
4                forest fire near la ronge sask  canada
5     all residents asked to 'shelter in place' are ...
6     13,000 people receive  wildfires evacuation or...
7     just got sent this photo from ruby  alaska as ...
8      rockyfire update => california hwy  20 closed...
10     flood  disaster heavy rain causes flash flood...
13    i'm on top of the hill and i can see a fire in...
14    there's an emergency evacuation happening now ...
15    i'm afraid that the tornado is coming to our a...
Name: text, dtype: object

In [20]:
test_data.iloc[:10, 1]

id
0                    just happened a terrible car crash
2     heard about  earthquake is different cities, s...
3     there is a forest fire at spot pond, geese are...
9              apocalypse lighting   spokane  wildfires
11        typhoon soudelor kills 28 in china and taiwan
12                   we're shaking   it's an earthquake
21    they'd probably still show more life than arse...
22                                    hey  how are you 
27                                     what a nice hat 
29                                            fuck off 
Name: text, dtype: object

# Implementing bag of words on "text"

For our first models we'll only use the variable "text" and leave the "keyword" for later

For implementing bag of words we'll use countvectorizer from scikit-learn

In [21]:
from sklearn.feature_extraction.text import CountVectorizer 
vect = CountVectorizer() 

Now, we extract the texts from both training and test data

In [22]:
train_texts = train_data.text
test_texts = test_data.text

We mix them in a total text variable in which we'll train the bag of words

In [23]:
total_texts = np.concatenate([train_texts, test_texts])

In [24]:
vect.fit(total_texts)

CountVectorizer()

And now, we can apply bag of words on the texts from the training data and the test data

In [25]:
train_bow = vect.transform(train_data.text) 
test_bow  = vect.transform(test_data.text) 

And now we are ready to define our the target (y) and the data we'll use for making predictions (X, the texts with bag of words) 

In [26]:
X = train_bow
y = train_data.target 

To train and test our models we split the data in training and validation data: 

In [27]:
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, test_size = 0.2) 

# Tree-based model: XGBoost

The firs model we are going to use is XGBoost. We can jump straight in. For that we are going to train the XGBoost model with different learning rates and compare it. We'll do it through the following code, in which we first train XGBoost with a learning rate of 0.01 and then keep increasing the learning rates and comparing its performance to the previous models'

In [28]:
import xgboost as xgb 
import sklearn.metrics

def geterror(real, pred): 
    return sklearn.metrics.log_loss(real, pred) 

xgb_p = xgb.XGBClassifier(n_estimators = 4000, early_stopping_rounds = 50, learning_rate = 0.01)
xgb_p.fit(train_X, train_y, 
         eval_set = [(train_X, train_y), (val_X,val_y)], 
         verbose = 100) 
pprediction = xgb_p.predict(val_X) 
blr = 0.01
lrates = [0.02, 0.05, 0.1, 0.2, 0.25, 0.5]
t_error = geterror(pprediction, val_y) 


for rate in lrates: 
    print("New model begins:")
    xgb_p = xgb.XGBClassifier(n_estimators = 4000, early_stopping_rounds = 50, learning_rate = rate) 
    xgb_p.fit(train_X, train_y, 
             eval_set = [(train_X, train_y), (val_X, val_y)], 
             verbose = 100)
    pprediction = xgb_p.predict(val_X)
    error = geterror(pprediction, val_y) 
    if error< t_error: 
        t_error = error 
        blr = rate




[0]	validation_0-logloss:0.69115	validation_1-logloss:0.69152
[100]	validation_0-logloss:0.58939	validation_1-logloss:0.61858
[200]	validation_0-logloss:0.55130	validation_1-logloss:0.59081
[300]	validation_0-logloss:0.52796	validation_1-logloss:0.57339
[400]	validation_0-logloss:0.51218	validation_1-logloss:0.56164
[500]	validation_0-logloss:0.49954	validation_1-logloss:0.55230
[600]	validation_0-logloss:0.48867	validation_1-logloss:0.54462
[700]	validation_0-logloss:0.47911	validation_1-logloss:0.53767
[800]	validation_0-logloss:0.47065	validation_1-logloss:0.53141
[900]	validation_0-logloss:0.46286	validation_1-logloss:0.52584
[1000]	validation_0-logloss:0.45567	validation_1-logloss:0.52112
[1100]	validation_0-logloss:0.44901	validation_1-logloss:0.51652
[1200]	validation_0-logloss:0.44265	validation_1-logloss:0.51248
[1300]	validation_0-logloss:0.43667	validation_1-logloss:0.50894
[1400]	validation_0-logloss:0.43117	validation_1-logloss:0.50557
[1500]	validation_0-logloss:0.42599	v

Now we check the best learning rate

In [29]:
blr

0.1

And train the model again with that learning rate

In [30]:
xgb = xgb.XGBClassifier(n_estimators = 4000, early_stopping_rounds = 50, learning_rate = blr) 
xgb.fit(train_X, train_y, 
             eval_set = [(train_X, train_y), (val_X, val_y)], 
             verbose = 100)

[0]	validation_0-logloss:0.67405	validation_1-logloss:0.67770
[100]	validation_0-logloss:0.45328	validation_1-logloss:0.52073
[200]	validation_0-logloss:0.40136	validation_1-logloss:0.49063
[300]	validation_0-logloss:0.36709	validation_1-logloss:0.47464
[400]	validation_0-logloss:0.34159	validation_1-logloss:0.46807
[500]	validation_0-logloss:0.32189	validation_1-logloss:0.46340
[600]	validation_0-logloss:0.30548	validation_1-logloss:0.46023
[682]	validation_0-logloss:0.29318	validation_1-logloss:0.45971


XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=50, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.1, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=4000,
              n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

Finally, we can apply that model to the make prediction on the test data set

In [31]:
prediction = xgb_p.predict(test_bow)

To submit our prediction we load the sample submission and introduce our predictions

In [32]:
sample_submission = pd.read_csv('../input/nlp-getting-started/sample_submission.csv')

In [33]:
sample_submission.target = prediction

In [34]:
sample_submission.to_csv("submission.csv", index = False) 

Uploading this document to the competition gives us a 0.79 score, which is given used F1 score (Note it would have been great to use F1 as our metric for training XGBoost, but, since F1 is not differentiable, that option was not possible) 

# LSTM

The next model we are going to use is LSTM

In [35]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [36]:
voc_size = 5000
one_hot_rep = [one_hot(words, voc_size) for words in train_texts]
print(one_hot_rep[1])
one_hot_rep_Test=[one_hot(words,voc_size)for words in test_texts] 

[2072, 3533, 4006, 1290, 2708, 1824, 978]


In [37]:
sentence_length=25    # here we are specifying a sentence length so that every sentence is the same length and our neural network can handle all the data

embedded_docs = pad_sequences(one_hot_rep,padding='pre',maxlen=sentence_length)
embedded_docs_test=pad_sequences(one_hot_rep_Test,padding='pre',maxlen=sentence_length)

Again, we check how this works

In [38]:
embedded_docs[:5]

array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0, 3051,  363, 4311, 1085, 3940, 1571,  445, 2793, 4452, 2216,
        1670, 2981,  814],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0, 2072, 3533, 4006, 1290,
        2708, 1824,  978],
       [   0,    0,    0,  814, 1177, 2436, 1995,  261, 4966, 1610, 4311,
        3244, 3370, 4271, 3570, 4681, 2477, 4334, 1821, 4070, 4966, 4332,
         438, 4311,  406],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0, 4354,  778,  846, 4796,  741, 4334,
         438, 4966, 4505],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0, 3592,  992,
        1422,  445,  577, 1235, 4104, 3007, 1723, 4576, 1235,  741,   78,
        1105, 2667, 1166]], dtype=int32)

Now we import all the needed tensorflow layers

In [39]:
import tensorflow as tf
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout

And create our model: 

In [40]:
embedding_vector_features=40
model=Sequential()
model.add(Embedding(voc_size,embedding_vector_features,input_length=sentence_length))
model.add(Dropout(0.3))
model.add(LSTM(200))
model.add(Dropout(0.3))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

2022-10-19 23:41:44.508805: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 25, 40)            200000    
_________________________________________________________________
dropout (Dropout)            (None, 25, 40)            0         
_________________________________________________________________
lstm (LSTM)                  (None, 200)               192800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense (Dense)                (None, 1)                 201       
Total params: 393,001
Trainable params: 393,001
Non-trainable params: 0
_________________________________________________________________


Now, we need to change our input and outputs to numpy arrays for the model to work better

In [41]:
z = np.array(embedded_docs) #input
y = np.array(y)             #output

We now split into train and validation data


In [42]:
x_train, x_val, y_train, y_val = train_test_split(z, y, test_size=0.2, random_state=42)

In [43]:
res = model.fit(x_train,y_train,validation_data=(x_val,y_val),epochs=20,batch_size=64)

2022-10-19 23:41:45.292876: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [44]:
x_test=np.array(embedded_docs_test)

In [45]:
prediction = model.predict(x_test)

In [46]:
prediction

array([[0.03017414],
       [0.9104457 ],
       [1.        ],
       ...,
       [0.99999964],
       [0.0273563 ],
       [0.9751467 ]], dtype=float32)

In [47]:
def to_one (x):
    if x>0.5:
        return 1
    else: 
        return 0

In [48]:
sample_submission = pd.read_csv("../input/nlp-getting-started/sample_submission.csv")

In [49]:
sample_submission.target = prediction

In [50]:
sample_submission

Unnamed: 0,id,target
0,0,0.030174
1,2,0.910446
2,3,1.000000
3,9,0.876992
4,11,0.999994
...,...,...
3258,10861,0.000041
3259,10865,0.406094
3260,10868,1.000000
3261,10874,0.027356


Now we want the targe to be only 1's and 0's

In [51]:
sample_submission.target = sample_submission.target.map(lambda x : to_one(x))

In [52]:
sample_submission.to_csv("mysubmission.csv", index = False)