This notebook is my take on the Kaggle competition on Natural Language Processing (NLP) to predict which Tweets are about real disasters and which one’s aren’t. I'm joining this competition while following a Coursera course on Deep Learning, so I am restricted to use a Recurrent Neural Network as classifier.

First I will explore the available data in order to be able to form a master plan for achieving a high score in the competition.

In [2]:
import pandas as pd

df_train = pd.read_csv('input/train.csv')
df_train.sample(10)


Unnamed: 0,id,keyword,location,text,target
3397,4864,explode,|IG: imaginedragoner,If Ryan doesn't release new music soon I might...,0
5558,7933,rainstorm,,Robot_Rainstorm: We have two vacancies on the ...,0
824,1199,blizzard,,peanut butter cookie dough blizzard is ???????...,0
248,353,annihilation,Subconscious LA,World Annihilation vs Self Transformation http...,0
39,57,ablaze,Paranaque City,Ablaze for you Lord :D,0
2876,4133,drought,,U.S. in record hurricane drought: The United S...,1
516,745,attacked,"Oslo, Norway",Christian Attacked by Muslims at the Temple Mo...,1
4910,6990,massacre,London,@MartynWaites It's a well-known fact that the ...,1
2007,2883,damage,Somewhere in the Canada,Nine inmates charged with causing damage in Ca...,1
904,1307,bloody,Glasgow,I'm awful at painting.. why did I agree to do ...,0


In [3]:
df_test = pd.read_csv('input/test.csv')
df_test.sample(10)

Unnamed: 0,id,keyword,location,text
56,186,aftershock,United Kingdom,Bo2 had by far the best competitive maps imo h...
1368,4502,emergency,,11000 SEEDS 30 VEGETABLE FRUIT VARIETY GARDEN ...
1465,4862,explode,,my damn head feel like it's gone explode ??
2617,8749,siren,London,@ryan_defreitas for me it's Revs Per Minute bu...
207,673,attack,,Suspect in latest US theatre attack had psycho...
123,395,annihilation,,U.S National Park Services Tonto National Fore...
979,3238,deluged,,Businesses are deluged with ivoices. Make your...
1566,5277,fear,Windsor ON Canada,...@jeremycorbyn must be willing to fight and ...
1763,5967,hazard,,@fplhints hazard depay ozil Ritchie .\nShould ...
2560,8544,screams,,-mom screams from kitchen- \n'WHERES MY AVOCAD...


Great, the training data contains 5 columns, id, keyword, location, text and target. The test file (needed for the final submission only misses the target column, which is as expected. Let's see which columns we can use by exploring them.

In [4]:
df_train.isna().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

In [5]:
df_test.isna().sum()

id             0
keyword       26
location    1105
text           0
dtype: int64

In [6]:
df_train.shape

(7613, 5)

In [7]:
df_test.shape

(3263, 4)

So, roughly 1/3 of the entries do not contain a location, which makes this column not suitable as input for a classifier. The keyword is not missing that often, but we have to see if the keyword is meaningful enough to keep as a feature (and drop the rows with missing keywords).

In [8]:
print('Total number of entries={}, Unique ids={}, unique keywords={}, unique locations={}'.format(len(df_train),len(df_train.id.unique()),len(df_train.keyword.unique()),len(df_train.location.unique())))

Total number of entries=7613, Unique ids=7613, unique keywords=222, unique locations=3342


In [9]:
df_train[['keyword','target']].groupby('keyword').value_counts()

keyword     target
ablaze      0         23
            1         13
accident    1         24
            0         11
aftershock  0         34
                      ..
wreck       0         30
            1          7
wreckage    1         39
wrecked     0         36
            1          3
Length: 438, dtype: int64

As we can see using the keywords alone would not lead to accurate results. Therefore, we'll inspect the text field further to see how we can use the information.

In [10]:
df_train['text'].sample(50)

7260    Weather forecast for Thailand  A Whirlwind is ...
5875    @CTAZtrophe31 Everything must be OK because sh...
1827             Pak Army Helicopter crashed in Mansehra.
5520    Hm MT @Ebolatrends: Alabama Home Quarantined O...
3051    Contruction upgrading ferries to earthquake st...
5248    Refugio oil spill may have been costlier bigge...
7147    The Architect Behind Kanye WestÛªs Volcano ht...
2056    investigate why Robert mueller didn't respond ...
6276    New item: Pillow Covers ANY SIZE Pillow Cover ...
2038    Permits for bear hunting in danger of outnumbe...
5616    This just-married Turkish couple gave 4000 Syr...
4394    #hot  Funtenna: hijacking computers to send da...
3059    @DArchambau THX for your great encouragement a...
5414    My dad is panicking as my weight loss means he...
3521    How ÛÏLittle BoyÛ Affected the People In Hi...
6580    Violators of the new improved Reddit will be s...
6874    What happens to us as sexual trauma #survivors...
3478    ITS A 

From the text we can extract a few actions we can do to make the text suitable for using it as classification.
- We can remove (or extract) mentions like @abysmaljoiner
- The same applies to hashtags, which we could consider as keywords as well
- Remove special characters like "ÛÏÛ"

Let's start with extracting the mentions and hashtags to see if it can help with classification.

In [16]:
import re
import numpy as np

mention_str = '@([^ ]+)'
hashtag_str = '#([^ ]+)'
special_chars = '[^a-zA-Z]'

def extract_matches(sentence,expr) :
    
    return re.findall(expr,sentence)

df_train['mentions'] = df_train['text'].apply(lambda string: extract_matches(string, mention_str)).apply(lambda y: np.nan if len(y)==0 else y)

df_train['hashtags'] = df_train['text'].apply(lambda string: extract_matches(string, hashtag_str)).apply(lambda y: np.nan if len(y)==0 else y)
df_train.sample(50)

Unnamed: 0,id,keyword,location,text,target,mentions,hashtags
6227,8888,smoke,Indonesia,@TeamAtoWinner no.. i mean when is mino said t...,0,[TeamAtoWinner],
540,786,avalanche,Buy Give Me My Money,@funkflex yo flex im here https://t.co/2AZxdLCXgA,0,[funkflex],
4891,6963,massacre,,@eileenmfl are you serious?,0,[eileenmfl],
519,751,avalanche,guaravitas,we'll crash down like an avalanche,0,,
2090,3004,death,ATL??AL??,I Hate To Talking Otp With My Grandma... I Mea...,0,,
3456,4944,exploded,Jamaica,@ItsNasB now I have to go replace my sarcasm m...,0,[ItsNasB],
3294,4722,evacuate,17-Feb,Okay I need all of you to evacuate the house s...,0,,
2338,3364,demolition,"Murray Hill, New Jersey",Remaining Sections Of Greystone Psychiatric Ho...,0,,
6976,10006,tsunami,,@Eric_Tsunami worry about yourself,0,[Eric_Tsunami],
2295,3292,demolish,,I have completed the quest 'Demolish 5 Murlo.....,0,,"[Android, androidgames, gameinsight]"


In [17]:
df_train.isna().sum()

id             0
keyword       61
location    2533
text           0
target         0
mentions    5595
hashtags    5857
dtype: int64

In [32]:
df_train[df_train['keyword'].isna()]['hashtags'].notna().sum()

21

Unfortunately the number of entries with NaN mentions or hashtags is high, and only 21 keywords can be extracted additionally from the hashtags, which I don't deem high enough to risk added bias. Therefore, we will replace the special characters from the text.

In [36]:

def replace_matches(sentence,expr,replacement) :
    
    return re.sub(expr,replacement,sentence)

# remove mentions
df_train['cleantext'] = df_train['text'].apply(lambda string: replace_matches(string, mention_str, ' ')).apply(lambda y: np.nan if len(y)==0 else y)

# replace special characters
df_train['cleantext'] = df_train['cleantext'].apply(lambda string: replace_matches(string, special_chars, ' ')).apply(lambda y: np.nan if len(y)==0 else y)
df_train[['text','cleantext']].sample(50)

Unnamed: 0,text,cleantext
976,#handbag #fashion #style http://t.co/iPXpI3me1...,handbag fashion style http t co iPXpI me ...
6848,@RaabChar_28 @DrPhil @MorganLawGrp How do you ...,How do you self inflict a wound to your ...
3128,Elsa is gonna end up getting electrocuted. She...,Elsa is gonna end up getting electrocuted She...
6636,'The Terrorist Tried to Get Out of the Car; I ...,The Terrorist Tried to Get Out of the Car I ...
5399,I hear the mumbling i hear the cackling i got ...,I hear the mumbling i hear the cackling i got ...
3254,@godsfirstson1 and she wrapped his coat around...,and she wrapped his coat around herself It ...
425,Video Captures Man Removing American Flag From...,Video Captures Man Removing American Flag From...
5714,#RoddyPiperAutos Fears over missing migrants i...,RoddyPiperAutos Fears over missing migrants i...
594,FedEx no longer will ship potential bioterror ...,FedEx no longer will ship potential bioterror ...
3411,Kendall Jenner and Nick Jonas Are Dating and t...,Kendall Jenner and Nick Jonas Are Dating and t...


This cleaning would maybe be too rigorous, since I see URLs are invalidaded, which leads to new non-existing words. Let's therefore replace urls by the word 'url'.