# Predict disaster tweets with machine learning

**Table of Contents**
<ul>
    <li><a href ="#intro">Introduction</a></li>
    <li><a href ="#wrangle">Wrangling</a></li>
</ul>

<a id ="intro"></a>
## Introduction
Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. 

In [18]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import neattext.functions as nfx

warnings.filterwarnings('ignore')

In [6]:
df_train = pd.read_csv('../data/train.csv')
df_test = pd.read_csv('../data/test.csv')

In [10]:
print('test data ', df_test.shape)
print('train data ', df_train.shape)
df_test.head()

test data  (3263, 4)
train data  (7613, 5)


Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [8]:
df = df_train.copy()
df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [11]:
df.isnull().any().sum()

2

In [15]:
df.columns[df.isnull().any()].to_list()

['keyword', 'location']

In [17]:
df.duplicated().sum()

0

<a id ="wrangle"></a>
## Wrangling

In [19]:
dir(nfx)

['BTC_ADDRESS_REGEX',
 'CURRENCY_REGEX',
 'CURRENCY_SYMB_REGEX',
 'Counter',
 'DATE_REGEX',
 'EMAIL_REGEX',
 'EMOJI_REGEX',
 'HASTAG_REGEX',
 'MASTERCard_REGEX',
 'MD5_SHA_REGEX',
 'MOST_COMMON_PUNCT_REGEX',
 'NUMBERS_REGEX',
 'PHONE_REGEX',
 'PoBOX_REGEX',
 'SPECIAL_CHARACTERS_REGEX',
 'STOPWORDS',
 'STOPWORDS_de',
 'STOPWORDS_en',
 'STOPWORDS_es',
 'STOPWORDS_fr',
 'STOPWORDS_ru',
 'STOPWORDS_yo',
 'STREET_ADDRESS_REGEX',
 'TextFrame',
 'URL_PATTERN',
 'USER_HANDLES_REGEX',
 'VISACard_REGEX',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__generate_text',
 '__loader__',
 '__name__',
 '__numbers_dict',
 '__package__',
 '__spec__',
 '_lex_richness_herdan',
 '_lex_richness_maas_ttr',
 'clean_text',
 'defaultdict',
 'digit2words',
 'extract_btc_address',
 'extract_currencies',
 'extract_currency_symbols',
 'extract_dates',
 'extract_emails',
 'extract_emojis',
 'extract_hashtags',
 'extract_html_tags',
 'extract_mastercard_addr',
 'extract_md5sha',
 'extract_numbers',
 'extr

In [20]:
df.columns

Index(['id', 'keyword', 'location', 'text', 'target'], dtype='object')

In [22]:
df['keyword'].value_counts()

fatalities               45
deluge                   42
armageddon               42
sinking                  41
damage                   41
                         ..
forest%20fire            19
epicentre                12
threat                   11
inundation               10
radiation%20emergency     9
Name: keyword, Length: 221, dtype: int64

In [23]:
df['location'].value_counts()

USA                    104
New York                71
United States           50
London                  45
Canada                  29
                      ... 
MontrÌ©al, QuÌ©bec       1
Montreal                 1
ÌÏT: 6.4682,3.18287      1
Live4Heed??              1
Lincoln                  1
Name: location, Length: 3341, dtype: int64

In [26]:
df['text'].iloc[7600:7609]

7600    Evacuation order lifted for town of Roosevelt:...
7601    #breaking #LA Refugio oil spill may have been ...
7602    a siren just went off and it wasn't the Forney...
7603    Officials say a quarantine is in place at an A...
7604    #WorldNews Fallen powerlines on G:link tram: U...
7605    on the flip side I'm at Walmart and there is a...
7606    Suicide bomber kills 15 in Saudi security site...
7607    #stormchase Violent Record Breaking EF-5 El Re...
7608    Two giant cranes holding a bridge collapse int...
Name: text, dtype: object

#### Clear Noise
- Remove mentions/user handles
- Remove hashtags
- Remove urls
- Remove emojis
- Remove special characters

In [29]:
df['cleaned_text'] = df.text.apply(nfx.remove_hashtags)
df['cleaned_text'] = df.cleaned_text.apply(nfx.remove_emojis)
df['cleaned_text'] = df.cleaned_text.apply(nfx.remove_userhandles)
df['cleaned_text'] = df.cleaned_text.apply(nfx.remove_multiple_spaces)
df['cleaned_text'] = df.cleaned_text.apply(nfx.remove_urls)
df['cleaned_text'] = df.cleaned_text.apply(nfx.remove_puncts)
df.head()

Unnamed: 0,id,keyword,location,text,target,hashtags,cleaned_text
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,[#earthquake],Our Deeds are the Reason of this May ALLAH For...
1,4,,,Forest fire near La Ronge Sask. Canada,1,[],Forest fire near La Ronge Sask Canada
2,5,,,All residents asked to 'shelter in place' are ...,1,[],All residents asked to shelter in place are be...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,[#wildfires],13000 people receive evacuation orders in Cali...
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,"[#Alaska, #wildfires]",Just got sent this photo from Ruby as smoke fr...


In [30]:
df[['text','cleaned_text']]

Unnamed: 0,text,cleaned_text
0,Our Deeds are the Reason of this #earthquake M...,Our Deeds are the Reason of this May ALLAH For...
1,Forest fire near La Ronge Sask. Canada,Forest fire near La Ronge Sask Canada
2,All residents asked to 'shelter in place' are ...,All residents asked to shelter in place are be...
3,"13,000 people receive #wildfires evacuation or...",13000 people receive evacuation orders in Cali...
4,Just got sent this photo from Ruby #Alaska as ...,Just got sent this photo from Ruby as smoke fr...
...,...,...
7608,Two giant cranes holding a bridge collapse int...,Two giant cranes holding a bridge collapse int...
7609,@aria_ahrary @TheTawniest The out of control w...,The out of control wild fires in California e...
7610,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,M194 [01:04 UTC]5km S of Volcano Hawaii
7611,Police investigating after an e-bike collided ...,Police investigating after an ebike collided w...
