# Disaster Tweets EDA

Summary: Goal is to determine which tweets are about real disasters or not
Dataset: https://www.kaggle.com/competitions/nlp-getting-started/leaderboard

The goal of this initial workbook is to complete the EDA to further investigate the data and potential features within the dataset.

## Lib Imports

In [88]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [89]:
from nltk.corpus import stopwords
import preprocessor as p

ENGLISH_STOPWORDS = set(stopwords.words('english'))

## Train / Test Dataset Loading and Analysis

**Train Dataset Analysis**

In [90]:
train_df = pd.read_csv('train.csv')
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [91]:
train_df.isna().sum() * 100 / len(train_df)

id           0.000000
keyword      0.801261
location    33.272035
text         0.000000
target       0.000000
dtype: float64

Seem to be missing alot of values from keywords and location within the training dataset.

In [92]:
test_df = pd.read_csv('test.csv')
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


Now I am going to investigate whether there are any differences in distribution of target class between the missing and non missing location subsets of the training dataframe.

In [93]:
train_df[train_df['location'].isna()]['target'].value_counts(normalize=True)

target
0    0.575602
1    0.424398
Name: proportion, dtype: float64

In [94]:
train_df[~train_df['location'].isna()]['target'].value_counts(normalize=True)

target
0    0.567717
1    0.432283
Name: proportion, dtype: float64

It doesn't look like there are any substantial differences, which is quite positive.

**Test Dataframe Analysis**

In [95]:
test_df.isna().sum() * 100 / len(test_df)

id           0.000000
keyword      0.796813
location    33.864542
text         0.000000
dtype: float64

The testing dataset has similar missing data, whereby over 33% of locations are missing from the dataset.

In [96]:
train_df['target'].value_counts(normalize=True)

target
0    0.57034
1    0.42966
Name: proportion, dtype: float64

### Keyword Analysis

Initial observation is that keyword represents a single word or 'category' for the tweet.

In [97]:
missing_keyword_mask = (train_df['keyword'].isna())

In [98]:
keyword_train = train_df[~missing_keyword_mask]
keyword_train

Unnamed: 0,id,keyword,location,text,target
31,48,ablaze,Birmingham,@bbcmtd Wholesale Markets ablaze http://t.co/l...,1
32,49,ablaze,Est. September 2012 - Bristol,We always try to bring the heavy. #metal #RT h...,0
33,50,ablaze,AFRICA,#AFRICANBAZE: Breaking news:Nigeria flag set a...,1
34,52,ablaze,"Philadelphia, PA",Crying out for more! Set me ablaze,0
35,53,ablaze,"London, UK",On plus side LOOK AT THE SKY LAST NIGHT IT WAS...,0
...,...,...,...,...,...
7578,10830,wrecked,,@jt_ruff23 @cameronhacker and I wrecked you both,0
7579,10831,wrecked,"Vancouver, Canada",Three days off from work and they've pretty mu...,0
7580,10832,wrecked,London,#FX #forex #trading Cramer: Iger's 3 words tha...,0
7581,10833,wrecked,Lincoln,@engineshed Great atmosphere at the British Li...,0


In [99]:
keyword_train['keyword'].value_counts()

keyword
fatalities               45
deluge                   42
armageddon               42
damage                   41
body%20bags              41
                         ..
forest%20fire            19
epicentre                12
threat                   11
inundation               10
radiation%20emergency     9
Name: count, Length: 221, dtype: int64

### Location Analysis

In [100]:
missing_location_mask = (train_df['location'].isna())

In [101]:
location_train = train_df[~missing_location_mask]
location_train.head()

Unnamed: 0,id,keyword,location,text,target
31,48,ablaze,Birmingham,@bbcmtd Wholesale Markets ablaze http://t.co/l...,1
32,49,ablaze,Est. September 2012 - Bristol,We always try to bring the heavy. #metal #RT h...,0
33,50,ablaze,AFRICA,#AFRICANBAZE: Breaking news:Nigeria flag set a...,1
34,52,ablaze,"Philadelphia, PA",Crying out for more! Set me ablaze,0
35,53,ablaze,"London, UK",On plus side LOOK AT THE SKY LAST NIGHT IT WAS...,0


In [102]:
location_train['location'].value_counts()

location
USA                   104
New York               71
United States          50
London                 45
Canada                 29
                     ... 
Silesia, Poland         1
Hickville, USA          1
New York NYC            1
Valle Del Sol           1
todaysbigstock.com      1
Name: count, Length: 3341, dtype: int64

There seems to be quite a mixture of different types of text within the location column, with some entries just being the country, some containing the city and country, and others not being a real location.

In [103]:
contains_comma_mask = (location_train['location'].str.find(',') != -1)

In [104]:
location_train[contains_comma_mask]['location'].value_counts()

location
Los Angeles, CA                   26
Washington, DC                    21
Chicago, IL                       18
California, USA                   15
New York, NY                      15
                                  ..
Arlington, VA and DC               1
Durban, South Africa               1
Kalamazoo, Michigan                1
Washington DC / Nantes, France     1
ÌÏT: 33.209923,-87.545328          1
Name: count, Length: 1305, dtype: int64

I'm going to need to find a better way of analysing the makeup of this location column....

### Text Analysis

After some investigation and review of other tweets, it is apparent the the text can contains any of the below features:

* 1. tags: any text starting with '#' will contain tags
* 2. user_tags: any text starting with @ represents a user tag
* 3. urls: text starting with https://, http:// etc contain urls
* 4. retweets: contains t.co?


In [120]:
def preprocess_text(df):
    df['text'] = df['text'].str.lower()
    df['text'] = df['text'].str.replace('[^\w\s]', ' ')
    df['text'] = df['text'].str.replace('\s\s+', ' ')
    df['text'] = df['text'].apply(p.clean)
    df['text'] = df['text'].apply(lambda x : ' '.join([w for w in x.split(' ') if w not in ENGLISH_STOPWORDS]))


In [109]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,our deeds are the reason of this #earthquake m...,1
1,4,,,forest fire near la ronge sask. canada,1
2,5,,,all residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,just got sent this photo from ruby #alaska as ...,1


In [110]:
preprocess_text(train_df)

In [121]:
create_tokens(train_df)

In [122]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target,tokens
0,1,,,our deeds are the reason of this may allah for...,1,deeds reason may allah forgive us
1,4,,,forest fire near la ronge sask. canada,1,forest fire near la ronge sask. canada
2,5,,,all residents asked to 'shelter in place' are ...,1,residents asked 'shelter place' notified offic...
3,6,,,people receive evacuation orders in california,1,people receive evacuation orders california
4,7,,,just got sent this photo from ruby as smoke fr...,1,got sent photo ruby smoke pours school
