Hello, This is my notebook for Twitter Disaster Classification competition on Kaggle. Link to the competition, where you can also find all files: https://www.kaggle.com/c/nlp-getting-started   

This competition helps to get into NLP (Natural Languange Processing)
My work-flow:
    1. Question (mission) of this project
    2. Get and Clean Data
    3. Perform Exploratory Data Analysis
    4. Apply some NLP techniques
    5. Share Insights

## Question (task) of this project

This is one of the 'getting started' projects on Kaggle. Twitter is one of the most popular soc.media where people post their opinion/ news/ article.
There are a lot of metaphorically used words to describe something.
And task is to detect which post was about real catastrophical event and which was just metaphora.

## Get and Clean Data

There are 3 files in this competition: training - for exploratory and tuning-model purpose; test - to test our model and submit into Kaggle; sample submission - how final data should look like.

In [1]:
import pandas as pd

train = pd.read_csv('train.csv')

test = pd.read_csv('test.csv')

sample = pd.read_csv('sample_submission.csv')

In [2]:
train.head()

# As we see our target variable is target column (1 for real disaster, 0 for not disaster)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [3]:
print(f'Train file shape: {train.shape}, test file shape : {test.shape}')

Train file shape: (7613, 5), test file shape : (3263, 4)


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [5]:
train['keyword'].unique()[:10]

array([nan, 'ablaze', 'accident', 'aftershock', 'airplane%20accident',
       'ambulance', 'annihilated', 'annihilation', 'apocalypse',
       'armageddon'], dtype=object)

In [6]:
train['location'].unique()[:10]

array([nan, 'Birmingham', 'Est. September 2012 - Bristol', 'AFRICA',
       'Philadelphia, PA', 'London, UK', 'Pretoria', 'World Wide!!',
       'Paranaque City', 'Live On Webcam'], dtype=object)

What we gain here: 4 attributues: Id (just to not mess up queue), Keyword: main word that described this tweet, Location: very unstructured data and Text: text of the tweet

### Fill missing values in Location and Keyword

In [7]:
# Both train and test contains NA values so let's fill them

def fillna_column(column, imputer):
    if column.isnull().sum() > 0:
        return column.fillna(imputer)
    
    else:
        return column
    
train['location'] = fillna_column(train['location'], 'Unknown')
train['keyword'] = fillna_column(train['keyword'], 'no')

In [8]:
# For test also
test['location'] = fillna_column(test['location'], 'Unknown')
test['keyword'] = fillna_column(test['keyword'], 'no')

In [9]:
# I will drop Location column as it is hard to structure and gain some information from this attribute

train.drop('location', axis = 1, inplace = True)
test.drop('location', axis = 1, inplace = True)

In [10]:
train.loc[69, 'text']

'Accident center lane blocked in #SantaClara on US-101 NB before Great America Pkwy #BayArea #Traffic http://t.co/pmlOhZuRWR'

### Clean Data

Steps:
    1. Lowercase all words
    2. Remove punctuation
    3. Remove numbers
    4. Remove links
    5. Remove meaningless words like (like, e.t.c, then, by...)
    6. Tokenize words
    7. Remove stop words or most common words
    

Round 1: CLeaning the data

In [11]:
import re
import string

def clean_text_round1(text):
    
    
    text = text.lower() # Lower case text
    
    text = re.sub('@\w*:.', '', text) # remove account names
    

    text = re.sub(r"http\S+", "", text) # remove links
   

    text = ' '.join(s for s in text.split() if not any(c.isdigit() for c in s)) # remove words containing digits
    

    text = re.sub('[^A-Za-z0-9 ]+', '', text) # remove special characters
    

    text = text.replace('  ', ' ') # remove extra space
    
    text = text.strip() # Remove extra space from beginning and ending of text
    
    return text
    



In [12]:
train['text'] = train['text'].apply(clean_text_round1)

test['text'] = test['text'].apply(clean_text_round1)

In [13]:
# Let's clean keyword cloumn: we see here %20 instead of space so let's replace it

def remove_spec_chars(example):

    return re.sub('[^A-Za-z ]+', ' ', example)



In [14]:
train['keyword'] = train['keyword'].apply(remove_spec_chars)
test['keyword'] = test['keyword'].apply(remove_spec_chars)

In [15]:
#train['keyword'].unique()

#### Next, I want to divide keywords to classes and categoralize them

In [16]:
human_related = ['airplane accident', 'ambulance', 'army', 'arson', 'arsonist', 'attack', 'attacked', 'battle', 'bioterror', 'bioterrorism', 'bleeding', 
                 'blood', 'bloody', 'blown up', 'body bag', 'body bagging', 'body bags', 'bomb', 'bombed', 'bombing', 'bridge collapse',
                 'buildings burning', 'buildings on fire', 'burned', 'burning', 'burning buildings', 'casualties', 'casualty', 'chemical emergency', 'crash', 'crashed', 'crush', 'crushed', 
                 'curfew', 'dead', 'death', 'deaths', 'debris', 'demolish', 'demolished', 'demolition', 'derail',
                 'derailed', 'derailment', 'desolate', 'desolation', 'destroy',
                 'destroyed', 'destruction', 'detonate', 'detonation', 'devastated', 'devastation', 'drown',
                 'drowned', 'drowning', 'electrocute', 'electrocuted', 'emergency', 'emergency plan', 'emergency services',
                 'engulfed', 'explode', 'exploded', 'explosion',  'eyewitness', 'famine', 'fatal', 'fatalities', 'fatality', 'fear',
                 'fire truck', 'first responders', 'flames', 'flattened', 'harm', 'hijack', 'hijacker', 'hijacking', 'hostage',
                 'hostages', 'injured', 'injuries', 'injury',  'mass murder', 'mass murderer', 'massacre', 'mayhem', 'military',
                 'nuclear disaster', 'nuclear reactor', 'oil spill', 'outbreak', 'panic',
                 'panicking', 'police', 'quarantine', 'quarantined', 'radiation emergency', 'razed', 'refugees', 'rescue',
                 'rescued', 'rescuers', 'riot', 'rioting', 'ruin', 'screamed', 'screaming', 'screams', 'sinking', 'siren', 'sirens', 'smoke',
                 'stretcher', 'structural failure', 'suicide bomb',
                 'suicide bomber', 'suicide bombing', 'sunk', 'survive', 'survived', 'survivors', 'terrorism', 'terrorist', 'threat',
                 'trapped', 'trauma', 'traumatised', 'trouble', 'upheaval', 'war zone', 'weapon', 'weapons',  'wounded', 'wounds',  'wreck', 'wreckage', 'wrecked',
                ]
all_categories = train['keyword'].unique()

nature_related = list(set(all_categories) - set(human_related) )


In [17]:
# Now' let's replace values with two values: Natural, Human
train['keyword'] = train['keyword'].apply(lambda x: 'human' if x in human_related else 'nature')

test['keyword'] = test['keyword'].apply(lambda x: 'human' if x in human_related else 'nature')

In [18]:
import numpy as np
# Now let's use one-hot encoding for this column

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown = 'ignore')

keyword = train['keyword'].values

keyword = keyword.reshape(-1, 1)

keyword = enc.fit_transform(keyword)

In [19]:
train = pd.concat([train, pd.DataFrame(keyword.toarray())], axis = 1)

train = train.rename(columns = {0: 'human', 1: 'nature'})

train.drop('keyword', axis = 1, inplace = True)

train

Unnamed: 0,id,text,target,human,nature
0,1,our deeds are the reason of this earthquake ma...,1,0.0,1.0
1,4,forest fire near la ronge sask canada,1,0.0,1.0
2,5,all residents asked to shelter in place are be...,1,0.0,1.0
3,6,people receive wildfires evacuation orders in ...,1,0.0,1.0
4,7,just got sent this photo from ruby alaska as s...,1,0.0,1.0
...,...,...,...,...,...
7608,10869,two giant cranes holding a bridge collapse int...,1,0.0,1.0
7609,10870,ariaahrary thetawniest the out of control wild...,1,0.0,1.0
7610,10871,s of volcano hawaii,1,0.0,1.0
7611,10872,police investigating after an ebike collided w...,1,0.0,1.0


In [20]:
# Same for test dataframe
test = pd.concat([test, pd.DataFrame(enc.transform(test['keyword'].values.reshape(-1, 1)).toarray())], axis = 1)

test = test.rename(columns = {0: 'human', 1: 'nature'})

test.drop('keyword', axis = 1, inplace = True)

### Organizing Data

I need two types of each dataset. 
1) Corpus - our collection of text in order
2) Document-Term matrix - count appereance of word in row

In [21]:
# We already created a corpys so let's save datasets for other notebooks

train.to_pickle('train_corpus.pkl')

test.to_pickle('test_corpus.pkl')

# Documnet-Term matrix

For many of the techniques we'll be using in future notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [25]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words = 'english')

train_cv = cv.fit_transform(train['text'])

train_dtm = pd.DataFrame(train_cv.toarray(), columns = cv.get_feature_names())

train_dtm.index = train.index

train_dtm['id'] = train['id']


In [26]:
# same goes for test

test_cv = cv.transform(test['text'])

test_dtm = pd.DataFrame(test_cv.toarray(), columns = cv.get_feature_names())

test_dtm.index = test.index

test_dtm['id'] = test['id']


test_dtm

Unnamed: 0,aa,aaaa,aaaaaaallll,aaaaaand,aaarrrgghhh,aaceorg,aampb,aampw,aan,aannnnd,...,zonesthank,zoom,zouma,zourryart,zrnf,zss,zumiez,zurich,zxathetis,zzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3258,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3259,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3260,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3261,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
# Let's save this save this datasets
test_dtm.to_pickle('test_dtm.pkl')

train_dtm.to_pickle('train_dtm.pkl')


Cleaning step is finished