<a href="https://colab.research.google.com/github/FanusArefaine/Natural-Language-Processing/blob/main/NLP_Text_Classification_Disaster_or_Not.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
from google.colab import drive 
drive.mount('/content/drive')

Mounted at /content/drive


In [140]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

### **Importing Training and Testing Dataset**

In [141]:
train_df = pd.read_csv('train.csv')
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [142]:
test_df = pd.read_csv('test.csv')
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


#### **Feature Exploration**

In [143]:
print("Dimensions of training dataset: ", train_df.shape)
print("Dimensions of testing dataset: ", test_df.shape)

Dimensions of training dataset:  (7613, 5)
Dimensions of testing dataset:  (3263, 4)


In [144]:
# Distribution of target variable in training dataset [BALANCED OR NOT]

# As observed below, the dataset is moderately balanced

disaster_tweets = train_df[train_df['target']==1].shape[0]
not_disaster_tweets = train_df.shape[0] - disaster_tweets

print(f'Percentage of distaster tweets: {round(((disaster_tweets/train_df.shape[0])*100),2)}%')
print(f'Percentage of not distaster tweets: {round(((not_disaster_tweets/train_df.shape[0])*100),2)}%')

Percentage of distaster tweets: 42.97%
Percentage of not distaster tweets: 57.03%


In [145]:
# Checking the training dataset for null values 

# Location has significant amount of missing values and keyword also has some missing values as shown below

train_df.isnull().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

In [146]:
# Checking the testing dataset for null values 

# Same as the training dataset, the testing dataset has a signigincant amount of missing location feature values and also some missing values on the keyword feature 

test_df.isnull().sum()

id             0
keyword       26
location    1105
text           0
dtype: int64

**Due to missing Values. . .**


     'location' feature can be dropped due to high percentage of missing values. 

     'keyword' feature will be further analyzed for validation and potential for classifiying targets better

  

In [147]:
#Dropping location feature in both training and testing datasets 

train_df.drop(['location'], axis=1, inplace=True)
test_df.drop(['location'], axis=1, inplace=True)

In [151]:
# Disaster tweets' keywords

train_df[train_df['target']==1].groupby('keyword').count()

Unnamed: 0_level_0,id,text,target
keyword,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ablaze,13,13,13
accident,24,24,24
airplane%20accident,30,30,30
ambulance,20,20,20
annihilated,11,11,11
...,...,...,...
wounded,26,26,26
wounds,10,10,10
wreck,7,7,7
wreckage,39,39,39


In [152]:
# exracting disaster and not disaster keywords

not_disaster_keys = train_df[train_df['target']==0]['keyword'].tolist()
disaster_keys = train_df[train_df['target']==1]['keyword'].tolist()

In [153]:
from collections import Counter 

# Keyword frequencies in disaster and not disaster tweets 

not_disaster_keys_counts_sorted = (Counter(not_disaster_keys)).most_common()
disaster_keys_counts_sorted = (Counter(disaster_keys)).most_common()


In [154]:
# Top 20 most common keywords in not disaster tweets 

not_disaster_keys_counts_sorted[:20]

[('body%20bags', 40),
 ('armageddon', 37),
 ('harm', 37),
 ('deluge', 36),
 ('ruin', 36),
 ('wrecked', 36),
 ('explode', 35),
 ('fear', 35),
 ('siren', 35),
 ('twister', 35),
 ('aftershock', 34),
 ('panic', 34),
 ('screaming', 34),
 ('blaze', 33),
 ('blazing', 33),
 ('blizzard', 33),
 ('crush', 33),
 ('sinking', 33),
 ('traumatised', 33),
 ('bloody', 32)]

In [155]:
# Top 20 most common keywords in disaster tweets 

disaster_keys_counts_sorted[:20]

[(nan, 42),
 ('derailment', 39),
 ('outbreak', 39),
 ('wreckage', 39),
 ('debris', 37),
 ('oil%20spill', 37),
 ('typhoon', 37),
 ('evacuated', 32),
 ('rescuers', 32),
 ('suicide%20bomb', 32),
 ('suicide%20bombing', 32),
 ('nuclear%20disaster', 31),
 ('razed', 31),
 ('airplane%20accident', 30),
 ('earthquake', 30),
 ('suicide%20bomber', 30),
 ('bridge%20collapse', 29),
 ('collision', 29),
 ('wildfire', 29),
 ('buildings%20on%20fire', 28)]

In [157]:
# Common keywords in disaster and not disaster tweets 

common_keys = list(set(not_disaster_keys).intersection(set(disaster_keys)))
print(f'Number of common words in disaster and not disaster tweets: {len(common_keys)}\n')
common_keys

Number of common words in disaster and not disaster tweets: 218



[nan,
 'collapse',
 'drowning',
 'panic',
 'collision',
 'blew%20up',
 'crashed',
 'death',
 'cliff%20fall',
 'obliteration',
 'thunderstorm',
 'body%20bag',
 'collide',
 'burning%20buildings',
 'weapon',
 'quarantine',
 'exploded',
 'crush',
 'curfew',
 'devastated',
 'desolation',
 'emergency',
 'survived',
 'fire',
 'terrorist',
 'volcano',
 'wild%20fires',
 'survive',
 'detonate',
 'electrocute',
 'earthquake',
 'war%20zone',
 'engulfed',
 'ambulance',
 'danger',
 'flames',
 'demolished',
 'police',
 'collided',
 'explode',
 'oil%20spill',
 'landslide',
 'survivors',
 'razed',
 'injury',
 'inundation',
 'rainstorm',
 'typhoon',
 'wreck',
 'destroyed',
 'thunder',
 'chemical%20emergency',
 'flooding',
 'disaster',
 'displaced',
 'mudslide',
 'burning',
 'rubble',
 'avalanche',
 'screams',
 'hail',
 'airplane%20accident',
 'wounded',
 'annihilated',
 'flattened',
 'destroy',
 'emergency%20plan',
 'injuries',
 'hazard',
 'deaths',
 'suicide%20bombing',
 'riot',
 'trapped',
 'accident'

In [158]:
# Checking how many times the above common words appeared in non disaster tweets 

count = 0

not_disaster_common_keys = sum([count+1 for word in not_disaster_keys if word in common_keys])
print(f'There is an occurence of {not_disaster_common_keys} common keywords out of total {len(not_disaster_keys)} not disaster keywords.')

There is an occurence of 4308 common keywords out of total 4342 not disaster keywords.


In [159]:
# Checking how many times the above common words appeared in non disaster tweets 

count = 0

disaster_common_keys = sum([count+1 for word in disaster_keys if word in common_keys])
print(f'There is an occurence of {disaster_common_keys} common keywords out of total {len(disaster_keys)} disaster keywords.')

There is an occurence of 3156 common keywords out of total 3271 disaster keywords.


#### **As shown above . . .**


      Disaster and Not disaster tweets share most of the keywords. 

      Keeping the keywords might otherwise mislead the classification algorithm. 

      Hence, dropping the 'keyword' feature would be a wise idea.



In [161]:
# Dropping tweets with missing 'keyword' feature

train_df.drop(['keyword'], axis=1, inplace=True)
test_df.drop(['keyword'], axis=1, inplace=True)

In [162]:
train_df.head()

Unnamed: 0,id,text,target
0,1,Our Deeds are the Reason of this #earthquake M...,1
1,4,Forest fire near La Ronge Sask. Canada,1
2,5,All residents asked to 'shelter in place' are ...,1
3,6,"13,000 people receive #wildfires evacuation or...",1
4,7,Just got sent this photo from Ruby #Alaska as ...,1


In [163]:
test_df.head()

Unnamed: 0,id,text
0,0,Just happened a terrible car crash
1,2,"Heard about #earthquake is different cities, s..."
2,3,"there is a forest fire at spot pond, geese are..."
3,9,Apocalypse lighting. #Spokane #wildfires
4,11,Typhoon Soudelor kills 28 in China and Taiwan
