# Data Cleaning

## Introduction

* This notebook goes through a necasssary step of cleaning the data before it is used for exploratory data analysis. 
* The input of this notebook is a training dataset in csv format sourced from Kaggle. 
* The output of this notebook is a Term frequency-inverse document frequency(TF-IDF) Matrix

## Reading and Understanding the dataset

In [397]:
# Importing libraries
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re           # regular expression 
import string       # String Handling
import random       #For selecting random rows
import nltk         # Natural langauage processing toolkit


Reading the training data into a dataframe using pandas and viewing the top 20 rows of data

In [398]:
train_df = pd.read_csv("train.csv")
train_df.head(20)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


First 20 rows in 'Keyword' and 'Location' column are NaNs. Now Checking if there are nulls in other columns of the dataframe.

In [399]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
id          7613 non-null int64
keyword     7552 non-null object
location    5080 non-null object
text        7613 non-null object
target      7613 non-null int64
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


Checking number of unique values in 'keywords' and 'location'.

In [400]:
len(train_df.keyword.unique())

222

In [401]:
len(train_df.location.unique())

3342

A glimpse at 50 rows from the 'locations' column selected randomly


In [402]:
random.sample(list((train_df.location.unique())),k=50)


['Milton keynes',
 '#UNITE THE BLUE  ',
 'germany',
 'Madisonville TN',
 'South, England',
 'Burbank,CA',
 'Reston, VA, USA',
 'Frankfort, KY',
 'Orbost, Victoria, Australia',
 'port matilda pa',
 'on a catwalk somewhere',
 'Holland MI via Houston, CLE',
 'East Lansing, MI',
 'Based out of Portland, Oregon',
 'Ogba, Lagos, Nigeria',
 'Garrett',
 'ÌÏT: -26.695807,27.837865',
 'The Universe',
 'Terlingua, Texas',
 'The ?? below ???',
 'The Netherlands',
 'America | New Zealand ',
 'North Dartmouth, Massachusetts',
 'Oklahoma City',
 'Arvada, CO',
 'MIchigan',
 'Glendale, CA',
 'Winnipeg, Manitoba',
 'Ideally under a big tree',
 'The Orwellion police-state',
 'Cape Cod, Massachusetts USA',
 'Temporary Towers',
 'All around the world!',
 'Alvin, TX',
 'Washington State',
 'Bellville, Ohio',
 'International Action',
 'Enniscrone & Aughris, Sligo ',
 'Indianapolis, IN',
 'Thane',
 'ÌÏT: 10.614817868480726,12.195582811791382',
 'Stay Tuned ;) ',
 'Kingswinford',
 'Mooseknuckle, Maine',
 'Tros

A glimpse at 50 rows from the 'keywords' column selected randomly

In [403]:
random.sample(list((train_df.keyword.unique())),k=50)


['sinking',
 'sirens',
 'destroyed',
 'survivors',
 'battle',
 'curfew',
 'twister',
 'meltdown',
 'blazing',
 'hostage',
 'refugees',
 'blizzard',
 'structural%20failure',
 'mass%20murder',
 'hailstorm',
 'blaze',
 'burning',
 'wounded',
 'evacuation',
 'weapons',
 'collided',
 'drown',
 nan,
 'earthquake',
 'flames',
 'crush',
 'eyewitness',
 'desolation',
 'wreck',
 'screamed',
 'hijack',
 'buildings%20on%20fire',
 'flooding',
 'radiation%20emergency',
 'fatality',
 'drowned',
 'thunder',
 'destroy',
 'floods',
 'devastation',
 'aftershock',
 'army',
 'avalanche',
 'hostages',
 'danger',
 'drought',
 'disaster',
 'collide',
 'mudslide',
 'displaced']

A glimpse at 30 rows from the 'text' column selected randomly

In [404]:
# Selecting 30 random from text column randomly
random.sample(list((train_df.text)),k=30)

['@Jude_Mugabi not that all abortions get you traumatised. At times you are okay with the decision due to reasons like rape',
 "Ever since my Facebook #Mets meltdown after the Padres fiasco- mets are 6-0. You're welcome",
 'OH MY GOD RYANS IN TROUBLE http://t.co/ADIp0UnXHU',
 'Windstorm lastingness perquisite - acquiesce in a twister retreat: ZiUW http://t.co/iRt4kkgsJx',
 '@wocowae Police Officer Wounded Suspect Dead After Exchanging Shots http://t.co/oiOeCbsh1f ushed',
 '#Flood in Bago Myanmar #We arrived Bago',
 'saving babies from burning buildings soaking cake in a shit tonne of alcohol mat is a man after my own heart ?? #GBBO',
 'http://t.co/FhI4qBpwFH @FredOlsenCruise Please take the #FaroeIslands off your itinerary until the mass murder of dolphins &amp; whales stops.',
 'U.S. record hurricane drought. http://t.co/fE9hIVfMxq',
 'I liked a @YouTube video http://t.co/V57NUgmGKT US CANADA RADIATION UPDATE EMERGENCY FISHING CLOSURES',
 'My take away: preservation parks r an imposit

## Observation

Some data incosistencies or redundant information found in the dataset are as follows

* Upper case and lower case at unexpected location
* Punctuations
* Numbers in text 
* Use of cities, states and Country names. (Granularity problem)
* Special characters such as \x89ÛÒ and \n
* Hyperklinks
* Tags in tweets




## Cleaning the data

The method below does the following to clean anomalies in 'location' column:

* Changes all the text to lower case
* Removes punctutations
* Removes texts with numbers
* Removes cities names if country/state names are mentioned. (High level granularity is maintained)

In [405]:
# Method to chaange text to lower case and remove punctionation
def cleaning_location(text):
    text = str(text)
    text = text.lower()    #lower case
    text = text.split(',')[-1:][0].strip() # Removing city names when country/state name is present
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)  #removing punctuations
    text = re.sub('\w*\d\w*', '', text) #removing text with number

    
    return text

# Applying the method to location column
train_df.location = train_df.location.apply(cleaning_location)

The method below does the following to clean anomalies in 'keyword' column:

* Changes all the text to lower case
* Removes punctutations
* replaces number with space (as %20 was found in middle of two words)

In [408]:
def cleaning_keyword(text):
    text = str(text)
    text = text.lower()    ##lower case
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)  ##removing punctuations
    text = re.sub('(\d+)', ' ', text)   #replacing numbers with space
    return text

# Applying the method to keyword column
train_df.keyword = train_df.keyword.apply(cleaning_keyword)

The following method to clean anomalies in 'text' column does the following

* Changes all the text to lower case
* Removes words starting with @ to remove the tags and mentions example: @barackobama
* Adds a column with hashtag values
* Removes links
* Removes punctuation
* Removes words with numbers
* Removes special characters examples: \x89û,\x89ûó etc 
* Removes '\n' 


In [410]:
train_df["Hasthags"] = ''

def retreive_hashtags(text):
    hash_tag = ''
    hash_tag = re.sub('#','',' '.join(re.findall('(#[A-Za-z]+[A-Za-z0-9-_]+)', text)))  #retreiving hastags
    return hash_tag

def cleaning_text(text):
    
    text = str(text)
    text = text.lower()    ##lower case
    
    text = re.sub(r'@[A-Za-z]+[A-Za-z0-9-_]+', '',text) #removing any word starting with @   \w
    text = re.sub(r'https|www|http\S+', '', text)  #removing any word starting with http
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)  ##removing punctuations
    text = re.sub('\w*\d\w*', '', text)  #removing words with numbers
    text = re.sub(r'[^\x00-\x7F]+', '', text) # removing special characters
    text.replace("\n","")
    return text

train_df.Hasthags= train_df.text.apply(retreive_hashtags)

# Applying the method to the text column in the dataframe
train_df.text= train_df.text.apply(cleaning_text)

In [412]:
train_df

Unnamed: 0,id,keyword,location,text,target,Hasthags
0,1,,,our deeds are the reason of this earthquake ma...,1,earthquake
1,4,,,forest fire near la ronge sask canada,1,
2,5,,,all residents asked to shelter in place are be...,1,
3,6,,,people receive wildfires evacuation orders in...,1,wildfires
4,7,,,just got sent this photo from ruby alaska as s...,1,Alaska wildfires
...,...,...,...,...,...,...
7608,10869,,,two giant cranes holding a bridge collapse int...,1,
7609,10870,,,the out of control wild fires in california ...,1,
7610,10871,,,s of volcano hawaii,1,
7611,10872,,,police investigating after an ebike collided w...,1,


A glimpse of cleaned data set

In [413]:
train_df.text

0       our deeds are the reason of this earthquake ma...
1                   forest fire near la ronge sask canada
2       all residents asked to shelter in place are be...
3        people receive wildfires evacuation orders in...
4       just got sent this photo from ruby alaska as s...
                              ...                        
7608    two giant cranes holding a bridge collapse int...
7609      the out of control wild fires in california ...
7610                                 s of volcano hawaii 
7611    police investigating after an ebike collided w...
7612    the latest more homes razed by northern califo...
Name: text, Length: 7613, dtype: object

In [371]:
from nltk.stem import WordNetLemmatizer

In [372]:
wordnet_lemmatizer = WordNetLemmatizer()

In [373]:
nltk.word_tokenize(train_df.text[0])

['our',
 'deeds',
 'are',
 'the',
 'reason',
 'of',
 'this',
 'earthquake',
 'may',
 'allah',
 'forgive',
 'us',
 'all']

In [370]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/shashanksharma/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True