# Data Cleaning

In [82]:
# Importing libraries
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re           # regular expression 
import string       # String Handling
import random       #For selecting random rows

Reading the training data into a dataframe using pandas and viewing the top 20 rows of data

In [163]:
train_df = pd.read_csv("train.csv")
#train_df.head(20)

First 20 rows in 'Keyword' and 'Location' column are NaNs. Now Checking if there are nulls in other columns of the dataframe.

In [84]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
id          7613 non-null int64
keyword     7552 non-null object
location    5080 non-null object
text        7613 non-null object
target      7613 non-null int64
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


Checking number of unique values in 'keywords' and 'location'.

In [85]:
len(train_df.keyword.unique())

222

In [86]:
len(train_df.location.unique())

3342

A glimpse at 50 rows from the 'locations' column selected randomly


In [87]:
random.sample(list((train_df.location.unique())),k=50)


['This Is Paradise. Relax. ',
 'Nairobi, Kenya',
 'back in japan ??????????',
 '956',
 'Fairfax, VA',
 'Winnipeg, Manitoba',
 'Somewhere in the Canada',
 'Lyallpur, Pakistan',
 'Suburban Detroit, Michigan',
 'United States',
 'Oregon, USA',
 'Watch Those Videos -',
 'Loughborough, England',
 'ayr',
 'Nashua NH',
 'i got 1/13 menpa replies, omg',
 'Ontario, Canada',
 '60th St (SS)',
 'Daruka (near Tamworth) NSW',
 'South, USA',
 "Dil's Campsite",
 'Brazil ',
 'Los Angeles, CA',
 'International ',
 'nap central',
 'Denver, Colorado',
 '??????????????',
 'Reston, VA, USA',
 'Global-NoLocation',
 'The Circle of Life',
 'PSN: Pipbois ',
 'Proudly Canadian!',
 'El Paso, TX',
 'Flipadelphia',
 '?????? ??? ?????? ????????',
 'California, USA',
 '302???? 815',
 'Crayford, London',
 'Pacific Northwest',
 '9/1/13',
 'Aztec NM',
 'm3, k, a, d',
 'Roanoke VA',
 'shoujo hell ',
 'South West, England',
 'Davis, California',
 'To The Right of You!',
 '11th dimension, los angeles',
 'Harlingen, TX',
 '

In [120]:
train_df.keyword.unique()[:30]

array(['nan', 'ablaze', 'accident', 'aftershock', 'airplane accident',
       'ambulance', 'annihilated', 'annihilation', 'apocalypse',
       'armageddon', 'army', 'arson', 'arsonist', 'attack', 'attacked',
       'avalanche', 'battle', 'bioterror', 'bioterrorism', 'blaze',
       'blazing', 'bleeding', 'blew up', 'blight', 'blizzard', 'blood',
       'bloody', 'blown up', 'body bag', 'body bagging'], dtype=object)

Some data incosistencies found in the Location and Keyword column are as follows

* Upper case and lower case
* Punctuations
* Numbers in text 
* Use of cities, states and Country names.



The following method to clean anomalies in 'location' column does the following

* Change all the text to lower case
* Removes punctutations
* removes texts with numbers

In [176]:
# Method to chaange text to lower case and remove punctionation
def cleaning_location(text):
    text = str(text)
    text = text.lower()    ##lower case
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)  ##removing punctuations
    text = re.sub('\w*\d\w*', '', text) ##removing text with number
    return text

# Applying the method to location column
train_df.location = train_df.location.apply(cleaning_location)

The following method to clean anomalies in 'keyword' column does the following

* Change all the text to lower case
* Removes punctutations
* replaces number with space ( as %20 was found in middle of two words)

In [179]:
def cleaning_keyword(text):
    text = str(text)
    text = text.lower()    ##lower case
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)  ##removing punctuations
    text = re.sub('(\d+)', ' ', text)   #replacing numbers with space
    return text

# Applying the method to keyword column
train_df.keyword = train_df.keyword.apply(cleaning_keyword)

In [180]:
# Selecting 30 random from text column randomly
random.sample(list((train_df.text)),k=30)

['american weapons and support are fueling a bloody air war in yemen ',
 ' strange loud impact bang noises under train to epsom about to arrive wimbledon',
 ' thank you i survived ',
 'video were picking up bodies from water rescuers are searching for hundreds of migrants in the mediterran ',
 '  well i think that sounds like a fine plan where little derailment is possible so i applaud you ',
 'people who try to jwalk while an ambulance is passing i hate you',
 'check these out     nsfw',
 'more homes razed by northern calif wildfire  sandiego ',
 ' when u do a fatality and like the corpse is still jittering',
 'the latest more homes razed by northerncalifornia wildfire  zippednews ',
 'ahmazing story of the power animal rescuers have a starving homeless dog with no future was rescued by a person ',
 ' i hope that mountain dew erodes your throat and floods your lungs leaving you to drown to death',
 '  do anything to fix that of the few people he had every trusted in his life charles w

The following method to clean anomalies in 'text' column does the following

* Change all the text to lower case
* Removes words starting with @ to remove the tags and mentions example: @barackobama
* Removes links
* Removes punctuation
* Removes words with numbers


In [170]:
def cleaning_text(text):
    text = str(text)
    text = text.lower()    ##lower case
    text = re.sub(r'@\w+', '',text) ##removing any word starting with @
    text = re.sub(r'http\S+', '', text)  ##removing any word starting with http
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)  ##removing punctuations
    text = re.sub('\w*\d\w*', '', text)  #removes words with numbers
    return text 

# Applying the method to the text column in the dataframe
train_df.text = train_df.text.apply(cleaning_text)

In [172]:
random.sample(list((train_df.text)),k=30)

[' whao  nigerian refugees repatriated from cameroon ',
 ' wk of rainier diet and my street seward park ave is inundated w bypass traffic so  whats your plan ',
 'and even if the stars and moon collide \x89ûó oh oh i never want you back to my life you can take your words and all ',
 'hollywood movie about trapped miners released in chile ',
 ' your tweet was quoted by   ',
 'i did another one i did another one you still aint done shit about the other one nigga body bagging meek',
 ' \n\npakistan says army helicopter has crashed in countrys restive northwest   fox news',
 '  pandemonium in aba as woman delivers baby without face  ',
 'latestnews police officer wounded suspect dead after exchanging shots richmond va ap \x89ûó a richmond pol ',
 'brooke just face timed me at the concert and just screamed for  minutes straight',
 'lets fraction the vital need for our fatalities  how would you break it down in education econom ',
 'lets not forget our wounded female veterans ',
 'have you r