# Data Cleaning

## Introduction

* This notebook goes through a necasssary step of cleaning the data before it is used for exploratory data analysis. 
* The input of this notebook is a training dataset in csv format sourced from Kaggle. https://www.kaggle.com/c/nlp-getting-started
* The output of this notebook is a csv file with clean and lemmatized text data. 

## Reading and Understanding the dataset

In [482]:
# Importing libraries
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re           # regular expression 
import string       # String Handling
import random       #For selecting random rows
import nltk         # Natural langauage processing toolkit
from nltk.stem import WordNetLemmatizer  #Used for Lemmatizing the text
from nltk.corpus import wordnet          #Used for POS tagging 
from nltk.corpus import stopwords        #Stopwords to be removed from text


Reading the training data into a dataframe using pandas and viewing the top 20 rows of data

In [483]:
train_df = pd.read_csv("train.csv")
train_df.head(20)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


First 20 rows in 'Keyword' and 'Location' column are NaNs. Now Checking if there are nulls in other columns of the dataframe.

In [484]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
id          7613 non-null int64
keyword     7552 non-null object
location    5080 non-null object
text        7613 non-null object
target      7613 non-null int64
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


Checking number of unique values in 'keywords' and 'location'.

In [485]:
len(train_df.keyword.unique())

222

In [486]:
len(train_df.location.unique())

3342

A glimpse at 50 rows from the 'locations' column selected randomly


In [487]:
random.sample(list((train_df.location.unique())),k=50)


['Konoha Village',
 'Rhyme Or Reason?',
 'kano',
 'va',
 'Inside the Beltway (DC Area)',
 'Hogwarts',
 'Delhi ',
 'Richmond Heights, OH',
 'Oregon, USA',
 'Durham N.C ',
 'D.C. - Baltimore - Annapolis',
 'NYC,US - Cali, Colombia',
 'Glenview to Knoxville ',
 'DC',
 'Lima, Peru',
 'Terlingua, Texas',
 'Madrid, Comunidad de Madrid',
 'Nevada, USA',
 'SEATTLE, WA USA',
 'the insane asylum. ',
 'Harlingen, TX',
 'Chicagoland',
 'Jackson TN',
 'Detroit/Windsor',
 'somewhere too cold for me',
 'Free State, South Africa',
 '[@blackparavde is my frankie]',
 'District 12 - Orange County',
 'WorldWide',
 "The Sun's Corona",
 'Passamaquoddy',
 'Tulalip, Washington',
 'yorkshire\n',
 'Alliston Ontario',
 'Washington, DC NATIVE',
 'atx',
 'Amman,Jordan',
 'Montgomery, AL',
 'Laguna Beach, Calif. ',
 'in the Word of God',
 'The Multiverse',
 'Desde Republica Argentina',
 'DFW, Texas',
 'EspaÌ±a, Spain',
 'Chester ',
 'New Your',
 'middle eastern palace',
 'The Desert of the Real',
 'Peterborough, On

A glimpse at 50 rows from the 'keywords' column selected randomly

In [488]:
random.sample(list((train_df.keyword.unique())),k=50)


['panicking',
 'emergency%20plan',
 'drowning',
 'flood',
 'arson',
 'accident',
 'seismic',
 'blood',
 'bloody',
 'ambulance',
 'crush',
 'refugees',
 'wildfire',
 'wrecked',
 'engulfed',
 'hostages',
 'twister',
 'dust%20storm',
 'nuclear%20disaster',
 'injuries',
 'catastrophe',
 'curfew',
 'trauma',
 'earthquake',
 'crash',
 'upheaval',
 'violent%20storm',
 'epicentre',
 'fire%20truck',
 'armageddon',
 'emergency',
 'collapsed',
 'siren',
 'electrocute',
 'airplane%20accident',
 'deluged',
 'blown%20up',
 'burning%20buildings',
 'lightning',
 'bombed',
 'destroy',
 'hailstorm',
 'rioting',
 'thunder',
 'explosion',
 'terrorist',
 'cyclone',
 'deluge',
 'suicide%20bombing',
 'screams']

A glimpse at 30 rows from the 'text' column selected randomly

In [489]:
# Selecting 30 random from text column randomly
random.sample(list((train_df.text)),k=30)

["#Nevada's \x89Û÷exceptional\x89Ûª #drought steady at ~11%; ~ 95% of #NV in drought: http://t.co/Nyo1xueBFA @DroughtGov http://t.co/w0a1MJOrHY",
 'Wow Crackdown 3 uses multiple servers in multiplayer?!?! U can destroy whole buildings?!?! #copped',
 'WWI WWII JAPANESE ARMY NAVY MILITARY JAPAN LEATHER WATCH WAR MIDO WW1 2 - Full read by eBay http://t.co/QUmcE7W2tY http://t.co/KTKG2sDhHl',
 'illegal alien released by Obama/DHS 4 times Charged With Rape &amp; Murder of Santa Maria CA Woman Had Prior Offenses  http://t.co/MqP4hF9GpO',
 'Japan marks 70th anniversary of Hiroshima atomic bombing http://t.co/a2SS7pr4gW',
 'If the Taken movies took place in India 2 (Vine by @JusReign) https://t.co/hxM8C8e33D',
 'Last Second Ebay Bid RT? http://t.co/oEKUcq4ZL0 Shaolin Rescuers (dvd 2010) Shen Chan Nan Chiang Five Venoms Kung Fu ?Please Favori',
 'USFS an acronym for United States Fire Service. http://t.co/8NAdrGr4xC',
 "The worst  voice I can ever hear is the 'Nikki your in trouble' voice from m

## Observation

Some data incosistencies or redundant information found in the dataset are as follows

* Upper case and lower case at unexpected location
* Punctuations
* Numbers in text 
* Use of cities, states and Country names. (Granularity problem)
* Special characters such as \x89ÛÒ and \n
* Hyperklinks
* Tags in tweets




## Cleaning the data

The method below does the following to clean anomalies in 'location' column:

* Changes all the text to lower case
* Removes punctutations
* Removes texts with numbers
* Removes cities names if country/state names are mentioned. (High level granularity is maintained)

In [490]:
# Method to chaange text to lower case and remove punctionation
def cleaning_location(text):
    text = str(text)
    text = text.lower()    #lower case
    text = text.split(',')[-1:][0].strip() # Removing city names when country/state name is present
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)  #removing punctuations
    text = re.sub('\w*\d\w*', '', text) #removing text with number

    
    return text

# Applying the method to location column
train_df.location = train_df.location.apply(cleaning_location)

The method below does the following to clean anomalies in 'keyword' column:

* Changes all the text to lower case
* Removes punctutations
* replaces number with space (as %20 was found in middle of two words)

In [491]:
def cleaning_keyword(text):
    text = str(text)
    text = text.lower()    ##lower case
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)  ##removing punctuations
    text = re.sub('(\d+)', ' ', text)   #replacing numbers with space
    return text

# Applying the method to keyword column
train_df.keyword = train_df.keyword.apply(cleaning_keyword)

The following method extracts the hashtags from the tweets. These hashtags play a vital role in interpreting the context of the tweet

In [492]:
def retreive_hashtags(text):
    hash_tag = ''
    hash_tag = re.sub('#','',' '.join(re.findall('(#[A-Za-z]+[A-Za-z0-9-_]+)', text)))  #retreiving hastags
    return hash_tag

The following method to clean anomalies in 'text' column does the following

* Changes all the text to lower case
* Removes words starting with @ to remove the tags and mentions example: @barackobama
* Adds a column with hashtag values
* Removes links
* Removes punctuation
* Removes words with numbers
* Removes special characters examples: \x89û,\x89ûó etc 
* Removes '\n' 


In [493]:
def cleaning_text(text):
    
    text = str(text)
    text = text.lower()    ##lower case
    
    text = re.sub(r'@[A-Za-z]+[A-Za-z0-9-_]+', '',text) #removing any word starting with @   \w
    text = re.sub(r'https|www|http\S+', '', text)  #removing any word starting with http
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)  ##removing punctuations
    text = re.sub('\w*\d\w*', '', text)  #removing words with numbers
    text = re.sub(r'[^\x00-\x7F]+', '', text) # removing special characters
    text.replace("\n","")
    return text


# Applying the method to retreive hastags and add to the hastag column
train_df["Hasthags"] = ''
train_df.Hasthags= train_df.text.apply(retreive_hashtags)

# Applying the method to the text column in the dataframe
train_df.text= train_df.text.apply(cleaning_text)

The following method converts NLTK tags to wordnet tags which would be used to lemmatize the words in the following method

In [494]:
def wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

In [495]:
lemmatizing = WordNetLemmatizer()

def lemmatize_sentence(sentence):
    
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], wordnet_tag(x[1])), nltk_tagged)
    
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if word not in set(stopwords.words('english')):
            if tag is None:
                #if there is no available tag, append the token as is
                lemmatized_sentence.append(word)
            else:        
                #else use the tag to lemmatize the token
                lemmatized_sentence.append(lemmatizing.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)



In [496]:
# Applying the method to retreive hastags and add to the hastag column
train_df.Hasthags= train_df.Hasthags.apply(lemmatize_sentence)

# Applying the method to the text column in the dataframe
train_df.text= train_df.text.apply(lemmatize_sentence)

In [497]:
train_df.



Unnamed: 0,id,keyword,location,text,target,Hasthags
5805,8286,rioting,the weird part of wonderland,people riot everywhere think id one usamisan,1,
7187,10297,weapon,massachusetts,upload video asap guy get see new weapon type ...,0,
1858,2671,crush,,que crushmtvhottest justin bieber,0,MTVHottest
4134,5881,hailstorm,heaven,grow calgary avoids bad city wicked weather,1,
7380,10563,windstorm,ab,precious olive tree lose battleanother crazy w...,1,yyc
857,1239,blood,egypt,people tattoo u allow donate blood receive blo...,1,tattoo
3026,4345,dust storm,dutchenglishgerman,new mad max screenshots show lovely dust storm...,0,
3836,5459,first responders,tn near nashville,shoot event theater give free coffee first res...,1,TN
979,1418,body bag,new york,genuine leather man bag messenger fit ipad min...,0,
4902,6978,massacre,stay tuned,rise coates charleston massacre walter scott b...,1,


In [None]:
train