# Data Cleaning

## Introduction

* This notebook goes through a necasssary step of cleaning the data before it is used for exploratory data analysis. 
* The input of this notebook is a training dataset in csv format sourced from Kaggle. https://www.kaggle.com/c/nlp-getting-started
* The output of this notebook is a csv file with clean and lemmatized text data. 

## Reading and Understanding the dataset

In [24]:
# Importing libraries
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re           # regular expression 
import string       # String Handling
import random       #For selecting random rows
import nltk         # Natural langauage processing toolkit
from nltk.stem import WordNetLemmatizer  #Used for Lemmatizing the text
from nltk.corpus import wordnet          #Used for POS tagging 
from nltk.corpus import stopwords        #Stopwords to be removed from text


Reading the training data into a dataframe using pandas and viewing the top 20 rows of data

In [25]:
train_df = pd.read_csv("train.csv")
train_df.head(20)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


First 20 rows in 'Keyword' and 'Location' column are NaNs. Now Checking if there are nulls in other columns of the dataframe.

In [26]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
id          7613 non-null int64
keyword     7552 non-null object
location    5080 non-null object
text        7613 non-null object
target      7613 non-null int64
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


Checking number of unique values in 'keywords' and 'location'.

In [27]:
len(train_df.keyword.unique())

222

In [28]:
len(train_df.location.unique())

3342

A glimpse at 30 rows from the 'locations' column selected randomly


In [29]:
random.sample(list((train_df.location.unique())),k=30)


['middle eastern palace',
 'Breaking News',
 'Dover, DE',
 'World',
 'Pakistan, Islamabad',
 'Heinz Field ',
 'Concord, NH ',
 'Heaven',
 'Chiswick, London',
 'highlands&slands scotland',
 'Pluto',
 'watford',
 'Frostburg',
 'The land of New Jersey. ',
 'Use #TMW in tweets get #RT',
 'Auburn, AL',
 'Pune, mostly ',
 'POFFIN',
 'too far',
 'teh internets',
 'Nairobi , Kenya',
 'An eight-sided polygon',
 'Soufside',
 'Nashua NH',
 'Liberty Lake, WA',
 'Ecuador',
 'Bremerton, WA',
 '#BlackLivesMatter',
 'Philadelphia',
 'Vancouver, Colombie-Britannique']

A glimpse at 30 rows from the 'keywords' column selected randomly

In [30]:
random.sample(list((train_df.keyword.unique())),k=30)


['crushed',
 'desolate',
 'structural%20failure',
 'landslide',
 'storm',
 'hostage',
 'rioting',
 'electrocute',
 'blaze',
 'natural%20disaster',
 'screamed',
 'flattened',
 'quarantined',
 'blazing',
 'bombing',
 'explode',
 'sirens',
 'devastated',
 'wounds',
 'annihilated',
 'exploded',
 'blizzard',
 'deaths',
 'war%20zone',
 'emergency%20plan',
 'fatal',
 'fire',
 'lightning',
 'destroyed',
 'displaced']

A glimpse at 30 rows from the 'text' column selected randomly

In [31]:
# Selecting 30 random from text column randomly
random.sample(list((train_df.text)),k=30)

["+ DID YOU SAY TO HIM!!?!?!?!' and phil actually collapsed on the gravel sobbing endlessly with a crowd watching him confused angry mad+",
 '320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/e14EPzhotH | @djicemoon | #Dubstep #TrapMusic #DnB #EDM #Dance #Ices\x89Û_ http://t.co/22a9D5DO6q',
 'My sis can now sit on a cam w/o panicking https://t.co/GiYaaD7dcc',
 "@fa07af174a71408 I have lived &amp; my family have lived in countries where looters were shot on sight where rioting wasn't tolerated. Why here",
 "@schelbertgeorg Thanks. I'm teaching an online class &amp; asking my students lots of questions like this. Sorry for the deluge of Ren. art!",
 "If you fill your mind with encouragement and positivity then it won't take you hostage. Be careful of your content",
 "People really still be having curfew even when they're 18 &amp; graduated high school ??",
 'HE CALLED IT A MUDSLIDE AW',
 'they say bad things happen for a reason\nbut no wise words gonna stop te bleeding',
 '@fadelurker @dalinth

## Observation

Some data incosistencies or redundant information found in the dataset are as follows

* Upper case and lower case at unexpected location
* Punctuations
* Numbers in text 
* Use of cities, states and Country names. (Granularity problem)
* Special characters such as \x89ÛÒ and \n
* Hyperklinks
* Tags in tweets




## Cleaning the data

The method below does the following to clean anomalies in 'location' column:

* Changes all the text to lower case
* Removes punctutations
* Removes texts with numbers
* Removes cities names if country/state names are mentioned. (High level granularity is maintained)

In [32]:
# Method to chaange text to lower case and remove punctionation
def cleaning_location(text):
    text = str(text)
    text = text.lower()    #lower case
    text = text.split(',')[-1:][0].strip() # Removing city names when country/state name is present
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)  #removing punctuations
    text = re.sub('\w*\d\w*', '', text) #removing text with number
    return text

# Applying the method to location column
train_df.location = train_df.location.apply(cleaning_location)

The method below does the following to clean anomalies in 'keyword' column:

* Changes all the text to lower case
* Removes punctutations
* replaces number with space (as %20 was found in middle of two words)

In [33]:
def cleaning_keyword(text):
    text = str(text)
    text = text.lower()    ##lower case
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)  ##removing punctuations
    text = re.sub('(\d+)', ' ', text)   #replacing numbers with space
    return text

# Applying the method to keyword column
train_df.keyword = train_df.keyword.apply(cleaning_keyword)

The following method to clean anomalies in 'text' column does the following

* Changes all the text to lower case
* Removes words starting with @ to remove the tags and mentions example: @barackobama
* Adds a column with hashtag values
* Removes links
* Removes punctuation
* Removes words with numbers
* Removes special characters examples: \x89û,\x89ûó etc 
* Removes '\n' 


In [34]:
def cleaning_text(text):
    
    text = str(text)
    text = text.lower()    ##lower case
    
    text = re.sub(r'@[A-Za-z]+[A-Za-z0-9-_]+', '',text) #removing any word starting with @   \w
    text = re.sub(r'https|www|http\S+', '', text)  #removing any word starting with http
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)  ##removing punctuations
    text = re.sub('\w*\d\w*', '', text)  #removing words with numbers
    text = re.sub(r'[^\x00-\x7F]+', '', text) # removing special characters
    text.replace("\n","")
    return text

The following method converts NLTK tags to wordnet tags which would be used to lemmatize the words in the following method

In [35]:
def wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

The following method is used to lemmatize the sentences in the following order:

* Clean the sentences by calling the cleaning_text method on each sentence.
* Tokenizing the sentence and generating a nltk POS tag for each word in the cleaned sentence.
* Converting the nltk POS tag to Wordnet POS tag by calling wordnet_tag method.
* Removes the stopwords
* Lemmatizes the tokens using the pos tags and joining them to form a sentence of lemmatized words.


In [36]:
lemmatizing = WordNetLemmatizer()

def lemmatize_sentence(sentence):
    sentence = cleaning_text(sentence) #Cleaning the sentence
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  #Tokenizing and tagging each word 
    wordnet_tagged = map(lambda x: (x[0], wordnet_tag(x[1])), nltk_tagged)  # Coverting the NLTK tags to wordnet tag
    
    #Lemmatizing the tagged tokens
    lemmatized_sentence = [] #empty list for lemmatized words
    for word, tag in wordnet_tagged:
        if word not in set(stopwords.words('english')):  #removing stopwords
            if tag is None:                   
                lemmatized_sentence.append(word) #adding the word as it is if POS tag missing
            else:        
                #else use the tag to lemmatize the token
                lemmatized_sentence.append(lemmatizing.lemmatize(word, tag))  ##lemmatizing the token using the POS tag
    return " ".join(lemmatized_sentence)



In [37]:
# Applying the lemmatizing method to text column 
train_df.text= train_df.text.apply(lemmatize_sentence)

#Applying the lemmarizing methiod to keyword column
train_df.keyword = train_df.keyword.apply(lemmatize_sentence)

Let's have a look at the cleaned dataframe

In [38]:
train_df.sample(n=20)

Unnamed: 0,id,keyword,location,text,target
3550,5074,famine,universe,export food wont solve problem african end fam...,1
4800,6832,loud bang,kenya,break news unconfirmed heard loud bang nearby ...,0
5146,7338,nuclear reactor,,finnish minister fennovoima nuclear reactor go...,0
2523,3626,desolation,on twitter,yeah lamb god rock ring introdesolation hd via,0
3394,4858,evacuation,queensland,evacuation drill work fire door wouldnt open g...,0
7112,10191,violent storm,,dramatic video show plane land violent storm,1
1509,2177,catastrophic,ny,learn destructive volcanic event us history th...,1
5171,7374,obliterate,,ever want obliterate entire specie face earth ...,0
5200,7425,obliterate,tennessee,wacko like michelebachman predict world soon o...,1
1828,2628,crashed,ne,bug almost crash euro,1


Checking if text from any row was completely removed due to cleaning

In [39]:
train_df.loc[train_df.text == '']

Unnamed: 0,id,keyword,location,text,target
5115,7295,nuclear reactor,,,0


There is one row with id 7295 where the whole text was removed. 

The original train data is read again to see the content of the row with id 7295

In [40]:
df_crosscheck = pd.read_csv('train.csv')
df_crosscheck.loc[df_crosscheck.id == 7295]

Unnamed: 0,id,keyword,location,text,target
5115,7295,nuclear%20reactor,,Err:509,0


It can be observed above that the text in that particular row is an error message. Hence, that text was correctly cleaned from that row. 

The row is removed from the dataframe and the dataframe is saved as a csv file to be used for EDA.

In [41]:
train_df = train_df[train_df.text != '']
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7612 entries, 0 to 7612
Data columns (total 5 columns):
id          7612 non-null int64
keyword     7612 non-null object
location    7612 non-null object
text        7612 non-null object
target      7612 non-null int64
dtypes: int64(2), object(3)
memory usage: 356.8+ KB


In [42]:
train_df.to_csv("Clean_train_data.csv")

Cleaning test data using the predefined methods.

In [43]:
test_df = pd.read_csv('test.csv')

In [44]:
# Applying the method to location column
test_df.location = test_df.location.apply(cleaning_location)
# Applying the method to keyword column
test_df.keyword = test_df.keyword.apply(cleaning_keyword)

# Applying the lemmatizing method to text column 
test_df.text= test_df.text.apply(lemmatize_sentence)

# Applying the lemmatizing method to keyword column 
test_df.keyword= test_df.keyword.apply(lemmatize_sentence)

In [45]:
test_df.to_csv('Clean_test_data.csv')