# Data Cleaning

This notebook will cover the cleaning process of the augmented dataset. We will be removing the redundant rows and columns, and will export the cleaned dataset to .csv format at the end of the notebook.

### Setting-up the Environment
Required libraries are first imported, and then the datasets are read into the environment.

In [1]:
# Import required libraries
import pandas as pd
import numpy as np

In [2]:
# Read in Data
Train = pd.read_csv('Augmented/Train.csv', index_col = [0])
Test = pd.read_csv('Augmented/Test.csv', index_col = [0])
Val = pd.read_csv('Augmented/Val.csv', index_col = [0])

### Looking at the data

We will print out a section of the data, and look at the columns present.

In [3]:
# View data
Train.head(5)

Unnamed: 0,ID,date,labeled,message,original,language,genre,related,PII,request,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,2010-01-01,0,"With the cooperation of First Hawaiian Bank, t...",,en,news,1,0,0,...,0,0,1,0,1,0,0,0,1,0
1,2,2010-01-01,1,PEWODEN FIFTH SECTION OF THE DEPARTEMEN OF L'A...,Pewoden 5em Seksyon Depatman Atibonit ap fe no...,ht,direct,1,0,0,...,0,0,1,0,0,0,1,0,0,1
2,3,2010-01-01,1,"Today on a call with Dr. Chan, Director Genera...",,en,news,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,2010-01-01,0,"YANGON, Jul 08, 2008 (Xinhua via COMTEX News N...",,en,news,1,0,0,...,1,1,1,1,1,0,0,0,0,0
4,5,2010-01-01,1,Throughout the year there were growing signs o...,,en,news,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# List of columns and types
Train.dtypes

ID                         int64
date                      object
labeled                    int64
message                   object
original                  object
language                  object
genre                     object
related                    int64
PII                        int64
request                    int64
offer                      int64
aid_related                int64
medical_help               int64
medical_products           int64
search_and_rescue          int64
security                   int64
military                   int64
child_alone                int64
water                      int64
food                       int64
shelter                    int64
clothing                   int64
money                      int64
missing_people             int64
refugees                   int64
death                      int64
other_aid                  int64
infrastructure_related     int64
transport                  int64
buildings                  int64
electricit

In [5]:
#Looking at the number of values for categorical features - Training data
Categorical_vals = ['language', 'genre',
       'related', 'PII', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report']

for features in Categorical_vals:
    print('Values for column {}: '.format(features), set(Train[features]),', Length: ',len(set(Train[features])),'\n')

Values for column language:  {'pl', 'lb', 'sl', 'sw', 'lt', 'xh', 'la', 'eu', 'de', 'et', 'eo', 'vi', 'qu', 'sq', 'fr', 'en', 'an', 'id', 'fi', 'cs', 'ms', 'ro', 'fo', 'nb', 'nl', 'da', 'jv', 'oc', 'lv', 'mg', 'tr', 'vo', 'it', 'ku', 'rw', 'af', 'zu', 'mt', 'br', 'wa', 'pt', 'ca', 'es', 'sk', 'ga', 'nn', 'sv', 'cy', 'bs', 'no', 'tl', 'hr', 'ht'} , Length:  53 

Values for column genre:  {'direct', 'social', 'news'} , Length:  3 

Values for column related:  {0, 1, 2} , Length:  3 

Values for column PII:  {0} , Length:  1 

Values for column request:  {0, 1} , Length:  2 

Values for column offer:  {0} , Length:  1 

Values for column aid_related:  {0, 1} , Length:  2 

Values for column medical_help:  {0, 1} , Length:  2 

Values for column medical_products:  {0, 1} , Length:  2 

Values for column search_and_rescue:  {0, 1} , Length:  2 

Values for column security:  {0, 1} , Length:  2 

Values for column military:  {0, 1} , Length:  2 

Values for column child_alone:  {0} , Length:

Below, we will see the unique values of each categorical features in the dataset.

In [6]:
#Looking at the number of values for categorical features - Testing data
Categorical_vals = ['language', 'genre',
       'related', 'PII', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report']

for features in Categorical_vals:
    print('Values for column {}: '.format(features), set(Test[features]),', Length: ',len(set(Test[features])),'\n')

Values for column language:  {'pl', 'lb', 'sl', 'lt', 'sw', 'xh', 'eu', 'de', 'eo', 'fr', 'en', 'hu', 'id', 'fi', 'ms', 'nb', 'nl', 'da', 'jv', 'lv', 'tr', 'it', 'rw', 'af', 'zu', 'mt', 'br', 'pt', 'wa', 'es', 'nn', 'sv', 'cy', 'tl', 'hr', 'ht'} , Length:  36 

Values for column genre:  {'direct', 'social', 'news'} , Length:  3 

Values for column related:  {0, 1, 2} , Length:  3 

Values for column PII:  {0} , Length:  1 

Values for column request:  {0, 1} , Length:  2 

Values for column offer:  {0, 1} , Length:  2 

Values for column aid_related:  {0, 1} , Length:  2 

Values for column medical_help:  {0, 1} , Length:  2 

Values for column medical_products:  {0, 1} , Length:  2 

Values for column search_and_rescue:  {0, 1} , Length:  2 

Values for column security:  {0, 1} , Length:  2 

Values for column military:  {0, 1} , Length:  2 

Values for column child_alone:  {0} , Length:  1 

Values for column water:  {0, 1} , Length:  2 

Values for column food:  {0, 1} , Length:  2 

In [7]:
#Looking at the number of values for categorical features - Validation data
Categorical_vals = ['language', 'genre',
       'related', 'PII', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report']

for features in Categorical_vals:
    print('Values for column {}: '.format(features), set(Val[features]),', Length: ',len(set(Val[features])),'\n')

Values for column language:  {'pl', 'sl', 'sw', 'xh', 'eu', 'de', 'eo', 'fr', 'en', 'an', 'fi', 'id', 'ro', 'ms', 'gl', 'nb', 'nl', 'da', 'oc', 'jv', 'it', 'rw', 'af', 'br', 'pt', 'wa', 'es', 'cy', 'sv', 'bs', 'no', 'tl', 'hr', 'ht'} , Length:  34 

Values for column genre:  {'direct', 'social', 'news'} , Length:  3 

Values for column related:  {0, 1, 2} , Length:  3 

Values for column PII:  {0} , Length:  1 

Values for column request:  {0, 1} , Length:  2 

Values for column offer:  {0} , Length:  1 

Values for column aid_related:  {0, 1} , Length:  2 

Values for column medical_help:  {0, 1} , Length:  2 

Values for column medical_products:  {0, 1} , Length:  2 

Values for column search_and_rescue:  {0, 1} , Length:  2 

Values for column security:  {0, 1} , Length:  2 

Values for column military:  {0, 1} , Length:  2 

Values for column child_alone:  {0} , Length:  1 

Values for column water:  {0, 1} , Length:  2 

Values for column food:  {0, 1} , Length:  2 

Values for co

We can see that the feature 'related' has 3 values. Looking at the information from the source of the data, 0 represents no, 1 represents yes, while 2 represents maybe. We will be dropping the rows where values of the related column equal 2, as they aren't that useful. Further, we will also be dropping the columns with only 1 possible value in the training dataset, namely 'PII', 'offer', and 'child_alone' since these columns won't be useful in training.

We define the function to perform these tasks below.

In [8]:
# Function to clean data
def clean_data(df):
    df_ = df.drop(['PII', 'offer', 'child_alone', 'genre'], axis = 1)
    clean_df = df_[df_.related != 2].reset_index()
    
    return clean_df

In [9]:
# Run function
Train_cleaned = clean_data(Train)
Test_cleaned = clean_data(Test)
Val_cleaned = clean_data(Val)

### Exporting to .csv

After running the function, we now have our cleaned dataset, which we will export to .csv format, to be used for further analysis.

In [10]:
# Exporting the cleaned datasets in csv format
Train_cleaned.to_csv('Cleaned/Train.csv')
Test_cleaned.to_csv('Cleaned/Test.csv')
Val_cleaned.to_csv('Cleaned/Val.csv')

### (Additional) Preparing the New Test Data

In addition, specifically for the test data, we'll be removing all the category labels when the messages are unlabeled. This is to simulate the real-world scenario for the test data.

In [11]:
Test_new = pd.read_csv('Cleaned/Val.csv', index_col = [0])

In [12]:
for i in range(len(Test_new)):
    if (Test_new['labeled'][i] == 0):
        Test_new.iloc[i,7:] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [13]:
Test_new.to_csv('Production/Val.csv')