# Data Augmentation

This notebook will cover the augmentation process of the initial dataset, which can be accessed [here](https://www.kaggle.com/datasets/landlord/multilingual-disaster-response-messages?resource=download&select=disaster_response_messages_training.csv). To support this project's use case, the initial dataset was augmented with the following features/ identifier:
- Date
- Categorised label
- Ticket ID
- Language

The augmented datasets are also exported in csv format at the end of this notebook.

### Assumptions:
- There are approximately 2000 tickets in a single year.
- Around 30% of the tickets received are categorised, whereas the others aren't.

### Setting-up the Environment

Required libraries are first imported, and then the datasets are read into the environment.

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import random
import datetime
import langid

In [2]:
# Reading in the training, testing, and validation dataset
Train_data = pd.read_csv('disaster_response_messages_training.csv')
Test_data = pd.read_csv('disaster_response_messages_test.csv')
Val_data = pd.read_csv('disaster_response_messages_validation.csv')

  interactivity=interactivity, compiler=compiler, result=result)


### Generating Random Dates

First, a function is set up to generate random dates. Then, this function is applied to the training, testing, and validation data. We assume that the app receives approximately 2000 tickets a year, hence why the training dates are generated over a period of 10 years, while the testing and validation dates are each generated over a period of 1 year.

In [3]:
# Define function to generate random date between 2 dates
def random_date_generator(start, end):
    period = end - start
    days = period.days
    rand = random.randrange(days)
    rand_date = start + datetime.timedelta(days = rand)
    
    return(rand_date)

In [4]:
# Generate random dates for training data
n1 = len(Train_data)
Date = []
start_date = datetime.date(2010, 1, 1)
end_date = datetime.date(2019, 12, 31)
for i in range(n1):
    d_ = random_date_generator(start_date, end_date)
    Date.append(d_)

Train_data['date'] = Date

In [5]:
# Generate random dates for testing data
n2 = len(Test_data)
Date = []
start_date = datetime.date(2020, 1, 1)
end_date = datetime.date(2020, 12, 31)
for i in range(n2):
    d_ = random_date_generator(start_date, end_date)
    Date.append(d_)

Test_data['date'] = Date

In [6]:
# Generate random dates for validation data
n3 = len(Val_data)
Date = []
start_date = datetime.date(2021, 1, 1)
end_date = datetime.date(2021, 12, 31)
for i in range(n3):
    d_ = random_date_generator(start_date, end_date)
    Date.append(d_)

Val_data['date'] = Date

### Generating Labels for Categorised Tickets

While submitting a ticket, a user can optionally categorise their ticket (ex: medical aid, food, etc.). Here, we assume that 30% of the tickets are categorised, while the others aren't. This section will start by defining a function to generate the labels identifying categorised tickets, where 1 means that it was and 0 means that it wasn't.

In [7]:
# Define function to generate labels - 1 (Categorised) or 0 (Not Categorised)
def label_generator(N, thres = 0.3):
    labels = []
    for i in range(N):
        rand = np.random.uniform()
        if (rand < thres):
            x = 1
        else:
            x = 0
        labels.append(x)
    return(labels)

In [8]:
# Generate labels for training, testing, and validation data
Train_data['labeled'] = label_generator(n1)
Test_data['labeled'] = label_generator(n2)
Val_data['labeled'] = label_generator(n3)

### Generating Ticket ID

The ticket ID is generated in this section. Earlier submitted tickets will have a smaller ticket ID, starting from 1.

In [9]:
# Sort data based on date (in ascending order)
Train_sorted = Train_data.sort_values('date')
Test_sorted = Test_data.sort_values('date')
Val_sorted = Val_data.sort_values('date')

In [10]:
# Generating ticket ID, with earlier tickets having smaller ID (starting from 1)
Train_sorted['ID'] = list(np.arange(1,n1+1))
Test_sorted['ID'] = list(np.arange(n1+1,n1+n2+1))
Val_sorted['ID'] = list(np.arange(n1+n2+1,n1+n2+n3+1))

In [11]:
# Reset index for training, testing, and validation data
Train_sorted.reset_index(inplace = True)
Test_sorted.reset_index(inplace = True)
Val_sorted.reset_index(inplace = True)

### Detecting Language of Original Message

Here, a column is added to identify the language of each ticket's message.

In [12]:
# Detect the language in each ticket for the training data
lang_train = []
for i in range(n1):
    msg = Train_sorted['original'][i]
    if type(msg) == float:
        if np.isnan(msg):
            x = 'en'
    else:
        x = langid.classify(msg.lower())[0]
    lang_train.append(x)

Train_sorted['language'] = lang_train

In [13]:
# Detect the language in each ticket for the testing data
lang_test = []
for i in range(n2):
    msg = Test_sorted['original'][i]
    if type(msg) == float:
        if np.isnan(msg):
            x = 'en'
    else:
        x = langid.classify(msg.lower())[0]
    lang_test.append(x)

Test_sorted['language'] = lang_test

In [14]:
# Detect the language in each ticket for the validation data
lang_val = []
for i in range(n3):
    msg = Val_sorted['original'][i]
    if type(msg) == float:
        if np.isnan(msg):
            x = 'en'
    else:
        x = langid.classify(msg.lower())[0]
    lang_val.append(x)

Val_sorted['language'] = lang_val

### Finishing-up

Finally, relevant columns are selected, and the final augmented datasets are then exported in csv format.

In [15]:
# Renaming and selecting relevant columns
Train_augmented = Train_sorted[['ID', 'date', 'labeled', 'message', 'original', 'language', 'genre', 'related','PII', 
                                'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 
                                'search_and_rescue', 'security', 'military', 'child_alone', 'water', 
                                'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees', 
                                'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 
                                'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 
                                'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 
                                'other_weather', 'direct_report']]
Test_augmented = Test_sorted[['ID', 'date', 'labeled', 'message', 'original', 'language', 'genre', 'related','PII', 
                                'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 
                                'search_and_rescue', 'security', 'military', 'child_alone', 'water', 
                                'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees', 
                                'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 
                                'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 
                                'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 
                                'other_weather', 'direct_report']]
Val_augmented = Val_sorted[['ID', 'date', 'labeled', 'message', 'original', 'language', 'genre', 'related','PII', 
                                'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 
                                'search_and_rescue', 'security', 'military', 'child_alone', 'water', 
                                'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees', 
                                'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 
                                'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 
                                'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 
                                'other_weather', 'direct_report']]

In [16]:
# Exporting the augmented datasets in csv format
Train_augmented.to_csv('Augmented/Train.csv')
Test_augmented.to_csv('Augmented/Test.csv')
Val_augmented.to_csv('Augmented/Val.csv')