# Functions for preprocessing data
below are my helper functions created to preprocess my data to feed into the neural network.

## Sub functions

### `csv_to_dataframe`
takes in a variable number of csv file paths and outputs an array of pandas dataframes and a list of unique bookingIDs for identification in later functions

In [0]:
import pandas as pd
import numpy as np

def csv_to_dataframe(*args):
    unique_IDs = []
    dataframes = []
    for i in args:
        temp_df = pd.read_csv(i, delimiter = ",")
        temp_df.sort_values(by=["bookingID","second"])

        temp_df = temp_df.drop(columns =["Accuracy", "Bearing", "Speed", "gyro_x", "gyro_y", "gyro_z"])
        # Rationale: might be able to tell with just accelerometer data. 

        temp_unique = temp_df.bookingID.unique()
        unique_IDs.append(temp_unique)
        dataframes.append(temp_df)
    return dataframes, unique_IDs

### `preprocess_split_df`
takes in an array of pandas dataframes, a list of unique bookingIDs and outputs a list where each element is a dataframe with the same bookingID.

In [0]:
def preprocess_split_df(dataframe, unique_ids): 
    data = []
    for id in unique_ids:
        temp = dataframe.loc[dataframe.bookingID == id]
        data.append(temp)
    return data

### `preprocess_match_label`
Takes in a dataframe array. Then matches corresponding labels (0 or 1) with each dataframe in a dataframe array based on bookingID. Also drops bookingID column once data is in the correct format.

In [0]:
def preprocess_match_label(dataframes, dataframe_labels):
    output = []
    for dataframe in dataframes:
    
        # start matching against dataframe
        bookingID = dataframe["bookingID"].iloc[0]
        dataframe = dataframe.drop(columns = "bookingID")
    
        for labels in dataframe_labels.values:
            if(labels[0] == bookingID):
                output.append([dataframe, labels[1]])

    return output

### `preprocess_convert_df_to_list`
takes in an array of `(dataframe, label)` pairs and converts them to a list of `(numpy_array, label)`

In [0]:
def preprocess_convert_df_to_list(data):
    output = []
    for dataframe, target in data:
        output.append([np.asarray(dataframe.values), target]) # editing data inplace, faster but less room for error...
  
    return output

### `preprocess_pass_or_fail`
balances number of dangerous and non-dangerous driving cases for unbiased training.

In [0]:
import random

def preprocess_pass_or_fail(data):

    not_dangerous = []
    dangerous = []
  
    for sequence, label in data:
        if label == 0:
            not_dangerous.append([sequence, label])
        elif label == 1:
            dangerous.append([sequence, label])

    # balance data
    lower = min(len(not_dangerous), len(dangerous))
  
    not_dangerous = not_dangerous[:lower]
    dangerous = dangerous[:lower]
  
    print(f'number of dangerous driving cases: {len(dangerous)}\nnumber of safe driving cases: {len(not_dangerous)}\n')
  
    balanced_data = not_dangerous + dangerous
    random.shuffle(balanced_data)
    
    x = [] # sequential data
    y = [] # labels
  
    for sequence, labels in balanced_data:
        x.append(sequence)
        y.append(labels)
    
    return np.array(x), y

## Import csv files
**Note to reviewer:** ensure the csv files are in the appropriate locations OR alter the filepath for the labels in the body of the function `preprocess_data_for_training` and the filepaths given in its parameters.

In [6]:
# importing data from google cloud storage (GCS)
from google.colab import auth
auth.authenticate_user()

project_id='My First Project'
!gsutil cp -r gs://grab-ai-for-sea/features /content/
!gsutil cp -r gs://grab-ai-for-sea/labels /content/

Copying gs://grab-ai-for-sea/features/.DS_Store...
Copying gs://grab-ai-for-sea/features/part-00000-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv...
Copying gs://grab-ai-for-sea/features/part-00001-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv...
Copying gs://grab-ai-for-sea/features/part-00002-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv...
| [4 files][567.2 MiB/567.2 MiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://grab-ai-for-sea/features/part-00003-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv...
Copying gs://grab-ai-for-sea/features/part-00004-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv...
Copying gs://grab-ai-for-sea/features/part-00005-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv...
Copying gs://grab-ai-for-sea

## Main function using sub functions
Runs all the subfunctions above in sequence. Returns a numpy array where each element is a sequence of data from the same booking ID and another array of labels.

**Note:** Change the location for the labels if necessary

In [7]:
from keras.preprocessing.sequence import pad_sequences

def preprocess_data_for_training(*args):
    print("converting csv to dataframes.....", end = '')
    dataframe_array, unique_id_array = csv_to_dataframe(*args)
    arg_count = 0
    for i in args:
        arg_count += 1
    print("done.")
  
    print("splitting dataframes.....", end = '')
    datas = []
    for i in range(arg_count):
        temp = preprocess_split_df(dataframe_array[i], unique_id_array[i])
        datas.append(temp)
    print("done.")

    print("matching dataframes and labels.....", end = '')
    # matching labels
    df_labels = pd.read_csv("/content/labels/part-00000-e9445087-aa0a-433b-a7f6-7f4c19d78ad6-c000.csv")
    df_and_labels = []
    for i in range(arg_count):
        temp = preprocess_match_label(datas[i], df_labels)
        df_and_labels.append(temp)
    print("done.")
  
    print("converting dataframes to lists.....", end = '')
    list_and_labels = []
    for i in range(arg_count):
        temp = preprocess_convert_df_to_list(df_and_labels[i])
        list_and_labels.append(temp)
    print("done.")
  
  
    print("balancing safe and dangerous datasets.....\n", end = '')
    x_array = []
    y_array = []

    for i in range(arg_count):
        temp_x, temp_y = preprocess_pass_or_fail(list_and_labels[i])
        x_array.append(temp_x)
        y_array.append(temp_y)
    print("done.")
    
    print("concatenating and padding return values.....", end = '')
    # join and pad data
    concat_values = np.concatenate(x_array)  
    concat_values = pad_sequences(concat_values, dtype='float32')

    concat_labels = np.concatenate(y_array)
    print("done.")

    return concat_values, concat_labels

Using TensorFlow backend.


# Run the preprocessing function
run the main function on all the available csv files to compile their data into a list of sequences and a list of labels

In [8]:
x_values, y_values = preprocess_data_for_training("/content/features/part-00000-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv",
                                                  "/content/features/part-00001-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv",
                                                  "/content/features/part-00002-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv",
                                                  "/content/features/part-00003-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv",
                                                  "/content/features/part-00004-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv",
                                                  "/content/features/part-00005-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv",
                                                  "/content/features/part-00006-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv",
                                                  "/content/features/part-00007-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv",
                                                  "/content/features/part-00008-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv",
                                                  "/content/features/part-00009-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv")

converting csv to dataframes.....done.
splitting dataframes.....done.
matching dataframes and labels.....done.
converting dataframes to lists.....done.
balancing safe and dangerous datasets.....
number of dangerous driving cases: 5001
number of safe driving cases: 5001

number of dangerous driving cases: 5001
number of safe driving cases: 5001

number of dangerous driving cases: 5001
number of safe driving cases: 5001

number of dangerous driving cases: 5001
number of safe driving cases: 5001

number of dangerous driving cases: 5001
number of safe driving cases: 5001

number of dangerous driving cases: 5001
number of safe driving cases: 5001

number of dangerous driving cases: 5001
number of safe driving cases: 5001

number of dangerous driving cases: 5001
number of safe driving cases: 5001

number of dangerous driving cases: 5001
number of safe driving cases: 5001

number of dangerous driving cases: 5001
number of safe driving cases: 5001

done.
concatenating and padding return values

In [9]:
# double check the shape of the data.
print(x_values.shape) # indicates that we have 100020 sequences which are each 778 rows long with 4 columns per row.
print(y_values.shape) # only the labels, thats why it is one dimensional.

(100020, 778, 4)
(100020,)


## Saving processed data into pickles

In [0]:
import pickle

pickle_out = open("preprocessedAccel_Sequences.pickle", "wb")
pickle.dump(x_values, pickle_out)

pickle_out_y = open("preprocessedAccel_Labels.pickle", "wb")
pickle.dump(y_values, pickle_out_y)

pickle_out.close()
pickle_out_y.close()

### Saving pickles to GCS
**Note to reviewer:** use whichever method you prefer to store the pickles. Just make sure they are imported appropriately in the other jupyter notebooks.

In [11]:
!gsutil cp /content/preprocessedAccel_Sequences.pickle gs://grab-ai-for-sea
!gsutil cp /content/preprocessedAccel_Labels.pickle gs://grab-ai-for-sea

Copying file:///content/preprocessedAccel_Sequences.pickle [Content-Type=application/octet-stream]...
/ [0 files][    0.0 B/  1.2 GiB]                                                ==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

\ [1 files][  1.2 GiB/  1.2 GiB]  100.3 MiB/s                                   
Operation completed over 1 objects/1.2 GiB.                 

# Preprocessing test data for prediction

In [20]:
test_x, test_y = preprocess_data_for_training("/content/features/part-00000-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv")

converting csv to dataframes.....done.
splitting dataframes.....done.
matching dataframes and labels.....done.
converting dataframes to lists.....done.
balancing safe and dangerous datasets.....
number of dangerous driving cases: 5001
number of safe driving cases: 5001

done.
concatenating and padding return values.....done.


In [21]:
print(test_x.shape) # make sure shape is different from 778, 4

(10002, 750, 4)


Saving test sequence as pickle. To import when testing model.

In [22]:
pickle_out = open("testData_Sequences.pickle", "wb")
pickle.dump(test_x, pickle_out)

pickle_out_y = open("testData_Labels.pickle", "wb")
pickle.dump(test_y, pickle_out_y)

pickle_out.close()
pickle_out_y.close()

# save the test sequences for later
!gsutil cp /content/testData_Sequences.pickle gs://grab-ai-for-sea
!gsutil cp /content/testData_Labels.pickle gs://grab-ai-for-sea

Copying file:///content/testData_Sequences.pickle [Content-Type=application/octet-stream]...
- [1 files][114.5 MiB/114.5 MiB]                                                
Operation completed over 1 objects/114.5 MiB.                                    
Copying file:///content/testData_Labels.pickle [Content-Type=application/octet-stream]...
/ [1 files][ 78.3 KiB/ 78.3 KiB]                                                
Operation completed over 1 objects/78.3 KiB.                                     
