# Data wrangling

This notebook is just to capture the workflow of combining all the data from the asynchronous and synchronous gremlin runs done on Summnit.  The resulting CSV file will then be added to our (not) FOGA git repo.  I will have a separate note book there for doing analytics and visualizations.

This is a clone of the gremlin data wrangling notebook that will combine the two Gremlin runs -- one for the regular Gremlin async run, and the other for a *new* async Gremlin run that used the new parent selection strategy of also allowing for selecting from currently evaluating individuals.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

# First process the asynchronous data

In [2]:
cd /Users/may/Projects/data/AV/gremlin/runs/2021_UKCI/1051001_async/

/Users/may/Projects/data/AV/gremlin/runs/2021_UKCI/1051001_async


In [3]:
def add_columns(job_id, run_type, df):
    """ A convenience for adding the Summit job ID and run_type (sync vs. async)"""
    df['job_id'] = job_id
    df['run_type'] = run_type
    return df

In [4]:
# Could have used a list comprehension, but I wanted to echo the files for sanity checking
async_data = []
for csv_file in Path('.').glob('*ind*csv'):
    print(f'reading {csv_file}')
    async_data.append(pd.read_csv(str(csv_file)))

reading 3_1051001_issue_45_individuals.csv
reading 0_1051001_issue_45_individuals.csv
reading 4_1051001_issue_45_individuals.csv
reading 2_1051001_issue_45_individuals.csv
reading 1_1051001_issue_45_individuals.csv


In [5]:
async_data = [add_columns('1051001', 'async', x) for x in async_data]

In [6]:
async_df = pd.concat(async_data)

In [7]:
# Since we're going to later merge in the by-generation stuff that has a 'generation' column, we need to 
# add one to the async, but put in NaNs to indicate that is not relevant for the async stuff.
async_df['generation'] = np.nan

In [8]:
async_df.head() # sanity check

Unnamed: 0,run,hostname,pid,uuid,birth_id,scenario,cloudiness,wetness,precipitation,precipitation_deposits,...,fog_density,fog_distance,sun_azimuth_angle,sun_altitude_angle,start_eval_time,stop_eval_time,fitness,job_id,run_type,generation
0,3,d36n08,105620,eaa6d360-751a-4536-9ee7-6ecceffaf759,26,14,8,49,50,50,...,12,1.56,123,10,1622667000.0,1622667000.0,18.441713,1051001,async,
1,3,d36n01,24482,3403baac-6418-4c4a-a063-b478c5f98fd9,42,13,89,16,25,75,...,37,719.57594,306,-42,1622667000.0,1622667000.0,63.508626,1051001,async,
2,3,d36n07,170330,b01e7cac-eb6c-4d37-9805-c44298aa5706,37,36,49,20,50,75,...,21,15.777216,77,17,1622667000.0,1622667000.0,99.302382,1051001,async,
3,3,d36n01,24466,102e48ba-c110-4f6e-bdbf-3cd8c563c8c5,43,25,60,46,50,0,...,38,9.48576,318,72,1622667000.0,1622667000.0,47.844071,1051001,async,
4,3,d36n02,45849,8bb41025-7261-43c7-8b56-37bf88afb26d,8,38,51,34,50,0,...,46,1843.674407,336,43,1622667000.0,1622667000.0,77.942178,1051001,async,


In [20]:
async_df.to_csv('all_async.csv')

# Now to do the same thing for the synchronous (by-generation) data

In [9]:
cd /Users/may/Projects/data/AV/gremlin/runs/2021_UKCI/1078364_async_eval_selection

/Users/may/Projects/data/AV/gremlin/runs/2021_UKCI/1078364_async_eval_selection


In [23]:
new_async_data = []
for csv_file in Path('.').glob('*individuals*csv'):
    print(f'reading {csv_file}')
    new_async_data.append(pd.read_csv(str(csv_file)))

In [24]:
new_async_data = [add_columns('1078364', 'eval_select', x) for x in sync_data]

In [25]:
new_async_df = pd.concat(new_async_data)

In [26]:
new_async_df['generation'] = np.nan

In [27]:
all_dfs = pd.concat([async_df, new_async_df])

In [20]:
cd ..


/Users/may/Projects/data/AV/gremlin/runs/2021_UKCI


In [28]:
all_dfs.to_csv('gremlin_async_and_new_async.csv', na_rep='NA', index=False)

In [29]:
all_dfs.head()

Unnamed: 0,run,hostname,pid,uuid,birth_id,scenario,cloudiness,wetness,precipitation,precipitation_deposits,...,fog_density,fog_distance,sun_azimuth_angle,sun_altitude_angle,start_eval_time,stop_eval_time,fitness,job_id,run_type,generation
0,3,d36n08,105620,eaa6d360-751a-4536-9ee7-6ecceffaf759,26,14,8,49,50,50,...,12,1.56,123,10,1622667000.0,1622667000.0,18.441713,1051001,async,
1,3,d36n01,24482,3403baac-6418-4c4a-a063-b478c5f98fd9,42,13,89,16,25,75,...,37,719.57594,306,-42,1622667000.0,1622667000.0,63.508626,1051001,async,
2,3,d36n07,170330,b01e7cac-eb6c-4d37-9805-c44298aa5706,37,36,49,20,50,75,...,21,15.777216,77,17,1622667000.0,1622667000.0,99.302382,1051001,async,
3,3,d36n01,24466,102e48ba-c110-4f6e-bdbf-3cd8c563c8c5,43,25,60,46,50,0,...,38,9.48576,318,72,1622667000.0,1622667000.0,47.844071,1051001,async,
4,3,d36n02,45849,8bb41025-7261-43c7-8b56-37bf88afb26d,8,38,51,34,50,0,...,46,1843.674407,336,43,1622667000.0,1622667000.0,77.942178,1051001,async,


In [30]:
all_dfs.tail()

Unnamed: 0,run,hostname,pid,uuid,birth_id,scenario,cloudiness,wetness,precipitation,precipitation_deposits,...,fog_density,fog_distance,sun_azimuth_angle,sun_altitude_angle,start_eval_time,stop_eval_time,fitness,job_id,run_type,generation
595,3,g31n11,91443,d9837cf0-a4b2-4702-bb8a-a2f3cdfbf4c4,592,50,20,37,25,100,...,77,449.359963,226,-72,1623613000.0,1623613000.0,99.06299,1078364,eval_select,
596,3,g31n14,125087,3c5d2fcc-ed8c-4647-aa41-ea644e919501,593,50,20,36,50,100,...,75,449.359963,230,-72,1623613000.0,1623613000.0,99.173575,1078364,eval_select,
597,3,g31n18,95965,546d921c-52c5-4eba-8068-97f579cc1775,573,53,84,3,75,50,...,30,1843.674407,161,-62,1623613000.0,1623613000.0,94.89358,1078364,eval_select,
598,3,g31n13,149516,a8883138-62ab-4947-b02c-8a2bfb2bfd71,598,50,20,36,0,100,...,75,449.359963,230,-72,1623613000.0,1623613000.0,98.325259,1078364,eval_select,
599,3,g31n15,140613,c8f1d8ac-8b67-46c8-b038-0d957093ab9e,50,3,68,82,75,0,...,86,0.6,121,65,1623612000.0,1623614000.0,,1078364,eval_select,


At this point, the data is written out to a CSV file that was committed to our share repository.  So, the purpose of this specific notebook is done.