# Initial Data Exploration
-------------------------------------

Messing around with a day's worth of PBS Accouting Data to see if we can figure anything useful out.

In [2]:
import pandas as pd
import numpy as np

In [10]:
path = '../csv_output/summary.csv'
df = pd.read_csv(path)

df.head()

Unnamed: 0,id,date,r_type,resource,value,key
0,8,20180212,Q,queue,economy,
1,9,20180212,Q,queue,economy,
2,10,20180212,Q,queue,economy,
3,11,20180212,Q,queue,economy,
4,12,20180212,Q,queue,economy,


We first have to know what we're working with so we can take a look at each column and see what it contains (not an exhaustive list)

- **id**: The unique identification code for each job
- **date**: The date the job was run
- **r_type**: Record Type
    - Q : Job entered a queue (not recorded for subjobs)
    - S : Job execution has started
    - E : Job or subjob ended
    - A : Job was aborted
- **resource**: Resources the job used/ information about the job
    - user : Who submitted the job
    - Resource_List.<resource> : List of resources requested by job (ncpus,mem...)
    - Resource.used : Resources that were used in the job
    - start : Time when job execution started (seconds since epoch)
    - qtime : Time when job entered queue (seconds since epoch)
    - Exit_status (keep in mind interactive jobs always return 0)
- **value**: Value for each resource
- **key**: Select statement values

Since how long a job takes to run is pretty important, I thought we could investigate this aspect first.

In [11]:
# Super basic, make a df out of every resource that has time in the name

df = df.dropna(axis=0, subset=['resource'])

dfTime = df[df['resource'].str.contains('time')]
dfTime.head(100)

Unnamed: 0,id,date,r_type,resource,value,key
12,15,20180212,S,ctime,1518447436,
15,15,20180212,S,qtime,1518447436,
25,15,20180212,S,Resource_List.walltime,01:00:00,
28,15,20180212,S,etime,1518447436,
31,17,20180212,E,qtime,1518447436,
36,17,20180212,E,etime,1518447436,
46,17,20180212,E,Resource_List.walltime,01:00:00,
53,17,20180212,E,resources_used.walltime,00:35:36,
56,17,20180212,E,ctime,1518447436,
66,18,20180212,S,ctime,1518447436,


- ctime : time at which job was created
- etime : time in which the job became eligible to run (in execution queue)
- qtime : time job was queued
- Resource_List.walltime : requested walltime
- resource_used.walltime : time it actually took

From a quick glance at the data it seems a lot of jobs are submitted with a requested walltime of an hour, lets double check...

In [16]:
print ("Walltime: {}".format(dfTime.loc[(dfTime['resource'] == 'Resource_List.walltime') & (dfTime['r_type']== 'E'), 'value']))

Walltime: 46      01:00:00
99      01:00:00
175     01:00:00
228     01:00:00
281     01:00:00
311     01:00:00
365     01:00:00
432     01:00:00
485     01:00:00
561     01:00:00
614     01:00:00
644     01:00:00
697     01:00:00
773     01:01:01
823     01:01:01
873     01:01:01
923     01:01:01
973     01:01:01
1023    01:01:01
1073    01:01:01
1123    01:01:01
1173    01:01:01
1223    01:01:01
1273    01:01:01
1323    01:01:01
1373    01:01:01
1423    01:01:01
1473    01:01:01
1523    01:01:01
1573    01:01:01
1625    01:01:01
1673    01:01:01
1724    01:00:00
Name: value, dtype: object


Just looked at the tutorial on the ncar webpage (see "Submitting jobs with PBS") and the example specifies a walltime of an hour. This might explain why a lot of users stick to that specific walltime. However, this means that this would not be a great variable to explore. 

In fact, we can go one of two ways here. We could either investigate ctime, etime and qtime to see how PBS is handling jobs and look for outliers or patterns there or we can go back to the main df to see if there are variables with a strong correlation with actual walltime.

In [19]:
df.loc[df['id']==15]

Unnamed: 0,id,date,r_type,resource,value,key
8,15,20180212,S,queue,economy,
9,15,20180212,S,account,"""SSSG0001""",
10,15,20180212,S,Resource_List.place,scatter:exclhost,
11,15,20180212,S,group,ncar,
12,15,20180212,S,ctime,1518447436,
13,15,20180212,S,resource_assigned.ncpus,4608,
14,15,20180212,S,Resource_List.ncpus,2304,
15,15,20180212,S,qtime,1518447436,
16,15,20180212,S,Resource_List.mpiprocs,128,
17,15,20180212,S,exec_vnode,(r1i0n0:ncpus=36)+(r1i0n1:ncpus=36)+(r1i0n2:nc...,
