# Sample Data Extraction
This notebook shows how the sample data was extracted.

In [None]:
#@title Please input your project id
import numpy as np
from google.cloud import bigquery
from google.colab import auth
from google.cloud.bigquery import magics

auth.authenticate_user()
print('Authenticated')
project_id = 'cluster-scheduling-437114' #@param {type: "string"}
# Set the default project id for %bigquery magic
magics.context.project = project_id

# Use the client to run queries constructed from a more complicated function.
client = bigquery.Client(project=project_id)

Authenticated


## Data extraction and preprocessing

The complete database description is in [Google cluster-usage traces v3](https://drive.google.com/file/d/10r6cnJ5cJ89fPWCgj7j4LtLBqYN9RiI9/view).

For the CriticalPath(CPLen) in the [Graphene](https://urldefense.proofpoint.com/v2/url?u=https-3A__www.usenix.org_system_files_conference_osdi16_osdi16-2Dgrandl-2Dgraphene.pdf&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=DEq8DIQPbwANBsyzyzxSQv3mjmXjRODgIYtBTK-gui4&m=078MPcaTX48wul9O9gknhVcO3fsQTA6Ov6JI1in-ecXtU4icJBMG1SmTyloZeqfV&s=-jAA4VvdLT29JG8rZWsfp0NVKuHJ1t9X_nQnkGrCBs0&e=) paper, we need `collection_events` table.



This query ignore the data with 0 timestamp, as mentioned in documentation `A time of 0 represents events that occurred before the beginning of the trace window.` Hence we do not know the exact beginning time of this collection.

We just sample same user's collection here.

In [None]:
# This query ignore the data that has the 0 timestamp.
sql_query = ('''
SELECT
        time,
        collection_id,
        type,
        user,
        parent_collection_id,
        start_after_collection_ids
FROM `google.com:google-cluster-data`.clusterdata_2019_a.collection_events
WHERE collection_id NOT IN (
    SELECT collection_id
    FROM `google.com:google-cluster-data`.clusterdata_2019_a.collection_events
    WHERE time = 0
) AND user = '4+xJwHj7e8nRzTPp13wnDUjqOYGJ9nnXrMHjtTo3Zt4=';
''')

# Convert the query result to a DataFrame
df = client.query(sql_query).to_dataframe()
df.head()

df.to_csv('time_duration_SAMPLE.csv', index=False)

### Sample data for TWork
Below cell is for querying the corresponding collection resource usage using the same `collection_id`.

In [None]:
# Assuming your dataframe is called df and it has a column called 'collection_id'
unique_collection_ids = df['collection_id'].dropna().unique()  # Drop any None values and get unique values
collection_ids_array = np.array(unique_collection_ids)  # Convert to NumPy array
# Convert NumPy array to a tuple of values for SQL IN clause
collection_ids_tuple = tuple(collection_ids_array)

sql_usage = f'''SELECT
                start_time,
                end_time,
                collection_id,
                machine_id,
                collection_type,
                average_usage.cpus AS cpu_usage,
                average_usage.memory AS memory_usage,
                assigned_memory
                FROM `google.com:google-cluster-data`.clusterdata_2019_a.instance_usage
                WHERE collection_id IN {collection_ids_tuple}
                '''

df_usage = client.query(sql_usage).to_dataframe()
df_usage.head()
df_usage.to_csv('instance_usage_SAMPLE.csv')

Unnamed: 0,start_time,end_time,collection_id,machine_id,collection_type,cpu_usage,memory_usage,assigned_memory
0,2174700000000,2175000000000,396897001817,9579310242,0,9.7e-05,0.004799,0.013672
1,168721000000,168723000000,375826195480,160127742918,0,0.000374,2.3e-05,0.023438
2,273328000000,273345000000,378641079226,102893323408,0,0.003197,0.000602,0.007332
3,1723500000000,1723501000000,385475899088,2060696851,0,0.0,0.000125,0.004684
4,1375800000000,1376100000000,384088965239,1377377732,0,9.7e-05,0.001106,0.010406


In [None]:
# Assuming your dataframe is called df and it has a column called 'collection_id'
unique_collection_ids = df['collection_id'].dropna().unique()  # Drop any None values and get unique values
collection_ids_array = np.array(unique_collection_ids)  # Convert to NumPy array
# Convert NumPy array to a tuple of values for SQL IN clause
collection_ids_tuple = tuple(collection_ids_array)

sql_usage = f'''
SELECT
    collection_id,
    AVG(average_usage.cpus) AS avg_cpu_usage,
    AVG(average_usage.memory) AS avg_memory_usage
FROM `google.com:google-cluster-data`.clusterdata_2019_a.instance_usage
WHERE collection_id IN {collection_ids_tuple}
GROUP BY collection_id
                '''

df_usage = client.query(sql_usage).to_dataframe()
df_usage.head()
df_usage.to_csv('instance_usage_SAMPLE.csv',index=False)