# Training Set Preparation

In this notbook, the (already pre-processed) dataset `preprocessed_worker.csv` will further be transformed to be usuable for the training with the notebook `lstm_utilization_prediction.ipynb`.

## Import Modules

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

## Load the Dataset

In [None]:
DF_PATH: str = 'preprocessed_worker.csv'

df = pd.read_csv(DF_PATH, index_col='start_date', parse_dates=True)
df.head(15)

## Sort the Dataset

The dataset gets sorted first by each `machine` and then each machine gets sorted by its `start_date` index.

So the data is ascending for each machine.

In [None]:
df = df.sort_values(['machine', 'start_date'])

In [None]:
df['job_name'].value_counts()

# Drop not needed Columns

In [None]:

df.drop(columns=['Unnamed: 0', 'inst_name', 'inst_id', 'worker_name', 'gpu_name', 'workload', 'start_time_t', 'end_time_t'], inplace=True)
df.dropna(axis=1)
df

In [None]:
term_val = 'Terminated'
df.query('status_t == @term_val', inplace=True)

In [None]:
df['task_name'].value_counts()

## Create Dataframe with Columns used for Machine Learning

Next we will create the dataframe that holds all values we will use for machine learning.

In [None]:
training_columns: list = ['machine', 'job_name', 'task_name', 'inst_num', 'cpu_usage',
                          'gpu_wrk_util', 'avg_mem', 'max_mem', 'avg_gpu_wrk_mem',
                          'max_gpu_wrk_mem', 'gpu_type', 'runtime']

machine_df = df[training_columns]

## Save Dataframe to Disk

In order to later on use the dataframe, we will store it to disk.

In [None]:
# machine_df.to_csv('training_machine_sorted_df.csv')

In [None]:
# machine_df = pd.read_csv('training_machine_sorted_df.csv', index_col='start_date')

In [None]:
machine_df

In [None]:
dummies = pd.get_dummies(machine_df.task_name)
machine_df.join(dummies)
machine_df