# Notebook Description

This Notebook applies some technical transformations to given data.<br>
The goal is to merge the data of a directory and bring it into a form that can directly be fed to our Neural Network.<br>

For every key, corresponding to the 2-tuple of the amounts of remaining Jobs and Machines, we will:
   1. load lists of data samples assigned to this key throughout all dictionaries within a directory
   2. for every such list, select some of these samples with regards to a balancing rule 
   3. transform the numpy-arrays into LSTM compatible form
   4. concatenate them to a final list
    
Finally, we will store each of these final lists as a pickle-list in a sub-directory.

So after creating a directory of data dictionaries, this Notebook has to be run on it.<br>

# Code

In [1]:
import pickle
import os
import time
import random
import import_ipynb
from Jobs_and_Machines import *
from States_and_Policies import *

importing Jupyter notebook from Jobs_and_Machines.ipynb
importing Jupyter notebook from States_and_Policies.ipynb
importing Jupyter notebook from Global_Variables.ipynb


### Select Hyperparameters and Data Sets

In [2]:
#set key limits
n = 8 #max job number
m = 4 #max machine number
n_min = 3 #min job number
m_min = 2 #min machine number

A directory corresponds to a data set. So choose the data set that shall be transformed.  To use estimated data of higher Job numbers, resulting from the approach of applying further techniques of Deep Reinforcement Learning, respond <i>0</i> in the following input-question.

In [3]:
#Number of Data Set has to be given in two digits, so for example "1" has to beg given as "01", while "98" stays the same
DS = input("Which Data-Set do you want to work on?\n"+"Type 0 if you want to work on estimated Data-Sets.\n")

Which Data-Set do you want to work on?
Type 0 if you want to work on estimated Data-Sets.
01


In case that the data set is estimated for a higher number of Jobs, this increased number has to be stated.

In [4]:
if DS == "0":
    n = int(input("What is the new number of Jobs?\n"))
    n_min = n

### Helper Functions

In [5]:
#helper function to merge dictionaries containing data
def merge_dicts(data_dict, sample_dict):
    for key in data_dict:
        if key in sample_dict:
            data_dict[key][0] += sample_dict[key][0] #add inputs
            data_dict[key][1] += sample_dict[key][1] #add targets

In [6]:
#bring data into format directly readable for Neural Network and its base layer, being an LSTM
def data_into_LSTM_format(data_dict):
    #iterate over keys
    for key in data_dict:
        #get inputs and targets
        inputs, targets = data_dict[key] #list of list of inputs and list of targets
        #inputs is a list, for every state their is one entry, being a list itself 
        #These inner lists consist of two entries: Job-data and Machine-data of a state
        #every machine-data consists of 3 entries, so create indexes for the range of m_state repeating every index 3 times
        idxs = [ind+1 for ind in range(key[1]) for _ in range(3)]
        
        #inputs are now Jobs. Each Job is a sequence of the processing time and the respective machine information.
        #the last 2 entries are the jobs earliness and weight
        seq_inputs = [np.insert(inp[0],idxs,inp[1].flatten(), axis=1) for inp in inputs]
        
        #merge samples to numpy array
        data_dict[key][0] = [np.stack(seq_inputs)]
        data_dict[key][1] = [np.stack(targets)]

### Create Final Data

In [7]:
#change working path to Data directory
work_path = input("What is the working path to the data directory?\n")
os.chdir(work_path)

What is the working path to the data directory?
D:\\Job-Scheduling-Files\Data


We will now merge all data dictionaries of a directory by their keys into one final dictionary.

In [8]:
#merged dictionary
data_dict = dict(((n_state,m_state),[[],[]]) 
                           for n_state in range(n_min,n+1) for m_state in range(m_min,m+1))

#measure starting time
st = time.time()

#check if we are using estimated data set
if DS == "0":
    data_path = f'EstimData/{n}_Jobs/estim_data_{n}_Jobs'
    data_indices = [str(i+1) for i in range(800)]
#else get path to folder of data sets
else:
    #Directory of data. We used 10 of them for the training data
    data_path = f'DataSet_{DS}/data_{DS}'
    #every file consists of the dictionary of one Job Scheduling Problem
    data_indices = ["0"*(4-len(str(i))) + str(i) for i in range((int(DS)-1)*10000,int(DS)*10000)] #10000

#loop over dictionaries
for data_ind in data_indices:
    #open dictionary
    with open(f'{data_path}_{data_ind}.pickle', 'rb') as f:
        #load sample dictionary
        sample_dict = pickle.load(f)
        #loop over it keys
        for key in sample_dict:
            #down sample will contain selection of data
            down_sample = [[],[]]
            n_state = key[0]
            #98 denotes the validation data, 99 the test data
            if DS in ["98", "99"]:
                #only one state-data per n-m-combination for every Job Scheduling Problem
                data_length = 1
                #take first data instance
                down_sample[0] = sample_dict[key][0][:data_length]
                down_sample[1] = sample_dict[key][1][:data_length]
                #update current sample dictionary
                sample_dict[key] = down_sample
            #if training data dictionary
            elif int(DS)-1 in range(10):
                #target index "i" corresponds to "i" being the optimal action in a selected state
                #we only select the data of on state for every such "i" from every dictionary=JobSchedulingProblem
                target_indices = [0 for i in range(n_state+1)]
                #loop over target-vectors
                for i, row in enumerate(sample_dict[key][1]):
                    #check if we already saved the data of a state whose optimal action is equal to the one of this target vector
                    if target_indices[n_state - np.argmax(row[::-1])] == 0:
                        #add input data
                        down_sample[0].append(sample_dict[key][0][i])
                        #add target values
                        down_sample[1].append(row)
                        #update that we already have one state with optimal action "i" for this dictionary
                        target_indices[n_state - np.argmax(row[::-1])] += 1
                    #break as soon as we have a state for every such "i"
                    if not 0 in target_indices:
                        break
                #we now have (at most) n_state+1 states, each having a different index as optimal action
                options = len(down_sample[1])
                #we randomly sample over them to balance the training data set with regards to the optimal actions
                choice = random.choice(range(options))
                #add data to sample dictionary
                sample_dict[key] = [[down_sample[0][choice]], [down_sample[1][choice]]]
                
        #add the selected data of the sample dictionary to the final dictionary
        merge_dicts(data_dict,sample_dict)
        
#print how much time this process took        
et = time.time()
print(round(et-st), "seconds to merge sample data into final dictionary")

328 seconds to merge sample data into final dictionary


We want to see how well our data is balanced in the end with regards to the optimal actions.

In [9]:
#print distribution of optimal actions for every key
for key in data_dict:
    n_state = key[0]
    target_indices = [0]*(n_state+1)
    for row in data_dict[key][1]:
        target_indices[n_state - np.argmax(row[::-1])] += 1
    print(n_state, key[1], target_indices)

3 2 [2538, 2548, 2488, 2426]
3 3 [2468, 2517, 2514, 2501]
3 4 [2417, 2469, 2524, 2590]
4 2 [2067, 2070, 2058, 1995, 1810]
4 3 [1973, 2056, 2005, 1935, 2031]
4 4 [2104, 1984, 2061, 1831, 2020]
5 2 [1863, 1915, 1965, 1896, 1754, 607]
5 3 [1727, 1686, 1702, 1739, 1555, 1591]
5 4 [1896, 2011, 1924, 1486, 955, 1728]
6 2 [1960, 1982, 1971, 1687, 1383, 940, 77]
6 3 [2002, 1999, 1898, 1619, 1206, 659, 617]
6 4 [2466, 2286, 1870, 1392, 734, 370, 882]
7 2 [3155, 2554, 1806, 1153, 680, 427, 223, 2]
7 3 [3106, 2588, 1786, 1137, 692, 381, 180, 130]
7 4 [3189, 2537, 1708, 1059, 636, 367, 168, 336]
8 2 [4376, 2448, 1446, 803, 463, 274, 116, 74, 0]
8 3 [3635, 2506, 1708, 958, 590, 321, 167, 81, 34]
8 4 [2832, 2512, 1890, 1235, 697, 397, 238, 94, 105]


Lastly, we will transform the data into a form compatible with our Neural Network. Each list of samples is assigned to one key in the final dictionary. These keys are the 2-tuples of the amounts of remaining Jobs and Machines of the associated states. Every list will be saved separately.

In [10]:
#transform final dictionary into compatible data-format for an keras LSTM
st = time.time()
data_into_LSTM_format(data_dict)
et = time.time()
print(round(et-st), "seconds to transform data")

11 seconds to transform data


The merged data will be saved in a sub-directory.

In [None]:
#create folder to save merged LSTM data
if DS == "0":
    LSTM_data_path = f'EstimData/{n}_Jobs/LSTM_EstimData_RR/'
    
else:
    LSTM_data_path = f'DataSet_{DS}/LSTM_Data_RR_{DS}/'
    
if not os.path.exists(LSTM_data_path):
    os.mkdir(LSTM_data_path)

In [None]:
#save every merged n-m-combination as a pickle file 
for key in data_dict:
    n_state, m_state = key
    file_path = f'{LSTM_data_path}{n_state}-jobs-{m_state}-machines'
    if DS != "0":
        file_path += f'_{DS}'
    with open(f'{file_path}.pickle', 'wb') as f:
            pickle.dump(data_dict[key], f, pickle.HIGHEST_PROTOCOL)