Hi all!
In this script, I try to use (deep) reinforcement learning (a2c) to solve the knapsack problem.
This should not be an exciting project but just a warm up (for me at least) to understand hands-on RL, $\epsilon$-greedy, embeddings and such.

The idea is to use deep Q learning to solve the knapsack probelm.

What is the knapsack problem (KP)? 
The KP is a problem where you are given a knapsack (a bag) with limited capacity and a set of objets.
Objects have two attributes, prize and weight. Your goal is to select some objects such that you maximize the prize of the object chosen without exceeding the capacity of the knapsack.

For example, given a knapsack of capacity 1 (also in the following, we assume weights to be normalized, so capacity knapsack = 1) and objects [prize, weight] = {[3,0.8],[1,0.25],[1,0.25],[1,0.25],[1,0.25]} the best solution would be to NOT pick up $obj_{0}$ (even if it has the greatest prize) while picking up all the other objects.

Why I've chosen the KP? Because the KP is often considered the easiet among the NP-Hard problem (meaning that it takes exponential number of steps to achieve the optimal solution)

As always, we start by importing useful libraries

In [1]:
import matplotlib.pyplot as plt

import numpy as np 
import pandas as pd
import torch 
import random
import itertools
import copy
try:
    import cPickle as pickle
except ModuleNotFoundError:
    import pickle

%matplotlib inline

random.seed(1234)

import torch
if torch.cuda.is_available():
    print('GPU')

Then we define the object class

In [2]:
class ObjectCass:
    
    Prize = None
    Weight = None
    
    def __init__(self, reward, weight):
        
        self.Prize = reward
        self.Weight = weight

Now we define the enviroment function.
The environment generates num_objs (input) objects. 
The enviroment can either generate a real instance (by default) or a fake one of which we know the optimal solution.

The environment returns the objects and their target (which is computed via an heuristic).

For the fake instance, the odd objects have prize 1 and weight (2/num_objs)- epsilon (epsilon being a small numer), while the even ones have price 0.1 and weight 0.99. Since the capacity is normalized, the best solution is to choose all and only the even objects (i.e. capacity = num_objects/2 * (2/num_objs)- epsilon < 1 and reward = num_objects/2 ). 

In [3]:
def Heuristic(Objects, scaling_factor=1 ):
    
    special_object = Objects[-1]
    Objects.remove(special_object)
    
    Objects_sorted = sorted(Objects, key=lambda obj: -float(obj.Prize)/obj.Weight)
    weight_total = 0
    price_total = 0
    for obj in Objects_sorted:
        if weight_total+obj.Weight>1:
            continue
        price_total+=obj.Prize
        weight_total+=obj.Weight
    Objects.append(special_object)
    
    target = torch.tensor(price_total).float()
    target = scaling_factor*target # sclaed target to have positive reward for good actions
    target.requires_grad = False
    
    return target

class StateClass:
    
    ObjectsFeatures = None
    res_capacity = None
    Chosen = None
    price = None
    Final = None
    Selectable = None
    SelectableObjectsFeatures = None
    
    def __init__(self, ObjectsFeatures, res_capacity, Chosen, price, env):
        
        self.ObjectsFeatures = ObjectsFeatures
        self.res_capacity = res_capacity
        self.Chosen = Chosen
        self.price = price
        self.Final = 0
        self.Selectable = self.StateMaskFunction(env)
        if len(self.Selectable)==0:
            self.Final=1
        self.SelectableObjectsFeatures = self.SelectableObjectsFeaturesFunction()
    
        return

    def StateMaskFunction(self, env):
        
        indexes = [i for i in range(len(env.Objects)) 
            if i not in self.Chosen and 
            not (self.res_capacity - env.Objects[i].Weight < 0)]
    
        return indexes 
    
    def SelectableObjectsFeaturesFunction(self):
        
        SelectableObjectsFeatures = []
        for i in range(len(self.ObjectsFeatures)):
            if i in self.Selectable:
                SelectableObjectsFeatures.append(self.ObjectsFeatures[i])
        
        if len(SelectableObjectsFeatures)>0:
            SelectableObjectsFeatures = torch.stack(SelectableObjectsFeatures)
        
        return SelectableObjectsFeatures

class Environment:
    
    ObjectsFeatures = None
    Objects = None
    capacity = None
    target = None
    
    def __init__(self, num_objs, FakeBool = False, prize_min=0.01, prize_max=1, weight_min=0.01, weight_max=1, scaling_factor = 1, shuffle = False):
        
        if FakeBool:
            # fake code to control the instance I fed it.
            # best solution = [1,0,1,0,1,0,1,0,1,0,1]
            Objects = []
            for i in range(num_objs):
                if i%2==1:
                    prize = 0.1
                    weight = 0.99
                    obj = ObjectCass(prize,weight)
                    Objects.append(obj)
                else:
                    prize = 1
                    weight = (2/float(num_objs))-0.001
                    obj = ObjectCass(prize,weight)
                    Objects.append(obj)
            if shuffle:
                random.shuffle(Objects)
            target = torch.tensor(int(num_objs/2)).float() # real target
            target = scaling_factor*target # sclaed target to have positive reward for good actions
            target.requires_grad = False
            ObjectsFeatures = torch.tensor([[obj.Prize, obj.Weight] for obj in Objects])
            ObjectsFeatures.requires_grad = False            

        else:
            Objects = []
            for i in range(num_objs):
                prize = round(random.uniform(prize_min, prize_max), 2)
                weight = round(random.uniform(weight_min, weight_max), 2)
                obj = ObjectCass(prize,weight)
                Objects.append(obj)
            target = Heuristic(Objects, scaling_factor)
            ObjectsFeatures = torch.tensor([[obj.Prize, obj.Weight] for obj in Objects])
            ObjectsFeatures.requires_grad = False

        capacity = torch.tensor([1]).float() #
        capacity.requires_grad = False

        self.ObjectsFeatures = ObjectsFeatures
        self.Objects = Objects
        self.capacity = capacity
        self.target = target
            
        return
    
    def CreateInitialState(self):
        
        price = torch.tensor(0).float() #
        price.requires_grad = False
        ObjectsFeatures = []
        for obj in self.ObjectsFeatures:
            ObjectsFeatures.append(torch.cat([obj, self.capacity]))
        ObjectsFeatures = torch.stack(ObjectsFeatures)
        
        return StateClass(ObjectsFeatures, self.capacity, [], price, self)

    def step(self, state_old, obj_chosen):

        capacity = state_old.res_capacity-state_old.ObjectsFeatures[obj_chosen][1].item()
        price = state_old.price+state_old.ObjectsFeatures[obj_chosen][0].item()
        New_Chosen = [old_chosen for old_chosen in state_old.Chosen]
        New_Chosen.append(obj_chosen)
        ObjectsFeatures = state_old.ObjectsFeatures.clone()
        for i in range(len(ObjectsFeatures)):
            ObjectsFeatures[i][-1] = capacity
        state_new = StateClass(ObjectsFeatures, capacity, New_Chosen, price, self)
        
        # new state , reward
        return state_new, env.Objects[obj_chosen].Prize
    
    def RandomPicker(self):
        
        Chosen = []
        res_cap = 0
        possible_objs = [obj for obj in self.Objects if obj.Weight + res_cap < self.capacity and obj not in Chosen]
        price = 0
        while len(possible_objs)>0:
            obj = random.choice(possible_objs)
            res_cap+=obj.Weight
            price+=obj.Prize
            Chosen.append(obj)
            possible_objs = [obj for obj in self.Objects if obj.Weight + res_cap < self.capacity and obj not in Chosen]
        
        return price

env = Environment(2)
s_0 = env.CreateInitialState()
s_1, reward = env.step(s_0,0)
print('s_0 cap', s_0.res_capacity)
print('s_1 cap', s_1.res_capacity)
print('s_0 price', s_0.price)
print('s_1 price', s_1.price)
print('s_0 obj features', s_0.ObjectsFeatures)
print('s_1 obj features', s_1.ObjectsFeatures)
print('reward', reward)

('s_0 cap', tensor([1.]))
('s_1 cap', tensor([0.5500]))
('s_0 price', tensor(0.))
('s_1 price', tensor(0.9700))
('s_0 obj features', tensor([[0.9700, 0.4500, 1.0000],
        [0.0200, 0.9100, 1.0000]]))
('s_1 obj features', tensor([[0.9700, 0.4500, 0.5500],
        [0.0200, 0.9100, 0.5500]]))
('reward', 0.97)


Implementation based on: https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f

Let's also define a NN which uses an attention mechanism.

In [4]:
class Net_Attention(torch.nn.Module):
    
    def __init__(self, num_input_features,h_model, num_head ,num_layers, dim_feedforward, p_dropout, num_outputs):    
        
        super(Net_Attention, self).__init__()
        
        self.emb = torch.nn.Linear(num_input_features,  h_model)
        encoder_layer = torch.nn.TransformerEncoderLayer(d_model=h_model, nhead=num_head, 
                                                   dim_feedforward=dim_feedforward, dropout=p_dropout)
        self.transformer_encoder = torch.nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
                
        # FINAL LAYER
        self.final_layer_0 = torch.nn.Linear(h_model,h_model)
        self.final_layer_1 = torch.nn.Linear(h_model,1)

    def forward(self, States, requires_grad = True):
        
        ObjectsFeatures = [state.SelectableObjectsFeatures for state in States]
        if requires_grad == False:
            for obj in ObjectsFeatures:
                if len(obj)==0:
                    error
        
        ObjectsFeatures = torch.stack(ObjectsFeatures)
        ObjectsFeatures.requires_grad = requires_grad # in case you don't need to compute the gradient
        
        # define relu operation
        ReLU = torch.nn.ReLU()
        
        E = self.emb(torch.transpose(ObjectsFeatures,0,1))
        # where 
        # S is the source sequence length,  
        # N is the batch size,  
        # E is the feature number        
        # you want 
        # E.size() = S,N,E = num_obj x batch x h_model
        
        #c = E
        c = self.transformer_encoder(E)
        c = self.final_layer_0(c)
        c = ReLU(c)
        Qvals = self.final_layer_1(c)
        Qvals = torch.squeeze(Qvals, dim = 2)
        Qvals = torch.transpose(Qvals,0,1)
        
        return Qvals

Now we have a function that, given a state, it returns the Q values. We have to define the loss 

In [5]:
def PrintMaxQvals(Qvals_print,  stringa = ''):
    
    print('printing just 100 points for readibility reasons')
    plt.figure()
    plt.plot(Qvals_print, label='Qmax '+ stringa)
    plt.ylabel('Qvals ' + stringa)
    plt.legend()
    return

def MaskFunctionArray(States, Environments, Qvalss):
    
    QI = [MaskFunction(States[i], Environments[i], Qvalss[i]) for i in range(len(States))]
    Qvalss_mask = [q[0] for q in QI]
    indexess = [q[1] for q in QI]
    
    #Qvalss_mask = []
    #indexess = []
    #for i in range(len(States)):
    #    q,i = MaskFunction(States[i], Environments[i], Qvalss[i])
    #    Qvalss_mask.append(q)
    #    indexess.append(i)
        
    return Qvalss_mask, indexess

def MaskFunction(state, env, Qvals):
        
    indexes = [i for i in range(len(env.Objects)) 
               if i not in state.Chosen and 
               not (state.res_capacity - env.Objects[i].Weight < 0)]
    Qvals_out = [Qvals[i] for i in indexes]
    if len(Qvals_out)!=0:
        Qvals_out = torch.stack(Qvals_out)
    
    return Qvals_out, indexes

class BufferClass:
    
    replay_lenght = None
    minibatch_size = None
    Buffer = None
    
    def __init__(self,replay_lenght, minibatch_size):
        
        self.replay_lenght = replay_lenght
        self.minibatch_size = minibatch_size
        self.Buffer = []
        
    def Add(self, Transitions):
                
        while len(self.Buffer)+len(Transitions)>=self.replay_lenght:
            self.Buffer.remove(random.choice(self.Buffer))
        for t in Transitions:
            self.Buffer.append(t)
        
        return
        
    def Minibatch(self, new_transactions = 0):
        
        if len(self.Buffer) <= self.minibatch_size:
            Elements = self.Buffer
        else:
            new_transactions = min(new_transactions, self.minibatch_size)
            last_transactions = self.Buffer[-new_transactions:]
            Elements = random.sample(self.Buffer, self.minibatch_size - new_transactions)
            # here it could pick the smae transition twice!! Not a big deal but good to know
            Elements = last_transactions + Elements
            Minibatch = Elements

        #States = [experience[0] for experience in Elements]
        #Actions = [experience[1] for experience in Elements]
        #Rewards = [experience[2] for experience in Elements]
        #New_States = [experience[3] for experience in Elements]
        #
        #Minibatch = [States, Actions, Rewards, New_States]
        
        return Minibatch

def Learning_step(Net_predict,Net_target,optimizer, Minibatch, gamma = 1):
    
    optimizer.zero_grad()   # zero the gradient buffers
    Losses = []
    for experience in Minibatch:
        state, action, reward, new_state = experience
        action_index = state.Selectable.index(action)
        if not new_state.Final==1:
            loss = (reward + gamma*torch.max(Net_target([new_state], requires_grad = False).squeeze(dim = 0)) - Net_predict([state]).squeeze(dim = 0)[action_index])**2
        else:
            loss = (reward - Net_predict([state]).squeeze(dim = 0)[action_index])**2
        Losses.append(loss)
    Losses = torch.stack(Losses)
    
    #States = []
    #Actions = []
    #Rewards = []
    #New_States = []
    #for experience in Minibatch:
    #    state, action, reward, new_state = experience
    #    States.append(state)
    #    Actions.append(action)
    #    Rewards.append(reward)
    #    New_States.append(new_state)    
    #New_States_Target = [New_States[i] for i in range(len(New_States)) if New_States[i].Final==0]
    #Max_New_States_Target, _ = torch.max(Net_target(New_States_Target, requires_grad = False), dim = 1)
    #Max_New_States = []
    #j = 0
    #for i in range(len(New_States)):
    #    if New_States[i].Final==0: # not final
    #        Max_New_States.append(Max_New_States_Target[j])
    #        j+=1
    #    else:
    #        Max_New_States.append(torch.tensor([0]))
    #Rewards = torch.tensor(Rewards)
    #Auxiliary = Net_predict(States)
    #Predictions_Actions = [Auxiliary[i][Actions[i]] for i in range(len(Auxiliary))]
    #Predictions_Actions = torch.stack(Predictions_Actions)
    #stoptocheck
    # note that some Max_New_States are zero.
    #Losses = (Rewards + gamma*Max_New_States - Predictions_Actions)**2 
    
    loss = Losses.mean()
    loss.backward()
    
    # gradient clipping
    clip_value = 1
    for p in Net_predict.parameters():
        p.register_hook(lambda grad: torch.clamp(grad, -clip_value, clip_value))
    
    # apply gradient
    optimizer.step()
    
    return loss.tolist()

def Evaluate(num_epoch, Net_predict, num_objs, FakeBool = False, shuffle = False, times = 1, images = True, examples = 0):
    
        
    better = 0
    for time_counter in range(1,times+1):
        print('evaluating model..('+str(time_counter)+' out of '+str(times)+')')
        # set to train
        Net_predict.eval()
        # define the number of epochs
        T = []
        R = []
        R_random = []
        for e_counter in range(num_epoch):
            env = Environment(num_objs, FakeBool = FakeBool, shuffle = shuffle)
            state = env.CreateInitialState()
            while True:
                Qvals = Net_predict([state])
                Qvals = torch.squeeze(Qvals, dim = 0)
                action = torch.argmax(Qvals)
                action_index = state.Selectable[action]
                new_state, reward = env.step(state, action_index)
                state = new_state
                # break if no more possible actions
                if state.Final==1:
                    break

            T.append(env.target)
            R.append(new_state.price)
            R_random.append(env.RandomPicker())

        R_norm = [R[i]/T[i] for i in range(len(R))]
        R_random_norm = [R_random[i]/T[i] for i in range(len(R_random))]
        T_norm = [1 for i in range(len(R))]
        
        if images:
            plt.figure()
            plt.plot(R)
            plt.plot(R, label='Rewards')
            plt.plot(R_random)
            plt.plot(R_random, label='Rewards (random picker)')
            plt.plot(T)
            plt.plot(T, label='baseline')
            plt.ylabel(' Rewards ' )
            plt.legend()
            plt.ylim(bottom=0)
            plt.figure()
            plt.plot(R_norm)
            plt.plot(R_norm, label='Rewards (normalized)')
            plt.plot(R_random_norm)
            plt.plot(R_random_norm, label='Rewards (random picker) normalized')
            plt.plot(T_norm)
            plt.plot(T_norm, label='baseline (normalized)')
            plt.ylabel(' Rewards (normalized)' )
            plt.legend()
            plt.ylim(bottom=0)

        print('average norm reward (net)   : ', np.mean(R_norm))
        print('average norm reward (random): ', np.mean(R_random_norm))
        if np.mean(R_norm)>np.mean(R_random_norm):
            better+=1
    print('\n\n\n')
    print('the NN was better than the random pikcer '+str(better)+' times out of '+str(times))            
    print('\n\n\nExample:')
    if examples>0:
        for example_counter in range(examples):
            reward_example = 0
            env = Environment(num_objs, FakeBool = FakeBool, shuffle = shuffle)
            print('Objects: ')
            for i in range(len(env.Objects)):
                print('object num '+str(i)+' price: '+str(env.Objects[i].Prize)+' weight: '+str(env.Objects[i].Weight))
            print('heuristic target: ', env.target)
            state = env.CreateInitialState()
            while True:
                Qvals = Net_predict([state])
                Qvals = torch.squeeze(Qvals, dim = 0)
                action = torch.argmax(Qvals)
                action_index = state.Selectable[action]
                print('NN selects: ', action_index)
                new_state, reward = env.step(state, action_index)
                state = new_state
                reward_example+= env.Objects[action_index].Prize
                # break if no more possible actions
                if state.Final==1:
                    break
            print('NN reward ', reward_example)


Now let's try the whole RL framework.

In [6]:
num_input_features = 3 # prize, weight, res_capacity 
num_objs = 10
num_outputs = num_objs
h_model = 4
num_head = 2
num_layers = 2
dim_feedforward = 4
p_dropout = 0.1

Net_predict = Net_Attention(num_input_features,h_model, num_head ,num_layers, dim_feedforward, p_dropout, num_outputs)
Net_target = Net_Attention(num_input_features,h_model, num_head ,num_layers, dim_feedforward, p_dropout, num_outputs)

In [7]:
# Fake Instance?
FakeBool = False
shuffleBool = True
# define the number of epochs
num_epoch = 2500
# epsilon
epsilon_max = 1
epsilon_min = 0.05
#
minibatch_size = 16
# C (how often update Net_target)
C = max(int(num_epoch/100),100)
# unit (for printing)
unit = max(int(num_epoch/100),1)
PATH_predict = '/home/big_bamboo/Downloads/Net_Predict'
PATH_target = '/home/big_bamboo/Downloads/Net_Target'

# is it a new test?
LoadTest = False
if LoadTest:
    Net_predict.load_state_dict(torch.load(PATH_predict))
    Net_target.load_state_dict(torch.load(PATH_target))
    with open('RLKP-num_epoch_start-Save.pkl', 'rb') as input:
        num_epoch_start = pickle.load(input)
    with open('RLKP-Replay-buffer-Save.pkl', 'rb') as input:
        ReplayBuffer = pickle.load(input)
    with open('RLKP-Qvals_print_init-Save.pkl', 'rb') as input:
        Qvals_print_init = pickle.load(input)
    with open('RLKP-Qvals_print_final-Save.pkl', 'rb') as input:
        Qvals_print_final = pickle.load(input)
    with open('RLKP-optimizer-Save.pkl', 'rb') as input:
        optimizer = pickle.load(input)
else:
    # create optimizer
    optimizer = torch.optim.SGD(Net_predict.parameters(), lr=1e-3, momentum = 0.9)
    # replay buffer
    replay_lenght = 1e6
    ReplayBuffer = BufferClass(replay_lenght,minibatch_size)
    Qvals_print_init = []
    Qvals_print_final = []
    num_epoch_start = 0

def SaveModel(Net_predict,Net_target,ReplayBuffer, e_counter, Qvals_print_init, Qvals_print_final):
    
    torch.save(Net_predict.state_dict(), PATH_predict)
    torch.save(Net_target.state_dict(), PATH_target)
    with open('RLKP-num_epoch_start-Save.pkl', 'wb') as output:
        pickle.dump(e_counter, output, pickle.HIGHEST_PROTOCOL)
    with open('RLKP-Replay-buffer-Save.pkl', 'wb') as output:
        pickle.dump(ReplayBuffer, output, pickle.HIGHEST_PROTOCOL)
    with open('RLKP-Qvals_print_init-Save.pkl', 'wb') as output:
        pickle.dump(Qvals_print_init, output, pickle.HIGHEST_PROTOCOL)
    with open('RLKP-Qvals_print_final-Save.pkl', 'wb') as output:
        pickle.dump(Qvals_print_final, output, pickle.HIGHEST_PROTOCOL)
    with open('RLKP-optimizer-Save.pkl', 'wb') as output:
        pickle.dump(optimizer, output, pickle.HIGHEST_PROTOCOL)
        
# optimizer scheduler
lmbda = lambda epoch: 1
scheduler = torch.optim.lr_scheduler.MultiplicativeLR(optimizer, lr_lambda=lmbda)

In [None]:
def EpsGreedy(Qvals,indexes, e_counter):
    
    # select action
    epsilon = (epsilon_max-epsilon_min)*(num_epoch-e_counter)/num_epoch + epsilon_min
    if random.random() < epsilon:
        action_index = random.choice(indexes)
    else:
        action = torch.argmax(Qvals)
        action_index = indexes[action]
    
    return action_index

In [None]:
# set to train
Net_predict.train()
Q_init_print = []
Q_final_print = []

for e_counter in range(num_epoch_start, num_epoch):
    if (e_counter)%100==0:
        print('epoch: ',(e_counter))
        # save model
        #SaveModel(Net_predict,Net_target,ReplayBuffer, e_counter, Qvals_print_init, Qvals_print_final)
    # creating instance
    env = Environment(num_objs, FakeBool = FakeBool, shuffle = shuffleBool)
    state = env.CreateInitialState()
    Transitions = []
    first = True
    while True:
        # Compute Qvals
        Qvals = Net_predict([state])
        Qvals = Qvals.squeeze() # reduce dimension
        # just for printing 
        if first:
            first = False
            if e_counter%max(int((num_epoch+minibatch_size)/100),1)==0:
                Q_init_print.append(torch.max(Qvals))
        # select action
        action_index = EpsGreedy(Qvals,state.Selectable, e_counter)
        # environment step
        new_state, reward = env.step(state, action_index)
        Transitions.append([state, action_index, reward, new_state])
        state = new_state
        # break if no more possible actions
        if state.Final==1:
            if e_counter%max(int((num_epoch+minibatch_size)/100),1)==0:
                Q_final_print.append(torch.max(Qvals))
            break
    # replay buffer
    ReplayBuffer.Add(Transitions)    
    if (e_counter- minibatch_size)>= minibatch_size:
        l = Learning_step(Net_predict,Net_target,optimizer, ReplayBuffer.Minibatch(new_transactions = len(Transitions)))
    if (e_counter)%C == 0:
        Net_target.load_state_dict(Net_predict.state_dict())
    if (e_counter)%int(num_epoch/5)==0 and e_counter>minibatch_size:
        scheduler.step()
        for param_group in optimizer.param_groups:
            print('learning rate ', param_group['lr'])
            if param_group['lr']<1e-10:
                print('the lr dropped too low, something is wrong')
                error

PrintMaxQvals(Q_init_print,  stringa = 'initial Q')
PrintMaxQvals(Q_final_print,  stringa = 'final Q')
Evaluate(100, Net_predict, num_objs, FakeBool = FakeBool, shuffle = shuffleBool)

('epoch: ', 0)
('epoch: ', 100)
('epoch: ', 200)
('epoch: ', 300)
('epoch: ', 400)
('epoch: ', 500)
('learning rate ', 0.001)
('epoch: ', 600)
('epoch: ', 700)
('epoch: ', 800)
('epoch: ', 900)
('epoch: ', 1000)
('learning rate ', 0.001)


In [None]:
Evaluate(100, Net_predict, num_objs, FakeBool = False, shuffle = shuffleBool, times = 10, images = False, examples = 1)

Do you want to try your own instance? Here you go!

In [None]:
# set to train
Net_predict.eval()
Net_target.eval()
env = Environment(num_objs)
# first prize, second weight
Objects_numbers = [
    [0.98, 0.23], # obj 0
    [0.76, 0.45], # obj 1
    [0.54, 0.67], # obj 2
    [0.32, 0.89], # obj 3
    [0.10, 0.11], # obj 4
    [0.12, 0.33], # obj 5
    [0.34, 0.55], # obj 6
    [0.56, 0.77], # obj 7
    [0.78, 0.99], # obj 8
    [0.99, 0.22] # obj 9
]

Objects = []
for i in range(len(Objects_numbers)):
    prize = Objects_numbers[i][0]
    weight = Objects_numbers[i][1]
    obj = ObjectCass(prize,weight)
    Objects.append(obj)
env.Objects = Objects
env.target = Heuristic(Objects, 1)
env.ObjectsFeatures = torch.tensor([[obj.Prize, obj.Weight] for obj in Objects])
env.ObjectsFeatures.requires_grad = False
print('heuristic taget: ', env.target)
state = env.CreateInitialState()
total_reward = 0
while True:
    # Compute Qvals
    Qvals = Net_predict([state])
    Qvals = Qvals.squeeze() # reduce dimension
    # select action
    action = torch.argmax(Qvals)
    action_index = state.Selectable[action]
    print('NN chooses obj: ', action_index)
    # environment step
    new_state, reward = env.step(state, action_index)
    total_reward+=reward
    state = new_state
    # break if no more possible actions
    if state.Final==1:
        break
print('NN reward: ', total_reward)

TO DO:
- if you solve with FakeBool = True and then with FakeBool = False (both with 250 iterations), it achieves very good solutions. Why?? (Most likely because you re-used the same NN of before and the fake instances are very informative)

STILL TO  DO:
- investigate why it is still so slow the training (compare it with the evaluation)
- find a way to make minibatch_size = 1024 without making everything crash? -- Are we sure about this?
- compute losses in (mini)batches -> this does not work if you feed only the relevant objects because you can't stack them (they all have different dimensions)
- create environment in batches (same as previous point)

DONE:
- make sure that the newest sequences are in the minibatch #DONE
- put res_capacity as a repeated object feature in the state (now it is disregarded) # DONE
- feed the NN just the releveant objects # DONE but this forbids to batch the creation of new environments