Training on multiple csvs #1

armheb · 2020-07-09T16:44:16Z

Hi, thanks for sharing your awesome work, I wanted to train the transformer model in multiple csvs which have the same time span, should I just contact them and make one big dataframe and train the model?
Thanks

JanSchm · 2020-07-09T22:27:06Z

Hi armheb,

Yes concatenation is one possible solution.
However, if all files combined are too large to keep in memory you can use a DataGenerator to supply the data incrementally to the model, here an example.

`class DataGenerator(tf.keras.utils.Sequence):

def __init__(self, paths, seq_len, batch_size):
    self.paths = paths
    self.seq_len = seq_len
    self.batch_size = batch_size

def __len__(self):
    return (np.ceil(len(self.paths) / float(self.batch_size))).astype(np.int)

def __getitem__(self, idx):
    path_batch = self.paths[idx*self.batch_size : (idx+1)*self.batch_size]

    train_input_ohlcv, y = list(), list()
    for path in path_batch:
        seq = preprocess_data(path).values
  
        for i in seq:
            in_seq, out_seq = seq[i-self.seq_len:i], seq[I]
            
            train_input_ohlcv.append(in_seq[-self.seq_len:]) 

            y.append(out_seq[3])

    train_input_ohlcv = np.asarray(train_input_ohlcv, dtype=np.float32)
    y = np.asarray(y, dtype=np.float32)
    return train_input_ohlcv, y

train_gen = DataGenerator(train_seq_paths, seq_per_file, seq_len, batch_size)
val_gen = DataGenerator(val_seq_paths, seq_per_file, seq_len, batch_size)
`

Then when training the model the fit function looks as follows

history = model.fit(train_gen, steps_per_epoch = len(train_seq_paths)//batch_size, batch_size = batch_size, verbose = 1, callbacks = [callbacks], epochs = 35, shuffle = True, validation_data = val_gen, validation_steps = len(val_seq_paths)//batch_size, max_queue_size = 2,)

I hope this helps.

armheb · 2020-07-10T10:50:55Z

Thank you so much for the great explanation, I will try that and share the results here.

JanSchm · 2020-07-13T19:29:01Z

I'm looking forward to the results. If you have any additional questions just let me know

armheb · 2020-07-15T13:54:28Z

Hi, thanks you've been very helpful, I modified your DataGenerator code a bit and got it to start training as you said, but now at the end of the first epoch, I get out memory errors for the GPU! Here is the error:
OOM when allocating tensor with shape[3736448,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model/transformer_encoder/multi_attention/single_attention_6/dense_20/Tensordot/MatMul (defined at :25) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_train_function_38226]

Function call stack:
train_function

I'm training on a Titan XP with 12GB memory, and also decreased the batch size and seq_len but still getting the same error.

armheb · 2020-07-15T14:42:40Z

This is my DataGenerator class:

`class DataGenerator(tf.keras.utils.Sequence):

def __init__(self, paths, seq_len, batch_size):
    self.paths = paths
    self.seq_len = seq_len
    self.batch_size = batch_size

def __len__(self):
    return (np.ceil(len(self.paths) / float(self.batch_size))).astype(np.int)

def __getitem__(self, idx):
    path_batch = self.paths[idx*self.batch_size : (idx+1)*self.batch_size]

    train_input_ohlcv, y = list(), list()
    for path in path_batch:
        seq = preprocess_data(path).values

        for i in range(self.seq_len,len(seq)):
            in_seq, out_seq = seq[i-self.seq_len:i], seq[i]

            train_input_ohlcv.append(in_seq[-self.seq_len:]) 

            y.append(out_seq[1])

    train_input_ohlcv = np.array(train_input_ohlcv)
    y = np.array(y)
    return train_input_ohlcv, y
`

JanSchm · 2020-07-15T18:06:14Z

If you are not shuffling ur files during training, it looks like that the last files that go into the generator have a lot of entries. What I can deriving from shape[3736448,256] is that ur are passing 3736448 sequences with a length of 256 into the model.

The 3736448 is the aggregated batch size of that file batch.

Just check whether u have very large file in ur dataset and potentially exclude it for now.

armheb · 2020-07-15T18:42:21Z

Thanks for your answer, I don't have a large file in the dataset, in the preprocess_data function I return the preprocessed dataframe white the same shape, do you think there is anything wrong in the second loop of the generator class? That was the part I changed.

armheb · 2020-07-15T19:42:48Z

By adding all sequences together, the model can train, although it takes about 4.5 hours per epoch!

armheb · 2020-08-25T22:23:20Z

Hi, I trained the model for about a week but unfortunately, the final result was a straight line in the middle. do you have plans to update the repo? can please share your weight to finetune the model based on that?
I really appreciate your work. thanks.

JanSchm closed this as completed Jul 13, 2020

vinh-cao mentioned this issue Dec 4, 2021

Shuffling data samples during training #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on multiple csvs #1

Training on multiple csvs #1

armheb commented Jul 9, 2020

JanSchm commented Jul 9, 2020 •

edited

Loading

armheb commented Jul 10, 2020

JanSchm commented Jul 13, 2020

armheb commented Jul 15, 2020

armheb commented Jul 15, 2020

JanSchm commented Jul 15, 2020

armheb commented Jul 15, 2020

armheb commented Jul 15, 2020

armheb commented Aug 25, 2020

Training on multiple csvs #1

Training on multiple csvs #1

Comments

armheb commented Jul 9, 2020

JanSchm commented Jul 9, 2020 • edited Loading

armheb commented Jul 10, 2020

JanSchm commented Jul 13, 2020

armheb commented Jul 15, 2020

armheb commented Jul 15, 2020

JanSchm commented Jul 15, 2020

armheb commented Jul 15, 2020

armheb commented Jul 15, 2020

armheb commented Aug 25, 2020

JanSchm commented Jul 9, 2020 •

edited

Loading