Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on multiple csvs #1

Closed
armheb opened this issue Jul 9, 2020 · 9 comments
Closed

Training on multiple csvs #1

armheb opened this issue Jul 9, 2020 · 9 comments

Comments

@armheb
Copy link

armheb commented Jul 9, 2020

Hi, thanks for sharing your awesome work, I wanted to train the transformer model in multiple csvs which have the same time span, should I just contact them and make one big dataframe and train the model?
Thanks

@JanSchm
Copy link
Owner

JanSchm commented Jul 9, 2020

Hi armheb,

Yes concatenation is one possible solution.
However, if all files combined are too large to keep in memory you can use a DataGenerator to supply the data incrementally to the model, here an example.

`class DataGenerator(tf.keras.utils.Sequence):

def __init__(self, paths, seq_len, batch_size):
    self.paths = paths
    self.seq_len = seq_len
    self.batch_size = batch_size

def __len__(self):
    return (np.ceil(len(self.paths) / float(self.batch_size))).astype(np.int)

def __getitem__(self, idx):
    path_batch = self.paths[idx*self.batch_size : (idx+1)*self.batch_size]

    train_input_ohlcv, y = list(), list()
    for path in path_batch:
        seq = preprocess_data(path).values
  
        for i in seq:
            in_seq, out_seq = seq[i-self.seq_len:i], seq[I]
            
            train_input_ohlcv.append(in_seq[-self.seq_len:]) 

            y.append(out_seq[3])

    train_input_ohlcv = np.asarray(train_input_ohlcv, dtype=np.float32)
    y = np.asarray(y, dtype=np.float32)
    return train_input_ohlcv, y

train_gen = DataGenerator(train_seq_paths, seq_per_file, seq_len, batch_size)
val_gen = DataGenerator(val_seq_paths, seq_per_file, seq_len, batch_size)
`

Then when training the model the fit function looks as follows

history = model.fit(train_gen, steps_per_epoch = len(train_seq_paths)//batch_size, batch_size = batch_size, verbose = 1, callbacks = [callbacks], epochs = 35, shuffle = True, validation_data = val_gen, validation_steps = len(val_seq_paths)//batch_size, max_queue_size = 2,)

I hope this helps.

@armheb
Copy link
Author

armheb commented Jul 10, 2020

Thank you so much for the great explanation, I will try that and share the results here.

@JanSchm
Copy link
Owner

JanSchm commented Jul 13, 2020

I'm looking forward to the results. If you have any additional questions just let me know

@JanSchm JanSchm closed this as completed Jul 13, 2020
@armheb
Copy link
Author

armheb commented Jul 15, 2020

Hi, thanks you've been very helpful, I modified your DataGenerator code a bit and got it to start training as you said, but now at the end of the first epoch, I get out memory errors for the GPU! Here is the error:
OOM when allocating tensor with shape[3736448,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model/transformer_encoder/multi_attention/single_attention_6/dense_20/Tensordot/MatMul (defined at :25) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_train_function_38226]

Function call stack:
train_function

I'm training on a Titan XP with 12GB memory, and also decreased the batch size and seq_len but still getting the same error.

@armheb
Copy link
Author

armheb commented Jul 15, 2020

This is my DataGenerator class:

`class DataGenerator(tf.keras.utils.Sequence):

def __init__(self, paths, seq_len, batch_size):
    self.paths = paths
    self.seq_len = seq_len
    self.batch_size = batch_size

def __len__(self):
    return (np.ceil(len(self.paths) / float(self.batch_size))).astype(np.int)

def __getitem__(self, idx):
    path_batch = self.paths[idx*self.batch_size : (idx+1)*self.batch_size]

    train_input_ohlcv, y = list(), list()
    for path in path_batch:
        seq = preprocess_data(path).values

        for i in range(self.seq_len,len(seq)):
            in_seq, out_seq = seq[i-self.seq_len:i], seq[i]

            train_input_ohlcv.append(in_seq[-self.seq_len:]) 

            y.append(out_seq[1])

    train_input_ohlcv = np.array(train_input_ohlcv)
    y = np.array(y)
    return train_input_ohlcv, y
`

@JanSchm
Copy link
Owner

JanSchm commented Jul 15, 2020

If you are not shuffling ur files during training, it looks like that the last files that go into the generator have a lot of entries. What I can deriving from shape[3736448,256] is that ur are passing 3736448 sequences with a length of 256 into the model.

The 3736448 is the aggregated batch size of that file batch.

Just check whether u have very large file in ur dataset and potentially exclude it for now.

@armheb
Copy link
Author

armheb commented Jul 15, 2020

Thanks for your answer, I don't have a large file in the dataset, in the preprocess_data function I return the preprocessed dataframe white the same shape, do you think there is anything wrong in the second loop of the generator class? That was the part I changed.

@armheb
Copy link
Author

armheb commented Jul 15, 2020

By adding all sequences together, the model can train, although it takes about 4.5 hours per epoch!

@armheb
Copy link
Author

armheb commented Aug 25, 2020

Hi, I trained the model for about a week but unfortunately, the final result was a straight line in the middle. do you have plans to update the repo? can please share your weight to finetune the model based on that?
I really appreciate your work. thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants