Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to pass 2-dimentional sequence to LSTM? #90

Closed
ganji15 opened this issue Sep 18, 2016 · 14 comments
Closed

How to pass 2-dimentional sequence to LSTM? #90

ganji15 opened this issue Sep 18, 2016 · 14 comments
Labels

Comments

@ganji15
Copy link

ganji15 commented Sep 18, 2016

I found that given examples can process 1-dimentional sequence data with RNN by passing the input sequence to a embeding-layer, then the input sequence is transformed to a 2-dimentional data sequence(word vector matrix I guess).
However, sometimes input sequence is 2-dimentional , such as (time_step, 1-dimentional feature_vector) <=> [ [f, f, f, f], [f, f, f, f], ..., [f, f, f, f]]. So how can I directly put the 2-dimentional sequence to RNN/LSTM?

I tried following method, but I failed.

DataProvider.py

from paddle.trainer.PyDataProvider2 import *

def hook(settings, input_dim, num_class, is_train, **kwargs):
    settings.input_types = [
                dense_vector_sequence(int(input_dim)),
                integer_value_sequence(int(num_class))]
    settings.is_train = is_train

@provider(init_hook=hook)
def processData(settings, file_name):
    seqs, labels = cPickle.load(open(file_name, 'rb'))
    indexs = list(range(len(labels)))
    if settings.is_train:
        random.shuffle(indexs)
    for i in indexs:
        seq = seqs[i]     # sequence of 1-dim fixed length vector
        label = labels[i]  # various length integer sequence
        yield seq, label

Error Information

I0919 00:57:43.780038 592 Util.cpp:138] commandline: /usr/local/bin/../opt/paddle/bin/paddle_trainer --config=OcrRecognition.py --dot_period=10 --log_period=10 --test_all_data_in_one_period=1 --use_gpu=1 --gpu_id=0 --trainer_count=1 --num_passes=100 --save_dir=./model
I0919 00:57:44.160080 592 Util.cpp:113] Calling runInitFunctions
I0919 00:57:44.160356 592 Util.cpp:126] Call runInitFunctions done.
[WARNING 2016-09-19 00:57:44,193 default_decorators.py:40] please use keyword arguments in paddle config.
[WARNING 2016-09-19 00:57:44,194 default_decorators.py:40] please use keyword arguments in paddle config.
[INFO 2016-09-19 00:57:44,195 networks.py:1122] The input order is [ocr_seq, label]
[INFO 2016-09-19 00:57:44,195 networks.py:1129] The output order is [ctc]
I0919 00:57:44.197438 592 Trainer.cpp:169] trainer mode: Normal
I0919 00:57:44.198976 592 PyDataProvider2.cpp:219] loading dataprovider dataprovider::processData
I0919 00:57:44.201690 592 PyDataProvider2.cpp:219] loading dataprovider dataprovider::processData
I0919 00:57:44.201730 592 GradientMachine.cpp:134] Initing parameters..
I0919 00:57:44.205186 592 GradientMachine.cpp:141] Init parameters done.
*** Aborted at 1474217866 (unix time) try "date -d @1474217866" if you are using GNU date ***
PC: @ 0x7fd4325a5767 (unknown)
*** SIGSEGV (@0x10) received by PID 592 (TID 0x7fd433536840) from PID 16; stack trace: ***
@ 0x7fd432e28330 (unknown)
@ 0x7fd4325a5767 (unknown)
@ 0x7fd43256c444 (unknown)
@ 0x7fd432644370 (unknown)
@ 0x7fd4325cf193 (unknown)
@ 0x7fd43261b3b7 (unknown)
@ 0x7fd4107d6da4 array_str
@ 0x7fd4325d258a (unknown)
@ 0x7fd4325d277a (unknown)
@ 0x7fd4108ab3dc gentype_repr
@ 0x7fd432624da0 (unknown)
@ 0x82abf9 paddle::py::repr()
@ 0x569eb1 paddle::IndexScanner::fill()
@ 0x56a2c1 paddle::SequenceScanner::fill()
@ 0x56d3fc paddle::PyDataProvider2::getNextBatchInternal()
@ 0x563982 paddle::DataProvider::getNextBatch()
@ 0x69b437 paddle::Trainer::trainOnePass()
@ 0x69ecc7 paddle::Trainer::train()
@ 0x53bf73 main
@ 0x7fd43144cf45 (unknown)
@ 0x5475b5 (unknown)
@ 0x0 (unknown)

/usr/local/bin/paddle: 行 81: 592 段错误 (核心已转储) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${@:2}


Can anyone provide any suggestions or examples? Thank you!

@qingqing01
Copy link
Contributor

qingqing01 commented Sep 19, 2016

I think the yield format is not correct.

The processData should yield one sample once. The format should be
[[f, ...], [f, ...], ...] for dense_vector_sequence, in which one [f,...] is one time step. The format should be [i, i, ...] for integer_value_sequence. You can refer to PyDataProvider2 documentation.

I'm not quite sure about your pickled data format, so sorry for not giving you the examples of processData.

@reyoung
Copy link
Collaborator

reyoung commented Sep 19, 2016

It seems that the label vector is not an int list. And this line is invoked.

@ganji15
Copy link
Author

ganji15 commented Sep 19, 2016

@qingqing01
Here is my code of pickled data:


def gen_dataset(name, idxs):
        vals = []
        labels = []
        for idx in idxs:
            # data['x'][idx].transpose() is numpy.array with shape(time_step, 11)  and every diffenert
            # data[['x'][idx].transpose() has a different time_step.
            # then I try function tolist() to transform the maxtrix to list of list => [[f, ...], ..., [f...]]
            vals.append(data['x'][idx].transpose().tolist())
            # Similarily, list(data['y'][idx]) is 1-dimentional integer vector [i, ...] which has various length
            labels.append(list(data['y'][idx]))
        cPickle.dump((vals, labels), open(name, 'wb'))

So, I'm not sure how to process my data to match 'dense_vector_sequence' type and 'integer_value_sequence' type.

@ganji15
Copy link
Author

ganji15 commented Sep 19, 2016

@reyoung I think you are right, but how does this error occur?

The following is my code:

dataprovider.py

def hook(settings, input_dim, num_class, is_train, **kwargs):
    settings.input_types = [
                dense_vector_sequence(int(input_dim)),
                integer_value_sequence(int(num_class))]
    settings.is_train = is_train

@provider(init_hook=hook)
def processData(settings, file_name):
    seqs, labels = cPickle.load(open(file_name, 'rb'))
    indexs = list(range(len(labels)))
    if settings.is_train:
        random.shuffle(indexs)
    for i in indexs:
        seq = seqs[i]    
        label = labels[i]  
        yield seq, label

my pickled data format


def gen_dataset(name, idxs):
        vals = []
        labels = []
        for idx in idxs:
            # data['x'][idx].transpose() is numpy.array with shape(time_step, 11)  and every diffenert
            # data[['x'][idx].transpose() has a different time_step.
            # then I try function tolist() to transform the maxtrix to list of list => [[f, ...], ..., [f...]]
            vals.append(data['x'][idx].transpose().tolist())
            # Similarily, list(data['y'][idx]) is 1-dimentional integer vector [i, ...] which has various length
            labels.append(list(data['y'][idx]))
        cPickle.dump((vals, labels), open(name, 'wb'))

@reyoung
Copy link
Collaborator

reyoung commented Sep 19, 2016

Please print type of labels, label, label[0] in your dataprovider.

print type(labels), type(label), type(label[0])

@ganji15
Copy link
Author

ganji15 commented Sep 19, 2016

@reyoung

In [3]: seqs, lbs = cPickle.load(open('data/ocr_train.pkl', 'rb'))

In [4]: type(seqs)
Out[4]: list

In [5]: type(seqs[0])
Out[5]: list

In [6]: type(seqs[0][0])
Out[6]: list

In [7]: type(seqs[0][0][0])
Out[7]: float

In [8]: type(lbs[0][0])
Out[8]: numpy.int32

In [9]: type(lbs[0])
Out[9]: list

In [10]: lbs[0]
Out[10]: [70, 75, 4, 9, 31]

@reyoung
Copy link
Collaborator

reyoung commented Sep 19, 2016

@ganji15 numpy.int32 is not int object in python. Cast it to int please.

map(int, lbs[0])

@ganji15
Copy link
Author

ganji15 commented Sep 19, 2016

@reyoung
It works! Thank you very much!

@reyoung
Copy link
Collaborator

reyoung commented Sep 19, 2016

@ganji15 The numpy will be support in a few days.

Thanks for your attention.

@alvations
Copy link
Contributor

alvations commented Nov 7, 2016

I have managed to feed numpy objects into Paddle by using something like np.array.tolist():

from paddle.trainer.PyDataProvider2 import *

import numpy as np

UNK_IDX = 2
START = "<s>"
END = "<e>"

def _get_ids(s, dictionary):
    words = s.strip().split()
    return [dictionary[START]] + \
           [dictionary.get(w, UNK_IDX) for w in words] + \
           [dictionary[END]]

def hook(settings, src_dict, trg_dict, file_list, **kwargs):
    # Some code ...
    # A numpy matrix that corresponds to the src (row) and target (column) vocabulary
    settings.thematrix = np.random.rand(len(src_dict), len(trg_dict))
    # ...
    settings.slots = [ integer_value_sequence(len(settings.src_dict)),
                           dense_vector_sequence(len(setting.src_dict)),
                            integer_value_sequence(len(settings.trg_dict))]
    # ...

@provider(init_hook=hook, pool_size=50000)
def process(settings, file_name):
    # ...
    for line in enumerate(f):
        src_seq, trg_seq = line.strip().split('\t')
        src_ids = _get_ids(src_seq, settings.src_dict)
        trg_ids = [settings.trg_dict.get(w, UNK_IDX)
                           for w in trg_words]
        trg_ids = [settings.trg_dict[START]] + trg_ids
    yield src_ids , settings.thematrix[src_ids].tolist(), trg_ids

Somehow the vectors can't seem to get pass the first batch and Paddle throws this error:

~/Paddle/demo/rowrow$ bash train.sh 
I1104 18:59:42.636052 18632 Util.cpp:151] commandline: /home/ltan/Paddle/binary/bin/../opt/paddle/bin/paddle_trainer --config=train.conf --save_dir=/home/ltan/Paddle/demo/rowrow/model --use_gpu=true --num_passes=100 --show_parameter_stats_period=1000 --trainer_count=4 --log_period=10 --dot_period=5 
I1104 18:59:46.503566 18632 Util.cpp:126] Calling runInitFunctions
I1104 18:59:46.503810 18632 Util.cpp:139] Call runInitFunctions done.
[WARNING 2016-11-04 18:59:46,847 default_decorators.py:40] please use keyword arguments in paddle config.
[INFO 2016-11-04 18:59:46,856 networks.py:1125] The input order is [source_language_word, target_language_word, target_language_next_word]
[INFO 2016-11-04 18:59:46,857 networks.py:1132] The output order is [__cost_0__]
I1104 18:59:46.871026 18632 Trainer.cpp:170] trainer mode: Normal
I1104 18:59:46.871906 18632 MultiGradientMachine.cpp:108] numLogicalDevices=1 numThreads=4 numDevices=4
I1104 18:59:46.988584 18632 PyDataProvider2.cpp:247] loading dataprovider dataprovider::process
[INFO 2016-11-04 18:59:46,990 dataprovider.py:15] src dict len : 45661
[INFO 2016-11-04 18:59:47,316 dataprovider.py:26] trg dict len : 422
I1104 18:59:47.347944 18632 PyDataProvider2.cpp:247] loading dataprovider dataprovider::process
[INFO 2016-11-04 18:59:47,348 dataprovider.py:15] src dict len : 45661
[INFO 2016-11-04 18:59:47,657 dataprovider.py:26] trg dict len : 422
I1104 18:59:47.658279 18632 GradientMachine.cpp:134] Initing parameters..
I1104 18:59:49.244287 18632 GradientMachine.cpp:141] Init parameters done.
F1104 18:59:50.485621 18632 PythonUtil.h:213] Check failed: PySequence_Check(seq_) 
*** Check failure stack trace: ***
    @     0x7f71f521adaa  (unknown)
    @     0x7f71f521ace4  (unknown)
    @     0x7f71f521a6e6  (unknown)
    @     0x7f71f521d687  (unknown)
    @           0x54dac9  paddle::DenseScanner::fill()
    @           0x54f1d1  paddle::SequenceScanner::fill()
    @           0x5543cc  paddle::PyDataProvider2::getNextBatchInternal()
    @           0x5779b2  paddle::DataProvider::getNextBatch()
    @           0x6a01f7  paddle::Trainer::trainOnePass()
    @           0x6a3b57  paddle::Trainer::train()
    @           0x53a2b3  main
    @     0x7f71f4426f45  (unknown)
    @           0x545ae5  (unknown)
    @              (nil)  (unknown)
/home/ltan/Paddle/binary/bin/paddle: line 81: 18632 Aborted                 (core dumped) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${@:2}

More details on http://stackoverflow.com/questions/40421248/why-is-paddle-throwing-errors-when-feeding-in-a-dense-vector-sequence-to-a-seqto and the data+code that i use to run train.sh is in https://github.com/alvations/rowrow .

Is it just numpy vectors that are not supported yet? Or is it that Paddle hasn't support any dense vector sequence (even if it's list of list of floats) in native Python object?

@reyoung
Copy link
Collaborator

reyoung commented Nov 7, 2016

1、numpy is supported
2、dense_vector_sequence is a vector of dense_vector, which data type should be [[f, f, f], [f, f, f]]

@reyoung
Copy link
Collaborator

reyoung commented Nov 7, 2016

@alvations Please open another issue for new question.

@reyoung
Copy link
Collaborator

reyoung commented Nov 7, 2016

It seems that you should use dense_vector instead of dense_vector_sequence, because the settings.thematrix[src_ids] is just a vector of float.

@alvations
Copy link
Contributor

alvations commented Nov 7, 2016

Sorry for the inconvenience. I've created a new issue #369.

The settings.thematrix[src_ids] should return a matrix (vector of vectors) that fits the dense_vector_sequence [ [f,f,f], [f,f,f], ...] structure, right? :

>>> import numpy as np
>>> x = np.random.rand(10,5) # 10 rows, 5 columns
>>> x
array([[ 0.71414965,  0.45273671,  0.37954461,  0.04298937,  0.65297758],
       [ 0.71330836,  0.93355837,  0.91250145,  0.73036384,  0.00237625],
       [ 0.27265885,  0.01207583,  0.10584876,  0.64541483,  0.42509224],
       [ 0.15477619,  0.5713811 ,  0.71976755,  0.00669505,  0.7747009 ],
       [ 0.07513192,  0.20092001,  0.30176491,  0.98289236,  0.60552273],
       [ 0.4454395 ,  0.19612705,  0.47249998,  0.81235983,  0.35272056],
       [ 0.48687432,  0.91080766,  0.77938878,  0.45750021,  0.98119178],
       [ 0.70029773,  0.00784268,  0.56423129,  0.40237047,  0.86712586],
       [ 0.31193082,  0.60600517,  0.18091819,  0.3627252 ,  0.85459444],
       [ 0.32658941,  0.51335506,  0.29290611,  0.74307929,  0.87390234]])
>>> x[[0,2,5,8]] # get 4 rows
array([[ 0.71414965,  0.45273671,  0.37954461,  0.04298937,  0.65297758],
       [ 0.27265885,  0.01207583,  0.10584876,  0.64541483,  0.42509224],
       [ 0.4454395 ,  0.19612705,  0.47249998,  0.81235983,  0.35272056],
       [ 0.31193082,  0.60600517,  0.18091819,  0.3627252 ,  0.85459444]])

zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019
thisjiang pushed a commit to thisjiang/Paddle that referenced this issue Oct 28, 2021
gglin001 pushed a commit to graphcore/Paddle-fork that referenced this issue Dec 8, 2021
* add SetIpuIndexStage for model sharding/pipelinging

* add batches_per_step
wangxicoding pushed a commit to wangxicoding/Paddle that referenced this issue Dec 9, 2021
qingshui pushed a commit to jiaoxuewu/PaddleBox that referenced this issue Dec 1, 2023
* add nan_inf metric

* return count for nan and inf

---------

Co-authored-by: lihui53 <lihui53@MacBook-Pro-5.local>
lizexu123 pushed a commit to lizexu123/Paddle that referenced this issue Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants