Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed data reader for IMDB dataset. #7002

Merged
merged 4 commits into from
Dec 26, 2017

Conversation

qingqing01
Copy link
Contributor

@qingqing01 qingqing01 commented Dec 25, 2017

Fix #7001

@@ -76,45 +75,19 @@ def build_dict(pattern, cutoff):

def reader_creator(pos_pattern, neg_pattern, word_idx, buffer_size):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer_size becomes useless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer_size is never used. Even in the previous experiment, people only set the shuffle buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Remove the buffer size. And I test the time for whether to use two threads.

  • Not use two threads: 16.65757s
  • Use two threads: 25 - 27s. I'm not sure why this is slower, the code is as follows:
def reader_creator(pos_pattern, neg_pattern, word_idx, buffer_size):
    start_time = time.time()
    UNK = word_idx['<unk>']

    POS = []
    NEG = []

    def load(pattern, out, label):
        for doc in tokenize(pattern):
            out.append(([word_idx.get(w, UNK) for w in doc], label))

    # Creates two threads that loads positive and negative samples
    # into qs.
    t0 = threading.Thread(
        target=load, args=(
            pos_pattern,
            POS, 0, ))
    t0.daemon = True
    t0.start()

    t1 = threading.Thread(
        target=load, args=(
            neg_pattern,
            NEG, 1, ))
    t1.daemon = True
    t1.start()

    t0.join()
    t1.join()

    INS = POS + NEG
    random.shuffle(INS)
    duration = time.time() - start_time
    print('\nTotal time: %.5f ' % (duration))

    def reader():
        for doc, label in INS:
            yield doc, label

    return reader

Copy link
Contributor

@dzhwinter dzhwinter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great enhance

Copy link
Collaborator

@reyoung reyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent jobs. Thanks

@qingqing01 qingqing01 merged commit c3fd2c2 into PaddlePaddle:develop Dec 26, 2017
@qingqing01 qingqing01 deleted the imdb_data branch November 14, 2019 05:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants