-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed data reader for IMDB dataset. #7002
Conversation
python/paddle/v2/dataset/imdb.py
Outdated
@@ -76,45 +75,19 @@ def build_dict(pattern, cutoff): | |||
|
|||
def reader_creator(pos_pattern, neg_pattern, word_idx, buffer_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buffer_size
becomes useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buffer_size
is never used. Even in the previous experiment, people only set the shuffle buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Remove the buffer size. And I test the time for whether to use two threads.
- Not use two threads: 16.65757s
- Use two threads: 25 - 27s. I'm not sure why this is slower, the code is as follows:
def reader_creator(pos_pattern, neg_pattern, word_idx, buffer_size):
start_time = time.time()
UNK = word_idx['<unk>']
POS = []
NEG = []
def load(pattern, out, label):
for doc in tokenize(pattern):
out.append(([word_idx.get(w, UNK) for w in doc], label))
# Creates two threads that loads positive and negative samples
# into qs.
t0 = threading.Thread(
target=load, args=(
pos_pattern,
POS, 0, ))
t0.daemon = True
t0.start()
t1 = threading.Thread(
target=load, args=(
neg_pattern,
NEG, 1, ))
t1.daemon = True
t1.start()
t0.join()
t1.join()
INS = POS + NEG
random.shuffle(INS)
duration = time.time() - start_time
print('\nTotal time: %.5f ' % (duration))
def reader():
for doc, label in INS:
yield doc, label
return reader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great enhance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent jobs. Thanks
Fix #7001
Test Env:
Total Time: