-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WikiCorpus processes lock-up when run from command line #1320
Comments
@jli05 May I kindly ask about the resolution of this issue in case it comes up in the future here? |
There remains problem with WikiCorpus multi-processing. If we run with Could we do some fix to it?
The code ''' Dump Bag-of-Word from gensim WikiCorpus '''
import gzip
from argparse import ArgumentParser
from gensim.corpora.dictionary import Dictionary
from gensim.corpora.wikicorpus import WikiCorpus
def dump_bow(corpus, partition_size=50, limit=200, output_prefix='dump'):
''' Dump Bag-of-Word from gensim WikiCorpus
Iterate through the documents in the wiki dump and dump the
Bag-of-Words of the documents in a series of .txt.gz files.
Each line in the uncompressed file represent one document, with
only lower-case words separated by space.
PARAMETERS
-----------
corpus: gensim.corpora.WikiCorpus
The Wikidump corpus.
partition_size: int
Number of documents in each .txt.gz dump file.
limit: int or None
The total number of documents to dump, or None for all
the documents in the corpus.
output_prefix: str
Prefix of the dump files.
'''
def write_buffer(buf, output_prefix, partition_id):
''' Dump current buffer of Bag-of-Words '''
fname = '{}-{:06d}.txt.gz'.format(output_prefix, partition_id)
with gzip.open(fname, 'wt') as partition_file:
partition_file.write(buf)
if limit is not None:
print('Processing {} documents in the corpus...'.format(limit))
else:
print('Processing all the documents in the corpus...')
assert partition_size >= 1
assert limit is None or limit >= 1
# gensim 2.0 requires this otherwise the multi-processing locks up
# assert limit is None or partition_size <= limit
count_documents = 0
partition_id = 0
buf = ''
for bow in corpus.get_texts():
text = ' '.join([byte_array.decode('utf-8') for byte_array in bow])
buf += text + '\n'
count_documents += 1
if count_documents % 200 == 0:
print('Processed {} documents.'.format(count_documents))
if count_documents % partition_size == 0:
write_buffer(buf, output_prefix, partition_id)
buf = ''
partition_id += 1
if limit is not None and count_documents >= limit:
break
if buf:
write_buffer(buf, output_prefix, partition_id)
print('Dumped {} documents.'.format(count_documents))
def main():
''' Parse arguments and run '''
parser = ArgumentParser(description='Dump bag-of-words in .txt.gz files')
parser.add_argument('wikidump', type=str,
help='xxx-pages-articles.xml.bz2 wiki dump file')
parser.add_argument('dictionary', type=str,
help='gensim dictionary .txt file')
parser.add_argument('-j', '--jobs', type=int, default=2,
help='Number of parallel jobs, default: 2')
parser.add_argument('-p', '--partition-size', type=int,
help='Number of documents in each .txt.gz file')
parser.add_argument('-l', '--limit', type=int,
help=('Total number of documents to dump, '
'or all documents when not specified'))
parser.add_argument('-o', '--output-prefix', type=str, default='dump',
help='Prefix of dump .txt.gz files, default: dump')
args = parser.parse_args()
wiki_dictionary = Dictionary.load_from_text(args.dictionary)
wiki = WikiCorpus(args.wikidump, processes=args.jobs,
dictionary=wiki_dictionary)
dump_bow(wiki, args.partition_size, args.limit,
output_prefix=args.output_prefix)
if __name__ == '__main__':
main() |
Tried to reproduce on current develop branch with python3.6 and couldn't.
@jli05 can you do something like this and see if it fixes the hanging?
|
It was too long ago and I'd have to re-download the data if I want to do what you ask. So I trust what you say. |
I could reproduce it on Then I done Before e06c7c3#diff-eece52d95c280dabe57c803c95d6bb96 So, this is already fixed now. What do you think, @menshikh-iv? |
@xelez thanks for investigation |
When we run the following function from within
iPython
, it runs successfully with multi-processes enabled by theprocesses
argument when constructingWikiCorpus
. However when we put this function in a.py
file and usepython3 xxx.py
to invoke it, it never processes more than 200 documents, when we press Ctrl-C, there seems to be a semaphore lock-up caused by multi-processes.The text was updated successfully, but these errors were encountered: