Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add segment-wiki script #1483

Merged
merged 10 commits into from
Oct 27, 2017
Merged

Add segment-wiki script #1483

merged 10 commits into from
Oct 27, 2017

Conversation

menshikh-iv
Copy link
Contributor

Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump and extract sections of pages from it.

CC @piskvorky

os.write(sys.stdout.fileno(), u"\t".join(printed_components).encode('utf-8') + b"\n")


# noinspection PyUnresolvedReferences
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noise, please remove (here and elsewhere).

@piskvorky
Copy link
Owner

@menshikh-iv the original script was Python3 only -- has this been tested on Python2?

We're aiming at dual compatibility (e.g. using six), like the rest of gensim.

@menshikh-iv
Copy link
Contributor Author

@piskvorky of course, need to first check all another wiki scripts (as suggested in #1584), after it I'll add python2 compatibility here.

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor code-style comments; great script!

How as this tested for scale/stability? I remember some issues with multiprocessing.

article_title<tab>section_heading<tab>section_content<tab>section_heading<tab>section_content

"""
with open(output_file, 'wb') as outfile:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smart_open

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

The documents are extracted on-the-fly, so that the whole (massive) dump
can stay compressed on disk.

>>> wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2') # create word->word_id mapping, takes almost 8h
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring out of date (different class).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (updated all docstrings, converted to numpy-style, removed outdated things).

Parse the content inside a page tag, returning its content as a list of tokens
(utf8-encoded strings).

Returns a 2-tuple (str, list) -
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither google nor numpy docstring format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


"""
elem = cElementTree.fromstring(page_xml)
filter_namespaces = ('0',)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deserves a comment -- what is this?

@menshikh-iv
Copy link
Contributor Author

About performance, approximately 5 minutes for 100,000 articles in the output file (SSD, i7 6700HQ, ruwiki), sometimes problems with Ctrl+C (if you want to interrupt), but it's non-critical.

if self.lemmatize:
num_total_tokens += len(utils.lemmatize(section_content))
else:
num_total_tokens += len(tokenize(section_content))
Copy link
Owner

@piskvorky piskvorky Oct 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw I think for the purposes of gensim-data, we shouldn't do any tokenization or normalization. We should present the sections "as they are", so people can use their own sentence detection / token detection etc. Only remove newlines and tabs just before printing, because of the output format.

It's easy to go from raw section_content => tokenize, but impossible to go from tokenize => raw. @menshikh-iv

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for filtering very short articles, all content provided "as is".

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK, thanks.

Copy link
Contributor

@anotherbugmaster anotherbugmaster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for misguidance, here's the right way.

The documents are extracted on-the-fly, so that the whole (massive) dump can stay compressed on disk.

"""
def __init__(self, fileobj, processes=None, lemmatize=utils.has_pattern(), filter_namespaces=('0',)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't mix __init__ and class annotations. I propose to annotate __init__ from now on.

http://www.sphinx-doc.org/en/stable/ext/example_numpy.html

lemmatize : bool
If `pattern` package is installed, use fancier shallow parsing to get token lemmas.
Otherwise, use simple regexp tokenization.
filter_namespaces : tuple(int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tuple of int


Yields
------
tuple(str, list of tuple(str, str))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(str, list of (str, str))


Returns
-------
tuple(str, list of tuple(str, str))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(str, list of (str, str))


Yields
------
tuple(str, list of tuple(str, str))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(str, list of (str, str))

@menshikh-iv menshikh-iv merged commit 300ce8c into develop Oct 27, 2017
@menshikh-iv menshikh-iv deleted the add-wikiscript branch October 27, 2017 13:28
horpto pushed a commit to horpto/gensim that referenced this pull request Oct 28, 2017
* add segment wiki script

* fix indentation error

* Add output file and logging + small fixes

* add smart_open

* Add numpy-style docstrings & fix .rst

* Fix types

* Fix docstrings + output file format (json-lines)

* Upd .rst
@menshikh-iv
Copy link
Contributor Author

Continued in #1694

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants