Add segment-wiki script #1483

menshikh-iv · 2017-07-13T15:30:06Z

Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump and extract sections of pages from it.

piskvorky · 2017-07-14T04:09:45Z

gensim/scripts/segment_wiki.py

+        os.write(sys.stdout.fileno(), u"\t".join(printed_components).encode('utf-8') + b"\n")
+
+
+# noinspection PyUnresolvedReferences


Noise, please remove (here and elsewhere).

piskvorky · 2017-07-14T04:11:04Z

@menshikh-iv the original script was Python3 only -- has this been tested on Python2?

We're aiming at dual compatibility (e.g. using six), like the rest of gensim.

menshikh-iv · 2017-09-14T07:35:52Z

@piskvorky of course, need to first check all another wiki scripts (as suggested in #1584), after it I'll add python2 compatibility here.

piskvorky

Minor code-style comments; great script!

How as this tested for scale/stability? I remember some issues with multiprocessing.

piskvorky · 2017-10-06T11:58:36Z

gensim/scripts/segment_wiki.py

+    article_title<tab>section_heading<tab>section_content<tab>section_heading<tab>section_content
+
+    """
+    with open(output_file, 'wb') as outfile:


smart_open

piskvorky · 2017-10-06T11:59:20Z

gensim/scripts/segment_wiki.py

+    The documents are extracted on-the-fly, so that the whole (massive) dump
+    can stay compressed on disk.
+
+    >>> wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2') # create word->word_id mapping, takes almost 8h


Docstring out of date (different class).

Done (updated all docstrings, converted to numpy-style, removed outdated things).

piskvorky · 2017-10-06T12:00:00Z

gensim/scripts/segment_wiki.py

+    Parse the content inside a page tag, returning its content as a list of tokens
+    (utf8-encoded strings).
+
+    Returns a 2-tuple (str, list) -


Neither google nor numpy docstring format.

piskvorky · 2017-10-06T12:00:15Z

gensim/scripts/segment_wiki.py

+
+    """
+    elem = cElementTree.fromstring(page_xml)
+    filter_namespaces = ('0',)


Deserves a comment -- what is this?

menshikh-iv · 2017-10-06T12:22:34Z

About performance, approximately 5 minutes for 100,000 articles in the output file (SSD, i7 6700HQ, ruwiki), sometimes problems with Ctrl+C (if you want to interrupt), but it's non-critical.

piskvorky · 2017-10-07T11:07:17Z

gensim/scripts/segment_wiki.py

+                    if self.lemmatize:
+                        num_total_tokens += len(utils.lemmatize(section_content))
+                    else:
+                        num_total_tokens += len(tokenize(section_content))


Btw I think for the purposes of gensim-data, we shouldn't do any tokenization or normalization. We should present the sections "as they are", so people can use their own sentence detection / token detection etc. Only remove newlines and tabs just before printing, because of the output format.

It's easy to go from raw section_content => tokenize, but impossible to go from tokenize => raw. @menshikh-iv

This is only for filtering very short articles, all content provided "as is".

Ah, OK, thanks.

anotherbugmaster

Sorry for misguidance, here's the right way.

anotherbugmaster · 2017-10-27T09:36:38Z

gensim/scripts/segment_wiki.py

+    The documents are extracted on-the-fly, so that the whole (massive) dump can stay compressed on disk.
+
+    """
+    def __init__(self, fileobj, processes=None, lemmatize=utils.has_pattern(), filter_namespaces=('0',)):


Don't mix __init__ and class annotations. I propose to annotate __init__ from now on.

http://www.sphinx-doc.org/en/stable/ext/example_numpy.html

anotherbugmaster · 2017-10-27T10:11:13Z

gensim/scripts/segment_wiki.py

+        lemmatize : bool
+            If `pattern` package is installed, use fancier shallow parsing to get token lemmas.
+            Otherwise, use simple regexp tokenization.
+        filter_namespaces : tuple(int)


tuple of int

anotherbugmaster · 2017-10-27T10:13:17Z

gensim/scripts/segment_wiki.py

+
+    Yields
+    ------
+    tuple(str, list of tuple(str, str))


(str, list of (str, str))

anotherbugmaster · 2017-10-27T10:14:05Z

gensim/scripts/segment_wiki.py

+
+    Returns
+    -------
+    tuple(str, list of tuple(str, str))


(str, list of (str, str))

anotherbugmaster · 2017-10-27T10:14:38Z

gensim/scripts/segment_wiki.py

+
+        Yields
+        ------
+        tuple(str, list of tuple(str, str))


(str, list of (str, str))

* add segment wiki script * fix indentation error * Add output file and logging + small fixes * add smart_open * Add numpy-style docstrings & fix .rst * Fix types * Fix docstrings + output file format (json-lines) * Upd .rst

menshikh-iv · 2017-11-08T07:25:35Z

Continued in #1694

menshikh-iv added 2 commits July 13, 2017 20:28

add segment wiki script

c7b8a8a

fix indentation error

11691bb

piskvorky requested changes Jul 14, 2017

View reviewed changes

Add output file and logging + small fixes

fb83ef2

piskvorky requested changes Oct 6, 2017

View reviewed changes

menshikh-iv added the style checking label Oct 6, 2017

menshikh-iv added 2 commits October 6, 2017 18:19

add smart_open

102c0df

Merge branch 'develop' into add-wikiscript

29302a4

piskvorky reviewed Oct 7, 2017

View reviewed changes

Add numpy-style docstrings & fix .rst

ef3b094

anotherbugmaster suggested changes Oct 27, 2017

View reviewed changes

menshikh-iv added 4 commits October 27, 2017 15:24

Fix types

8eda36b

Merge branch 'develop' into add-wikiscript

3c9f0d0

Fix docstrings + output file format (json-lines)

e40f8c9

Upd .rst

1f923e2

menshikh-iv merged commit 300ce8c into develop Oct 27, 2017

menshikh-iv deleted the add-wikiscript branch October 27, 2017 13:28

menshikh-iv removed the style checking label Nov 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add segment-wiki script #1483

Add segment-wiki script #1483

menshikh-iv commented Jul 13, 2017

piskvorky Jul 14, 2017

piskvorky commented Jul 14, 2017

menshikh-iv commented Sep 14, 2017

piskvorky left a comment

piskvorky Oct 6, 2017

menshikh-iv Oct 7, 2017

piskvorky Oct 6, 2017

menshikh-iv Oct 27, 2017

piskvorky Oct 6, 2017

menshikh-iv Oct 27, 2017

piskvorky Oct 6, 2017

menshikh-iv commented Oct 6, 2017

piskvorky Oct 7, 2017 •

edited

Loading

menshikh-iv Oct 27, 2017

piskvorky Oct 27, 2017

anotherbugmaster left a comment

anotherbugmaster Oct 27, 2017

anotherbugmaster Oct 27, 2017

anotherbugmaster Oct 27, 2017

anotherbugmaster Oct 27, 2017

anotherbugmaster Oct 27, 2017

menshikh-iv commented Nov 8, 2017

		os.write(sys.stdout.fileno(), u"\t".join(printed_components).encode('utf-8') + b"\n")


		# noinspection PyUnresolvedReferences

Add segment-wiki script #1483

Add segment-wiki script #1483

Conversation

menshikh-iv commented Jul 13, 2017

Choose a reason for hiding this comment

piskvorky commented Jul 14, 2017

menshikh-iv commented Sep 14, 2017

piskvorky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Oct 6, 2017

piskvorky Oct 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anotherbugmaster left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Nov 8, 2017

piskvorky Oct 7, 2017 •

edited

Loading