Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor code with PEP8 and additional limitations. Fix #1521 #1569

Merged
merged 20 commits into from
Sep 8, 2017

Conversation

menshikh-iv
Copy link
Contributor

No description provided.

@menshikh-iv menshikh-iv added this to In progress in Code cleanup Sep 5, 2017
doc = [(id_, weight)
for id_, weight in enumerate(map(float, parts))
if abs(weight) > eps]
doc = [(id_, weight) for id_, weight in enumerate(float(x) for x in parts) if abs(weight) > eps]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv
line 308:

doc = [(int(id_), float(weight))
           for id_, weight in zip(*[iter(parts)] * 2)
           if abs(float(weight)) > eps]

line 310 :

doc = [(id_, float(weight))
           for id_, weight in enumerate(parts)
           if abs(float(weight)) > eps]

How about above code?
Above code is less loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, accepted.

Copy link
Owner

@piskvorky piskvorky Sep 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zsef123 Both examples above are incorrect. The correct formatting using hanging indent:

doc = [
   (id_, float(weight))
   for id_, weight in enumerate(parts)
   if abs(float(weight)) > eps
]

Actually, this line not too long, so simple:

doc = [(id_, float(weight)) for id_, weight in enumerate(parts) if abs(float(weight)) > eps]

would work too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already done as your last variant @piskvorky

@@ -363,7 +363,7 @@ def testPipeline(self):
data = cache
test_data = data.data[0:2]
test_target = data.target[0:2]
id2word = Dictionary(map(lambda x: x.split(), test_data))
id2word = Dictionary([x.split() for x in data.data])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data.data need change to test_data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

@@ -436,7 +436,7 @@ def testPasses(self):
for test_rhot in test_rhots:
model.update(corpus, author2doc)

msg = ", ".join(map(str, [passes, model.num_updates, model.state.numdocs]))
msg = ", ".join(str(x) for x in [passes, model.num_updates, model.state.numdocs])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about msg = "%d, %d, %d" % (passes, model.num_updates, model.state.numdocs)?
This is more simple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, fixed

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Epic! I reviewed the diff and left some minor comments inside. I will look at the non-diffed (unchanged) parts next.

self.dfs = dict((tokenid, freq)
for tokenid, freq in iteritems(self.dfs)
if tokenid not in bad_ids)
self.token2id = dict((token, tokenid) for token, tokenid in iteritems(self.token2id) if tokenid not in bad_ids)
Copy link
Owner

@piskvorky piskvorky Sep 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer support py2.6, so this can be a dict comprehension: {token: tokenid for ...}. Here and elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

logger.info(
"keeping %i tokens which were in no less than %i and no more than %i (=%.1f%%) documents",
len(good_ids), no_below, no_above_abs, 100.0 * no_above)
logger.info("keeping %i tokens which were in no less than %i and no more than %i (=%.1f%%) documents", len(good_ids), no_below, no_above_abs, 100.0 * no_above)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line too long, the args should be on their own separate line. Here and elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -333,7 +322,7 @@ def merge_with(self, other):
old2new[other_id] = new_id
try:
self.dfs[new_id] += other.dfs[other_id]
except Exception:
except AttributeError:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? What will happen if other.dfs exists, but is not a dictionary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -254,20 +252,14 @@ def init_shards(self, output_prefix, corpus, shardsize=4096, dtype=_default_dtyp
"""Initialize shards from the corpus."""

if not gensim.utils.is_corpus(corpus):
raise ValueError('Cannot initialize shards without a corpus to read'
' from! (Got corpus type: {0})'.format(type(corpus)))
raise ValueError('Cannot initialize shards without a corpus to read from! (Got corpus type: {0})'.format(type(corpus)))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to use one type of formatting consistently (%s instead of .format). Here and elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my opinion both variants is good, for this reason we shouldn't use only one

@@ -484,8 +456,7 @@ def resize_shards(self, shardsize):
# If something happens when we're in this stage, we're screwed.
except Exception as e:
print(e)
raise RuntimeError('Resizing completely failed for some reason.'
' Sorry, dataset is probably ruined...')
raise RuntimeError('Resizing completely failed for some reason. Sorry, dataset is probably ruined...')
Copy link
Owner

@piskvorky piskvorky Sep 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use logger.exception to log the exception, and then simply re-raise the original error, instead of printing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

raise TypeError('Couldn\'t find number of features, '
'refusing to guess (dimension set to {0},'
'type of corpus: {1}).'.format(self.dim, type(corpus)))
raise TypeError('Couldn\'t find number of features, refusing to guess (dimension set to {0}, type of corpus: {1}).'.format(self.dim, type(corpus)))
Copy link
Owner

@piskvorky piskvorky Sep 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: Better use " to delimit the string, to avoid the \' inside, to improve readability. Here and elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed (\' occurs only in this file several times)

self.dictionary = dictionary.Dictionary()
numPositions = 0
for docNo, (sourceId, docUri) in enumerate(self.documents):
if docNo % 1000 == 0:
logger.info("PROGRESS: at document #%i/%i (%s, %s)" %
(docNo, len(self.documents), sourceId, docUri))
logger.info("PROGRESS: at document #%i/%i (%s, %s)", docNo, len(self.documents), sourceId, docUri)
Copy link
Owner

@piskvorky piskvorky Sep 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docNo and resultFile is Java-style. I think this file (package) should be simply removed, instead of updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree, will removed it soon (but not in this PR)

LOG.warning('no word id mapping provided; initializing from '
'corpus, assuming identity')
raise ValueError('at least one of corpus/id2word must be specified, to establish input space dimensionality')
LOG.warning('no word id mapping provided; initializing from corpus, assuming identity')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this LOG? We use logger in gensim, by convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@menshikh-iv menshikh-iv merged commit db9e230 into develop Sep 8, 2017
@menshikh-iv menshikh-iv moved this from In progress to Done in Code cleanup Sep 8, 2017
@menshikh-iv menshikh-iv deleted the pep8-continue branch October 6, 2017 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants