Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor documentation for gensim.similarities.docsim. #1910

Merged
merged 18 commits into from
Feb 23, 2018

Conversation

CLearERR
Copy link
Contributor

@menshikh-iv menshikh-iv added the incubator project PR is RaRe incubator project label Feb 16, 2018
>>>
>>> corpus = CorpusMiislita(datapath('head500.noblanks.cor.bz2'))
>>> corpus.get_texts()
<generator object get_texts at 0x7fa932f397d0>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad output, can you show the concrete line of the dataset next(iter(corpus.get_texts())) ?

>>> if word not in CorpusMiislita.stoplist]
>>>
>>> def __len__(self):
>>> if 'length' not in self.__dict__:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to write something with logger, this should be simple & small example

>>>
>>> def get_texts(self):
>>> for doc in self.getstream():
>>> yield [word for word in utils.to_unicode(doc).lower().split()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some issues with formatting

>>> corpus = CorpusMiislita(datapath('head500.noblanks.cor.bz2'))
>>> corpus.get_texts()
<generator object get_texts at 0x7fa932f397d0>
>>> corpus.__len__()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please len(dorpus) instead of this one, call "magic" directly is bad pattern (and is justified only for specific cases)


Return
------
{:class: `~scipy.sparse.csr_matrix`, :class: `~numpy.array`}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy.array -> numpy.ndarray here and everywhere. Also, in this case, link shouldn't be rendered -> don't use ~ for numpy/scipy

Size of shards should be chosen so that a `shardsize x chunksize` matrix of floats fits comfortably into
main memory.
norm : str, optional
Normalization to use. Accepted values: {l1, l2}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, instead of

norm : str, optional
    Normalization to use. Accepted values: {l1, l2}.

should be

norm : {'l1', 'l2'}, optional
    Normalization to use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better notation when we have several string pre-defined values.

@menshikh-iv menshikh-iv merged commit 5355c06 into piskvorky:develop Feb 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incubator project PR is RaRe incubator project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants