[WIP] Data/model storage. Fix 1453 #1632

chaitaliSaini · 2017-10-17T08:54:32Z

API for dataset/model storage (old PR #1492).

menshikh-iv · 2017-10-26T12:51:50Z

gensim/downloader.py

+
+
+def _create_base_dir():
+    r"""Create the gensim-data directory in home directory, if it has not been already created.


Is it really needed to add r for all docstrings ? What's a reason?

menshikh-iv · 2017-10-26T12:54:47Z

gensim/downloader.py

+    sys.stdout.flush()
+
+
+def _create_base_dir():


Maybe use __ instead of _ will be better (for hiding from import), here and everywhere?

menshikh-iv · 2017-10-26T12:57:47Z

gensim/downloader.py

+
+if __name__ == '__main__':
+    logging.basicConfig(format='%(asctime)s :%(name)s :%(levelname)s :%(message)s', stream=sys.stdout, level=logging.INFO)
+    parser = argparse.ArgumentParser(description="Gensim console API", usage="python -m gensim.api.downloader  [-h] [-d data__name | -i data__name | -c]")


No need to pass custom "usage" string here (argparse will generate it automatically)

menshikh-iv · 2017-10-26T12:58:12Z

gensim/downloader.py

+    logging.basicConfig(format='%(asctime)s :%(name)s :%(levelname)s :%(message)s', stream=sys.stdout, level=logging.INFO)
+    parser = argparse.ArgumentParser(description="Gensim console API", usage="python -m gensim.api.downloader  [-h] [-d data__name | -i data__name | -c]")
+    group = parser.add_mutually_exclusive_group()
+    group.add_argument("-d", "--download", metavar="data__name", nargs=1, help="To download a corpus/model : python -m gensim.downloader -d corpus/model name")


Strange names for metavar, why metavar is needed here?

menshikh-iv · 2017-10-26T13:01:07Z

gensim/downloader.py

+        logger.info("%s downloaded", name)
+    else:
+        rmtree(tmp_dir)
+        raise Exception("There was a problem in downloading the data. We recommend you to re-try.")


Add info about checksums (concrete filename, expected checksum, real checksum, expected size, real size).

menshikh-iv · 2017-10-26T13:04:37Z

Great job @chaitaliSaini, now your code is more readable and clear (and works stable) 🔥 👍 @anotherbugmaster will review your docstrings today.

anotherbugmaster

Good job, thank you! Fix the minor issues and check out this styleguide (in case you haven't yet), it will help you write consistent documentation:

https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt#docstring-standard

anotherbugmaster · 2017-10-26T15:16:17Z

gensim/downloader.py

+
+
+def progress(chunks_downloaded, chunk_size, total_size):
+    r"""Create and update the progress bar.


Why is r necessary?

anotherbugmaster · 2017-10-26T15:20:04Z

gensim/downloader.py

+    filled_len = int(math.floor((bar_len * size_downloaded) / total_size))
+    percent_downloaded = round((size_downloaded * 100) / total_size, 1)
+    bar = '=' * filled_len + '-' * (bar_len - filled_len)
+    sys.stdout.write('[%s] %s%s %s/%sMB downloaded\r' % (bar, percent_downloaded, "%", round(size_downloaded / (1024 * 1024), 1), round(float(total_size) / (1024 * 1024), 1)))


anotherbugmaster · 2017-10-26T15:21:22Z

gensim/downloader.py

+
+
+def _calculate_md5_checksum(tar_file):
+    r"""Calculate the checksum of the given tar.gz file.


anotherbugmaster · 2017-10-26T15:33:05Z

gensim/downloader.py

+def info(name=None):
+    r"""Return the information related to model/dataset.
+
+    If name is supplied, then information related to the given dataset/model will be returned. Otherwise detailed information of all model/datasets will be returned.


Too long, split it

anotherbugmaster · 2017-10-26T15:33:19Z

gensim/downloader.py

+    Returns
+    -------
+    dict
+        Return detailed information about all models/datasets if name is not provided. Otherwise return detailed informtiona of the specific model/dataset


anotherbugmaster · 2017-10-26T15:36:21Z

gensim/downloader.py

+    data:
+        load model to memory
+    data_dir: str
+        return path of dataset/model.


No new line after last section

anotherbugmaster · 2017-10-26T15:41:07Z

gensim/downloader.py

+
+    Parameters
+    ----------
+    name : {None, data name}, optional


name : str or None, optional is the right way. Also try to write a description after every parameter.

anotherbugmaster · 2017-10-26T15:42:22Z

gensim/downloader.py

+    Parameters
+    ----------
+    name: str
+        dataset/model name


Capital letters

anotherbugmaster · 2017-10-26T15:42:37Z

gensim/downloader.py

+    Parameters
+    ----------
+    name: str
+        dataset/model name which has to be downloaded


Also capital letters

menshikh-iv · 2017-11-07T12:00:22Z

gensim/test/test_api.py

+import numpy as np
+
+
+class TestApi(unittest.TestCase):


Need to add test for multipart

menshikh-iv · 2017-11-07T12:02:19Z

gensim/downloader.py

+import math
+import shutil
+import tempfile
+try:


One try/catch is enough here.

menshikh-iv · 2017-11-07T12:10:30Z

gensim/downloader.py

+    Parameters
+    ----------
+    chunks_downloaded : int
+        Number of chunks of data that have been downloaded


. at the end of sentence (here and anywhere)

menshikh-iv · 2017-11-07T12:10:56Z

gensim/downloader.py

+
+def _create_base_dir():
+    """Create the gensim-data directory in home directory, if it has not been already created.
+    Raises


missing newline before section title

menshikh-iv · 2017-11-07T12:38:58Z

gensim/downloader.py

+    """Create the gensim-data directory in home directory, if it has not been already created.
+    Raises
+    ------
+    File Exists Error


Raises --------- Exception Two possible reasons: ...

menshikh-iv · 2017-11-07T12:41:57Z

gensim/downloader.py

+            return data['models'][name]["checksum"]
+    else:
+        if name in corpora:
+            return data['corpora'][name]["checksum-" + str(part)]


"cheksum-{}".format(part) instead

menshikh-iv · 2017-11-07T12:44:02Z

gensim/downloader.py

+    tmp_dir = tempfile.mkdtemp()
+    tmp_load_file_path = os.path.join(tmp_dir, "__init__.py")
+    urllib.urlretrieve(url_load_file, tmp_load_file_path)
+    no_parts = int(_get_parts(name))


store it as int, don't cast

menshikh-iv · 2017-11-07T12:47:35Z

gensim/downloader.py

+            compressed_folder_name = "{f}.tar.gz_a{p}".format(f=name, p=chr(96 + part))
+            tmp_data_file_dir = os.path.join(tmp_dir, compressed_folder_name)
+            logger.info("Downloading Part %s/%s", part, no_parts)
+            urllib.urlretrieve(url_data, tmp_data_file_dir, reporthook=_progress)


Show part on progressbar

menshikh-iv · 2017-11-07T12:54:53Z

gensim/downloader.py

+        concatenated_folder_dir = os.path.join(tmp_dir, concatenated_folder_name)
+        for part in range(1, no_parts + 1):
+            url_data = "https://github.com/chaitaliSaini/gensim-data/releases/download/{f}/{f}.tar.gz_a{p}".format(f=name, p=chr(96 + part))
+            compressed_folder_name = "{f}.tar.gz_a{p}".format(f=name, p=chr(96 + part))


Use numeric suffixes

menshikh-iv · 2017-11-07T12:56:27Z

gensim/downloader.py

+        os.remove(concatenated_folder_dir)
+        os.rename(tmp_dir, data_folder_dir)
+    else:
+        url_data = "https://github.com/chaitaliSaini/gensim-data/releases/download/{f}/{f}.tar.gz".format(f=name)


Make distinct function

menshikh-iv · 2017-11-08T16:22:16Z

gensim/downloader.py

+            logger.info("%s \n", json.dumps(data['corpora'][name], indent=4))
+            return data['corpora'][name]
+        elif name in models:
+            logger.info("%s \n", json.dumps(data['corpora'][name], indent=4))


Bug data['corpora'][name] -> data['models'][name]

menshikh-iv · 2017-11-10T04:35:03Z

Finished in #1705

chaitaliSaini added 16 commits July 30, 2017 04:59

added download and catalogue functions

ec8c016

added link and info

636bfff

modeified link and info functions

fffe203

Updated download function

f567dee

Added logging

61ba3d6

Added load function

d8257a3

Removed unused imports

5571469

added check for installed models

cabf173

updated download function

5d509fc

Improved help for terminal

551f54e

load returns model path

ff5509f

added jupyter notebook and merged code

e654070

alternate names for load

b0d1110

corrected formatting

498b32b

added checksum after download

03649b0

refactored code

7fbf228

menshikh-iv changed the title ~~[WIP]Data/model storage~~ [WIP] Data/model storage Oct 17, 2017

menshikh-iv changed the title ~~[WIP] Data/model storage~~ [WIP] Data/model storage. Fix 1453 Oct 17, 2017

menshikh-iv added the incubator project PR is RaRe incubator project label Oct 17, 2017

chaitaliSaini added 3 commits October 17, 2017 15:16

removed log file code

d0311d1

added progressbar

7e00e2d

fixed pep8

f38670d

menshikh-iv suggested changes Oct 26, 2017

View reviewed changes

menshikh-iv requested review from menshikh-iv and removed request for menshikh-iv October 26, 2017 13:05

anotherbugmaster suggested changes Oct 26, 2017

View reviewed changes

chaitaliSaini added 2 commits October 31, 2017 12:59

added tests

4cadfa2

added download for >2gb data

e844e01

menshikh-iv suggested changes Nov 7, 2017

View reviewed changes

chaitaliSaini added 2 commits November 8, 2017 17:45

add test for multipart

580a93a

fixed pep8

e899f88

menshikh-iv suggested changes Nov 8, 2017

View reviewed changes

fixed bug

8eeec54

menshikh-iv closed this Nov 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Data/model storage. Fix 1453 #1632

[WIP] Data/model storage. Fix 1453 #1632

chaitaliSaini commented Oct 17, 2017 •

edited by menshikh-iv

Loading

menshikh-iv Oct 26, 2017

menshikh-iv Oct 26, 2017

menshikh-iv Oct 26, 2017

menshikh-iv Oct 26, 2017

menshikh-iv Oct 26, 2017

menshikh-iv commented Oct 26, 2017

anotherbugmaster left a comment •

edited

Loading

anotherbugmaster Oct 26, 2017

anotherbugmaster Oct 26, 2017

anotherbugmaster Oct 26, 2017

anotherbugmaster Oct 26, 2017

anotherbugmaster Oct 26, 2017

anotherbugmaster Oct 26, 2017

anotherbugmaster Oct 26, 2017 •

edited

Loading

anotherbugmaster Oct 26, 2017

anotherbugmaster Oct 26, 2017

menshikh-iv Nov 7, 2017

menshikh-iv Nov 7, 2017

menshikh-iv Nov 7, 2017

menshikh-iv Nov 7, 2017

menshikh-iv Nov 7, 2017

menshikh-iv Nov 7, 2017

menshikh-iv Nov 7, 2017

menshikh-iv Nov 7, 2017

menshikh-iv Nov 7, 2017

menshikh-iv Nov 7, 2017

menshikh-iv Nov 8, 2017

menshikh-iv commented Nov 10, 2017



		def _create_base_dir():
		r"""Create the gensim-data directory in home directory, if it has not been already created.



		def progress(chunks_downloaded, chunk_size, total_size):
		r"""Create and update the progress bar.



		def _calculate_md5_checksum(tar_file):
		r"""Calculate the checksum of the given tar.gz file.

		import numpy as np


		class TestApi(unittest.TestCase):

[WIP] Data/model storage. Fix 1453 #1632

[WIP] Data/model storage. Fix 1453 #1632

Conversation

chaitaliSaini commented Oct 17, 2017 • edited by menshikh-iv Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Oct 26, 2017

anotherbugmaster left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anotherbugmaster Oct 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Nov 10, 2017

chaitaliSaini commented Oct 17, 2017 •

edited by menshikh-iv

Loading

anotherbugmaster left a comment •

edited

Loading

anotherbugmaster Oct 26, 2017 •

edited

Loading