DH Code Review #1

jdamerow · 2022-05-27T19:23:29Z

The ticket for this code review is: DHCodeReview/DHCodeReview#1

This repository is ready for review. Please start with reviewing src/semanticlayertools/linkage/cocitation.py. If time allows, you welcome to review src/semanticlayertools/pipelines/cocitetimeclusters.py and the other files in the pull request.

…ng yet

…rror

zmbq

I am unfamiliar with the Leiden algorithm, and most of your algorithmic approach. Since this is a relatively short code review process, I didn't spend time on learning those. My comments are not on the actual calculation, but about the code that performs them.

All in all, if this were the level of code usually written in labs, I would be out of a job...

zmbq · 2022-05-29T05:44:05Z

src/semanticlayertools/pipelines/cocitetimeclusters.py

+    :type limitRefLength: bool or int
+    """
+    for path in [cociteOutpath, timeclusterOutpath, reportsOutpath]:
+        os.makedirs(path)


Usability: You should add exist_ok=True, so that the program doesn't fail in case the output directories exist. If you want them to be empty, or want to create new ones, check for their emptiness (or non-existence) before you start, and output an error message that tells the user they need to empty the output folders first. Otherwise they'll just get an exception.

The idea was that the pipeline acts as a convenience function to group all steps of the processing. Since it is easy to run, its also easy to accidentally overwrite neeeded data. Thats why actuall exist might not be OK. But I would capture the resulting Exception and tell the user to check if the given paths are really the right onces and delete old data if needed.

zmbq · 2022-05-29T05:44:35Z

src/semanticlayertools/pipelines/cocitetimeclusters.py

+
+    :param inputFilepath:  Path to corpora input data
+    :type text: str
+    :param cociteOutpath: Output path for cocitation networks


Cosmetic: Why three output directories, instead of one output directory? You can create subfolders in it by default, for each type of output.

This is true, It makes more sense to have one main output folder and then create the subfolders automatically ,thx!

zmbq · 2022-05-29T05:53:31Z

src/semanticlayertools/pipelines/cocitetimeclusters.py

+    for path in [cociteOutpath, timeclusterOutpath, reportsOutpath]:
+        os.makedirs(path)
+    starttime = time.time()
+    cocites = Cocitations(


Usability: In case of a lengthy pipeline, I find it better to break the pipeline into discrete steps that can run individually. You already have three discrete steps, an argument for running just one of them (any one of them really) can be very useful to users. Of course the argument for each step should be its input folder.

makes sense, the pipeline as an entry point to the details of each subprocess. will add this

zmbq · 2022-05-29T05:55:30Z

src/semanticlayertools/pipelines/cocitetimeclusters.py

+    )
+    clusterreports.gatherClusterMetadata()
+    clusterreports.writeReports()
+    print(f'Done after {time.time() - starttime} seconds.')


I may have missed something - it seems as if there is no way to run this pipeline from the command line. It's much more convenient to add command line arguments that allow running from the shell, instead of from a Python prompt.

Take a look at https://github.com/swansonk14/typed-argument-parser for a cool type-aware argument parser.

Again, this is correct. will have to change this, especially since some of the processes can be extremely long running...

zmbq · 2022-05-29T05:58:00Z

src/semanticlayertools/linkage/cocitation.py

+    :type columnName: str
+    :param numberProc: Number of CPUs the package is allowed to use (default=all)
+    :type numberProc: int
+    :param limitRefLength: Either False or integer giving the maximum number of references a considered publication is allowed to contain


Why not use None instead of False?

Both is possible, but I chose False as the answer to the question of the parameter :-) limit reference lenght? False

zmbq · 2022-05-29T06:39:03Z