Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DH Code Review #1

Open
wants to merge 54 commits into
base: empty
Choose a base branch
from
Open

DH Code Review #1

wants to merge 54 commits into from

Conversation

jdamerow
Copy link

@jdamerow jdamerow commented May 27, 2022

The ticket for this code review is: DHCodeReview/DHCodeReview#1

This repository is ready for review. Please start with reviewing src/semanticlayertools/linkage/cocitation.py. If time allows, you welcome to review src/semanticlayertools/pipelines/cocitetimeclusters.py and the other files in the pull request.

Copy link

@zmbq zmbq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unfamiliar with the Leiden algorithm, and most of your algorithmic approach. Since this is a relatively short code review process, I didn't spend time on learning those. My comments are not on the actual calculation, but about the code that performs them.

All in all, if this were the level of code usually written in labs, I would be out of a job...

:type limitRefLength: bool or int
"""
for path in [cociteOutpath, timeclusterOutpath, reportsOutpath]:
os.makedirs(path)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usability: You should add exist_ok=True, so that the program doesn't fail in case the output directories exist. If you want them to be empty, or want to create new ones, check for their emptiness (or non-existence) before you start, and output an error message that tells the user they need to empty the output folders first. Otherwise they'll just get an exception.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was that the pipeline acts as a convenience function to group all steps of the processing. Since it is easy to run, its also easy to accidentally overwrite neeeded data. Thats why actuall exist might not be OK. But I would capture the resulting Exception and tell the user to check if the given paths are really the right onces and delete old data if needed.


:param inputFilepath: Path to corpora input data
:type text: str
:param cociteOutpath: Output path for cocitation networks
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmetic: Why three output directories, instead of one output directory? You can create subfolders in it by default, for each type of output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true, It makes more sense to have one main output folder and then create the subfolders automatically ,thx!

for path in [cociteOutpath, timeclusterOutpath, reportsOutpath]:
os.makedirs(path)
starttime = time.time()
cocites = Cocitations(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usability: In case of a lengthy pipeline, I find it better to break the pipeline into discrete steps that can run individually. You already have three discrete steps, an argument for running just one of them (any one of them really) can be very useful to users. Of course the argument for each step should be its input folder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, the pipeline as an entry point to the details of each subprocess. will add this

)
clusterreports.gatherClusterMetadata()
clusterreports.writeReports()
print(f'Done after {time.time() - starttime} seconds.')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed something - it seems as if there is no way to run this pipeline from the command line. It's much more convenient to add command line arguments that allow running from the shell, instead of from a Python prompt.

Take a look at https://github.com/swansonk14/typed-argument-parser for a cool type-aware argument parser.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is correct. will have to change this, especially since some of the processes can be extremely long running...

:type columnName: str
:param numberProc: Number of CPUs the package is allowed to use (default=all)
:type numberProc: int
:param limitRefLength: Either False or integer giving the maximum number of references a considered publication is allowed to contain
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use None instead of False?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both is possible, but I chose False as the answer to the question of the parameter :-) limit reference lenght? False

if os.path.isdir(self.outpath):
raise OSError(f'Output folder {self.outpath} exists. Aborting.')
else:
os.mkdir(self.outpath)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for exist_ok - I commented on creating directories elsewhere.

titles = dataframe[self.textcolumn].values
for title in tqdm(titles, leave=False):
try:
# text pre-processing
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A short comment on the pre-preprocessing could be useful - seems like you're removing some punctuation, some escape symbols and whitespace, but I'm not sure. You're also removing digits?

)
except ValueError:
raise
dfCluster = pd.concat(clusterdf, ignore_index=True)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variables clusterdf and dfCluster confused me, They're both dataframes, one in Hungarian, but I'm not sure why there are two of them or what the difference between them is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is bad naming on my side. clusterdf is actually a list of dataframes. will change

inputnodes = set(basedf.node.values)
notFound = inputnodes.difference(set(dfCluster[self.publicationIDcolumn].values))
topAuthors = Counter(
[x for y in [x.split(';') for x in dfCluster[self.authorColumnName].fillna('').values] for x in y]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double list comprehension always confuses me a lot. I'm never sure which for runs first, and what the result is.

for x in topAuthors:
if x[0] != '':
authortext += f'\t{x[0]}: {x[1]}\n'
topAffils = Counter(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since this non-trivial Counter expression is repeated twice, I'd move it to a nested function that creates the counter.

@jdamerow
Copy link
Author

@zmbq this is great, thanks so much! What do you think the author should definitely address to get a code review completed badge?

@maltevogl
Copy link
Member

Thanks @zmbq for this truly in depth review! Lot of details that I will address in the upstream version, that lifes here: https://gitlab.gwdg.de/modelsen/semanticlayertools Once done I ll let you know and also update the Github mirror.

@zmbq
Copy link

zmbq commented Jun 24, 2022

I wasn't aware there was a badge involved! I think @maltevogl deserves it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants