-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DH Code Review #1
base: empty
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unfamiliar with the Leiden algorithm, and most of your algorithmic approach. Since this is a relatively short code review process, I didn't spend time on learning those. My comments are not on the actual calculation, but about the code that performs them.
All in all, if this were the level of code usually written in labs, I would be out of a job...
:type limitRefLength: bool or int | ||
""" | ||
for path in [cociteOutpath, timeclusterOutpath, reportsOutpath]: | ||
os.makedirs(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usability: You should add exist_ok=True
, so that the program doesn't fail in case the output directories exist. If you want them to be empty, or want to create new ones, check for their emptiness (or non-existence) before you start, and output an error message that tells the user they need to empty the output folders first. Otherwise they'll just get an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was that the pipeline acts as a convenience function to group all steps of the processing. Since it is easy to run, its also easy to accidentally overwrite neeeded data. Thats why actuall exist might not be OK. But I would capture the resulting Exception and tell the user to check if the given paths are really the right onces and delete old data if needed.
|
||
:param inputFilepath: Path to corpora input data | ||
:type text: str | ||
:param cociteOutpath: Output path for cocitation networks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cosmetic: Why three output directories, instead of one output directory? You can create subfolders in it by default, for each type of output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true, It makes more sense to have one main output folder and then create the subfolders automatically ,thx!
for path in [cociteOutpath, timeclusterOutpath, reportsOutpath]: | ||
os.makedirs(path) | ||
starttime = time.time() | ||
cocites = Cocitations( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usability: In case of a lengthy pipeline, I find it better to break the pipeline into discrete steps that can run individually. You already have three discrete steps, an argument for running just one of them (any one of them really) can be very useful to users. Of course the argument for each step should be its input folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, the pipeline as an entry point to the details of each subprocess. will add this
) | ||
clusterreports.gatherClusterMetadata() | ||
clusterreports.writeReports() | ||
print(f'Done after {time.time() - starttime} seconds.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have missed something - it seems as if there is no way to run this pipeline from the command line. It's much more convenient to add command line arguments that allow running from the shell, instead of from a Python prompt.
Take a look at https://github.com/swansonk14/typed-argument-parser for a cool type-aware argument parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this is correct. will have to change this, especially since some of the processes can be extremely long running...
:type columnName: str | ||
:param numberProc: Number of CPUs the package is allowed to use (default=all) | ||
:type numberProc: int | ||
:param limitRefLength: Either False or integer giving the maximum number of references a considered publication is allowed to contain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use None
instead of False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both is possible, but I chose False as the answer to the question of the parameter :-) limit reference lenght? False
if os.path.isdir(self.outpath): | ||
raise OSError(f'Output folder {self.outpath} exists. Aborting.') | ||
else: | ||
os.mkdir(self.outpath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for exist_ok
- I commented on creating directories elsewhere.
titles = dataframe[self.textcolumn].values | ||
for title in tqdm(titles, leave=False): | ||
try: | ||
# text pre-processing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A short comment on the pre-preprocessing could be useful - seems like you're removing some punctuation, some escape symbols and whitespace, but I'm not sure. You're also removing digits?
) | ||
except ValueError: | ||
raise | ||
dfCluster = pd.concat(clusterdf, ignore_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variables clusterdf
and dfCluster
confused me, They're both dataframes, one in Hungarian, but I'm not sure why there are two of them or what the difference between them is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is bad naming on my side. clusterdf is actually a list of dataframes. will change
inputnodes = set(basedf.node.values) | ||
notFound = inputnodes.difference(set(dfCluster[self.publicationIDcolumn].values)) | ||
topAuthors = Counter( | ||
[x for y in [x.split(';') for x in dfCluster[self.authorColumnName].fillna('').values] for x in y] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double list comprehension always confuses me a lot. I'm never sure which for
runs first, and what the result is.
for x in topAuthors: | ||
if x[0] != '': | ||
authortext += f'\t{x[0]}: {x[1]}\n' | ||
topAffils = Counter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, since this non-trivial Counter expression is repeated twice, I'd move it to a nested function that creates the counter.
@zmbq this is great, thanks so much! What do you think the author should definitely address to get a code review completed badge? |
Thanks @zmbq for this truly in depth review! Lot of details that I will address in the upstream version, that lifes here: https://gitlab.gwdg.de/modelsen/semanticlayertools Once done I ll let you know and also update the Github mirror. |
I wasn't aware there was a badge involved! I think @maltevogl deserves it. |
The ticket for this code review is: DHCodeReview/DHCodeReview#1
This repository is ready for review. Please start with reviewing
src/semanticlayertools/linkage/cocitation.py
. If time allows, you welcome to reviewsrc/semanticlayertools/pipelines/cocitetimeclusters.py
and the other files in the pull request.