-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature/https-or-http-func #176
feature/https-or-http-func #176
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will review tomorrow!!
Feel free to take a look though it may not be in its final form, I forgot to comment that I wanted to get a PR open to confirm some test failures that I was seeing locally, looks like they are so I'll need to look into that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing this issue! I had a few comments, mostly regarding some edge cases where the url does not start with http
or https
. I think this could be improved by removing that assumption, and passing in the resource_exists
callable that also has google_credentials_file
as a kwarg so we can account for gs://
cases
if not is_secure_uri(resource_uri): | ||
# Attempt to find secure version of resource and simply swap | ||
# otherwise we will have to host | ||
if session.video_uri.startswith("http://"): | ||
secure_uri = session.video_uri.replace("http://", "https://") | ||
if resource_uri.startswith("http://"): | ||
secure_uri = resource_uri.replace("http://", "https://") | ||
if resource_exists(secure_uri): | ||
log.info( | ||
f"Found secure version of {session.video_uri}, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section seems a little redundant since we're already doing logic to check whether the url is https
or http
in try_url
. I think we could tweak this to account for that
cdp_backend/utils/string_utils.py
Outdated
if not url.startswith("http://") and not url.startswith("https://"): | ||
raise ValueError("url must be a valid http or https url") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can say this is necessarily true in al cases. Certain url's pointing to GCP resources could start with gs://
, and if so we handle those differently
So I wanna say we can delete this conditional statement
cdp_backend/utils/string_utils.py
Outdated
if url.startswith("http://"): | ||
url = url.replace("http://", "https://") | ||
|
||
if resolve_func(url): | ||
return url | ||
|
||
url = url.replace("https://", "http://") | ||
if resolve_func(url): | ||
return url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having two replace operations, could we make this something like:
secure_url = url.replace("http://", "https://")
if resolve_func(secure_url):
return secure_url
if resolve_func(url):
return url
cdp_backend/utils/string_utils.py
Outdated
return url | ||
|
||
# raise LookupError("the resource {} does not exist with either http / https".format(url)) | ||
return "" # if we got this far, the resource does not exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little hesitant to return an empty string here instead of raising an error. We'd have to handle the empty string wherever this method is used, and if we miss that then it could lead to us accidentally processing data with an empty string instead of catching an error
cdp_backend/utils/string_utils.py
Outdated
if resolve_func(url): | ||
return url | ||
|
||
# raise LookupError("the resource {} does not exist with either http / https".format(url)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't necessarily true if the url is gs://
|
||
|
||
@pytest.mark.parameterize( | ||
"url, expected"[("https://exists", True), ("https://not-exists", False)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think you missed a comma after "url, expected"
Thanks for the feedback! Most of these comments should be addressed in some local changes I have which I'll push up soon. I tried to make |
Yea circular dependencies are a no go in Python. I would be fine if you moved this function and associated tests from |
Made some changes, let me know what you all think. Validators made the most sense to me just given the name for putting this logic in, but will happily defer to you two and will change it to file_utils if that makes the most sense. Tests are passing locally but I'm seeing some weirdness with my setup (probably just missing python versions) so I don't 100% believe that yet. |
Codecov Report
@@ Coverage Diff @@
## main #176 +/- ##
==========================================
+ Coverage 94.54% 94.62% +0.07%
==========================================
Files 50 50
Lines 2587 2605 +18
==========================================
+ Hits 2446 2465 +19
+ Misses 141 140 -1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Thanks for taking this one. I have two minor nitpicks but would be happy to merge now or after you answer one question!
- One nitpick is on using
fstrings
instead ofstring.format
- Can you lint the code real quick? Running
tox -e lint
should help fixup any of those issues. Related... GitHub is showing you deleted a test resource and added a test resource?
Question! This function is meant to help us archive "EventMinutesItemFile" and "MatterFile" data better.
If you take a quick look at the database schema: https://councildataproject.org/cdp-backend/database_schema.html
In the pipeline, we just ignore failing URI checks for matter file and such: https://github.com/CouncilDataProject/cdp-backend/blob/main/cdp_backend/pipeline/event_gather_pipeline.py#L1444
and skip to the next task in the pipeline. It would be great if we used this function to try the URL before even creating and uploading the model. Basically, the pseudocode would be something like: try_url
-> returned string -> create and upload model -> catch and log any random failure?? || try_url
-> raised LookupError
-> log
file details as unable to uplaod like we are currently doing
So, with that all said, want to include that work in this PR or in a second PR?
Also last note: how was the self-onboarding process? places we can improve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had a question about the conditional logic + some nitpicks, and I think there were some extra files/imports that weren't being used.
Also for PR's we use a lint checker which seems to be failing in this case. It's usually pretty descriptive about what it wants you to reformat, and the main checks we run are isort
, black
, mypy
, and flake
Thanks again for the feedback! I believe I addressed the two nitpicks you mentioned. I'm happy to work on the archiving improvements as a part of this PR if this functionality not existing isn't blocking anything else. May take me a bit before I can fully address this next phase of the PR. As for the onboarding, I think you all have an excellent onboarding process. My struggles are just not being that well versed with the Python ecosystem, mainly two things:
I think the code base is excellent and I enjoyed working on this, so if you will have me on further things, I would be happy to receive any recommendations for a better Python dev setup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I had one question regarding mixed content for Jackson in a comment, but once that's resolved I think this can be merged in
Going to merge this now simply because it is a nice unit of work to PR in. But yes please do work on the archiving improvements whenever you get a chance! Really really quick and neat work of this!
Yea, we may want to change the CONTRIBUTING doc to mention that you can run individual versions of the build system with the environment specifier... As far as integration with IDE... yea, Python just doesn't have the build systems as Go or Rust do. Pretty good tho. VS Code can format on save and reorder imports and such on save but wont handle the lint and typing for you :/ |
Link to Relevant Issue
This pull request resolves #148
Description of Changes
This adds a func
try_url
instring_utils.py
that will do a request to the https / http version of a given url, giving preference to https. If https is found first, it returns that url immediately and doesn't attempt to find the http version.I'm passing in the optional resolve_func to allow for mock testing in unit tests, as I don't want to rely on external URL's for the tests to pass. The default resole_func is the already existing
resource_exists
func.