Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/https-or-http-func #176

Merged

Conversation

liamphmurphy
Copy link
Contributor

Link to Relevant Issue

This pull request resolves #148

Description of Changes

This adds a func try_url in string_utils.py that will do a request to the https / http version of a given url, giving preference to https. If https is found first, it returns that url immediately and doesn't attempt to find the http version.

I'm passing in the optional resolve_func to allow for mock testing in unit tests, as I don't want to rely on external URL's for the tests to pass. The default resole_func is the already existing resource_exists func.

Copy link
Member

@evamaxfield evamaxfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will review tomorrow!!

@evamaxfield evamaxfield added the enhancement New feature or request label Mar 20, 2022
@liamphmurphy
Copy link
Contributor Author

Will review tomorrow!!

Feel free to take a look though it may not be in its final form, I forgot to comment that I wanted to get a PR open to confirm some test failures that I was seeing locally, looks like they are so I'll need to look into that.

Copy link
Collaborator

@isaacna isaacna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing this issue! I had a few comments, mostly regarding some edge cases where the url does not start with http or https. I think this could be improved by removing that assumption, and passing in the resource_exists callable that also has google_credentials_file as a kwarg so we can account for gs:// cases

Comment on lines 344 to 351
if not is_secure_uri(resource_uri):
# Attempt to find secure version of resource and simply swap
# otherwise we will have to host
if session.video_uri.startswith("http://"):
secure_uri = session.video_uri.replace("http://", "https://")
if resource_uri.startswith("http://"):
secure_uri = resource_uri.replace("http://", "https://")
if resource_exists(secure_uri):
log.info(
f"Found secure version of {session.video_uri}, "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section seems a little redundant since we're already doing logic to check whether the url is https or http in try_url. I think we could tweak this to account for that

Comment on lines 171 to 172
if not url.startswith("http://") and not url.startswith("https://"):
raise ValueError("url must be a valid http or https url")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can say this is necessarily true in al cases. Certain url's pointing to GCP resources could start with gs://, and if so we handle those differently

So I wanna say we can delete this conditional statement

Comment on lines 174 to 182
if url.startswith("http://"):
url = url.replace("http://", "https://")

if resolve_func(url):
return url

url = url.replace("https://", "http://")
if resolve_func(url):
return url
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having two replace operations, could we make this something like:

    secure_url = url.replace("http://", "https://")

    if resolve_func(secure_url):
        return secure_url
    if resolve_func(url):
        return url

return url

# raise LookupError("the resource {} does not exist with either http / https".format(url))
return "" # if we got this far, the resource does not exist
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little hesitant to return an empty string here instead of raising an error. We'd have to handle the empty string wherever this method is used, and if we miss that then it could lead to us accidentally processing data with an empty string instead of catching an error

if resolve_func(url):
return url

# raise LookupError("the resource {} does not exist with either http / https".format(url))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessarily true if the url is gs://



@pytest.mark.parameterize(
"url, expected"[("https://exists", True), ("https://not-exists", False)],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you missed a comma after "url, expected"

@liamphmurphy
Copy link
Contributor Author

liamphmurphy commented Mar 20, 2022

Thanks for addressing this issue! I had a few comments, mostly regarding some edge cases where the url does not start with http or https. I think this could be improved by removing that assumption, and passing in the resource_exists callable that also has google_credentials_file as a kwarg so we can account for gs:// cases

Thanks for the feedback! Most of these comments should be addressed in some local changes I have which I'll push up soon.

I tried to make resource_exists the default callable but was having import errors, but I shall try that again. On second look, it may be a circular dependency problem as cdp_backend.database.validators is importing from string_utils, and importing resource_exists would require pulling from cdp_backend.database.validators. I'm not the strongest with Python but I assume circular dependencies is considered a problem?

@evamaxfield
Copy link
Member

I tried to make resource_exists the default callable but was having import errors, but I shall try that again. On second look, it may be a circular dependency problem as cdp_backend.database.validators is importing from string_utils, and importing resource_exists would require pulling from cdp_backend.database.validators. I'm not the strongest with Python but I assume circular dependencies is considered a problem?

Yea circular dependencies are a no go in Python. I would be fine if you moved this function and associated tests from string_utils to file_utils... I don't think there is a circular dep there. If there is.... I guess just move it straight to validators for now?

@liamphmurphy
Copy link
Contributor Author

Made some changes, let me know what you all think. Validators made the most sense to me just given the name for putting this logic in, but will happily defer to you two and will change it to file_utils if that makes the most sense.

Tests are passing locally but I'm seeing some weirdness with my setup (probably just missing python versions) so I don't 100% believe that yet.

@codecov
Copy link

codecov bot commented Mar 20, 2022

Codecov Report

Merging #176 (f00cf73) into main (5270305) will increase coverage by 0.07%.
The diff coverage is 96.00%.

@@            Coverage Diff             @@
##             main     #176      +/-   ##
==========================================
+ Coverage   94.54%   94.62%   +0.07%     
==========================================
  Files          50       50              
  Lines        2587     2605      +18     
==========================================
+ Hits         2446     2465      +19     
+ Misses        141      140       -1     
Impacted Files Coverage Δ
cdp_backend/database/validators.py 88.05% <88.88%> (+3.57%) ⬆️
cdp_backend/pipeline/event_gather_pipeline.py 85.78% <100.00%> (+0.03%) ⬆️
cdp_backend/tests/database/test_validators.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5270305...f00cf73. Read the comment docs.

Copy link
Member

@evamaxfield evamaxfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thanks for taking this one. I have two minor nitpicks but would be happy to merge now or after you answer one question!

  1. One nitpick is on using fstrings instead of string.format
  2. Can you lint the code real quick? Running tox -e lint should help fixup any of those issues. Related... GitHub is showing you deleted a test resource and added a test resource?

Question! This function is meant to help us archive "EventMinutesItemFile" and "MatterFile" data better.

If you take a quick look at the database schema: https://councildataproject.org/cdp-backend/database_schema.html

In the pipeline, we just ignore failing URI checks for matter file and such: https://github.com/CouncilDataProject/cdp-backend/blob/main/cdp_backend/pipeline/event_gather_pipeline.py#L1444

and skip to the next task in the pipeline. It would be great if we used this function to try the URL before even creating and uploading the model. Basically, the pseudocode would be something like: try_url -> returned string -> create and upload model -> catch and log any random failure?? || try_url -> raised LookupError -> log
file details as unable to uplaod like we are currently doing

So, with that all said, want to include that work in this PR or in a second PR?


Also last note: how was the self-onboarding process? places we can improve?

cdp_backend/database/validators.py Outdated Show resolved Hide resolved
@evamaxfield evamaxfield changed the title tweak/https-or-http-func feature/https-or-http-func Mar 20, 2022
Copy link
Collaborator

@isaacna isaacna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a question about the conditional logic + some nitpicks, and I think there were some extra files/imports that weren't being used.

Also for PR's we use a lint checker which seems to be failing in this case. It's usually pretty descriptive about what it wants you to reformat, and the main checks we run are isort, black, mypy, and flake

cdp_backend/utils/string_utils.py Outdated Show resolved Hide resolved
cdp_backend/tests/utils/test_string_utils.py Outdated Show resolved Hide resolved
cdp_backend/database/validators.py Outdated Show resolved Hide resolved
cdp_backend/pipeline/event_gather_pipeline.py Show resolved Hide resolved
cdp_backend/tests/database/test_validators.py Outdated Show resolved Hide resolved
@liamphmurphy
Copy link
Contributor Author

liamphmurphy commented Mar 20, 2022

Thanks again for the feedback! I believe I addressed the two nitpicks you mentioned.

I'm happy to work on the archiving improvements as a part of this PR if this functionality not existing isn't blocking anything else. May take me a bit before I can fully address this next phase of the PR.

As for the onboarding, I think you all have an excellent onboarding process. My struggles are just not being that well versed with the Python ecosystem, mainly two things:

  1. I'm coming from a very different (and sometimes opinionated) set of tools in the Go world. For example, I did the "{}".format(str) because that is how I'm used to doing it in Go with their (somewhat) equivalent fmt.Sprintf("%s", str).
  2. I've never used tox before so I didn't have the linter setup to run on each save, which is how I have it done with my Go setup. So I've gotten very comfortable with Go's linter automatically removing unused imports among other things. To my eyes, the tox output from make build is pretty lengthy and verbose so I wasn't adequately spotting some of the errors it was notifying me about. Hopefully I'm a bit better keyed into that now.

I think the code base is excellent and I enjoyed working on this, so if you will have me on further things, I would be happy to receive any recommendations for a better Python dev setup.

Copy link
Collaborator

@isaacna isaacna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I had one question regarding mixed content for Jackson in a comment, but once that's resolved I think this can be merged in

@evamaxfield
Copy link
Member

Going to merge this now simply because it is a nice unit of work to PR in. But yes please do work on the archiving improvements whenever you get a chance! Really really quick and neat work of this!

I've never used tox before so I didn't have the linter setup to run on each save, which is how I have it done with my Go setup. So I've gotten very comfortable with Go's linter automatically removing unused imports among other things. To my eyes, the tox output from make build is pretty lengthy and verbose so I wasn't adequately spotting some of the errors it was notifying me about. Hopefully I'm a bit better keyed into that now.

Yea, we may want to change the CONTRIBUTING doc to mention that you can run individual versions of the build system with the environment specifier... tox -e py38 will run python 3.8, tox -e lint will just run the linter etc.

As far as integration with IDE... yea, Python just doesn't have the build systems as Go or Rust do. Pretty good tho. VS Code can format on save and reorder imports and such on save but wont handle the lint and typing for you :/

@evamaxfield evamaxfield merged commit b30324e into CouncilDataProject:main Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a function for replacing all http URIs with https URIs when possible
3 participants