New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a load_api_data management command #1734

Merged
merged 17 commits into from Apr 12, 2018

Conversation

Projects
None yet
2 participants
@toolness
Contributor

toolness commented Apr 9, 2018

This adds a load_api_data management command, as discussed in #1028.

Here's the command-line help, with the universal options (e.g. verbosity) removed:

usage: manage.py load_api_data [-s START_PAGE]
                               [-e END_PAGE] [--append] [-u URL]

Load rate data from the API of another CALC instance.

optional arguments:
  -s START_PAGE, --start-page START_PAGE
                        start page (default is 1)
  -e END_PAGE, --end-page END_PAGE
                        end page (default is to read all pages)
  --append              append to existing rates (instead of deleting them
                        first)
  -u URL, --url URL     URL of CALC API (default is
                        https://api.data.gov/gsa/calc/rates/)

Example usage

This will load the first 5 pages of API data (given that our pagination size is currently hard-coded to 200, it will retrieve 1000 rates):

python manage.py load_api_data -e 5

This will append the next five pages to the database, without wiping out the previously loaded data:

python manage.py load_api_data -s 6 -e 10 --append

Notes

  • Right now I don't think it's possible to ask the CALC API to change its pagination size. Enabling this is pretty easy and could make this management command run quite a bit faster, for when developers want to transfer the whole database.

  • The command isn't wrapped in an atomic transaction. Since the command is intended primarily for development, I figured it'd be OK if devs just ran the command with no arguments and then pressed Ctrl + C once they got tired of waiting. This would result in some data in the database, at least. But if this is unexpected, and we really want the database to be rolled back, then we can do that. Alternatively, if we really want, we could prompt the user, asking them if they want to commit or rollback, but this would require adding additional tests.

  • Right now one disadvantage of this command in comparison to load_data is that because the data is remote, it's not quite as easy to re-populate it if one has to wipe their DB. I guess one could use the built-in dumpdata command to dump their data and re-load it later, though. Hmm.

To do

  • Add tests.
  • Do we want to rename the command? I wasn't sure if load_api_data, load_from_api, loadapi, or something else was best.
def create(self, validated_data):
return Contract.objects.bulk_create([
Contract(**item) for item in validated_data
])

This comment has been minimized.

@toolness

toolness Apr 9, 2018

Contributor

By default, ListSerializer doesn't bulk create, so this subclass speeds things up a lot.

This comment has been minimized.

@jseppi

jseppi Apr 12, 2018

Contributor

That's some good pythoning right there

while True:
if end_page is not None and page > end_page:
break
res = requests.get(f"{url}?page={page}").json()

This comment has been minimized.

@toolness

toolness Apr 9, 2018

Contributor

Ideally we would just use res['next'] for this value, but alas for #1692. It also means that devs couldn't add to the querystring in the --url argument to the command, e.g. to load their CALC with the results of a certain search query, though I don't even know if that's a desired use case. But still, just pointing it out.

@toolness

This comment has been minimized.

Contributor

toolness commented Apr 9, 2018

@jseppi @hbillings what do you think about this so far? If it looks 👍 then I will add tests but otherwise we can talk about how to improve it.

@jseppi

This comment has been minimized.

Contributor

jseppi commented Apr 9, 2018

Looks pretty good to me!

@toolness

This comment has been minimized.

Contributor

toolness commented Apr 9, 2018

@jseppi ok cool! so the plan is...

  1. Add tests for this, review and merge this PR.
  2. Update developer docs to have developers run this new command instead of load_data and load_s70.
  3. Remove load_data and load_s70 from the repository.

Sounds reasonable?

toolness added some commits Apr 11, 2018

@toolness

This comment has been minimized.

Contributor

toolness commented Apr 11, 2018

uggh selenium

@toolness

This comment has been minimized.

Contributor

toolness commented Apr 11, 2018

@jseppi do you think we should move this new load_api_data command from contracts to api? I don't think contracts currently has any dependencies on API, but putting this command in it will make that the case...

Anyways, aside from that I think this is finally ready for review.

@toolness toolness changed the title from [WIP] Add a load_api_data management command to Add a load_api_data management command Apr 11, 2018

@toolness toolness requested a review from jseppi Apr 11, 2018

@jseppi

This comment has been minimized.

Contributor

jseppi commented on .circleci/config.yml in 191f578 Apr 11, 2018

😆 👍

@jseppi

jseppi approved these changes Apr 12, 2018

Great! LGTM!

@@ -142,7 +142,8 @@ jobs:
name: Run python tests
command: |
# Note that this MUST come after building the front end assets
py.test --cov=. --cov-report xml:./coverage/python/coverage.xml
# For details on why we're ignoring selenium tests, see https://github.com/18F/calc/issues/330

This comment has been minimized.

@jseppi

jseppi Apr 12, 2018

Contributor

😭

def create(self, validated_data):
return Contract.objects.bulk_create([
Contract(**item) for item in validated_data
])

This comment has been minimized.

@jseppi

jseppi Apr 12, 2018

Contributor

That's some good pythoning right there

class ContractSerializer(serializers.ModelSerializer):
# TODO: Figure out whether we actually support a null value for

This comment has been minimized.

@jseppi

jseppi Apr 12, 2018

Contributor

Want to make a separate issue for this so we don't have a dangling TODO?

try:
for rates, total_pages in pages:
if pbar is None:
pbar = tqdm(total=total_pages)

This comment has been minimized.

@jseppi

jseppi Apr 12, 2018

Contributor

These two names together are kinda funny. maybe progress_bar instead?

)
# These tests are a bit funky because we run the command against its

This comment has been minimized.

@jseppi

jseppi Apr 12, 2018

Contributor

APICEPTION!

@jseppi

This comment has been minimized.

Contributor

jseppi commented Apr 12, 2018

It kinda makes sense to go in the api app, but I don't feel strongly about it.

@toolness

This comment has been minimized.

Contributor

toolness commented Apr 12, 2018

That's some good pythoning right there

Hah, it was actually copypasta'd from the DRF docs. :)

Ok I think your suggestions are good, I will implement them and then merge this!

@toolness toolness merged commit cb5070d into develop Apr 12, 2018

3 checks passed

ci/circleci: build Your tests passed on CircleCI!
Details
codeclimate All good!
Details
codeclimate/total-coverage 92% (0.1% change)
Details

@toolness toolness deleted the load-api-data branch Apr 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment