Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance data export scripts #7

Closed
3 tasks
mnaydan opened this issue Jun 15, 2023 · 5 comments
Closed
3 tasks

enhance data export scripts #7

mnaydan opened this issue Jun 15, 2023 · 5 comments
Labels
🖇️ duplicate This issue or pull request already exists

Comments

@mnaydan
Copy link

mnaydan commented Jun 15, 2023

  • revise dwellings.py script to fit members.py script
  • write unit tests for export dwellings.py
  • open pull request on s&co repository

From Rebecca: There are necessary code updates to revise the data export scripts and then export and validate the datasets. We could export and publish new datasets with existing code, but it wouldn't include all the updates they've been working on for dates + addresses which is needed for the geographic analysis.

@quadrismegistus
Copy link
Collaborator

As a note, here's the quick script I ran to export the 'dwellings' data from a running mep-django installation. I will write this as a PR to mep-django as an export_dwellings.py command.

# export_dwellings.py

# allow django models in use
import os,django,datetime as dt
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "mep.settings")
os.environ["DJANGO_ALLOW_ASYNC_UNSAFE"] = "true"
django.setup()
from mep.people.models import Person
from mep.common.utils import absolutize_url


def export_dwellings():
    old=[]
    for person in Person.objects.all():
        print(person)
        for account in person.account_set.all():
            for addr in account.address_set.all():
                loc = addr.location
                
                odx=dict(
                    # Member
                    member_uri=absolutize_url(person.get_absolute_url()),
                    
                    # IDs
                    person_id=person.id,
                    account_id=account.id,
                    address_id=addr.id,
                    location_id=loc.id,
                    
                    # Address data
                    start_date = addr.start_date,
                    end_date = addr.end_date,
                    start_date_precision = addr.start_date_precision,
                    end_date_precision = addr.end_date_precision,
                    care_of_person_id = addr.care_of_person_id,
                    
                    # Location data
                    street_address=loc.street_address,
                    city=loc.city,
                    postal_code=loc.postal_code,
                    latitude=loc.latitude,
                    longitude=loc.longitude,
                    country_id=loc.country_id,
                    arrrondissement=loc.arrondissement(),
                    
                )
                old.append(odx)
    now=dt.datetime.now()
    ofn=f'dwellings.{now.year:02}-{now.month:02}-{now.day:02}.pkl'
    import pickle
    with open(ofn,'wb') as of:
        pickle.dump(old, of)

if __name__=='__main__': export_dwellings()

@quadrismegistus
Copy link
Collaborator

Note: The above code needs to be adapted on the model of export_members.py

@rlskoeser
Copy link
Contributor

@quadrismegistus @jkotin some comments on the address data export enhancements:

Here's the relevant GitHub issue in the mep-django codebase:
Princeton-CDH/mep-django#612
There isn't much detail there, just a suggestion of how we might structure it (which looks similar enough to what you've done here).

I don't think we should introduce a new term ("dwellings"); I strongly recommend we continue to call these addresses for consistency with previous versions of the datasets. My plan is for exported address data to be packaged with the members data for a new version of the members dataset. We'll have to update the datapackage validation and dataset readme to document the fields in the addresses export and how the two files relate. We'll also need to clearly document this in the dataset change log. There will be redundancy with the main member export data, but I think we should keep that for backwards compatibility.

I suggest the new export filename should be member_addresses.csv or possibly account_addresses.csv. We should generate both csv and json formats, and our existing export script code should make that easy. I think the easiest way to implement is a new manage command that extends our existing BaseExport class. It might be nice to create a convenience omnibus export manage command that uses call_command to run all the exports.

I still think it would be incredibly valuable to have a GeoJSON export of this data, because that would make the data usable with so many tools; that would require additional work, but there must be python packages that would help with this. (Maybe something in geodjango would be useful.).

field-specific comments:

  • do not include any numeric database ids
  • technically addresses are associated with accounts, not members; we should member uris the same way that they are exported in the events dataset, and we should consider including member names and sort names as well (as we do there) so that the file can be used to some extent on its own
  • start and end dates should each be exported as a single field using partial_start_date and partial_end_date, which will compile the date information and precision information into the ISO format we're using across all the data exports

@jkotin
Copy link
Collaborator

jkotin commented Aug 11, 2023

Just a brief note: this all sounds good to me. I agree about "addresses" over "dwellings," and having GeoJSON exports. I'm happy to write the readme files once the specific formats are determined. I can already picture the enhanced books dataset (with author nationalities and genders) but not the enhanced members with the addresses. What will the columns look like to show addresses at particular times?

@rlskoeser rlskoeser added the 🖇️ duplicate This issue or pull request already exists label Feb 29, 2024
@rlskoeser
Copy link
Contributor

I'm going to close this as a duplicate, since the work is being tracked on Princeton-CDH/mep-django#791 and Princeton-CDH/mep-django#792

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🖇️ duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

4 participants