Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get initial version of aggregation working #1

Merged
merged 3 commits into from May 9, 2020
Merged

Get initial version of aggregation working #1

merged 3 commits into from May 9, 2020

Conversation

itsjon
Copy link
Contributor

@itsjon itsjon commented May 4, 2020

I believe this is working as expected, pending manual checks.

Near-term improvements:

  • Improve facility name matching to reduce the need to manually add more aliases, while not trying to be too clever and accidentally mismatching names. See helpers.data.Mapper. A few options:
    • Ignore punctuation when matching
    • Ensure any whitespace between words in names is treated as a single space, so excess spaces, newlines, etc. won't cause matching to miss an otherwise valid match
    • Assume "CC" == "Correctional Center", and similarly for other common acronyms
    • Ignore acronyms if the acronym is simply an acronym of the preceding words ("California Correctional Institution (CCI)" == "California Correctional Institution")
  • Consider updating the merge logic. Currently, when merging multiple rows for the same date for the same facility, each numeric column will receive the largest value for that column among all the merged rows, which means columns may not sum as expected if the merged row draws from multiple source rows. See helpers.merging.merge_rows.
  • Add County column, drawn either from the master mapping file or the source datasets
  • Decide what, if anything, to do with the Inmates Pending column in the covidprisondata.com dataset

@itsjon itsjon requested a review from jessex May 4, 2020 05:46
Copy link
Member

@jessex jessex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton for getting this off the ground, @itsjon! Is the idea here that the user will manually go grab the latest files from a given source, add them to data/{source}/ and then run the output script? If so, let's add some general operating instructions to the readme that basically say, "Get the latest files for a given source, place them in the right location, generate the output, then commit the latest files and the output to this repository to keep it publicly up-to-date."

output/.gitignore Outdated Show resolved Hide resolved
src/aggregate.py Show resolved Hide resolved
src/constants.py Outdated Show resolved Hide resolved
src/constants.py Outdated Show resolved Hide resolved
src/constants.py Show resolved Hide resolved
src/helpers/merging.py Show resolved Hide resolved
src/helpers/data.py Outdated Show resolved Hide resolved
src/helpers/test_data/name_mappings.csv Outdated Show resolved Hide resolved

for column in TEXT_COLUMNS:
merged_row[column] = ','.join([row[column] for row, source_id in rows_with_sources
if row.get(column) and source_id in sources_used])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this conditional (if source_id in sources_used) depends on sources_used having been populated by the previous for loop, right? I think that's somewhat non-intuitive. Why would it be that the presence of a source in the numeric columns would dictate its presence in the text columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for the existing text columns, Source and Notes, it doesn't seem to make sense to include that information from an input row in an output row if the numeric values from that input row weren't ultimately used in the output row. If that doesn't make sense to you, let's file an issue and circle up with C.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I buy that. Let's proceed as is for now!

if row:
sources_used.add(source)

return get_value(row, column) if row else ''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do change get_value to return None instead of an empty string, we should ensure that this particular call returns an empty string if the function returns None or have L45 below translate from None to empty string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, any row that doesn't have a value for the column at hand is filtered out before determining the row with the max value, so if get_value is invoked here, it will definitely have some value. Else, if no rows with a value for the column were provided, the empty string is returned.

Copy link
Member

@jessex jessex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, @itsjon! Thanks for the headstart on this.

@itsjon
Copy link
Contributor Author

itsjon commented May 9, 2020

@jessex sorry I missed your reply. Thank you, and you're welcome!

@itsjon itsjon merged commit 2c92a24 into master May 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants