Get initial version of aggregation working #1

itsjon · 2020-05-04T05:44:19Z

I believe this is working as expected, pending manual checks.

Near-term improvements:

Improve facility name matching to reduce the need to manually add more aliases, while not trying to be too clever and accidentally mismatching names. See helpers.data.Mapper. A few options:
- Ignore punctuation when matching
- Ensure any whitespace between words in names is treated as a single space, so excess spaces, newlines, etc. won't cause matching to miss an otherwise valid match
- Assume "CC" == "Correctional Center", and similarly for other common acronyms
- Ignore acronyms if the acronym is simply an acronym of the preceding words ("California Correctional Institution (CCI)" == "California Correctional Institution")
Consider updating the merge logic. Currently, when merging multiple rows for the same date for the same facility, each numeric column will receive the largest value for that column among all the merged rows, which means columns may not sum as expected if the merged row draws from multiple source rows. See helpers.merging.merge_rows.
Add County column, drawn either from the master mapping file or the source datasets
Decide what, if anything, to do with the Inmates Pending column in the covidprisondata.com dataset

jessex

Thanks a ton for getting this off the ground, @itsjon! Is the idea here that the user will manually go grab the latest files from a given source, add them to data/{source}/ and then run the output script? If so, let's add some general operating instructions to the readme that basically say, "Get the latest files for a given source, place them in the right location, generate the output, then commit the latest files and the output to this repository to keep it publicly up-to-date."

output/.gitignore

src/aggregate.py

src/constants.py

src/helpers/merging.py

src/helpers/data.py

src/helpers/test_data/name_mappings.csv

jessex · 2020-05-05T05:35:05Z

src/helpers/merging.py

+
+  for column in TEXT_COLUMNS:
+    merged_row[column] = ','.join([row[column] for row, source_id in rows_with_sources
+                                   if row.get(column) and source_id in sources_used])


So this conditional (if source_id in sources_used) depends on sources_used having been populated by the previous for loop, right? I think that's somewhat non-intuitive. Why would it be that the presence of a source in the numeric columns would dictate its presence in the text columns?

At least for the existing text columns, Source and Notes, it doesn't seem to make sense to include that information from an input row in an output row if the numeric values from that input row weren't ultimately used in the output row. If that doesn't make sense to you, let's file an issue and circle up with C.

Okay I buy that. Let's proceed as is for now!

jessex · 2020-05-05T05:36:41Z

src/helpers/merging.py

+  if row:
+    sources_used.add(source)
+
+  return get_value(row, column) if row else ''


If we do change get_value to return None instead of an empty string, we should ensure that this particular call returns an empty string if the function returns None or have L45 below translate from None to empty string.

Yeah, any row that doesn't have a value for the column at hand is filtered out before determining the row with the max value, so if get_value is invoked here, it will definitely have some value. Else, if no rows with a value for the column were provided, the empty string is returned.

… on how to perform the daily run

jessex

Great job, @itsjon! Thanks for the headstart on this.

itsjon · 2020-05-09T19:03:12Z

@jessex sorry I missed your reply. Thank you, and you're welcome!

itsojon added 2 commits May 4, 2020 00:28

Get initial version of aggregation working

03c6034

Add data to repo

1f08905

itsjon requested a review from jessex May 4, 2020 05:46

jessex requested changes May 5, 2020

View reviewed changes

Address feedback, including more docstrings and details in the README…

4269920

… on how to perform the daily run

itsjon mentioned this pull request May 7, 2020

Use enums rather than raw strings in the input-to-output header mappings #3

Open

itsjon requested a review from jessex May 7, 2020 23:03

jessex approved these changes May 8, 2020

View reviewed changes

itsjon merged commit 2c92a24 into master May 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get initial version of aggregation working #1

Get initial version of aggregation working #1

itsjon commented May 4, 2020

jessex left a comment

jessex May 5, 2020

itsjon May 7, 2020

jessex May 8, 2020

jessex May 5, 2020

itsjon May 7, 2020

jessex left a comment

itsjon commented May 9, 2020

Get initial version of aggregation working #1

Get initial version of aggregation working #1

Conversation

itsjon commented May 4, 2020

jessex left a comment

Choose a reason for hiding this comment

jessex May 5, 2020

Choose a reason for hiding this comment

itsjon May 7, 2020

Choose a reason for hiding this comment

jessex May 8, 2020

Choose a reason for hiding this comment

jessex May 5, 2020

Choose a reason for hiding this comment

itsjon May 7, 2020

Choose a reason for hiding this comment

jessex left a comment

Choose a reason for hiding this comment

itsjon commented May 9, 2020