New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get initial version of aggregation working #1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton for getting this off the ground, @itsjon! Is the idea here that the user will manually go grab the latest files from a given source, add them to data/{source}/
and then run the output script? If so, let's add some general operating instructions to the readme that basically say, "Get the latest files for a given source, place them in the right location, generate the output, then commit the latest files and the output to this repository to keep it publicly up-to-date."
|
||
for column in TEXT_COLUMNS: | ||
merged_row[column] = ','.join([row[column] for row, source_id in rows_with_sources | ||
if row.get(column) and source_id in sources_used]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this conditional (if source_id in sources_used
) depends on sources_used
having been populated by the previous for loop, right? I think that's somewhat non-intuitive. Why would it be that the presence of a source in the numeric columns would dictate its presence in the text columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least for the existing text columns, Source
and Notes
, it doesn't seem to make sense to include that information from an input row in an output row if the numeric values from that input row weren't ultimately used in the output row. If that doesn't make sense to you, let's file an issue and circle up with C.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I buy that. Let's proceed as is for now!
if row: | ||
sources_used.add(source) | ||
|
||
return get_value(row, column) if row else '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do change get_value
to return None instead of an empty string, we should ensure that this particular call returns an empty string if the function returns None or have L45 below translate from None to empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, any row that doesn't have a value for the column at hand is filtered out before determining the row with the max value, so if get_value
is invoked here, it will definitely have some value. Else, if no rows with a value for the column were provided, the empty string is returned.
… on how to perform the daily run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job, @itsjon! Thanks for the headstart on this.
@jessex sorry I missed your reply. Thank you, and you're welcome! |
I believe this is working as expected, pending manual checks.
Near-term improvements:
helpers.data.Mapper
. A few options:helpers.merging.merge_rows
.County
column, drawn either from the master mapping file or the source datasetsInmates Pending
column in the covidprisondata.com dataset