Looking for Volunteers! We need a data aggregator #160

yo-mike · 2020-06-03T16:55:17Z

Task: Aggregate 2020PB items into a single CSV or Excel Spreadsheet

Go to the link below and populate each row with the entries from each 2020PB Pull Request

https://docs.google.com/spreadsheets/d/1zEChPuDj0eTeB9cOXrJNHK3rW-aW1zohUVMm6LrN1IU/edit?usp=sharing

Perhaps multiple contributors can divide the work easily.

Assigned to: ???

If you can help, let one of the contributors know and we'll get you access to the Google Sheet.

Ref Project Board:
https://github.com/949mac/police-brutality/projects/1

We're petitioning this effort to be part of this repo, although we have not received a response from the original repo owner yet!

ChelseaHannan · 2020-06-03T17:20:28Z

I would love to help. I'd be happy to create json files as well. It would be cool if someone could pull badge numbers from the videos with the end goal of actual repercussions for these officers.

yo-mike · 2020-06-03T17:46:22Z

That would be great! I would prefer a CSV/Excel for ease-of-import into MySQL/Postgres.

Are you able to copy the template I made? You can begin using that. We'll make an API to spit out the JSON.

To that point, any additional fields added would be terrific, as long as they comply with the original guidelines.

Edit: I saw you requested access. I just added you! Feel free to use the base template. Thank you very much!

ChelseaHannan · 2020-06-03T17:51:03Z

Yes sounds good, I’ll copy the template and get started.

…

On Jun 3, 2020, at 1:46 PM, Mike ***@***.***> wrote: That would be great! I would prefer a CSV/Excel for ease-of-import into MySQL/Postgres. Are you able to copy the template I made? You can begin using that. We'll make an API to spit out the JSON. To that point, any additional fields added would be terrific, as long as they comply with the original guidelines. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#160 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFZY3SSXVRAO6E5WVPHJARDRU2D7ZANCNFSM4NR3ACLQ>.

ChelseaHannan · 2020-06-03T18:03:09Z

I don’t see where it indicates PR# on the GitHub. Where do I find that info? Looks like the first entry on the sheet says PR# 2 so I want to make sure I’m following the right format.

…

On Jun 3, 2020, at 1:46 PM, Mike ***@***.***> wrote: That would be great! I would prefer a CSV/Excel for ease-of-import into MySQL/Postgres. Are you able to copy the template I made? You can begin using that. We'll make an API to spit out the JSON. To that point, any additional fields added would be terrific, as long as they comply with the original guidelines. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#160 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFZY3SSXVRAO6E5WVPHJARDRU2D7ZANCNFSM4NR3ACLQ>.

yo-mike · 2020-06-03T18:11:53Z

@ChelseaHannan,

No problem! It's just a way to audit the list against what has been approved here. Next to Issues, look for Pull Requests:

Then click on Closed

Then look at the PR# under the title

ChelseaHannan · 2020-06-03T18:22:06Z

Thank you! I was in the other repository that was linked on Reddit, that’s how I found this. Thank you for doing this, I’m happy to contribute to this cause. I’ve been so upset the past week about what’s happening and it feels good to fight back. I will help in any way I can so if you guys need anything just let me know. I have a little bit of experience with front-end development and some SQL knowledge, but I’m willing to learn anything necessary to contribute.

…

On Jun 3, 2020, at 2:12 PM, Mike ***@***.***> wrote: No problem! It's just a way to audit the list against what has been approved here. Next to Issues, look for Pull Requests: <https://user-images.githubusercontent.com/23151962/83672149-acb78680-a58a-11ea-99d3-686412dc869f.png> Then click on Closed <https://user-images.githubusercontent.com/23151962/83672192-bd67fc80-a58a-11ea-99cf-e79ad0d4d2ee.png> Then look at the PR# under the title <https://user-images.githubusercontent.com/23151962/83672291-df617f00-a58a-11ea-8ffd-5869160ee020.png> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#160 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFZY3SQGEKRB3I6G4JUANLTRU2G7PANCNFSM4NR3ACLQ>.

ChelseaHannan · 2020-06-03T18:23:59Z

Also, does it matter what order they are in? Alphabetical by state or in order of pull request?

…

On Jun 3, 2020, at 2:12 PM, Mike ***@***.***> wrote: No problem! It's just a way to audit the list against what has been approved here. Next to Issues, look for Pull Requests: <https://user-images.githubusercontent.com/23151962/83672149-acb78680-a58a-11ea-99d3-686412dc869f.png> Then click on Closed <https://user-images.githubusercontent.com/23151962/83672192-bd67fc80-a58a-11ea-99cf-e79ad0d4d2ee.png> Then look at the PR# under the title <https://user-images.githubusercontent.com/23151962/83672291-df617f00-a58a-11ea-8ffd-5869160ee020.png> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#160 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFZY3SQGEKRB3I6G4JUANLTRU2G7PANCNFSM4NR3ACLQ>.

adzialocha · 2020-06-03T18:30:18Z

I'm just rewriting the python script here: #110 of @ubershmekel which is a helpful tool to get the data inside. Can copy it into the Google Spreadsheet in a few minutes

yo-mike · 2020-06-03T18:33:57Z

@ChelseaHannan, it does not matter the order. Thank you.

@adzialocha, that would be terrific. Then perhaps @ChelseaHannan can review the list and add any extra details, such as badge numbers, as she sees them?

yo-mike · 2020-06-03T18:34:36Z

Btw, @adzialocha, Chelsea has started -- so let us know as quickly as you can about the effectiveness of the script.

ChelseaHannan · 2020-06-03T18:36:02Z

Ok, what exactly does the Python script do? Is it a more efficient way of entering the data?

…

On Jun 3, 2020, at 2:30 PM, Andreas Dzialocha ***@***.***> wrote: I'm just rewriting the python script here: #110 <#110> of @ubershmekel <https://github.com/ubershmekel> which is a helpful tool to get the data inside. Can copy it into the Google Spreadsheet in a few minutes — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#160 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFZY3SWX4MHI6WS47AVEUHTRU2JERANCNFSM4NR3ACLQ>.

yo-mike · 2020-06-03T18:38:44Z

@ChelseaHannan, it sounds like a scraper that will quickly iterate through each PR and scrape the content into a spreadsheet. Like a robot doing the first pass.

ChelseaHannan · 2020-06-03T18:42:27Z

Ok, that’s awesome. I’ll be on standby. I was working from a copy of the spreadsheet. I won’t add anymore info to the original document for the time being.

…

On Jun 3, 2020, at 2:12 PM, Mike ***@***.***> wrote: No problem! It's just a way to audit the list against what has been approved here. Next to Issues, look for Pull Requests: <https://user-images.githubusercontent.com/23151962/83672149-acb78680-a58a-11ea-99d3-686412dc869f.png> Then click on Closed <https://user-images.githubusercontent.com/23151962/83672192-bd67fc80-a58a-11ea-99cf-e79ad0d4d2ee.png> Then look at the PR# under the title <https://user-images.githubusercontent.com/23151962/83672291-df617f00-a58a-11ea-8ffd-5869160ee020.png> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#160 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFZY3SQGEKRB3I6G4JUANLTRU2G7PANCNFSM4NR3ACLQ>.

adzialocha · 2020-06-03T18:59:14Z

This is my update on that script: ubershmekel#1

Here are the example output files:

.csv and .json https://www.sendspace.com/filegroup/xzUoBFGF7m4HcVyV6h8gxw

I'm not sure how to integrate this the smartest way now, maybe in your fork?

yo-mike · 2020-06-03T19:24:55Z

Awesome. Thank you! @ChelseaHannan , want to import the CSV over into the sheet, review, and add supplemental data as needed?

adzialocha · 2020-06-03T19:31:11Z

Not all links were parsed correctly, but I've updated the script (and sendspace files).

mnlmaier · 2020-06-03T19:37:30Z

One interesting thing to consider is adding geolocation (long/lat) to the cities. @949mac @adzialocha wdyt? would be better to do that directly on BE side instead of parsing on the frontend

yo-mike · 2020-06-03T19:40:51Z

@mnlmaier
I agree. I can add a Job that will automatically add it based on address.

So far, it looks like we have city/state. But if there are specific addresses, we can get those lat/lons too.

mnlmaier · 2020-06-03T19:47:00Z

@949mac working around that with a FE package right know, until it's included in the response

yo-mike · 2020-06-03T19:53:21Z

@adzialocha Any way to add the PR# to make the list easier to audit when updates are made?

adzialocha · 2020-06-03T19:57:46Z

Right now its scraping through the md files in reports in the master branch, no PR checking involved. I guess this can be implemented with help of the GitHub API (https://developer.github.com/v3/pulls/#list-pull-requests-files).

What is the idea behind using the PRs?

ubershmekel · 2020-06-03T20:08:02Z

Great work. Let's get this list auto-generating in a github action asap. I've contacted @2020PB and hope to get an action in this repo crunching the data. My plan was to commit the generated files into a branch on this repo. That would make the files available for others to poll through the github api.

Does anybody have a different idea or request for where to put the data after it's parsed into files?

Also, note this issue is similar to #141 and perhaps even a duplicate.

mnlmaier · 2020-06-03T20:16:36Z

how quickly do you guys think we can get lat/long coordinates in there? i've set up a map which will place markers for an array of locations, just waiting for a response to be thrown in there 😬

yo-mike · 2020-06-03T20:23:17Z

@adzialocha -- The PR's will be useful to detach IDs. The db has internal IDs. This is useful when performing bulk updates. For example: @ChelseaHannan was going to add supplemental information.

yo-mike · 2020-06-03T20:45:09Z

@mnlmaier - Lat/Long is in!

elctrc · 2020-06-03T20:54:31Z

Is the Google Sheets doc above going to be the ultimate landing spot for the data? If so, depending on how it's shared, we could hook it up to a Colab notebook in the short term to begin analysis if that seems valuable.

yo-mike · 2020-06-03T20:55:24Z

@elctrc - That's a good question. We have an initial data import done. But for consistency, it would be nice to determine the formal structure of the data.

I'd say we're open to ideas at this point!

elctrc · 2020-06-03T20:58:22Z

Ok so it sounds like you've got a working script to import the data - is it doing any cleaning as well? And then the plan is to have @ChelseaHannan do a pass of adding in more metadata?

mnlmaier · 2020-06-03T21:05:36Z

https://frontend-1750f.web.app

thanks for the great data work guys!

yo-mike · 2020-06-03T21:05:51Z

Yes -- so far the collabs here are not the original repo owners; so we're working somewhat independently in order to make this data more publically accessible.

With that being said, there is much value in streamlining the submissions. I'm not sure if the repo owners have put much thought into this further.

yo-mike · 2020-06-03T21:06:32Z

https://frontend-1750f.web.app

thanks for the great data work guys!

That's beautiful! It brings tears to my eyes. Way to go everyone.

elctrc · 2020-06-03T21:11:03Z

Wow. Nice work!

elctrc · 2020-06-03T21:14:47Z

Is there an endpoint that can be used to pull down the raw data?

ChelseaHannan · 2020-06-03T21:15:17Z

Looks great! Forgive me but I'm not very experienced with working with data, I don't know how to utilize the script or add metadata at this point.

mnlmaier · 2020-06-03T21:20:31Z

great work guys! now we can focus on design and user interactions.

@ChelseaHannan https://846policebrutality.b-cdn.net/api/incidents

yo-mike · 2020-06-03T21:32:56Z

Looks great! Forgive me but I'm not very experienced with working with data, I don't know how to utilize the script or add metadata at this point.

@ChelseaHannan , don't worry about the script. Let me update the sheet with the data from the script. Then in terms of improvement and metadata, you are welcome to update the sheet with any complementary and supplemental information.

@mnlmaier - Is there anything specific you would like Chelsea to look for?

yo-mike · 2020-06-03T21:35:43Z

@ChelseaHannan - The data is in at https://docs.google.com/spreadsheets/d/1zEChPuDj0eTeB9cOXrJNHK3rW-aW1zohUVMm6LrN1IU/edit#gid=0;

This was created using an automated script, so it is possible that it's missing information. Anything you can add or cleanup is greatly appreciated!

mnlmaier · 2020-06-03T21:51:09Z

it would make sense to use ISO or UNIX timestamps maybe? that would be something to parse on the backend.

still there are some weird things in there, some incidents are dated back to 1900 😬

yo-mike · 2020-06-03T21:57:25Z

The incidents having 1900-01-01 didn't have a valid date.

As far as timestamps go, we could do that -- however, I'm not sure if there is a time component on the Incident yet.

It would be terrific if the reddit army could try to narrow some of this stuff down to improve upon the data.

ChelseaHannan · 2020-06-04T01:41:32Z

Nice! I’ll look over it tomorrow and make any necessary changes.

…

On Wed, Jun 3, 2020 at 5:35 PM Mike ***@***.***> wrote: @ChelseaHannan <https://github.com/ChelseaHannan> - The data is in at https://docs.google.com/spreadsheets/d/1zEChPuDj0eTeB9cOXrJNHK3rW-aW1zohUVMm6LrN1IU/edit#gid=0 ; This was created using an automated script, so it is possible that it's missing information. Anything you can add or cleanup is greatly appreciated! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#160 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFZY3SU6W4YHRRUFZFBRETDRU263ZANCNFSM4NR3ACLQ> .

yo-mike · 2020-06-04T03:17:02Z

Thank you, @ChelseaHannan!

ubershmekel · 2020-06-04T04:21:32Z

The data build is live. Every commit to master will regenerate these files in the branch data_build.

https://github.com/2020PB/police-brutality/tree/data_build
The best data will always be at https://raw.githubusercontent.com/2020PB/police-brutality/data_build/all-locations.json
If you want to change anything about how this data is generated then look at https://github.com/2020PB/police-brutality/tree/master/tools

elimisteve · 2020-06-04T08:48:27Z

Here are several examples of recent police brutality from award-winning journalist Barrett Brown: https://medium.com/@barrettbrown/need-a-reason-to-smash-a-cop-9613f739149e

ChelseaHannan · 2020-06-04T16:22:01Z

It looks like the descriptions didn't populate. What happened was there were dates inside the description(G) column. I erased all the values in that one. Left it blank for now. There is a title column that has a pretty decent description of each event.

I also cleaned up the links, made sure they were all in the same field. I went through a few of them to make sure they were corresponding to the correct incidents.

I filled in as many of the blank cities as I could. I confirmed the cities by watching the linked videos. Fixed a couple that were labeled as the wrong city.

I will continue to go through and make sure everything corresponds and is accurate.

ubershmekel · 2020-06-04T17:24:09Z

@949mac @ChelseaHannan I see your spreadsheet has 118 entries while https://github.com/2020PB/police-brutality/blob/data_build/all-locations.csv has 196. Did you folks happen to check what is in this one vs that? Is there data that's duplicated on the github csv or missing?

ChelseaHannan · 2020-06-04T18:33:12Z

Why are we building another csv here if they already have one?

elctrc · 2020-06-04T18:49:40Z

(editing) - I think the question posed by @ChelseaHannan is important and didn't want it to get buried. I am assuming the csv referenced above is the raw translated output of the json from the api endpoint (so prior to your editing, Chelsea)

Secondly...Is the purpose to take this data after it is manually cleaned/edited and populate more data on the map? Or is this for a different front end application? I ask because this is not necessarily sustainable at scale, as you begin to receive more and more data, the task of manually updating will become impossible. Is there a process in mind for joining the revised and edited set of data with the continual flow of new information?

yo-mike · 2020-06-04T19:13:40Z

This original request was made before the data feed was ready in the readme.

At this point, we can decide if it make sense to gather volunteers to look at the content for anything else that wasn’t automatically parsed.

@ChelseaHannan was willing to look for supplemental information like police badge numbers, license plates, etc.

@elctrc in terms of scale, I get it. Let’s take a step back and see what’s needed at this point.

@mnlmaier - do you need anything else for front end content?

Otherwise, we can close this until a need arises.

Thoughts?

elctrc · 2020-06-04T20:27:33Z

For sure. I am more than happy to help on the data analysis side if this is ever something desired (or if you decide you want there to be a way for users to parse the data). But that may not relate to this issue and I understand! :)

mnlmaier · 2020-06-04T20:29:48Z

depends on which features we will decide in the future. as a first step we should make sure that all the content is complete and that no weird things are happening (some locations are off, you can see them when zooming out in the map)

ChelseaHannan · 2020-06-04T21:14:15Z

@mnlmaier I really like what you've done with the map. I think you should consider creating your own separate repository for your project. I don't believe the csv in the main repo has any location coordinates. Hopefully someone can keep that updated for you, because being able to see the map is very helpful for people.

idiosyncronaut · 2020-06-05T03:18:50Z

I would like to propose a twitter bot that, when mentioned in conversation, submits the tweet and assets to a data store, in an attempt to leverage twitter to easily submit abuse data. I haven't checked out the stack, but it could be a simple AWS lambda or something.

Filtering that and ensuring it turns into meaningful data will be a step to be taken.

But right now, I think the priority should be getting as much of the data stored as possible, while social media is at it's most active.

But get the data first.

nickatnight · 2020-06-05T03:40:53Z

The python api is finally live @949mac @mnlmaier with full ci/cd

repo: https://github.com/nickatnight/policebrutality.io
api: http://api.policebrutality.io/v1/videos

@mnlmaier I could probably plug in your front end if we dockerize it

Edit:
link

ubershmekel · 2020-06-05T08:26:40Z

@idiosyncronaut please open a new issue with new ideas and requests. Also, this dataset is highly edited for objectivity, and evidence. It's not an attempt to make a dump of everything. If you see something you'd like to contribute, you should make a PR to the markdown files on master.

@ChelseaHannan I'm sorry if there was a misunderstanding here. This github repo is managed in the markdown files on the master branch. It's the best way we can collaborate with 50+ contributors and verify every change as a pull request. It should be easy to make downstream datasets now, though I would recommend using the json file over the csv file.

With regards to having coordinates in our dataset - feel free to open up a separate issue if you think that's valuable.

I'm closing this mega-issue. Please feel free to comment if you think we should re-open.

mnlmaier · 2020-06-05T16:21:29Z

one last comment, sorry about that:

@nickatnight @949mac there's a discord server, I just asked, I am allowed to send you an invite :) any way to get in touch without posting the link publicly?

ubershmekel · 2020-06-05T16:57:10Z

@mnlmaier I would recommend asking them to DM you on reddit then you can check post history.

yo-mike mentioned this issue Jun 3, 2020

Happy to help: Project Management + Backend CMS + JSON REST API #141

Closed

5 tasks

ubershmekel closed this as completed Jun 5, 2020

mvattuone mentioned this issue Jun 5, 2020

Adding lat/lng to all-locations.* generated files #298

Closed

Looking for Volunteers! We need a data aggregator #160

Looking for Volunteers! We need a data aggregator #160

Comments

yo-mike commented Jun 3, 2020

ChelseaHannan commented Jun 3, 2020

yo-mike commented Jun 3, 2020 • edited Loading

ChelseaHannan commented Jun 3, 2020 via email

ChelseaHannan commented Jun 3, 2020 via email

yo-mike commented Jun 3, 2020 • edited Loading

ChelseaHannan commented Jun 3, 2020 via email

ChelseaHannan commented Jun 3, 2020 via email

adzialocha commented Jun 3, 2020

yo-mike commented Jun 3, 2020

yo-mike commented Jun 3, 2020

ChelseaHannan commented Jun 3, 2020 via email

yo-mike commented Jun 3, 2020

ChelseaHannan commented Jun 3, 2020 via email

adzialocha commented Jun 3, 2020 • edited Loading

yo-mike commented Jun 3, 2020 • edited Loading

adzialocha commented Jun 3, 2020

mnlmaier commented Jun 3, 2020 • edited Loading

yo-mike commented Jun 3, 2020

mnlmaier commented Jun 3, 2020

yo-mike commented Jun 3, 2020

adzialocha commented Jun 3, 2020 • edited Loading

ubershmekel commented Jun 3, 2020 • edited Loading

mnlmaier commented Jun 3, 2020 • edited Loading

yo-mike commented Jun 3, 2020

yo-mike commented Jun 3, 2020

elctrc commented Jun 3, 2020

yo-mike commented Jun 3, 2020

elctrc commented Jun 3, 2020

mnlmaier commented Jun 3, 2020

yo-mike commented Jun 3, 2020

yo-mike commented Jun 3, 2020

elctrc commented Jun 3, 2020

elctrc commented Jun 3, 2020

ChelseaHannan commented Jun 3, 2020

mnlmaier commented Jun 3, 2020

yo-mike commented Jun 3, 2020

yo-mike commented Jun 3, 2020

mnlmaier commented Jun 3, 2020 • edited Loading

yo-mike commented Jun 3, 2020

ChelseaHannan commented Jun 4, 2020 via email

yo-mike commented Jun 4, 2020

ubershmekel commented Jun 4, 2020

elimisteve commented Jun 4, 2020

ChelseaHannan commented Jun 4, 2020 • edited Loading

ubershmekel commented Jun 4, 2020

ChelseaHannan commented Jun 4, 2020

elctrc commented Jun 4, 2020 • edited Loading

yo-mike commented Jun 4, 2020

elctrc commented Jun 4, 2020

mnlmaier commented Jun 4, 2020

ChelseaHannan commented Jun 4, 2020

idiosyncronaut commented Jun 5, 2020 • edited Loading

nickatnight commented Jun 5, 2020 • edited Loading

ubershmekel commented Jun 5, 2020

mnlmaier commented Jun 5, 2020

ubershmekel commented Jun 5, 2020

yo-mike commented Jun 3, 2020 •

edited

Loading

yo-mike commented Jun 3, 2020 •

edited

Loading

adzialocha commented Jun 3, 2020 •

edited

Loading

yo-mike commented Jun 3, 2020 •

edited

Loading

mnlmaier commented Jun 3, 2020 •

edited

Loading

adzialocha commented Jun 3, 2020 •

edited

Loading

ubershmekel commented Jun 3, 2020 •

edited

Loading

mnlmaier commented Jun 3, 2020 •

edited

Loading

mnlmaier commented Jun 3, 2020 •

edited

Loading

ChelseaHannan commented Jun 4, 2020 •

edited

Loading

elctrc commented Jun 4, 2020 •

edited

Loading

idiosyncronaut commented Jun 5, 2020 •

edited

Loading

nickatnight commented Jun 5, 2020 •

edited

Loading