Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Looking for Volunteers! We need a data aggregator #160

Closed
1 task
yo-mike opened this issue Jun 3, 2020 · 56 comments
Closed
1 task

Looking for Volunteers! We need a data aggregator #160

yo-mike opened this issue Jun 3, 2020 · 56 comments

Comments

@yo-mike
Copy link
Collaborator

yo-mike commented Jun 3, 2020

Task: Aggregate 2020PB items into a single CSV or Excel Spreadsheet

  • Go to the link below and populate each row with the entries from each 2020PB Pull Request

https://docs.google.com/spreadsheets/d/1zEChPuDj0eTeB9cOXrJNHK3rW-aW1zohUVMm6LrN1IU/edit?usp=sharing

Perhaps multiple contributors can divide the work easily.

Assigned to: ???

If you can help, let one of the contributors know and we'll get you access to the Google Sheet.

Ref Project Board:
https://github.com/949mac/police-brutality/projects/1

We're petitioning this effort to be part of this repo, although we have not received a response from the original repo owner yet!

@ChelseaHannan
Copy link

I would love to help. I'd be happy to create json files as well. It would be cool if someone could pull badge numbers from the videos with the end goal of actual repercussions for these officers.

@yo-mike
Copy link
Collaborator Author

yo-mike commented Jun 3, 2020

That would be great! I would prefer a CSV/Excel for ease-of-import into MySQL/Postgres.

Are you able to copy the template I made? You can begin using that. We'll make an API to spit out the JSON.

To that point, any additional fields added would be terrific, as long as they comply with the original guidelines.

Edit: I saw you requested access. I just added you! Feel free to use the base template. Thank you very much!

@ChelseaHannan
Copy link

ChelseaHannan commented Jun 3, 2020 via email

@ChelseaHannan
Copy link

ChelseaHannan commented Jun 3, 2020 via email

@yo-mike
Copy link
Collaborator Author

yo-mike commented Jun 3, 2020

@ChelseaHannan,

No problem! It's just a way to audit the list against what has been approved here. Next to Issues, look for Pull Requests:

image

Then click on Closed
image

Then look at the PR# under the title
image

@ChelseaHannan
Copy link

ChelseaHannan commented Jun 3, 2020 via email

@ChelseaHannan
Copy link

ChelseaHannan commented Jun 3, 2020 via email

@adzialocha
Copy link
Contributor

I'm just rewriting the python script here: #110 of @ubershmekel which is a helpful tool to get the data inside. Can copy it into the Google Spreadsheet in a few minutes

@yo-mike
Copy link
Collaborator Author

yo-mike commented Jun 3, 2020

@ChelseaHannan, it does not matter the order. Thank you.

@adzialocha, that would be terrific. Then perhaps @ChelseaHannan can review the list and add any extra details, such as badge numbers, as she sees them?

@yo-mike
Copy link
Collaborator Author

yo-mike commented Jun 3, 2020

Btw, @adzialocha, Chelsea has started -- so let us know as quickly as you can about the effectiveness of the script.

@ChelseaHannan
Copy link

ChelseaHannan commented Jun 3, 2020 via email

@yo-mike
Copy link
Collaborator Author

yo-mike commented Jun 3, 2020

@ChelseaHannan, it sounds like a scraper that will quickly iterate through each PR and scrape the content into a spreadsheet. Like a robot doing the first pass.

@ChelseaHannan
Copy link

ChelseaHannan commented Jun 3, 2020 via email

@adzialocha
Copy link
Contributor

adzialocha commented Jun 3, 2020

This is my update on that script: ubershmekel#1

Here are the example output files:

I'm not sure how to integrate this the smartest way now, maybe in your fork?

@yo-mike
Copy link
Collaborator Author

yo-mike commented Jun 3, 2020

Awesome. Thank you! @ChelseaHannan , want to import the CSV over into the sheet, review, and add supplemental data as needed?

@adzialocha
Copy link
Contributor

Not all links were parsed correctly, but I've updated the script (and sendspace files).

@mnlmaier
Copy link

mnlmaier commented Jun 3, 2020

One interesting thing to consider is adding geolocation (long/lat) to the cities. @949mac @adzialocha wdyt? would be better to do that directly on BE side instead of parsing on the frontend

@yo-mike
Copy link
Collaborator Author

yo-mike commented Jun 3, 2020

@mnlmaier
I agree. I can add a Job that will automatically add it based on address.

So far, it looks like we have city/state. But if there are specific addresses, we can get those lat/lons too.

@mnlmaier
Copy link

mnlmaier commented Jun 3, 2020

@949mac working around that with a FE package right know, until it's included in the response

@yo-mike
Copy link
Collaborator Author

yo-mike commented Jun 3, 2020

@adzialocha Any way to add the PR# to make the list easier to audit when updates are made?

@adzialocha
Copy link
Contributor

adzialocha commented Jun 3, 2020

Right now its scraping through the md files in reports in the master branch, no PR checking involved. I guess this can be implemented with help of the GitHub API (https://developer.github.com/v3/pulls/#list-pull-requests-files).

What is the idea behind using the PRs?

@ubershmekel
Copy link
Collaborator

ubershmekel commented Jun 3, 2020

Great work. Let's get this list auto-generating in a github action asap. I've contacted @2020PB and hope to get an action in this repo crunching the data. My plan was to commit the generated files into a branch on this repo. That would make the files available for others to poll through the github api.

Does anybody have a different idea or request for where to put the data after it's parsed into files?

Also, note this issue is similar to #141 and perhaps even a duplicate.

@mnlmaier
Copy link

mnlmaier commented Jun 3, 2020

how quickly do you guys think we can get lat/long coordinates in there? i've set up a map which will place markers for an array of locations, just waiting for a response to be thrown in there 😬

@yo-mike
Copy link
Collaborator Author

yo-mike commented Jun 3, 2020

@adzialocha -- The PR's will be useful to detach IDs. The db has internal IDs. This is useful when performing bulk updates. For example: @ChelseaHannan was going to add supplemental information.

@yo-mike
Copy link
Collaborator Author

yo-mike commented Jun 3, 2020

@mnlmaier - Lat/Long is in!

image

@elctrc
Copy link

elctrc commented Jun 3, 2020

Is the Google Sheets doc above going to be the ultimate landing spot for the data? If so, depending on how it's shared, we could hook it up to a Colab notebook in the short term to begin analysis if that seems valuable.

@yo-mike
Copy link
Collaborator Author

yo-mike commented Jun 3, 2020

@elctrc - That's a good question. We have an initial data import done. But for consistency, it would be nice to determine the formal structure of the data.

I'd say we're open to ideas at this point!

@elctrc
Copy link

elctrc commented Jun 3, 2020

Ok so it sounds like you've got a working script to import the data - is it doing any cleaning as well? And then the plan is to have @ChelseaHannan do a pass of adding in more metadata?

@mnlmaier
Copy link

mnlmaier commented Jun 3, 2020

https://frontend-1750f.web.app

thanks for the great data work guys!

@yo-mike
Copy link
Collaborator Author

yo-mike commented Jun 3, 2020

Yes -- so far the collabs here are not the original repo owners; so we're working somewhat independently in order to make this data more publically accessible.

With that being said, there is much value in streamlining the submissions. I'm not sure if the repo owners have put much thought into this further.

@yo-mike
Copy link
Collaborator Author

yo-mike commented Jun 3, 2020

https://frontend-1750f.web.app

thanks for the great data work guys!

That's beautiful! It brings tears to my eyes. Way to go everyone.

@elctrc
Copy link

elctrc commented Jun 3, 2020

Wow. Nice work!

@elctrc
Copy link

elctrc commented Jun 3, 2020

Is there an endpoint that can be used to pull down the raw data?

@ChelseaHannan
Copy link

Looks great! Forgive me but I'm not very experienced with working with data, I don't know how to utilize the script or add metadata at this point.

@mnlmaier
Copy link

mnlmaier commented Jun 3, 2020

great work guys! now we can focus on design and user interactions.

@ChelseaHannan https://846policebrutality.b-cdn.net/api/incidents

@yo-mike
Copy link
Collaborator Author

yo-mike commented Jun 3, 2020

Looks great! Forgive me but I'm not very experienced with working with data, I don't know how to utilize the script or add metadata at this point.

@ChelseaHannan , don't worry about the script. Let me update the sheet with the data from the script. Then in terms of improvement and metadata, you are welcome to update the sheet with any complementary and supplemental information.

@mnlmaier - Is there anything specific you would like Chelsea to look for?

@yo-mike
Copy link
Collaborator Author

yo-mike commented Jun 3, 2020

@ChelseaHannan - The data is in at https://docs.google.com/spreadsheets/d/1zEChPuDj0eTeB9cOXrJNHK3rW-aW1zohUVMm6LrN1IU/edit#gid=0;

This was created using an automated script, so it is possible that it's missing information. Anything you can add or cleanup is greatly appreciated!

@mnlmaier
Copy link

mnlmaier commented Jun 3, 2020

it would make sense to use ISO or UNIX timestamps maybe? that would be something to parse on the backend.

still there are some weird things in there, some incidents are dated back to 1900 😬

@yo-mike
Copy link
Collaborator Author

yo-mike commented Jun 3, 2020

The incidents having 1900-01-01 didn't have a valid date.

As far as timestamps go, we could do that -- however, I'm not sure if there is a time component on the Incident yet.

It would be terrific if the reddit army could try to narrow some of this stuff down to improve upon the data.

@ChelseaHannan
Copy link

ChelseaHannan commented Jun 4, 2020 via email

@yo-mike
Copy link
Collaborator Author

yo-mike commented Jun 4, 2020

Thank you, @ChelseaHannan!

@ubershmekel
Copy link
Collaborator

The data build is live. Every commit to master will regenerate these files in the branch data_build.

@elimisteve
Copy link

Here are several examples of recent police brutality from award-winning journalist Barrett Brown: https://medium.com/@barrettbrown/need-a-reason-to-smash-a-cop-9613f739149e

@ChelseaHannan
Copy link

ChelseaHannan commented Jun 4, 2020

It looks like the descriptions didn't populate. What happened was there were dates inside the description(G) column. I erased all the values in that one. Left it blank for now. There is a title column that has a pretty decent description of each event.

I also cleaned up the links, made sure they were all in the same field. I went through a few of them to make sure they were corresponding to the correct incidents.

I filled in as many of the blank cities as I could. I confirmed the cities by watching the linked videos. Fixed a couple that were labeled as the wrong city.

I will continue to go through and make sure everything corresponds and is accurate.

@ubershmekel
Copy link
Collaborator

@949mac @ChelseaHannan I see your spreadsheet has 118 entries while https://github.com/2020PB/police-brutality/blob/data_build/all-locations.csv has 196. Did you folks happen to check what is in this one vs that? Is there data that's duplicated on the github csv or missing?

@ChelseaHannan
Copy link

Why are we building another csv here if they already have one?

@elctrc
Copy link

elctrc commented Jun 4, 2020

(editing) - I think the question posed by @ChelseaHannan is important and didn't want it to get buried. I am assuming the csv referenced above is the raw translated output of the json from the api endpoint (so prior to your editing, Chelsea)

Secondly...Is the purpose to take this data after it is manually cleaned/edited and populate more data on the map? Or is this for a different front end application? I ask because this is not necessarily sustainable at scale, as you begin to receive more and more data, the task of manually updating will become impossible. Is there a process in mind for joining the revised and edited set of data with the continual flow of new information?

@yo-mike
Copy link
Collaborator Author

yo-mike commented Jun 4, 2020

This original request was made before the data feed was ready in the readme.

At this point, we can decide if it make sense to gather volunteers to look at the content for anything else that wasn’t automatically parsed.

@ChelseaHannan was willing to look for supplemental information like police badge numbers, license plates, etc.

@elctrc in terms of scale, I get it. Let’s take a step back and see what’s needed at this point.

@mnlmaier - do you need anything else for front end content?

Otherwise, we can close this until a need arises.

Thoughts?

@elctrc
Copy link

elctrc commented Jun 4, 2020

For sure. I am more than happy to help on the data analysis side if this is ever something desired (or if you decide you want there to be a way for users to parse the data). But that may not relate to this issue and I understand! :)

@mnlmaier
Copy link

mnlmaier commented Jun 4, 2020

depends on which features we will decide in the future. as a first step we should make sure that all the content is complete and that no weird things are happening (some locations are off, you can see them when zooming out in the map)

@ChelseaHannan
Copy link

@mnlmaier I really like what you've done with the map. I think you should consider creating your own separate repository for your project. I don't believe the csv in the main repo has any location coordinates. Hopefully someone can keep that updated for you, because being able to see the map is very helpful for people.

@idiosyncronaut
Copy link

idiosyncronaut commented Jun 5, 2020

I would like to propose a twitter bot that, when mentioned in conversation, submits the tweet and assets to a data store, in an attempt to leverage twitter to easily submit abuse data. I haven't checked out the stack, but it could be a simple AWS lambda or something.

Filtering that and ensuring it turns into meaningful data will be a step to be taken.

But right now, I think the priority should be getting as much of the data stored as possible, while social media is at it's most active.

But get the data first.

@nickatnight
Copy link

nickatnight commented Jun 5, 2020

The python api is finally live @949mac @mnlmaier with full ci/cd

repo: https://github.com/nickatnight/policebrutality.io
api: http://api.policebrutality.io/v1/videos

@mnlmaier I could probably plug in your front end if we dockerize it

Edit:
link

@ubershmekel
Copy link
Collaborator

@idiosyncronaut please open a new issue with new ideas and requests. Also, this dataset is highly edited for objectivity, and evidence. It's not an attempt to make a dump of everything. If you see something you'd like to contribute, you should make a PR to the markdown files on master.

@ChelseaHannan I'm sorry if there was a misunderstanding here. This github repo is managed in the markdown files on the master branch. It's the best way we can collaborate with 50+ contributors and verify every change as a pull request. It should be easy to make downstream datasets now, though I would recommend using the json file over the csv file.

With regards to having coordinates in our dataset - feel free to open up a separate issue if you think that's valuable.

I'm closing this mega-issue. Please feel free to comment if you think we should re-open.

@mnlmaier
Copy link

mnlmaier commented Jun 5, 2020

one last comment, sorry about that:

@nickatnight @949mac there's a discord server, I just asked, I am allowed to send you an invite :) any way to get in touch without posting the link publicly?

@ubershmekel
Copy link
Collaborator

@mnlmaier I would recommend asking them to DM you on reddit then you can check post history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants