In [1]:
import pandas as pd
import numpy as np

### Information about the datasets

#### Washingpost dataset
In 2015, The Post began tracking more than a dozen details about each killing — including the race of the deceased, the circumstances of the shooting, whether the person was armed and whether the person was experiencing a mental-health crisis — by culling local news reports, law enforcement websites and social media, and by monitoring independent databases such as Killed by Police and Fatal Encounters. The Post conducted additional reporting in many cases.


#### Mapping Police Violence
This information has been meticulously sourced from the three largest, most comprehensive and impartial crowdsourced databases on police killings in the country: FatalEncounters.org, the U.S. Police Shootings Database and KilledbyPolice.net. We've also done extensive original research to further improve the quality and completeness of the data; searching social media, obituaries, criminal records databases, police reports and other sources to identify the race of 90 percent of all victims in the database.


#### Important notes
Washingpost sources theur data also from the mapping police violence dataset, but they cleaned the data. Meaning that the  mapping police violence dataset has more entries because it also has some NaN values, which could still be of interest for us.

In [5]:
# Data dir path from root of project
data_dir = "./data"

# https://github.com/washingtonpost/data-police-shootings
df_wsp = pd.read_csv(f'{data_dir}/fatal-police-shootings-data-wsp.csv')  
display(df_wsp.head())

# https://mappingpoliceviolence.org/
df_mpv = pd.read_csv(f'{data_dir}/mapping-police-violence-24oct2020.csv')  
display(df_mpv.head())

Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera,longitude,latitude,is_geocoding_exact
0,3,Tim Elliot,2015-01-02,shot,gun,53.0,M,A,Shelton,WA,True,attack,Not fleeing,False,-123.122,47.247,True
1,4,Lewis Lee Lembke,2015-01-02,shot,gun,47.0,M,W,Aloha,OR,False,attack,Not fleeing,False,-122.892,45.487,True
2,5,John Paul Quintero,2015-01-03,shot and Tasered,unarmed,23.0,M,H,Wichita,KS,False,other,Not fleeing,False,-97.281,37.695,True
3,8,Matthew Hoffman,2015-01-04,shot,toy weapon,32.0,M,W,San Francisco,CA,True,attack,Not fleeing,False,-122.422,37.763,True
4,9,Michael Rodriguez,2015-01-04,shot,nail gun,39.0,M,H,Evans,CO,False,attack,Not fleeing,False,-104.692,40.384,True


Unnamed: 0,Victim's name,Victim's age,Victim's gender,Victim's race,URL of image of victim,Date of Incident (month/day/year),Street Address of Incident,City,State,Zipcode,...,Unarmed/Did Not Have an Actual Weapon,Alleged Weapon (Source: WaPo and Review of Cases Not Included in WaPo Database),Alleged Threat Level (Source: WaPo),Fleeing (Source: WaPo),Body Camera (Source: WaPo),WaPo ID (If included in WaPo database),Off-Duty Killing?,Geography (via Trulia methodology based on zipcode population density: http://jedkolko.com/wp-content/uploads/2015/05/full-ZCTA-urban-suburban-rural-classification.xlsx ),MPV ID,Fatal Encounters ID
0,Name withheld by police,,Male,,,10/14/2020,,Cookson,OK,,...,Allegedly Armed,spear,attack,Not fleeing,No,6232.0,,,,
1,Name withheld by police,,Male,,,10/14/2020,,South Los Angeles,CA,,...,Allegedly Armed,gun,attack,Not fleeing,No,6231.0,,,,
2,Name withheld by police,,Male,White,,10/14/2020,,Chico,CA,,...,Allegedly Armed,knife,attack,foot,No,6230.0,,,,
3,Marcos Ramirez,27.0,Male,Hispanic,,10/13/2020,,Bakersfield,CA,,...,Allegedly Armed,knife,attack,foot,Yes,6228.0,,,,
4,Anthony Jones,24.0,Male,,,10/12/2020,,Bethel Springs,TN,,...,Unarmed/Did Not Have an Actual Weapon,no object,other,car,No,6229.0,,,,


In [6]:
# WSP: Get incident state counts
per_state_counts = df_wsp.groupby(["state"]).count()
display(per_state_counts.head())
# display(per_state_counts)
print(len(per_state_counts))

Unnamed: 0_level_0,id,name,date,manner_of_death,armed,age,gender,race,city,signs_of_mental_illness,threat_level,flee,body_camera,longitude,latitude,is_geocoding_exact
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
AK,40,39,40,40,40,39,40,36,40,40,40,37,40,33,33,40
AL,105,102,105,105,100,100,105,95,105,105,105,103,105,98,98,105
AR,84,84,84,84,81,80,84,73,84,84,84,78,84,73,73,84
AZ,262,251,262,262,249,251,262,226,262,262,262,251,262,244,244,262
CA,853,783,853,853,824,785,853,733,853,853,853,809,853,828,828,853


51


In [8]:
# Output indecent lenght of both datasets
wsp_len = str(len(df_wsp))
mpv_len = str(len(df_mpv))


print(f"Washingpost data-police-shootings has {wsp_len} entries")
print(f"Mapping Police violence has {mpv_len} entries")


Washingpost data-police-shootings has 5716 entries
Mapping Police violence has 8507 entries
