# Comparing Names of People Killed Across Different Datasets

The first part of the analysis is to see if the people killed are consistently recorded by all 3 datasets. To do that, we will see how many names match between them.

In [1]:
import pandas as pd
import numpy as np
from difflib import SequenceMatcher
import Levenshtein as lev

In [59]:

mpv = pd.read_csv("data/MPVDataset.csv")


In [4]:
mpv.tail()

Unnamed: 0,Victim's name,Victim's age,Victim's gender,Victim's race,URL of image of victim,Date of Incident (month/day/year),Street Address of Incident,City,State,Zipcode,...,Criminal Charges?,Link to news article or photo of official document,Symptoms of mental illness?,Unarmed,Alleged Weapon (Source: WaPo),Alleged Threat Level (Source: WaPo),Fleeing (Source: WaPo),Body Camera (Source: WaPo),WaPo ID (If included in WaPo database),Unnamed: 24
6314,Andrew L. Closson,21,Male,White,http://www.superiortelegram.com/sites/default/...,1/1/13,U.S. Highway 53,Gordon,WI,54838.0,...,No Known Charges,http://www.superiortelegram.com/content/deputy...,Drug or alcohol use,Allegedly Armed,,,,,,
6315,Mark Chavez,49,Male,Hispanic,http://www.tricitytribuneusa.com/wp-content/up...,1/1/13,912 Loma Linda Ave.,Farmington,NM,87401.0,...,No Known Charges,http://www.daily-times.com/farmington-news/ci_...,No,Allegedly Armed,,,,,,
6316,Andrew Layton,26,Male,White,http://bloximages.chicago2.vip.townnews.com/ma...,1/1/13,410 S Riverfront Drive,Mankato,MN,56001.0,...,No Known Charges,http://www.tmcnet.com/usubmit/2013/02/21/69388...,No,Allegedly Armed,,,,,,
6317,Tyree Bell,31,Male,Black,http://content.omaha.com/media/maps/ps/2013/ja...,1/1/13,3727 N. 42nd St.,Omaha,NE,68111.0,...,No Known Charges,http://www.ketv.com/news/Police-chief-details-...,Yes,Allegedly Armed,,,,,,
6318,Christopher Tavares,21,Male,Hispanic,http://www.krdo.com/image/view/-/17980228/medR...,1/1/13,Highway 50 and North Elizabeth Street,Pueblo,CO,81008.0,...,No Known Charges,http://www.krdo.com/news/Pueblo-Police-shoot-k...,No,Allegedly Armed,,,,,,


## Different choices of comparing names

It's not just a simple "in"

In [7]:
"Joseph Johnson" in "Joseph Walden Johnson Jr."

False

In [18]:
lev.ratio("Joseph Walden Johnson Jr.", "Joseph Johnson Jr.")

0.8372093023255814

In [9]:
SequenceMatcher(None, "Joseph Walden Johnson Jr.", "Joseph Johnson").ratio()

0.717948717948718

## Dates After 2015

From checking the head and tail of the datasets, Mapping Police Violence and Killed By Police data go back to 2013, but Washington Post data only starts in 2015. We'll start the comparison when all 3 datasets record data, 2015.

In [32]:
# Cleaning data so the format is standardized
mpv_dates = mpv["Date of Incident (month/day/year)"].values
mpv_months = [int(date[: date.find("/")]) for date in mpv_dates]
mpv_years = [int("20" + date[date.rfind("/") + 1: ]) for date in mpv_dates]
mpv["month"] = mpv_months
mpv["year"] = mpv_years
past2015_mpv = mpv[mpv['year'] >= 2015]

In [35]:

mpv_names = past2015_mpv["Victim's name"].values


In [37]:
# cleaning data
new_mpv = []
for name in mpv_names:
    if name != "Name withheld by police":
        new_mpv.append(name)
mpv_names = new_mpv


In [40]:
print("Number of names in the Mapping Police Violence dataset:", "\t", len(mpv_names))

Number of names in the Washington Post dataset: 	 3491
Number of names in the Mapping Police Violence dataset: 	 4001
Number of names in the Killed By Police dataset: 	 3524
