```{attention}

The Chicago Missing Persons data story seeks to represent the technical work and findings of the investigation by Invisible Institute ("II"), City Bureau ("CB"), and the Human Rights Data Analysis Group ("HRDAG").

One of HRDAG's goals as an organization is to explain why data collection methods and possible analyses are necessarily and deeply intertwined, and make it known when certain statements are not possible or reasonable to make with a given dataset. Relatedly, an important part of this story involves reviewing a few of the widely-publicized cases from the time period and comparing the ground truth as reported by the community and media to what is prescribed by policy and presented in the city’s data, which realizes certain shortcomings highlighted by the HRDAG team. Grounding the technical work with examples like these is key to our ability to place the data firmly in the real-world context in which it lives and to depict the significance of the findings through the lens of lived experience.

Mentions of these cases may include the names of missing and murdered humans, comments from surviving family/friends, and reports from the County Medical Examiner’s office including manner and method of death. While we do not wish to sensationalize or center any gruesome detail, we recommend discretion in reading this notebook as the information discussed in it may be upsetting or triggering. The heaviest section will be the final one on Record Linkage.

```

# Background

### [ABOUT THE PROJECT](https://www.btsurface.com/)

Invisible Institute and the Human Rights Data Analysis Group have collaborated on analyzing missing persons cases since 2019 – initially scraping and analyzing open missing persons cases from 2000 - 2016 received through public records requests.

Through [Beneath the Surface](https://www.btsurface.com/), a project that uses machine learning to parse through narrative text of police misconduct records, our team was able to identify over 50 complaints related to missing persons cases. Alleged misconduct ranged from officers denying reports all the way to an officer closing a case before finding the young person. Our team then embarked on a deep dive into Chicago Police Missing Persons Data and the legislation which influenced the data systems where these cases live; analyzing every digital case record from 2000-2022 prepared and provided by the Chicago Police Department. We paired with City Bureau in 2021 on the investigation.

Since reporting in the [Chicago Reader](https://chicagoreader.com/news-politics/chicago-police-missing-persons/) and publishing the full version of the story as a [microsite](https://chicagomissingpersons.com), the reporters involved in this project have presented to the Illinois Task Force on Missing and Murdered Chicago Women, in addition to city representatives. The Task Force's role includes examining and reporting the systemic causes behind violence that Chicago women and girls experience and reviewing the existing and potential new methods for tracking and collecting data on violence against Chicago women and girls. Pursuant to that goal, they have attempted to review the missing persons data collected by the city's primary contact agency, the Chicago Police Department.

We hope these data notebooks can support elected officals and local community members in better understanding the missing persons pipeline, that the public has access to the details of the gaps unveiled by our findings, and that our work can operate as a model for utilizing data science and narrative justice to interrogate police records and the data-driven claims made from them.

### [Missing Persons](https://chicagomissingpersons.com/)

According to the Illinois Intergovernmental Missing Child Recovery Act of 1984, the primary contact agency (in the case of Chicago, the Chicago Police Department) is to enter every missing persons report into the Law Enforcement Agencies Data System (LEADS) for the purpose of effecting an immediate law enforcement response to reports of missing children, as well as to compile and retain information regarding missing children. This historic data repository was written into law so that it could provide a factual and statistical base for research which would enable a deeper understanding and addressing of the problem of missing children in Illinois. Also in support of this research goal, law enforcement are required to update cases to reflect and include information relating to the final disposition of each case, but Chicago police do not enforce this practice in a way that is transparent or reproducible by experts at HRDAG.

Through Beneath the Surface, an initial public records request for missing persons report data was sent to **Chicago Police Department (CPD)**. As the investigation continued for two more years, additional requests were made for the updated dataset from **CPD**, as well as new requests for 911 call data related to missing persons reports from the **Office of Emergency Management and Communications (OEMC)** and death data from the **Cook County Medical Examiner (ME)**.

We hoped to use the CPD data to reproduce its previous public claims that suggested 99.9% of missing persons reports were for runaways who had been located, as well as compare the ground truth events and data narrative of several widely-reported missing persons cases.

### Exploratory analysis found discrepancies in the data

Our team almost immediately identified two cases which were classified as closed non-criminal even though a murder investigation was opened for the missing individual. After two years and reviewing all available case records for several dozen widely-reported on missing persons from 2000-2022, we were able to identify a total of 11 cases which were classified as "Closed Non-criminal" even though a homicide case had been opened for the missing, including 1 where the CPD Homicide case is explicitly mentioned in the request to reclassify the missing person report as non-criminal. We were also able to identify 4 cases in which the Chicago Police closed the missing persons case before the person was found – including two cases where the Black teenage girl was subsequently found murdered.

We’re not the only group to review CPD's record-keeping and encounter problems. In the past, CPD was investigated by the DOJ for several things. In a section of the [2017 DOJ report](https://www.justice.gov/d9/chicago_police_department_findings.pdf) from the investigation, titled, “CPD Does Not Provide Officers With Sufficient Direction, Supervision, or Support to Ensure Lawful and Effective Policing”, the data collection and public transparency methods were deemed "insufficient" by the DOJ. The section further revealed information about the poor usability and interoperability of **CPD's** software systems that contributed to our understanding of the internal data generation process in place for the first two-thirds of the record-keeping. During our investigation, II also received information that CPD personnel influenced the design of some of their data entry system to discourage the use of structured fields and things like drop-down menus, at a cost to the ease of both data entry and producing summary statistics from the data. This design choice explained how some of the data entry errors we found, like the more than two dozen age values between 133 and 985 years old, or the year of birth recorded as "1776", were possible.

In our attempt to reproduce CPD's prior statistic that 99.9% of those reported missing were located, we were unable to identify any structured field maintaining this information. After following up with the responding FOIA officer, we were told to use the field that indicates the status of the investigation as Open or Closed and Criminal or Non-criminal. However, as we have already stated and will discuss further with examples in the Record Linkage section, the labels in the status field (particularly "Closed Non-criminal") can be a misleading characterization of the ground truth and have been applied to cases where the missing person faced a gruesome outcome rather than a safe return home. We hope this notebook will demonstrate the flaw in using the status field to fill in information about the outcomes of missing persons.

In this notebook we will introduce the datasets, our initial goals and questions, and the discrepancies we found between the data and ground truth outcomes. The next chapter will be on the events surrounding a missing persons event, including 911 calls, officer arrival, case closure, the UCR code, and a sneak peak of analysis involving missing person 911 calls and ShotSpotter dispatches. The next two will be on who is represented in the data and described in most reports, and then who is not represented in the data and the real-world circumstances that lead to someone being fully excluded from this kind of data collection.

### Questions we want to answer

Given a true collection of the unique IDs of everyone who went missing in the city of Chicago between 1 January 2000 and 31 December 2022 and information about the timeline and outcome for each individual, we would be able to respond precisely to questions like,

- how many humans went missing in Chicago in this time period?
- how many humans went missing in Chicago _more than once_ in this time period?
- how many humans who went missing in Chicago in this time period _were located in good health_?
- how many humans who went missing in Chicago in this time period _were located and had been the victim of a crime_?
- does the speed of officer arrival differ for humans _who have been missing before_ compared to _those who have not_?

If we had **OEMC** and **ME** records that included external agency IDs for posterity, we could trace reports of missing humans from police contact to report the missing to the moment an officer arrived on scene, and follow reports that ended in the missing person being found deceased.

However, the data provided does not include unique identifers, so we can't give precise numbers for how many people went missing and distinguish who went missing more than once.

We can explore instead the data generation process and record linkage between a small sample of identified reports.

---

# Introduce the data
### CPD Missing Persons (MP)
- Source: Chicago Police Department
- CPD's digitized MP report database from 1 January 2000 to 31 December 2022
- 6 FOIA-requested versions between December 2020 and April 2023 (.xlsx)
- an additional sample of CPD's report database including reports originating as MP that were reclassified to another type of event (.xlsx)
- a couple dozen FOIA-requested original and supplementary incident reports (.pdf)

To start a discussion of the data story behind missing persons reports, it behooves us to define 1) our 'unit of analysis', the basic unit of the thing we're interested in studying. At HRDAG, we typically design a dataset around the question, "Who did what to whom?", with information like the name of the injured or killed person and the location of the incident.

These are the core features of the missing persons data as collected/recorded by CPD:
- `rd` or Records Division number
- date last seen (AKA `date_occurred`)
- location last seen (AKA `address`)
- some info about the missing person (this will be covered as its own topic)

So, in this context, our data is more appropriately structured around, "Who was last seen where?" We think of this as the event of a person going missing. Each row in the data represents 1 unique `rd` number and person reported missing.

---

### CPD Homicide
- Source: Chicago Police Department
- CPD's digitized Homicide report database from 1 January 2000 to 1 January 2023
- 1 FOIA-requested version between from 2022 (.xlsx)

The CPD Homicide data contains the unredacted full name, sex, and age of the deceased, along with status and timeline information about the case. Each row represents 1 unique `rd` number and person reported as the victim of a homicide.

---

### Office of Emergency Management & Communication (OEMC)
- Source: Office of Emergency Management & Communication
- 1 FOIA-requested sample of 2 files, covering April 2018 to April 2022
- 1 file of all dispatches, including initial event type, `init_type`, and priority, `init_priority`
- 1 file of locations, only for `init_type` of "MISSING PERSON"

The OEMC call data contains the timeline, event type, and priority of each request for dispatch. Each row represents 1 unique `eventnumber` and a 911 call or other emergency dispatch (ShotSpotter "calls" are not true 911 calls but go through OEMC the same way for dispatch purposes).

---

### Cook County Medical Examiner (ME)
- Source: Cook County Medical Examiner
- 1 FOIA-requested version covered 1 January 2000 - 1 August 2014
- 2 FOIA-requested versions covering the partial timeline or with partial case details

The ME death data contains the timeline, manner of death, and personally identifying information of each person found deceased in the county. Each row represents 1 unique `casenumber` and person found deceased.

---

Many other sources and supplementary documents were included in the investigation, including policy and directive orders, and dozens of interviews with the family and friends of missing humans, staff at group homes, former and current CPD personnel, and experts who study missing and murdered humans. Where possible, other public resources like the [Cook County Sheriff's Office](https://www.cookcountysheriffil.gov/person/), [Illinois Missing](https://illinoismissing.org/missing/), and [NamUs](https://namus.nij.ojp.gov/) were referenced. With the **OEMC** and **ME** datasets, we hoped to perform record linkage and identify missing persons reports that began as 911 calls and ended with the **ME**.

# What is an event?

### data structure + features
According to the Intergovernmental Missing Child Recovery Act of 1984 – a missing Child is anyone under the age of 21 whose parent or guardian does not know where they are. Although the legislation does not mandate adults are put into the LEADS terminal, these reports are still taken and may be associated with a LEADS/NCIC number. Any missing person may be classified as "high risk" if they are of tender age (0-9 years old), are disabled, have certain mental health challenges, etc. Regardless of a person's age, ability, time missing, or relationship to the complainant, it is against the law for CPD to refuse to accept the report.

> Our focus: In the data and by CPD directives, the term "FOUND person" is used to refer to someone who is found and not cognizant of his or her whereabouts and cannot make contact with a responsible person having a concern for his or her welfare. This is distinct from a "LOCATED person", which according to directives is the term used for a missing person who has been located. Although a missing person and a found person may be connected, they are not the same type of case. For the purpose of this project, we focused on every case which initially began as a missing person.

### missing data + limitations
Given the immediate concerns related to the quality of the outcome data, our team leaned heavily on media reports and interviews with family members in order to identify several missing persons in the data for a case study of ground truth outcomes. Due to privacy laws, street addresses have been censored as block addresses. Additionally, there aren’t names connected to the cases, but according to directives each case is provided a unique case number, a Records Division or `rd` number, for each individual missing person. (You can can see our test of whether `rd`s are unique to an individual when we examine the dataset later in the, "person identifiers and the `rd` field", section of this notebook.)

As these data are all the result of public records requests and data generation processes that are at least in part data entry, we don't assume these are a complete picture of the thing we are trying to quantify, either. We explore the conditions for someone to be missing from the missing persons data in a later notebook, as well.

#### CPD
In the case of **CPD's** data, we know going into our research that missing persons reports are one of the few remaining incident reports that have not been digitized and that not all missing persons reports are shared with or accepted by police, so _we do not regard this collection as the ground truth representation of everyone who went missing in the city of Chicago in this time period._

Instead, it's more accurate to think of this as the collection of _accepted_ missing persons reports made to CPD in this time period that were digitized and included in at least one set of public records released to **II**, **CB**, and/or **HRDAG** in the last two years.

That might not sound like a very interesting dataset, but we can still learn a lot by asking practical questions about the data and **CPD's** record-keeping.

***What are the basic procedures for opening and closing a missing person report?***

In the beginning, a missing person report is considered a "zero-report." It is not considered to be related to a crime until it becomes clear to CPD that a crime is associated with the report / underlying events.

CPD does not have clear case closure procedures in their missing persons directives which has led to varied methods and reasons with which officers close their cases.

***What happens when a case which was initially classified as a missing person is to be reclassified as related to a crime?***

Based on legislation, Detectives are required to reclassify the missing persons case as a crime if they have knowledge that a crime occured. A case can be reclassified in two ways: through a supplementary report or a change in UCR when reclassification rationale is supported by information in the case report.

By directives and in practice, CPD doesn't collect structured data indicating whether or not a person was located -- rather they track if the case was criminally associated or not. The status of the case indicates if an offender was identified or charged and the current IUCR column provides the crime associated with the incident. In this way, "Closed Non-criminal" means that no crime and no offender were identified by the police.

***What would a missing person being located safely look like in this data?***

Since the Chicago Police Department doesn't track whether or not a person was located, it cannot distinguish between events that were determined to be criminal

Here are a few cases that were officially reclassified – indicating that the reclassification rationale was supported by information in the case report. It includes two adult women and one missing child.
- Daisy Hayes, who has never been found but her case was reclassified as a homicide when evidence was discovered that implicated her boyfriend in her disappearance. He was charged with her murder and her case status reflects cleared closed.
- Joanna Wright, who has never been found, but witness and vehicle evidence indicated she was kidnapped. Her case has been suspended and was reclassified as a kidnapping.
- Marlen Ochoa Lopez, a missing teen, whose body was found in a garbage can after her disappearance. Her baby had been cut out of her body, but died shortly after birth. Two women were charged in her murder. Her case is cleared closed.

***What is the `status` field?***

FOIA Dictionaries provided by public records officers informed us of the following definitions for case `status`:

| code |  label | description |
| :--- | :---: | -: |
| 0 | Open Assigned | Assigned to a detective for investigation |
| 0 | Open Original | District Review Pending, not yet given to detectives |
| 0 | Open Unassigned | Reviewed by District, not yet assigned to detectives |
| 1 | Suspended | All investigative avenues fully pursued, case can not proceed further at this time |
| 3 | Cleared Closed | All offenders have been arrested and charged |
| 4 | Cleared Open | One or more offenders arrested and charged. One or more offenders still wanted |
| 5 | EX Cleared Closed | All offenders identified, complainant refuses to prosecute or offenders identified but whereabouts unknown |
| 6 | EX Cleared Open |  |
| 7 | Closed Non-criminal | Incident not criminal in nature |

This dictionary and the `status` data represented by it provide a useful top-level classification of CPD's investigation. However, these descriptions do not elaborate on the outcome of the missing human, instead only describing procedural events. Despite the lack of outcome information in these descriptions, the FOIA officer instructed us to infer that "Closed Non-criminal" means the person was "likely found."

> "Note1: If an incident is closed non-criminal than the person was likely found."

Using a phrase such as "likely found" when pressed for precise information demonstrates a lack of confidence and precision with their own data. What does "Closed Non-criminal" actually say about police case handling?

***What is the `current_iucr` field?***

The Illinois Uniform Crime Reporting ("IUCR") codes are the result of Illinois' participation in the FBI's 1930 Uniform Crime Reporting Program, although the FBI began phasing out UCR codes in 2021 in favor of the [National Incident Based Reporting System ("NIBRS")](https://www.fbi.gov/how-we-can-help-you/more-fbi-services-and-information/ucr/nibrs).

In one of our attempts to press for outcome data, the responding FOIA officer referred us to the `current_iucr` field:

> "Provided are the RD numbers and current IUCR classifications of incidents originally classified under IUCR code 6050 (MISSING PERSON). A case originally classified as MISSING PERSON may have its IUCR code updated to reflect new information: for example, the incident's IUCR code may be updated to 6055 (NON-CRIMINAL: FOUND PERSON), 5084 (NON-CRIMINAL: DEATH), or a criminal code such as 1790 (CHILD ABDUCTION), 4220 (KIDNAPPING), or 0110 (HOMICIDE: FIRST DEGREE MURDER). More specific information on missing person recoveries would be captured in individual case report narratives."

[CPD's Incident Reporting Guide](https://directives.chicagopolice.org/forms/CPD-63.451_Table.pdf) provides more information about how officers were meant to identify and classify different types of events using the Illinois Uniform Crime Reporting codes. The [city's data portal](https://data.cityofchicago.org/widgets/c7ck-438e) provides an index of many more of the UCR codes. However, we were not able to identify a code for a "LOCATED" person as the UCR system was meant to encode different types of criminal activity, and a LOCATED person has not necessarily been the victim or offender of a crime.

This note about UCRs from the FOIA officer tells us several things about the limitations of **CPD's** data:

So, if we are meant to use UCR codes as a stand-in for case outcome, we do not have a representative value for LOCATED persons. There are directives confirming that more specific information about the circumstances are kept in case report narratives, though.

> "Department members who are notified that a missing person has been located will complete a Supplementary Report (CPD-11.411-A) to the original Missing/Found Person Case Report in all cases of located persons, including any additional information regarding pertinent circumstances in the narrative." (Special Order S04-05, Section VIII, item A6)

From this information, we understand two things:
1) **CPD** uses the status of an investigation and the UCR code system to encode information about the outcome of missing persons reports, although the UCR codes do not have a value for located persons who were not involved in a crime and the status descriptions do not refer to specific outcomes.
2) **CPD** maintains a more complete narrative of the outcome in the unstructured case files, specifically Supplementary Reports, that need to be requested in small batches and cannot be reviewed as a large collection.
    - Unless CPD has a structured database available in other internal software which is not subject to FOIA requests, it's likely the true number of located persons is buried in a collection of tens or hundreds of thousands of paper and digital Supp reports that would need be read individually to determine whether they describe a missing person who was later located.

Going back to our big question, **What would a missing person returning home safely look like in this data?**

#### Supplementary data
- Although 911 call data is initially generated through a digital phone system, it could be possible for a call to be declined or lost, or for dispatch to be unable to deploy an officer to the scene to investigate. In these cases, we may have incomplete OEMC data or no data at all.
- Theoretically, it would be possible for the County ME to be unable to respond to a particular call, perhaps because of challenges with availability and/or call volume. In these cases, we imagine another county ME or similar official would have been deployed to a scene within Cook County. 

Now that we have a sense for the data we have and its limitations, let's look more at the actual data.

# setup

In [2]:
# dependencies
import yaml
import re
import pandas as pd
import geopandas as gpd

In [3]:
# support methods
def read_yaml(fname):
    with open(fname, "r") as f:
        data = yaml.safe_load(f)
    assert type(data) == dict
    return list(data.keys())



def prep_geodata(f):
    geo_df = gpd.read_file(f)
    current_crs = geo_df.crs
    geo_df.to_crs(epsg=3857, inplace=True)
    return geo_df

In [4]:
# primary dataset
mp = pd.read_parquet("../../export/output/mp.parquet")
mp = mp.loc[mp.year_occurred < 2023].copy()

# primary by version
first = pd.read_parquet("../../join/output/versions/first.parquet")
second = pd.read_parquet("../../join/output/versions/second.parquet")
third = pd.read_parquet("../../join/output/versions/third.parquet")
fourth = pd.read_parquet("../../join/output/versions/fourth.parquet")
fifth = pd.read_parquet("../../join/output/versions/fifth.parquet")
sixth = pd.read_parquet("../../join/output/versions/sixth.parquet")

# supplemental datasets
mult_miss = read_yaml("../../clean/hand/rdno_fixes.yml")
extras = pd.read_parquet("../../join/output/extras.parquet")
oemc_disp = pd.read_parquet("../../../OEMC/import/output/oemc_dispatch.parquet")
oemc_loc = pd.read_parquet("../../../OEMC/import/output/oemc_location.parquet")
me = pd.read_parquet("../../../ME_death/filter/output/deaths.parquet")
geo_beats = prep_geodata("../frozen/beats.geojson")

In [5]:
# we don't want to publish these in their entirety - they have names of deceased humans
# want to publish a subset that includes the examples we use
raw_me_early = pd.read_excel("../../../ME_death/import/input/FOIA_All_ME_Cases_R.xlsx",
                       sheet_name='Prior-Aug 2014')
raw_me_recent = pd.read_excel("../../../ME_death/import/input/FOIA_All_ME_Cases_R.xlsx",
                       sheet_name='Aug 2014 onwards')
cpd_hom = pd.read_excel("../../../ME_death/import/input/19773-P824106-Homicides-Database.xlsx",
                        sheet_name='HOMICIDE VICS 2000-31JAN2023')

In [6]:
cpd_hom

Unnamed: 0,RD NO,VICTIM FIRST NAME,VICTIM LAST NAME,CRIME ADDRESS,VICTIM RACE,VICTIM SEX,VICTIM AGE,DATE OFFICER ARRIVED,HOMICIDE DATE/TIME,DATE CASE CLEARED-CLOSED,CLEARED EXCEPTIONALLY,NOTIFICATION DATE/TIME,DET FIRST NAME,DET LAST NAME
0,F000110,DEBORAH,WOODARD,64XX S LANGLEY AVE,BLACK,F,38.0,2000-01-01 01:40:00,2000-01-01 01:00:00,2000-01-03,,NaT,,
1,F002792,GIANNA,SOTELO,33XX W CUYLER AVE,WHITE HISPANIC,F,1.0,,2000-01-02 10:52:00,2000-01-10,,NaT,ROBERT,RUTHERFORD
2,F003404,ANDRE,THOMAS,55XX W MADISON ST,BLACK,M,24.0,2000-01-02 17:00:00,2000-01-02 16:41:00,2000-02-22,BAR TO PROSECUTE,NaT,RAYMOND,SCHALK
3,F004985,CARL,COOK,119XX S EGGLESTON AVE,BLACK,M,23.0,2000-01-03 12:35:00,2000-01-03 12:30:00,2000-01-06,,NaT,MICHAEL,FRAZIER
4,F005147,PRESTON,STOFER,10XX W 112TH PL,BLACK,M,67.0,2000-01-03 13:35:00,2000-01-03 13:30:00,2000-01-10,,NaT,PHILIP,GRAZIANO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14224,JG129690,ERIC,DIGGINS,101XX S WINSTON AVE,BLACK,M,40.0,2023-01-26 00:10:00,2023-01-25 23:45:00,NaT,,2023-01-26 01:15:00,CURTISINE,GILMORE
14225,JG131679,ANITRA,WHITE,95XX S BENNETT AVE,BLACK,F,47.0,2023-01-27 16:16:00,2023-01-27 14:06:00,2023-02-07,DEATH OF OFFENDER,2023-01-26 16:41:00,ROBERT,SMITH
14226,JG131679,ANITRA,WHITE,95XX S BENNETT AVE,BLACK,F,47.0,2023-01-27 16:16:00,2023-01-27 14:06:00,2023-02-07,DEATH OF OFFENDER,2023-01-27 17:08:00,ROBERT,SMITH
14227,JG134346,SYDNEY,JORDAN,16XX S HOMAN AVE,BLACK,M,30.0,2023-01-30 06:35:00,2023-01-30 06:03:00,NaT,,NaT,BROOK,GLYNN


In [5]:
# basic features to review
basic = ['rd_no', 'race', 'sex', 'age', 'age_group', 'address',
         'date_occurred', 
         'date_officer_arrived', 'notification_time', 'closed_date',
         'status', 'primary', 'description',
         'original_iucr', 'current_iucr',]

In [32]:
raw_me.head().T

Unnamed: 0,0,1,2,3,4
CASENUMBER,14-AUG-0163,14-AUG-0162,14-AUG-0161,14-AUG-0160,14-AUG-0159
INCIDENT_DATE,2014-08-11 05:19:00,2014-08-10 18:35:00,2014-08-09 17:49:00,2014-08-10 20:30:00,2014-08-10 19:05:00
DEATHDATE,2014-08-11 00:00:00,2014-08-11 00:00:00,2014-08-11 00:00:00,2014-08-10 00:00:00,2014-08-10 00:00:00
DECEDENT_FIRST_NAME,JOHN,JOLYN,CHRISTOPHER,JABARI,VERONICA
DECEDENT_MIDDLE_NAME,,,,,
DECEDENT_LAST_NAME,STEWART,JOHNSON,CORRIGAN,SCURLOCK,GAZZILLO
DECEDENT_AGE,68,63,50,16,39
GENDER,Male,Female,Male,Male,Female
RESIDENCE_CITY,Elgin,Hoffman Estates,Chicago,Chicago,Evanston
INCIDENT_CITY,Elgin,Hoffman Estates,Chicago,Chicago,Evanston


In [30]:
df.head().T

Unnamed: 0,0,1,2,3,4
CASENUMBER,ME2022-07866,ME2022-07867,ME2022-07868,ME2022-07865,ME2022-07864
INCIDENT_DATE,2022-09-08 20:30:00,2022-09-08 20:00:00,2022-08-31 16:00:00,2022-08-26 00:00:00,2022-09-08 11:40:00
DEATH_DATE,2022-09-08 21:17:00,2022-09-08 20:56:00,2022-09-08 20:20:00,2022-09-08 18:28:00,2022-09-08 17:57:00
DECEDENT_FIRST_NAME,JACKIE,BERTULFO,STANISLAVA,BYRON,ESGAR
DECEDENT_MIDDLE_NAME,,,,,LEDEZMA
DECEDENT_LAST_NAME,WRIGHT,VARGAS MENA,BIRINCIKIENE,VINSON,PARALES
DECEDENT_AGE,54,51,74,54,29
GENDER,Male,Male,Female,Male,Male
MARITALSTATUSNAME,,U,U,,U
INCIDENT_CITY,Chicago,Chicago,,Chicago,


In [6]:
mp.sample(5)[['rd_no', 'date_occurred', 'address']]

Unnamed: 0,rd_no,date_occurred,address
62445,JD395519,2020-10-10 12:00:00,"XX E 61ST ST, CHICAGO"
269699,HL634347,2005-09-24 21:00:00,"68XX S HERMITAGE AVE, CHICAGO"
14468,HR631687,2009-11-07 18:00:00,"44XX S GREENWOOD AVE, CHICAGO"
291952,JA112382,2017-01-11 12:00:00,"48XX N FRANCISCO AVE, CHICAGO"
275987,HR666169,2009-11-29 14:30:00,"41XX W MONROE ST, CHICAGO"


# examining the dataset
- [ ] introduce the primary features, source, assumed/instructed interpretation
- [ ] `rd`
- [ ] `date_occurred`
- [ ] `address`

### person identifiers & the `rd` field
When the Chicago Police Department is made aware of a report of a missing person, policy dictates that an officer goes to the scene (or takes a report in person at the station) and does a preliminary investigation to verify that the report is a "Bona Fide" missing persons case. If it is, the officer is meant to generate a Missing Person Incident Report for each individual involved.

Importantly, each Incident Report created by CPD has its own `rd` or record number, which can be treated as a unique identifier for each unique person in the event - _if it's true that only one person was described in the Incident Report._ As we go through the records shared by CPD, we observe a couple dozen reports between 1 January 2000 and 31 December 2021 for which it appears this is not the case.

In [7]:
# how many unique `rd` numbers do we observe?
len(mp.rd_no.unique())

352911

For example, `rd_no` 'JA172886' has the same last seen date and last seen location in several rows of data, but the person details describe two distinct individuals:

* Person A is a 32 year old person born in 1985, no race documented
* Person B is a 22 year old Black person born in 1995

If we were to look only at this data and speculate about possible explanations, we could imagine that an officer got some poor or incomplete information initially, and later received new information about the case and updated the incident to describe the correct individual. Maybe the initial filing had typos in the age and birth year, and those were updated accurately. In either case, one of those persons described is not a real MP, and we should disregard those records.

Alternatively, if we do not assume that policy is always followed, then we can believe that this is an example of a record number being assigned to more than one missing person. But we want to be more scientific about this process than a simple guess, so to follow up on our theory, we submitted FOIA requests for a sample of the `rd_no`s where we believed more than one human was described in the same Incident Report, including JA172886. 
In the case of JA172886, there is a handwritten Incident Report including a narrative field, as well as a digital Supplementary Incident Report. In the narrative and digital records, two distinct humans are described as the subjects of the investigation. One is a 32 year old female with no documented race and the other is a 22 year old Black female. 

In [8]:
# setup for query of shared rds
base_demo = ['rd_no', 'date_occurred', 'race', 'sex', 
             'age', 'age_group', 'year_of_birth',]
vc_cols = [col for col in extras.columns if ('vc_' in col) & (any(extras[col]))]
shared_rd = extras.loc[(extras.race.str.contains("|")) | 
                       (extras.sex.str.contains("|")) | 
                       (extras.age.str.contains("|")) | 
                       (extras.age_group.str.contains("|")),
                       base_demo]

##### how many unique `rd` numbers are candidates for being shared by multiple missing humans?

In [9]:
shared_rd.shape[0]

94

In [10]:
# preview sample of the candidate records
shared_rd.sample(10)

Unnamed: 0,rd_no,date_occurred,race,sex,age,age_group,year_of_birth
80088,JA127750,,,,14.0|24.0,,1992|2002
178079,HN291179,,,,14.0|15.0,,
217488,F138939,,,,17.0|18.0,,
182555,F107845,,,,14.0|15.0,,1985|1986
117207,JC140559,,BLACK|WHITE,,,,
63420,F035945,,,,38.0|39.0,,
234930,HL580193,,,,16.0|17.0,10-16|17-30,
240169,F109997,,,,15.0|16.0,,
173357,HP246729,,,,48.0|58.0|59.0,,1948|1949|1959
35873,HL457115,,,,15.0|19.0,,


In [11]:
# if we ignore `rd`, how many records would be flagged as sharing `date_occurred` and `address`?
mp.loc[mp.address.str.strip() != "CHICAGO"].shape[0] - \
    mp.loc[mp.address.str.strip() != "CHICAGO", 
           ['date_occurred', 'address']
          ].drop_duplicates().shape[0]

17174

- one possible explanation is that records sharing a last seen date and address represent events where more than one person was reported missing, and CPD correctly assigned distinct `rd` numbers for each individual. If true, these records will be handled correctly by our data processing pipeline and each human will be represented in our final table.
- we know that several of the most common block addresses are areas with a group or foster home, and it's possible more than one youth was reported missing from one of these homes on the same day as another.
- we have also been told by staff at foster and group homes (who are often the ones making these reports for the youth in their care) that sometimes when they call to get updates on a previous report and provide the assigned `rd` they had been given originally, they are told by a representative that "that is no longer the `rd` number" for that case. In these instances, they're often given a new `rd` that is supposedly assigned to the original report. If true, this creates a problem of overrepresenting certain cases, as well as their associated demographics and things like area occurred, which can impact our later summaries.

# Record Linkage

### Introduction

Record linkage refers to the process of programmatically linking co-referent records within a collection and de-duplicating them, so that there's one row per unique identifier.

If the identifier is associated with a unique incident, then this sets up the data for incident-level analysis.

Where possible, we try to organize the data so that there's one row per unique _person_. We take a hash value of the content that refers to the same person and use the value as an "entity" id to represent the group of co-referent records. This preserves the option to inspect the grouped, person-level data later on. However, person-level analysis is not always feasible, especially with anonymized datasets like the CPD MP records. There, we will only be able to link to an MP report via an `rd`.

In our work at HRDAG, Entity Resolution usually involves combining prepared lists we receive from different sources that contain names of humans who have been killed in a violent conflict. We use the structured information about overlaps between sources with other statistical techniques like [multiple systems estimation, or MSE](https://hrdag.org/2013/03/11/mse-the-basics/). We hope to enable MSE because of the data limitations stated above; we know that some people are more likely to be documented than others and we want to use all available tools to estimate how many people are missing from the data.

If the OEMC and CPD records were not anonymous, then we could do Entity Resolution across these databases and perform person-level analysis of outcomes where the missing person was ultimately found deceased. Where possible, we could trace reports that began as a 911 call to OEMC to their CPD report, and later ME record. Or, focus on the CPD and ME overlaps to study the rates of missing persons later found deceased and how these reports are classified by CPD. However, as already stated, we won't be able to do that kind of record linkage here.

Data generated by humans and on human rights abuses is often imprecise, so we develop and evaluate our record linkage techniques to make sure they are as robust and flexible as they can be reliable. We won't go into much detail on it in this notebook, but here is a [blogpost series that dives into part of this framework](https://hrdag.org/2016/01/08/a-geeky-deep-dive-database-deduplication-to-identify-victims-of-human-rights-violations/) and a [demonstrative workflow](https://github.com/HRDAG/training-docs/tree/main/demo-tasks/record-linkage) for more information.

### Linking MP Reports

All CPD databases share an `rd` field that can be used to identify records relating to the same unique and original report. In fact, it's outlined in a policy from the 1980s that original reports should be re-classified and updated with new information about the underlying events to support research and analysis later on.

It's likely that the Illinois Task Force for Missing and Murdered Chicago Women expected to use the City's data and reclassifications accordingly to analyze the outcomes of reports as promised by law. However, virtually all of the MP records have their original classification, so there would be an extremely limited sample of outcomes to study on the surface. Ie. There are only 10 reported links between the MP and Homicide data, out of more than 350,000 reports made over more than 20 years.

Consider the unique id fields in the datasets here. CPD uses an `rd`, OEMC uses an `eventnumber`, the ME uses a `casenumber`. We expect co-referent records across the datasets, but these id fields are distinct and won't get us those linkages. We could try another approach using names but those are not included in the CPD MP data, so we might only have productive matches from the CPD Homicide and ME databases, which we know to have dependent data generation processes.

In the beginning of the project, there didn't appear to be a productive avenue for analyzing report outcomes. But as II and CB conducted interviews and reviewed widely publicized cases, CPD `rd`s and ME `casenumber`s were revealed for a sample of cases originating as missing persons reports with known present-day outcomes. This enabled investigators to review the unstructured data in CPD case files, along with the structured MP data and, when the missing person had been found deceased, the Homicide and ME data. The National Missing and Unidentified Persons System (NamUs) was also checked for certain instances where the person is still considered missing. 

After analyzing the sample of a little over a dozen cases, we found discrepancies between the "NON-CRIMINAL" classification and ground truth events. Here are 4 of the 11 cases we identified. These cases allow us a depper insight into the missing person pipeline and some underlying issues related to missing children in Chicago.

### Jahmeshia Conner

(2009) Jahmeshia Conner was a 12 year old girl who was [last seen](https://www.documentcloud.org/documents/24399391-jahmeshiaconner_hr646333) Nov. 16, 2009 at a bus stop in Englewood. Her mother reported her missing the next day when she didn't come home from school. She was found raped and strangled on November 30 -- 2 weeks later. **Police said there were several sightings of Jahmeshia in the two weeks she was missing from home. Supt. Jody Weis subsequently announced that the department would review the way it releases alerts about missing persons, and the number of alerts sent to the media rose sharply.**

In [7]:
jahmeshia_rd = "HR646333"

#### CPD's MP data

In [10]:
mp.loc[mp.rd_no == jahmeshia_rd, basic].T

Unnamed: 0,114512
rd_no,HR646333
race,BLACK
sex,F
age,12.0
age_group,youth (10-20)
address,"65XX S MARSHFIELD AVE, CHICAGO"
date_occurred,2009-11-15 20:00:00
date_officer_arrived,NaT
notification_time,2009-11-16 18:11:00
closed_date,2009-12-02 00:00:00


#### CPD's Homicide data

In [13]:
cpd_hom.loc[(cpd_hom['VICTIM LAST NAME'] == 'CONNER') &
            (cpd_hom['VICTIM FIRST NAME'].str.contains("JAHMESHIA"))].T

Unnamed: 0,5351
RD NO,HR666816
VICTIM FIRST NAME,JAHMESHIA
VICTIM LAST NAME,CONNER
CRIME ADDRESS,64XX S MARSHFIELD AVE
VICTIM RACE,BLACK
VICTIM SEX,F
VICTIM AGE,12.0
DATE OFFICER ARRIVED,2009-11-30 06:50:00
HOMICIDE DATE/TIME,2009-11-30 06:50:00
DATE CASE CLEARED-CLOSED,2014-05-06 00:00:00


#### County ME data

In [15]:
raw_me_early.loc[(raw_me_early.DECEDENT_LAST_NAME == 'CONNER') &
           (raw_me_early.DECEDENT_FIRST_NAME == 'JAHMESHIA')].T.dropna()

Unnamed: 0,28197
CASENUMBER,09-NOV-0473
INCIDENT_DATE,2009-11-30 06:50:00
DEATHDATE,2009-11-30 00:00:00
DECEDENT_FIRST_NAME,JAHMESHIA
DECEDENT_LAST_NAME,CONNER
DECEDENT_AGE,12
GENDER,Female
RESIDENCE_CITY,Chicago
INCIDENT_CITY,Chicago
INCIDENT_COUNTY,Cook County


### Takaylah Tribbit

(2019) Takaylah Tribbitt was a 14 year old girl, who was described as a habitual runaway– she was [last seen](https://www.documentcloud.org/documents/24399394-takaylah_tribbjc416670) missing from a DCFS group home on September 1, 2019. Interviews with her family and underlying records show that she had an unstable living situation, was struggling with bi-polar disorder, and was groomed by an older man. There were several sightings of Takalyah during the 2 weeks before her murder. Police reports show, when detectives reached out to Takaylah's friends they learned she began getting beaten by a man the day after she ran away from the group home and wanted to return. **On September 16, Her body was found in a Gary, Indiana alley that was known for sex work–she had been sexually assaulted, executed, and had her hands bound. When she was found, she was a Jane Doe. Law enforcmenet in Chicago put out a missing person alert for Takaylah on September 25 -- over a week after her body had been found.**

In [16]:
takaylah_rd = "JC416670"

#### CPD's MP data

In [17]:
mp.loc[mp.rd_no == takaylah_rd, basic].T

Unnamed: 0,274209
rd_no,JC416670
race,BLACK
sex,F
age,14.0
age_group,youth (10-20)
address,"11XX N NOBLE ST, CHICAGO"
date_occurred,2019-09-01 22:00:00
date_officer_arrived,2019-09-02 00:00:00
notification_time,2019-09-02 00:44:00
closed_date,2020-07-30 00:00:00


#### CPD's Homicide data

Takaylah's case is not reflected in the Chicago Police Homicide Data or Cook County Medical Examiner's data. Her body was found in Gary, Indiana -- in a different jurisdiction. Her missing persons case reflects closed non-criminal in the data, but underlying documents show she was a homicide victim.

### Desiree Robinson

(2016) Desiree Robinson was a 16 year old girl who was reported missing by her grandfather. She was [last seen](https://www.documentcloud.org/documents/24399379-desiree_robinson_hz542552) on November 29, 2016. Reports show that she had been in touch with a friend via facebook messenger during the aproximately 3.5 weeks that she was missing. [Media](https://www.chicagotribune.com/news/breaking/ct-met-sex-trafficking-trial-20190301-story.html) reports say that nine days before she was killed she told her friend, “I came here for a party. ... He told me he was going to take me home, but now he won’t let me leave.” **On December 19th, at 6:37pm [**Detective Barbara Glimco**](https://cpdp.co/officer/10095/barbara-glimco/) submitted a report noting her grandfather -- the complaintant-- related she had been located and was not a victim or offender of a crime. The report was approved by [**Sgt. Monique Washington**](https://cpdp.co/officer/30090/monique-washington/). However, her family on her grandfather and mother’s side claim they had not seen her since she went missing in late November.**

In [19]:
desiree_rd = "HZ542552"

#### CPD's MP data

In [20]:
mp.loc[mp.rd_no == desiree_rd, basic].T

Unnamed: 0,300992
rd_no,HZ542552
race,BLACK
sex,F
age,16.0
age_group,youth (10-20)
address,"101XX S YATES AVE, CHICAGO"
date_occurred,2016-11-29 12:00:00
date_officer_arrived,2016-12-06 18:30:00
notification_time,2016-12-06 19:11:00
closed_date,2016-12-20 00:00:00


#### CPD's Homicide data

Desiree's homicide was investigated in Markham, IL.

#### County ME data

Four days after her missing persons case was closed, Desiree's body was found in a garage. She had been trafficked by [Joseph Hazley](https://abc7chicago.com/joseph-hazley-desiree-robinson-murder-markham-garage/5330837/) through a classified advertising site called [Backpage](https://www.cbsnews.com/chicago/news/backpage-shut-down-by-the-government/) and was murdered by a Antonio Rosales for refusing to do a sex act for free. She had been beaten, strangled and had her throat slit.

In [22]:
raw_me_recent.loc[(raw_me_recent.DECEDENT_LAST_NAME == 'ROBINSON') &
           (raw_me_recent.DECEDENT_FIRST_NAME.str.contains("DESI"))].T

Unnamed: 0,55248
CASENUMBER,ME2016-06179
INCIDENT_DATE,2016-12-24 09:20:00
DEATH_DATE,2016-12-24 09:41:00
DECEDENT_FIRST_NAME,DESIREE
DECEDENT_MIDDLE_NAME,NAUTICA TATYANNA
DECEDENT_LAST_NAME,ROBINSON
DECEDENT_AGE,16
GENDER,Female
MARITALSTATUSNAME,Never Married
INCIDENT_CITY,Markham


### Chavanna Prather

(2002) Chavanna Prather was a 17 year old basketball player who worked at a local Mcdonalds; she was three months pregnant at the time she went missing. She had been reported missing a couple weeks before that and the reports noted that she was having trouble getting along with her mother’s house rules. She was [last seen](https://www.documentcloud.org/documents/24399380-chavannaprather_hh320286) on April 19, 2002, leaving work with her manager, Tony Fountain, a man nearly twice her age with whom she had been intimate. According to [court records](https://www.illinoiscourts.gov/Resources/4c607a0b-7df9-4124-af3b-0a95b64c6d5c/1103312_R23.pdf), she was murdered that night. The same records reveal that while he had Chavanna in the trunk of his car, he texted a friend – “I shot that bitch. She wouldn't die.” **On April 26, 2002, at 2:06pm [**Detective Elroy Baker**](https://cpdp.co/officer/1111/elroy-baker/) submitted a closure report; noting her mom-- the complaintant-- related that Chavanna had been located at a friend’s house and there was no indication she was a victim or offender of a crime. This report was approved by [**Lt. Carol McLaurin**](https://cpdp.co/officer/18379/carol-mc-laurin/) at 3:17pm that day.**

In [25]:
chavanna_rd = "HH320286"

#### CPD's MP data

In [26]:
mp.loc[mp.rd_no == chavanna_rd, basic].T

Unnamed: 0,149909
rd_no,HH320286
race,BLACK
sex,F
age,17.0
age_group,youth (10-20)
address,"84XX S CRANDON AV, CHICAGO"
date_occurred,2002-04-19 11:00:00
date_officer_arrived,2002-04-26 14:00:00
notification_time,2002-04-20 19:12:00
closed_date,2002-04-26 00:00:00


#### CPD's Homicide data

Less than a couple hours after law enforcement closed her missing person case, Chavanna’s body was found in a marsh area on 117th and Torrence Ave.

Moss had began to grow on her body, indicating she had been there for days.

In [27]:
cpd_hom.loc[(cpd_hom['VICTIM LAST NAME'] == 'PRATHER') &
            (cpd_hom['VICTIM FIRST NAME'].str.contains("CHAV"))].T

Unnamed: 0,1462
RD NO,HH332622
VICTIM FIRST NAME,CHAVANA
VICTIM LAST NAME,PRATHER
CRIME ADDRESS,117XX S TORRANCE AVE
VICTIM RACE,BLACK
VICTIM SEX,F
VICTIM AGE,17.0
DATE OFFICER ARRIVED,2002-04-26 16:20:00
HOMICIDE DATE/TIME,2002-04-26 17:00:00
DATE CASE CLEARED-CLOSED,2002-04-30 00:00:00


- [Lamarr Minor's cpdp record](https://cpdp.co/officer/19207/lamarr-minor/)

#### County ME

Autopsy reports show that she was beaten, strangled, and shot multiple times.

Although widely report on, there is no pregnancy recorded in the Cook County Medical Examiner data for Prather. Nevertheless, court records indicate Fountain was convicted of "the first-degree murder of Chavana Prather and the intentional homicide of her unborn child."

In [28]:
raw_me.loc[(raw_me.DECEDENT_LAST_NAME == 'PRATHER') &
           (raw_me.DECEDENT_FIRST_NAME.str.contains("CHAV"))].T

Unnamed: 0,74605
CASENUMBER,02-APR-0478
INCIDENT_DATE,2002-04-26 16:00:00
DEATHDATE,2002-04-26 00:00:00
DECEDENT_FIRST_NAME,CHAVNA
DECEDENT_MIDDLE_NAME,
DECEDENT_LAST_NAME,PRATHER
DECEDENT_AGE,17
GENDER,Female
RESIDENCE_CITY,Chicago
INCIDENT_CITY,Chicago


### Record linkage summary

Misspelled names plague the city datasets. Chavanna's name is spelled 3 different ways across media, CPD, and County ME sources. Sadaria's Davis' CPD Homicide record has her first name as 'Sandra'. Data entry errors like these are why we adapt our record linkage processes to include possible pairs between databases that don't match on full name, but do on a combination of other features like age and last seen year/month. However, it's easy to imagine a busy officer or reviewer checking for a case by exact name in one or several of these databases, and in this context, a hasty data entry error combined with a hasty search would lead to a false belief that the missing person you're looking for had not been found deceased.

All 4 of the underage Black girls whose stories we examined closely above have "CLOSED NON-CRIMINAL" as their missing persons case `status`, demonstrating how this field is not synonmous with a missing person being located and in fact has been applied to cases where the missing was found and had been murdered.

- In the final supplementary report where an officer requests to close Jahmeshia Conner's missing person case and mark it "Non-criminal", he states that the missing had been located in an alley and positively identified by the Medical Examiner, and even includes the `rd` number for the CPD Homicide case. In reading this report, **there is no room to claim that CPD did not have knowledge that Jahmeshia had been murdered at the time of closing her missing person case and classifying it as "Non-criminal".** The request was approved just 8 minutes after being submitted to a supervisor.
- Takaylah Tribbit's missing persons case was closed as "Non-criminal" on July 30, 2020, almost [9 months after she had been identified as a gunshot victim](https://www.nwitimes.com/news/local/crime-courts/teen-found-dead-in-gary-alley-was-reported-missing-by/article_50ae1147-b37f-501b-935d-d8c23071737e.html) in Indiana. In articles published when she was identified, Chicago police are reported as acknowledging that she had been missing since September. So, **how could CPD lack knowledge in July 2020 that she had been the victim of a crime?**
- While her friend called around to find her, the police closed Desiree's missing persons case two weeks after reporting arrival to it, citing a conversation with the complainant that she had been located. **Desiree's relatives have contradicted the police narrative that she returned home between when she went missing and when her body was found.**
- Police records claim an officer spoke with the complainant days after the initial missing persons report for Chavanna and they related that she had been located at a friend's house. **How is that possible if court records show she was killed the night she went missing?**
    - Here is the timeline of April 26th, 2002, the day Chavanna's body was found and her missing persons case closed:
        - At 2:06pm an officer submits a case supp report to close her missing persons case, citing a conversation with complainant.
        - At 4pm, the Medical Examiner creates the death report.
        - At 4:20pm, CPD reports an officer arriving to the report of her homicide. The homicide case has a different `rd` number and there is nothing in her missing persons report to connect it to the homicide case, other than her misspelled name and presumably the case details in the unstructured reports.
        - The time of homicide in CPD's records is 5pm.

Hindsight includes criminal convictions for the persons responsible for Desiree and Chavanna's murders, and these are surely the kind of cases that the 1984 Missing Child Recovery Act was intended to enable the study and prevention of.

However, when "CLOSED NON-CRIMINAL" is used in cases like these that did in fact involve criminal events, the label loses its meaning _and_ functionally severs the connection to those related events and outcomes that would otherwise be included in the study. As a result, using the `status` field to respond to questions about human outcomes would be inconsistent with the ground truth for at least these cases. The true prevalence of this misclassification in other cases can only be discovered through systematic review of all CPD case materials and, most crucially, interviews with complainants and other stakeholders who may have different information than what is presented in police narratives.

### Conclusion

Law enforcement is often the agency with the most contact with vulnerable populations who are likely to go missing – whether people are dealing with mental health issues, substance use abuse disorder, domestic violence, or human trafficking. They generate the data that legislation and researchers lean on to come to conclusions about violence prevention and addressing crime.

Through a comparison of law enforcement operating manuals, Illinois legislation, and CPD data, we have seen the disconnect that can come between the creation of a policy and its implementation. What’s most dangerous about this is the use of the data by officials higher in the chain who may not realize the disconnect between the data workers, policy, and law enforcement, and take its presentation at face value. Lack of awareness about how the data is created and what challenges officers face trying to manage and update cases affects the reliability of the data and statements made from it in ways that must be examined critically.

When the state initially crafted a policy around missing persons cases, they were hoping to provide a factual and statistical base for research that would address the problem of missing people in Illinois. However, the information we can learn today about two decades of missing persons cases in Chicago is extremely limited given that the core `status` field is not reliable and no structured fields contain outcome information. The state Taskforce for Missing and Murdered Chicago Women would be able to more easily examine and report underlying patterns related to missing persons and make effective suggestions for violence prevention, as it was created to do, if Chicago Police reclassified and closed cases in a reliable and transparent way that reflects the ground truth of what happened. Instead, that information about the outcomes of the more than 350,000 reports accepted by CPD since January 1, 2000, is lost.

This legislative body can still work to examine the records, but will need to keep data quality issues in mind during their exercise. Although we can’t learn about the outcome of all the cases en masse, we can provide stakeholders with the limited information we can gather from the datasets, such as rates of officer arrival, case closure, distribution of cases by race/age/sex of missing, the implications of shotspotter on case priority, and a few others things we will present in subsequent notebooks.

### Discussion

Systematically identifying still-missing humans and other types of misclassification for an audit of the MP data would involve a review of underlying case materials, as well as crucial interviews and statements from complainants. Without funding and community participation to conduct that expansive review, the information about those case outcomes is lost as these instances can't be distinguished from reports that were closed appropriately using the structured data alone. With labeled training data that represents the true outcome of say, 1,000 cases in the database, we may be able to develop a machine learning model that can classify the remainder of the case data. This would involve a robust review process to evaluate and improve the classification accuracy, and may be most useful as a tool for suggesting which cases are most likely to have been misclassified and should be reviewed further by a human.

That being said, we could analyze the case outcomes where the reported missing person was ultimately found deceased within the county because there should be a record in the ME data in such cases. To do this, we need the unredacted MP data that includes full names of the missing humans to make identifications between databases possible. Connections between the two databases does not strictly imply that a missing person became deceased while missing, but other databases such as NamUs, as well as media reports and family interviews, can help fill in the timeline and those details.

Regardless of what analysis the Task Force and other community stakeholders are interested in, the discrepancies in CPD's records presented in this notebook will continue to be possible without intervention, and could affect any of the, on average, dozens of new reports received by CPD on a daily basis.