# COGS 108 - Project Proposal

## Authors

- Timothy Carraher: Conceptualization, Writing – original draft
- Mihir Vad: Background research, data curation, writing - original draft
- Matthew Do: Background research , writing - original draft
- Leah Maltezos: Background Research, writing - original draft

## Research Question

Are NYPD officers of certain races more likely to receive civilian complaints
than expected based on their representation within the police force?

Using NYPD personnel data and civilian complaint records, we will compare the proportion of complaints received by officers of each race to their proportion within the police officer population. This is a statistical inference task that will assess whether observed complaint distributions differ significantly from expected distributions proportionally. We will use simple regression and statistical tests like a chi-square test or regression complaint rate modeling to evaluate disproportionality.

## Background and Prior Work

The racial disparity in the police force is very clear in the United States, with many reports on
how race influences police behavior and how civilians perceive law enforcement. Multiple
studies have shown that communities of color, such as Black and Latino populations,
experience disproportionate rates of police stops, use of force, and arrests.<a
href="#fn1"><sup>1</sup></a> However, there is not much focus on examining if officers of
different racial backgrounds receive higher or lower rates of civilian complaints, and if complaint
patterns might reflect biases in how the public perceives a police officer.

The New York City Civilian Complaint Review Board keeps one of the largest databases of
police misconduct complaints in the United States which contains detailed allegations against
NYPD officers. Prior studies have found that officers with previous complaint histories are more
likely to receive future complaints than those without a complaint history, suggesting that a small
proportion of officers account for a disproportionate share of police misconduct.<a
href="#fn2"><sup>2</sup></a>

Whether or not an officer's race influences complaint rates can reveal a complex dynamics of
implicit bias and favoritism. Psychological research has shown that people of all races can have
implicit biases that affect their judgments. <a href="#fn3"><sup>3</sup></a> Some believe that
racial compatitibility between officers and community members can improve trust and
cooperation while others believe that training may be more determinative of officer behavior
than their demographic characteristics.

Another study finds that many major U.S. police forces do not hold a similar racial composition
to the communities that they patrol in. The NYPD, despite being one of the more diverse police
departments nationally, still shows disparities between officer demographics and the city's
demographics.<a href="#fn4"><sup>4</sup></a> Understanding whether officers of certain
racial backgrounds receive complaints at higher rates than against others can shed light on
potential biases in civilian comp

References:

<a name="fn1">1.</a> Lofstrom, M., Hayes, J., Martin, B., & Premkumar, D. (Oct. 2021). Racial
Disparities in Law Enforcement Stops. Public Policy Institute of California.
https://www.ppic.org/publication/racial-disparities-in-law-enforcement-stops/

<a name="fn2">2.</a> Anyaso, H. (1 Aug. 2019) Police Officers’ Exposure to Peers Accused of
Misconduct Shapes Their Subsequent Behavior.
https://www.ipr.northwestern.edu/news/2019/papachristos-police-misconduct-study.html

<a name="fn3">3.</a> Carbado D. (May 2018) The Black Police: Policing Our Own
https://harvardlawreview.org/print/vol-131/the-black-police-policing-our-own/

<a name="fn4">4.</a> Smith G., (3 Sept. 2019) Despite Diversity Gains, Top NYPD Ranks Fall
Short of Reflecting Communities
https://www.thecity.nyc/2019/09/03/despite-diversity-gains-top-nypd-ranks-fall-short-of-reflecting-communities/ 

## Hypothesis


We hypothesize that Black and Hispanic NYPD officers will receive civilian complaints at higher
rates than expected. On the other hand, white officers will receive complaints at rates lower than
expected. This assumption is based on the fact that officers of color may possibly be
disproportionately assigned to higher crime rate areas which will generally have more civilian
interactions. These interactions would cause greater community tension and implicit biases can
cause civilians to file complaints against more officers who are racial minorities

## Data

The ideal dataset to answer our research question would include variables of officer records(rank, race, tenure, precinct, and # of complaints) and civilian complaint data(race, time, area and complaint type). There would need by 10,000s of observations across many years spread amongst racial groups to make any meaningful comparisons. This data would be collected NYPD complaint records and NYC publicly avaibable data sources. This data would be organized by each NYPD officer badge number with each of their complaints as an observation in a neat & tidy format.  
### Datasets

#### CCRB Complaints Dataset
Source: Data Store Archive — ProPublica: https://projects.propublica.org/datastore/#civilian-complaints-against-new-york-city-police-officers  
Records: Approximately 33,358 closed complaints  
Variables: Officer race, allegation type, incident date, disposition, and outcomes  
Purpose: Analyze complaint patterns against NYPD officers by officer race  

#### NYPD Personnel Demographics
Source: NYC Open Data: https://data.cityofnewyork.us/Public-Safety/NYPD-Personnel-Demographics/5vr7-5fki/about_data  
Content: Racial composition of NYPD officers  
Purpose: Establish expected proportions for officer complaint analysis  

#### NYPD Arrests Data (Historic)
Source: NYC Open Data:  https://data.cityofnewyork.us/Public-Safety/NYPD-Arrests-Data-Historic-/8h9b-rp9u/about_data  
Records: 5,986,025 arrest records  
Variables: Perpetrator race (PERP RACE), age group, arrest date, location (precinct, borough), offense description, charge level (felony/misdemeanor)  
Time Range: Historical data spanning multiple years  
Purpose: Analyze arrest patterns across racial groups  

#### NYC Population Demographics (ACS)
Source: American Community Survey (ACS) 2012–2016: https://data.cityofnewyork.us/City-Government/Demographic-and-Housing-Profiles-by-Borough/cu9u-3r5e/about_data   
Content: Population race proportions across NYC  
Proportions: White (32.1%), Black (24.3%), Asian (13.9%), Hispanic (28.9%), American Indian (0.4%), Pacific Islander (0.1%), Two or More (1.5%)  
Total Population: Approximately 8.5 million residents  
Purpose: Establish expected proportions for arrest analysis


  

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection 
- [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

Our project uses publicly available government and research datasets from sources such as NYC Open Data rather than collecting data directly from individuals. Because we are not interacting with individuals or collecting personal responses, traditional informed consent does not apply. However, we will still use the data responsibly and follow any data use guidelines provided by the original data sources. We will also be careful to present results in ways that do not harm individuals or communities.

- [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

There may be bias in civilian complaint data because not all misconduct gets reported. For example some people may not file complaints if they don't trust the police, don't know how to report, or think it won't make a difference. These are just some of the many reasons. Reporting may also vary between different communities. Because of this, complaint numbers don't perfectly show officer behavior. In our project, we will keep this limitation in mind and avoid treating complaint counts as direct proot of misconduct.

- [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

We will only use data that does not include personal identifying information or is grouped together so individual officers cannot be identified. For example we will not use officer names, badge numbers, or other identifying details. All results will be reported in group or summary form instead of focusing on individuals. This will help protect privacy and reduces the risk of someone being singled out or misidentified. We want to make sure our analysis focuses on overall trends rather than specific people.
       
- [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

Because this project studies racial differences, it is important to carefully interpret results and avoid reinforcing stereotypes. We will be sure to frame findings as structural or systemic patterns rather than individual blame. We will also emphasize limitations of the data when presenting results. We will also be careful with how we explain our findings so they are not taken out of context. Our goal is to help people understand patterns in the data, not make assumptions about individuals or groups.

### B. Data Storage
- [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

All the datasets we used are publicly available and will be stored securely in password protected university environments such as  
Datahub. Access will be limited among our project group only. Even though the data is public, we will still handle it responsibly and avoid storing unnecessary copies. Project files will only be shared within our group.

- [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
- [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

Data will only be stored for the duration of the course project. After the project is completed, local copies will be deleted and only final analysis outputs will remain in the project repository.

### C. Analysis
- [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

Our analysis focuses on complaint data and officer demographics, which does not capture the full social and historical context of policing. We will be sure to acknowledge that complaint data does not fully represent all community experiences and will avoid making broad claims about behavior or intent.

- [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

Complaint data may reflect reporting bias, systemic bias, and uneven enforcement patterns. Not all misconduct is reported, and reporting rates may vary across different communities or neighborhoods. Because of this, complaint numbers may not fully represent actual behavior or incidents. We will test distributions carefully and clearly separate correlation from causation when interpreting results. We will also acknowledge these limitations when presenting our findings.
       
- [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

All visualizations and statistics will be presented clearly and without exaggeration. We will avoid making graphs that could be misleading or confusing to readers. We will also not be selective of our data in which we only show the data that supports our expectations. We will also make sure to explain our results carefully and avoid making broad conclusions that are not supported by the data.

- [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

No personally identifiable information will be shown in visualizations or shared outputs. Results will only be shown as group data. This means we will focus on trends across groups rather than individual officers. This helps protect privacy and reduces the risk of misidentifying or targeting individuals.

- [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

All code, data cleaning steps, and analysis methods will be documented in the notebook to ensure reproducibility. This allows others to understand how results were created and verify our work if needed. Keeping clear records also helps prevent mistakes and improves transparency in our research process.

### D. Modeling
- [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

We will check if any variables might indirectly reflect race. If we notice this, we will think carefully about whether we should use that variable or explain why it is needed. Our goal is to avoid results that could be unfair or misleading If we notice a variable may indirectly reflect sensitive demographic information, we will reconsider using it or explain why it is necessary.  
Our ultimate goal is to avoid creating unfair or misleading results.

- [ ] **D.2 Fairness across groups**  
- [ ] **D.3 Metric selection**  
- [ ] **D.4 Explainability**  
- [X] **D.5 Communicate limitations**

We will clearly explain the limits of complaint data and avoid presenting results as proof of individual officer behavior or intent. Complaint data shows reported incidents, not confirmed misconduct. We will also explain that just because two things are related does not mean one causes the other.

### E. Deployment
- [ ] **E.1 Monitoring and evaluation**  
- [ ] **E.2 Redress**  
- [ ] **E.3 Roll back**  
- [ ] **E.4 Unintended use**


## Team Expectations 

* *Team Expectation 1*  Respectful and timely communciation via text group chat
* *Team Expectation 2*  Clear delegation of roles and accountability if said roles are not completed well or in a timely manner
* *Team Expecation 3*  Planning ahead of time; all team members read the assignment and understand what is being asked and if not don't wait till last minute
* *Team Expecation 4*  If conflict arises deal with it in a democratic and respectful manner
* *Team Expecation 5*  Constructive feedback, not unconstructive

## Project Timeline Proposal

| Meeting Date | Meeting Time | Tasks | Discuss at Meeting |
|-------------|--------------|-------|-------------------|
| 2/4 | 7 PM | Finalize and turn in project proposal | Finish Tasks |
| 2/11 | 6 PM | Comb through datasets; data wrangle; choose plan; discuss analysis | Finish Tasks |
| 2/18 | 12 PM | Finalize and turn in Data Checkpoint | Finish Tasks |
| 2/24 | 6 PM | Work on analysis and visualizations; prepare EDA | Finish Tasks |
| 3/4 | 12 PM | Finalize and turn in EDA Checkpoint | Finish Tasks |
| 3/10 | 6 PM | Discuss final roles and visualization changes | Finish Tasks |
| 3/13 | 6 PM | Final touches and problem fixing | Finish Tasks |
| 3/18 | Before 11:59 PM | Turn in Final Project & Surveys | Finish Tasks |