Enhancing Public Safety: A Comprehensive Analysis of Geospatial Shifts in Los Angeles and Helping Field Police Forces

Our team, Treeo, consists of three members:
Ashley Rauch will act as the POC for the group - ashrauch4
Madeyln Forster - mgforste
Brealin Redecker - brealinredecker

Overview:
Our dataset covers all crime incidents reported to the LAPD going back to 2020. The core problem is straightforward: the LAPD has a finite number of patrol units and needs to figure out where to put them. That's a harder question than it sounds, and historical hotspot maps alone aren't enough to answer it.

Most crime prediction models look at static relationships — things like linking demographic data to crime rates — but that approach tends to reinforce over-policing in areas that already get heavy surveillance. We're trying to do something different. Instead of just predicting where crime will happen, we're framing this as a deployment question: where should the LAPD actually field its forces? That shift matters because it forces us to deal with a problem that pure prediction models usually ignore, reported crime isn't the same as actual crime. If an area has less police presence, fewer crimes get reported there, so a model trained only on reported data will keep sending resources to the same places and neglecting everywhere else.

To account for this, we're building in an adjustment that estimates how police presence affects the likelihood that a crime actually gets reported. We're also separating chronic high-crime areas — places where crime has been consistently high and is well understood — from emerging high-volatility areas where patterns are actively shifting. The goal is to make sure resources don't just get locked into historical patterns. We'll also build in checks against over-concentration, like flagging neighborhoods where reported crime looks suspiciously low given what we'd expect, and setting minimum coverage thresholds so no district gets completely ignored.

Link to Data: "https://catalog.data.gov/dataset/crime-data-from-2020-to-present"
Our crime reporting dataset reflects all incidents of crime in the City of Los Angeles dating back to 2020, collected by The Los Angeles Police Department (LAPD). The data has 28 columns, a combination of string, integer, and float features, with over a million rows of data. As the data is government data, it must be accurate, ensuring it is reliable for model use. Also, according to the Freedom of Information Act, they are legally obligated to disclose information — including crime data and records of misconduct — under public records laws. The data includes meta deta, such as that it was published by the Los Angeles Police Department (LAPD), it is publically available data, and it was last updated on January 2, 2026.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder

In [4]:
#Loading in the data
data = pd.read_csv("Crime_Data_from_2020_to_Present.csv")

In [5]:
#This gives summary stats of all of the columns in our dataset.
data.describe()

Unnamed: 0,DR_NO,TIME OCC,AREA,Rpt Dist No,Part 1-2,Crm Cd,Vict Age,Premis Cd,Weapon Used Cd,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,LAT,LON
count,1004991.0,1004991.0,1004991.0,1004991.0,1004991.0,1004991.0,1004991.0,1004975.0,327247.0,1004980.0,69160.0,2314.0,64.0,1004991.0,1004991.0
mean,220221500.0,1339.9,10.69174,1115.633,1.400348,500.1568,28.91706,305.6201,363.9553,499.9174,958.101258,984.01599,991.21875,33.99821,-118.0909
std,13197180.0,651.0613,6.110255,611.1605,0.4899691,205.2731,21.99272,219.3021,123.734528,205.0736,110.354348,52.350982,27.06985,1.610713,5.582386
min,817.0,1.0,1.0,101.0,1.0,110.0,-4.0,101.0,101.0,110.0,210.0,310.0,821.0,0.0,-118.6676
25%,210616900.0,900.0,5.0,587.0,1.0,331.0,0.0,101.0,311.0,331.0,998.0,998.0,998.0,34.0147,-118.4305
50%,220915900.0,1420.0,11.0,1139.0,1.0,442.0,30.0,203.0,400.0,442.0,998.0,998.0,998.0,34.0589,-118.3225
75%,231110300.0,1900.0,16.0,1613.0,2.0,626.0,44.0,501.0,400.0,626.0,998.0,998.0,998.0,34.1649,-118.2739
max,252104100.0,2359.0,21.0,2199.0,2.0,956.0,120.0,976.0,516.0,956.0,999.0,999.0,999.0,34.3343,0.0


In [6]:
#Quick examination of the data. 
#This gives information on the whole dataset - column names, column count, row count, null count and the type of data in each column.
data.info()

<class 'pandas.DataFrame'>
RangeIndex: 1004991 entries, 0 to 1004990
Data columns (total 28 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   DR_NO           1004991 non-null  int64  
 1   Date Rptd       1004991 non-null  str    
 2   DATE OCC        1004991 non-null  str    
 3   TIME OCC        1004991 non-null  int64  
 4   AREA            1004991 non-null  int64  
 5   AREA NAME       1004991 non-null  str    
 6   Rpt Dist No     1004991 non-null  int64  
 7   Part 1-2        1004991 non-null  int64  
 8   Crm Cd          1004991 non-null  int64  
 9   Crm Cd Desc     1004991 non-null  str    
 10  Mocodes         853372 non-null   str    
 11  Vict Age        1004991 non-null  int64  
 12  Vict Sex        860347 non-null   str    
 13  Vict Descent    860335 non-null   str    
 14  Premis Cd       1004975 non-null  float64
 15  Premis Desc     1004403 non-null  str    
 16  Weapon Used Cd  327247 non-null   float64
 17  

In [7]:
#We noticed our data had a lot of null values, so we investigated further by looking at the percentage of nulls in each column.
    #As you can see, columns "Crm Cd 2", "Crm Cd 3" and "Crm Cd 4" all had a null percentantage above 90%. 
    #We realized this data is being captured in the "Crm Cd" and "Crm Cd 1" column because those have a null percentage of 0%. 
    #Because of this, we are okay with deleting this columns to ensure maximum efficiency when loading our data. 
#"Cross Street" also had a high null percentage of 84.65%. 
    #We realized we have other columns that capture the location, so like the Crm Cds, we are okay with deleting this column as well.
#"Weapon Used Cd" and "Weapon Desc" also had a high null percentage of 67.44% each. 
    #Because there's no other column that has weapon used, we are going to keep those columns in our dataset, but we aren't going to heavily rely on it.
    #Our project also focuses on how to deploy forces and reporting rates, so having a high null percentage for the weapons column is okay. 
data.isnull().sum()
print((data.isnull().sum() / len(data) * 100).round(2))

DR_NO              0.00
Date Rptd          0.00
DATE OCC           0.00
TIME OCC           0.00
AREA               0.00
AREA NAME          0.00
Rpt Dist No        0.00
Part 1-2           0.00
Crm Cd             0.00
Crm Cd Desc        0.00
Mocodes           15.09
Vict Age           0.00
Vict Sex          14.39
Vict Descent      14.39
Premis Cd          0.00
Premis Desc        0.06
Weapon Used Cd    67.44
Weapon Desc       67.44
Status             0.00
Status Desc        0.00
Crm Cd 1           0.00
Crm Cd 2          93.12
Crm Cd 3          99.77
Crm Cd 4          99.99
LOCATION           0.00
Cross Street      84.65
LAT                0.00
LON                0.00
dtype: float64


In [8]:
#Dropping the columns discussed in the previous cell. 
threshold = 80
columns_to_drop = (data.isnull().sum() / len(data) * 100)[lambda x: x > threshold].index
data_cleaned = data.drop(columns=columns_to_drop)
print(f"Dropped columns: {list(columns_to_drop)}")
print(f"Remaining columns count: {data_cleaned.shape[1]}")

Dropped columns: ['Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4', 'Cross Street']
Remaining columns count: 24


In [9]:
#What's remaining
print("Columns with remaining nulls:")
print(data_cleaned.isnull().sum()[data_cleaned.isnull().sum() > 0])

print("\n" + "#"*60 + "\n")
print("Percentage of nulls:")
print((data_cleaned.isnull().sum() / len(data_cleaned) * 100)[lambda x: x > 0].round(2))

Columns with remaining nulls:
Mocodes           151619
Vict Sex          144644
Vict Descent      144656
Premis Cd             16
Premis Desc          588
Weapon Used Cd    677744
Weapon Desc       677744
Status                 1
Crm Cd 1              11
dtype: int64

############################################################

Percentage of nulls:
Mocodes           15.09
Vict Sex          14.39
Vict Descent      14.39
Premis Cd          0.00
Premis Desc        0.06
Weapon Used Cd    67.44
Weapon Desc       67.44
Status             0.00
Crm Cd 1           0.00
dtype: float64


In [13]:
#Converting our variables into dummy variables
encoder = OneHotEncoder()
variables_one_hot = encoder.fit_transform(data_cleaned[['Vict Sex']])
df_one_hot = pd.DataFrame(variables_one_hot, columns=encoder.get_feature_names_out(['Vict Sex']))

ValueError: Shape of passed values is (1004991, 1), indices imply (1004991, 6)