<a href="https://colab.research.google.com/github/Ashliz1/NYC-Crime-Data---Group-10-/blob/main/term_project_starter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis of New York Crime

### Author: (Group 10) Ashley Gomez, Sung Ik Park
### Date : December 8th, 2025

## Executive Summary

This project analyzes crime patterns in New York using the spotcrime dataset. The goal is to identify which areas appear relatively safer than others based on crime type, location, and time of day. Using PythonÂ (ADD MORE HERE!), we will review the data, summarize, and visualize to reveal trends/patterns across New York. The findings provide a basic overview of where crime clusters occur, offering insight for residents, policymakers, and anyone evaluating safety within New York.

## Table of Contents

1. Introduction
2. Problem Statement / Research Question
3. Data Description
4. Setup and Environment
5. Data Loading
6. Data Preparation
7. Model Planning
8. Model Building / Analysis
9. Discussion & Interpretation
10. Conclusion
11. References
12. Appendix

## Introduction

The project analyzes crime data in New York by identifying trends and patterns in where incidents occur most often, exploring how crime is spread across the city. The analysis focuses on three main aspects of data: location, crime type, and time of day. We use the SpotCrime dataset, which includes values such as location, timestamp, crime type, longitude, and latitude. Our first steps involve cleaning the data, creating summaries, and using Python to organize the data. Overall, the goal is to understand which areas are safer than others by providing an overview of crime patterns across New York.

## Problem Statement / Research Question

This project aims to determine which areas in New York appear safer or less safe based on recorded crime counts in the dataset. Understanding the overall distribution of crime can help people decide where to live and support decision-makers seeking a clearer view of safety conditions. The analysis also considers which crime types are most frequent and whether certain times of day show higher activity. It is expected that some areas will have noticeably higher incident counts than others and that specific crime types may appear more frequently depending on location and time. The approach uses simple descriptive methods, including grouping and counting incidents and creating visualizations, to highlight where crime is most concentrated across New York.

## Data Description

The data set contains individual crime incident reports from various locations across New York. Each row represents a single reported crime type, date, time, and geographic information such as county, city, ZIP code, address, and latitude/longitude. The dataset is provided in CSV format and contains a larger number of rows and columns. Data such as the timestamp column will be converted to a proper datetime format, and specific datasets may include missing values or duplicates. Although the data set includes several variables, the key focus will be on location, time, and crime type; the level of detail helps capture enough information to identify patterns/trends in crime activity across New York.

## Setup and Environment

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

## Data Loading

In [None]:
data_url = "https://raw.githubusercontent.com/Ashliz1/NYC-Crime-Data---Group-10-/refs/heads/main/spotcrime.crime.ny.csv"
df = pd.read_csv(data_url)
df.head()

In [None]:
df.shape

(61824, 19)

In [20]:
df[['City','CrimeType','CrimeTime']].head()

Unnamed: 0,City,CrimeType,CrimeTime
0,Corona,Theft,2025-03-17 01:00:00+00:00
1,Corona,Theft,2025-03-16 23:00:00+00:00
2,Corona,Assault,2025-03-16 20:00:00+00:00
3,Corona,Theft,2025-03-16 20:00:00+00:00
4,Corona,Theft,2025-03-16 20:00:00+00:00


In [29]:
df[["City","CrimeType","CrimeTime","Latitude","Longitude"]].head()

Unnamed: 0,City,CrimeType,CrimeTime,Latitude,Longitude
0,Corona,Theft,2025-03-17 01:00:00+00:00,40.7468,-73.8605
1,Corona,Theft,2025-03-16 23:00:00+00:00,40.7468,-73.8605
2,Corona,Assault,2025-03-16 20:00:00+00:00,40.7468,-73.8605
3,Corona,Theft,2025-03-16 20:00:00+00:00,40.7468,-73.8605
4,Corona,Theft,2025-03-16 20:00:00+00:00,40.7468,-73.8605


## Data Preparation

In [30]:
# Convert CrimeTime to datetime format
df["CrimeTime"] = pd.to_datetime(df["CrimeTime"])
#Extract hour of the day from the timestamp
df["hour"] = df["CrimeTime"].dt.hour
# Drop rows where key fields are missing
df = df.dropna(subset = ["City","CrimeType","CrimeTime","Latitude","Longitude"])


## Model Planning

We will apply simple decriptive analysis and visualization to address our research questions. The anaylsis is in three main steps:

1. Identify the top three cities with the highest number of crimes.
2. For each of these top cities, find the three most frequent crime types.
3. For each of combination, determine the hour when crimes occur the most often.



### Functions

In [76]:
def analyze_crime_by_city(df):
  # find the top 3 cities with the highest crime counts
  top_cities = df["City"].value_counts().head(3)

  # filter df to only include rows from those top 5 cities to minimize inefficiency
  df_top = df[df["City"].isin(top_cities.index)]

  # 1. group by city + crime type
  # 2. count number of incidents
  # 3. sort counts in descending order to have top city to be on the first
  # 4. reset index to normal columns
  # 5. a) group again by city and keep top 5 rows per city
  # 5. b) group again by city and crime type and keep the top (most frquent) hour
  top_types_per_city = (
      df_top.groupby(["City","CrimeType"])
      .size()
      .reset_index(name="Count")
      .sort_values(["City","Count"], ascending=[True,False])
      .groupby("City")
      .head(3)
  )
  peak_hours = (
      df_top.groupby(["City","CrimeType","hour"])
      .size()
      .reset_index(name="Count")
      .sort_values(["City","CrimeType","Count"], ascending=[True,True,False])
      .groupby(["City","CrimeType"])
      .head(1)
  )

  return top_cities, top_types_per_city, peak_hours

## Model building / Analysis

In [77]:
top_cities, top_types, peak_hours = analyze_crime_by_city(df)

In [78]:
top_cities

Unnamed: 0_level_0,count
City,Unnamed: 1_level_1
New York,6533
Brooklyn,2026
Buffalo,1761


In [79]:
top_types

Unnamed: 0,City,CrimeType,Count
6,Brooklyn,Theft,1155
1,Brooklyn,Assault,585
4,Brooklyn,Robbery,118
15,Buffalo,Theft,970
10,Buffalo,Assault,313
12,Buffalo,Other,220
24,New York,Theft,3539
19,New York,Assault,1750
21,New York,Other,371


In [80]:
peak_hours

Unnamed: 0,City,CrimeType,hour,Count
3,Brooklyn,Arrest,5,4
32,Brooklyn,Assault,22,51
50,Brooklyn,Burglary,18,18
57,Brooklyn,Other,3,5
77,Brooklyn,Robbery,2,13
101,Brooklyn,Shooting,19,6
104,Brooklyn,Theft,1,90
132,Brooklyn,Vandalism,17,2
139,Buffalo,Arrest,7,4
150,Buffalo,Arson,3,1


## Discussion and Results

## Conclusion

## References

## Appendix