# Gun Violence EDA — Incidents (2014–2024)

This notebook loads all cleaned incident CSVs from `Incidents` folder, combines them,
runs quick structure checks, summarizes numeric columns, explores incident characteristics,
and saves a combined gzipped CSV for later analysis.

In [9]:
from pathlib import Path
import pandas as pd

INCIDENTS_DIR = Path("/Users/johnnybae/Documents/Academia/Chaminade/DS495 - Research/Incidents")

In [10]:
# read all cleaned files ---
cleaned_files = sorted(INCIDENTS_DIR.glob("*_clean.csv"))
if not cleaned_files:
    raise FileNotFoundError("No cleaned incident files found in the folder")

dfs = []
for file in cleaned_files:
    df = pd.read_csv(file, low_memory=False)
    df["Source_File"] = file.name
    dfs.append(df)

inc = pd.concat(dfs, ignore_index=True)
print(f"[OK] Combined {len(cleaned_files)} files → {len(inc):,} total rows\n")

[OK] Combined 11 files → 449,386 total rows



In [11]:
# checks ---
print("Columns:", list(inc.columns))
print("\nBasic info:")
print(inc.info())

print("\nMissing value counts:")
print(inc.isna().sum())

print("\nSample rows:")
print(inc.sample(5, random_state=42))

Columns: ['ID', 'Vic-Killed', 'Vic-Injured', 'Sus-Killed', 'Sus-Injured', 'Sus-Unharmed', 'Sus-Arrested', 'Characteristics', 'Source_File']

Basic info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 449386 entries, 0 to 449385
Data columns (total 9 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   ID               428678 non-null  float64
 1   Vic-Killed       428678 non-null  float64
 2   Vic-Injured      428678 non-null  float64
 3   Sus-Killed       428678 non-null  float64
 4   Sus-Injured      428678 non-null  float64
 5   Sus-Unharmed     428678 non-null  float64
 6   Sus-Arrested     428678 non-null  float64
 7   Characteristics  428678 non-null  object 
 8   Source_File      449386 non-null  object 
dtypes: float64(7), object(2)
memory usage: 30.9+ MB
None

Missing value counts:
ID                 20708
Vic-Killed         20708
Vic-Injured        20708
Sus-Killed         20708
Sus-Injured        20708
Sus-Unharmed  

In [12]:
# numeric summaries ---
num_cols = ["Vic-Killed", "Vic-Injured", "Sus-Killed", "Sus-Injured", "Sus-Unharmed", "Sus-Arrested"]
for c in num_cols:
    inc[c] = pd.to_numeric(inc[c], errors="coerce")

print("\nSummary statistics for numeric columns:")
print(inc[num_cols].describe())


Summary statistics for numeric columns:
          Vic-Killed    Vic-Injured     Sus-Killed    Sus-Injured  \
count  428678.000000  428678.000000  428678.000000  428678.000000   
mean        0.371129       0.779210       0.063453       0.048631   
std         0.562085       1.040925       0.246527       0.227625   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       1.000000       0.000000       0.000000   
75%         1.000000       1.000000       0.000000       0.000000   
max        60.000000     439.000000       4.000000       5.000000   

        Sus-Unharmed   Sus-Arrested  
count  428678.000000  428678.000000  
mean        0.471281       0.351058  
std         0.759079       0.681743  
min         0.000000       0.000000  
25%         0.000000       0.000000  
50%         0.000000       0.000000  
75%         1.000000       1.000000  
max        15.000000      15.0000

In [13]:
# look at Characteristics ---
if "Characteristics" in inc.columns:
    chars = (
        inc["Characteristics"]
        .replace("N/A", pd.NA)
        .dropna()
        .str.split(",")
        .explode()
        .str.strip()
        .value_counts()
        .head(15)
    )
    print("\nTop 15 Incident Characteristics:")
    print(chars)


Top 15 Incident Characteristics:
Characteristics
accidental                                                                                                         167627
Shot - Wounded/Injured                                                                                             163871
Shot - Dead (murder                                                                                                140316
or suicide)                                                                                                         84995
Shot - Wounded/Injured\nShot - Dead (murder                                                                         27311
car to car)                                                                                                         22779
Shot - Wounded/Injured\nDrive-by (car to street                                                                     21097
Shot - Wounded/Injured\nArmed robbery with injury/death and/or evidence of DGU found            

In [14]:
# optional save combined dataset for later analysis ---
out_path = INCIDENTS_DIR / "incidents_all_years.csv.gz"
inc.to_csv(out_path, index=False, compression="gzip")
print(f"\n[Saved combined file] {out_path}")



[Saved combined file] /Users/johnnybae/Documents/Academia/Chaminade/DS495 - Research/Incidents/incidents_all_years.csv.gz
