## Stratified Sample

The objective of this notebook is to obtain a stratified sample of Los Angeles crimes from 2020 to 2023. A stratified sample was chosen because it ensures that each year and crime category is adequately represented in the dataset, which provides a more accurate and comprehensive analysis of crime patterns over time. The original data was obtained from the City of Los Angeles Public Data Catalog, dataset [`Crime_Data_from_2020_to_Present_20240905.csv`](../data/raw/Crime_Data_from_2020_to_Present_20240905.csv).


In [None]:
import pandas as pd

csv_path = "../data/raw/Crime_Data_from_2020_to_Present_20240905.csv"

# Read the CSV file and parse dates
crimes = pd.read_csv(
    csv_path, 
    parse_dates=['Date Rptd','DATE OCC'],
    dtype={"TIME OCC": int}
)

crimes['year'] = crimes['DATE OCC'].dt.year


# Perform stratified sampling by year
crimes = crimes.groupby("year").sample(frac=1/3, random_state=2021)

# Define the column filter
filter = ['DR_NO', 'Date Rptd', 'DATE OCC','year', 'TIME OCC', 'AREA NAME', 'Crm Cd Desc', 'Vict Age', 'Vict Sex', 'Vict Descent', 'Weapon Desc', 'Status Desc']

# Exclude data for the year 2024 as it is still in progress
crimes = crimes[crimes['year'] != 2024]

# Save the DataFrame to a CSV file
crimes.to_csv(
    "../data/raw/Crime_2020_2023.csv", 
    index=False, 
    columns=filter
)

print("CSV file saved successfully.")
