### Import required libraries

In [1]:
# import common libraries
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# set the seaborn run command for font parameters
# sns.set(rc={'font.weight': 'bold'}, font_scale=1.3)

# set parameters to remove erroenous warning messages that clutter the notebook
warnings.filterwarnings("ignore", category=matplotlib.cbook.mplDeprecation)

## Table of Contents

1. [Business Understanding](#business-understanding)
2. [Data Understanding](#data-understanding)
3. [Data Preparation](#data-preparation)
4. [Modeling](#modeling)
5. [Evaluation](#evaluation)
6. [Deployment](#deployment)

## Business Understanding <a name="business-understanding"></a>

For this study, we will analyze serious crimes commited in Los Angeles for the period 2020 - 2024.

There are 2 objectives for this study:
1) Analyze serious crimes committed in Los Angeles, accroding to the National Crime Index
2) Predict the clearance rate of serious crimes in Los Angeles to improve law enforcement outcomes and increase public safety

Low crime clearance rates are a major concern for the Police and the public. The clearance rate is the proportion of crimes that result in an arrest. 

Assess the current situation:
- What are trends in serious crimes being committed in Los Angeles today?
- What is the current clearance rate for serious crimes in Los Angeles?

Determining Data Mining Goals:
- Insure that the crime incidents are appropriately labeled according to national crime index
- Analyze impact to citizens as far as demographics, time of day, and location

Project Plan for this study:
- Collect the data from the LAPD public data sharing portal
- Preprocess the data to ensure that it is clean and ready for analysis
- Analyze the data to understand the trends in serious crimes in Los Angeles
- Clean the data improving data quality, improving analysis insights and model performance
- Create machine learning pipelines to pre-process the data and train the model
- Evaluate the model to ensuring it's performance is accurate and interpretable


## Data Understanding <a name="data-understanding"></a>

The data we collected from the LAPD includes the following observations:
- 932,140 crimes reported
- 21 different regional police districts
- 120 unique labels classifying different crimes reported
- 79 unique labels classifying different weapons used in the crimes reported
- 306 unique labels classifying different locations where crimes were committed
- 6 unique labels for crime status (arrest made, investigation continues)
- Time Period - Full year data for 2020 - 2023 and Q1 for 2024

There are 27 (omitting crime ID field) features included in the data cover the following topics:
<br>
- Date, Time, Location and Description of each crime comitted
- Victim demographics (Age, Sex, Descent)
- Type of location where crimes are committed (Apartment, Gas Station, etc)
- Weapon used in the crime (Firearm, Knife, etc)
- Status of the crime (Arrest made or continued investigation)
- Geolocation

### Data Quality
The data quality is ok in general.  The critical features appear to be in good shape, with a negligble amount of missing values.  There is an issue with the `Part1-2`, label.  Part 1 crimes are serious crimes (according to the national crime index), while Part 2 crimes are less serious.  We will need to come up with a strategy to relable the data to differentiate between the two in our analysis.


## Data Preparation <a name="data-preparation"></a>
We will take the following steps to create a high quality dataset

1. Drop erroneous columns - Columns `CrmCd2`, `CrmCd3`, `CrmCd4` have between 95% - 99% missing values. We will drop these columns as they do not provide any valuable insight.<br><br>
2. Temporal data - We will need to manually format date and time columns, including `DateRptd`,`DATEOCC`,`TIMEOCC`<br><br>
3. Text and Categorical Data - We will create machine learning pipelines using TFIDF to vectorize the text data and one hot encoding to transform the categorical data

### Feature Engineering
We will add the following features to improve data quality for more granular analysis:
- `Year` - Extract year from `DATEOCC`
- `Month` - Extract month from `DATEOCC`
- `Day` - Extract day from  `DATEOCC`
- `Part of Day` - Descritize `TIMEOCC` into 4 intervals (Morning, Afternoon, Evening, Night)
- `Target` - Create a target variable indicating if the suspect was arrested or if the investigation is ongoing

### Data Visualization
We will create an interactive map which shows the concentration of crime activity per region, with some written analysis to observe patterns in crime concentration.

## Modeling <a name="modeling"></a>

We will use an XGBoost predictive model to predict the clearance rate of serious crimes in Los Angeles.  We will use the following steps to create the model:

1. Create a development, training and test datasets to train the model
2. Create a machine learning pipeline to preprocess the data
3. Create a parameter grid to choose ideal hyperparameters to optimize the model


## Evaluation <a name="evaluation"></a>

Evaluate the performance of the models.

## Deployment <a name="deployment"></a>

Deploy the models into production.


In [2]:
# load in arrest data
arrest = pd.read_csv('~/Downloads/arrest_data_2020_to_present.csv')
# remove spaces from arrest
arrest.columns = arrest.columns.str.replace(' ', '_')

# load in crime data
crime = pd.read_csv('../data/crime_data.csv')
# remove spaces from crime
crime.columns = crime.columns.str.replace(' ', '_')

In [3]:
arrest_id = arrest.loc[:,['Report_ID']]

arrest_id

Unnamed: 0,Report_ID
0,6636966
1,6637119
2,6624479
3,6636128
4,6636650
...,...
281690,6778830
281691,6789376
281692,6790740
281693,6786791


In [4]:
arrest.query('Report_ID == 200904235')

Unnamed: 0,Report_ID,Report_Type,Arrest_Date,Time,Area_ID,Area_Name,Reporting_District,Age,Sex_Code,Descent_Code,...,Disposition_Description,Address,Cross_Street,LAT,LON,Location,Booking_Date,Booking_Time,Booking_Location,Booking_Location_Code
116414,200904235,RFC,01/06/2020 12:00:00 AM,1537.0,9,Van Nuys,964,27,F,H,...,MISDEMEANOR COMPLAINT FILED,14000 RIVERSIDE DR,,34.1576,-118.438,POINT (-118.438 34.1576),,,,


In [5]:
arrest.Report_Type.value_counts()

Report_Type
BOOKING    221335
RFC         60360
Name: count, dtype: int64

In [6]:
crime.query('DR_NO == 200904235')

Unnamed: 0,DR_NO,Date_Rptd,DATE_OCC,TIME_OCC,AREA,AREA_NAME,Rpt_Dist_No,Part_1-2,Crm_Cd,Crm_Cd_Desc,...,Status,Status_Desc,Crm_Cd_1,Crm_Cd_2,Crm_Cd_3,Crm_Cd_4,LOCATION,Cross_Street,LAT,LON
42392,200904235,01/06/2020 12:00:00 AM,01/06/2020 12:00:00 AM,1537,9,Van Nuys,964,1,442,SHOPLIFTING - PETTY THEFT ($950 & UNDER),...,AA,Adult Arrest,442.0,,,,14000 RIVERSIDE DR,,34.1576,-118.438


In [7]:
crime_id = crime.loc[:,['DR_NO']].rename(columns={'DR_NO':'Report_ID'})

crime_id

Unnamed: 0,Report_ID
0,190326475
1,200106753
2,200320258
3,200907217
4,220614831
...,...
932135,241605270
932136,241604405
932137,242106032
932138,242004546


In [8]:
# intersection of arrest and crime data
intersection = pd.merge(arrest_id, crime_id, on='Report_ID', how='inner')

intersection.shape

(56, 1)

In [None]:
intersection.sample(5)