# Feature Engineering

This notebook contain Feature Engineering in 3 aspects for our arrest prediction:

- Temporal feature
- Spatial(location type) feature
- Crime Type Mapping for Chicago & NIBRS dataset 

In [1]:
import pandas as pd

## Mapping Standard for Location Type

This document provides a mapping standard for the location type feature in datasets. The goal is to create a common set of categories that can be used across different datasets to ensure consistency and comparability.

| Common Category         | Chicago Dataset (location_description) | NIBRS Dataset (location_name) |
|-------------------------|--------------------------------|--------------------------|
| **Residence**          | RESIDENCE, HOUSE, APARTMENT, RESIDENCE - YARD (FRONT / BACK), RESIDENCE - GARAGE, RESIDENCE - PORCH / HALLWAY, DRIVEWAY - RESIDENTIAL | Residence/Home |
| **Street/Outdoor**     | ALLEY, STREET, SIDEWALK, VACANT LOT / LAND, PARK PROPERTY, BRIDGE, HIGHWAY / EXPRESSWAY, LAKEFRONT / WATERFRONT / RIVERBANK | Highway/Road/Alley/Street/Sidewalk, Lake/Waterway/Beach, Field/Woods |
| **Transportation Hub**  | CTA PLATFORM, CTA BUS, CTA TRAIN, CTA STATION, AIRPORT TERMINAL LOWER LEVEL - SECURE AREA, AIRPORT PARKING LOT, AIRPORT TERMINAL UPPER LEVEL - SECURE AREA | Air/Bus/Train Terminal |
| **Retail/Commercial**  | SMALL RETAIL STORE, GROCERY FOOD STORE, DEPARTMENT STORE, DRUG STORE, CONVENIENCE STORE, CURRENCY EXCHANGE, BANK, PAWN SHOP, LIQUOR STORE | Grocery/Supermarket, Drug Store/Doctor's Office/Hospital, Convenience Store, Specialty Store, Bank/Savings and Loan |
| **Entertainment**      | BAR OR TAVERN, TAVERN / LIQUOR STORE, MOVIE HOUSE / THEATER, SPORTS ARENA / STADIUM, BOWLING ALLEY | Bar/Nightclub, Arena/Stadium/Fairgrounds/Coliseum, Amusement Park |
| **Government/Public**  | GOVERNMENT BUILDING / PROPERTY, POLICE FACILITY, FIRE STATION, SCHOOL - PUBLIC BUILDING, LIBRARY | Government/Public Building, School-Elementary/Secondary |
| **Medical Facility**   | HOSPITAL BUILDING / GROUNDS, MEDICAL / DENTAL OFFICE, NURSING / RETIREMENT HOME, ANIMAL HOSPITAL | Drug Store/Doctor's Office/Hospital |
| **Workplace/Office**   | COMMERCIAL / BUSINESS OFFICE, FACTORY / MANUFACTURING BUILDING | Commercial/Office Building, Industrial Site |
| **Parking Lot**        | PARKING LOT / GARAGE (NON RESIDENTIAL), CHA PARKING LOT / GROUNDS | Parking/Drop Lot/Garage |
| **Unknown/Other**      | OTHER (SPECIFY), ATM (AUTOMATIC TELLER MACHINE), COIN OPERATED MACHINE | Other/Unknown |


## Location mapping for Chicago Crime Dataset
The `key` i s from "location_description" column in the dataset. The `value` is the corresponding location type. The mapping is based on the description of the location in the dataset and common sense knowledge about the types of locations.

In [None]:
# Create a mapping dictionary
mapping1 = {
    "RESIDENCE - YARD (FRONT / BACK)": "Residence",
    "APARTMENT": "Residence",
    "HOUSE": "Residence",
    "STREET": "Street/Outdoor",
    "ALLEY": "Street/Outdoor",
    "SIDEWALK": "Street/Outdoor",
    "VACANT LOT / LAND": "Street/Outdoor",
    "PARK PROPERTY": "Street/Outdoor",
    "CTA PLATFORM": "Transportation Hub",
    "CTA BUS": "Transportation Hub",
    "CTA TRAIN": "Transportation Hub",
    "AIRPORT TERMINAL LOWER LEVEL - SECURE AREA": "Transportation Hub",
    "GROCERY FOOD STORE": "Retail/Commercial",
    "DRUG STORE": "Retail/Commercial",
    "DEPARTMENT STORE": "Retail/Commercial",
    "BANK": "Retail/Commercial",
    "BAR OR TAVERN": "Entertainment",
    "MOVIE HOUSE / THEATER": "Entertainment",
    "SPORTS ARENA / STADIUM": "Entertainment",
    "GOVERNMENT BUILDING / PROPERTY": "Government/Public",
    "POLICE FACILITY": "Government/Public",
    "HOSPITAL BUILDING / GROUNDS": "Medical Facility",
    "COMMERCIAL / BUSINESS OFFICE": "Workplace/Office",
    "FACTORY / MANUFACTURING BUILDING": "Workplace/Office",
    "PARKING LOT / GARAGE (NON RESIDENTIAL)": "Parking Lot",
    "OTHER (SPECIFY)": "Unknown/Other"
}

# Read the dataset
df1 = pd.read_csv("dataset1.csv")  # Replace with the path to Chicago dataset 

# Apply the mapping
df1["unified_location_category"] = df1["location_description"].map(mapping1).fillna("Unknown/Other")

# Save the processed data
# df1.to_csv("dataset1_mapped.csv", index=False)

# Display the first few rows
df1.head()

## Location mapping for NIBRS test dataset
The `key` is from "location_name" column in the dataset. The `value` is the corresponding location type. The mapping is based on the description of the location in the dataset and common sense knowledge about the types of locations.

In [2]:
# Create a mapping dictionary
mapping2 = {
    "Residence/Home": "Residence",
    "Highway/Road/Alley/Street/Sidewalk": "Street/Outdoor",
    "Lake/Waterway/Beach": "Street/Outdoor",
    "Air/Bus/Train Terminal": "Transportation Hub",
    "Grocery/Supermarket": "Retail/Commercial",
    "Drug Store/Doctor's Office/Hospital": "Medical Facility",
    "Convenience Store": "Retail/Commercial",
    "Specialty Store": "Retail/Commercial",
    "Bank/Savings and Loan": "Retail/Commercial",
    "Bar/Nightclub": "Entertainment",
    "Arena/Stadium/Fairgrounds/Coliseum": "Entertainment",
    "Amusement Park": "Entertainment",
    "Government/Public Building": "Government/Public",
    "School-Elementary/Secondary": "Government/Public",
    "Commercial/Office Building": "Workplace/Office",
    "Industrial Site": "Workplace/Office",
    "Parking/Drop Lot/Garage": "Parking Lot",
    "Other/Unknown": "Unknown/Other"
}

# Read the dataset
df2 = pd.read_csv("merged_2023_no_redundant.csv")  # Replace with the path to NIBRS test dataset

# Apply the mapping
df2["unified_location_category"] = df2["location_name"].map(mapping2).fillna("Unknown/Other")

# Save the processed data
# df2.to_csv("merged_2023_no_redundant_mapped.csv", index=False)

# Display the first few rows
df2.head()

  df2 = pd.read_csv("merged_2023_no_redundant.csv")  # Replace with the path to NIBRS test dataset


Unnamed: 0,agency_id,incident_id,nibrs_month_id,cargo_theft_flag,incident_date,incident_hour,incident_status,arrestee_id,arrest_date,multiple_indicator,...,location_id,num_premises_entered,method_entry_code,offense_name,crime_against,offense_category_name,offense_group,location_code,location_name,unified_location_category
0,4675,169368274,44493177,f,2023-01-19 00:00:00,21.0,ACCEPTED,51253221.0,2023-01-19 00:00:00,N,...,35,,,Simple Assault,Person,Assault Offenses,A,20,Residence/Home,Residence
1,4675,169368295,44493179,f,2023-02-08 00:00:00,18.0,ACCEPTED,51253236.0,2023-02-08 00:00:00,N,...,35,,,Destruction/Damage/Vandalism of Property,Property,Destruction/Damage/Vandalism of Property,A,20,Residence/Home,Residence
2,4675,172820689,44525191,f,2023-04-13 00:00:00,13.0,ACCEPTED,52245171.0,2023-04-13 00:00:00,N,...,35,,,Intimidation,Person,Assault Offenses,A,20,Residence/Home,Residence
3,4675,172820694,44525191,f,2023-04-13 00:00:00,21.0,ACCEPTED,52245172.0,2023-04-13 00:00:00,N,...,25,,,Drug/Narcotic Violations,Society,Drug/Narcotic Offenses,A,13,Highway/Road/Alley/Street/Sidewalk,Street/Outdoor
4,4675,187560794,44525192,f,2023-05-20 00:00:00,15.0,ACCEPTED,57077143.0,2023-05-20 00:00:00,N,...,35,,,Simple Assault,Person,Assault Offenses,A,20,Residence/Home,Residence


## Create unified location type encoding number for both datasets
The `key` is the updated location type from previous processing step. The `value` is the corresponding encoding number. 


| **common_category**    | **UNIFIED_LOCATION_CODE** |
|-----------------------|------------------------|
| Residence            | 1  |
| Street/Outdoor      | 2  |
| Transportation Hub  | 3  |
| Retail/Commercial   | 4  |
| Entertainment       | 5  |
| Government/Public   | 6  |
| Medical Facility    | 7  |
| Workplace/Office    | 8  |
| Parking Lot        | 9  |
| Unknown/Other      | 10 |


In [None]:
# Define UNIFIED_LOCATION_CODE mapping
unified_location_mapping = {
    "Residence": 1,
    "Street/Outdoor": 2,
    "Transportation Hub": 3,
    "Retail/Commercial": 4,
    "Entertainment": 5,
    "Government/Public": 6,
    "Medical Facility": 7,
    "Workplace/Office": 8,
    "Parking Lot": 9,
    "Unknown/Other": 10
}

In [None]:
# Read and process dataset1
df1 = pd.read_csv("dataset1_mapped.csv")  # Replace with the path to your dataset
df1["UNIFIED_LOCATION_CODE"] = df1["common_category"].map(unified_location_mapping)
df1.to_csv("dataset1_final.csv", index=False)
print(df1.head())

In [5]:
# Read and process dataset2
# df2 = pd.read_csv("dataset2_mapped.csv")  # Replace with the path to your dataset
df2["UNIFIED_LOCATION_CODE"] = df2["unified_location_category"].map(unified_location_mapping)
print(df2.head())

   agency_id  incident_id  nibrs_month_id cargo_theft_flag  \
0       4675    169368274        44493177                f   
1       4675    169368295        44493179                f   
2       4675    172820689        44525191                f   
3       4675    172820694        44525191                f   
4       4675    187560794        44525192                f   

         incident_date  incident_hour incident_status  arrestee_id  \
0  2023-01-19 00:00:00           21.0        ACCEPTED   51253221.0   
1  2023-02-08 00:00:00           18.0        ACCEPTED   51253236.0   
2  2023-04-13 00:00:00           13.0        ACCEPTED   52245171.0   
3  2023-04-13 00:00:00           21.0        ACCEPTED   52245172.0   
4  2023-05-20 00:00:00           15.0        ACCEPTED   57077143.0   

           arrest_date multiple_indicator  ... num_premises_entered  \
0  2023-01-19 00:00:00                  N  ...                  NaN   
1  2023-02-08 00:00:00                  N  ...                  

In [6]:
df2.to_csv("merged_2023_no_redundant.csv", index=False)