# Feature Engineering

This notebook contain Feature Engineering in 3 aspects for our arrest prediction:

- Temporal feature
- Spatial(location type) feature
- Crime Type Mapping for Chicago & NIBRS dataset 

In [11]:
import pandas as pd
import holidays

## Load Dataset

In [12]:
# Replace with the path to Chicago dataset
chicago_filepath = "Chicago_data.csv"
# Replace with the path to NIBRS test dataset
NIBRS_filepath = "merged_2023_no_redundant.csv"

# Read the dataset
chicago_df = pd.read_csv(chicago_filepath)
NIBRS_df = pd.read_csv(NIBRS_filepath)  

  NIBRS_df = pd.read_csv(NIBRS_filepath)


In [13]:
print("Chicago dataset shape: ", chicago_df.shape)
print("NIBRS dataset shape: ", NIBRS_df.shape)

Chicago dataset shape:  (1100000, 22)
NIBRS dataset shape:  (572438, 26)


## Spatial Feature Engineering

### Mapping Standard for Location Type

This document provides a mapping standard for the location type feature in datasets. The goal is to create a common set of categories that can be used across different datasets to ensure consistency and comparability.

| Common Category         | Chicago Dataset (location_description) | NIBRS Dataset (location_name) |
|-------------------------|--------------------------------|--------------------------|
| **Residence**          | RESIDENCE, HOUSE, APARTMENT, RESIDENCE - YARD (FRONT / BACK), RESIDENCE - GARAGE, RESIDENCE - PORCH / HALLWAY, DRIVEWAY - RESIDENTIAL | Residence/Home |
| **Street/Outdoor**     | ALLEY, STREET, SIDEWALK, VACANT LOT / LAND, PARK PROPERTY, BRIDGE, HIGHWAY / EXPRESSWAY, LAKEFRONT / WATERFRONT / RIVERBANK | Highway/Road/Alley/Street/Sidewalk, Lake/Waterway/Beach, Field/Woods |
| **Transportation Hub**  | CTA PLATFORM, CTA BUS, CTA TRAIN, CTA STATION, AIRPORT TERMINAL LOWER LEVEL - SECURE AREA, AIRPORT PARKING LOT, AIRPORT TERMINAL UPPER LEVEL - SECURE AREA | Air/Bus/Train Terminal |
| **Retail/Commercial**  | SMALL RETAIL STORE, GROCERY FOOD STORE, DEPARTMENT STORE, DRUG STORE, CONVENIENCE STORE, CURRENCY EXCHANGE, BANK, PAWN SHOP, LIQUOR STORE | Grocery/Supermarket, Drug Store/Doctor's Office/Hospital, Convenience Store, Specialty Store, Bank/Savings and Loan |
| **Entertainment**      | BAR OR TAVERN, TAVERN / LIQUOR STORE, MOVIE HOUSE / THEATER, SPORTS ARENA / STADIUM, BOWLING ALLEY | Bar/Nightclub, Arena/Stadium/Fairgrounds/Coliseum, Amusement Park |
| **Government/Public**  | GOVERNMENT BUILDING / PROPERTY, POLICE FACILITY, FIRE STATION, SCHOOL - PUBLIC BUILDING, LIBRARY | Government/Public Building, School-Elementary/Secondary |
| **Medical Facility**   | HOSPITAL BUILDING / GROUNDS, MEDICAL / DENTAL OFFICE, NURSING / RETIREMENT HOME, ANIMAL HOSPITAL | Drug Store/Doctor's Office/Hospital |
| **Workplace/Office**   | COMMERCIAL / BUSINESS OFFICE, FACTORY / MANUFACTURING BUILDING | Commercial/Office Building, Industrial Site |
| **Parking Lot**        | PARKING LOT / GARAGE (NON RESIDENTIAL), CHA PARKING LOT / GROUNDS | Parking/Drop Lot/Garage |
| **Unknown/Other**      | OTHER (SPECIFY), ATM (AUTOMATIC TELLER MACHINE), COIN OPERATED MACHINE | Other/Unknown |


### Location mapping for Chicago Crime Dataset
The `key` i s from "location_description" column in the dataset. The `value` is the corresponding location type. The mapping is based on the description of the location in the dataset and common sense knowledge about the types of locations.

In [14]:
# Create a mapping dictionary
mapping1 = {
    "RESIDENCE - YARD (FRONT / BACK)": "Residence",
    "APARTMENT": "Residence",
    "HOUSE": "Residence",
    "STREET": "Street/Outdoor",
    "ALLEY": "Street/Outdoor",
    "SIDEWALK": "Street/Outdoor",
    "VACANT LOT / LAND": "Street/Outdoor",
    "PARK PROPERTY": "Street/Outdoor",
    "CTA PLATFORM": "Transportation Hub",
    "CTA BUS": "Transportation Hub",
    "CTA TRAIN": "Transportation Hub",
    "AIRPORT TERMINAL LOWER LEVEL - SECURE AREA": "Transportation Hub",
    "GROCERY FOOD STORE": "Retail/Commercial",
    "DRUG STORE": "Retail/Commercial",
    "DEPARTMENT STORE": "Retail/Commercial",
    "BANK": "Retail/Commercial",
    "BAR OR TAVERN": "Entertainment",
    "MOVIE HOUSE / THEATER": "Entertainment",
    "SPORTS ARENA / STADIUM": "Entertainment",
    "GOVERNMENT BUILDING / PROPERTY": "Government/Public",
    "POLICE FACILITY": "Government/Public",
    "HOSPITAL BUILDING / GROUNDS": "Medical Facility",
    "COMMERCIAL / BUSINESS OFFICE": "Workplace/Office",
    "FACTORY / MANUFACTURING BUILDING": "Workplace/Office",
    "PARKING LOT / GARAGE (NON RESIDENTIAL)": "Parking Lot",
    "OTHER (SPECIFY)": "Unknown/Other"
}

# Apply the mapping
chicago_df["unified_location_category"] = chicago_df["location_description"].map(mapping1).fillna("Unknown/Other")

# Save the processed data
# chicago_df.to_csv("dataset1_mapped.csv", index=False)

# Display the first few rows
chicago_df.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,...,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude,location,unified_location_category
0,13777896,JJ183487,2025-03-16T03:00:00.000,040XX N KEYSTONE AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,False,True,...,16.0,08B,1148566.0,1926623.0,2025,2025-03-19T15:41:08.000,41.954594,-87.729245,"\n, \n(41.954593897, -87.729244692)",Residence
1,13776543,JJ182816,2025-03-12T00:00:00.000,037XX W NORTH AVE,910,MOTOR VEHICLE THEFT,AUTOMOBILE,STREET,False,False,...,23.0,07,1151340.0,1910377.0,2025,2025-03-19T15:42:01.000,41.909959,-87.719475,"\n, \n(41.909959416, -87.719474573)",Street/Outdoor
2,13772937,JJ178623,2025-03-12T00:00:00.000,076XX S EAST END AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,...,43.0,11,1188860.0,1854711.0,2025,2025-03-19T15:42:01.000,41.756389,-87.583428,"\n, \n(41.756389436, -87.583428355)",Unknown/Other
3,13774108,JJ179898,2025-03-12T00:00:00.000,097XX S MERRILL AVE,910,MOTOR VEHICLE THEFT,AUTOMOBILE,STREET,False,False,...,51.0,07,1192381.0,1840724.0,2025,2025-03-19T15:42:01.000,41.717923,-87.570979,"\n, \n(41.717922891, -87.570978883)",Street/Outdoor
4,13772980,JJ178262,2025-03-12T00:00:00.000,095XX S HALSTED ST,560,ASSAULT,SIMPLE,LIBRARY,False,False,...,73.0,08A,1172656.0,1841600.0,2025,2025-03-19T15:42:01.000,41.720783,-87.643198,"\n, \n(41.720783347, -87.643197739)",Unknown/Other


### Location mapping for NIBRS test dataset
The `key` is from "location_name" column in the dataset. The `value` is the corresponding location type. The mapping is based on the description of the location in the dataset and common sense knowledge about the types of locations.

In [15]:
# Create a mapping dictionary
mapping2 = {
    "Residence/Home": "Residence",
    "Highway/Road/Alley/Street/Sidewalk": "Street/Outdoor",
    "Lake/Waterway/Beach": "Street/Outdoor",
    "Air/Bus/Train Terminal": "Transportation Hub",
    "Grocery/Supermarket": "Retail/Commercial",
    "Drug Store/Doctor's Office/Hospital": "Medical Facility",
    "Convenience Store": "Retail/Commercial",
    "Specialty Store": "Retail/Commercial",
    "Bank/Savings and Loan": "Retail/Commercial",
    "Bar/Nightclub": "Entertainment",
    "Arena/Stadium/Fairgrounds/Coliseum": "Entertainment",
    "Amusement Park": "Entertainment",
    "Government/Public Building": "Government/Public",
    "School-Elementary/Secondary": "Government/Public",
    "Commercial/Office Building": "Workplace/Office",
    "Industrial Site": "Workplace/Office",
    "Parking/Drop Lot/Garage": "Parking Lot",
    "Other/Unknown": "Unknown/Other"
}

# Apply the mapping
NIBRS_df["unified_location_category"] = NIBRS_df["location_name"].map(mapping2).fillna("Unknown/Other")

# Save the processed data
# NIBRS_df.to_csv("merged_2023_no_redundant_mapped.csv", index=False)

# Display the first few rows
NIBRS_df.head()

Unnamed: 0,agency_id,incident_id,nibrs_month_id,cargo_theft_flag,incident_date,incident_hour,incident_status,arrestee_id,arrest_date,multiple_indicator,...,location_id,num_premises_entered,method_entry_code,offense_name,crime_against,offense_category_name,offense_group,location_code,location_name,unified_location_category
0,4675,169368274,44493177,f,2023-01-19 00:00:00,21.0,ACCEPTED,51253221.0,2023-01-19 00:00:00,N,...,35,,,Simple Assault,Person,Assault Offenses,A,20,Residence/Home,Residence
1,4675,169368295,44493179,f,2023-02-08 00:00:00,18.0,ACCEPTED,51253236.0,2023-02-08 00:00:00,N,...,35,,,Destruction/Damage/Vandalism of Property,Property,Destruction/Damage/Vandalism of Property,A,20,Residence/Home,Residence
2,4675,172820689,44525191,f,2023-04-13 00:00:00,13.0,ACCEPTED,52245171.0,2023-04-13 00:00:00,N,...,35,,,Intimidation,Person,Assault Offenses,A,20,Residence/Home,Residence
3,4675,172820694,44525191,f,2023-04-13 00:00:00,21.0,ACCEPTED,52245172.0,2023-04-13 00:00:00,N,...,25,,,Drug/Narcotic Violations,Society,Drug/Narcotic Offenses,A,13,Highway/Road/Alley/Street/Sidewalk,Street/Outdoor
4,4675,187560794,44525192,f,2023-05-20 00:00:00,15.0,ACCEPTED,57077143.0,2023-05-20 00:00:00,N,...,35,,,Simple Assault,Person,Assault Offenses,A,20,Residence/Home,Residence


### Create unified location type encoding number for both datasets
The `key` is the updated location type from previous processing step. The `value` is the corresponding encoding number. 


| **common_category**    | **UNIFIED_LOCATION_CODE** |
|-----------------------|------------------------|
| Residence            | 1  |
| Street/Outdoor      | 2  |
| Transportation Hub  | 3  |
| Retail/Commercial   | 4  |
| Entertainment       | 5  |
| Government/Public   | 6  |
| Medical Facility    | 7  |
| Workplace/Office    | 8  |
| Parking Lot        | 9  |
| Unknown/Other      | 10 |


In [16]:
# Define UNIFIED_LOCATION_CODE mapping
unified_location_mapping = {
    "Residence": 1,
    "Street/Outdoor": 2,
    "Transportation Hub": 3,
    "Retail/Commercial": 4,
    "Entertainment": 5,
    "Government/Public": 6,
    "Medical Facility": 7,
    "Workplace/Office": 8,
    "Parking Lot": 9,
    "Unknown/Other": 10
}

In [17]:
# Read and process dataset1
chicago_df["UNIFIED_LOCATION_CODE"] = chicago_df["unified_location_category"].map(unified_location_mapping)
# chicago_df.to_csv("dataset1_final.csv", index=False)
chicago_df.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,...,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude,location,unified_location_category,UNIFIED_LOCATION_CODE
0,13777896,JJ183487,2025-03-16T03:00:00.000,040XX N KEYSTONE AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,False,True,...,08B,1148566.0,1926623.0,2025,2025-03-19T15:41:08.000,41.954594,-87.729245,"\n, \n(41.954593897, -87.729244692)",Residence,1
1,13776543,JJ182816,2025-03-12T00:00:00.000,037XX W NORTH AVE,910,MOTOR VEHICLE THEFT,AUTOMOBILE,STREET,False,False,...,07,1151340.0,1910377.0,2025,2025-03-19T15:42:01.000,41.909959,-87.719475,"\n, \n(41.909959416, -87.719474573)",Street/Outdoor,2
2,13772937,JJ178623,2025-03-12T00:00:00.000,076XX S EAST END AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,...,11,1188860.0,1854711.0,2025,2025-03-19T15:42:01.000,41.756389,-87.583428,"\n, \n(41.756389436, -87.583428355)",Unknown/Other,10
3,13774108,JJ179898,2025-03-12T00:00:00.000,097XX S MERRILL AVE,910,MOTOR VEHICLE THEFT,AUTOMOBILE,STREET,False,False,...,07,1192381.0,1840724.0,2025,2025-03-19T15:42:01.000,41.717923,-87.570979,"\n, \n(41.717922891, -87.570978883)",Street/Outdoor,2
4,13772980,JJ178262,2025-03-12T00:00:00.000,095XX S HALSTED ST,560,ASSAULT,SIMPLE,LIBRARY,False,False,...,08A,1172656.0,1841600.0,2025,2025-03-19T15:42:01.000,41.720783,-87.643198,"\n, \n(41.720783347, -87.643197739)",Unknown/Other,10


In [18]:
# Read and process dataset2
NIBRS_df["UNIFIED_LOCATION_CODE"] = NIBRS_df["unified_location_category"].map(unified_location_mapping)
NIBRS_df.head()

Unnamed: 0,agency_id,incident_id,nibrs_month_id,cargo_theft_flag,incident_date,incident_hour,incident_status,arrestee_id,arrest_date,multiple_indicator,...,num_premises_entered,method_entry_code,offense_name,crime_against,offense_category_name,offense_group,location_code,location_name,unified_location_category,UNIFIED_LOCATION_CODE
0,4675,169368274,44493177,f,2023-01-19 00:00:00,21.0,ACCEPTED,51253221.0,2023-01-19 00:00:00,N,...,,,Simple Assault,Person,Assault Offenses,A,20,Residence/Home,Residence,1
1,4675,169368295,44493179,f,2023-02-08 00:00:00,18.0,ACCEPTED,51253236.0,2023-02-08 00:00:00,N,...,,,Destruction/Damage/Vandalism of Property,Property,Destruction/Damage/Vandalism of Property,A,20,Residence/Home,Residence,1
2,4675,172820689,44525191,f,2023-04-13 00:00:00,13.0,ACCEPTED,52245171.0,2023-04-13 00:00:00,N,...,,,Intimidation,Person,Assault Offenses,A,20,Residence/Home,Residence,1
3,4675,172820694,44525191,f,2023-04-13 00:00:00,21.0,ACCEPTED,52245172.0,2023-04-13 00:00:00,N,...,,,Drug/Narcotic Violations,Society,Drug/Narcotic Offenses,A,13,Highway/Road/Alley/Street/Sidewalk,Street/Outdoor,2
4,4675,187560794,44525192,f,2023-05-20 00:00:00,15.0,ACCEPTED,57077143.0,2023-05-20 00:00:00,N,...,,,Simple Assault,Person,Assault Offenses,A,20,Residence/Home,Residence,1


## Temporal Feature Engineering

For the NIBRS dataset, we first need to create a unified date by combining `incident_hour` and `incident_date` to match the date format used in the Chicago dataset.

In [20]:
NIBRS_df = NIBRS_df.dropna(subset=["incident_date", "incident_hour"])

NIBRS_df["incident_hour"] = NIBRS_df["incident_hour"].astype(int)
NIBRS_df = NIBRS_df[(NIBRS_df["incident_hour"] >= 0) & (NIBRS_df["incident_hour"] <= 23)]

NIBRS_df["incident_date"] = pd.to_datetime(NIBRS_df["incident_date"], errors='coerce')

# combine date & hour to match date in Chicago dataset
NIBRS_df["date"] = NIBRS_df.apply(
    lambda row: row["incident_date"].strftime("%Y-%m-%d") + f"T{row['incident_hour']:02d}:00:00.000",
    axis=1
)

In [22]:
chicago_df['date'] = pd.to_datetime(chicago_df['date'], errors='coerce')
NIBRS_df['date'] = pd.to_datetime(NIBRS_df['date'], errors='coerce')

### Basic Temporal Feature

- month – the month of the incident
- day – the day in month of the incident
- hour – the hour of the incident
- weekday – the day of the week
- is_weekend – flag for Saturday/Sunday

In [23]:
def extractBasicTimeFeature(df):
    df['month'] = df['date'].dt.month
    df['day'] = df['date'].dt.day
    df['hour'] = df['date'].dt.hour
    df['weekday'] = df['date'].dt.weekday  # 0=Monday, 6=Sunday
    df['is_weekend'] = df['weekday'].isin([5, 6]).astype(int) # if weekend

    return df

In [24]:
chicago_df = extractBasicTimeFeature(chicago_df)
chicago_df.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,...,latitude,longitude,location,unified_location_category,UNIFIED_LOCATION_CODE,month,day,hour,weekday,is_weekend
0,13777896,JJ183487,2025-03-16 03:00:00,040XX N KEYSTONE AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,False,True,...,41.954594,-87.729245,"\n, \n(41.954593897, -87.729244692)",Residence,1,3,16,3,6,1
1,13776543,JJ182816,2025-03-12 00:00:00,037XX W NORTH AVE,910,MOTOR VEHICLE THEFT,AUTOMOBILE,STREET,False,False,...,41.909959,-87.719475,"\n, \n(41.909959416, -87.719474573)",Street/Outdoor,2,3,12,0,2,0
2,13772937,JJ178623,2025-03-12 00:00:00,076XX S EAST END AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,...,41.756389,-87.583428,"\n, \n(41.756389436, -87.583428355)",Unknown/Other,10,3,12,0,2,0
3,13774108,JJ179898,2025-03-12 00:00:00,097XX S MERRILL AVE,910,MOTOR VEHICLE THEFT,AUTOMOBILE,STREET,False,False,...,41.717923,-87.570979,"\n, \n(41.717922891, -87.570978883)",Street/Outdoor,2,3,12,0,2,0
4,13772980,JJ178262,2025-03-12 00:00:00,095XX S HALSTED ST,560,ASSAULT,SIMPLE,LIBRARY,False,False,...,41.720783,-87.643198,"\n, \n(41.720783347, -87.643197739)",Unknown/Other,10,3,12,0,2,0


In [25]:
NIBRS_df = extractBasicTimeFeature(NIBRS_df)
NIBRS_df.head()

Unnamed: 0,agency_id,incident_id,nibrs_month_id,cargo_theft_flag,incident_date,incident_hour,incident_status,arrestee_id,arrest_date,multiple_indicator,...,location_code,location_name,unified_location_category,UNIFIED_LOCATION_CODE,date,month,day,hour,weekday,is_weekend
0,4675,169368274,44493177,f,2023-01-19,21,ACCEPTED,51253221.0,2023-01-19 00:00:00,N,...,20,Residence/Home,Residence,1,2023-01-19 21:00:00,1,19,21,3,0
1,4675,169368295,44493179,f,2023-02-08,18,ACCEPTED,51253236.0,2023-02-08 00:00:00,N,...,20,Residence/Home,Residence,1,2023-02-08 18:00:00,2,8,18,2,0
2,4675,172820689,44525191,f,2023-04-13,13,ACCEPTED,52245171.0,2023-04-13 00:00:00,N,...,20,Residence/Home,Residence,1,2023-04-13 13:00:00,4,13,13,3,0
3,4675,172820694,44525191,f,2023-04-13,21,ACCEPTED,52245172.0,2023-04-13 00:00:00,N,...,13,Highway/Road/Alley/Street/Sidewalk,Street/Outdoor,2,2023-04-13 21:00:00,4,13,21,3,0
4,4675,187560794,44525192,f,2023-05-20,15,ACCEPTED,57077143.0,2023-05-20 00:00:00,N,...,20,Residence/Home,Residence,1,2023-05-20 15:00:00,5,20,15,5,1


### Holiday Feature Extraction

Added a binary flag indicating whether the incident occurred on a holiday.

In [26]:
def extractHolidayFeature(df):
    us_holidays = holidays.US()
    df['is_holiday'] = df['date'].dt.date.apply(lambda x: 1 if x in us_holidays else 0)

    return df

In [27]:
chicago_df = extractHolidayFeature(chicago_df)
chicago_df.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,...,longitude,location,unified_location_category,UNIFIED_LOCATION_CODE,month,day,hour,weekday,is_weekend,is_holiday
0,13777896,JJ183487,2025-03-16 03:00:00,040XX N KEYSTONE AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,False,True,...,-87.729245,"\n, \n(41.954593897, -87.729244692)",Residence,1,3,16,3,6,1,0
1,13776543,JJ182816,2025-03-12 00:00:00,037XX W NORTH AVE,910,MOTOR VEHICLE THEFT,AUTOMOBILE,STREET,False,False,...,-87.719475,"\n, \n(41.909959416, -87.719474573)",Street/Outdoor,2,3,12,0,2,0,0
2,13772937,JJ178623,2025-03-12 00:00:00,076XX S EAST END AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,...,-87.583428,"\n, \n(41.756389436, -87.583428355)",Unknown/Other,10,3,12,0,2,0,0
3,13774108,JJ179898,2025-03-12 00:00:00,097XX S MERRILL AVE,910,MOTOR VEHICLE THEFT,AUTOMOBILE,STREET,False,False,...,-87.570979,"\n, \n(41.717922891, -87.570978883)",Street/Outdoor,2,3,12,0,2,0,0
4,13772980,JJ178262,2025-03-12 00:00:00,095XX S HALSTED ST,560,ASSAULT,SIMPLE,LIBRARY,False,False,...,-87.643198,"\n, \n(41.720783347, -87.643197739)",Unknown/Other,10,3,12,0,2,0,0


In [28]:
NIBRS_df = extractHolidayFeature(NIBRS_df)
NIBRS_df.head()

Unnamed: 0,agency_id,incident_id,nibrs_month_id,cargo_theft_flag,incident_date,incident_hour,incident_status,arrestee_id,arrest_date,multiple_indicator,...,location_name,unified_location_category,UNIFIED_LOCATION_CODE,date,month,day,hour,weekday,is_weekend,is_holiday
0,4675,169368274,44493177,f,2023-01-19,21,ACCEPTED,51253221.0,2023-01-19 00:00:00,N,...,Residence/Home,Residence,1,2023-01-19 21:00:00,1,19,21,3,0,0
1,4675,169368295,44493179,f,2023-02-08,18,ACCEPTED,51253236.0,2023-02-08 00:00:00,N,...,Residence/Home,Residence,1,2023-02-08 18:00:00,2,8,18,2,0,0
2,4675,172820689,44525191,f,2023-04-13,13,ACCEPTED,52245171.0,2023-04-13 00:00:00,N,...,Residence/Home,Residence,1,2023-04-13 13:00:00,4,13,13,3,0,0
3,4675,172820694,44525191,f,2023-04-13,21,ACCEPTED,52245172.0,2023-04-13 00:00:00,N,...,Highway/Road/Alley/Street/Sidewalk,Street/Outdoor,2,2023-04-13 21:00:00,4,13,21,3,0,0
4,4675,187560794,44525192,f,2023-05-20,15,ACCEPTED,57077143.0,2023-05-20 00:00:00,N,...,Residence/Home,Residence,1,2023-05-20 15:00:00,5,20,15,5,1,0


## Crime Type Mapping for Chicago & NIBRS dataset 

In [30]:
# Map for NIBRS
mapping = {
    'All Other Offenses': 'Other Offenses',
    'Animal Cruelty': 'Other Offenses',
    'Bad Checks': 'Fraud Offenses',
    'Counterfeiting/Forgery': 'Fraud Offenses',
    'Embezzlement': 'Fraud Offenses',
    'Extortion/Blackmail': 'Fraud Offenses',
    'Motor Vehicle Theft': 'Larceny/Theft Offenses',
    'Peeping Tom': 'Sex Offenses',
    'Sex Offenses, Non-forcible': 'Sex Offenses',
    'Driving Under the Influence': 'Liquor Law Violations',
    'Drunkenness': 'Liquor Law Violations',
    'Family Offenses, Nonviolent': 'Assault Offenses'
}

# Apply mapping; if not in dictionary, keep original value
NIBRS_df['New_offense_category_name'] = NIBRS_df['offense_category_name'].map(mapping).fillna(NIBRS_df['offense_category_name'])


In [32]:
NIBRS_df['New_offense_category_name'].value_counts()

New_offense_category_name
Larceny/Theft Offenses                      173759
Assault Offenses                            155334
Destruction/Damage/Vandalism of Property     93396
Fraud Offenses                               50478
Drug/Narcotic Offenses                       26162
Burglary/Breaking & Entering                 25556
Robbery                                      13541
Weapon Law Violations                        12073
Sex Offenses                                  6824
Pornography/Obscene Material                  1388
Kidnapping/Abduction                          1155
Arson                                         1036
Stolen Property Offenses                       937
Homicide Offenses                              805
Other Offenses                                 309
Prostitution Offenses                          179
Human Trafficking                               24
Bribery                                         16
Gambling Offenses                                4
Name:

In [33]:
crime_mapping = {
    'THEFT': ('Property', 'Larceny/Theft Offenses'),
    'BATTERY': ('Person', 'Assault Offenses'),
    'CRIMINAL DAMAGE': ('Property', 'Destruction/Damage/Vandalism of Property'),
    'ASSAULT': ('Person', 'Assault Offenses'),
    'MOTOR VEHICLE THEFT': ('Property', 'Larceny/Theft Offenses'),
    'DECEPTIVE PRACTICE': ('Property', 'Fraud Offenses'),
    'OTHER OFFENSE': ('Society', 'Other Offenses'),
    'ROBBERY': ('Property', 'Robbery'),
    'WEAPONS VIOLATION': ('Society', 'Weapon Law Violations'),
    'BURGLARY': ('Property', 'Burglary/Breaking & Entering'),
    'NARCOTICS': ('Society', 'Drug/Narcotic Offenses'),
    'CRIMINAL TRESPASS': ('Society', 'Trespass of Real Property'),
    'OFFENSE INVOLVING CHILDREN': ('Person', 'Other Offenses'),
    'CRIMINAL SEXUAL ASSAULT': ('Person', 'Sex Offenses'),
    'SEX OFFENSE': ('Person', 'Sex Offenses'),
    'PUBLIC PEACE VIOLATION': ('Society', 'Curfew/Loitering/Vagrancy Violations'),
    'HOMICIDE': ('Person', 'Homicide Offenses'),
    'INTERFERENCE WITH PUBLIC OFFICER': ('Society', 'Bribery'),
    'ARSON': ('Property', 'Arson'),
    'STALKING': ('Person', 'Other Offenses'),
    'PROSTITUTION': ('Society', 'Prostitution Offenses'),
    'LIQUOR LAW VIOLATION': ('Society', 'Liquor Law Violations'),
    'CONCEALED CARRY LICENSE VIOLATION': ('Society', 'Weapon Law Violations'),
    'INTIMIDATION': ('Person', 'Assault Offenses'),
    'KIDNAPPING': ('Person', 'Kidnapping/Abduction'),
    'OBSCENITY': ('Society', 'Pornography/Obscene Material'),
    'GAMBLING': ('Society', 'Gambling Offenses'),
    'HUMAN TRAFFICKING': ('Person', 'Human Trafficking'),
    'PUBLIC INDECENCY': ('Society', 'Disorderly Conduct'),
    'OTHER NARCOTIC VIOLATION': ('Society', 'Drug/Narcotic Offenses'),
    'NON-CRIMINAL': ('Not a Crime', 'Other Offenses')
}

# Apply mapping
chicago_df[['crime_against', 'offense_category_name']] = chicago_df['primary_type'].map(crime_mapping).apply(pd.Series)


In [34]:
chicago_df.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,...,unified_location_category,UNIFIED_LOCATION_CODE,month,day,hour,weekday,is_weekend,is_holiday,crime_against,offense_category_name
0,13777896,JJ183487,2025-03-16 03:00:00,040XX N KEYSTONE AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,False,True,...,Residence,1,3,16,3,6,1,0,Person,Assault Offenses
1,13776543,JJ182816,2025-03-12 00:00:00,037XX W NORTH AVE,910,MOTOR VEHICLE THEFT,AUTOMOBILE,STREET,False,False,...,Street/Outdoor,2,3,12,0,2,0,0,Property,Larceny/Theft Offenses
2,13772937,JJ178623,2025-03-12 00:00:00,076XX S EAST END AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,...,Unknown/Other,10,3,12,0,2,0,0,Property,Fraud Offenses
3,13774108,JJ179898,2025-03-12 00:00:00,097XX S MERRILL AVE,910,MOTOR VEHICLE THEFT,AUTOMOBILE,STREET,False,False,...,Street/Outdoor,2,3,12,0,2,0,0,Property,Larceny/Theft Offenses
4,13772980,JJ178262,2025-03-12 00:00:00,095XX S HALSTED ST,560,ASSAULT,SIMPLE,LIBRARY,False,False,...,Unknown/Other,10,3,12,0,2,0,0,Person,Assault Offenses


### Add if-arrest feature For NIBRS dataset

In [36]:
NIBRS_df['arrest'] = NIBRS_df['arrestee_id'].notna()
NIBRS_df.head()

Unnamed: 0,agency_id,incident_id,nibrs_month_id,cargo_theft_flag,incident_date,incident_hour,incident_status,arrestee_id,arrest_date,multiple_indicator,...,UNIFIED_LOCATION_CODE,date,month,day,hour,weekday,is_weekend,is_holiday,New_offense_category_name,arrest
0,4675,169368274,44493177,f,2023-01-19,21,ACCEPTED,51253221.0,2023-01-19 00:00:00,N,...,1,2023-01-19 21:00:00,1,19,21,3,0,0,Assault Offenses,True
1,4675,169368295,44493179,f,2023-02-08,18,ACCEPTED,51253236.0,2023-02-08 00:00:00,N,...,1,2023-02-08 18:00:00,2,8,18,2,0,0,Destruction/Damage/Vandalism of Property,True
2,4675,172820689,44525191,f,2023-04-13,13,ACCEPTED,52245171.0,2023-04-13 00:00:00,N,...,1,2023-04-13 13:00:00,4,13,13,3,0,0,Assault Offenses,True
3,4675,172820694,44525191,f,2023-04-13,21,ACCEPTED,52245172.0,2023-04-13 00:00:00,N,...,2,2023-04-13 21:00:00,4,13,21,3,0,0,Drug/Narcotic Offenses,True
4,4675,187560794,44525192,f,2023-05-20,15,ACCEPTED,57077143.0,2023-05-20 00:00:00,N,...,1,2023-05-20 15:00:00,5,20,15,5,1,0,Assault Offenses,True


In [37]:
print("Chicago dataset shape: ", chicago_df.shape)
print("NIBRS dataset shape: ", NIBRS_df.shape)

Chicago dataset shape:  (1100000, 32)
NIBRS dataset shape:  (562976, 37)


In [None]:
# chicago_df.to_csv("Chicago_after_FE.csv", index=False)
# NIBRS_df.to_csv("NIBRS_after_FE.csv", index=False)