Project Overview: This notebook is designed to analyze crime data from Los Angeles, covering the period from 2020 to the present day. By leveraging exploratory data analysis (EDA) and machine learning techniques, we aim to discover trends, identify patterns, and develop a predictive model to enhance our understanding of crime occurrences in the area.

Data Description
DR_NO: Unique identifier for each record.
Date Rptd: Date the crime was reported.
DATE OCC: Date the crime occurred.
TIME OCC: Time the crime occurred.
AREA: Area code or ID where the crime took place.
AREA NAME: Name of the area where the crime took place.
Rpt Dist No: Reporting district number.
Part 1-2: Type of crime classification
Crm Cd: Crime code, a numeric code representing a specific crime.
Crm Cd Desc: Description of the crime code
Mocodes: Modus operandi codes, which might describe crime methods or patterns.
Vict Age: Age of the victim.
Vict Sex: Gender of the victim.
Vict Descent: Descent or ethnicity of the victim.
Premis Cd: Code for the type of premises where the crime occurred.
Premis Desc: Description of the type of premises
Weapon Used Cd: Code for the weapon used in the crime.
Weapon Desc: Description of the weapon used
Status: Status code, possibly indicating the case status.
Status Desc: Description of the status code
Crm Cd 1 to Crm Cd 4: Additional crime codes, potentially for incidents involving multiple crimes.
LOCATION General location information, possibly an address or landmark.
Cross Street: Cross street or intersection near the incident.
LAT: Latitude of the crime location.
LON: Longitude of the crime location.

In [None]:
'''Installing and Importing necessary libraries
'''

In [None]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
import time

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to split the data into train and test
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# To build model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
)



import tensorflow as tf #An end-to-end open source machine learning platform
from tensorflow import keras  # High-level neural networks API for deep learning.
from keras import backend   # Abstraction layer for neural network backend engines.
from keras.models import Sequential  # Model for building NN sequentially.
from keras.layers import Dense

# to suppress warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Loading the dataset

In [None]:
data = pd.read_csv("/content/drive/My Drive/Machine_Learning/Crime_Data_from_2020_to_Present.csv")

Data Overview [Showing some samples of data]

data.head(20).T

In [None]:
data.tail()

Checking the shape of the data

In [None]:
data.shape

There are 990293 rows and 28 columns. [Also satisfying the project criteria]

Checking 10 random rows of the dataset

In [None]:
# let's view a sample of the data
data.sample(n=10, random_state=1)

In [None]:
# let's create a copy of the data to avoid any changes to original data
df = data.copy()

Checking the data types of the columns for the dataset

In [None]:
df.info()

The dataset consists of 28 columns with varying data types, capturing a broad array of information about crime incidents. The data types range from integers and floats to objects (strings)

Checking for duplicate values

In [None]:
# checking for duplicate values
df.duplicated().sum()

Checking for missing values

In [None]:
df.isna().sum()

Cols as Mocodes,vict sex, cross street etc has many missing values. We have to preprocess the missing values to create model for performing several needed prediction.
We will drop additional crime codes from Crm cd 1 to Crm cd 4 as they are not serving any purpose. We will also drop Cross street as it has many null values and when we already have LAT and LON we don't need it.

checking the statistical summary of the datset

In [None]:
df.describe().T

DR_NO ranges from 817 to 2,499,289; TIME OCC from 1 to 2359; AREA from 1 to 21; Rpt Dist No from 101 to 2199; Vict Age from -40 to 120; Premis Cd from 0 to 976; Weapon Used Cd from 101 to 516; LAT from 33.4 to 34.3; LON from -118.3 to -118.2.

Exploratory Data Analysis (EDA) Summary

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Defining the figure size
plt.figure(figsize=(22, 14))

# Selecting only numerical columns for plotting
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns

# Plotting the histogram for each numerical feature
for i, feature in enumerate(numerical_features):
    plt.subplot((len(numerical_features) + 2) // 3, 3, i + 1)  # Adjusting subplot grid dynamically
    sns.histplot(data=df, x=feature, kde=True)  # Plotting the histogram with KDE for better visualization
    plt.title(f'Histogram of {feature}')  # Adding title to each subplot

plt.tight_layout()  # Adding spacing between plots
plt.show()


As we can see from the dataset,  

TIME OCC = Most of the time occured There are noticeable peaks at certain times (e.g., around 400, 1200, and between 1800-2300), suggesting that crimes occur more frequently during these specific times. This could indicate higher crime activity during late evening and night hours. The sharp peak close to zero might represent crimes occurring very early in the morning or could indicate issues with data entry if 0 is not a valid time.
Overall, this histogram indicates that crimes are more likely to occur in the late afternoon, evening, and night, which could inform decisions on allocating resources during high-risk times.


Area = There are distinct peaks at certain area codes (e.g., around 2, 5, 12), indicating these areas have a higher frequency of reported crimes compared to others. This histogram helps identify high-activity crime zones, allowing for better resource allocation and focused crime prevention efforts in specific areas.

Rpt Dist No  = The distribution of Rpt Dist No shows varying crime frequencies across districts. Some districts have noticeable peaks, indicating higher crime reports, while others show relatively low counts. This suggests specific districts have more reported incidents, which may require targeted attention.

Part 1-2 = The Part 1-2 column shows a bimodal distribution with values clustered at 1 and 2, indicating two main crime classifications (likely "Part 1" and "Part 2" crimes). The frequency of "Part 1" crimes is significantly higher, suggesting these are more commonly reported incidents.

Crm CD = Certain crime codes, such as those around 400 and 600, appear much more frequently, indicating these types of crimes are reported at a higher rate. Other codes have significantly lower frequencies, suggesting they represent less common crimes. This distribution highlights the prevalence of specific crime types in the dataset.

Vict Age: The histogram for Vict Age shows a high concentration of values around the younger age group, with a few outliers at higher ages. There's an unusual peak at 0, which could represent missing or unknown age data.

Premis Cd: The Premis Cd distribution is heavily skewed, with specific codes occurring far more frequently than others, indicating that certain premises are more common locations for incidents.

Weapon Used Cd: Similar to Premis Cd, this histogram shows that a few weapon codes appear frequently, suggesting that certain types of weapons are more commonly involved in crimes.

Crm Cd 1: The histogram for Crm Cd 1 shows multiple peaks, indicating a variety of primary crime codes with some codes significantly more frequent, suggesting common types of primary crimes.

Crm Cd 2: The distribution of Crm Cd 2 is skewed towards higher values, with a sharp increase near 1000. This could indicate secondary crime codes being used selectively or for specific types of incidents.

Crm Cd 3: Crm Cd 3 is similar to Crm Cd 2, with fewer data points but showing an increase near 1000, indicating that this code might only be applicable to specific cases.

Crm Cd 4: The Crm Cd 4 histogram shows very few entries, with values concentrated near 1000, implying that this column is rarely used and may be relevant only for specific multi-offense incidents.

LAT (Latitude): The latitude values are tightly clustered around a narrow range, which is expected as the data likely represents incidents within a specific geographical area.

LON (Longitude): Similar to LAT, longitude values are concentrated within a limited range, reflecting the confined geographic scope of the dataset.



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Defining the figure size
plt.figure(figsize=(16, 14))

# Selecting only numerical columns for plotting
features = df.select_dtypes(include=['number']).columns.tolist()

# Creating the boxplots
for i, feature in enumerate(features):
    plt.subplot((len(features) + 2) // 3, 3, i + 1)  # Adjusting subplot grid dynamically
    sns.boxplot(data=df, x=feature)  # Plotting the boxplot
    plt.title(f'Boxplot of {feature}')  # Adding title to each subplot

plt.tight_layout()  # Adding spacing between plots
plt.show()


DR_NO: A few outliers on the lower end, though this column likely represents a unique identifier, so outliers here may not be relevant.

TIME OCC: No visible outliers, as most times appear to fall within a consistent range.

AREA: No significant outliers, as the data is evenly distributed within the range.

Rpt Dist No: No clear outliers; the values are within a reasonable spread.

Part 1-2: No outliers, with data concentrated at two distinct values, likely indicating two main categories.

Crm Cd: No major outliers; the distribution of crime codes is within a standard range.

Vict Age: Outliers are present on both the lower end (0) and higher end (above 100), which may represent errors or unusual cases.

Premis Cd: No significant outliers, with values mostly within a reasonable range.

Weapon Used Cd: Outliers appear on the lower end, likely due to specific weapon codes that occur less frequently.

Crm Cd 1, Crm Cd 2, Crm Cd 3: Numerous outliers across these columns, especially in Crm Cd 2 and Crm Cd 3, where certain codes appear far outside the main cluster. These could represent unique crime codes that are less common.

Crm Cd 4: Few outliers on the lower end, likely due to rare multi-crime codes.

LAT and LON: Outliers are present on both latitude (LAT) and longitude (LON), indicating a few locations outside the main geographical area of the dataset.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Sample 10% of the data for faster plotting
sample_df = df.sample(frac=0.1, random_state=42)

# Select only categorical columns with less than 50 unique values
categorical_features = [col for col in sample_df.select_dtypes(include=['object']).columns if sample_df[col].nunique() < 50]

# Cap rare categories to "Other" for features with more than 10 categories
for feature in categorical_features:
    top_categories = sample_df[feature].value_counts().nlargest(10).index  # Top 10 categories
    sample_df[feature] = sample_df[feature].apply(lambda x: x if x in top_categories else 'Other')

# Plotting bar plots for the selected categorical features
plt.figure(figsize=(16, 14))

for i, feature in enumerate(categorical_features):
    plt.subplot((len(categorical_features) + 2) // 3, 3, i + 1)  # Adjusting subplot grid dynamically
    sns.countplot(data=sample_df, x=feature)  # Plotting the bar plot
    plt.title(f'Bar Plot of {feature}')  # Adding title to each subplot
    plt.xticks(rotation=45)  # Rotate x-axis labels for better readability

plt.tight_layout()  # Adding spacing between plots
plt.show()


AREA NAME: The category "Other" has the highest count, followed by smaller, relatively even counts across other areas, indicating a concentration in unspecified areas.

Vict Sex: Females (F) and Males (M) have the highest counts, with a small portion in "Other" and unknown categories, showing a typical gender distribution.

Vict Descent: The majority of victims are in categories H and W, with a few other descent categories also appearing frequently, indicating a demographic pattern.

Status: The IC status code dominates, suggesting that a majority of cases fall under this category, with other status codes being significantly less common.

Status Desc: The majority of cases are categorized as "Invest Cont," with fewer cases in other status descriptions, reflecting the primary status of reported incidents.

Bivariate Analysis

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Selecting only numerical columns for correlation
numeric_df = df.select_dtypes(include=['number'])

# Defining the size of the plot
plt.figure(figsize=(12, 7))

# Plotting the heatmap for correlation
sns.heatmap(
    numeric_df.corr(),  # Use only numeric columns for correlation
    annot=True,  # Annotate each cell with the numeric value
    vmin=-1, vmax=1,  # Set color scale limits to show full correlation range
    fmt=".2f",  # Format the annotations to two decimal places
    cmap="Spectral"  # Color map for visual effect
)

plt.title("Correlation Heatmap")  # Adding a title for clarity
plt.show()


The heatmap reveals a strong correlation between Crm Cd and Crm Cd 1 (0.70), indicating a close relationship between primary crime codes. Weapon Used Cd and Crm Cd also show a moderate correlation (0.37), suggesting specific weapons are common in certain crime types. Latitude (LAT) and Longitude (LON) have a perfect negative correlation (-1.00), as expected for fixed geographic points. Most other correlations are weak or near zero, showing minimal relationships between the remaining features.

In [None]:
import seaborn as sns

# Sample a fraction of the data to speed up the plotting
sample_df = df.sample(frac=0.1, random_state=42)  # Adjust fraction as needed (e.g., 0.05 for larger datasets)

# Selecting a subset of columns for the pairplot
selected_columns = ['TIME OCC', 'AREA', 'Vict Age', 'Premis Cd', 'Weapon Used Cd', 'LAT', 'LON']
sample_df = sample_df[selected_columns]

# Plotting the pairplot
sns.pairplot(sample_df)


The pairplot shows the relationships between selected numerical variables in the dataset. TIME OCC appears uniformly distributed, with no clear patterns between crime times and area or premises types, indicating crimes occur across various times and locations. Vict Age shows a concentration of younger victims, distributed across a range of premises, suggesting that incidents involving younger individuals happen in diverse settings. LAT and LON values are tightly clustered, reflecting the dataset’s localized geographic scope with limited spread. Overall, the plot reveals minimal strong linear correlations among most variables.


In [None]:
#Data Preprocessing

In [None]:
import numpy as np

# Drop columns with excessive missing values (e.g., Crm Cd 2, Crm Cd 3, and Cross Street) in a single step
df = df.drop(columns=['Cross Street','Crm Cd 1', 'Crm Cd 2','Crm Cd 3','Crm Cd 4'])

# Impute missing values in categorical columns with 'Unknown' or 'Other' using a single operation for each column
categorical_cols = ['Vict Sex', 'Vict Descent', 'Weapon Desc', 'Premis Desc']
df[categorical_cols] = df[categorical_cols].fillna('Unknown')

# For numerical columns with missing values, use vectorized operations for mean or median imputation
# Apply conditions in one step for Vict Age
df['Vict Age'] = np.where((df['Vict Age'] > 0) & (df['Vict Age'] <= 100), df['Vict Age'], df['Vict Age'].median())

# Use median imputation for other numerical columns
numerical_cols = ['Weapon Used Cd', 'Premis Cd']
for col in numerical_cols:
    df[col].fillna(df[col].median(), inplace=True)


In [None]:
# Fill missing values in 'Mocodes' with 'Unknown' as it might contain categorical information
df['Mocodes'].fillna('Unknown', inplace=True)


In [None]:
df.isna().sum()

In [None]:
# Drop rows where 'Status' has null values
df = df.dropna(subset=['Status'])

# Verify if there are any remaining null values
print(df.isna().sum())


In [None]:
# For LAT and LON, remove rows with coordinates outside the expected range
# Assuming the dataset is for a specific geographic area, e.g., Los Angeles
df = df[(df['LAT'] > 33.5) & (df['LAT'] < 34.5) & (df['LON'] < -118) & (df['LON'] > -119)]

In [None]:
# One-hot encode AREA NAME, Vict Sex, Vict Descent, Status Desc, and Crm Cd Desc
df = pd.get_dummies(df, columns=['AREA NAME', 'Vict Sex', 'Vict Descent', 'Status Desc', 'Crm Cd Desc'], drop_first=True)


In [None]:
# Extracting hour from TIME OCC (assuming it's in a 24-hour format)
df['Hour'] = df['TIME OCC'] // 100

# Convert DATE OCC to datetime format and extract year, month, and day features
df['DATE OCC'] = pd.to_datetime(df['DATE OCC'], errors='coerce')
df['Year'] = df['DATE OCC'].dt.year
df['Month'] = df['DATE OCC'].dt.month
df['Day'] = df['DATE OCC'].dt.day

# Drop the original date and time columns if not needed
df = df.drop(columns=['DATE OCC', 'TIME OCC', 'DR_NO'])


In [None]:
from sklearn.preprocessing import StandardScaler

# Define numerical columns to scale
numerical_cols = ['Vict Age', 'Rpt Dist No', 'Premis Cd', 'Weapon Used Cd', 'LAT', 'LON', 'Hour']

# Apply standard scaling
scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])


In [None]:
# Check final data structure
print(df.info())
print(df.describe())


The DataFrame has 987,743 rows, 209 columns, and various data types, using 357.2 MB of memory. Key fields include area codes, district codes, age data with outliers, codes for premises, weapons, and crimes, as well as geographic and time-related data. Some columns may need encoding, and outliers may require cleaning.

In [None]:
df.info()

In [None]:
df.head(500)

In [None]:
print(df.columns)


In [None]:
df.info()

In [None]:
import pandas as pd
import numpy as np

# Assuming your data is in df

# Step 1: Check for remaining missing values
print("Missing Values Summary:")
print(df.isnull().sum()[df.isnull().sum() > 0])

# Step 2: Check if any one-hot encoded columns can be simplified
# Example: Remove one-hot columns for 'AREA NAME' if not needed
one_hot_area_cols = [col for col in df.columns if 'AREA NAME_' in col]
df.drop(columns=one_hot_area_cols, inplace=True)

# Step 3: Recheck outliers in important columns
from scipy.stats import zscore

# Outlier detection for 'Vict Age'
if 'Vict Age' in df.columns:
    df = df[(np.abs(zscore(df['Vict Age'])) < 3)]  # within 3 standard deviations

# Outlier detection for 'Hour'
if 'Hour' in df.columns:
    df = df[(np.abs(zscore(df['Hour'])) < 3)]

# Step 4: Check the final summary of the DataFrame
print("Final DataFrame Summary:")
print(df.info())
print(df.describe())

# Display a few rows to verify the cleaned data
print(df.head())


In [None]:
df.describe()

In [None]:
# Display all column names without truncation
for column in df.columns:
    print(column)


In [None]:
df.isna().sum()

In [None]:
# Identify the one-hot encoded columns related to crime description
crime_columns = [col for col in df.columns if col.startswith("Crm Cd Desc_")]

# Create a new target column by finding the column with the value '1' in each row
df['Crime_Type'] = df[crime_columns].idxmax(axis=1)

# Remove the "Crm Cd Desc_*" one-hot encoded columns
df.drop(columns=crime_columns, inplace=True)

# Now 'Crime_Type' is your target column, which contains the crime type for each row


In [None]:
df.columns.tolist()

In [None]:

# Group by Year and Crime_Type to see trends over time
crime_trends_yearly = df.groupby(['Year', 'Crime_Type']).size().unstack(fill_value=0)


Question 1. What are the most common types of crimes reported over time?

In [None]:
#visualize the frequency of each crime type over the years.

In [None]:
# Get the top 10 most common crime types
top_crimes = df['Crime_Type'].value_counts().head(10).index
filtered_data = crime_trends_yearly[top_crimes]

# Plot trends for the top crimes only
plt.figure(figsize=(14, 8))
for crime_type in filtered_data.columns:
    plt.plot(filtered_data.index, filtered_data[crime_type], label=crime_type)

plt.xlabel("Year")
plt.ylabel("Number of Incidents")
plt.title("Trends Over Time for Most Common Crime Types")
plt.legend(loc='upper right', bbox_to_anchor=(1.2, 1))
plt.xticks(rotation=45)
plt.show()


The line plot illustrates trends in the ten most common crime types over recent years. "Vehicle - Stolen" consistently ranks as the most frequent crime, while other types like "Theft Plain - Petty" and "Battery - Simple Assault" show moderate frequency. Notably, "Theft of Identity" had a significant spike around 2021 but declined thereafter. Overall, the graph reveals varying patterns for different crimes, with some showing steady trends and others displaying fluctuations over time.

In [None]:
#Heatmap Corelation

In [None]:
import seaborn as sns

plt.figure(figsize=(14, 10))
sns.heatmap(crime_trends_yearly[top_crimes].T, cmap="YlGnBu", annot=False, cbar=True)
plt.title("Heatmap of Most Common Crime Types Over Time")
plt.xlabel("Year")
plt.ylabel("Crime Type")
plt.show()


#The heatmap provides a visual representation of the frequency of common crime types over time. Darker shades indicate higher incidences, while lighter shades show lower ones. "Vehicle - Stolen" consistently appears as a highly frequent crime across the years. Notably, "Theft of Identity" had a peak around 2021, shown by the darker blue. This heatmap effectively highlights fluctuations in crime occurrences, helping to identify trends for each crime type over different years.

1.	What are the most common types of crimes reported over time?
   Answer: Vehicle - Stolen: Consistently reported at high levels each year. Battery - Simple Assault: Another frequently reported crime type. Burglary      from Vehicle: Maintains a steady frequency over the years.Theft of Identity: Shows a significant spike around 2021.Vandalism - Felony ($400 & Over,        All Church Vandalisms): Regularly reported, though with some fluctuations.
These crime types remain prevalent throughout the years, with "Vehicle - Stolen" being particularly common.

Question 2: Which areas have the highest crime rates, and how do they change over time?

In [None]:
# Group by AREA and Year to see the frequency of crimes in each area annually
area_trends_yearly = df.groupby(['Year', 'AREA']).size().unstack(fill_value=0)


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(14, 10))
sns.heatmap(area_trends_yearly, cmap="YlOrRd", annot=False, cbar=True)
plt.title("Heatmap of Crime Frequency by Area Over Time")
plt.xlabel("Year")
plt.ylabel("Area")
plt.show()


The heatmap displays crime frequency across different areas over several years. Darker shades indicate higher crime rates, with areas 1, 12, and 13 consistently showing higher frequencies over time. These areas experienced particularly elevated crime rates around 2022 and 2023, while other areas generally show lighter shades, indicating lower crime occurrences.

In [None]:
# Transpose the data for easier plotting with an area chart
area_trends_yearly_transposed = area_trends_yearly.T

plt.figure(figsize=(14, 8))
area_trends_yearly_transposed.plot(kind='area', stacked=True, figsize=(14, 8), colormap='tab20')
plt.title("Crime Frequency in Different Areas Over Time")
plt.xlabel("Year")
plt.ylabel("Number of Crimes")
plt.legend(title="Area", loc='upper left', bbox_to_anchor=(1.0, 1.0))
plt.show()


The area chart shows crime frequency across areas over time. Each color represents an area, with peaks indicating higher crime rates. Some areas consistently have more crimes, contributing significantly to total crime rates, while others fluctuate. Peaks and dips highlight changes in crime trends over the years.

Question 2: Which areas have the highest crime rates, and how do they change over time?
Answer: The areas with the highest crime rates, as shown in both the heatmap and area chart, include Area 1, Area 12, and Area 13. These areas consistently report higher crime frequencies over the years. Crime rates in these areas show fluctuations rather than a consistent increase, with some years experiencing peaks and others showing declines.

Question 3: Is there a correlation between time of day (hour) and specific types of crimes?

In [None]:
# Group by Hour and Crime_Type to find the frequency of each crime type at each hour
crime_by_hour = df.groupby(['Hour', 'Crime_Type']).size().unstack(fill_value=0)


In [None]:
# Select the top 10 most frequent crime types
top_crime_types = df['Crime_Type'].value_counts().head(10).index
crime_by_hour_top = crime_by_hour[top_crime_types]

plt.figure(figsize=(14, 8))
sns.heatmap(crime_by_hour_top, cmap="coolwarm", cbar=True)
plt.title("Frequency of Top Crime Types by Hour of the Day")
plt.xlabel("Crime Type")
plt.ylabel("Hour")
plt.show()


The heatmap shows the distribution of the top 10 most frequent crime types by hour of the day. "Vehicle - Stolen" has the highest frequency across all hours, represented by the darkest shade on the left. Other crimes like "Battery - Simple Assault" and "Burglary from Vehicle" show moderate frequencies, with lighter shades indicating lower occurrences. This visualization highlights that certain crimes are more prevalent than others but does not indicate strong variation by hour for these top crime types.

Question 3: Is there a correlation between time of day (hour) and specific types of crimes?
Answer: There’s no strong correlation between specific hours and top crime types. Crimes like "Vehicle - Stolen" occur consistently across all hours, with no significant hourly fluctuations among the top types.

In [None]:
Question 4: How does crime distribution vary by demographic factors (like victim’s sex or age)?

In [None]:
# Select the one-hot encoded columns for victim sex
victim_sex_columns = ['Vict Sex_F', 'Vict Sex_M', 'Vict Sex_H', 'Vict Sex_Unknown', 'Vict Sex_X']

# Group by Crime_Type and sum across victim sex columns
crime_by_sex = df.groupby('Crime_Type')[victim_sex_columns].sum()


In [None]:
# Select one-hot encoded columns for victim sex
victim_sex_columns = ['Vict Sex_F', 'Vict Sex_M', 'Vict Sex_H', 'Vict Sex_Unknown', 'Vict Sex_X']

# Group by Crime_Type and sum the counts for each sex
crime_by_sex = df.groupby('Crime_Type')[victim_sex_columns].sum()

# Focus on the top 10 crime types for readability
top_crimes = df['Crime_Type'].value_counts().head(10).index
filtered_crime_by_sex = crime_by_sex.loc[top_crimes]

# Plotting the distribution by sex
filtered_crime_by_sex.plot(kind='bar', stacked=True, figsize=(14, 8), width=0.8)
plt.title("Crime Distribution by Victim's Sex (Top 10 Crimes)")
plt.xlabel("Crime Type")
plt.ylabel("Number of Incidents")
plt.legend(title="Victim's Sex")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


The chart shows "Vehicle - Stolen" as the most common crime, affecting both male and female victims significantly. Other frequent crimes like "Battery - Simple Assault" and "Burglary from Vehicle" also impact both sexes, but at lower rates.

Question 4: How does crime distribution vary by demographic factors (like victim’s sex or age)?
Answer: "Vehicle - Stolen" is the most frequent crime, impacting both sexes. Crimes like "Battery - Simple Assault" and "Burglary from Vehicle" also affect both genders. The 19-35 age group reports the most incidents, with crime rates generally lower in older age groups.

Question 5: Are certain types of crimes more frequent on specific days (e.g., weekends vs. weekdays)?

In [None]:
# Convert the Date Rptd column to day of the week (if not already in dataset)
df['DayOfWeek'] = pd.to_datetime(df['Date Rptd']).dt.day_name()

# Group by Crime_Type and DayOfWeek
crime_by_day = df.groupby(['Crime_Type', 'DayOfWeek']).size().unstack(fill_value=0)

# Focus on top 10 crimes for readability
top_crimes = df['Crime_Type'].value_counts().head(10).index
filtered_crime_by_day = crime_by_day.loc[top_crimes]

# Plot the data
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(14, 8))
sns.heatmap(filtered_crime_by_day, cmap="YlGnBu", annot=True, fmt="d")
plt.title("Crime Frequency by Day of the Week for Top 10 Crimes")
plt.xlabel("Day of the Week")
plt.ylabel("Crime Type")
plt.tight_layout()
plt.show()


The heatmap shows that "Vehicle - Stolen" and "Battery - Simple Assault" are more frequent on weekdays, especially Fridays, than weekends. In contrast, crimes like "Theft Plain - Petty ($950 & Under)" and "Intimate Partner - Simple Assault" are steady across the week with minor fluctuations. This suggests certain crimes peak on weekdays, likely tied to daily activity patterns.

Question 5: Are certain types of crimes more frequent on specific days (e.g., weekends vs. weekdays)?
Answer: Yes, certain crimes are more frequent on specific days. For example, "Vehicle - Stolen" and "Battery - Simple Assault" are more common on weekdays, particularly on Fridays, than on weekends. Other crimes, like "Theft Plain - Petty ($950 & Under)" and "Intimate Partner - Simple Assault," remain relatively steady throughout the week. This suggests that some crimes peak on weekdays, likely influenced by daily routines and activity patterns.

In [None]:
Questiom 6: Is there a seasonal trend in crime occurrence (monthly or yearly)?

In [None]:
# Ensure 'Date Rptd' is in datetime format
df['Date Rptd'] = pd.to_datetime(df['Date Rptd'], errors='coerce')  # Convert to datetime, handling errors

# Now you can use the .dt accessor
df['Year-Month'] = df['Date Rptd'].dt.to_period('M')

# Group by the new 'Year-Month' column to get monthly totals
monthly_crime_trends = df.groupby('Year-Month').size()

# Convert index to datetime for plotting
monthly_crime_trends.index = monthly_crime_trends.index.to_timestamp()

# Plot monthly crime trends over time
plt.figure(figsize=(14, 8))
plt.plot(monthly_crime_trends.index, monthly_crime_trends)
plt.title("Monthly Crime Trends Over Time")
plt.xlabel("Year-Month")
plt.ylabel("Number of Incidents")
plt.grid(True)
plt.tight_layout()
plt.show()


The line plot illustrates monthly crime trends over time, showing a generally stable level of incidents from 2020 through early 2023, with monthly incidents hovering around 15,000 to 20,000. However, there’s a noticeable decline in crime frequency starting in 2023, dropping sharply toward 2024.

Questiom 6: Is there a seasonal trend in crime occurrence (monthly or yearly)?
Answer: Yes, there appears to be a seasonal trend in crime occurrence. From 2020 to early 2023, crime incidents remain relatively stable, with minor fluctuations each month. However, a sharp decline in reported incidents begins in 2023 and continues into 2024. This pattern suggests either a seasonal reduction in crime or potential external factors, such as policy changes or enforcement measures, that led to fewer reported incidents in the later period.

In [None]:
Question 7:	Which features (e.g., location, time, type of weapon) are most predictive of specific crime types?

In [None]:
# Assuming your main dataframe is named `df`
# Step 1: Select relevant columns including 'Crime_Type' as the target variable
selected_columns = ['AREA', 'Hour', 'Premis Desc', 'Weapon Desc', 'Crime_Type']  # Adjust this list as needed
df_selected = df[selected_columns].copy()

# Step 2: Perform one-hot encoding on categorical columns if needed
df_selected = pd.get_dummies(df_selected, columns=['Premis Desc', 'Weapon Desc'], drop_first=True)

# Step 3: Sample the dataset to reduce memory load
df_sampled = df_selected.sample(frac=0.3, random_state=42)  # Use 30% of data

# Step 4: Define X and y
X = df_sampled.drop(columns=['Crime_Type'])
y = df_sampled['Crime_Type']


To identify which features (e.g., location, time, type of weapon) are most predictive of specific crime types, I selected key columns (AREA, Hour, Premis Desc, Weapon Desc, Crime_Type) for analysis. Categorical variables like Premis Desc and Weapon Desc were one-hot encoded to convert them into numerical format, and 30% of the data was sampled to reduce memory load while maintaining representative data. Finally, the dataset was split into X (features) and y (target variable, Crime_Type) for further predictive modeling.

In [None]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Use an even smaller sample, for example, 10% of the data
df_sampled = df_selected.sample(frac=0.1, random_state=42)  # Use 10% of data
X = df_sampled.drop(columns=['Crime_Type'])
y = df_sampled['Crime_Type']


In [None]:
from sklearn.ensemble import RandomForestClassifier


In [None]:
from sklearn.ensemble import RandomForestClassifier

# Use a smaller sample of the data
df_sampled = df_selected.sample(frac=0.1, random_state=42)  # Use 10% of data
X = df_sampled.drop(columns=['Crime_Type'])
y = df_sampled['Crime_Type']

# Initialize and train the Random Forest model
model = RandomForestClassifier(n_estimators=50, random_state=42)  # Reduce number of trees
model.fit(X_train, y_train)


In [None]:
from sklearn.metrics import classification_report, accuracy_score

# Predict on the test set
y_pred_rf = model.predict(X_test)  # For Random Forest
y_pred_lr = model.predict(X_test)  # For Logistic Regression

# Random Forest Performance
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))

# Logistic Regression Performance
print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_lr))
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))


In this step, a Random Forest Classifier was implemented to predict crime types. Despite using balanced class weights and an oversampled dataset, the model's performance varied significantly across classes, with lower precision and recall for less frequent crime categories. This indicates that class imbalance in the dataset, even after oversampling, likely influenced the model's ability to generalize for underrepresented classes. Additionally, the large number of classes and overlapping features may have contributed to reduced accuracy for certain crime types. Further optimization, such as advanced feature selection or alternative resampling techniques, could improve performance.

Both models have low overall performance, suggesting that predicting crime types based on the selected features is challenging.

In [None]:
# Simplify the 'Crime_Type' categories by mapping them to broader classes
crime_type_mapping = {
    'Crm Cd Desc_VEHICLE - STOLEN': 'Theft',
    'Crm Cd Desc_BATTERY - SIMPLE ASSAULT': 'Assault',
    # Add more mappings based on observed crime types, grouping similar ones
}

df['Crime_Type'] = df['Crime_Type'].replace(crime_type_mapping)


In [None]:
# Display the counts of each crime type to verify distribution
print(df['Crime_Type'].value_counts())


Crime_Type variable was re-mapped into broader categories to address the high granularity of the original labels, which contributed to data sparsity and poor model performance. By grouping similar or rare crime types into fewer broader categories (e.g., Theft, Assault), the dataset became more balanced and easier to model.

In [None]:
# Define a threshold for rare crimes (e.g., crimes with less than 100 occurrences)
rare_crime_threshold = 100

# Identify rare crimes
rare_crimes = df['Crime_Type'].value_counts()[df['Crime_Type'].value_counts() < rare_crime_threshold].index

# Replace rare crimes with 'Other'

df['Crime_Type'] = df['Crime_Type'].apply(lambda x: 'Other' if x in rare_crimes else x)

# Check the new distribution of 'Crime_Type' after grouping
print(df['Crime_Type'].value_counts())


rare classes with a frequency below a defined threshold were grouped into a new Other category. This step further reduced the imbalance by ensuring that rare categories did not disproportionately affect the model's ability to generalize. These changes aimed to improve the classifier's performance by providing more representative and balanced training data, enabling better predictions for the redefined Crime_Type variable.

In [None]:
# Assuming 'Hour' and 'Day' columns exist in your data
def get_time_of_day(hour):
    if 5 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 17:
        return 'Afternoon'
    elif 17 <= hour < 21:
        return 'Evening'
    else:
        return 'Night'

df['Time_of_Day'] = df['Hour'].apply(get_time_of_day)
df['Is_Weekend'] = df['Day'].apply(lambda x: 1 if x in ['Saturday', 'Sunday'] else 0)


New features were created to enhance prediction: Time_of_Day categorized Hour into periods like Morning and Night, capturing temporal patterns, while Is_Weekend flagged weekends (1) vs. weekdays (0) to identify weekend crime trends. These additions improve the model's ability to capture temporal and contextual influences.

In [None]:
df = pd.get_dummies(df, columns=['Time_of_Day', 'Premis Desc', 'Weapon Desc'], drop_first=True)


In [None]:
df_sampled = df.sample(frac=0.1, random_state=42)  # Reduce to 10% of the data
X = df_sampled.drop(columns=['Crime_Type'])
y = df_sampled['Crime_Type']


In [None]:
X_numeric = X.select_dtypes(include=['int64', 'float64'])  # Select only numeric columns

In [None]:
from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.01)  # Adjust threshold as needed
X_reduced = selector.fit_transform(X_numeric)


In [None]:
kept_columns = X_numeric.columns[selector.get_support()]
X_reduced = pd.DataFrame(X_reduced, columns=kept_columns)


In [None]:
print("Reduced dataset shape:", X_reduced.shape)
print(X_reduced.head())

In [None]:
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE

# Filter numeric columns
X_reduced = X_reduced.select_dtypes(include=[np.number])

# Reset indices to ensure alignment
X_reduced = X_reduced.reset_index(drop=True)
y = y.reset_index(drop=True)

# Apply SMOTE for balancing
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_reduced, y)

# Check shapes
print(f"X_resampled shape: {X_resampled.shape}")
print(f"y_resampled shape: {y_resampled.shape}")


The dataset was reduced to 10 key features to focus on the most relevant variables for prediction. This dimensionality reduction simplifies the model, improves computational efficiency, and reduces noise, enabling more accurate and faster predictions. The reduced dataset retains essential features like AREA, Hour, LAT, and LON while ensuring critical information is preserved.

In [None]:
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)



In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_reduced, y)


SMOTE (Synthetic Minority Oversampling Technique) was applied to address class imbalance in the dataset. It generates synthetic samples for underrepresented classes, ensuring a balanced distribution of target labels. This step enhances the model's ability to learn patterns from all classes effectively, improving prediction performance for minority classes.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)


In [None]:
rf_model = RandomForestClassifier(n_estimators=50, random_state=42)
rf_model.fit(X_train, y_train)


In [None]:
 # Make predictions on the test set
y_pred = rf_model.predict(X_test)


Predictions were generated on the test set using the trained Random Forest model. This step evaluates the model's performance by comparing the predicted labels (y_pred) against the actual labels in the test data, enabling the calculation of evaluation metrics like accuracy, precision, and recall.

In [None]:
from sklearn.metrics import classification_report, accuracy_score

# Print evaluation metrics
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))


The classification report evaluates the Random Forest model's performance across different crime categories. It provides precision, recall, and F1-score metrics for each class. The model achieved high performance across all metrics, indicating effective prediction of crime types. The balanced class distribution (achieved via SMOTE) and feature engineering likely contributed to this strong result. The overall accuracy confirms the model's reliability in classifying the broader crime categories.

The Random Forest model achieved a perfect classification performance with an accuracy of 1.0 and precision, recall, and F1-scores of 1.0 for all crime types. This indicates that the model successfully predicted every crime type in the test set, suggesting excellent feature selection and balancing.

In [None]:
# Get feature importance
import matplotlib.pyplot as plt
import numpy as np

feature_importance = rf_model.feature_importances_
features = X_train.columns

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.barh(features, feature_importance)
plt.xlabel("Feature Importance")
plt.ylabel("Features")
plt.title("Feature Importance for Crime Type Prediction")
plt.show()


The feature importance plot shows that Crm Cd is the most significant predictor of crime type, contributing over 50% of the importance. Other features like Vict Age, Premis Cd, and Weapon Used Cd also play notable roles, while location-related features (LAT, LON) and time (Hour) have lesser influence.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Select the top 20 most frequent crime types
top_classes = y_test.value_counts().head(20).index  # Adjust the number based on clarity
y_test_filtered = y_test[y_test.isin(top_classes)]
y_pred_filtered = y_pred[y_test.isin(top_classes)]

# Compute confusion matrix for the filtered classes
cm_filtered = confusion_matrix(y_test_filtered, y_pred_filtered, labels=top_classes)
disp_filtered = ConfusionMatrixDisplay(confusion_matrix=cm_filtered, display_labels=top_classes)

# Plot with enhanced clarity
fig, ax = plt.subplots(figsize=(12, 12))  # Adjust figure size
disp_filtered.plot(cmap='viridis', xticks_rotation=45, ax=ax)
plt.title("Confusion Matrix for Top 20 Crime Types")
plt.xticks(fontsize=10)  # Adjust font size for x-ticks
plt.yticks(fontsize=10)  # Adjust font size for y-ticks
plt.show()


The confusion matrix shows the model performs well for the top 20 crime types, with most predictions correct along the diagonal. Misclassifications are minimal, highlighting good accuracy for these crimes.

# Deep learning Model

In [None]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

# Initialize label encoder
label_encoder = LabelEncoder()

# Fit and transform the target labels
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Convert to one-hot encoded format
y_train_one_hot = to_categorical(y_train_encoded)
y_test_one_hot = to_categorical(y_test_encoded)

# Check the shapes
print(f"y_train shape after encoding: {y_train_one_hot.shape}")
print(f"y_test shape after encoding: {y_test_one_hot.shape}")


Target labels were encoded using LabelEncoder to convert them into numerical format, followed by one-hot encoding for compatibility with the deep learning model. This step ensures the labels are properly formatted for multi-class classification.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Initialize the Sequential model
model = Sequential()

# Input layer
model.add(Dense(128, input_dim=10, activation='relu'))  # input_dim matches X_train.shape[1]
model.add(Dropout(0.3))

# Hidden layers
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))

# Output layer
num_classes = y_train_one_hot.shape[1]  # Number of unique classes
model.add(Dense(num_classes, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Summary
model.summary()


"A sequential deep learning model was designed with an input layer matching the feature dimensions, two hidden layers using ReLU activation for non-linearity, and a softmax output layer for multi-class classification. Dropout layers were added to prevent overfitting, and the model was compiled with categorical cross-entropy loss and accuracy as the evaluation metric.

In [None]:
# Train the model
history = model.fit(
    X_train, y_train_one_hot,  # Features and one-hot encoded labels
    validation_data=(X_test, y_test_one_hot),  # Validation data
    epochs=20,  # Adjust as needed
    batch_size=32,  # Adjust based on dataset size
    verbose=1  # Shows training progress
)


The model was trained over 20 epochs using the training dataset, with validation data to monitor performance. Despite completing the training process, the accuracy and loss metrics remained constant across all epochs, indicating the model failed to learn effectively. This could be due to reasons such as insufficient training, poor feature representation, or inappropriate model architecture. Further analysis and adjustments are needed to improve learning outcomes.

In [None]:
# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test_one_hot, verbose=0)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy * 100:.2f}%")


The model evaluation on the test dataset yielded a loss of 4.5752 and a test accuracy of 1.00%. The extremely low accuracy indicates the model failed to generalize or learn meaningful patterns, likely due to issues such as an inappropriate model structure, lack of feature representation, or data-related challenges like imbalance or noise. Further refinement of the model and dataset preprocessing is required

In [None]:
import matplotlib.pyplot as plt

# Plot training and validation accuracy
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.title('Model Accuracy')
plt.legend()
plt.show()

# Plot training and validation loss
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Model Loss')
plt.legend()
plt.show()


The plots of model accuracy and loss over 20 epochs reveal that the model failed to improve its performance. Both training and validation accuracy remained nearly constant at a very low value, indicating a lack of learning. Similarly, the loss metrics show no significant reduction, suggesting issues such as insufficient model complexity, data preprocessing errors, or ineffective feature representation. These results emphasize the need for revising the model architecture and preprocessing steps to enable effective learning.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 10))
sns.heatmap(conf_matrix_limited, annot=True, fmt='d', cmap='Blues', cbar=True)

plt.title('Confusion Matrix for Top Classes')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')

# Add class names if available
class_labels = label_encoder.classes_[:top_classes]  # Assuming label_encoder was used
plt.xticks(ticks=np.arange(top_classes) + 0.5, labels=class_labels, rotation=45)
plt.yticks(ticks=np.arange(top_classes) + 0.5, labels=class_labels, rotation=45)

plt.show()


The confusion matrix presented is ineffective, as it contains only zeros, indicating no valid predictions or classifications were captured. This could result from mismatched or incorrect y_true and y_pred inputs, improper slicing, or errors in the prediction or visualization process. To address this, it is crucial to verify the alignment and correctness of the true and predicted labels, recalibrate the confusion matrix, and ensure meaningful data is processed before visualizing the results.

In [None]:
from sklearn.utils.class_weight import compute_class_weight

# Compute class weights
class_weights = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train_encoded),
    y=y_train_encoded
)

class_weights_dict = {i: weight for i, weight in enumerate(class_weights)}
print("Class Weights:", class_weights_dict)


Class weights were computed using compute_class_weight to address class imbalance in the dataset. These weights assign higher importance to underrepresented classes and lower importance to overrepresented ones, helping the model learn more effectively from imbalanced data. The resulting dictionary (class_weights_dict) was used during model training to improve performance across all classes.

In [None]:
model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy'],
    weighted_metrics=class_weights_dict  # Pass the class weights
)


In [None]:
import matplotlib.pyplot as plt

# Plot training and validation accuracy
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()


The training and validation accuracy plot shows significant fluctuations without meaningful improvement over epochs. This indicates that the model struggles to learn patterns from the data, likely due to factors like an ineffective model architecture, insufficient feature representation, or noisy data. The lack of convergence suggests the need for deeper analysis and adjustments in the model or data preprocessing.

In [None]:
for i in range(10):  # Display 10 test samples
    print(f"True Label: {y_true_classes[i]}, Predicted Label: {y_pred_classes[i]}")


In [None]:
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"Class distribution after SMOTE: {np.bincount(y_train_encoded)}")


In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [None]:
model = Sequential()
model.add(Dense(256, input_dim=X_train.shape[1], activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax'))  # Ensure output matches number of classes

model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)


In this section, the predicted test labels revealed a single dominant class (76), indicating poor model generalization. To address this, SMOTE was used to balance the class distribution, ensuring equal representation across all classes. Features were standardized using StandardScaler for improved training stability. A new Sequential model was implemented with additional layers, dropout for regularization, and a softmax output layer, aiming to improve multi-class classification performance.

In [None]:
from tensorflow.keras.optimizers import Adam
optimizer = Adam(learning_rate=0.0001)  # Smaller learning rate
model.compile(
    optimizer=optimizer,
    loss='categorical_crossentropy',
    metrics=['accuracy']
)


In [None]:
history = model.fit(
    X_train, y_train_one_hot,
    validation_data=(X_test, y_test_one_hot),
    epochs=50,
    batch_size=32
)


The model was trained for 50 epochs, showing consistent improvement in both training and validation accuracy, with the validation accuracy reaching around 83.9%. The loss values steadily decreased, indicating effective learning and convergence. This reflects that the new architecture and preprocessing steps, including SMOTE and scaling, contributed to improved model performance.

In [None]:
model.save('trained_model.keras')


In [None]:
from tensorflow.keras.models import load_model

loaded_model = load_model('trained_model.keras')


In [None]:
# Evaluate the loaded model
loss, accuracy = loaded_model.evaluate(X_test, y_test_one_hot, verbose=0)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy * 100:.2f}%")


In [None]:
# Predict using the loaded model
y_pred = loaded_model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)

# Display some predictions
for i in range(5):
    print(f"Sample {i+1}: True Label = {y_true_classes[i]}, Predicted Label = {y_pred_classes[i]}")


Predictions were generated using the trained model on the test set. A sample of predictions shows that the model successfully predicted the correct labels for several test samples, indicating that the model has learned meaningful patterns and improved in its classification performance compared to earlier iterations.

In [None]:
import numpy as np

# Find top N classes with the highest support
top_classes = 10
class_support = np.sum(conf_matrix, axis=1)
top_class_indices = np.argsort(class_support)[-top_classes:]

# Slice the confusion matrix
conf_matrix_limited = conf_matrix[np.ix_(top_class_indices, top_class_indices)]

# Plot the confusion matrix for top classes
plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix_limited, annot=True, fmt='d', cmap='Blues', cbar=True)

# Add class labels
plt.title(f'Confusion Matrix (Top {top_classes} Classes)')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.xticks(ticks=np.arange(top_classes) + 0.5, labels=top_class_indices, rotation=45)
plt.yticks(ticks=np.arange(top_classes) + 0.5, labels=top_class_indices, rotation=45)
plt.show()


A confusion matrix was generated for the top 10 most frequent classes to visualize the model's performance. The diagonal dominance indicates that the model accurately predicted the majority of samples for these classes. This focused analysis highlights the model's effectiveness in handling the most common classes while providing insights into potential misclassifications for further refinement.

# Insights for LAPD from Crime Prediction Model