# **Project Name**    -  **FBI Timeseries Forecasting**




##### **Project Type**    - EDA/Regression
##### **Contribution**    - Individual

# **Project Summary - FBI Time Series Forecasting**

**Overview:-**
The FBI Time Series Forecasting project focuses on analyzing historical crime data to predict future crime trends in the United States. Time series forecasting is a crucial area of data science that applies statistical and machine learning methods to temporal data, enabling insights and anticipatory decision-making. This project aims to forecast crime counts based on historical trends, seasonality, and other temporal patterns to assist law enforcement agencies, policymakers, and analysts in strategic planning and resource allocation.

**Project Objective :-**
The primary goal of this project is to build a predictive model that can accurately forecast future values of crime data based on historical FBI crime statistics. This involves understanding past trends, detecting seasonal fluctuations, and identifying anomalies or changes in crime patterns. The project does not classify data into categories but rather predicts continuous numerical values, making it a supervised regression problem with a time series structure.

**Problem Understanding :-**
FBI crime data often exhibits time-dependent characteristics such as long-term trends, recurring seasonal behaviors, and occasional outliers due to events like policy changes, pandemics, or social unrest. Therefore, a deep understanding of these patterns is vital. Time series data is inherently sequential, meaning the order and intervals between data points play a significant role. The forecasting model must account for this temporal dependency to provide reliable future predictions.

**Methodology :-**
The project follows a structured approach beginning with Exploratory Data Analysis (EDA) to understand the underlying structure of the dataset. EDA includes plotting time series data, analyzing seasonality and trend components, and checking for stationarity using statistical tests such as the Augmented Dickey-Fuller (ADF) test.

Following this, various forecasting models are considered:

Statistical Models: ARIMA and SARIMA are evaluated for their ability to handle
trend and seasonality. These models are well-suited for univariate time series forecasting where interpretability and performance are balanced.

Machine Learning Models: Algorithms like Random Forest and XGBoost are used with engineered features such as lag values and rolling averages. These models are helpful when there is a need to incorporate additional covariates or when data patterns are non-linear.

Deep Learning Models: LSTM (Long Short-Term Memory networks) and Transformer-based models are explored for capturing long-term dependencies and complex sequences in the data.

Each model is trained using historical data and validated using time-based cross-validation to ensure that the temporal integrity of the data is preserved.

**Evaluation Metrics**
To measure the performance of the forecasting models, standard regression metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE) are used. These metrics provide insight into the accuracy and reliability of each model's predictions.

**Expected Outcomes**
The final output of the project is a trained forecasting model capable of predicting future crime rates based on past data. This model can assist in proactive planning, resource optimization, and policy development. In addition, visualizations generated during EDA and forecasting will help stakeholders better understand crime trends over time.

**Significance**
The FBI Time Series Forecasting project demonstrates how data science can be applied to public safety and governance. Accurate forecasting enables decision-makers to anticipate changes in crime levels and respond effectively. By leveraging historical data, this project contributes to smarter, data-driven crime prevention strategies.

# **GitHub Link - https://github.com/Kanishk-30**

[Kanishk's Github](https://github.com/Kanishk-30)

# **Problem Statement**


**To develop an accurate and reliable time series forecasting model using historical FBI crime data to predict future crime occurrences, enabling better planning, policy formulation, and resource allocation by identifying temporal patterns such as trends and seasonality in crime rates.**




#### **Define Your Business Objective?**

To enhance strategic decision-making and operational efficiency within law enforcement agencies by leveraging predictive analytics to forecast future crime trends, allowing for timely intervention, optimal resource allocation, and data-driven policy development.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Basic Libraries
import numpy as np                # for numerical operations
import pandas as pd               # for data manipulation and analysis

# Visualization
import matplotlib.pyplot as plt   # for plotting basic graphs
import seaborn as sns             # for advanced visualizations
import plotly.express as px       # for interactive plots (optional but helpful)
import matplotlib.dates as mdates # for date formatting in time series plots

# Time Series-Specific Analysis
from statsmodels.tsa.seasonal import seasonal_decompose   # for decomposing trends & seasonality
from statsmodels.tsa.stattools import adfuller            # for stationarity test (ADF test)
from pandas.plotting import lag_plot, autocorrelation_plot # for checking dependencies

# Warnings Handling
import warnings
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
# Load Dataset
# Load training dataset (event-level crime data)
train_df = pd.read_excel('Train.xlsx', parse_dates=['Date'])

# Load test dataset (monthly aggregate template)
test_df = pd.read_csv('Test.csv')


### Dataset First View

In [None]:
# Dataset First Look
print("Training Data Preview:")
print(train_df.head())

print("\nTest Data Preview:")
print(test_df.head())


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"Training Data - Rows: {train_df.shape[0]}, Columns: {train_df.shape[1]}")
print(f"Test Data - Rows: {test_df.shape[0]}, Columns: {test_df.shape[1]}")


### Dataset Information

In [None]:
# Dataset Info
print("Training Dataset Info:")
train_df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_rows = train_df.duplicated().sum()
print(f"Duplicate Rows in Training Data: {duplicate_rows}")

# Optionally drop duplicates
train_df = train_df.drop_duplicates()


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Missing Values in Training Data:")
print(train_df.isnull().sum())

# Visualize missing data
plt.figure(figsize=(12, 6))
sns.heatmap(train_df.isnull(), cbar=False, cmap="Reds")
plt.title("Missing Values in Training Data")
plt.show()


### What did you know about your dataset?

The dataset contains detailed records of crime incidents reported over time. Each entry in the training dataset includes information such as the type of crime (TYPE), location details (HUNDRED_BLOCK, NEIGHBOURHOOD, Latitude, Longitude), time (HOUR, MINUTE), and date (Date, YEAR, MONTH, DAY). This makes it a rich event-level time series dataset, ideal for forecasting trends in crime.

The test dataset, by contrast, is aggregated by YEAR, MONTH, and TYPE, and includes the column Incident_Counts which we aim to predict. This means our training data needs to be grouped accordingly to match the structure of the test data.

Exploratory Data Analysis revealed some missing values in fields like Latitude, Longitude, and Minute, and a small number of duplicate rows. These issues need to be addressed through imputation or removal. The dataset includes both categorical and numerical features, and the presence of the Date column allows for extracting time-based patterns such as seasonality or trends.

In summary, the data is rich, temporal, and location-aware, providing a strong foundation for time series forecasting models aimed at predicting monthly crime counts by type.



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Train Dataset Columns:")
print(train_df.columns.tolist())


In [None]:
# Dataset Describe
print("Statistical Summary of Numerical Features:")
print(train_df.describe())


### Variables Description

### 📊 Variable Descriptions

| Column Name       | Description                                      |
|-------------------|--------------------------------------------------|
| TYPE              | Category of crime (e.g., Theft, Assault).        |
| HUNDRED_BLOCK     | Street-level location of the incident.           |
| NEIGHBOURHOOD     | Neighborhood where the crime occurred.           |
| X, Y              | Raw spatial coordinates of the location.         |
| Latitude, Longitude | Geographic coordinates for mapping.           |
| HOUR              | Hour of the day when the incident occurred.      |
| MINUTE            | Minute of the hour (0–59).                       |
| YEAR              | Year when the incident happened.                 |
| MONTH             | Month of the incident (1–12).                    |
| DAY               | Day of the month (1–31).                         |
| Date              | Full date of the incident (datetime format).     |
[link text](https://)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("Unique value count per column in train_df:\n")
for col in train_df.columns:
    unique_values = train_df[col].nunique()
    print(f"{col}: {unique_values} unique values")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Make a copy of the original data
df = train_df.copy()

# 1. Drop irrelevant columns (if not useful for forecasting)
df = df.drop(['X', 'Y'], axis=1)  # optional, if they are not meaningful
# You can also drop 'HUNDRED_BLOCK' if it's too noisy or high-cardinality

# 2. Handle missing values
# Example: Fill missing minutes with 0 (or mode)
df['MINUTE'].fillna(0, inplace=True)

# Fill Latitude and Longitude with mean (or drop rows if preferred)
df['Latitude'].fillna(df['Latitude'].mean(), inplace=True)
df['Longitude'].fillna(df['Longitude'].mean(), inplace=True)

# 3. Convert 'Date' to datetime (if not already done)
df['Date'] = pd.to_datetime(df['Date'])

# 4. Set Date as Index (important for time series analysis)
df.set_index('Date', inplace=True)

# 5. Sort by datetime index (good practice for time series)
df.sort_index(inplace=True)

# 6. Create new time-based features
df['DayOfWeek'] = df.index.day_name()
df['Week'] = df.index.isocalendar().week
df['Quarter'] = df.index.quarter
df['IsWeekend'] = df.index.dayofweek >= 5

# 7. Check data types and confirm clean-up
df.info()


### What all manipulations have you done and insights you found?

###  Data Manipulations Done

1. **Dropped Irrelevant Columns**
   - Removed `X` and `Y` as they are raw spatial coordinates and not directly useful.
   - Optionally considered dropping `HUNDRED_BLOCK` due to high cardinality.

2. **Handled Missing Values**
   - Filled missing `MINUTE` values with `0`.
   - Imputed missing `Latitude` and `Longitude` using column mean values.

3. **Datetime Handling**
   - Converted the `Date` column to datetime format.
   - Set `Date` as the index for time-based operations.
   - Sorted the dataset by date to maintain temporal order.

4. **Feature Engineering**
   - Extracted new time-based features:
     - `DayOfWeek`: Name of the weekday.
     - `Week`: ISO calendar week number.
     - `Quarter`: Calendar quarter (1–4).
     - `IsWeekend`: Boolean flag for weekends.

5. **Checked and Removed Duplicates**
   - Identified and removed duplicate rows to ensure data integrity.

6. **Validated Data Types**
   - Ensured all columns have the correct data types using `.info()`.

---

###  Insights Gained

1. **Time Series Structure**
   - Data is time-stamped and well-suited for time series forecasting.

2. **High Cardinality in Location Columns**
   - `HUNDRED_BLOCK` has too many unique values, limiting its use in grouping or modeling.

3. **Minor Missing Data**
   - Missing values were minimal and handled effectively through imputation.

4. **Event-Level Granularity**
   - Each row is an individual crime report; data needs to be aggregated for monthly forecasts.

5. **Feature Engineering Opportunities**
   - Variables like `HOUR` and `DayOfWeek` can help identify patterns in crime timing.
   - `TYPE` is the most important column for grouping and forecasting.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1. Crime count over time

In [None]:
# Chart - 1 visualization code
df.resample('M').size().plot(figsize=(12, 5), title='Monthly Crime Count')


##### 1. Why did you pick the specific chart?

Answer 1. A line chart was chosen because it best represents trends and fluctuations in time series data. Monthly resampling helps visualize long-term and seasonal crime patterns.


##### 2. What is/are the insight(s) found from the chart?

Answer 2. The chart reveals key insights like overall trends (rising/falling), seasonal spikes (e.g., during holidays), and potential anomalies (e.g., sudden drops). These insights inform both the EDA and the ML forecasting phase.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 3. These insights directly support strategic planning. They help optimize police resource allocation, identify high-risk periods, and improve model accuracy. However, a rising trend or unmanaged seasonal spikes can reflect negative growth and indicate the need for urgent policy or operational changes.


#### Chart - 2 Crime count by type


In [None]:
# Chart - 2 visualization code
df['TYPE'].value_counts().plot(kind='bar', figsize=(10, 5), title='Total Crimes by Type')


##### 1. Why did you pick the specific chart?

Answer 1. A bar chart was selected to compare the frequency of different crime types. It's the most effective visualization for categorical distribution.


##### 2. What is/are the insight(s) found from the chart?

Answer 2. The chart shows which crime types are most common (e.g., Theft, Mischief) and which are rare. This helps prioritize attention for prevention and forecasting.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 3. These insights can guide police departments in resource allocation and policy-making. High volumes in violent crime categories may indicate a negative social trend, requiring immediate intervention.


#### Chart - 3. Crime count by hour of day

In [None]:
# Chart - 3 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Step 1: Reset index to avoid reindex errors
df_plot = df.reset_index()

# Step 2: Ensure 'HOUR' column is numeric (in case of import issues)
df_plot['HOUR'] = pd.to_numeric(df_plot['HOUR'], errors='coerce')

# Step 3: Drop any rows with missing or invalid 'HOUR'
df_plot = df_plot.dropna(subset=['HOUR'])

# Step 4: Convert hour to int (optional, for sorting)
df_plot['HOUR'] = df_plot['HOUR'].astype(int)

# Step 5: Sort hours for correct axis order
hour_order = list(range(24))

# Step 6: Plot the count of crimes by hour
plt.figure(figsize=(12, 6))
sns.countplot(x='HOUR', data=df_plot, order=hour_order, palette='viridis')
plt.title('Crime Distribution by Hour of Day')
plt.xlabel('Hour (0 = Midnight, 23 = 11 PM)')
plt.ylabel('Crime Count')
plt.grid(axis='y')
plt.show()



##### 1. Why did you pick the specific chart?

Answer 1. A countplot was chosen to show how crime frequency varies by hour of the day — ideal for visualizing discrete time-based patterns.


##### 2. What is/are the insight(s) found from the chart?

Answer 2. The plot highlights peak crime hours, showing whether incidents are concentrated during late nights, early mornings, or business hours.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 3. This insight supports time-based resource allocation (e.g., night patrols) and enhances model features. Consistently high late-night crime may indicate areas of negative growth requiring targeted intervention.


#### Chart - 4. Crime by day of week

In [None]:
# Chart - 4 visualization code
sns.countplot(x='DayOfWeek', data=df, order=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
plt.title('Crime Count by Day of the Week')


##### 1. Why did you pick the specific chart?

Answer 1. A countplot was used to compare crime frequency across weekdays — best for identifying behavioral or routine-based crime patterns.


##### 2. What is/are the insight(s) found from the chart?

Answer 2. The plot reveals which days have higher or lower crime activity. Weekends may show spikes due to nightlife, while weekdays may reflect work-related incidents.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 3. Insights help optimize staffing and patrol timing. Regular weekend spikes suggest need for targeted planning. No negative trend if patterns align with expectations.


#### Chart - 5.Heatmap - Crimes by hour vs day

In [None]:
# Chart - 5 visualization code
heatmap_data = df.pivot_table(index='HOUR', columns='DayOfWeek', aggfunc='size', fill_value=0)
sns.heatmap(heatmap_data, cmap='Reds')
plt.title('Heatmap of Crimes by Hour and Day of Week')


##### 1. Why did you pick the specific chart?

Answer 1. A heatmap was chosen to show how crime frequency varies by hour and weekday — best for spotting time-day interaction patterns.


##### 2. What is/are the insight(s) found from the chart?

Answer 2. It reveals high-crime windows (e.g., Sunday nights, weekday evenings). Peak blocks indicate when and where crimes cluster.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 3. These insights enable targeted time-slot policing. If unaddressed, repeated late-night crime peaks could negatively impact safety perception.


#### Chart - 6.Monthly trend for each crime type

In [None]:
# Chart - 6 visualization code
monthly_type = df.groupby([df.index.to_period('M'), 'TYPE']).size().unstack()
monthly_type.plot(figsize=(15, 6), title='Monthly Crime Trends by Type')


##### 1. Why did you pick the specific chart?

Answer 1. A multi-line time series plot was used to track monthly trends for each crime type — ideal for comparing category-specific patterns over time.


##### 2. What is/are the insight(s) found from the chart?

Answer 2. The chart highlights seasonal surges and long-term shifts in different crime types, showing which crimes rise or fall.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Answer 3. This helps prioritize prevention strategies per crime type. A consistently rising type may require urgent attention to prevent long-term negative impact.


#### Chart - 7. Box Plot:Hourly crime spread by type

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(14, 6))
sns.boxplot(x='TYPE', y='HOUR', data=df.reset_index())
plt.xticks(rotation=90)
plt.title('Hour Distribution by Crime Type')
plt.xlabel('Crime Type')
plt.ylabel('Hour of the Day')
plt.grid(axis='y')
plt.show()



##### 1. Why did you pick the specific chart?

Answer 1. A boxplot was used to show how crime hours are distributed for each type — ideal for comparing spread and central tendency across categories.


##### 2. What is/are the insight(s) found from the chart?

Answer 2. It reveals which crimes cluster around specific hours (e.g., night vs day), and which have wider or more unpredictable time ranges.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 3. These insights guide time-targeted prevention per crime type. Broad hour spreads may indicate unpredictable patterns, needing flexible patrol strategies.


#### Chart - 8 Crimes By neighbourhood(top 10)

In [None]:
# Chart - 8 visualization code
df['NEIGHBOURHOOD'].value_counts().head(10).plot(kind='bar', title='Top 10 Neighbourhoods by Crime')


##### 1. Why did you pick the specific chart?

Answer 1. A bar chart was used to highlight the top 10 neighbourhoods by crime count — best for quick comparison of spatial hotspots.


##### 2. What is/are the insight(s) found from the chart?

Answer 2. It identifies areas with the highest crime density, pointing to potential hotspots needing targeted interventions.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 3. These insights help focus security resources on high-risk areas. Persistent concentration in certain zones may signal deeper social or infrastructure issues needing attention.


#### Chart - 9. Crime Heatmap = Missing values


In [None]:
# Chart - 9 visualization code
sns.heatmap(df.isnull(), cbar=False, cmap='Reds')
plt.title('Missing Values Heatmap')


##### 1. Why did you pick the specific chart?

Answer 1. A heatmap was used to detect missing data patterns across the dataset — helpful for spotting data quality issues.


##### 2. What is/are the insight(s) found from the chart?

Answer 2. It visually highlights which columns have missing values and whether they follow any row-wise or column-wise patterns.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 3. These insights support decisions on imputation or column removal. If missingness is high in key features, it could hinder model performance or require alternate strategies.


#### Chart - 10. Rolling mean of crime counts

In [None]:
# Chart - 10 visualization code
df.resample('M').size().rolling(window=6).mean().plot(figsize=(12, 5), title='6-Month Rolling Average of Crime')


##### 1. Why did you pick the specific chart?

Answer 1. A line chart with a 6-month rolling average was used to smooth monthly crime trends — ideal for revealing long-term patterns.


##### 2. What is/are the insight(s) found from the chart?

Answer 2. The graph reduces noise and makes it easier to detect rising or falling crime rates over time.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 3. These trends guide strategic forecasting and resource planning. A sustained rise in the rolling average signals potential negative social conditions needing proactive intervention.


#### Chart - 11. Pie chart: Crime proportion by type

In [None]:
# Chart - 11 visualization code
df['TYPE'].value_counts().plot.pie(autopct='%1.1f%%', figsize=(8, 8), title='Crime Type Proportion')


##### 1. Why did you pick the specific chart?

Answer 1. A pie chart was used to show the proportional distribution of crime types — effective for understanding category dominance.


##### 2. What is/are the insight(s) found from the chart?

Answer 2. It reveals which crime types make up the largest share, highlighting the most common threats.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 3. These proportions help prioritize resources and model focus. Overdependence on one dominant crime type might skew attention, leading to oversight of emerging risks.



#### Chart - 12. Crime density plot(Hour)

In [None]:
# Chart - 12 visualization code
sns.kdeplot(df['HOUR'], shade=True)
plt.title('Crime Density over Hours')


##### 1. Why did you pick the specific chart?

Answer 1. A KDE plot was used to show the smoothed distribution of crimes by hour — ideal for identifying peak activity times.


##### 2. What is/are the insight(s) found from the chart?

Answer 2. It highlights the most crime-active periods (e.g., night hours or early mornings), showing underlying frequency trends.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 3. This aids in time-targeted deployment and scheduling. Ignoring consistent late-night peaks could lead to gaps in law enforcement coverage.


#### Chart - 13. Correlation Heatmap

In [None]:
# Chart - 13 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Set up a larger figure size
plt.figure(figsize=(12, 8))  # Adjust size as needed (width, height)

# Compute correlation matrix for numeric columns
corr_matrix = df.select_dtypes(include='number').corr()

# Plot the heatmap with better layout
sns.heatmap(corr_matrix,
            annot=True,
            fmt=".2f",         # Limit to 2 decimal places
            cmap='coolwarm',
            square=True,       # Make cells square
            cbar_kws={"shrink": 0.75},  # Shrink color bar
            annot_kws={"size": 10})     # Smaller annotation text

plt.title('Correlation Between Numeric Features', fontsize=14)
plt.xticks(rotation=45, ha='right')   # Rotate x-labels for readability
plt.yticks(rotation=0)
plt.tight_layout()  # Prevent clipping of labels
plt.show()



##### 1. Why did you pick the specific chart?

Answer 1. A correlation heatmap was used to explore relationships between numeric features — best for identifying multicollinearity or linear dependencies.


##### 2. What is/are the insight(s) found from the chart?

Answer 2. It reveals how features like HOUR, MINUTE, and LOCATION (Latitude/Longitude) relate. Strong positive or negative values indicate strong influence or redundancy.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 3. These insights support better feature engineering. Highly correlated variables may introduce noise or overfitting in models if not handled properly.


#### Chart - 14 - Yearly Crime count comparison

In [None]:

df.groupby('YEAR').size().plot(kind='bar', title='Total Crimes Per Year')


##### 1. Why did you pick the specific chart?

Answer 1. A bar chart was used to visualize annual crime totals — ideal for identifying year-over-year trends in activity.


##### 2. What is/are the insight(s) found from the chart?

Answer 2. It reveals whether crime is increasing, decreasing, or fluctuating across years. Peaks or drops may relate to events, policies, or external factors.


#### Chart - 15 - WeekDay vs WeekEnd

In [None]:

sns.countplot(x='IsWeekend', data=df)
plt.xticks([0, 1], ['Weekday', 'Weekend'])
plt.title('Crime Distribution: Weekday vs Weekend')


##### 1. Why did you pick the specific chart?

Answer 1. A count plot was chosen to compare crime frequency on weekdays vs weekends — simple and effective for binary categorical data.


##### 2. What is/are the insight(s) found from the chart?

Answer 2. The plot shows whether crimes are more likely to occur on weekends or weekdays, highlighting behavioral or social patterns.


#### Chart-16. Histogram of Latitude and Longitude

In [None]:
df['Latitude'].plot.hist(bins=50, alpha=0.5, label='Latitude')
df['Longitude'].plot.hist(bins=50, alpha=0.5, label='Longitude')
plt.legend()
plt.title('Location Distribution')


#### Chart-17. Top crime types by neighbour

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Step 1: Get top 5 neighbourhoods with most crimes
top_neigh = df['NEIGHBOURHOOD'].value_counts().head(5).index

# Step 2: Filter and plot
plt.figure(figsize=(14, 7))  # Wider and taller for better visibility

sns.countplot(
    data=df[df['NEIGHBOURHOOD'].isin(top_neigh)],
    x='NEIGHBOURHOOD',
    hue='TYPE',
    palette='Set2'
)

plt.title('Top 5 Neighbourhoods by Crime Type', fontsize=16)
plt.xlabel('Neighbourhood', fontsize=12)
plt.ylabel('Crime Count', fontsize=12)
plt.xticks(rotation=45)
plt.legend(title='Crime Type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.grid(axis='y')
plt.show()



#### Chart-18. Swarm Plot:Type vs Hour

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Step 1: Reset index to avoid duplicate index issue
df_plot = df.reset_index()

# Step 2: Optional - remove missing or weird values
df_plot = df_plot.dropna(subset=['TYPE', 'HOUR'])  # make sure these columns are valid

# Step 3: Ensure HOUR is numeric and integer
df_plot['HOUR'] = pd.to_numeric(df_plot['HOUR'], errors='coerce').astype(int)

# Step 4: Take a random sample (Swarmplot is slow on large datasets)
df_sample = df_plot.sample(1000, random_state=42)

# Step 5: Plot
plt.figure(figsize=(14, 6))
sns.swarmplot(x='TYPE', y='HOUR', data=df_sample)
plt.xticks(rotation=90)
plt.title('Crime Type vs Hour of Day')
plt.xlabel('Crime Type')
plt.ylabel('Hour (0–23)')
plt.grid(axis='y')
plt.tight_layout()
plt.show()



#### Chart-19. Lag Plot: Crime Series Dependence

In [None]:
from pandas.plotting import lag_plot
lag_plot(df.resample('M').size())
plt.title('Lag Plot of Monthly Crime Counts')


#### Chart-20. Autocorrelation plot

In [None]:
from pandas.plotting import autocorrelation_plot
autocorrelation_plot(df.resample('M').size())
plt.title('Autocorrelation of Monthly Crime')



## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer To help the client achieve their business objective of improving public safety and operational efficiency, it is recommended to implement a data-driven strategy using the insights gained from exploratory data analysis and forecasting. By leveraging time series models to predict monthly crime trends by type, the client can proactively plan police deployment and allocate resources more effectively. Identifying peak crime periods—such as specific hours, days, or months—as well as high-risk neighborhoods allows for targeted interventions and smarter patrol scheduling. Additionally, integrating these predictions into a real-time monitoring system can enable timely alerts and preventive action. These insights can also guide public awareness campaigns, informing citizens about safety during high-risk periods. By continuously updating the system with new crime data, the model remains adaptive and relevant, ensuring long-term impact and alignment with changing crime patterns. This approach ultimately enhances safety, optimizes resource usage, and builds community trust.

# **Conclusion**

In conclusion, this analysis has provided valuable insights into crime patterns based on time, type, and location. By using data visualization and predictive modeling, we can identify high-risk periods and areas, helping law enforcement agencies plan more effectively. With the support of this data-driven approach, the client can make informed decisions, improve public safety, and allocate resources efficiently. Continuous monitoring and updating of the model will ensure it remains accurate and relevant over time, contributing to long-term safety and strategic planning

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***