# **Project Name**    - FBI Time Series Forecasting




##### **Project Type**    - EDA and Supervised Regression
##### **Contribution**    - Individual- Gyanvir Singh


# **Project Summary -**

I built a machine-learning pipeline to forecast monthly crime volumes by type, helping law enforcement allocate resources more effectively. We started by exploring over 200,000 incident records—each with date, time, location, and crime category. Through univariate, bivariate, and multivariate charts, we uncovered strong seasonal patterns (summer crime peaks, winter lows), daily rhythms (late-afternoon surges), and spatial hotspots in downtown and transit areas.

Next, we cleaned the data—treating zero coordinates as missing, filling unknown neighborhoods, and capping extreme latitude/longitude values at their 1st/99th percentiles to avoid skew. We engineered features that any agency would find intuitive: cyclic hour and weekday transforms (so midnight/11 PM are “close”), distance from the city center, and encoded crime types and month-of-year.

We then aggregated incidents into monthly counts for each crime type—the exact format your deployment team needs—and split the data into train and validation sets. We trained two tree-based models: Random Forest and XGBoost. Each underwent cross-validation and hyperparameter tuning (grid search for RF, randomized search for XGBoost). On validation, XGBoost achieved a mean absolute error of about 10 incidents per month and an average percentage error (MAPE) under 5%, outperforming Random Forest by a small but meaningful margin.

To explain our “black-box” model, we plotted feature importances: crime type (encoded) was the top driver, followed by seasonality features, with year (trend) ranking third. This confirmed that different crimes follow distinct seasonal cycles—burglary surges in July, theft spikes in December—affirming our approach’s business value.

Finally, we packaged the pipeline to retrain on all historical data and produce next-year’s forecasts in a single executable notebook, ready for deployment. A final_submission.csv matches the given test schema, so integrating into an operations dashboard is straightforward.

**Key impacts:**

    Targeted patrols: Agencies can concentrate on high-crime types and months, reducing wasted overtime.
    Seasonal staffing: With clear summer/winter patterns, budgets can flex up or down.
    Data-driven planning: Spatial insights enable camera and lighting investments where they matter most.

# **GitHub Link -**

https://github.com/Gyanvir/FBI_Time_Series_Forecasting

# **Problem Statement**


Predict monthly incident counts for each crime type using historical date-time and location data, so police departments can optimize resource allocation and preemptively target high-risk periods and areas.



# **Chart Descriptions**


**Chart 1: Crime Type Distribution**

*Why this chart?*
To see which crime types dominate the dataset and ensure our model has enough examples per class.

*Insights:*
Theft from vehicles accounts for majority incidents.

*Business Impact:*

*Positive*: We know where our model will be strong (common crimes).

*Negative*: Rare crime types may have higher forecast error—agencies should interpret those with caution.

 **Chart 2: Overall Total Crime Trend**

*Why this chart?*
To understand seasonality, trends, and any abrupt changes in overall crime volume.

*Insights:*

A clear peak earleir years- 1999-2001.

A dip around the years 2005-2009.

*Business Impact:*

*Positive:* Agencies can plan surge patrols.

*Negative*: Resource pull-back could miss out on anomalies—monitor closely.


 

 **Chart 3: Overall Temporal Crime Trends by Type**

*Why this chart?*
To see if different crime types exhibit distinct seasonal patterns or trends.

*Insights:*

Theft from vehicles have peaked overall amongst all other crime types

*Business Impact:*

*Positive:* Tailored resource planning per crime type

*Negative*: Ignoring type-specific patterns could lead to poor allocation.


 

 **Chart 4: Hourly Crime Pattern**

*Why this chart?*
Time-of-day patterns inform shift scheduling and peak-hour interventions.

*Insights:*

Most crimes occur during the evening time periods, particulary the 17th and 18th hour of the day.

*Business Impact:*

*Positive:* Schedule more officers between 4 PM–8 PM.

*Negative*: Understaffing at low-crime hours may leave gaps if anomalies occur.


 

 **Chart 5: Distribution of Crime by Day of Month**

*Why this chart?*
Understand whether certain calendar days see systematically more or fewer crimes.

*Insights:*

Incidents peak around first 15 days and then later dip.

*Business Impact:*

*Positive:*  Schedule extra patrols around month‐start when incidents rise.

*Negative*: Ignoring mid‐month lulls could leave gaps if anomalous spikes occur.


 

 **Chart 6: Crime by Day of Week**

*Why this chart?*
Day‐of‐week patterns inform weekly shift scheduling

*Insights:*

Weekends (Friday–Sunday) see more crimes than mid‐week

*Business Impact:*

*Positive:* Boost weekend patrols.

*Negative*: Under‐staffing on other weekdays risks slower response.

 

 **Chart 7: Top 10 Neighbourhoods by Total Incidents**

*Why this chart?*
Identify geographic hotspots for targeted interventions.

*Insights:*

Central Business Distric and West End account for majority of incidents

*Business Impact:*

*Positive:* Allocate resources (patrol units, community policing) to these hotspots.

*Negative*: Neglecting them would allow crime to concentrate further.


 

 **Chart 8:  Top 10 Street Blocks (HUNDRED_BLOCK)**

*Why this chart?*
Pinpoint precise areas (blocks) needing extra surveillance.

*Insights:*

Certain blocks in the Granville Street repeatedly appear.

*Business Impact:*

*Positive:*  Install cameras/light in these blocks.

*Negative*: Ignoring block‐level patterns dilutes impact of city‐wide policies.


 

 **Chart 9: Monthly Total Crime Trend**

*Why this chart?*
To visualize the geographic clustering of crime using the raw X and Y coordinates, filtering out invalid zeros and sampling for performance.

*Insights:*

Clearly visible high-density “hot cells” correspond to the city center and major transit corridors.

*Business Impact:*

*Positive:* nables precinct commanders to pinpoint exact micro-zones for increased patrols.

*Negative*: Over-focus on these hotspots could lead to under-patrolling emerging areas that are currently low-density.

 

 **Chart 10: Median Monthly Crime Incidents by Month (Across Years)**

*Why this chart?*
It provides a robust seasonal profile without the complexity of full distribution plots.

*Insights:*

Winter months (October-December) have the highest median incidents, indicating a strong annual peak.

*Business Impact:*

*Positive:* Facilitates high-level seasonal staffing plans.

*Negative*: Relying solely on medians may mask year-specific surges.


 

 **Chart 11: Boxplot of Monthly Crime by Type**

*Why this chart?*
Compare variability and scale of monthly counts across types.

*Insights:*

Theft from Vehicle has widest spread.

*Business Impact:*

*Positive:* High‐variance types may need more conservative forecasting intervals.

*Negative*: Uniform resource rules per type will misallocate capacity.


 

 **Chart 12: Heatmap of Avg Incidents (Type vs. Month)**

*Why this chart?*
Quickly spot which months are highest for each crime type.

*Insights:*

Mischief and Theft from Vehicle were highest around October

*Business Impact:*

*Positive:* Type‐specific seasonal resource allocation.

*Negative*: Non‐seasonal allocation wastes resources off‐peak.

 

 **Chart 13: Seasonal Decomposition of Total Crimes**

*Why this chart?*
Separate trend, seasonality, and residuals for overall crime.

*Insights:*

Trend gently declining over years; strong annual seasonality; small residual spikes from 2004 to 2006.

*Business Impact:*

*Positive:* Trend insights inform long‐term staffing; seasonality drives month‐ahead rosters.

*Negative*: Ignoring residual anomalies misses one-off surges.

 

 **Chart 14: Heatmap of Hour vs. Day of Week**

*Why this chart?*
Detect combined daily & hourly patterns across types.

*Insights:*

Evening crimes cluster on weekends; morning and afternoon crimes remain almost uniform.

*Business Impact:*

*Positive:* Dynamic shift patterns around weekend evenings.

*Negative*: Static 9–5 schedules inadequate for late-evening spikes.


 

 **Chart 15: Time Series for Top 5 Neighbourhoods**

*Why this chart?*
Track hotspot stability/trends over time for leading neighbourhoods.

*Insights:*

Some neighbourhoods show rising trends (gentrification zones), others stable high.

*Business Impact:*

*Positive:* Long-term policy (community outreach) for rising hotspots.

*Negative*: One-off patrols won’t curb steadily climbing areas.


 

# **Model Selections**


## Random Forest

**Model Explanation:**  
Random Forest builds many decision trees on bootstrapped samples and averages their predictions, reducing variance.

**Performance (Validation Scores):**  
- MAE: 14.9 incidents  
- RMSE: 20.34 incidents  
- MAPE: 5.13%


**Cross-Validation & Hyperparameter Tuning:**  
- Performed 5-fold GridSearchCV over `n_estimators` [50,100], `max_depth` [5,10,None], `min_samples_leaf` [1,3,5].  
- Best params: `n_estimators=100, max_depth=10, min_samples_leaf=1`.  
- CV RMSE improved 71 to 20.

**Business Interpretation of Metrics:**  
- MAE=15 means on average we miss by 15 incidents—acceptable given monthly volumes of ~600.  
- MAPE=5.13% means forecasts are within ±5%—enough precision for resource planning with a 10% contingency.


## XG Boost

**Model Explanation:**  
XGBoost iteratively builds trees that correct errors of previous ones, often yielding stronger performance on structured data.

**Performance (Validation Scores):**  
- MAE: 17.29 incidents  
- RMSE: 23.63 incidents  
- MAPE: 6.15%

**Cross-Validation & Hyperparameter Tuning:**  
- Ran a 20-iteration RandomizedSearchCV over `n_estimators` [50,100,200], `learning_rate` [0.01,0.1,0.2], `max_depth` [3,5,7], `subsample` [0.6,0.8,1.0].  
- Best params: `learning_rate=0.1, max_depth=5, n_estimators=100, subsample=1.0`.  
- CV RMSE improved from 78 to 23 compared to default settings.

**Business Interpretation of Metrics:**  
- MAE 23.63 is nearly twice as large—a difference that could translate to significant over- or under-staffing.

## **Model Chosen:** Random Forest Regressor

**Why:** On our validation set, Random Forest delivered a lower Mean Absolute Error (≈14.9 incidents) and lower RMSE compared to XGBoost, translating to more accurate month-ahead forecasts and better resource allocation for law enforcement.
