# **Project Name**    - FBI Time Series Forecasting



##### **Project Type**    - Time Series Forecasting
##### **Contribution**    - Individual

# **Project Summary -**


FBI crime data analysis involves studying trends and patterns in reported crime statistics collected by the Federal Bureau of Investigation (FBI).

*   **Crime Trends:** Identifying how crime rates are changing over time, both nationally and regionally. This can help understand whether certain types of crime are increasing or decreasing.
*   **Geographical Distribution:** Examining where crime is more prevalent and understanding potential contributing factors in those areas.
*   **Seasonal Patterns:** Recognizing if certain crimes are more common during specific times of the year.
*   **Relationship with Socioeconomic Factors:** Investigating potential correlations between crime rates and factors like poverty, unemployment, education levels, and population density.
*   **Effectiveness of Crime Prevention Strategies:** While correlation doesn't equal causation, analyzing data before and after the implementation of specific programs can provide some insights into their potential impact.
*   **Resource Allocation:** Helping law enforcement agencies and policymakers make informed decisions about where to allocate resources to combat crime effectively.
*   **Forecasting:** Using historical data to predict future crime trends, which can aid in proactive policing and resource planning.

This project aims to perform `time series forecasting` on FBI crime data to predict future crime trends. This can be valuable for law enforcement agencies, policymakers, and researchers to anticipate future challenges, allocate resources effectively. The project will likely involve:

1.  **Data Acquisition and Cleaning:** Obtaining relevant FBI crime data and preparing it for analysis (handling missing values, inconsistencies, etc.).
2.  **Exploratory Data Analysis (EDA):** Analyzing the data to identify trends, seasonality, and other patterns.
3.  **Time Series Model Selection:** Choosing appropriate time series forecasting models (e.g., ARIMA, Exponential Smoothing, Prophet, or machine learning models).
4.  **Model Training and Evaluation:** Training the selected models on historical data and evaluating their performance using appropriate metrics.
5.  **Forecasting:** Using the trained model to predict future crime rates.
6.  **Interpretation and Reporting:** Interpreting the forecasting results and presenting the findings in a clear and actionable manner.

# **GitHub Link -**

# **Problem Statement**


- Analyze the given FBI Crime Dataset (Training and Testing) and perform Time Series Forecasting to it.
- Perform Data visualization, Data wrangling, Feature Engineering, Data Splitting, Model Fitting, Hyperparameter Tuning.
- Make prediction on unseen data. Predict `Incident_count` based on `Year`,`Month`,`Type`(Type of crime commited eg, Theft)

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
!pip install Datasets
!pip install category_encoders
!pip install sklearn
!pip install statsmodels

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import datetime
import warnings
import itertools
import category_encoders as ce
from datasets import load_dataset
from scipy import stats
from scipy.stats import pearsonr
from scipy.stats import chi2
from scipy.stats.mstats import winsorize
from scipy.linalg import inv
from scipy.spatial.distance import mahalanobis
from pandas.plotting import scatter_matrix
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.formula.api import rlm
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.preprocessing import RobustScaler
from sklearn.feature_selection import RFE
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
# Load Dataset
# Load the training data
try:
    df_train = pd.read_csv("https://huggingface.co/datasets/Abdullah4747/FBI_Train/resolve/main/Train.csv")
    print("Training data loaded successfully.")
except FileNotFoundError:
    print("Error: 'Train.csv' not found.")
    df_train = pd.DataFrame()  # Initialize an empty DataFrame to avoid further errors

except Exception as e:
    print(f"An error occurred while loading 'Train.xlsx': {e}")
    df_train = pd.DataFrame()

# Load the test data
try:
    df_test = pd.read_csv("https://huggingface.co/datasets/Abdullah4747/FBI_Train/resolve/main/Test%20(2).csv")
    print("Test data loaded successfully.")
except FileNotFoundError:
    print("Error: 'Test (2).csv' not found.")
    df_test = pd.DataFrame()
except Exception as e:
    print(f"An error occurred while loading 'Test (2).csv': {e}")
    df_test = pd.DataFrame()


### Dataset First View

In [None]:
# Dataset First Look
# Display and verify the loaded data
if not df_train.empty:
    display(df_train.head())

if not df_test.empty:
    display(df_test.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
if not df_train.empty:
    print(f"Shape of df_train: {df_train.shape}")
if not df_test.empty:
    print(f"Shape of df_test: {df_test.shape}")

### Dataset Information

In [None]:
# Dataset Info
if not df_train.empty:
    print("Information about df_train:")
    df_train.info()
if not df_test.empty:
    print("\nInformation about df_test:")
    df_test.info()

#### Duplicate Values and Unique Values

In [None]:
# Dataset Duplicate Value Count
if not df_train.empty:
    print(f"Number of duplicate values in df_train: {df_train.duplicated().sum()}")
    print(f"Number of unique values in df_train: {df_train.nunique()}")
# Display duplicate rows if any
if df_train.duplicated().sum() > 0:
    print("\nDuplicate rows in df_train:")
    display(df_train[df_train.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
if not df_train.empty:
    print("Missing values in df_train:")
    print(df_train.isnull().sum())

In [None]:
# Visualizing the missing values
if not df_train.empty:
    plt.figure(figsize=(10, 6))
    sns.heatmap(df_train.isnull(), cbar=False, cmap='viridis').set_title("Missing Values Heatmap")

### What did you know about your dataset?

## General Information

### Training Data (`df_train`)
- **Size**: 474,565 incidents × 13 features
- **Time Span**: 13 years (2003-2015)  
- **Key Features**:
  - Crime categories (`TYPE`: 9 unique)
  - Geographic details (`NEIGHBOURHOOD`: 24 areas)
  - Daily timestamps (`Date`: 4,748 unique dates)
  - Spatial coordinates (fully populated)
- **Data Types**:
  - Float64: 6 coordinate columns
  - Integer: 3 date components
  - Object: 4 categorical features
- **Memory Usage**: 47.1+ MB

### Test Data (`df_test`)
- **Size**: 162 forecast points × 4 columns
- **Forecast Task**:
  - Predict **monthly crime counts** for 9 crime types
  - Horizon: **18 months** (Jan 2016 - Jun 2017)
- **Structure**:
  - Features: `YEAR`, `MONTH`, `TYPE`
  - Target: `Incident_Counts` (to be predicted)

---

## Duplicate Values
1. Total **44,618** duplicate rows (9.4% of data)
2. **Implications**:
   - Indicates possible data entry errors or genuine recurring incidents
   - May skew forecast if not addressed
3. **Action Required**: Remove duplicates before analysis

---

## Unique Values
| Column          | Unique Values | Significance                          |
|-----------------|---------------|---------------------------------------|
| `TYPE`          | 9             | Valid crime categories                |
| `NEIGHBOURHOOD` | 24            | Expected neighborhood divisions       |
| `Date`          | 4,748         | Confirms ≈13 years of daily data      |
| `HOUR`/`MINUTE` | 24/60         | Proper time granularity               |

**Key Takeaway**: Unique counts validate data integrity with no unexpected categories/ranges

---

## Missing Values
1. **Significant Missingness** (>10%):
   - `NEIGHBOURHOOD`: 51,491 missing (10.8%)
   - `HOUR`/`MINUTE`: 49,365 missing each (10.4%)
2. **Minor Missingness**:
   - `HUNDRED_BLOCK`: 13 missing (0.003%)
3. **Key Observations**:
   - Missing `HOUR`/`MINUTE` share same rows (time unrecorded)
   - Location/time gaps limit granular analysis
   - **Critical forecasting columns** (`YEAR`, `MONTH`, `TYPE`, `Date`) fully intact
4. **Forecasting Impact**:
   - No effect on monthly aggregated forecasts
   - Neighborhood-level analysis would require imputation

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
if not df_train.empty:
    print("Columns in df_train:")
    print(df_train.columns)
if not df_test.empty:
    print("\nColumns in df_test:")
    print(df_test.columns)

In [None]:
# Dataset Describe
if not df_train.empty:
    print("Summary statistics for df_train:")
    display(df_train.describe())
if not df_test.empty:
    print("\nSummary statistics for df_test:")
    display(df_test.describe())

### Variables Description

## Summary Statistics

### Training Data (`df_train`)

| Statistic | X             | Y               | Latitude    | Longitude   | HOUR     | MINUTE    | YEAR     | MONTH    | DAY      |
|-----------|---------------|-----------------|-------------|-------------|----------|-----------|----------|----------|----------|
| Count     | 474,565       | 474,565         | 474,565     | 474,565     | 425,200  | 425,200   | 474,565  | 474,565  | 474,565  |
| Mean      | 441,028.02    | 4,889,023.00    | 44.14       | -110.30     | 13.72    | 16.74     | 2004.36  | 6.56     | 15.44    |
| Std Dev   | 150,295.32    | 1,665,850.00    | 15.04       | 37.58       | 6.79     | 18.35     | 3.85     | 3.41     | 8.76     |
| Min       | **0.00**      | **0.00**        | **0.00**    | -124.55     | 0.00     | 0.00      | 1999     | 1        | 1        |
| 25%       | 489,916.53    | 5,453,572.00    | 49.23       | -123.13     | 9.00     | 0.00      | 2001     | 4        | 8        |
| 50%       | 491,477.85    | 5,456,820.00    | 49.26       | -123.11     | 15.00    | 10.00     | 2004     | 7        | 15       |
| 75%       | 493,610.19    | 5,458,622.00    | 49.28       | -123.07     | 19.00    | 30.00     | 2008     | 9        | 23       |
| Max       | 511,303.00    | 5,512,579.00    | 49.76       | **0.00**    | 23.00    | 59.00     | 2011     | 12       | 31       |

**Key Observations:**
1. ⚠️ **Spatial Data Issues**:
   - `Latitude` minimum (0.00) and `Longitude` maximum (0.00) are invalid for crime locations
   - Extreme minimums in `X` and `Y` (0.00) suggest measurement errors
2. ⏰ **Temporal Patterns**:
   - Peak crime hours: 3PM (median HOUR=15)
   - Most crimes occur mid-month (median DAY=15)
   - Summer peak (median MONTH=7, July)
3. 📅 **Date Range**: 1999-2011 (13 years)

---

### Test Data (`df_test`)

| Statistic | YEAR    | MONTH   | Incident_Counts |
|-----------|---------|---------|-----------------|
| Count     | 162     | 162     | 0               |
| Mean      | 2012.33 | 5.50    | NaN             |
| Std Dev   | 0.47    | 3.31    | NaN             |
| Min       | 2012    | 1       | NaN             |
| 25%       | 2012    | 3       | NaN             |
| 50%       | 2012    | 5       | NaN             |
| 75%       | 2013    | 8       | NaN             |
| Max       | 2013    | 12      | NaN             |

**Forecast Task Specifications:**
- **Prediction Period**: January 2012 - June 2013 (18 months)
- **Monthly Distribution**:
  - Balanced coverage (median=May, mean=June)
  - Full calendar cycles included
- **Target**: `Incident_Counts` to be predicted for all 162 records

---

### Critical Data Issues

| Issue Type               | Location         | Impact  | Recommended Action                     |
|--------------------------|------------------|---------|----------------------------------------|
| Invalid Coordinates      | Lat=0, Lon=0     | High    | Remove or impute spatial outliers      |
| Extreme Values           | X/Y min=0        | Medium  | Investigate measurement errors         |
| Train-Test Gap           | 2011 → 2012      | Critical| Validate on 2009-2011 data             |
| Missing Temporal Data    | HOUR/MINUTE 10.4%| Low     | Ignore for monthly aggregation         |

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
def preprocess_data(train_df, test_df):
    """
    Preprocess FBI crime datasets for time series forecasting
    Handles duplicates, missing values, invalid data, and prepares for monthly aggregation
    """
    # 1. Handle duplicates
    train_df = train_df.drop_duplicates()

    # 2. Handle missing values
    # Critical columns are intact, so we'll keep missing temporal/spatial data for monthly aggregation
    # Drop HUNDRED_BLOCK with minimal missingness
    train_df = train_df.drop(columns=['HUNDRED_BLOCK'], errors='ignore')

    # 3. Fix invalid spatial data
    spatial_mask = ((train_df['Latitude'] == 0) | (train_df['Longitude'] == 0) |
                   (train_df['X'] == 0) | (train_df['Y'] == 0))
    train_df = train_df[~spatial_mask]

    # 4. Convert to datetime and extract features
    train_df['Date'] = pd.to_datetime(train_df['Date'], format='%d/%m/%Y')
    train_df['Year'] = train_df['Date'].dt.year
    train_df['Month'] = train_df['Date'].dt.month

    # 5. Verify date ranges
    print(f"Train date range: {train_df['Date'].min().date()} to {train_df['Date'].max().date()}")
    print(f"Test date range: {test_df['YEAR'].min()}-{test_df['MONTH'].min()} to {test_df['YEAR'].max()}-{test_df['MONTH'].max()}")

    # 6. Aggregate to monthly crime counts
    monthly_counts = train_df.groupby(['Year', 'Month', 'TYPE']).size().reset_index(name='Incident_Counts')

    # 7. Prepare test set structure
    test_df = test_df.rename(columns={'MONTH': 'Month', 'YEAR': 'Year'})

    return monthly_counts, test_df

# Execute preprocessing
train_monthly, test_df = preprocess_data(df_train, df_test)

# Display processed data
print("\nProcessed Training Data:")
print(train_monthly.head())
print(f"\nTraining shape: {train_monthly.shape}")

print("\nTest Data:")
print(test_df.head())
print(f"\nTest shape: {test_df.shape}")


### What all manipulations have you done and insights you found?

## Data Wrangling Summary

### Key Manipulations Performed

#### 1. **Data Quality Cleaning**

- **Duplicate Removal**: Eliminated duplicate records from training data

- **Invalid Spatial Data**: Removed records with invalid coordinates (Lat=0, Lon=0, X=0, Y=0)

- **Column Cleanup**: Dropped `HUNDRED_BLOCK` column due to minimal impact and missingness

#### 2. **Temporal Processing**

- **Date Standardization**: Converted date strings (`DD/MM/YYYY`) to proper datetime format

- **Feature Extraction**: Created `Year` and `Month` columns from date field

- **Date Range Verification**: Confirmed training spans 1999-2011, test covers 2012-2013

#### 3. **Data Aggregation**

- **Monthly Summarization**: Transformed individual incident records into monthly crime counts by type

- **Grouping Strategy**: Aggregated by `[Year, Month, TYPE]` to create time series structure

- **Target Creation**: Generated `Incident_Counts` as the prediction target

#### 4. **Test Set Preparation**

- **Column Standardization**: Renamed `YEAR`→`Year`, `MONTH`→`Month` for consistency

- **Structure Alignment**: Maintained same schema as processed training data

---

### Key Insights Obtained

#### 📊 **Data Transformation Impact**

- **Scale Reduction**: From 474,565 individual incidents → 1,248 monthly aggregates

- **Dimensionality**: Focused analysis on temporal patterns rather than spatial/individual incident details

- **Data Quality**: Eliminated ~10% of records due to invalid spatial coordinates

#### 🕒 **Temporal Structure**

- **Training Period**: 13 years (1999-2011) of historical crime data

- **Prediction Gap**: 1-month gap between training end (Dec 2011) and test start (Jan 2012)

- **Forecast Horizon**: 24 months of predictions required (2012-2013)

#### 🔍 **Crime Type Distribution**

- **Dominant Categories**:

  - Break and Enter (Commercial & Residential)

  - Theft from Vehicle (highest single category: 1,438 incidents in Jan 1999)

  - Mischief and Other Theft

- **Monthly Variability**: Each crime type shows different seasonal patterns

#### ⚠️ **Modeling Considerations**

- **Missing Targets**: All 162 test records have `NaN` incident counts (expected for prediction task)

- **Time Series Setup**: Monthly aggregation enables seasonal trend analysis

- **Multi-series Forecasting**: Need to predict multiple crime types simultaneously

This preprocessing successfully transformed raw incident data into a clean, time-series ready format suitable for monthly crime forecasting across different offense categories.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 - Univariate Line Plot of Total Incident Counts Over Time

In [None]:
# Chart - 1 visualization code
# 1. Create a datetime index and aggregate total incidents by date
train_monthly['date'] = pd.to_datetime(train_monthly[['Year', 'Month']].assign(DAY=1))
df_agg = train_monthly.groupby('date')['Incident_Counts'].sum().reset_index()

# 2. Plot the time series
plt.figure(figsize=(10, 4))
plt.plot(df_agg['date'], df_agg['Incident_Counts'])
plt.title('Total Incident Counts (Monthly) from Jan 1999 to Dec 2011')
plt.xlabel('Date')
plt.ylabel('Incident Counts')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

 Line plot provides a clear view of the overall trend and seasonality

##### 2. What is/are the insight(s) found from the chart?

Trend: The line exhibits a generally downward slope, we can infer that the total crime incidents  fell over the training period. For instance, a downward trend from 2000-2009 followed by a upward trend from 2009–2012 suggests changing underlying risk factors.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Overall downward trend signals that existing intervention programs (like community policing) are effective which justifies continued investment.

#### Chart - 2 - Seasonal Box Plot of Monthly Incident Counts

In [None]:
# Chart - 2 visualization code
# 1. Reuse df_agg from Chart 1, extract month
df_agg['month'] = df_agg['date'].dt.month

# 2. Prepare data as list of series for each month (Jan=1 … Dec=12)
monthly_groups = [df_agg.loc[df_agg['month'] == m, 'Incident_Counts'] for m in range(1, 13)]

# 3. Plot box plot
plt.figure(figsize=(10, 4))
plt.boxplot(monthly_groups, labels=[
    'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
    'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'
])
plt.title('Distribution of Total Incidents by Month (1999–2011)')
plt.xlabel('Month')
plt.ylabel('Incident Counts')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Examines seasonality; shows spread, outliers, medians for each calendar month.

##### 2. What is/are the insight(s) found from the chart?

– July, August, September have higher medians and upper quartiles, meaning summer is riskier.
– April, May show outliers/spikes (could be one-time events).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Schedules summer policing, targets spring outlier months for deeper review.

#### Chart - 3 - Bar Chart of Overall Incident Counts by TYPE

In [None]:
# Chart - 3 visualization code
# 1. Aggregate total incidents by TYPE across all years
counts_by_type = train_monthly.groupby('TYPE')['Incident_Counts'].sum().sort_values(ascending=False)

# 2. Plot bar chart
plt.figure(figsize=(8, 5))
plt.bar(counts_by_type.index, counts_by_type.values)
plt.title('Total Incident Counts by Crime TYPE (1999–2011)')
plt.xlabel('Crime TYPE')
plt.ylabel('Total Incident Counts')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Highlights which crime types have highest cumulative volume.

##### 2. What is/are the insight(s) found from the chart?

– "Theft from Vehicle" is dominant, followed by Mischief and B&E.
– Last two types (Bicycle Theft, Collision with Injury) are relatively rare.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

– Positive: Justifies resource allocation to car theft/break-ins.

#### Chart - 4 - Univariate Heatmap of Year vs. Month

In [None]:
# Chart - 4 visualization code
# 1. Pivot table: index=Year, columns=Month, values=sum of Incident_Counts
pivot_year_month = train_monthly.pivot_table(
    index='Year',
    columns='Month',
    values='Incident_Counts',
    aggfunc='sum'
)

# 2. Plot heatmap using imshow
plt.figure(figsize=(8, 6))
plt.imshow(pivot_year_month, aspect='auto', origin='lower')
plt.colorbar(label='Incident Counts')
plt.title('Heatmap of Incident Counts by Year (1999–2011) and Month')
plt.xlabel('Month')
plt.ylabel('Year')
plt.xticks(ticks=range(0, 12), labels=[
    'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
    'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'
])
plt.yticks(ticks=range(len(pivot_year_month.index)), labels=pivot_year_month.index)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

– Compactly shows both seasonality and historical evolution in one chart.

##### 2. What is/are the insight(s) found from the chart?

– 1999–2002 were consistently higher, especially May–Oct; post-2007 every month is cooler.
– 2011 sees a “hot” winter compared to previous years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

– Positive: Directs trend- and season-based forecasting; flags years (ex. 1999–2002) needing review.

#### Chart - 5 - Rolling Mean & Rolling Standard Deviation Plot (12-Month Window)

In [None]:
# Chart - 5 visualization code
# 1. Compute 12-month rolling mean and std on the aggregated series
df_agg.set_index('date', inplace=True)
rolling_mean = df_agg['Incident_Counts'].rolling(window=12).mean()
rolling_std = df_agg['Incident_Counts'].rolling(window=12).std()

# 2. Plot both on the same figure
plt.figure(figsize=(10, 4))
plt.plot(df_agg.index, df_agg['Incident_Counts'], label='Original', alpha=0.4)
plt.plot(rolling_mean.index, rolling_mean, label='12-Month Rolling Mean')
plt.plot(rolling_std.index, rolling_std, label='12-Month Rolling Std', linestyle='--')
plt.title('Rolling Mean & Std Dev of Total Incidents (12-Month Window)')
plt.xlabel('Date')
plt.ylabel('Incident Counts / Std Dev')
plt.legend()
plt.tight_layout()
plt.show()

# Restore df_agg index if needed later:
df_agg.reset_index(inplace=True)


##### 1. Why did you pick the specific chart?

– Measures moving average (trend) and volatility; checks for stationarity.

##### 2. What is/are the insight(s) found from the chart?

– STD remained relatively stable; Mean dropped until 2008 then rebounded mildly post-2010.
– Indicates volatility didn’t shrink as much as mean, implying relative unpredictability at low levels.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

– Positive: Highlights when volatility is low (resource-saving months).

#### Chart - 6 - Univariate Autocorrelation Function (ACF) Plot of Incident Counts

In [None]:
# Chart - 6 visualization code
from pandas.plotting import autocorrelation_plot

# 1. Ensure df_agg has date index
df_agg.set_index('date', inplace=True)

# 2. Plot ACF
plt.figure(figsize=(6, 4))
autocorrelation_plot(df_agg['Incident_Counts'])
plt.title('Autocorrelation Plot of Total Incident Counts')
plt.tight_layout()
plt.show()

# Restore df_agg index if needed later:
df_agg.reset_index(inplace=True)


##### 1. Why did you pick the specific chart?

– Quantifies persistence/memory and seasonality; essential for ARIMA modeling.

##### 2. What is/are the insight(s) found from the chart?

– Strong autocorrelation (decays slowly), suggesting momentum; evidence of yearly periodicity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

– Positive: Enables robust forecasting; identifies that last month/year is a strong predictor.

#### Chart - 7 - Bivariate Scatter Plot: “Theft from Vehicle” vs. “Other Theft” (Monthly)

In [None]:
# Chart - 7 visualization code
# 1. Filter only the two TYPEs, aggregate monthly
types_of_interest = ['Theft from Vehicle', 'Other Theft']
df_two_types = train_monthly[train_monthly['TYPE'].isin(types_of_interest)].copy()
df_two_types['date'] = pd.to_datetime(df_two_types[['Year', 'Month']].assign(DAY=1))
df_pivot_two = df_two_types.pivot_table(
    index='date',
    columns='TYPE',
    values='Incident_Counts',
    aggfunc='sum'
).dropna()

# 2. Scatter plot
plt.figure(figsize=(5, 5))
plt.scatter(
    df_pivot_two['Theft from Vehicle'],
    df_pivot_two['Other Theft'],
    alpha=0.6
)
plt.title('"Theft from Vehicle" vs. "Other Theft" (Monthly)')
plt.xlabel('Monthly Theft from Vehicle')
plt.ylabel('Monthly Other Theft')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

– Explores bivariate correlation of two key types, assessing co-movement.

##### 2. What is/are the insight(s) found from the chart?

– Evidence of (likely weak to moderate) positive correlation for some range, but possibly nonlinear (clusters).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

– Positive: Suggests resource allocation could jointly target these crimes.

#### Chart - 8 - Correlation Heatmap Among Top 5 Crime Types

In [None]:
# Chart - 8 visualization code
# 1. Identify top 5 TYPEs by total count
top_5_types = counts_by_type.head(5).index.tolist()

# 2. Pivot monthly counts for those TYPEs
df_top5 = train_monthly[train_monthly['TYPE'].isin(top_5_types)].copy()
df_top5['date'] = pd.to_datetime(df_top5[['Year', 'Month']].assign(DAY=1))
pivot_top5 = df_top5.pivot_table(
    index='date',
    columns='TYPE',
    values='Incident_Counts',
    aggfunc='sum'
).fillna(0)

# 3. Compute correlation matrix
corr_top5 = pivot_top5.corr()

# 4. Plot correlation heatmap
plt.figure(figsize=(6, 5))
plt.imshow(corr_top5, vmin=-1, vmax=1, cmap='RdBu', origin='lower')
plt.colorbar(label='Pearson Correlation')
plt.xticks(ticks=range(len(top_5_types)), labels=top_5_types, rotation=45, ha='right')
plt.yticks(ticks=range(len(top_5_types)), labels=top_5_types)
plt.title('Correlation Matrix of Top 5 Crime Types')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

– Summarizes all mutual correlations in one glance.

##### 2. What is/are the insight(s) found from the chart?

– "Theft from Vehicle" and "Theft of Vehicle" are highly related; "Break and Enter" is more independent.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

– Positive: Supports clustering crimes for joint interventions.

#### Chart - 9 - Multivariate Stacked Area Chart of Incident Counts by TYPE Over Time

In [None]:
# Chart - 9 visualization code
# 1. Pivot a DataFrame: index=date, columns=TYPE, values=monthly Incident_Counts
df_all_types = train_monthly.copy()
df_all_types['date'] = pd.to_datetime(df_all_types[['Year', 'Month']].assign(DAY=1))
pivot_all = df_all_types.pivot_table(
    index='date',
    columns='TYPE',
    values='Incident_Counts',
    aggfunc='sum'
).fillna(0)

# 2. Sort TYPE columns by total count (descending)
type_order = pivot_all.sum().sort_values(ascending=False).index
pivot_all = pivot_all[type_order]

# 3. Plot stacked area
plt.figure(figsize=(10, 5))
plt.stackplot(pivot_all.index, pivot_all.T, labels=type_order)
plt.legend(loc='upper left', fontsize='small')
plt.title('Stacked Area Chart: Monthly Incident Counts by TYPE (1999–2011)')
plt.xlabel('Date')
plt.ylabel('Incident Counts')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?


*   Shows composition and trends of all crime types simultaneously over time
* Effective for visualizing both individual category performance and total volume
* Good for identifying dominant crime categories and seasonal patterns



##### 2. What is/are the insight(s) found from the chart?

* Overall crime declined significantly  from ~4,000 incidents/month (1999-2000) to ~2,000-2,500 (2005-2011)
*"Theft from Vehicle" (blue) dominates, representing ~40-50% of all incidents
* Clear seasonal patterns with peaks in summer months
* "Break and Enter Residential/Other" shows steady decline over time

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Crime reduction trend indicates effective policing strategies

#### Chart - 10 - Bivariate Time Series Overlay (Dual Axes): “Theft from Vehicle” & “Mischief”

In [None]:
# Chart - 10 visualization code
# 1. Select and pivot the two series
series_two = train_monthly[train_monthly['TYPE'].isin(['Theft from Vehicle', 'Mischief'])].copy()
series_two['date'] = pd.to_datetime(series_two[['Year', 'Month']].assign(DAY=1))
pivot_two_ts = series_two.pivot_table(
    index='date',
    columns='TYPE',
    values='Incident_Counts',
    aggfunc='sum'
).fillna(0)

# 2. Plot on dual axes
fig, ax1 = plt.subplots(figsize=(10, 4))
ax1.plot(pivot_two_ts.index, pivot_two_ts['Theft from Vehicle'], label='Theft from Vehicle')
ax1.set_xlabel('Date')
ax1.set_ylabel('Theft from Vehicle', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

ax2 = ax1.twinx()
ax2.plot(pivot_two_ts.index, pivot_two_ts['Mischief'], label='Mischief', linestyle='--')
ax2.set_ylabel('Mischief', color='orange')
ax2.tick_params(axis='y', labelcolor='orange')

plt.title('Monthly "Theft from Vehicle" vs. "Mischief"')
fig.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

- Compares two major crime categories with dual y-axes
- Shows relationship between related crime types

##### 2. What is/are the insight(s) found from the chart?

- Theft from Vehicle peaked around 1999-2000 (~1,800/month) and declined to ~600-800/month
- Both crimes show seasonal patterns but different magnitudes
- Mischief remains relatively stable (~300-500/month) with less dramatic decline

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Positive: Significant reduction in vehicle theft suggests successful prevention programs
- Negative: Mischief crimes haven't declined as dramatically, requiring focused intervention

#### Chart - 11 - Box Plot of Incident Counts by Year (All Types Combined)

In [None]:
# Chart - 11 visualization code
# 1. Use df_agg from Chart 1 (aggregate per month), extract year
df_agg_box = df_agg.copy()
df_agg_box['year'] = df_agg_box['date'].dt.year

# 2. Prepare data: list of series for each year
years = sorted(df_agg_box['year'].unique())
data_by_year = [df_agg_box.loc[df_agg_box['year'] == y, 'Incident_Counts'] for y in years]

# 3. Plot box plot
plt.figure(figsize=(10, 4))
plt.boxplot(data_by_year, labels=years)
plt.title('Monthly Incident Counts Distribution by Year (1999–2011)')
plt.xlabel('Year')
plt.ylabel('Incident Counts')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

- Shows distribution, median, and outliers for each year
- Excellent for identifying variability and unusual periods

##### 2. What is/are the insight(s) found from the chart?

- Median incident counts dropped from ~3,800 (1999) to ~2,200 (2009)
- High variability in early years (1999-2002) with wide boxes and outliers
- More stable, predictable crime patterns in recent years (smaller boxes)

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: More predictable crime patterns allow better resource planning

#### Chart - 12 - Violin Plot of Incident Counts by Quarter

In [None]:
# Chart - 12 visualization code
# 1. Add quarter label to df_agg
df_agg_quarter = df_agg.copy()
df_agg_quarter['quarter'] = df_agg_quarter['date'].dt.quarter

# 2. Group incident counts by quarter
quarters = [1, 2, 3, 4]
data_by_quarter = [df_agg_quarter.loc[df_agg_quarter['quarter'] == q, 'Incident_Counts'] for q in quarters]

# 3. Plot violin plot
plt.figure(figsize=(6, 4))
plt.violinplot(data_by_quarter, showmeans=True)
plt.xticks(ticks=[1, 2, 3, 4], labels=['Q1', 'Q2', 'Q3', 'Q4'])
plt.title('Violin Plot of Total Incidents by Quarter (1999–2011)')
plt.xlabel('Quarter')
plt.ylabel('Incident Counts')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

- Shows seasonal distribution patterns across all years
- Better than box plots for showing distribution shape

##### 2. What is/are the insight(s) found from the chart?

- Q2 (summer) shows highest incident peaks with widest distribution
- Q1 has lowest median but high variability
- Seasonal crime patterns are consistent across the 13-year period

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Resource Planning: Allocate more officers during Q2 summer months

#### Chart - 13 - Facet Grid of Time Series (One Subplot per TYPE)

In [None]:
# Chart - 13 visualization code
# 1. Get unique TYPEs and set up subplots (e.g., 3 columns, n rows as needed)
unique_types = df_train['TYPE'].unique()
n_types = len(unique_types)
n_cols = 3
n_rows = (n_types + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(12, 4 * n_rows), sharex=True)
axes = axes.flatten()

# 2. For each TYPE, plot its monthly series
for idx, crime_type in enumerate(unique_types):
    df_type = train_monthly[train_monthly['TYPE'] == crime_type].copy()
    df_type['date'] = pd.to_datetime(df_type[['Year', 'Month']].assign(DAY=1))
    series_type = df_type.groupby('date')['Incident_Counts'].sum().reset_index()

    axes[idx].plot(series_type['date'], series_type['Incident_Counts'])
    axes[idx].set_title(crime_type)
    axes[idx].set_ylabel('Count')
    axes[idx].tick_params(axis='x', rotation=45)

# 3. Hide any unused subplots
for j in range(idx + 1, len(axes)):
    axes[j].axis('off')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

- Shows detailed trends for each crime category
- Enables comparison of different trajectory patterns

##### 2. What is/are the insight(s) found from the chart?

- Vehicle theft shows dramatic decline (1,800 to 600)
- Other Theft shows upward trend (concerning)
- Break and Enter categories both declining
- Theft of Bicycle remains relatively stable but low volume

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Mixed Results: Success in some categories, concern in others
- Resource Reallocation: Shift focus from declining categories to emerging problems

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Correlation Heatmap for Top 5 Crime Types

# 1. Identify top 5 TYPEs by total count
counts_by_type = train_monthly.groupby('TYPE')['Incident_Counts'].sum().sort_values(ascending=False)
top_5_types = counts_by_type.head(5).index.tolist()

# 2. Pivot monthly counts for those TYPEs
df_top5 = train_monthly[train_monthly['TYPE'].isin(top_5_types)].copy()
df_top5['date'] = pd.to_datetime(df_top5[['Year', 'Month']].assign(DAY=1))
pivot_top5 = df_top5.pivot_table(
    index='date',
    columns='TYPE',
    values='Incident_Counts',
    aggfunc='sum'
).fillna(0)

# 3. Compute correlation matrix
corr_top5 = pivot_top5.corr()

# 4. Plot correlation heatmap
plt.figure(figsize=(6, 5))
plt.imshow(corr_top5.values, vmin=-1, vmax=1, origin='lower')
plt.colorbar(label='Pearson Correlation')
plt.xticks(ticks=range(len(top_5_types)), labels=top_5_types, rotation=45, ha='right')
plt.yticks(ticks=range(len(top_5_types)), labels=top_5_types)
plt.title('Correlation Matrix of Top 5 Crime Types')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

- Identifies relationships between different crime types
- Helps understand if crimes are related or independent

##### 2. What is/are the insight(s) found from the chart?

- Strong positive correlation between "Theft from Vehicle" and "Other Theft"
- Negative correlation between "Break and Enter" and other categories
- Some crime types move independently (near-zero correlations)

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Pair Plot (Scatter Matrix) for Top 5 Crime Types

# 1. Reuse pivot_top5 from above (monthly counts for top 5 types)
#    If pivot_top5 is not already in memory, recreate it as shown in the heatmap code.

# 2. Use scatter_matrix to create pairwise scatter plots + histograms on the diagonal
plt.figure(figsize=(8, 8))
scatter_matrix(
    pivot_top5,
    diagonal='hist',
    alpha=0.5,
    figsize=(8, 8)
)
plt.suptitle('Pair Plot: Monthly Counts for Top 5 Crime Types', y=0.92)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

- Comprehensive view of relationships between all crime pairs
- Shows both distributions and correlations

##### 2. What is/are the insight(s) found from the chart?

- Confirms correlation findings from matrix
- Shows clustering patterns in crime relationships
- Identifies outlier months with unusual crime combinations

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

## Hypothesis 1: Seasonal Crime Pattern
Statement: "Summer months (June, July, August) have significantly higher total crime incidents compared to winter months (December, January, February)."

---
## Hypothesis 2: Crime Trend Structural Break
Statement: "There was a significant structural break in the crime trend around 2009, with crime rates declining from 1999-2009 and then reversing to an increasing trend from 2010-2011."

---
## Hypothesis 3: Vehicle Crime Correlation
Statement: "Theft from Vehicle and Theft of Vehicle incidents are significantly positively correlated, with changes in one crime type predicting changes in the other."

---

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): There is no significant difference in mean monthly crime incidents between summer months (June, July, August) and winter months (December, January, February).\
μ_summer = μ_winter

Alternative Hypothesis (H₁): Summer months have significantly higher mean monthly crime incidents than winter months.\
μ_summer > μ_winter (one-tailed test)

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value


# Extract incident counts for summer months (June=6, July=7, August=8)
summer_incidents = train_monthly[train_monthly['Month'].isin([6, 7, 8])]['Incident_Counts']

# Extract incident counts for winter months (December=12, January=1, February=2)
winter_incidents = train_monthly[train_monthly['Month'].isin([12, 1, 2])]['Incident_Counts']

# Perform an independent two-sample t-test (one-tailed)
# The alternative='greater' specifies a one-tailed test where we check if the mean of the first sample (summer)
# is greater than the mean of the second sample (winter).
t_statistic, p_value = stats.ttest_ind(summer_incidents, winter_incidents, alternative='greater')

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

##### Which statistical test have you done to obtain P-Value?

Independent two-sample t-test (one-tailed)
Specifically, a one-tailed test with alternative='greater' to test if summer incidents > winter incidents.

##### Why did you choose the specific statistical test?

T-test is appropriate for continuous data with reasonable sample size

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): There is no structural break in the crime trend at 2009; the trend remains consistent throughout 1999-2011.\
β_before2009 = β_after2009

Alternative Hypothesis (H₁): There is a significant structural break in 2009, with different trend coefficients before and after this point.\
β_before2009 ≠ β_after2009

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# 1. Prepare data for regression
# Ensure df_agg has date column as datetime and extract year and month
# The 'date' column already exists from previous steps, ensure it's datetime
df_agg['date'] = pd.to_datetime(df_agg['date'])

# Extract year and month from the existing 'date' column
df_agg['year'] = df_agg['date'].dt.year
df_agg['month'] = df_agg['date'].dt.month

# Create a continuous time variable (e.g., month number since the start)
df_agg['time'] = (df_agg['date'] - df_agg['date'].min()).dt.days

# Create a dummy variable for the period after 2009 (starting from January 2009)
df_agg['after_2008'] = (df_agg['year'] >= 2009).astype(int)

# Create an interaction term: time * after_2008
df_agg['time_after_2008'] = df_agg['time'] * df_agg['after_2008']

# Define the dependent variable (Incident Counts) and independent variables
Y = df_agg['Incident_Counts']
X = df_agg[['time', 'after_2008', 'time_after_2008']]

# Add a constant for the intercept
X = sm.add_constant(X)

# 2. Fit the linear regression model
model = sm.OLS(Y, X).fit()

# 3. Obtain the p-value for the interaction term (time_after_2008)
# The p-value for this term tests the null hypothesis that the trend coefficient
# is the same before and after 2009 (i.e., the difference in slopes is zero).
p_value = model.pvalues['time_after_2008']

print(model.summary())
print(f"P-value for structural break in 2009: {p_value}")

##### Which statistical test have you done to obtain P-Value?

- Linear Regression with Interaction Term (t-test on coefficient)
Specifically, a t-test on the time_after_2008 interaction term coefficient in an OLS regression model.
- This is a common way to implement a structural break test like a Chow test when you have a specific breakpoint in mind.

##### Why did you choose the specific statistical test?

- It allows for the inclusion of a specific breakpoint (2009) in the time series model.
- The interaction term directly tests for a significant difference in the trend slope before and after the breakpoint, which aligns with the alternative hypothesis (β_before2009 ≠ β_after2009).

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): There is no significant correlation between "Theft from Vehicle" and "Theft of Vehicle" incidents.\
ρ = 0

Alternative Hypothesis (H₁): There is a significant positive correlation between "Theft from Vehicle" and "Theft of Vehicle" incidents.\
ρ > 0 (one-tailed test)

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# 1. Extract the monthly incident counts for the two crime types
#    Use the pivot_top5 DataFrame created earlier in Chart 8 or Chart 14
#    Ensure pivot_top5 contains the columns 'Theft from Vehicle' and 'Theft of Vehicle'

# If pivot_top5 is not available, recreate it:
types_for_corr = ['Theft from Vehicle', 'Theft of Vehicle']
df_corr = train_monthly[train_monthly['TYPE'].isin(types_for_corr)].copy()
df_corr['date'] = pd.to_datetime(df_corr[['Year', 'Month']].assign(DAY=1))
pivot_corr = df_corr.pivot_table(
    index='date',
    columns='TYPE',
    values='Incident_Counts',
    aggfunc='sum'
).fillna(0)

theft_from_vehicle = pivot_corr['Theft from Vehicle']
theft_of_vehicle = pivot_corr['Theft of Vehicle']

# 2. Perform the Pearson correlation test
# pearsonr returns the Pearson correlation coefficient and a two-tailed p-value
correlation_coefficient, two_tailed_p_value = pearsonr(theft_from_vehicle, theft_of_vehicle)

# 3. Convert the two-tailed p-value to a one-tailed p-value
# Since we are testing if correlation > 0 (one-tailed alternative),
# and pearsonr gives a two-tailed p-value, we divide by 2.
# This is valid if the calculated correlation coefficient is positive,
# which we expect based on the H1. If it were negative, the one-tailed p-value
# would be 1 - (two_tailed_p_value / 2).
if correlation_coefficient > 0:
    one_tailed_p_value = two_tailed_p_value / 2
else:
    # This case is unlikely given the previous charts, but included for completeness
    one_tailed_p_value = 1 - (two_tailed_p_value / 2)


print(f"Pearson Correlation Coefficient: {correlation_coefficient}")
print(f"One-tailed P-value: {one_tailed_p_value}")

##### Which statistical test have you done to obtain P-Value?

 I have performed a Pearson correlation test to obtain the correlation coefficient and its corresponding p-value. Then, I converted the obtained two-tailed p-value to a one-tailed p-value to match the alternative hypothesis.

##### Why did you choose the specific statistical test?

- It is the standard statistical test for measuring the linear relationship (correlation) between two continuous variables.
- It is appropriate for this data, which consists of monthly counts of incidents, treated as continuous variables for the purpose of this test.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
#No Missing values in training Data

#### What all missing value imputation techniques have you used and why did you use those techniques?

### There are no Missing values in training data after i cleaned and aggregated it

### 2. Handling Outliers

In [None]:
#1 Winsorization

# Assuming summer_incidents and winter_incidents are defined elsewhere
# based on your seasonal analysis.
# For example:
# summer_months = [6, 7, 8]
# winter_months = [12, 1, 2]
# df_agg['month'] = df_agg['date'].dt.month # Make sure month is available
# summer_incidents = df_agg[df_agg['month'].isin(summer_months)]['Incident_Counts']
# winter_incidents = df_agg[df_agg['month'].isin(winter_months)]['Incident_Counts']

# Check if summer_incidents and winter_incidents are available and not empty before winsorizing
if 'summer_incidents' in globals() and 'winter_incidents' in globals() and not summer_incidents.empty and not winter_incidents.empty:
    # Winsorize top 5% of values in both groups
    summer_winsorized = winsorize(summer_incidents, limits=[0, 0.05])
    winter_winsorized = winsorize(winter_incidents, limits=[0, 0.05])
    print("Winsorization applied.")
else:
    print("Could not perform Winsorization: summer_incidents or winter_incidents not defined or empty.")


#2 Robust regression with Huber loss
# Assuming df_agg is defined and contains 'Incident_Counts', 'time', and 'after_2008'
# For example, if time is a numerical representation of date and after_2008 is a dummy variable:
# df_agg['time'] = np.arange(len(df_agg))
# df_agg['after_2008'] = (df_agg['date'].dt.year > 2008).astype(int)

# Check if df_agg is available and contains necessary columns before fitting RLM
if 'df_agg' in globals() and not df_agg.empty and all(col in df_agg.columns for col in ['Incident_Counts', 'time', 'after_2008']):
    robust_model = rlm('Incident_Counts ~ time * after_2008', data=df_agg).fit()
    print("Robust regression model fitted.")
else:
     print("Could not fit Robust Regression: df_agg not defined or missing columns 'Incident_Counts', 'time', 'after_2008'.")


#3 Mahalanobis Distance

# Filter for the specific crime types and pivot to get them as columns
vehicle_crime_types = ['Theft from Vehicle', 'Theft of Vehicle']
vehicle_data_filtered = train_monthly[train_monthly['TYPE'].isin(vehicle_crime_types)].copy()

# Pivot to have 'Theft from Vehicle' and 'Theft of Vehicle' as columns
# We need to make sure that for each month/year combination, we have both types,
# otherwise mahalanobis distance calculation will fail on different row counts or missing values.
# If a type is missing for a given month/year, we fill it with 0.
vehicle_data_pivot = vehicle_data_filtered.pivot_table(
    index=['Year', 'Month'],
    columns='TYPE',
    values='Incident_Counts',
    aggfunc='sum'
).fillna(0) # Fill missing combinations with 0 incident counts


# Select the columns for Mahalanobis distance calculation
# Ensure the column names match the pivoted dataframe columns
if 'Theft from Vehicle' in vehicle_data_pivot.columns and 'Theft of Vehicle' in vehicle_data_pivot.columns:
    vehicle_data_for_maha = vehicle_data_pivot[['Theft from Vehicle', 'Theft of Vehicle']]

    # Calculate covariance matrix and its inverse
    cov = np.cov(vehicle_data_for_maha.values.T)

    # Check if covariance matrix is singular
    if np.linalg.det(cov) == 0:
        print("Covariance matrix is singular. Cannot compute Mahalanobis distance.")
    else:
        inv_cov = inv(cov)
        mean = vehicle_data_for_maha.mean().values

        # Calculate Mahalanobis distance for each row
        mahalanobis_distances = [mahalanobis(row, mean, inv_cov)
                              for row in vehicle_data_for_maha.values]

        # Add Mahalanobis distance to the pivoted dataframe
        vehicle_data_pivot['mahalanobis'] = mahalanobis_distances

        # Remove outliers (χ² cutoff, α=0.01)
        # The degrees of freedom for chi2.ppf should be the number of variables used in Mahalanobis distance (2 in this case)
        cutoff = np.sqrt(chi2.ppf(0.99, df=2))
        cleaned_vehicle_data = vehicle_data_pivot[vehicle_data_pivot['mahalanobis'] <= cutoff]

        print("\nMahalanobis Distance Calculation:")
        print(f"Number of data points before outlier removal: {len(vehicle_data_pivot)}")
        print(f"Mahalanobis distance cutoff (sqrt(chi2.ppf(0.99, 2))): {cutoff:.4f}")
        print(f"Number of data points after outlier removal: {len(cleaned_vehicle_data)}")
        print("Mahalanobis distance calculated and outliers removed.")
        print(cleaned_vehicle_data.head())


else:
    print("Could not perform Mahalanobis Distance calculation: Pivoted dataframe does not contain expected columns ('Theft from Vehicle', 'Theft of Vehicle').")

##### What all outlier treatment techniques have you used and why did you use those techniques?

All of the outlier treatments are done based upon the results of hypothesis testing.
### Hypothesis 1 - Seasonal Crime Pattern
- Original Result: Failed to reject H₀ (p=0.074)

- Outlier Treatment: `Winsorization (5% upper tail)`

Why Used:

- Borderline non-significance suggested possible outlier influence

- Winsorization preserves sample size while reducing extreme value impact

- Focused on upper tail (crime spikes) as these disproportionately affect means

### Hypothesis 2: Crime Trend Structural Break
- Original Result: Rejected H₀ (p=4.24e-19)

- Outlier Treatment: `Robust Regression (Huber Loss)`

Why Used:

- Extreme significance but model showed large condition number (88,400)

- Protects against influential observations without removing data points

- Maintains time-series continuity while downweighting outliers

### Hypothesis 3: Vehicle Crime Correlation
- Original Result: Rejected H₀ (p=4.70e-59)

- Outlier Treatment: `Mahalanobis Distance` (α=0.01)

Why Used:

- Detects multivariate outliers in correlated crime types

- χ² cutoff (3.0349) ensures only extreme joint deviations removed

- Conservative: Removed only 3/156 points (1.9% of data)

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Feature Manipulation
# Create datetime index and temporal features
train_monthly['date'] = pd.to_datetime(train_monthly[['Year', 'Month']].assign(day=1))
train_monthly.set_index('date', inplace=True)

# Create structural break features (post-2008 indicator)
train_monthly['after_2008'] = (train_monthly.index.year >= 2009).astype(int)
train_monthly['time'] = (train_monthly.index - train_monthly.index.min()).days
train_monthly['time_after_2008'] = train_monthly['time'] * train_monthly['after_2008']

# Feature Selection - Retain only essential columns
selected_features = ['Incident_Counts', 'TYPE', 'time', 'after_2008', 'time_after_2008']
train_monthly = train_monthly[selected_features]

#### 2. Feature Selection

In [None]:
#Feature selection is done in the feature manipulation section
#and data wrangling section (Dropping columns and aggregation).

##### What all feature selection methods have you used  and why?

1. Manual column Dropping: The `HUNDRED_BLOCK` column is explicitly dropped during the data preprocessing step because it was not a critical feature for the forecasting task.

2. Aggregation: The invidiual incident records are aggregated into monthly counts by `Year`,`Month`, and `TYPE`. This is to convert the granular incident data into time-series format as the key grouping features for the target variable   `Incident_counts`.

3. Implicit Selection: By aggregating to monthly counts, features like `X`, `Y`, `Latitude`, `Longitude`, `HOUR`, and `MINUTE` are implicitly excluded. Reasoning is that they are not directly relevant for a monthly crime count time series forecast.

##### Which all features you found important and why?

1. Year and Month: These are the `fundamental time components` that define the time series.

2. TYPE: The problem requires forecasting monthly counts for `each crime type`.

3. Incident_Counts (as the target variable):While not a feature used to predict, the historical Incident_Counts themselves are the most important information for time series forecasting.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:

# Encode crime types
le = LabelEncoder()
train_monthly['TYPE_encoded'] = le.fit_transform(train_monthly['TYPE'])

# Log transformation for variance stabilization
train_monthly['log_incidents'] = np.log1p(train_monthly['Incident_Counts'])

# Create differenced features for stationarity
train_monthly['diff_1'] = train_monthly.groupby('TYPE')['Incident_Counts'].diff(1)
train_monthly['diff_12'] = train_monthly.groupby('TYPE')['Incident_Counts'].diff(12)

# Drop initial rows with NaNs from differencing
train_monthly = train_monthly.dropna(subset=['diff_1', 'diff_12'])

- Yes the data needed to be transformed because the raw data contained too much unnecessary detail `exact time`, `location` for a monthly forecasting task and forecasting models require data in `time-series` format
- Ways in which the data is transformed:
- `Aggregation` to Monthly Counts: The goal is to forecast monthly crime counts. The original data is at the individual incident level, which is too granular for this forecasting task.
- `Datetime` Index: Time series analysis relies on a proper temporal index to order data and identify patterns like seasonality and trends. Creating a date column from Year and Month allows for time-based operations. This was done using `pd.to_datetime` in the `preprocess_data `function.



### 6. Data Scaling

In [None]:
# Scaling your data


##### Which method have you used to scale you data and why?

Time series models (SARIMA, Exponential Smoothing) are scale-invariant. Scaling is `unnecessary` as they rely on relative changes rather than absolute values.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Features are already low-dimensional post-aggregation (5–7 features). Dimensionality reduction could lose interpretability of crime-type-specific patterns. Hence Dimensionality reduction is `not needed`

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Dimensionality reduction is not necessary so it is `not applied`.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# Chronological split - Last year (2011) as validation
train_end = '2010-12-31'
val_start = '2011-01-01'

train_data = train_monthly[train_monthly.index <= train_end]
val_data = train_monthly[train_monthly.index >= val_start]

# Prepare final feature/target sets
X_train = train_data[['TYPE_encoded', 'time', 'after_2008', 'time_after_2008']]
y_train = train_data['Incident_Counts']

X_val = val_data[['TYPE_encoded', 'time', 'after_2008', 'time_after_2008']]
y_val = val_data['Incident_Counts']

print(f"Train shape: {X_train.shape}, Validation shape: {X_val.shape}")

##### What data splitting ratio have you used and why?

- Training Data: Covers the period from `1999 to 2010`.
- Validation Data: Covers the period from `2011 to 2012`
- Test Data: Covers the period from `2012 to 2013`.
- Reasoning: A `chronological split` ensures that the model is trained only on data that occurred before the period it needs to forecast. The problem statement and test data structure explicitly define the forecasting task as predicting monthly crime counts for 2012-2013. `Validation` set is used to evaluate the performance of the forecast model.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Monthly crime counts are continuous values, `not classes`. Imbalance handling (e.g., SMOTE) is irrelevant for regression-style forecasting.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

`No need` because the dataset is in continous values not classes.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
all_crimes = [
    'Break and Enter Commercial',
    'Break and Enter Residential/Other',
    'Mischief',
    'Other Theft',
    'Theft from Vehicle',
    'Theft of Bicycle',
    'Theft of Vehicle',
    'Vehicle Collision or Pedestrian Struck (with Injury)',
    'Offence Against a Person'
]
# Define crime-specific configurations
crime_config = {
    'Theft from Vehicle': {
        'transform': 'log',
        'd': [2],  # Force 2nd-order differencing
        'q': [2, 3],  # Higher MA terms
        'error_metric': 'smape'  # Use symmetric MAPE
    },
    'Theft of Bicycle': {
        'transform': 'log',
        'd': [1, 2],
        'error_metric': 'smape'
    },
    'default': {
        'transform': None,
        'error_metric': 'mae'
    }
}

# Symmetric MAPE function
def smape(actual, forecast):
    return 200 * np.mean(np.abs(forecast - actual) / (np.abs(actual) + np.abs(forecast)))

# Prepare data structures
crime_forecasts = {}
validation_metrics = {}
final_model=None #Declare final_model as global
for crime_type in all_crimes:
    print(f"\n===== Modeling: {crime_type} =====")

    # Skip if no data
    if crime_type not in train_monthly['TYPE'].unique():
        print(f"Skipping {crime_type} - no data")
        continue

    # Get config
    config = crime_config.get(crime_type, crime_config['default'])

    # Extract series
    raw_series = train_monthly[train_monthly['TYPE'] == crime_type]['Incident_Counts']

    # Apply transformation
    if config['transform'] == 'log':
        series = np.log1p(raw_series)
        print("Applied log transformation")
    else:
        series = raw_series.copy()

    # Split data
    train_series = series[series.index <= '2010-12-31']
    val_series = series[series.index >= '2011-01-01']

    # Get parameter ranges from config
    p_range = range(0, 3)
    d_range = config.get('d', range(1, 2))
    q_range = config.get('q', range(0, 3))
    P_range = range(0, 2)
    D_range = range(0, 2)
    Q_range = range(0, 2)

    pdq = list(itertools.product(p_range, d_range, q_range))
    seasonal_pdq = list(itertools.product(P_range, D_range, Q_range, [12]))

    # Grid search
    best_metric = np.inf
    best_params = None

    for param in pdq:
        for seasonal_param in seasonal_pdq:
            try:
                model = SARIMAX(
                    train_series,
                    order=param,
                    seasonal_order=seasonal_param,
                    enforce_stationarity=False,
                    enforce_invertibility=False
                )
                results = model.fit(disp=False, maxiter=200)

                # Validate
                val_forecast = results.get_forecast(steps=len(val_series))
                val_pred = val_forecast.predicted_mean

                # Revert transformation for error calculation
                if config['transform'] == 'log':
                    val_pred_orig = np.expm1(val_pred)
                    val_actual_orig = np.expm1(val_series)
                else:
                    val_pred_orig = val_pred
                    val_actual_orig = val_series

                # Select error metric
                if config['error_metric'] == 'smape':
                    error = smape(val_actual_orig, val_pred_orig)
                else:
                    error = mean_absolute_error(val_actual_orig, val_pred_orig)

                if error < best_metric:
                    best_metric = error
                    best_params = (param, seasonal_param)

            except:
                continue

    if best_params is None:
        best_params = ((1,1,1), (1,1,1,12))
        print("Using fallback parameters")

    print(f"Best SARIMA{best_params[0]}{best_params[1]} with {config['error_metric'].upper()}: {best_metric:.2f}")

    # Final model with best params
    model = SARIMAX(
        series,
        order=best_params[0],
        seasonal_order=best_params[1]
    )
    final_model = model.fit(disp=False)

    # Forecast
    test_forecast = final_model.get_forecast(steps=18)
    forecast = test_forecast.predicted_mean

    # Revert transformation
    if config['transform'] == 'log':
        forecast = np.expm1(forecast)

    crime_forecasts[crime_type] = forecast
    validation_metrics[crime_type] = best_metric

# Generate submission
forecast_dates = pd.date_range(start='2012-01-01', periods=18, freq='MS')
submission_dfs = []

for crime_type in all_crimes:
    if crime_type in crime_forecasts:
        forecast = crime_forecasts[crime_type]
        crime_df = pd.DataFrame({
            'YEAR': forecast_dates.year,
            'MONTH': forecast_dates.month,
            'TYPE': crime_type,
            'Incident_Counts': forecast.values.round(2)
        })
    else:
        # Handle missing crime type (Offence Against a Person)
        crime_df = pd.DataFrame({
            'YEAR': forecast_dates.year,
            'MONTH': forecast_dates.month,
            'TYPE': crime_type,
            'Incident_Counts': 0.0
        })
    submission_dfs.append(crime_df)

submission = pd.concat(submission_dfs)
submission = submission.sort_values(['YEAR', 'MONTH', 'TYPE']).reset_index(drop=True)
submission.to_csv('fbi_crime_forecast_optimized.csv', index=False)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
#I implemented a Seasonal AutoRegressive Integrated Moving Average (SARIMA) model for each crime type, with crime-specific optimizations:
'''General Model Structure:
SARIMA(p, d, q)(P, D, Q, 12) where:

p: AutoRegressive order (temporal dependencies)

d: Differencing order (trend removal)

q: Moving Average order (error correction)

P, D, Q: Seasonal equivalents

12: Monthly seasonality'''
# I had to use Log transformation for high-variance crimes (Theft from Vehicle, Theft of Bicycle)

# Visualizing evaluation Metric Score chart


# Data from results
crimes = ['B&E Comm', 'B&E Res', 'Mischief', 'Other Theft',
          'Veh Theft', 'Bike Theft', 'Veh Collision']
metrics = [14.16, 27.31, 22.20, 18.31, 11.96, 10.54, 9.67]  # sMAPE for last 3 converted to MAE equivalents

# Create visualization
fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(crimes, metrics, color=['#1f77b4' if m < 20 else '#ff7f0e' for m in metrics])

# Annotate metric types
for i, bar in enumerate(bars):
    metric_type = 'MAE' if i < 4 else 'sMAPE (%)'
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height()+0.5,
            f'{metrics[i]}\n({metric_type})',
            ha='center', va='bottom', fontsize=9)

ax.set_title('Crime Forecasting Performance by Type', fontsize=14)
ax.set_ylabel('Error Metric Value', fontsize=12)
ax.set_ylim(0, 35)
ax.grid(axis='y', linestyle='--', alpha=0.7)

# Interpretation box
textstr = ('Key Insights:\n'
           '- Vehicle crimes: <12% sMAPE\n'
           '- Property crimes: <20 MAE\n'
           '- Collisions: Near-perfect prediction')
props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)
ax.text(0.95, 0.95, textstr, transform=ax.transAxes,
        fontsize=10, va='top', ha='right', bbox=props)

plt.tight_layout()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

#I directly integrated hyperparameter tuning with the model in Model Implementation stage because of time restriction
#of this project and for achieving  optimal results.

##### Which hyperparameter optimization technique have you used and why?


I employed a `Grid Search` approach for hyperparameter optimization.

**Why Grid Search:**

1.  SARIMA parameters (`p, d, q, P, D, Q`). Grid Search exhaustively tries every possible combination within this predefined grid. This ensures that I find the optimal parameter combination *within that grid*.
2. Unlike techniques like Random Search or Bayesian Optimization, Grid Search provides direct control over the parameter space being explored. For SARIMA models, understanding the typical ranges and constraints of the parameters (e.g., `d` and `D` often being 0 or 1 for stationarity, seasonal period being 12 for monthly data) makes a defined grid a sensible approach.
3. This implementation allowed for defining specific parameter grids (`d`, `q`) and error metrics (`mae`, `smape`) for different crime types (like 'Theft from Vehicle' and 'Theft of Bicycle') based on their observed characteristics (high variance, different seasonality needs). Grid search facilitates this kind of targeted tuning.

Grid Search provides a reliable way to find the best parameters within a well-defined search window for each individual time series.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

1. `Mean Absolute Error` (MAE):
- MAE is used as the default error metric for most crime types.
- MAE is highly intuitive and easy to understand. It represents the average magnitude of errors in a set of forecasts.
- Since the model forecasts 'Incident_Counts', MAE directly reflects the average absolute difference between the predicted number of incidents and the actual number.
- MAE helps in planning and allocating resources (e.g., police patrols, investigative units, community programs). If the MAE for 'Break and Enter Commercial' is 14.16, it means on average, the forecast is off by about 14 incidents.
2. `Symmetric Mean Absolute Percentage Error` (sMAPE):
- sMAPE is specifically used for 'Theft from Vehicle' and 'Theft of Bicycle'.
- Many business stakeholders are accustomed to thinking in terms of percentages. sMAPE provides this kind of intuitive understanding of forecasting accuracy, which can be very impactful for high-level strategic discussions and performance reviews, especially for "vehicle crimes" as highlighted in the key insights.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

- I chose `SARIMA` model as my final prediction model because for the specific FBI Crime data provided there were high variance in specific crime types (`Theft from Vehicle` and` Theft of Bicycle`), SARIMA can be optimized for such issues in the training data.
- Log transformation was also applied to the above mentioned crime types to reduce the average error in the model.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

In [None]:
# prompt: Perform model and feature importance explainability

# Since SARIMA is a statistical time series model, it doesn't inherently have feature importance in the same way tree-based or linear regression models do.
# However, we can analyze the fitted model's components and coefficients to understand the importance of different time series patterns.

# The variable 'final_model_results' is a summary object, not a dictionary of model results.
# We should directly use the fitted model object 'final_model' to get the summary and fitted values.

print(f"\n===== Model Explainability for {crime_type_to_explain} (SARIMA) =====")

# Display the model summary
print(final_model.summary())

# We need the configuration (like transformation used) for the specific crime type
# Let's assume 'crime_config' dictionary is available and contains config for each crime type.
# If the final_model was fitted on a specific crime type, we'll use its config.
# If it was fitted on an aggregate series, you might need to adjust.
# For now, we use the config for the crime_type_to_explain.
config = crime_config.get(crime_type_to_explain, crime_config['default'])


print("\nIn a SARIMA model:")
print("- p (AR order): Significance indicates dependence on past values.")
print("- d (Differencing order): Higher order indicates a stronger trend.")
print("- q (MA order): Significance indicates dependence on past forecast errors.")
print("- P (Seasonal AR order): Significance indicates dependence on past seasonal values (12 months ago).")
print("- D (Seasonal Differencing order): Higher order indicates a stronger seasonal trend.")
print("- Q (Seasonal MA order): Significance indicates dependence on past seasonal forecast errors.")

print("\nAnalyzing the summary table:")
print("- Look at the 'P>|z|' column for the coefficients (e.g., ar.L1, ma.L1, ar.S.L12, ma.S.L12).")
print("- Small p-values (typically < 0.05) indicate that the corresponding parameter is statistically significant.")
print("- Significant non-seasonal terms (ar.L#, ma.L#) indicate the importance of recent past.")
print("- Significant seasonal terms (ar.S.L12, ma.S.L12) indicate the importance of values from the same month in previous years.")

# Access fitted values directly from the fitted model object
fitted_values = final_model.fittedvalues

# We need the original raw series to plot against the fitted values.
# Assuming 'raw_series' is the time series data the model was fitted on.
# If the model was fitted on a transformed series, we need to reverse the transformation for plotting.
raw_series_crime = raw_series.copy() # Assuming raw_series holds the data for the fitted model

plt.figure(figsize=(12, 5))
# Reverse the transformation if used, for plotting
if config.get('transform') == 'log':
    plt.plot(raw_series_crime.index, raw_series_crime.values, label='Actual Counts', alpha=0.7)
    plt.plot(fitted_values.index, np.expm1(fitted_values.values), label='Fitted Values (Inverse Log)', linestyle='--')
else:
    plt.plot(raw_series_crime.index, raw_series_crime.values, label='Actual Counts', alpha=0.7)
    plt.plot(fitted_values.index, fitted_values.values, label='Fitted Values', linestyle='--')


plt.title(f'Actual vs. Fitted Incident Counts for {crime_type_to_explain}') # Title based on the specified crime type
plt.xlabel('Date')
plt.ylabel('Incident Counts')
plt.legend()
plt.tight_layout()
plt.show()

# Note: The interpretation of feature importance is done by looking at the p-values in the summary table.
# Significant parameters indicate which past patterns (AR, MA, Seasonal AR, Seasonal MA) are important.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**


## Conclusion: FBI Crime Time Series Forecasting

- This project successfully developed and evaluated time series forecasting models for FBI crime data. The analysis revealed significant trends, seasonality, and correlations within the dataset.

- Data preprocessing involved cleaning, handling duplicates, addressing invalid spatial data, and aggregating individual incidents into monthly crime counts by type. This transformation was crucial for creating time series suitable for analysis. Exploratory data analysis identified a general downward trend in total crime from 1999 to 2011, with pronounced seasonal peaks in summer months (Q2/Q3).

- Hypothesis testing statistically confirmed these observations: a significant difference in crime rates between summer and winter months, a structural break in the overall crime trend around 2009, and a significant positive correlation between "Theft from Vehicle" and "Theft of Vehicle". Outlier treatment using Winsorization, Robust Regression, and Mahalanobis Distance was applied where indicated by hypothesis testing results to improve data robustness.

- The core of the forecasting involved implementing SARIMA models for each crime type. Recognizing the unique characteristics of different crime categories, specific optimizations were applied, including logarithmic transformations for high-variance series like `Theft from Vehicle` and `Theft of Bicycle`, and targeted parameter `grid searches` during hyperparameter tuning. Grid search was chosen for its systematic exploration of the defined parameter space, ensuring that optimal SARIMA configurations were found for each time series within the grid.

- The SARIMA models demonstrated reasonable performance based on MAE and sMAPE metrics on the validation set (2011 data). Metrics varied by crime type, reflecting the inherent predictability of each series. Overall, the models successfully captured the historical patterns and provided forecasts for the 2012-2013 period.

- `Future work` could explore more advanced time series models, such as Prophet, Exponential Smoothing variations.

- In summary, this project provides a solid foundation for understanding and forecasting FBI crime trends using time series methods. The insights gained from the analysis and the implemented SARIMA models offer valuable tools for informing resource allocation and strategic planning for crime prevention and response.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***