<a href="https://colab.research.google.com/github/SSubhashReddy/Assignment-2/blob/main/Copy_of_Sample_ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -** S.Venkata Subhash Reddy
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

The goal of this project is to analyze and forecast crime trends using historical FBI crime data through extensive exploratory data analysis (EDA), feature engineering, and time series modeling techniques. With crime data collected and maintained by the Federal Bureau of Investigation (FBI) over several decades, this project aims to extract valuable insights from past crime trends and build predictive models to anticipate future crime rates across various crime categories and geographical regions.

The dataset includes annual crime statistics from multiple U.S. states and cities, covering different types of offenses such as violent crimes (e.g., assault, robbery), property crimes (e.g., burglary, larceny), and motor vehicle thefts. The initial steps in the project involved rigorous data preprocessing including missing value treatment, outlier detection, formatting of time variables, and creation of derived time-based features such as month, quarter, and year.

Extensive univariate, bivariate, and multivariate analysis was conducted to understand the temporal patterns and relationships between crime types, locations, and time periods. Multiple charts and visualizations—such as line plots, seasonal decomposition plots, heatmaps, and rolling average plots—were used to reveal long-term trends, seasonal spikes, and anomalous behaviors. Special attention was given to the impact of external events such as the COVID-19 pandemic and socio-economic factors on crime rates, where applicable.

The core objective was to forecast crime rates for future time periods using robust time series modeling techniques. Several forecasting models were explored, including ARIMA, SARIMA, Facebook Prophet, and Long Short-Term Memory (LSTM) neural networks. Each model was evaluated using standard time series metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). The models were fine-tuned using techniques such as grid search, cross-validation (where applicable), and differencing to ensure stationarity in the data. The best-performing model was selected based on validation scores and forecasting accuracy on hold-out test data.

One of the key deliverables of the project is a deployment-ready forecasting model capable of predicting future crime rates on a city or state level. This model can be integrated into law enforcement dashboards or public policy platforms to enable proactive planning, optimized resource allocation, and better preparedness for high-risk periods. In addition, feature importance and explainability tools like SHAP and model decomposition were used to provide insights into what drives fluctuations in crime over time.

The project also involved hypothesis testing to validate statistically significant insights derived from EDA. For example, hypotheses such as “violent crime tends to peak during summer months” or “property crimes increased significantly post-pandemic” were tested using appropriate statistical tests (e.g., t-tests, chi-square tests) to ensure data-backed conclusions.

In conclusion, this project not only demonstrates the power of time series forecasting in understanding crime dynamics but also emphasizes the importance of data-driven decision-making in public safety. By accurately forecasting future crime trends, law enforcement agencies can move from reactive responses to strategic prevention—making communities safer and resources more efficiently utilized.

# **GitHub Link -**

https://github.com/SSubhashReddy/Assignment-2/tree/main

# **Problem Statement**


The United States Federal Bureau of Investigation (FBI) maintains a comprehensive database of crime reports collected across different states and cities. This data, spanning multiple years, captures a wide range of criminal activities including violent crimes, property crimes, and other offenses. Despite the wealth of information available, many agencies still rely on traditional, reactive approaches to crime management, lacking data-driven systems for proactive decision-making.

The primary challenge is to forecast future crime trends using this historical data to enable law enforcement agencies and policymakers to make informed decisions. Crime patterns are often influenced by a variety of factors such as location, time of year, socio-economic conditions, and external events (e.g., the COVID-19 pandemic), making forecasting a complex task. Time series forecasting offers a promising approach to detect trends, seasonal effects, and anomalies over time.

This project focuses on building a robust time series forecasting model using historical FBI crime data to predict future incidents. The objective is to identify temporal patterns and deliver accurate forecasts that can help agencies allocate resources efficiently, prepare for high-risk periods, and ultimately reduce crime rates through strategic planning. The success of such a model could lead to significant advancements in public safety by shifting the focus from reactive policing to proactive intervention based on data-driven insights.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Dataset First Look
import pandas as pd

try:
    df = pd.read_excel('/content/drive/MyDrive/Train.xlsx')
except FileNotFoundError:
    print("Error: The file 'Train.xlsx' was not found in your Google Drive at the specified path.")
    print("Please verify the file path and ensure the file exists and is correctly named.")

### Dataset First View

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
import matplotlib.pyplot as plt # Ensure plt is imported
import seaborn as sns # Ensure seaborn is imported

plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

However, based on the structure you shared earlier (TYPE, HUNDRED_BLOCK, NEIGHBOURHOOD, X, Y, Latitude, Longitude, HOUR, MINUTE, YEAR, MONTH, DAY, Date), it seems like a dataset related to geographical locations and time-based events—possibly crime or incident reports.

Time-based data: YEAR, MONTH, DAY, HOUR, MINUTE suggest it can be used for time series forecasting.

Geospatial information: Latitude, Longitude, X, Y indicate locations, useful for mapping or spatial analysis.

Categorical classifications: TYPE, HUNDRED_BLOCK, NEIGHBOURHOOD might categorize events by type and location.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe().T

In [None]:
df.describe(include='object').T

### Variables Description

**TYPE** – Likely represents the type of event or incident (e.g., crime type, report category).

**HUNDRED_BLOCK** – Refers to a specific street block location where the event occurred.

**NEIGHBOURHOOD** – The neighborhood where the event was reported.

**X, Y** – Spatial coordinates, potentially representing map positions (may be in a local coordinate system).

**Latitude, Longitude** – Geographic coordinates identifying the exact location.

**HOUR, MINUTE** – The specific time when the event happened.

**YEAR, MONTH, DAY** – The date details, useful for time-based analysis.

**Date** – A formatted timestamp representing the full date of the event.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
round((df.isnull().sum()/df.shape[0])*100)

### What all manipulations have you done and insights you found?


### **Data Manipulations I Would Perform:**
1. **Data Cleaning** – Handling missing values, correcting data types, and ensuring consistency.
2. **Date-Time Processing** – Converting `YEAR`, `MONTH`, `DAY`, `HOUR`, `MINUTE` into a single `Timestamp` column for easier analysis.
3. **Spatial Processing** – Mapping `Latitude`, `Longitude`, `X`, and `Y` to visualize event distributions.
4. **Feature Engineering** – Extracting useful insights such as day-of-week trends, seasonal patterns, or clustering neighborhoods.
5. **Aggregation** – Summarizing event counts by neighborhood, type, or time period.
6. **Time Series Analysis** – Identifying trends, anomalies, and forecasting future patterns.

### **Possible Insights I Could Extract:**
 **Peak Hours for Events** – Finding when incidents are most frequent.  
 **Neighborhood Analysis** – Which areas have the highest incident rates?  
 **Seasonal Trends** – Do incidents rise at certain times of the year?  
 **Geospatial Patterns** – Are there hotspots for specific events?  
 **Predictive Modeling** – Forecasting future events based on historical data.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

if df is not None:
    numeric_df = df.select_dtypes(include=np.number)
    plt.figure(figsize=(10, 6))
    sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm')
    plt.title('Correlation Heatmap')
    plt.show()
else:
    print("\nSkipping Chart - 1 visualization as the dataset was not loaded.")

##### 1. Why did you pick the specific chart?

The correlation heatmap was chosen because it visually highlights the relationships between different crime variables over time. It helps identify which types of crimes tend to increase or decrease together, uncovering hidden patterns in the FBI time series data.

##### 2. What is/are the insight(s) found from the chart?

High Positive Correlation Between Certain Crime Types:
For example, aggravated assault and robbery may show a correlation coefficient above 0.8, indicating they often increase or decrease together. This suggests common underlying causes or similar seasonal patterns.

Low or Negative Correlation Between Other Crimes:
Crimes like property theft and drug offenses might have a low or slightly negative correlation, revealing they are influenced by different factors or occur in different contexts.

Clustered Crime Categories:
The heatmap visually groups crime types that behave similarly over time. This clustering helps in reducing dimensionality by selecting representative features from each group.
Feature Selection for Forecasting:
Variables with strong correlation can lead to redundancy. The heatmap helps identify which features to include or exclude in forecasting models to avoid multicollinearity and improve performance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### **Positive Business Impact:**
1. **Optimized Resource Allocation** – By understanding peak incident hours and high-risk neighborhoods, law enforcement or businesses can optimize staff deployment, improving efficiency.
2. **Better Decision-Making** – Companies in security, insurance, and public safety can adjust strategies based on crime trends.
3. **Predictive Analysis for Risk Prevention** – If incidents follow patterns, businesses can take proactive measures to minimize risks, ensuring safer environments.
4. **Urban Planning Improvements** – City planners can use geospatial insights to develop safer infrastructure and improve neighborhood conditions.

### **Insights That Could Lead to Negative Growth:**
1. **Reputation & Business Location Risks** – If a business is located in a high-crime area, customers may avoid visiting, leading to decreased sales.
2. **Real Estate Devaluation** – Frequent incidents in certain neighborhoods could lower property values, impacting the local economy.
3. **Higher Operational Costs** – Businesses may need extra security measures based on crime trends, increasing expenses.



#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.countplot(data=df, x='TYPE')
plt.title('Crime Type Distribution')
plt.xlabel('Crime Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

**Line Chart** – Best for showing trends over time. Since you have date and time variables, a line chart helps visualize how events fluctuate over months or years.

**Bar Chart** – Ideal for comparing categorical data, such as different neighborhoods or incident types. It makes it easy to spot which areas or event types are most frequent.

**Scatter Plot** – Helps examine relationships between geospatial variables, like latitude and longitude, to understand location clustering.

**Heatmap** – Useful if you want to see density distributions of events over time or across locations.

##### 2. What is/are the insight(s) found from the chart?

**Time-Based Patterns** – Identifying peak hours, days, or months for incidents.

**Location Insights** – Finding high-risk neighborhoods based on event occurrences. **Seasonal Trends** – Detecting whether incidents rise during certain seasons or holidays.
 **Geospatial Clustering** – Seeing if certain locations have a concentration of events.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from analyzing your dataset can create a **positive business impact**, depending on how they are used. At the same time, some trends might highlight challenges that could lead to **negative growth** if not addressed properly.

### ** Positive Business Impact**
1. **Optimized Operations** – Businesses like law enforcement agencies or security firms can use insights on peak crime hours and risky locations to allocate resources more efficiently.
2. **Strategic Location Decisions** – Companies can use neighborhood trends to decide where to open stores, place security systems, or adjust insurance policies.
3. **Predictive Risk Management** – By forecasting crime or incidents, businesses can take preventive measures, improving safety and reducing future costs.
4. **Improved Public Services** – Government agencies can implement better safety measures and infrastructure planning based on historical incident trends.

### ** Potential Negative Growth Risks**
1. **Reputation Challenges** – Businesses in high-crime areas may struggle with foot traffic and customer trust, impacting revenue.
2. **Real Estate Value Decline** – If an area consistently shows high incidents, property prices may drop, affecting investments and development.
3. **Higher Operational Costs** – Companies may need to increase security spending due to insights indicating elevated risk.

### ** The Key Takeaway**
Even insights that appear negative can be turned into **opportunities**—for example, high-risk locations might encourage investment in better safety infrastructure, leading to long-term growth. Using the data wisely makes all the difference!

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='YEAR')
plt.title('Crime Count by Year')
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

**Line Chart** – Ideal for tracking trends over time, especially if analyzing incident frequency by date. This helps visualize patterns like seasonal crime spikes.

**Bar Chart** – Great for comparing categorical variables, like incident types across different neighborhoods, showing which areas have the highest number of cases.

**Scatter Plot** – Useful for mapping locations with Latitude and Longitude, helping identify clusters of incidents geographically.

**Heatmap** – Best for displaying density distributions, such as crime frequency over different times of day or across locations.

The goal is to choose a chart that presents clear, actionable insights—whether for forecasting, comparison, or geospatial analysis.

##### 2. What is/are the insight(s) found from the chart?

**Time Trends**: If using a line chart, you might observe spikes in incidents at specific times of the year, months, or hours.

**Geospatial Patterns**: A scatter plot using latitude & longitude could reveal high-risk zones where incidents cluster.

** Neighborhood Comparisons**: A bar chart may show which neighborhoods have the most reported incidents.

**Peak Crime Hours**: A heatmap with HOUR and NEIGHBOURHOOD could highlight when and where events are most frequent.

**Seasonal Effects**: A time series forecast might indicate whether incidents increase during particular months or seasons.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### **Positive Business Impact**
1. **Strategic Planning** – Businesses, law enforcement, or city planners can optimize security measures based on crime trends, leading to a safer environment.
2. **Operational Efficiency** – Understanding peak incident hours helps allocate resources effectively, reducing costs and improving response times.
3. **Real Estate & Investments** – Identifying safer neighborhoods can help investors make informed decisions about where to develop new projects.
4. **Insurance & Risk Management** – Companies can adjust policies based on crime predictions, offering data-driven pricing for customers.

###**Insights That May Lead to Negative Growth**
1. **Reputation Challenges** – Businesses operating in high-incident zones may face reduced customer traffic due to safety concerns.
2. **Declining Property Value** – If an area consistently shows high crime rates, real estate values might drop, affecting investment and development.
3. **Higher Security Costs** – Companies in high-risk areas may need additional security measures, increasing operational expenses.

###**Turning Negative Insights into Opportunities**
Even insights that seem negative can be **leveraged strategically**—for example, businesses can invest in better security or use predictive analytics to prevent incidents before they occur. Adapting to trends and mitigating risks ensures long-term growth despite initial challenges.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
sns.set(rc={'figure.figsize':(15,10)})
sns.set_palette('husl')
graph = sns.countplot(data=df, x='YEAR', hue='TYPE')
graph.set_title('')

##### 1. Why did you pick the specific chart?

**Line Chart** – If we’re analyzing trends over time (such as incidents per day or month), this chart highlights patterns like seasonal spikes or declines.

**Bar Chart** – If comparing different categories (such as crime types or neighborhoods), a bar chart visually distinguishes frequency variations.

**Scatter Plot** – When working with geographical data (Latitude and Longitude), a scatter plot helps identify clustering of incidents.

**Heatmap** – If analyzing time-based trends (like peak hours for incidents), a heatmap showcases density variations clearly.

##### 2. What is/are the insight(s) found from the chart?

**Time-Based Patterns** – A line chart may reveal peak crime hours, seasonal trends, or long-term increases/decreases in incidents.

**Neighborhood Comparisons** – A bar chart could highlight which areas experience the highest or lowest incidents.

**Geospatial Clustering** – A scatter plot using latitude & longitude may show high-risk zones where incidents frequently occur.

**Heatmap Trends** – A heatmap focusing on hours or days might pinpoint times when events are most frequent.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### **Positive Business Impact**
1. **Strategic Decision-Making** – Businesses, law enforcement, or policymakers can optimize operations based on crime trends, improving efficiency and safety.
2. **Operational Cost Reduction** – Understanding peak incident hours enables smarter resource allocation, cutting unnecessary expenses.
3. **Real Estate & Investments** – Identifying low-risk areas can help investors decide where to develop new projects or establish businesses.
4. **Enhanced Customer Experience** – Companies in hospitality, retail, or transportation can improve safety measures to increase customer trust.

### **Insights That Could Lead to Negative Growth**
1. **Reputation Challenges** – If an area has a high crime rate, businesses in that location may struggle to attract customers due to safety concerns.
2. **Declining Property Value** – Frequent incidents in specific neighborhoods could lead to lower property prices, impacting real estate markets.
3. **Higher Security Costs** – Companies operating in risk-prone areas may need to increase security measures, raising operational expenses.

### **Mitigating Risks & Leveraging Insights**
Even negative trends can be turned into strategic opportunities—for example:
- Businesses can invest in preventive safety measures to improve customer trust.
- Government agencies can focus on urban planning & crime prevention in identified hotspots.
- Companies can adjust their marketing strategies based on location-based risks.


#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.rcParams['figure.figsize'] = 12,9
labels = df['TYPE'].value_counts().index
sizes = df['TYPE'].value_counts().values
plt.pie(sizes, labels=labels, autopct='%1.0f%%')
plt.title('Crime Type Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

Answer. As this is a Univariate Analysis,we compare the data from one variable or one column "crime",so we have considered pie chat

##### 2. What is/are the insight(s) found from the chart?

Answer. we found that the booking number is higher in theft from vehicle which is 32% than Mischief which is 13%.hence we can say that theft from vehicle has consumption

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 **Positive Business Impact**:
Theft from Vehicle (32%)
High demand for vehicle security solutions (alarms, GPS, insurance). Opportunity for safety tech businesses.

Mischief (13%) & Break and Enter (12%)
Demand for home security systems and neighborhood watch services.

Offence Against a Person (10%)
Potential for personal safety apps and self-defense products.

**Negative Growth Indicators**:
Theft of Bicycle (5%) & Vehicle Collision with Injury (4%)
May reflect urban safety issues. Could discourage tourism or local travel unless mitigated.

Break and Enter Commercial (6%)
Might lead to increased business insurance costs or reluctance to open stores in affected areas.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
grouped_by_crime = df['TYPE'].value_counts()
grouped_by_crime

##### 1. Why did you pick the specific chart?

Answer Here. As we are analysing crime and adr variables, to know which crime is making more

##### 2. What is/are the insight(s) found from the chart?

Theft from Vehicle is the most reported crime (153,932 cases), significantly more than the next highest (Mischief – 63,233).

Property-related crimes (e.g., thefts and break-ins) dominate the top categories, showing a pattern.

Crimes involving physical harm (like Offence Against a Person or Vehicle Collision with Injury) are fewer in comparison to property crimes.


##### 3. Will the gained insights help creating a positive business impact?

> Add blockquote


Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impacts:
Insurance companies can use this data to adjust premiums or provide incentives for theft-prevention tools.

Local businesses and retailers can use this to invest in targeted security (e.g., vehicle surveillance in parking areas).

Urban planners can redesign parking and lighting in high-risk areas.

Tech companies can innovate security products (smart alarms, tracking devices).

Potential Negative Impact:
Highlighting high crime rates in certain areas might discourage investment or tourism if not managed carefully.

Businesses in high-theft zones may face higher operational costs for insurance and security.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Check if the DataFrame is loaded and has the 'TYPE' column
if df is not None and 'TYPE' in df.columns:
    # Get the top 10 most frequent crime types for visualization
    top_n = 10
    top_crime_types = df['TYPE'].value_counts().nlargest(top_n).index
    df_top_crimes = df[df['TYPE'].isin(top_crime_types)]

    # Create a countplot for the top crime types
    plt.figure(figsize=(12, 8))
    sns.countplot(data=df_top_crimes, y='TYPE', order=top_crime_types, palette='viridis')
    plt.title(f'Top {top_n} Most Frequent Crime Types')
    plt.xlabel('Count')
    plt.ylabel('Crime Type')
    plt.tight_layout() # Adjust layout to prevent labels overlapping
    plt.show()
else:
    print("\nSkipping Chart - 7 visualization as the dataset was not loaded or 'TYPE' column is missing.")

# The original code was trying to filter based on non-existent columns
# not_canceled = df[df['is_canceled']==0]
# s1 = not_canceled[not_canceled['total_stay']<15].value_counts()
# plt.figure(figsize = (9,7))

##### 1. Why did you pick the specific chart?

The horizontal bar chart clearly shows the top 10 crime types by frequency, making it easy to compare and identify major crime categories.

##### 2. What is/are the insight(s) found from the chart?

* **Theft from Vehicle** is the most frequent crime.
* Property crimes dominate the list over violent ones.
* **Vehicle-related crimes** (theft from/ of vehicle, bicycle) are major issues.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Helps insurance, security, and tech firms target key risk areas.
* Informs urban planning and law enforcement strategies.
* Supports data-driven decision making for crime prevention and resource allocation.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Check if the DataFrame is loaded and has 'HOUR' and 'TYPE' columns
if df is not None and 'HOUR' in df.columns and 'TYPE' in df.columns:
    plt.figure(figsize=(14, 8))
    # Create a pivot table to count crime types by hour
    crime_by_hour = df.groupby(['HOUR', 'TYPE']).size().unstack(fill_value=0)
    # Plot the crime counts by hour for different crime types
    crime_by_hour.plot(kind='line', figsize=(14, 8))
    plt.title('Crime Count by Hour and Type')
    plt.xlabel('Hour of Day')
    plt.ylabel('Number of Incidents')
    plt.xticks(range(24)) # Ensure all hours are shown on x-axis
    plt.legend(title='Crime Type', bbox_to_anchor=(1.05, 1), loc='upper left') # Move legend outside plot
    plt.grid(True)
    plt.tight_layout() # Adjust layout to prevent labels overlapping
    plt.show()
else:
    print("\nSkipping Chart - 8 visualization as the dataset was not loaded or required columns ('HOUR', 'TYPE') are missing.")

##### 1. Why did you pick the specific chart?

This line chart shows how different crime types vary by hour of the day, helping identify time-based crime patterns.

##### 2. What is/are the insight(s) found from the chart?

Theft from Vehicle spikes in the late evening (16–23 hrs).

Most crimes are low during early morning hours (1–6 hrs).

Mischief and Other Theft also increase in the evening.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps police allocate patrols more effectively by hour.

Businesses can adjust security timing for peak crime hours.

Enables predictive security measures in high-risk time slots.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Check if the DataFrame is loaded and has the 'MONTH' column
if df is not None and 'MONTH' in df.columns:
    plt.figure(figsize=(12, 6))
    # Create a countplot for the distribution of crimes by month
    # We can use a specific order for the months (1 to 12) to make the plot chronological
    month_order = range(1, 13)
    sns.countplot(data=df, x='MONTH', order=month_order, palette='viridis')
    plt.title('Total Crime Count by Month')
    plt.xlabel('Month')
    plt.ylabel('Number of Incidents')
    plt.xticks(ticks=month_order, labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
    plt.grid(axis='y', linestyle='--')
    plt.show()
else:
    print("\nSkipping Chart - 10 visualization as the dataset was not loaded or 'MONTH' column is missing.")

##### 1. Why did you pick the specific chart?

This monthly bar chart helps identify seasonal trends in crime across the year for better planning and prevention.

##### 2. What is/are the insight(s) found from the chart?

July has the highest number of incidents.

Crime tends to increase in summer months (May–Sep).

February has the lowest crime count.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Supports seasonal staffing and patrol planning.

Retailers, event organizers, and insurers can prepare for peak crime months.

Helps cities optimize safety campaigns during high-crime periods.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd # Ensure pandas is imported

# Check if the DataFrame is loaded and has the 'Date' column
if df is not None and 'Date' in df.columns:
    # Ensure 'Date' column is in datetime format
    try:
        df['Date'] = pd.to_datetime(df['Date'])
        # Extract the day of the week (0=Monday, 6=Sunday)
        df['Day_of_Week'] = df['Date'].dt.dayofweek

        plt.figure(figsize=(10, 6))
        # Create a countplot for the distribution of crimes by day of the week
        # Order the days from Monday to Sunday
        day_order = range(7)
        day_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
        sns.countplot(data=df, x='Day_of_Week', order=day_order, palette='viridis')
        plt.title('Total Crime Count by Day of the Week')
        plt.xlabel('Day of Week')
        plt.ylabel('Number of Incidents')
        plt.xticks(ticks=day_order, labels=day_labels)
        plt.grid(axis='y', linestyle='--')
        plt.show()

    except Exception as e:
        print(f"Error processing 'Date' column or plotting: {e}")
        print("\nSkipping Chart - 11 visualization.")

else:
    print("\nSkipping Chart - 11 visualization as the dataset was not loaded or 'Date' column is missing.")

##### 1. Why did you pick the specific chart?

This day-wise bar chart shows how crime rates vary across the days of the week, helping identify daily patterns.

##### 2. What is/are the insight(s) found from the chart?

Monday has the highest crime count.

Sunday and Monday show peak activity.

Mid-week (Tuesday–Friday) has slightly lower crime levels.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Enables strategic deployment of police or security on high-crime days.

Businesses and public services can strengthen surveillance on weekends and Mondays.

Helps plan community outreach or crime prevention campaigns more effectively.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Check if the DataFrame is loaded and has the 'NEIGHBOURHOOD' column
if df is not None and 'NEIGHBOURHOOD' in df.columns:
    plt.figure(figsize=(12, 8))

    # Get the count of incidents per neighborhood
    neighborhood_counts = df['NEIGHBOURHOOD'].value_counts()

    # Select the top N neighborhoods (e.g., top 15)
    top_n_neighborhoods = 15
    if len(neighborhood_counts) > top_n_neighborhoods:
        top_neighborhood_list = neighborhood_counts.nlargest(top_n_neighborhoods).index
        # Filter the dataframe to include only the top neighborhoods for plotting order
        df_top_neighborhoods = df[df['NEIGHBOURHOOD'].isin(top_neighborhood_list)]
        # Use a countplot ordered by the top neighborhoods
        sns.countplot(data=df_top_neighborhoods, y='NEIGHBOURHOOD', order=top_neighborhood_list, palette='viridis')
        plt.title(f'Top {top_n_neighborhoods} Crime Incidents by Neighborhood')
    else:
        # If fewer than top_n_neighborhoods, just plot all of them ordered by count
        sns.countplot(data=df, y='NEIGHBOURHOOD', order=neighborhood_counts.index, palette='viridis')
        plt.title('Crime Incidents by Neighborhood')


    plt.xlabel('Number of Incidents')
    plt.ylabel('Neighborhood')
    plt.tight_layout() # Adjust layout to prevent labels overlapping
    plt.show()

else:
    print("\nSkipping Chart - 12 visualization as the dataset was not loaded or 'NEIGHBOURHOOD' column is missing.")

##### 1. Why did you pick the specific chart?

This horizontal bar chart effectively ranks neighborhoods by total crime incidents, making it easy to compare crime distribution across areas.

##### 2. What is/are the insight(s) found from the chart?

Central Business District has the highest number of incidents, followed by West End and Fairview.

Some neighborhoods like Killarney and Victoria-Fraserview report significantly lower crime.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps city officials prioritize policing and resource allocation in high-crime areas.

Real estate and retail businesses can use this to evaluate safety risks and choose locations wisely.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Check if the DataFrame is loaded and has 'Latitude' and 'Longitude' columns
if df is not None and 'Latitude' in df.columns and 'Longitude' in df.columns:
    plt.figure(figsize=(10, 8))
    # Create a scatter plot of Latitude vs Longitude
    # Use alpha to see density where points overlap
    sns.scatterplot(data=df, x='Longitude', y='Latitude', alpha=0.5, s=10) # s controls marker size
    plt.title('Geographical Distribution of Crime Incidents')
    plt.xlabel('Longitude')
    plt.ylabel('Latitude')
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.tight_layout() # Adjust layout
    plt.show()
else:
    print("\nSkipping Chart - 13 visualization as the dataset was not loaded or required columns ('Latitude', 'Longitude') are missing.")

##### 1. Why did you pick the specific chart?

This scatter plot visualizes the geographical spread of crime using latitude and longitude, helping detect spatial concentration of incidents.

##### 2. What is/are the insight(s) found from the chart?

Most crimes are clustered around a specific region, likely an urban zone.

Some outliers may indicate incorrect or missing geo-data.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Enables location-based policing and urban safety planning.

Helps businesses and governments optimize surveillance and infrastructure in high-density zones.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Check if the DataFrame is loaded
if df is not None:
    # Select only numerical columns for correlation calculation
    numerical_df = df.select_dtypes(include=['number'])

    # Compute the correlation matrix
    # .corr() will automatically handle NaNs by default (pairwise deletion)
    correlation_matrix = numerical_df.corr()

    # Set up the figure size
    plt.figure(figsize=(10, 8))

    # Generate the heatmap
    sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5)

    # Add title
    plt.title("Feature Correlation Heatmap")

    # Show the heatmap
    plt.show()
else:
    print("\nSkipping Chart - 14 visualization as the dataset was not loaded.")

##### 1. Why did you pick the specific chart?

This correlation heatmap helps understand relationships between features, which is crucial for feature selection and improving model accuracy.

##### 2. What is/are the insight(s) found from the chart?

X, Y, and Latitude are highly correlated, indicating redundancy.

Other features like HOUR, MINUTE, DAY, etc., show low correlation, meaning they contribute unique patterns.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps in removing redundant features, reducing model complexity.

Enhances prediction performance and data efficiency for time-based crime forecasting.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Chart - 16 visualization code
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Check if the DataFrame is loaded and has 'YEAR' and 'TYPE' columns
if df is not None and 'YEAR' in df.columns and 'TYPE' in df.columns:
    plt.figure(figsize=(15, 8))

    # Get the total count of incidents per year for each crime type
    # We'll focus on the top N crime types to keep the plot manageable
    top_n_types = 5 # Adjust N as needed
    top_crime_types = df['TYPE'].value_counts().nlargest(top_n_types).index

    # Filter the DataFrame to include only the top crime types
    df_top_types = df[df['TYPE'].isin(top_crime_types)]

    # Group by YEAR and TYPE and count the occurrences
    crime_trend_by_type = df_top_types.groupby(['YEAR', 'TYPE']).size().unstack(fill_value=0)

    # Plot the time series for each top crime type
    crime_trend_by_type.plot(kind='line', figsize=(15, 8))

    plt.title(f'Annual Crime Trends for Top {top_n_types} Crime Types')
    plt.xlabel('Year')
    plt.ylabel('Number of Incidents')
    plt.xticks(crime_trend_by_type.index) # Ensure all years are shown on x-axis
    plt.legend(title='Crime Type', bbox_to_anchor=(1.05, 1), loc='upper left') # Move legend outside plot
    plt.grid(True)
    plt.tight_layout() # Adjust layout
    plt.show()
else:
    print("\nSkipping Chart - 16 visualization as the dataset was not loaded or required columns ('YEAR', 'TYPE') are missing.")

##### 1. Why did you pick the specific chart?

This line chart effectively shows yearly trends for the top 5 crime types over time, highlighting long-term patterns.

##### 2. What is/are the insight(s) found from the chart?

Overall decline in the most frequent crime type from 1999 to 2007.

Some crimes remained steady or slightly increased, showing diverging patterns.

#### Chart - 15 - Pair Plot

In [None]:
# Chart - 15 visualization code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# import sys # No longer needed for this error handling approach

# Load your dataset
# Assuming the file path is correct based on your previous code
# try:
#     df = pd.read_excel('/content/drive/MyDrive/Train.xlsx') # Loading again might not be necessary if df is already loaded

# Check if the DataFrame is loaded
if df is not None:
    # Select a subset of numerical columns for the pair plot
    # Choose columns that are most likely to have interesting relationships
    selected_columns = ["YEAR", "MONTH", "HOUR", "Latitude", "Longitude"]

    # Check if selected_columns exist in the dataframe before subsetting
    missing_cols = [col for col in selected_columns if col not in df.columns]
    if missing_cols:
        print(f"Skipping Chart - 15 visualization: Missing columns for Pair Plot: {missing_cols}")
    else:
        # Create a subset DataFrame with the selected numerical columns
        df_subset = df[selected_columns]

        # Create the pair plot
        # This can take time depending on the number of rows and columns selected
        print("Generating Pair Plot (this might take a moment)...")
        sns.pairplot(df_subset, palette='viridis', diag_kind='kde') # diag_kind='kde' for smoother density plots on diagonal

        # Add a title to the overall figure (optional, often added manually after creation)
        # plt.suptitle('Pair Plot of Selected Numerical Features', y=1.02) # Adjust y position as needed

        plt.show()

else:
    print("\nSkipping Chart - 15 visualization as the dataset was not loaded.")

##### 1. Why did you pick the specific chart?

This pair plot is chosen to visually explore relationships between key features like year, month, hour, and geographic coordinates.

##### 2. What is/are the insight(s) found from the chart?

Crime incidents peak around certain hours of the day.

Clustering of incidents around specific latitudes and longitudes, indicating hotspot areas.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement
1. **Hypothesis 1:**
   *The average crime rate in Urban areas is higher than in Rural areas.*
2. **Hypothesis 2:**
   *There is no association between the type of crime and the region (North, South, East, West).*
3. **Hypothesis 3:**
   *The average monthly sales before and after a marketing campaign are different.*
These hypotheses cover:
* Comparing means (Hypothesis 1 & 3) — t-test or similar
* Testing association between categorical variables (Hypothesis 2) — Chi-square test
### Next Steps:

* For **Hypothesis 1 and 3**: Use **t-tests** (two-sample or paired).
* For **Hypothesis 2**: Use **Chi-square test of independence**.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Hypothesis 1: Crime Rate in Urban vs Rural Areas
Null Hypothesis (H0):
The average crime rate in Urban areas is equal to the average crime rate in Rural areas.

𝐻0: 𝜇𝑈𝑟𝑏𝑎𝑛=𝜇𝑅𝑢𝑟𝑎𝑙H 0​ :μ Urban​=μ Rural


Alternative Hypothesis (H1):
The average crime rate in Urban areas is higher than in Rural areas.
𝐻1 : 𝜇𝑈𝑟𝑏𝑎𝑛 > 𝜇𝑅𝑢𝑟𝑎𝑙H 1​ :μ Urban​ > μ Rural​


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
from scipy.stats import chi2_contingency

# Check if the DataFrame is loaded and has the required columns
if df is not None and 'TYPE' in df.columns and 'NEIGHBOURHOOD' in df.columns:
    # Create a contingency table of 'TYPE' and 'NEIGHBOURHOOD'
    # The chi-squared test works best when expected frequencies are not too small.
    # Including all neighborhoods and crime types might lead to very large tables with small counts.
    # For demonstration, let's use the full data, but be aware of potential warnings or less reliable results
    # with sparse tables.
    contingency_table = pd.crosstab(df['TYPE'], df['NEIGHBOURHOOD'])

    # Perform the Chi-Squared Test for Independence
    # chi2: the test statistic
    # p: the p-value
    # dof: degrees of freedom
    # expected: the expected frequencies, based on the assumption of independence
    try:
        chi2, p, dof, expected = chi2_contingency(contingency_table)

        # Print the results
        print("Chi-Squared Test for Independence between Crime Type and Neighborhood")
        print(f"Chi-Squared Statistic: {chi2:.4f}")
        print(f"P-value: {p:.4f}")
        print(f"Degrees of Freedom: {dof}")

        # Interpret the P-value
        alpha = 0.05 # Set significance level (commonly 0.05)
        if p < alpha:
            print(f"\nWith a P-value ({p:.4f}) less than the significance level ({alpha}), we reject the null hypothesis.")
            print("There is statistically significant evidence to suggest that Crime Type and Neighborhood are NOT independent.")
        else:
            print(f"\nWith a P-value ({p:.4f}) greater than or equal to the significance level ({alpha}), we fail to reject the null hypothesis.")
            print("There is not enough statistically significant evidence to suggest that Crime Type and Neighborhood are NOT independent.")

    except ValueError as e:
        print(f"\nCould not perform Chi-Squared test. Error: {e}")
        print("This might be due to zero rows/columns in the contingency table or very small expected counts.")
        print("Consider filtering data or combining categories if needed for the test.")

else:
    print("\nSkipping Statistical Test: Dataset not loaded or required columns ('TYPE', 'NEIGHBOURHOOD') are missing.")

##### Which statistical test have you done to obtain P-Value?

Statistical Test Used: Independent Samples t-test (one-tailed)

Why: Comparing means between two independent groups

How P-value is obtained:
The scipy.stats.ttest_ind() function is used in Python. The returned p-value tells us the probability of observing the data if the null hypothesis (equal means) were true.

##### Why did you choose the specific statistical test?

We are comparing means between two independent groups (Urban vs Rural).

The data is continuous (crime rate).

The groups are independent (urban and rural are separate populations).

We are checking whether one mean is greater than the other, so it’s a one-tailed test.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0):
There is no association between the type of crime and the region. They are independent.
𝐻
0
:
Crime Type
⊥
Region
H
0
​
 :Crime Type⊥Region

Alternative Hypothesis (H1):
There is an association between the type of crime and the region.
𝐻
1
:
Crime Type
⊥̸
Region
H
1
​
 :Crime Type

⊥RegionAnswer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test 2 to obtain P-Value
import pandas as pd
from scipy.stats import pearsonr

# Check if the DataFrame is loaded and has the required columns
if df is not None and 'Latitude' in df.columns and 'Longitude' in df.columns:
    # Select the two numerical columns
    latitudes = df['Latitude']
    longitudes = df['Longitude']

    # Drop rows with missing values in either column for the correlation calculation
    # pearsonr handles NaNs by dropping pairs, but explicitly dropping can be cleaner
    combined = pd.DataFrame({'Latitude': latitudes, 'Longitude': longitudes}).dropna()

    # Ensure there is enough data left after dropping NaNs (at least 2 points)
    if len(combined) < 2:
        print("\nSkipping Statistical Test 2: Not enough valid data points (non-NaN Latitude/Longitude pairs) to calculate correlation.")
    else:
        # Perform the Pearson correlation test
        # correlation_coefficient: the Pearson correlation coefficient
        # p_value: the p-value
        try:
            correlation_coefficient, p_value = pearsonr(combined['Latitude'], combined['Longitude'])

            # Print the results
            print("Pearson Correlation Test between Latitude and Longitude")
            print(f"Pearson Correlation Coefficient: {correlation_coefficient:.4f}")
            print(f"P-value: {p_value:.4f}")

            # Interpret the P-value and correlation coefficient
            alpha = 0.05 # Set significance level
            print("\nInterpretation:")
            if p_value < alpha:
                print(f"With a P-value ({p_value:.4f}) less than the significance level ({alpha}), we reject the null hypothesis.")
                print("There is statistically significant evidence of a linear relationship between Latitude and Longitude.")
                print(f"The correlation coefficient ({correlation_coefficient:.4f}) indicates the strength and direction of this linear relationship.")
            else:
                print(f"With a P-value ({p_value:.4f}) greater than or equal to the significance level ({alpha}), we fail to reject the null hypothesis.")
                print("There is not enough statistically significant evidence to suggest a linear relationship between Latitude and Longitude.")

        except Exception as e:
            print(f"\nAn unexpected error occurred during the Pearson correlation test: {e}")


else:
    print("\nSkipping Statistical Test 2: Dataset not loaded or required columns ('Latitude', 'Longitude') are missing.")

##### Which statistical test have you done to obtain P-Value?

Statistical Test Used: Chi-Square Test of Independence

Why: Testing association between two categorical variables

How P-value is obtained:
The scipy.stats.chi2_contingency() function is used in Python. It calculates a Chi-square statistic and p-value, which tells us how likely the observed distribution is under the assumption of independence.

##### Why did you choose the specific statistical test?

Both variables (crime type, region) are categorical.

We want to see if there is an association or dependency between them.

The chi-square test is the standard test to determine if distributions of categorical variables differ from each other.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0):
The average monthly sales before and after the marketing campaign are equal.
𝐻
0
:
𝜇
𝐵
𝑒
𝑓
𝑜
𝑟
𝑒
=
𝜇
𝐴
𝑓
𝑡
𝑒
𝑟
H
0
​
 :μ
Before
​
 =μ
After
​


Alternative Hypothesis (H1):
The average monthly sales before and after the marketing campaign are different.
𝐻
1
:
𝜇
𝐵
𝑒
𝑓
𝑜
𝑟
𝑒
≠
𝜇
𝐴
𝑓
𝑡
𝑒
𝑟
H
1
​
 :μ
Before
​


=μ
After
​


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test 3 to obtain P-Value
import pandas as pd
from scipy.stats import f_oneway

# Check if the DataFrame is loaded and has the required columns
if df is not None and 'TYPE' in df.columns and 'Latitude' in df.columns:
    # Get the list of unique crime types
    crime_types = df['TYPE'].unique()

    # Create a list of Latitude arrays, one for each crime type
    # Filter out potential NaN values in Latitude
    latitude_by_crime_type = [df[df['TYPE'] == crime_type]['Latitude'].dropna().values for crime_type in crime_types]

    # Filter out crime types that might have resulted in empty arrays after dropping NaNs
    latitude_by_crime_type = [arr for arr in latitude_by_crime_type if len(arr) > 0]

    # Ensure there is more than one group (crime type) with data to compare
    if len(latitude_by_crime_type) < 2:
         print("\nSkipping Statistical Test 3: Not enough distinct crime types with Latitude data to perform ANOVA.")
    else:
        # Perform the One-Way ANOVA test
        # f_statistic: the test statistic
        # p_value: the p-value
        try:
            f_statistic, p_value = f_oneway(*latitude_by_crime_type)

            # Print the results
            print("One-Way ANOVA Test for Mean Latitude across different Crime Types")
            print(f"F-Statistic: {f_statistic:.4f}")
            print(f"P-value: {p_value:.4f}")

            # Interpret the P-value
            alpha = 0.05 # Set significance level
            if p_value < alpha:
                print(f"\nWith a P-value ({p_value:.4f}) less than the significance level ({alpha}), we reject the null hypothesis.")
                print("There is statistically significant evidence to suggest that the mean Latitude is NOT the same for all crime types.")
            else:
                print(f"\nWith a P-value ({p_value:.4f}) greater than or equal to the significance level ({alpha}), we fail to reject the null hypothesis.")
                print("There is not enough statistically significant evidence to suggest that the mean Latitude is different across crime types.")

        except ValueError as e:
             print(f"\nCould not perform ANOVA test. Error: {e}")
             print("This might happen if one or more groups have zero variance or if there are other data issues.")
        except Exception as e:
             print(f"\nAn unexpected error occurred during the ANOVA test: {e}")


else:
    print("\nSkipping Statistical Test 3: Dataset not loaded or required columns ('TYPE', 'Latitude') are missing.")

##### Which statistical test have you done to obtain P-Value?

Statistical Test Used: Paired Samples t-test (two-tailed)

Why: Comparing means of related groups

How P-value is obtained:
The scipy.stats.ttest_rel() function is used. It returns a p-value indicating whether the mean difference between the paired observations is statistically significant.

##### Why did you choose the specific statistical test?

We are comparing the means of two related (paired) datasets (same stores/salespeople before and after campaign).

The data is continuous (sales figures).

The samples are not independent — we’re tracking the same subjects before and after.

Since we’re checking for any difference (not direction-specific), it’s a two-tailed test.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

# Display the count of missing values per column again
print("Missing values before handling:")
print(df.isnull().sum())
print("-" * 30)

# Display the percentage of missing values
print("Percentage of missing values before handling:")
print(round((df.isnull().sum()/df.shape[0])*100))
print("-" * 30)


# --- Imputation Strategy Examples ---

# Example 1: Impute numerical columns (like X, Y, Latitude, Longitude) with the Mean or Median
# Choose Median if the data is skewed or contains outliers
# Choose Mean if the data is approximately symmetrically distributed
numerical_cols_to_impute_median = ['X', 'Y', 'Latitude', 'Longitude'] # Adjust based on your data and needs
for col in numerical_cols_to_impute_median:
    if col in df.columns and df[col].isnull().sum() > 0:
        median_val = df[col].median()
        df[col].fillna(median_val, inplace=True)
        print(f"Imputed missing values in '{col}' with the Median ({median_val:.4f}).")

# Example 2: Impute categorical columns (like HUNDRED_BLOCK, NEIGHBOURHOOD) with the Mode
categorical_cols_to_impute_mode = ['HUNDRED_BLOCK', 'NEIGHBOURHOOD'] # Adjust based on your data and needs
for col in categorical_cols_to_impute_mode:
     if col in df.columns and df[col].isnull().sum() > 0:
        # Calculate mode, handle potential multiple modes by taking the first
        mode_val = df[col].mode()[0]
        df[col].fillna(mode_val, inplace=True)
        print(f"Imputed missing values in '{col}' with the Mode ('{mode_val}').")

# Example 3: Impute with a constant value (e.g., 'Unknown' for categorical)
# This is an alternative to mode imputation for categorical features
categorical_cols_to_impute_constant = [] # Add columns here if you prefer constant imputation
constant_value = 'Unknown'
for col in categorical_cols_to_impute_constant:
    if col in df.columns and df[col].isnull().sum() > 0:
        df[col].fillna(constant_value, inplace=True)
        print(f"Imputed missing values in '{col}' with the constant value '{constant_value}'.")


# --- Verification ---
# Display missing values again after imputation to confirm
print("\nMissing values after handling:")
print(df.isnull().sum())

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments 2 (Capping/Winsorizing using IQR)

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Choose a numerical column for outlier treatment
# 'Latitude' and 'Longitude' are good candidates for demonstrating spatial outliers
column_to_treat = 'Latitude' # Or 'Longitude', 'X', 'Y', etc.

# Check if the DataFrame is loaded and the chosen column exists and is numerical
if df is not None and column_to_treat in df.columns and pd.api.types.is_numeric_dtype(df[column_to_treat]):

    print(f"Analyzing and treating outliers for '{column_to_treat}'...")

    # --- 1. Visualize Outliers (Optional but Recommended) ---
    plt.figure(figsize=(10, 4))

    plt.subplot(1, 2, 1)
    sns.boxplot(x=df[column_to_treat].dropna()) # Drop NaNs for boxplot
    plt.title(f'Box Plot of {column_to_treat} (Before Treatment)')
    plt.xlabel(column_to_treat)

    # --- 2. Detect Outliers using IQR ---
    Q1 = df[column_to_treat].quantile(0.25)
    Q3 = df[column_to_treat].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    print(f"\nIQR Calculation for '{column_to_treat}':")
    print(f"Q1 (25th percentile): {Q1:.4f}")
    print(f"Q3 (75th percentile): {Q3:.4f}")
    print(f"IQR: {IQR:.4f}")
    print(f"Lower Bound (Q1 - 1.5*IQR): {lower_bound:.4f}")
    print(f"Upper Bound (Q3 + 1.5*IQR): {upper_bound:.4f}")

    # Identify outliers
    outliers = df[(df[column_to_treat] < lower_bound) | (df[column_to_treat] > upper_bound)]
    print(f"\nNumber of outliers detected: {len(outliers)}")
    print(f"Percentage of outliers: {(len(outliers) / len(df)) * 100:.2f}%")

    # --- 3. Treat Outliers by Capping (Winsorizing) ---
    # Make a copy to avoid modifying the original DataFrame slice directly if needed for other analyses
    df_treated = df.copy()

    # Cap values below the lower bound to the lower bound
    df_treated[column_to_treat] = np.where(
        df_treated[column_to_treat] < lower_bound,
        lower_bound,
        df_treated[column_to_treat]
    )

    # Cap values above the upper bound to the upper bound
    df_treated[column_to_treat] = np.where(
        df_treated[column_to_treat] > upper_bound,
        upper_bound,
        df_treated[column_to_treat]
    )

    print(f"\nOutliers in '{column_to_treat}' have been capped using the IQR method.")


    # --- 4. Visualize After Treatment (Optional but Recommended) ---
    plt.subplot(1, 2, 2)
    sns.boxplot(x=df_treated[column_to_treat].dropna()) # Drop NaNs for boxplot
    plt.title(f'Box Plot of {column_to_treat} (After Capping)')
    plt.xlabel(column_to_treat)

    plt.tight_layout()
    plt.show()

    # You can now use df_treated for subsequent analysis where outliers are treated

else:
    print(f"\nSkipping Outlier Handling: Dataset not loaded, column '{column_to_treat}' not found, or column is not numerical.")

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

import pandas as pd

# Check the number of unique values in categorical columns
print("Unique values count for potential categorical columns:")
categorical_cols = ['TYPE', 'HUNDRED_BLOCK', 'NEIGHBOURHOOD']
for col in categorical_cols:
    if col in df.columns:
        print(f"'{col}': {df[col].nunique()} unique values")
    else:
        print(f"'{col}' not found in DataFrame.")
print("-" * 30)


# Check if the DataFrame is loaded
if df is not None:

    # List of columns to One-Hot Encode
    # Let's start with 'TYPE'. If 'HUNDRED_BLOCK' or 'NEIGHBOURHOOD' have a huge number of unique values,
    # One-Hot Encoding them might not be feasible due to memory constraints and dimensionality.
    cols_to_onehot = ['TYPE']

    # Check if the columns exist in the DataFrame
    cols_exist = [col for col in cols_to_onehot if col in df.columns]
    cols_missing = [col for col in cols_to_onehot if col not in df.columns]

    if cols_missing:
        print(f"Skipping One-Hot Encoding for missing columns: {cols_missing}")

    if cols_exist:
        print(f"Performing One-Hot Encoding for columns: {cols_exist}")

        # Perform One-Hot Encoding
        # drop_first=True is often used to avoid multicollinearity (dummy variable trap)
        df_encoded = pd.get_dummies(df, columns=cols_exist, drop_first=True)

        print("\nDataFrame shape after One-Hot Encoding:")
        print(df_encoded.shape)
        print("\nFirst 5 rows of the DataFrame after encoding:")
        print(df_encoded.head())

    else:
        print("\nNo columns selected for One-Hot Encoding were found in the DataFrame.")


    # --- Consideration for High Cardinality Columns ---
    print("\nConsiderations for columns with high cardinality ('HUNDRED_BLOCK', 'NEIGHBOURHOOD'):")
    print("If these columns have a large number of unique values, One-Hot Encoding them directly")
    print("can create a very wide DataFrame, potentially causing memory issues and affecting model performance.")
    print("Alternative strategies for high cardinality categorical features include:")
    print("- Frequency Encoding")
    print("- Target Encoding")
    print("- Grouping rare categories before encoding")
    print("- Using models that can handle categorical features directly (e.g., LightGBM, CatBoost)")

else:
    print("\nSkipping Encoding: Dataset not loaded.")

# Now you can use df_encoded for analysis and modeling

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction (Illustrative Code - Likely Not Needed for your current dataset)

# Define a dictionary of common contractions
contractions_dict = {
    "ain't": "am not", "aren't": "are not", "can't": "cannot", "can't've": "cannot have",
    "could've": "could have", "couldn't": "could not", "couldn't've": "could not have",
    "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not",
    "hadn't've": "had not have", "hasn't": "has not", "haven't": "have not", "he'd": "he would",
    "he'd've": "he would have", "he'll": "he will", "he'll've": "he will have", "he's": "he is",
    "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
    "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have",
    "I'm": "I am", "I've": "I have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have",
    "it'll": "it will", "it'll've": "it will have", "it's": "it is", "let's": "let us",
    "ma'am": "madam", "mayn't": "may not", "might've": "might have", "mightn't": "might not",
    "mightn't've": "might not have", "must've": "must have", "mustn't": "must not",
    "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have",
    "o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have",
    "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
    "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have",
    "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have",
    "so've": "so have", "so's": "so is", "that'd": "that would", "that'd've": "that would have",
    "that's": "that is", "there'd": "there would", "there'd've": "there would have",
    "there's": "there is", "they'd": "they would", "they'd've": "they would have",
    "they'll": "they will", "they'll've": "they will have", "they're": "they are",
    "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would",
    "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
    "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have",
    "what're": "what are", "what's": "what is", "what've": "what have", "when's": "when is",
    "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have",
    "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
    "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not",
    "won't've": "will not have", "would've": "would have", "wouldn't": "would not",
    "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would",
    "y'all'd've": "you all would have", "y'all're": "you all are", "y'all've": "you all have",
    "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
    "you're": "you are", "you've": "you have"
}

def expand_contractions(text, contractions_dict):
    """Expands contractions in a text string."""
    if isinstance(text, str): # Only process strings
        # Use a regular expression to find contractions (case-insensitive)
        contractions_pattern = r'\b({})\b'.format('|'.join(contractions_dict.keys()))
        def replacer(match):
            # Replace with the expanded form, preserving case if needed (simplified here)
            return contractions_dict[match.group(0).lower()]
        expanded_text = re.sub(contractions_pattern, replacer, text, flags=re.IGNORECASE)
        return expanded_text
    else:
        return text # Return non-string data as is

# Example Usage (assuming you have a text column, e.g., 'Description'):
# import re # You would need to import re for regex

# if df is not None and 'Description' in df.columns: # Replace 'Description' with your actual text column name
#     print("Expanding contractions in 'Description' column...")
#     df['Description_expanded'] = df['Description'].apply(lambda x: expand_contractions(x, contractions_dict))
#     print("Contraction expansion complete.")
#     # You might want to compare original vs expanded text for a few rows
#     # print(df[['Description', 'Description_expanded']].head())
# else:
#     print("\nSkipping Contraction Expansion: Dataset not loaded or no 'Description' column found.")

print("Contraction expansion is likely not needed for the current columns in your dataset.")

#### 2. Lower Casing

In [None]:
# Lower Casing

import pandas as pd

# Check if the DataFrame is loaded
if df is not None:

    # List of columns to convert to lowercase
    # These are typically the categorical columns that contain text
    cols_to_lowercase = ['TYPE', 'HUNDRED_BLOCK', 'NEIGHBOURHOOD'] # Add other text columns if applicable

    print("Applying lowercasing to relevant columns...")

    for col in cols_to_lowercase:
        if col in df.columns:
            # Check if the column is of object (string) type before applying .str.lower()
            if pd.api.types.is_object_dtype(df[col]):
                # Apply the lower() method to the string representation of each element
                # Use .astype(str) just in case there are non-string types that pandas didn't infer
                df[col] = df[col].astype(str).str.lower()
                print(f"Successfully lowercased '{col}'.")
            else:
                 print(f"Column '{col}' is not of object type, skipping lowercasing.")
        else:
            print(f"Column '{col}' not found in DataFrame, skipping lowercasing.")

    print("\nFirst 5 rows of the DataFrame after lowercasing:")
    print(df.head())

else:
    print("\nSkipping Lower Casing: Dataset not loaded.")

# Now, values like 'THEFT FROM VEHICLE' and 'theft from vehicle' will be standardized to 'theft from vehicle'

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

import pandas as pd
import string # Import the string module to get punctuation characters

# Check if the DataFrame is loaded
if df is not None:

    # List of columns to remove punctuation from
    # These are typically the text/categorical columns
    cols_to_clean_punctuation = ['TYPE', 'HUNDRED_BLOCK', 'NEIGHBOURHOOD'] # Add other text columns if applicable

    # Define the set of punctuation characters to remove
    punctuations_to_remove = string.punctuation

    # Create a translation table to remove punctuation efficiently
    translator = str.maketrans('', '', punctuations_to_remove)

    print("Removing punctuation from relevant columns...")

    for col in cols_to_clean_punctuation:
        if col in df.columns:
            # Check if the column is of object (string) type
            if pd.api.types.is_object_dtype(df[col]):
                # Apply the translation table to remove punctuation
                # Use .astype(str) to ensure all elements are strings before applying translate
                df[col] = df[col].astype(str).str.translate(translator)
                print(f"Successfully removed punctuation from '{col}'.")
            else:
                 print(f"Column '{col}' is not of object type, skipping punctuation removal.")
        else:
            print(f"Column '{col}' not found in DataFrame, skipping punctuation removal.")

    print("\nFirst 5 rows of the DataFrame after removing punctuation:")
    print(df.head())

else:
    print("\nSkipping Punctuation Removal: Dataset not loaded.")


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

import pandas as pd
import re # Import the regular expression module

# Check if the DataFrame is loaded
if df is not None:

    # List of columns to clean text from
    # These are typically the text/categorical columns
    cols_to_clean_text = ['TYPE', 'HUNDRED_BLOCK', 'NEIGHBOURHOOD'] # Add other text columns if applicable

    # Regex pattern to find URLs
    # This is a basic pattern and might not catch all complex URLs
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

    # Regex pattern to find words containing digits
    # \b asserts a word boundary
    # \w* matches zero or more word characters (letters, digits, underscore)
    # \d+ matches one or more digits
    # \w* matches zero or more word characters after the digits
    # This pattern finds entire words that have digits in them
    word_with_digits_pattern = r'\b\w*\d\w*\b'

    print("Removing URLs and words/digits containing digits from relevant columns...")

    for col in cols_to_clean_text:
        if col in df.columns:
            # Check if the column is of object (string) type
            if pd.api.types.is_object_dtype(df[col]):
                # Ensure all elements are strings to apply regex methods
                df[col] = df[col].astype(str)

                # Remove URLs
                df[col] = df[col].str.replace(url_pattern, '', regex=True)
                print(f"Successfully removed URLs from '{col}'.")

                # Remove words containing digits
                df[col] = df[col].str.replace(word_with_digits_pattern, '', regex=True)
                print(f"Successfully removed words/digits containing digits from '{col}'.")
            else:
                 print(f"Column '{col}' is not of object type, skipping URL/digit word removal.")
        else:
            print(f"Column '{col}' not found in DataFrame, skipping URL/digit word removal.")

    # After removing words/digits, you might be left with extra whitespace
    # It's a good practice to remove leading/trailing whitespace and collapse multiple spaces
    for col in cols_to_clean_text:
        if col in df.columns and pd.api.types.is_object_dtype(df[col]):
             df[col] = df[col].str.strip().str.replace(r'\s+', ' ', regex=True) # Remove extra spaces
             # print(f"Cleaned up whitespace in '{col}'.") # Optional confirmation

    print("\nFirst 5 rows of the DataFrame after text cleaning:")
    print(df.head())

else:
    print("\nSkipping URL/Digit word Removal: Dataset not loaded.")

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

import pandas as pd
import nltk
from nltk.corpus import stopwords
import re # Needed for splitting text into words

# --- Download NLTK stopwords if not already downloaded ---
try:
    stopwords = set(stopwords.words('english'))
except LookupError:
    print("NLTK stopwords not found. Downloading...")
    nltk.download('stopwords')
    stopwords = set(stopwords.words('english'))
    print("NLTK stopwords downloaded.")
print("-" * 30)

# Check if the DataFrame is loaded
if df is not None:

    # List of columns to remove stopwords from
    # These are typically the text/categorical columns
    cols_to_clean_stopwords = ['TYPE', 'HUNDRED_BLOCK', 'NEIGHBOURHOOD'] # Add other text columns if applicable

    print("Removing stopwords from relevant columns...")

    # Function to remove stopwords from a single string
    def remove_stopwords_from_text(text):
        if isinstance(text, str):
            # Split the text into words, convert to lowercase (important!), and remove non-alphanumeric
            words = re.findall(r'\b\w+\b', text.lower()) # Using regex to find words
            # Filter out stopwords
            filtered_words = [word for word in words if word not in stopwords]
            # Join the words back into a string
            return ' '.join(filtered_words)
        else:
            return text # Return non-string data as is

    for col in cols_to_clean_stopwords:
        if col in df.columns:
            # Check if the column is of object (string) type
            if pd.api.types.is_object_dtype(df[col]):
                # Apply the function to each element in the column
                df[col] = df[col].apply(remove_stopwords_from_text)
                print(f"Successfully removed stopwords from '{col}'.")
            else:
                 print(f"Column '{col}' is not of object type, skipping stopword removal.")
        else:
            print(f"Column '{col}' not found in DataFrame, skipping stopword removal.")

    print("\nFirst 5 rows of the DataFrame after removing stopwords:")
    print(df.head())

else:
    print("\nSkipping Stopword Removal: Dataset not loaded.")

In [None]:
# Remove White spaces

import pandas as pd
import re # Import the regular expression module

# Check if the DataFrame is loaded
if df is not None:

    # List of columns to remove whitespace from
    # These are typically the text/categorical columns
    cols_to_clean_whitespace = ['TYPE', 'HUNDRED_BLOCK', 'NEIGHBOURHOOD'] # Add other text columns if applicable

    print("Removing extra whitespaces from relevant columns...")

    for col in cols_to_clean_whitespace:
        if col in df.columns:
            # Check if the column is of object (string) type
            if pd.api.types.is_object_dtype(df[col]):
                # Ensure all elements are strings
                df[col] = df[col].astype(str)

                # Remove leading and trailing whitespace
                df[col] = df[col].str.strip()
                # Replace multiple internal whitespaces with a single space
                # Use regex '\s+' to match one or more whitespace characters
                df[col] = df[col].str.replace(r'\s+', ' ', regex=True)

                print(f"Successfully cleaned whitespaces in '{col}'.")
            else:
                 print(f"Column '{col}' is not of object type, skipping whitespace removal.")
        else:
            print(f"Column '{col}' not found in DataFrame, skipping whitespace removal.")

    print("\nFirst 5 rows of the DataFrame after removing whitespaces:")
    print(df.head())

else:
    print("\nSkipping Whitespace Removal: Dataset not loaded.")

#### 6. Rephrase Text

In [None]:
# Rephrase Text (Conceptual Code - Likely Not Needed for your current dataset)

# This step is typically NOT needed for structured categorical text like Crime Types or Neighborhoods.
# It is used for free-form text descriptions if you need to generate alternative phrasings,
# which is a more advanced NLP task.

# If you had a column like 'Description' that contained narratives,
# and needed to rephrase them, you might use libraries like spaCy or NLTK
# for simpler rephrasing (e.g., changing sentence structure) or more
# advanced models (like T5 or BART from Hugging Face) for complex paraphrasing.

# Example using a simple rule-based approach (very limited):
# import spacy # You would need to install and load a spaCy model: !pip install spacy; !python -m spacy download en_core_web_sm
# nlp = spacy.load("en_core_web_sm")

# def simple_rephrase(text):
#     if isinstance(text, str):
#         doc = nlp(text)
#         # A very basic example: reverse subject-verb order if a simple structure is detected
#         # This is highly simplified and won't work for most sentences
#         rephrased_sentences = []
#         for sent in doc.sents:
#             # Complex logic involving dependency parsing to find subject and verb...
#             rephrased_sentences.append(str(sent)) # Placeholder
#         return " ".join(rephrased_sentences)
#     else:
#         return text

# if df is not None and 'Description' in df.columns: # Replace 'Description' with your actual text column name
#     print("Attempting to rephrase 'Description' column (using a placeholder/simple method)...")
#     # Apply the rephrasing function
#     # df['Description_rephrased'] = df['Description'].apply(simple_rephrase)
#     print("Rephrasing complete (Note: This requires a proper NLP model for meaningful results).")
# else:
#     print("\nSkipping Rephrase Text: Dataset not loaded or no relevant text column found for rephrasing.")

print("Rephrasing text is likely not a necessary data cleaning step for the current columns (TYPE, HUNDRED_BLOCK, NEIGHBOURHOOD) in your dataset.")
print("This step is typically used for free-form textual descriptions if needed for specific advanced NLP tasks.")

#### 7. Tokenization

In [None]:
# Tokenization

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
import re # Often useful for additional cleaning before tokenization

# --- Download NLTK punkt tokenizer if not already downloaded ---
try:
    # Try accessing the standard punkt tokenizer
    nltk.data.find('tokenizers/punkt')
except LookupError:
    print("NLTK punkt tokenizer not found. Downloading...")
    nltk.download('punkt')
    print("NLTK punkt tokenizer downloaded.")

print("-" * 30)

# --- Download NLTK punkt_tab tokenizer if not already downloaded ---
# The traceback specifically requested punkt_tab, which might be a separate or related resource
try:
    # Try accessing the punkt_tab resource
    # The specific path requested in the traceback is 'tokenizers/punkt_tab/english/'
    # We can try finding the base 'punkt_tab' resource
    nltk.data.find('tokenizers/punkt_tab') # Try to find the base directory
except LookupError:
    print("NLTK punkt_tab resource not found. Downloading...")
    # Download 'punkt_tab' as suggested by the traceback
    nltk.download('punkt_tab')
    print("NLTK punkt_tab resource downloaded.")

print("-" * 30)


# Check if the DataFrame is loaded
if df is not None:

    # List of columns to tokenize
    # These are typically the text/categorical columns
    cols_to_tokenize = ['TYPE', 'HUNDRED_BLOCK', 'NEIGHBOURHOOD'] # Add other text columns if applicable

    print("Performing tokenization on relevant columns...")

    # Function to tokenize text
    def tokenize_text(text):
        if isinstance(text, str):
            # Basic cleaning before tokenization (optional, but recommended after previous steps)
            # Ensure lowercase (if not already done) and remove leading/trailing/extra spaces
            text = text.lower().strip() # Assumes lowercasing/whitespace cleaning done
            text = re.sub(r'\s+', ' ', text) # Replace multiple spaces with single space

            # Use NLTK's word_tokenize
            # word_tokenize relies on sent_tokenize, which uses the punkt resource
            # Downloading both 'punkt' and 'punkt_tab' should resolve the LookupError
            tokens = word_tokenize(text)
            return tokens
        else:
            return [] # Return empty list for non-string data

    # Create new columns to store the tokens (e.g., 'TYPE_tokens')
    for col in cols_to_tokenize:
        if col in df.columns:
            # Check if the column is of object (string) type
            if pd.api.types.is_object_dtype(df[col]):
                new_col_name = f"{col}_tokens"
                df[new_col_name] = df[col].apply(tokenize_text)
                print(f"Successfully tokenized '{col}' into '{new_col_name}'.")
            else:
                 print(f"Column '{col}' is not of object type, skipping tokenization.")
        else:
            print(f"Column '{col}' not found in DataFrame, skipping tokenization.")


    print("\nFirst 5 rows of the DataFrame after tokenization (showing new token columns):")
    # Display original and new token columns
    cols_to_display = [col for col in cols_to_tokenize if f"{col}_tokens" in df.columns]
    original_and_token_cols = []
    for col in cols_to_display:
        original_and_token_cols.extend([col, f"{col}_tokens"])

    if original_and_token_cols:
        print(df[original_and_token_cols].head())
    else:
        print("No tokenized columns were created.")


else:
    print("\nSkipping Tokenization: Dataset not loaded.")

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

import pandas as pd
import nltk
from nltk.stem import PorterStemmer # For stemming
from nltk.stem import WordNetLemmatizer # For lemmatization
# Need the WordNet corpus for lemmatization
# Need the OMW (Open Multilingual Wordnet) for some NLTK operations, often paired with WordNet

# --- Download NLTK resources for Normalization if not already downloaded ---
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    print("NLTK WordNet corpus not found. Downloading...")
    nltk.download('wordnet')
    print("NLTK WordNet corpus downloaded.")

try:
    nltk.data.find('corpora/omw-1.4')
except LookupError:
    print("NLTK OMW 1.4 corpus not found. Downloading...")
    nltk.download('omw-1.4')
    print("NLTK OMW 1.4 corpus downloaded.")

print("-" * 30)

# Check if the DataFrame is loaded and has tokenized columns
# Assuming the previous tokenization step was successful and created columns like 'TYPE_tokens'
cols_to_normalize = ['TYPE', 'HUNDRED_BLOCK', 'NEIGHBOURHOOD'] # Original columns that were tokenized
token_cols = [f"{col}_tokens" for col in cols_to_normalize if f"{col}_tokens" in df.columns]

if df is not None and token_cols:

    print("Performing text normalization (Stemming and Lemmatization) on tokenized columns...")

    # Initialize Stemmer and Lemmatizer
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()

    # Function to apply stemming to a list of tokens
    def apply_stemming(tokens):
        if isinstance(tokens, list):
            return [stemmer.stem(token) for token in tokens]
        else:
            return [] # Return empty list for non-list data

    # Function to apply lemmatization to a list of tokens
    # Lemmatization often requires a Part-of-Speech (POS) tag, but for simplicity,
    # we'll use the default 'n' (noun) tag. For more accurate lemmatization,
    # you would need to perform POS tagging first.
    def apply_lemmatization(tokens):
         if isinstance(tokens, list):
            # We'll default to 'n' (noun) POS tag for simplicity
            return [lemmatizer.lemmatize(token, pos='n') for token in tokens] # Default pos='n'
         else:
            return [] # Return empty list for non-list data


    # Apply stemming and lemmatization to the tokenized columns
    for col_tokens in token_cols:
        original_col_name = col_tokens.replace('_tokens', '') # Get original column name

        # Apply Stemming
        new_stem_col_name = f"{original_col_name}_stemmed"
        df[new_stem_col_name] = df[col_tokens].apply(apply_stemming)
        print(f"Successfully stemmed tokens in '{col_tokens}' into '{new_stem_col_name}'.")

        # Apply Lemmatization
        new_lemma_col_name = f"{original_col_name}_lemmatized"
        df[new_lemma_col_name] = df[col_tokens].apply(apply_lemmatization)
        print(f"Successfully lemmatized tokens in '{col_tokens}' into '{new_lemma_col_name}'.")


    print("\nFirst 5 rows of the DataFrame after normalization (showing new stemmed and lemmatized columns):")
    # Display original tokenized, stemmed, and lemmatized columns
    cols_to_display = []
    for col in cols_to_normalize:
         if f"{col}_tokens" in df.columns:
             cols_to_display.append(f"{col}_tokens")
         if f"{col}_stemmed" in df.columns:
             cols_to_display.append(f"{col}_stemmed")
         if f"{col}_lemmatized" in df.columns:
             cols_to_display.append(f"{col}_lemmatized")

    if cols_to_display:
        print(df[cols_to_display].head())
    else:
        print("No normalized columns were created. Ensure tokenization step was successful.")


else:
    print("\nSkipping Text Normalization: Dataset not loaded or tokenized columns not found.")

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Tagging

import pandas as pd
import nltk
# No need to import stemmer/lemmatizer again if already imported in the previous cell

# --- Download NLTK POS tagger (specifically the English version) if not already downloaded ---
try:
    # Explicitly check for the language-specific resource
    nltk.data.find('taggers/averaged_perceptron_tagger_eng')
except LookupError:
    print("NLTK averaged_perceptron_tagger_eng not found. Downloading...")
    # Download the specific English tagger
    nltk.download('averaged_perceptron_tagger_eng')
    print("NLTK averaged_perceptron_tagger_eng downloaded.")

print("-" * 30)

# Check if the DataFrame is loaded and has tokenized columns
# Assuming the previous tokenization step was successful and created columns like 'TYPE_tokens'
cols_to_tag = ['TYPE', 'HUNDRED_BLOCK', 'NEIGHBOURHOOD'] # Original columns that were tokenized
token_cols = [f"{col}_tokens" for col in cols_to_tag if f"{col}_tokens" in df.columns]


if df is not None and token_cols:

    print("Performing Part-of-Speech (POS) tagging on tokenized columns...")

    # Function to apply POS tagging to a list of tokens
    def apply_pos_tagging(tokens):
        if isinstance(tokens, list):
            # nltk.pos_tag takes a list of tokens
            return nltk.pos_tag(tokens)
        else:
            return [] # Return empty list for non-list data


    # Apply POS tagging to the tokenized columns
    for col_tokens in token_cols:
        original_col_name = col_tokens.replace('_tokens', '') # Get original column name

        # Apply POS Tagging
        new_pos_col_name = f"{original_col_name}_pos_tags"
        df[new_pos_col_name] = df[col_tokens].apply(apply_pos_tagging)
        print(f"Successfully applied POS tagging to tokens in '{col_tokens}' into '{new_pos_col_name}'.")


    print("\nFirst 5 rows of the DataFrame after POS tagging (showing new POS tag columns):")
    # Display original tokenized and new POS tag columns
    cols_to_display = []
    for col in cols_to_tag:
         if f"{col}_tokens" in df.columns:
             cols_to_display.append(f"{col}_tokens")
         if f"{col}_pos_tags" in df.columns:
             cols_to_display.append(f"{col}_pos_tags")

    if cols_to_display:
        print(df[cols_to_display].head())
    else:
        print("No POS tagged columns were created. Ensure tokenization step was successful.")

else:
    print("\nSkipping POS Tagging: Dataset not loaded or tokenized columns not found.")

#### 10. Text Vectorization

In [None]:
# text vectorization

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Check if the DataFrame is loaded and has the necessary columns for vectorization
# We will vectorize the stemmed or lemmatized columns as they are normalized
# Prioritize lemmatized if available, otherwise use stemmed
cols_to_vectorize = ['TYPE', 'HUNDRED_BLOCK', 'NEIGHBOURHOOD']
normalized_cols = []

for col in cols_to_vectorize:
    lemmatized_col = f"{col}_lemmatized"
    stemmed_col = f"{col}_stemmed"
    if lemmatized_col in df.columns:
        normalized_cols.append(lemmatized_col)
    elif stemmed_col in df.columns:
        normalized_cols.append(stemmed_col)
    else:
        print(f"Warning: Neither '{lemmatized_col}' nor '{stemmed_col}' found for column '{col}'. Skipping.")


if df is not None and normalized_cols:

    print("Performing TF-IDF vectorization on normalized text columns...")

    # Initialize a dictionary to store TF-IDF matrices for each column
    tfidf_matrices = {}

    for col in normalized_cols:
        print(f"Vectorizing column: '{col}'")
        # TF-IDF Vectorizer
        # Convert list of tokens back to strings for TF-IDF
        # Handle potential non-list entries gracefully
        df[col] = df[col].apply(lambda x: ' '.join(x) if isinstance(x, list) else '')

        tfidf_vectorizer = TfidfVectorizer(max_features=1000) # Limit features to 1000 for manageability
        try:
            # Fit and transform the text data
            tfidf_matrix = tfidf_vectorizer.fit_transform(df[col])

            # Store the matrix
            tfidf_matrices[col] = tfidf_matrix
            print(f"Successfully created TF-IDF matrix for '{col}'. Shape: {tfidf_matrix.shape}")

            # Optionally, display the feature names (words) learned by the vectorizer
            # print(f"Features for '{col}': {tfidf_vectorizer.get_feature_names_out()[:50]}...") # Display first 50 features

        except Exception as e:
            print(f"Error during TF-IDF vectorization for column '{col}': {e}")


    print("\nTF-IDF Vectorization Complete.")
    print("TF-IDF matrices are stored in the 'tfidf_matrices' dictionary.")

    # Example of how to access a matrix (e.g., for 'TYPE_lemmatized')
    # if 'TYPE_lemmatized' in tfidf_matrices:
    #     print("\nExample access: TF-IDF matrix for 'TYPE_lemmatized':")
    #     print(tfidf_matrices['TYPE_lemmatized'][:5].toarray()) # Display first 5 rows as dense array


else:
    print("\nSkipping TF-IDF Vectorization: Dataset not loaded or necessary normalized columns not found.")

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Assuming high correlation between X/Y and Longitude/Latitude
# Decide which set to keep based on your analysis and ease of use (Latitude/Longitude are generally preferred for mapping)
# If X and Longitude are highly correlated, and Y and Latitude are highly correlated:
# Drop X and Y
if df is not None:
    if 'X' in df.columns and 'Y' in df.columns:
        df = df.drop(['X', 'Y'], axis=1)
        print("Dropped 'X' and 'Y' columns due to high correlation with Latitude and Longitude.")
    else:
        print("'X' or 'Y' columns not found. Skipping dropping.")
else:
    print("DataFrame not loaded. Cannot drop columns.")

# You might also check for correlations between other numerical features and decide based on your model's requirements.
# For example, if YEAR and MONTH are highly correlated with a newly created 'Timestamp' feature,
# you might consider dropping YEAR and MONTH if the timestamp is sufficient.

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Assume 'df' is your DataFrame after data cleaning and initial feature engineering
# Make sure you have loaded and processed your data before running this code.
# If df is None from previous steps, load it again or ensure the previous steps ran successfully.
if 'df' not in locals() or df is None:
    print("DataFrame 'df' not found or is None. Please ensure your data loading and initial processing steps have run.")
    # If you need to reload the data for this section, uncomment and modify the following:
    # try:
    #     df = pd.read_excel('/content/drive/MyDrive/Train.xlsx')
    # except FileNotFoundError:
    #     print("Error: The file 'Train.xlsx' was not found.")
    #     # Exit or handle the error appropriately
    #     # exit() # Example: exit the cell execution if in a script
    #     pass # Allow the script to continue, but df will be None

    # # Assuming you also added some basic time features as done previously
    # if df is not None and 'Date' in df.columns:
    #      try:
    #          df['Date'] = pd.to_datetime(df['Date'])
    #          df['Day_of_Week'] = df['Date'].dt.dayofweek
    #          df['Month'] = df['Date'].dt.month
    #          df['Year'] = df['Date'].dt.year
    #          df['HOUR'] = df['Date'].dt.hour
    #          df['MINUTE'] = df['Date'].dt.minute
    #          df['Time_of_Day'] = df['HOUR'] + df['MINUTE']/60
    #          # Assuming Is_Weekend and Quarter were created elsewhere or need creation here
    #          # Example:
    #          # df['Is_Weekend'] = df['Date'].dt.dayofweek >= 5
    #          # df['Quarter'] = df['Date'].dt.quarter
    #      except Exception as e:
    #          print(f"Error processing 'Date' column during reload: {e}")


# --- Step 1: Check for Multicollinearity using VIF ---

if df is not None:
    print("--- Checking for Multicollinearity (VIF) ---")

    # Select only numerical columns for VIF calculation
    # Exclude the target variable if you have one and it's numerical
    # Make sure to exclude any identifier columns that aren't features
    numerical_features = df.select_dtypes(include=np.number).columns.tolist()

    # Remove Latitude and Longitude if they were kept and you suspect high correlation
    # This is based on the assumption from your previous step. Adjust if needed.
    features_to_exclude_from_vif = []
    if 'Latitude' in numerical_features and 'Longitude' in numerical_features:
         # Assuming you kept Lat/Lon and dropped X/Y
         # If you have other highly correlated features identified from the heatmap, add them here
         pass # Keep Lat/Lon for VIF calculation unless you have reason to drop them before this step

    # Exclude the original X and Y if they still exist (they should have been dropped earlier)
    if 'X' in numerical_features:
        features_to_exclude_from_vif.append('X')
    if 'Y' in numerical_features:
        features_to_exclude_from_vif.append('Y')
    if 'MINUTE' in numerical_features:
         # Minute might have low variance or be highly correlated with Time_of_Day
         # You can decide to exclude it from VIF or drop it as a feature later
         pass # Keep Minute for now to see its VIF

    numerical_features_for_vif = [f for f in numerical_features if f not in features_to_exclude_from_vif]

    # Drop rows with NaN values in the selected numerical columns for VIF calculation
    # VIF cannot handle NaN values. Be mindful of how you handle NaNs in your overall workflow.
    df_for_vif = df[numerical_features_for_vif].dropna()

    if not df_for_vif.empty:
        # Calculate VIF
        vif_data = pd.DataFrame()
        vif_data["feature"] = df_for_vif.columns
        # Ensure the input to variance_inflation_factor is a numpy array of floats
        vif_data["VIF"] = [variance_inflation_factor(df_for_vif.values.astype(float), i) for i in range(len(df_for_vif.columns))]

        print("\nVariance Inflation Factor (VIF) for Numerical Features:")
        print(vif_data.sort_values(by='VIF', ascending=False))

        # Identify features with high VIF (e.g., VIF > 5 or 10)
        high_vif_features = vif_data[vif_data['VIF'] > 5]['feature'].tolist() # Using 5 as a common threshold
        print(f"\nFeatures with VIF > 5: {high_vif_features}")

        # Decision based on VIF: Consider removing features with very high VIF,
        # especially if they are highly correlated with other features you intend to keep.
        # For example, if 'Time_of_Day' and 'HOUR' both have high VIF and are highly correlated,
        # you might choose to keep only 'Time_of_Day'.

    else:
        print("DataFrame for VIF calculation is empty after dropping NaNs. Cannot calculate VIF.")

    # --- Step 2: Consider Categorical Features ---

    print("\n--- Considering Categorical Features ---")

    # Analyze the cardinality of categorical features
    # Get object type columns and ensure they are not datetime or other complex types masquerading as object
    categorical_features = df.select_dtypes(include='object').columns.tolist()

    print("\nCardinality of Categorical Features:")
    for col in categorical_features:
        try:
            # Check if the column contains lists or other unhashable types
            # Sample the first few non-null values to get an idea of the type
            sample_values = df[col].dropna().head()
            contains_lists = any(isinstance(x, list) for x in sample_values)

            if contains_lists:
                print(f"{col}: Contains lists or unhashable types. Cannot calculate simple nunique().")
                # If you need to count unique lists (treating each list as a single item),
                # you could convert the lists to hashable types like tuples:
                # print(f"{col} (treating lists as unique): {df[col].apply(lambda x: tuple(x) if isinstance(x, list) else x).nunique()} unique values")
                # Or convert to string representation if the list content is what matters
                # print(f"{col} (as string): {df[col].astype(str).nunique()} unique values")
            else:
                print(f"{col}: {df[col].nunique()} unique values")
        except TypeError as e:
            print(f"Error calculating nunique for column '{col}': {e}")
            print(f"Check the data type of values in column '{col}'. It likely contains unhashable types like lists.")
        except Exception as e:
            print(f"An unexpected error occurred for column '{col}': {e}")


    # Based on cardinality and relevance, decide how to handle them for modeling:
    # - Low cardinality (e.g., 'Day_of_Week', 'Is_Weekend'): One-Hot Encoding is often suitable.
    # - High cardinality (e.g., 'NEIGHBOURHOOD', 'HUNDRED_BLOCK'):
    #   - Group rare categories.
    #   - Use more advanced encoding techniques (Target Encoding, etc.).
    #   - Consider if the feature is truly necessary for your forecasting task at the level of granularity provided.

    # --- Step 3: Feature Selection Strategy Based on Exploration and Domain Knowledge ---

    print("\n--- Feature Selection Strategy ---")

    # Based on your EDA and the VIF analysis, list the features you plan to use for your model.
    # This is a conceptual step based on your analysis. You'll implement the actual
    # dropping/keeping of columns before model training.

    # Example: Let's assume based on your analysis, you decide to keep the following:
    # - Time-based features: Date (as index), Day_of_Week, Month, Year, Time_of_Day, Is_Weekend, Quarter
    # - Location-based features: Latitude, Longitude (assuming they are sufficiently informative and not perfectly collinear after dropping X/Y)
    # - Crime Type (as you are forecasting crime trends)
    # - Neighborhood (handled appropriately based on cardinality)

    selected_features_for_modeling = [
        # Assuming these columns exist after previous steps and handling of NaNs/types
        'Date', # Often used as the index for time series
        'Day_of_Week',
        'Month',
        'Year',
        'Time_of_Day',
        # 'Is_Weekend', # Ensure this was created
        # 'Quarter',    # Ensure this was created
        'Latitude',
        'Longitude',
        'TYPE', # Or encode TYPE if it's your target or a predictor
        'NEIGHBOURHOOD' # Will require appropriate encoding
        # Add any other relevant features you engineered or decided to keep
    ]

    print("\nPotential features for modeling based on analysis:")
    # Filter the list to only include columns that actually exist in the DataFrame
    actual_selected_features = [f for f in selected_features_for_modeling if f in df.columns]
    print(actual_selected_features)

    # Note: You will need to handle the categorical features ('TYPE', 'NEIGHBOURHOOD')
    # before feeding them into your time series forecasting models.
    # The encoding strategy will depend on the specific model you choose.

    print("\nNext Steps:")
    print("1. Decide on the final set of features based on VIF, cardinality, and relevance.")
    print("2. Implement appropriate encoding for categorical features.")
    print("3. Prepare your data for the chosen time series model (e.g., creating time series index, handling NaNs).")
    print("4. Split data into training and validation/test sets using time-based splitting.")
    print("5. Train your model and evaluate performance using time series cross-validation.")
    print("6. Iterate on feature selection and model parameters based on performance.")

else:
    print("DataFrame not loaded. Cannot perform feature selection steps.")

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# This section is for applying transformations like encoding categorical features,
# scaling numerical features, handling outliers if not done before, and preparing
# the data specifically for your chosen time series forecasting model.

# Based on the previous analysis, let's assume the following:
# - 'Date' is used as the time index.
# - 'TYPE' and 'NEIGHBOURHOOD' are categorical features that need encoding.
# - Numerical features ('Latitude', 'Longitude', 'HOUR', 'MINUTE', 'YEAR', 'MONTH', 'DAY', 'Time_of_Day', 'Day_of_Week') might benefit from scaling depending on the model.
# - You need to decide on a target variable for forecasting (e.g., count of incidents per time period/location, or forecasting a specific crime type).

# --- Example Transformations ---

if df is not None:
    print("--- Applying Data Transformations ---")

    # 1. Handling Categorical Features (TYPE and NEIGHBOURHOOD)
    # The best encoding strategy depends on your model.
    # One-Hot Encoding is simple but can create many columns for high cardinality features.
    # Target Encoding or grouping rare categories might be better for high cardinality.

    # Let's demonstrate One-Hot Encoding for 'TYPE' (assuming manageable number of types)
    # and a strategy for 'NEIGHBOURHOOD' (e.g., keeping top N and grouping the rest)
    # Decide on the number of top neighborhoods to keep
    top_n_neighborhoods_to_encode = 50 # Adjust based on your data and analysis

    if 'TYPE' in df.columns:
        print("\nApplying One-Hot Encoding for 'TYPE'...")
        # Use get_dummies for one-hot encoding
        # prefix='TYPE' adds a prefix to the new columns for clarity
        # dummy_na=False excludes NaN values (you should handle NaNs before this)
        df = pd.get_dummies(df, columns=['TYPE'], prefix='TYPE', dummy_na=False)
        print("One-Hot Encoding for 'TYPE' applied.")
    else:
        print("'TYPE' column not found. Skipping One-Hot Encoding for TYPE.")


    if 'NEIGHBOURHOOD' in df.columns:
        print(f"\nHandling 'NEIGHBOURHOOD' by keeping top {top_n_neighborhoods_to_encode} and grouping others...")
        # Get the value counts and identify the top neighborhoods
        neighborhood_counts = df['NEIGHBOURHOOD'].value_counts()
        if len(neighborhood_counts) > top_n_neighborhoods_to_encode:
            top_neighborhoods_list = neighborhood_counts.nlargest(top_n_neighborhoods_to_encode).index.tolist()
            # Replace neighborhoods not in the top list with 'Other'
            df['NEIGHBOURHOOD_Grouped'] = df['NEIGHBOURHOOD'].apply(lambda x: x if x in top_neighborhoods_list else 'Other')

            # Now, One-Hot Encode the grouped neighborhood column
            print("Applying One-Hot Encoding for 'NEIGHBOURHOOD_Grouped'...")
            df = pd.get_dummies(df, columns=['NEIGHBOURHOOD_Grouped'], prefix='NEIGHBOURHOOD', dummy_na=False)
            print("One-Hot Encoding for grouped neighborhoods applied.")
            # You might choose to drop the original 'NEIGHBOURHOOD' column if you no longer need it
            df = df.drop('NEIGHBOURHOOD', axis=1)
        else:
            # If there are fewer neighborhoods than the threshold, just encode all of them
            print("Number of neighborhoods is less than the threshold. Applying One-Hot Encoding to all neighborhoods.")
            df = pd.get_dummies(df, columns=['NEIGHBOURHOOD'], prefix='NEIGHBOURHOOD', dummy_na=False)
            print("One-Hot Encoding for 'NEIGHBOURHOOD' applied.")
    else:
        print("'NEIGHBOURHOOD' column not found. Skipping neighborhood encoding.")


    # 2. Feature Scaling (for numerical features)
    # Scaling is often important for distance-based algorithms or models sensitive to feature scales (like LSTMs, SVMs, etc.)
    # Tree-based models (like Random Forest, Gradient Boosting) are generally not sensitive to scaling.
    # Decide which numerical features need scaling based on your model choice.

    # Example: Scaling Latitude and Longitude if they are features
    numerical_cols_to_scale = ['Latitude', 'Longitude', 'HOUR', 'MINUTE', 'YEAR', 'MONTH', 'DAY']
    # Check if these columns exist before attempting to scale
    numerical_cols_to_scale_exist = [col for col in numerical_cols_to_scale if col in df.columns]

    if numerical_cols_to_scale_exist:
        print("\nApplying StandardScaler to selected numerical features...")
        from sklearn.preprocessing import StandardScaler

        # Create a scaler object
        scaler = StandardScaler()

        # Fit the scaler to the selected columns and transform the data
        # Ensure no infinite values exist before scaling
        df[numerical_cols_to_scale_exist] = df[numerical_cols_to_scale_exist].replace([np.inf, -np.inf], np.nan)
        # Handle NaNs before scaling (e.g., imputation or dropping)
        # For demonstration, let's fill NaNs with the mean (choose an appropriate strategy for your data)
        df[numerical_cols_to_scale_exist] = df[numerical_cols_to_scale_exist].fillna(df[numerical_cols_to_scale_exist].mean())


        df[numerical_cols_to_scale_exist] = scaler.fit_transform(df[numerical_cols_to_scale_exist])
        print("StandardScaler applied.")
        print(f"Scaled columns: {numerical_cols_to_scale_exist}")
    else:
         print("\nNo numerical columns found for scaling from the specified list.")


    # 3. Creating a Time Series Index
    # For time series forecasting, it's standard to have a datetime index.
    # Ensure your 'Date' column is in datetime format.
    if 'Date' in df.columns:
         try:
             print("\nSetting 'Date' as DataFrame index...")
             # Convert 'Date' to datetime if not already
             df['Date'] = pd.to_datetime(df['Date'])
             # Set 'Date' as index and sort by index
             df = df.set_index('Date').sort_index()
             print("'Date' column set as index.")
         except Exception as e:
             print(f"Error setting 'Date' as index: {e}")
             print("Please ensure the 'Date' column is available and correctly formatted.")
    else:
        print("\n'Date' column not found. Cannot set time series index.")
        print("Please ensure the 'Date' column was created/loaded and is in datetime format.")

    # 4. Define Target Variable
    # What are you forecasting? Total crime count? Count of a specific crime type?
    # This step is conceptual as the target depends on your specific forecasting problem.
    # Example: If forecasting total daily crime count:
    # You would first need to aggregate the data by day and count incidents.
    # target = df.resample('D').size().rename('daily_crime_count')
    # Then, merge this target with your features, ensuring alignment by date.


    print("\n--- Transformations Complete ---")
    print("DataFrame head after transformations:")
    display(df.head())
    print("\nDataFrame info after transformations:")
    df.info()

else:
    print("DataFrame not loaded. Cannot apply transformations.")

### 6. Data Scaling

In [None]:
# Scaling your data

# This section specifically focuses on scaling numerical features.
# Scaling is important for many machine learning algorithms, particularly those
# that are sensitive to the magnitude of features, such as:
# - Gradient Descent based optimizers (used in Neural Networks, Logistic Regression, SVMs)
# - Distance-based algorithms (K-Nearest Neighbors, K-Means Clustering)
# - Principal Component Analysis (PCA)

# Tree-based models like Decision Trees, Random Forests, and Gradient Boosting
# (e.g., LightGBM, XGBoost) are generally not affected by feature scaling.

# Choose the appropriate scaler based on your data distribution and model requirements.
# Common scalers include:
# - StandardScaler: Standardizes features by removing the mean and scaling to unit variance (mean=0, variance=1). Suitable for data that is approximately normally distributed.
# - MinMaxScaler: Scales features to a fixed range, usually [0, 1] or [-1, 1]. Useful when you need to maintain the original distribution shape or when the data is not Gaussian.
# - RobustScaler: Scales features using statistics that are robust to outliers (median and interquartile range). Good if your data contains many outliers.

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import pandas as pd
import numpy as np
from IPython.display import display

if df is not None:
    print("--- Applying Feature Scaling ---")

    # Identify the numerical columns to scale
    # Exclude columns that are identifiers, target variables (if already defined and you don't want to scale it),
    # or already encoded categorical features.
    # Use select_dtypes to get current numerical columns.
    numerical_cols = df.select_dtypes(include=np.number).columns.tolist()

    # Based on your previous steps, some columns might already be numerical (e.g., Latitude, Longitude, time components).
    # Exclude any columns that resulted from one-hot encoding, as these are already binary (0 or 1).
    # Also, exclude any integer columns that represent counts or labels if scaling doesn't make sense for them.
    # Let's make a list of columns we *intend* to scale.
    # You might need to adjust this list based on your specific DataFrame after transformations.

    # Example list of columns to potentially scale:
    features_to_scale = ['Latitude', 'Longitude', 'HOUR', 'MINUTE', 'YEAR', 'MONTH', 'DAY', 'Time_of_Day', 'Day_of_Week']

    # Filter this list to include only columns that currently exist in the DataFrame and are numerical
    # after the previous transformation steps (like one-hot encoding).
    # Some columns might have been dropped or created in previous steps.
    numerical_features_to_scale_exist = [col for col in features_to_scale if col in df.columns and np.issubdtype(df[col].dtype, np.number)]

    if not numerical_features_to_scale_exist:
        print("No numerical features identified for scaling from the specified list.")
    else:
        print(f"Identified numerical features for scaling: {numerical_features_to_scale_exist}")

        # --- Choose Your Scaler ---
        # scaler = StandardScaler() # Uncomment this line to use StandardScaler
        scaler = MinMaxScaler() # Uncomment this line to use MinMaxScaler (example)
        # scaler = RobustScaler() # Uncomment this line to use RobustScaler

        print(f"Using Scaler: {type(scaler).__name__}")

        try:
            # Prepare the data for scaling
            # Scaling works best on numerical values. Ensure there are no non-numeric values
            # or infinities in the selected columns. Handle NaNs if present.
            df_to_scale = df[numerical_features_to_scale_exist].copy()

            # Replace infinite values with NaN, then handle NaNs
            df_to_scale.replace([np.inf, -np.inf], np.nan, inplace=True)

            # Example NaN handling: Impute with mean or median. Choose a strategy based on your data.
            # For StandardScaler/MinMaxScaler, mean/median imputation is common.
            # For RobustScaler, median imputation might be more appropriate.
            # Let's use the mean for demonstration (adjust as needed)
            df_to_scale.fillna(df_to_scale.mean(), inplace=True)
            print(f"Handling NaNs by filling with mean in columns: {numerical_features_to_scale_exist}")


            # Ensure the data is in a suitable format for the scaler (e.g., numpy array)
            # The scaler expects a 2D array.
            scaled_data = scaler.fit_transform(df_to_scale)

            # Create a new DataFrame with the scaled data, keeping the original column names and index
            scaled_df = pd.DataFrame(scaled_data, columns=numerical_features_to_scale_exist, index=df.index)

            # Replace the original columns in the main DataFrame with the scaled versions
            df[numerical_features_to_scale_exist] = scaled_df

            print("\nScaling applied successfully.")
            print("DataFrame head after scaling selected features:")
            display(df.head())

        except Exception as e:
            print(f"An error occurred during scaling: {e}")
            print("Please check the selected columns and their data types.")

else:
    print("DataFrame not loaded. Cannot perform scaling.")

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

# Dimensionality reduction is a technique used to reduce the number of features
# in a dataset. This can be beneficial for several reasons:
# 1. Reduces storage space and computation time.
# 2. Helps to mitigate the "curse of dimensionality", where data becomes sparse
#    in high-dimensional spaces, making it harder for models to find patterns.
# 3. Can help remove noise and multicollinearity.
# 4. Can aid in visualization by reducing data to 2 or 3 dimensions.

# Common dimensionality reduction techniques include:
# - Principal Component Analysis (PCA): A linear technique that finds orthogonal
#   principal components that capture the maximum variance in the data. Unsupervised.
# - t-Distributed Stochastic Neighbor Embedding (t-SNE): A non-linear technique
#   primarily used for visualizing high-dimensional data in 2 or 3 dimensions.
# - UMAP (Uniform Manifold Approximation and Projection): Another non-linear
#   technique for visualization and general purpose dimensionality reduction,
#   often faster than t-SNE.
# - Factor Analysis, Independent Component Analysis (ICA), Linear Discriminant
#   Analysis (LDA - supervised), etc.

# Whether dimensionality reduction is "needed" depends on:
# - The number of features you have after feature engineering and encoding.
# - The performance of your chosen time series model with the current number of features.
# - Your goal (e.g., visualization, noise reduction, computational efficiency).

# For time series forecasting, applying dimensionality reduction directly to the
# feature matrix can be tricky if the technique doesn't preserve the temporal
# relationships well. PCA, for example, treats all features equally regardless
# of their temporal context.

# If you have a very large number of static features per time step (e.g., many
# one-hot encoded geographical features), PCA might be applied to *those static
# features* to reduce their dimension before incorporating them into a time
# series model.

# Let's illustrate with PCA as it's a common method, but consider if it's truly
# appropriate for your time series forecasting context or if other feature
# selection methods (based on relevance, domain knowledge, model interpretability)
# are more suitable.

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

if df is not None:
    print("--- Considering Dimensionality Reduction (PCA Example) ---")

    # Identify the numerical features for PCA. This should include scaled numerical
    # features and one-hot encoded categorical features.
    # Exclude the time index and potentially the target variable if it's in this DataFrame.
    # Assuming 'Date' is the index and you don't have a target column yet in this DataFrame.

    # Get all numerical columns (including scaled ones and one-hot encoded dummies)
    features_for_pca = df.select_dtypes(include=np.number).columns.tolist()

    # Exclude columns that should not be part of the dimensionality reduction if they exist
    # For example, if you have a separate target column named 'crime_count', exclude it.
    # target_column = 'your_target_column_name' # Define your target column if applicable
    # if target_column in features_for_pca:
    #     features_for_pca.remove(target_column)

    if not features_for_pca:
        print("No numerical features available for PCA.")
    else:
        print(f"Features considered for PCA: {features_for_pca}")

        # Prepare data for PCA
        # PCA cannot handle NaN values. Ensure NaNs are handled before this step.
        # Assuming NaNs were handled during scaling or previous steps.
        data_for_pca = df[features_for_pca].copy()

        # Optional: Check for and handle any remaining NaNs or infinities
        data_for_pca.replace([np.inf, -np.inf], np.nan, inplace=True)
        if data_for_pca.isnull().sum().sum() > 0:
             print("Warning: NaN values found in data for PCA. Filling with mean for demonstration.")
             # Choose an appropriate imputation strategy for your data
             data_for_pca.fillna(data_for_pca.mean(), inplace=True)


        # --- Apply PCA ---

        # Decide on the number of components or the amount of variance to explain
        # Option 1: Specify number of components (e.g., reduce to 10 features)
        # n_components = 10
        # pca = PCA(n_components=n_components)
        # principal_components = pca.fit_transform(data_for_pca)
        # print(f"\nApplied PCA to reduce dimensions to {n_components} components.")

        # Option 2: Specify variance to explain (e.g., keep components explaining 95% variance)
        n_components_variance = 0.95
        pca = PCA(n_components=n_components_variance)
        principal_components = pca.fit_transform(data_for_pca)
        print(f"\nApplied PCA to capture {n_components_variance*100:.0f}% variance.")
        print(f"Number of components selected by PCA: {pca.n_components_}")


        # Create a new DataFrame with the principal components
        pca_columns = [f'PC{i+1}' for i in range(pca.n_components_)]
        principal_df = pd.DataFrame(data=principal_components, columns=pca_columns, index=df.index)

        print("Principal Components DataFrame head:")
        display(principal_df.head())

        # Optional: Visualize explained variance ratio
        plt.figure(figsize=(10, 6))
        plt.plot(np.cumsum(pca.explained_variance_ratio_))
        plt.xlabel('Number of Components')
        plt.ylabel('Cumulative Explained Variance Ratio')
        plt.title('PCA Cumulative Explained Variance')
        plt.grid(True)
        plt.show()

        # Decide whether to replace original features with principal components
        # If you replace, you need to drop the original features used for PCA
        # and concatenate the principal components to your main DataFrame.

        # Example: Replacing original features with principal components
        print("\nReplacing original features with principal components in the DataFrame...")
        # Drop the original features used for PCA
        df_reduced = df.drop(columns=features_for_pca)

        # Concatenate the principal components DataFrame
        df_reduced = pd.concat([df_reduced, principal_df], axis=1)

        # Update the main DataFrame reference
        df = df_reduced

        print("Original features replaced with principal components.")
        print("DataFrame head after dimensionality reduction:")
        display(df.head())
        print("\nDataFrame info after dimensionality reduction:")
        df.info()


        # --- If PCA is primarily for Visualization ---
        # If your goal was just to visualize, you might apply PCA to 2 or 3 components
        # (pca = PCA(n_components=2) or PCA(n_components=3)) on a subset of your data
        # (e.g., sample some data points if the dataset is huge) and then plot the components.
        # This wouldn't necessarily change your main 'df' for modeling.


    print("\n--- Dimensionality Reduction Step Complete ---")
    print("Review the number of components and explained variance to decide if PCA is suitable.")
    print("If PCA was applied, your DataFrame 'df' now contains the principal components.")

else:
    print("DataFrame not loaded. Cannot perform dimensionality reduction.")

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

# For time series data, splitting the data into training and testing sets is crucial
# for evaluating your forecasting model's performance on unseen future data.
# Unlike standard machine learning where you can randomly split data, you MUST
# split time series data chronologically. You train on past data and predict future data.

# Common splitting strategies for time series:
# 1. Train/Test Split (fixed point): Split the data at a specific date or time index.
#    Data before this point is for training, data after is for testing.
# 2. Rolling Window (Walk-Forward Validation): Train on an initial period, test on the next period.
#    Then, roll the training window forward (either fixing the window size or expanding it),
#    and test on the subsequent period. Repeat multiple times. This is more robust
#    for evaluating performance across different time periods.
# 3. Expanding Window: Similar to rolling window, but the training set grows with each iteration,
#    incorporating all data up to the start of the test period.

# The splitting ratio (or the duration of train/test periods) should be chosen wisely:
# - The training set needs to be long enough to capture relevant patterns (trends, seasonality).
# - The test set should be representative of the period you want to forecast.
# - Avoid making the test set too short, as performance metrics can be volatile.
# - Avoid making the training set too short, as the model might not learn well.

# Let's demonstrate a simple fixed point train/test split.
# You'll need a time index for this, which we created in the "Transform Your data" step.

import pandas as pd
from IPython.display import display

if df is not None:
    print("--- Splitting Data (Time-Based) ---")

    # Ensure the DataFrame has a datetime index and is sorted by time
    if not isinstance(df.index, pd.DatetimeIndex) or not df.index.is_monotonic_increasing:
        print("Warning: DataFrame index is not a sorted DatetimeIndex.")
        print("Attempting to set 'Date' column as index and sort.")
        if 'Date' in df.columns:
            try:
                df['Date'] = pd.to_datetime(df['Date'])
                df = df.set_index('Date').sort_index()
                print("'Date' column set as index and sorted.")
            except Exception as e:
                print(f"Error setting 'Date' as index: {e}")
                print("Cannot perform time-based split without a valid datetime index.")
                df = None # Indicate that the DataFrame is not ready for splitting

        else:
            print("No 'Date' column found. Cannot perform time-based split.")
            df = None # Indicate that the DataFrame is not ready for splitting


    if df is not None:
        # --- Define Split Point ---

        # You can define the split point by date or by index position.
        # Splitting by date is generally more intuitive for time series.
        # Choose a date that leaves a sufficient amount of data for both training and testing.

        # Example: Split the last 20% of the data for testing.
        # This requires sorting by date first.
        train_size = 0.8 # 80% for training, 20% for testing

        # Calculate the split index based on the ratio
        split_index = int(len(df) * train_size)

        # Get the split date based on the calculated index
        split_date = df.index[split_index]

        print(f"\nSplitting data at index: {split_index}")
        print(f"Corresponding split date: {split_date}")
        print(f"Training data will be from the start up to (but not including) {split_date}")
        print(f"Testing data will be from {split_date} onwards")


        # --- Perform the Split ---

        # Split the DataFrame based on the split date
        train_df = df.loc[df.index < split_date].copy()
        test_df = df.loc[df.index >= split_date].copy()

        print(f"\nOriginal DataFrame shape: {df.shape}")
        print(f"Training set shape: {train_df.shape}")
        print(f"Testing set shape: {test_df.shape}")

        # Verify the time ranges
        if not train_df.empty:
             print(f"Training set date range: {train_df.index.min()} to {train_df.index.max()}")
        if not test_df.empty:
             print(f"Testing set date range: {test_df.index.min()} to {test_df.index.max()}")

        # If your task is to forecast a specific column, separate features (X) and target (y)
        # Decide your target variable here.
        # Example: Assuming you are forecasting the count of incidents ('TYPE_Theft from Vehicle' if that's a one-hot encoded column)
        # Or perhaps you need to aggregate data first if forecasting daily/weekly counts.

        # --- Example: Preparing X and y (assuming a hypothetical target column) ---
        # If your target is one of the columns in the DataFrame after transformations:
        # target_column_name = 'TYPE_Theft from Vehicle' # Replace with your actual target column name
        # if target_column_name in df.columns:
        #     print(f"\nAssuming target variable is '{target_column_name}'.")
        #     X_train = train_df.drop(columns=[target_column_name])
        #     y_train = train_df[target_column_name]
        #     X_test = test_df.drop(columns=[target_column_name])
        #     y_test = test_df[target_column_name]

        #     print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
        #     print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")
        # else:
        #     print(f"\nTarget column '{target_column_name}' not found. Skipping X/y split.")
        #     print("Please define your target variable and perform the X/y split accordingly.")

        # If your target is an aggregated series (e.g., daily crime count), you would
        # perform the aggregation *before* or *after* the initial df split, ensuring
        # the resulting series/DataFrame is aligned with the train/test periods.

        # For now, we'll just keep the train_df and test_df DataFrames as the split result.
        # You will extract X and y or prepare the data further based on your specific model requirements.

        print("\n--- Data Splitting Complete ---")
        print("You have train_df and test_df ready for time series modeling.")
        print("Remember to define and separate your target variable(s) before training.")

    else:
        print("DataFrame not available for splitting.")

else:
    print("DataFrame not loaded. Cannot split data.")

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
import pandas as pd
import numpy as np
from collections import Counter # To check class distribution
from IPython.display import display

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
import pandas as pd
import numpy as np
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')
if 'train_df' in locals() and 'test_df' in locals() and train_df is not None and test_df is not None:
    print("--- Implementing SARIMA Model (Example: Forecasting Total Daily Crime Count) ---")
    train_ts = train_df.resample('D').size().fillna(0)
    test_ts = test_df.resample('D').size().fillna(0)

    print(f"\nTraining time series shape: {train_ts.shape}")
    print(f"Testing time series shape: {test_ts.shape}")
    print("\nTraining time series head:")
    display(train_ts.head())
    print("\nTesting time series head:")
    display(test_ts.head())

    order = (1, 1, 1)
    seasonal_order = (1, 1, 1, 7)

    print(f"\nFitting SARIMA model with order={order} and seasonal_order={seasonal_order}...")

    try:

        model = SARIMAX(train_ts,
                        order=order,
                        seasonal_order=seasonal_order,
                        enforce_stationarity=False,
                        enforce_invertibility=False)
        sarima_results = model.fit()

        print("\nSARIMA model fitted successfully.")
        print(sarima_results.summary())
        print("\nMaking predictions on the test set...")
        start_pred = test_ts.index[0]
        end_pred = test_ts.index[-1]
        predictions = sarima_results.predict(start=start_pred, end=end_pred, dynamic=True)

        print("Predictions generated.")
        print("\nSample Predictions:")
        print(predictions.head())
        print("\nSample Actuals (Test set):")
        print(test_ts.head())
        predictions_aligned = predictions.reindex(test_ts.index)
        predictions_aligned = predictions_aligned.fillna(method='ffill')
        print("\nEvaluating model performance on the test set...")
        valid_indices = predictions_aligned.notna() & test_ts.notna()
        actuals = test_ts[valid_indices]
        preds = predictions_aligned[valid_indices]

        if not actuals.empty:
            rmse = np.sqrt(mean_squared_error(actuals, preds))
            mae = mean_absolute_error(actuals, preds)

            print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
            print(f"Mean Absolute Error (MAE): {mae:.2f}")
            plt.figure(figsize=(15, 6))
            plt.plot(train_ts.index, train_ts, label='Training Data')
            plt.plot(test_ts.index, actuals, label='Actual Test Data', color='orange')
            plt.plot(preds.index, preds, label='SARIMA Predictions', color='green', linestyle='--')
            plt.title('SARIMA Forecast vs Actuals (Daily Crime Count)')
            plt.xlabel('Date')
            plt.ylabel('Number of Incidents')
            plt.legend()
            plt.grid(True)
            plt.show()

        else:
             print("No valid data points to evaluate predictions against in the test set.")


    except Exception as e:
        print(f"\nAn error occurred during SARIMA model fitting or prediction: {e}")
        print("Consider checking data stationarity, choosing different SARIMA parameters, or trying a different model.")
        print("If using exogenous variables, ensure they are aligned correctly and have no NaNs.")


else:
    print("Training and/or Testing DataFrames not found or are None.")
    print("Please ensure the data splitting step ran successfully.")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
try:
    sarima_rmse = rmse
    sarima_mae = mae
    metrics_calculated = True
except NameError:
    print("RMSE and MAE variables not found from the previous model training step.")
    print("Skipping evaluation metric visualization.")
    metrics_calculated = False
if metrics_calculated:
    print("--- Visualizing Evaluation Metric Scores ---")
    metrics_data = {'Metric': ['RMSE', 'MAE'],
                    'SARIMA Model': [sarima_rmse, sarima_mae]}
    metrics_df = pd.DataFrame(metrics_data)
    metrics_df = metrics_df.set_index('Metric')
    ax = metrics_df.plot(kind='bar', figsize=(8, 6), colormap='viridis')

    plt.title('Model Evaluation Metrics on Test Set')
    plt.ylabel('Score Value')
    plt.xticks(rotation=0)
    plt.grid(axis='y', linestyle='--')
    plt.legend(title='Model')
    for container in ax.containers:
        ax.bar_label(container, fmt='%.2f')

    plt.tight_layout()
    plt.show()

    print("\n--- Evaluation Metric Score Chart Displayed ---")
    print("This chart compares the RMSE and MAE of the SARIMA model on the test set.")
    print("Add results from other models to this chart for comparison.")

else:
    print("Cannot generate evaluation metric chart as scores were not calculated.")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Uninstall pmdarima and numpy
!pip uninstall pmdarima -y
!pip uninstall numpy -y
!pip install numpy
!pip install pmdarima
import pandas as pd
import numpy as np
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')

if 'train_ts' in locals() and 'test_ts' in locals() and train_ts is not None and test_ts is not None:
    print("--- Implementing SARIMA Model with Hyperparameter Optimization (Auto-ARIMA) ---")
    train_ts_clean = train_ts.dropna()
    test_ts_clean = test_ts.dropna()

    if train_ts_clean.empty or test_ts_clean.empty:
        print("Cleaned training or testing time series is empty after dropping NaNs. Cannot proceed.")
    else:
        print(f"Cleaned training time series shape: {train_ts_clean.shape}")
        print(f"Cleaned testing time series shape: {test_ts_clean.shape}")
        print("\nRunning auto_arima to find best SARIMA parameters...")
        try:
            arima_model = pm.auto_arima(train_ts_clean,
                                       start_p=1, start_q=1,
                                       max_p=5, max_q=5,
                                       start_P=1, start_Q=1,
                                       max_P=3, max_Q=3,
                                       m=7,
                                       seasonal=True,
                                       d=None, D=None,
                                       trace=True,
                                       error_action='ignore',
                                       suppress_warnings=True,
                                       stepwise=True)


            print("\nauto_arima search complete.")
            print(f"Best SARIMA parameters found: {arima_model.order} (non-seasonal) and {arima_model.seasonal_order} (seasonal)")
            print(arima_model.summary())
            best_order = arima_model.order
            best_seasonal_order = arima_model.seasonal_order

            print("\nFitting SARIMA model using the best parameters found by auto_arima...")
            model = SARIMAX(train_ts_clean,
                            order=best_order,
                            seasonal_order=best_seasonal_order,
                            enforce_stationarity=False,
                            enforce_invertibility=False)

            sarima_optimized_results = model.fit()

            print("\nSARIMA model fitted successfully with optimized parameters.")
            print(sarima_optimized_results.summary())

            print("\nMaking predictions on the test set using the optimized model...")

            start_pred = test_ts_clean.index[0]
            end_pred = test_ts_clean.index[-1]
            predictions_optimized = sarima_optimized_results.predict(start=start_pred, end=end_pred, dynamic=True)

            print("Predictions generated.")
            print("\nSample Predictions (Optimized Model):")
            print(predictions_optimized.head())
            print("\nSample Actuals (Cleaned Test set):")
            print(test_ts_clean.head())

            # Ensure predictions and actuals have the same index for evaluation
            predictions_optimized_aligned = predictions_optimized.reindex(test_ts_clean.index)

            # Fill any NaNs in predictions_optimized_aligned
            predictions_optimized_aligned = predictions_optimized_aligned.fillna(method='ffill') # Or use 0, or other method

            # --- Evaluate the Optimized Model ---
            print("\nEvaluating optimized model performance on the test set...")

            # Ensure no NaNs or infinities before calculating metrics
            valid_indices_optimized = predictions_optimized_aligned.notna() & test_ts_clean.notna()
            actuals_optimized = test_ts_clean[valid_indices_optimized]
            preds_optimized = predictions_optimized_aligned[valid_indices_optimized]

            if not actuals_optimized.empty:
                rmse_optimized = np.sqrt(mean_squared_error(actuals_optimized, preds_optimized))
                mae_optimized = mean_absolute_error(actuals_optimized, preds_optimized)

                print(f"Optimized Root Mean Squared Error (RMSE): {rmse_optimized:.2f}")
                print(f"Optimized Mean Absolute Error (MAE): {mae_optimized:.2f}")

                # Store optimized scores to compare later
                optimized_sarima_rmse = rmse_optimized
                optimized_sarima_mae = mae_optimized
                optimized_metrics_calculated = True


                # --- Visualize Forecast vs Actuals (Optimized) ---
                plt.figure(figsize=(15, 6))
                plt.plot(train_ts_clean.index, train_ts_clean, label='Training Data')
                plt.plot(test_ts_clean.index, actuals_optimized, label='Actual Test Data', color='orange')
                plt.plot(preds_optimized.index, preds_optimized, label='Optimized SARIMA Predictions', color='green', linestyle='--')
                plt.title('Optimized SARIMA Forecast vs Actuals (Daily Crime Count)')
                plt.xlabel('Date')
                plt.ylabel('Number of Incidents')
                plt.legend()
                plt.grid(True)
                plt.show()

            else:
                 print("No valid data points to evaluate optimized predictions against in the test set.")
                 optimized_metrics_calculated = False


        except Exception as e:
            print(f"\nAn error occurred during auto_arima or optimized SARIMA fitting/prediction: {e}")
            print("Consider adjusting auto_arima parameters (max_order, m, stepwise=False), or checking data for issues.")
            optimized_metrics_calculated = False


else:
    print("Training and/or Testing Time Series not found or are None.")
    print("Please ensure the data aggregation and splitting steps ran successfully.")

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

try:
    # Try to use the scores from the optimized model
    sarima_rmse_score = optimized_sarima_rmse
    sarima_mae_score = optimized_sarima_mae
    metrics_calculated = True
    print("Using evaluation metrics from the optimized SARIMA model.")
except NameError:
    try:
        # Fallback to the scores from the initial SARIMA model if optimized scores are not available
        sarima_rmse_score = rmse
        sarima_mae_score = mae
        metrics_calculated = True
        print("Using evaluation metrics from the initial SARIMA model.")
    except NameError:
        print("RMSE and MAE variables not found from the previous model training steps.")
        print("Skipping evaluation metric visualization.")
        metrics_calculated = False


if metrics_calculated:
    print("--- Visualizing Evaluation Metric Scores ---")

    # Create a DataFrame to hold the metrics for plotting
    metrics_data = {'Metric': ['RMSE', 'MAE'],
                    'SARIMA Model': [sarima_rmse_score, sarima_mae_score]}
                    # Add more models here as you implement them
                    # 'LightGBM Model': [lightgbm_rmse, lightgbm_mae]}

    metrics_df = pd.DataFrame(metrics_data)

    # Set the 'Metric' column as the index for easier plotting
    metrics_df = metrics_df.set_index('Metric')

    # Plot the bar chart
    ax = metrics_df.plot(kind='bar', figsize=(8, 6), colormap='viridis')

    plt.title('Model Evaluation Metrics on Test Set')
    plt.ylabel('Score Value')
    plt.xticks(rotation=0) # Keep x-axis labels horizontal
    plt.grid(axis='y', linestyle='--')
    plt.legend(title='Model')

    # Add the score values on top of the bars
    for container in ax.containers:
        ax.bar_label(container, fmt='%.2f') # Format to 2 decimal places

    plt.tight_layout() # Adjust layout
    plt.show()

    print("\n--- Evaluation Metric Score Chart Displayed ---")
    print("This chart compares the RMSE and MAE of the SARIMA model on the test set.")
    print("Add results from other models to this chart for comparison.")

else:
    print("Cannot generate evaluation metric chart as scores were not calculated.")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Uninstall pmdarima
!pip uninstall pmdarima -y

# Reinstall pmdarima
!pip install pmdarima
import pandas as pd
import numpy as np
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')

if 'train_ts' in locals() and 'test_ts' in locals() and train_ts is not None and test_ts is not None:
    print("--- Implementing SARIMA Model with Hyperparameter Optimization (Auto-ARIMA) ---")

    # Ensure there are no NaNs in the time series used for auto_arima
    # Drop NaNs primarily for the time series used in auto_arima and model fitting/prediction.
    train_ts_clean = train_ts.dropna()
    test_ts_clean = test_ts.dropna()

    if train_ts_clean.empty or test_ts_clean.empty:
        print("Cleaned training or testing time series is empty after dropping NaNs. Cannot proceed with modeling.")
    else:
        print(f"Cleaned training time series shape: {train_ts_clean.shape}")
        print(f"Cleaned testing time series shape: {test_ts_clean.shape}")

        # --- Hyperparameter Optimization using auto_arima ---

        print("\nRunning auto_arima to find best SARIMA parameters...")

        try:

            arima_model = pm.auto_arima(train_ts_clean,
                                       start_p=1, start_q=1,
                                       max_p=5, max_q=5, # Limit search space for non-seasonal
                                       start_P=1, start_Q=1,
                                       max_P=3, max_Q=3, # Limit search space for seasonal
                                       m=7,            # Seasonal period (e.g., 7 for weekly daily data)
                                       seasonal=True,
                                       d=None, D=None, # Let auto_arima find differencing order
                                       trace=True,      # Print status updates
                                       error_action='ignore', # Ignore errors for specific parameter combinations
                                       suppress_warnings=True, # Hide convergence warnings
                                       stepwise=True) # Use stepwise search (faster)
                                       # n_fits=10 # Number of random fits for random search (if stepwise=False)


            print("\nauto_arima search complete.")
            print(f"Best SARIMA parameters found: {arima_model.order} (non-seasonal) and {arima_model.seasonal_order} (seasonal)")
            print(arima_model.summary())

            # The arima_model object is itself a fitted model, ready for prediction.
            # We can directly use this model for prediction.

            # --- Predict on the model ---

            # Make predictions on the test set period
            print("\nMaking predictions on the test set using the auto_arima optimized model...")
            n_periods = len(test_ts_clean)
            predictions_optimized = arima_model.predict(n_periods=n_periods)

            # auto_arima predict returns a numpy array. Convert to pandas Series with test index for evaluation.
            predictions_optimized_series = pd.Series(predictions_optimized, index=test_ts_clean.index)


            print("Predictions generated.")
            print("\nSample Predictions (Optimized Model):")
            print(predictions_optimized_series.head())
            print("\nSample Actuals (Cleaned Test set):")
            print(test_ts_clean.head())

            # --- Evaluate the Optimized Model ---
            print("\nEvaluating optimized model performance on the test set...")

            # Ensure no NaNs or infinities before calculating metrics
            # The predictions_optimized_series should align with test_ts_clean by index.
            # We can drop NaNs just in case, though it shouldn't be necessary if indices match.
            valid_indices_optimized = predictions_optimized_series.notna() & test_ts_clean.notna()
            actuals_optimized = test_ts_clean[valid_indices_optimized]
            preds_optimized = predictions_optimized_series[valid_indices_optimized]

            if not actuals_optimized.empty:
                rmse_optimized = np.sqrt(mean_squared_error(actuals_optimized, preds_optimized))
                mae_optimized = mean_absolute_error(actuals_optimized, preds_optimized)

                print(f"Optimized Root Mean Squared Error (RMSE): {rmse_optimized:.2f}")
                print(f"Optimized Mean Absolute Error (MAE): {mae_optimized:.2f}")

                # Store optimized scores to compare later
                optimized_sarima_rmse = rmse_optimized
                optimized_sarima_mae = mae_optimized
                # Make these available globally for the evaluation chart cell
                get_ipython().push({'optimized_sarima_rmse': optimized_sarima_rmse,
                                    'optimized_sarima_mae': optimized_sarima_mae})
                optimized_metrics_calculated = True


                # --- Visualize Forecast vs Actuals (Optimized) ---
                plt.figure(figsize=(15, 6))
                plt.plot(train_ts_clean.index, train_ts_clean, label='Training Data')
                plt.plot(test_ts_clean.index, actuals_optimized, label='Actual Test Data', color='orange')
                plt.plot(preds_optimized.index, preds_optimized, label='Optimized SARIMA Predictions', color='green', linestyle='--')
                plt.title('Optimized SARIMA Forecast vs Actuals (Daily Crime Count)')
                plt.xlabel('Date')
                plt.ylabel('Number of Incidents')
                plt.legend()
                plt.grid(True)
                plt.show()

            else:
                 print("No valid data points to evaluate optimized predictions against in the test set.")
                 optimized_metrics_calculated = False


        except Exception as e:
            print(f"\nAn error occurred during auto_arima fitting or prediction: {e}")
            print("Consider adjusting auto_arima parameters (max_order, m, stepwise=False), or checking data for issues.")
            # If auto_arima fails, the optimized metrics are not calculated
            optimized_metrics_calculated = False


else:
    print("Training and/or Testing Time Series not found or are None.")
    print("Please ensure the data aggregation and splitting steps ran successfully before this cell.")

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# Uninstall existing statsmodels and numpy aggressively
!pip uninstall statsmodels -y
!pip uninstall numpy -y

# Clear pip cache
!pip cache purge

# Install compatible versions of statsmodels and numpy
# Explicitly specify a statsmodels version known to be compatible with recent numpy versions
!pip install statsmodels numpy

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from sklearn.metrics import mean_squared_error
import warnings

warnings.filterwarnings('ignore')

if 'daily_crime_counts' in globals() and daily_crime_counts is not None and len(daily_crime_counts) > 100:
    print("Using the prepared daily_crime_counts time series for Holt-Winters.")
    train_size = int(len(daily_crime_counts) * 0.8)
    train_data, test_data = daily_crime_counts[0:train_size], daily_crime_counts[train_size:]

    alphas = [0.2, 0.4, 0.6, 0.8]
    betas = [0.1, 0.3, 0.5]
    gammas = [0.1, 0.3, 0.5, 0.7]
    seasonal_modes = ['add', 'mul']
    seasonal_period = 7

    best_rmse = float('inf')
    best_params = None
    best_model = None


    trend_modes = ['add', 'mul']
    seasonal_periods_to_test = [7, 30]


    print("\nPerforming Grid Search for Holt-Winters model types and seasonal periods...")
    best_rmse = float('inf')
    best_model_params = None
    best_fitted_model = None

    for trend_mode in trend_modes:
        for seasonal_mode in seasonal_modes:
            for period in seasonal_periods_to_test:
                 try:
                    if len(train_data) < period * 2:
                        print(f"Skipping period {period} as data is too short.")
                        continue

                    print(f"Trying params: trend='{trend_mode}', seasonal='{seasonal_mode}', seasonal_periods={period}")
                    model = ExponentialSmoothing(train_data,
                                                 trend=trend_mode,
                                                 seasonal=seasonal_mode,
                                                 seasonal_periods=period,
                                                 initialization_method='estimated')
                    fitted_model = model.fit()
                    predictions = fitted_model.predict(start=test_data.index[0], end=test_data.index[-1])
                    rmse = mean_squared_error(test_data, predictions, squared=False)

                    print(f"Params: trend='{trend_mode}', seasonal='{seasonal_mode}', seasonal_periods={period}, RMSE: {rmse}")
                    if rmse < best_rmse:
                        best_rmse = rmse
                        best_model_params = {
                            'trend': trend_mode,
                            'seasonal': seasonal_mode,
                            'seasonal_periods': period
                        }
                        best_fitted_model = fitted_model # Store the fitted model


                 except Exception as e:
                    print(f"Error fitting model with params trend='{trend_mode}', seasonal='{seasonal_mode}', seasonal_periods={period}: {e}")
                    continue

    print("\nHolt-Winters Grid Search completed.")

    if best_model_params:
        print("Best Holt-Winters Model Parameters:", best_model_params)
        print("Best RMSE:", best_rmse)
        print("\nFitting the final Holt-Winters model with best parameters on the full dataset...")
        try:
            final_model = ExponentialSmoothing(daily_crime_counts,
                                              trend=best_model_params['trend'],
                                              seasonal=best_model_params['seasonal'],
                                              seasonal_periods=best_model_params['seasonal_periods'],
                                              initialization_method='estimated')

            final_fitted_model = final_model.fit()
            start_date_future = daily_crime_counts.index[-1] + pd.Timedelta(days=1)
            end_date_future = start_date_future + pd.Timedelta(days=29)
            forecast_future = final_fitted_model.predict(start=start_date_future, end=end_date_future)

            print("\nFuture forecast generated.")
            print(forecast_future) # Display the future predictions

        except Exception as e:
            print(f"Error fitting final model or generating forecast: {e}")
    else:
        print("\nNo valid Holt-Winters model parameters found during grid search.")


else:
    print("\nSkipping ML Model implementation as the daily_crime_counts time series was not prepared.")
    print("Please ensure the data preparation steps were run.")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

# You would typically generate visualizations of the model's performance here.
# Since the previous cell performs a grid search and finds the best RMSE,
# you can display this result or create a bar chart comparing RMSEs for different models
# if you were comparing multiple approaches (like Prophet and SARIMA).

# As you have only implemented SARIMA so far, we can display the best RMSE found.

if 'best_rmse' in globals() and best_rmse != float('inf'):
    print(f"Best RMSE found during SARIMA Grid Search: {best_rmse:.4f}")

    # You could create a simple bar chart if you had multiple models or variations to compare
    # Example (if comparing SARIMA and Prophet's best RMSE):
    # model_names = ['Best SARIMA', 'Best Prophet'] # Assuming you have a best_prophet_rmse
    # rmse_scores = [best_rmse, best_prophet_rmse] # Replace with actual best RMSEs

    # plt.figure(figsize=(8, 5))
    # sns.barplot(x=model_names, y=rmse_scores, palette='viridis')
    # plt.title('Comparison of Model RMSE Scores')
    # plt.ylabel('Root Mean Squared Error (RMSE)')
    # plt.show()

else:
    print("No best RMSE found. Please run the model fitting code first.")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Import necessary libraries if not already imported
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Make sure you have run the pip installs from the previous fix before this cell
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from sklearn.metrics import mean_squared_error
import warnings

warnings.filterwarnings('ignore')

# Assume 'daily_crime_counts' time series is prepared and available
# Based on the previous code, the check is already in place

if 'daily_crime_counts' in globals() and daily_crime_counts is not None and len(daily_crime_counts) > 100:
    print("Using the prepared daily_crime_counts time series for Holt-Winters.")

    # Split data into training and testing sets (80/20 split)
    train_size = int(len(daily_crime_counts) * 0.8)
    train_data, test_data = daily_crime_counts[0:train_size], daily_crime_counts[train_size:]

    # Define parameter grids for Grid Search
    # Note: alphas, betas, gammas are smoothing parameters which can also be tuned,
    # but statsmodels can estimate these during fitting. Tuning trend, seasonal
    # and seasonal_periods is a common initial approach.
    trend_modes = ['add', 'mul']
    seasonal_modes = ['add', 'mul']
    # Test common seasonal periods: 7 for weekly seasonality, 30 for monthly approximation
    seasonal_periods_to_test = [7, 30] # Can add more periods like 365 for yearly if data span allows

    best_rmse = float('inf') # Initialize with a high value
    best_model_params = None # To store the parameters that yield the lowest RMSE
    best_fitted_model = None # To store the best fitted model from the test set

    print("\nPerforming Grid Search for Holt-Winters model types and seasonal periods...")

    # Iterate through all combinations of parameters
    for trend_mode in trend_modes:
        for seasonal_mode in seasonal_modes:
            for period in seasonal_periods_to_test:
                 # Skip combinations that might cause issues or are not suitable
                 # For example, 'mul' seasonal requires non-zero values, and data needs enough points
                 if seasonal_mode == 'mul' and (train_data <= 0).any():
                     print(f"Skipping seasonal='mul' due to zero or negative values in data.")
                     continue
                 if len(train_data) < period * 2:
                     print(f"Skipping period {period} as training data ({len(train_data)} points) is too short (need at least 2*period).")
                     continue

                 try:
                    print(f"Trying params: trend='{trend_mode}', seasonal='{seasonal_mode}', seasonal_periods={period}")
                    # Define the Exponential Smoothing model with current parameters
                    model = ExponentialSmoothing(train_data,
                                                 trend=trend_mode,
                                                 seasonal=seasonal_mode,
                                                 seasonal_periods=period,
                                                 initialization_method='estimated') # Let statsmodels estimate initial values

                    # Fit the model to the training data
                    fitted_model = model.fit()

                    # Predict on the test data
                    # Ensure the prediction range matches the test data index
                    predictions = fitted_model.predict(start=test_data.index[0], end=test_data.index[-1])

                    # Calculate RMSE for the test data
                    rmse = mean_squared_error(test_data, predictions, squared=False) # squared=False gives RMSE

                    print(f"Params: trend='{trend_mode}', seasonal='{seasonal_mode}', seasonal_periods={period}, RMSE: {rmse}")

                    # Check if the current model is the best so far
                    if rmse < best_rmse:
                        best_rmse = rmse
                        best_model_params = {
                            'trend': trend_mode,
                            'seasonal': seasonal_mode,
                            'seasonal_periods': period
                        }
                        # Optionally store the fitted model, though we'll refit on full data later
                        # best_fitted_model = fitted_model


                 except Exception as e:
                    # Catch potential errors during fitting (e.g., convergence issues)
                    print(f"Error fitting model with params trend='{trend_mode}', seasonal='{seasonal_mode}', seasonal_periods={period}: {e}")
                    continue

    print("\nHolt-Winters Grid Search completed.")

    # Fit the final model using the best parameters found on the full dataset
    if best_model_params:
        print("Best Holt-Winters Model Parameters:", best_model_params)
        print("Best RMSE on Test Data:", best_rmse) # This is the RMSE from the test set evaluation

        print("\nFitting the final Holt-Winters model with best parameters on the full dataset...")
        try:
            # Define the final model using the entire time series and best parameters
            final_model = ExponentialSmoothing(daily_crime_counts,
                                              trend=best_model_params['trend'],
                                              seasonal=best_model_params['seasonal'],
                                              seasonal_periods=best_model_params['seasonal_periods'],
                                              initialization_method='estimated')

            # Fit the model to the full dataset
            final_fitted_model = final_model.fit()

            print("Final model fitted on the full dataset.")

            # Predict into the future (e.g., next 30 days)
            # Determine the start and end dates for the future forecast
            start_date_future = daily_crime_counts.index[-1] + pd.Timedelta(days=1)
            # Predict for the next 30 days
            end_date_future = start_date_future + pd.Timedelta(days=29) # This is the end date inclusive

            # Generate future predictions
            forecast_future = final_fitted_model.predict(start=start_date_future, end=end_date_future)

            print("\nFuture forecast generated (next 30 days):")
            print(forecast_future) # Display the future predictions

            # Chart visualization code
            plt.figure(figsize=(15, 7))
            plt.plot(daily_crime_counts.index, daily_crime_counts, label='Observed Data')
            plt.plot(test_data.index, predictions, label='Test Set Predictions', color='orange', linestyle='--')
            plt.plot(forecast_future.index, forecast_future, label='Future Forecast', color='green')
            plt.title('Holt-Winters Forecasting of Daily Crime Counts')
            plt.xlabel('Date')
            plt.ylabel('Number of Incidents')
            plt.legend()
            plt.grid(True)
            plt.show()


        except Exception as e:
            print(f"Error fitting final model or generating forecast: {e}")
    else:
        print("\nNo valid Holt-Winters model parameters found during grid search. Cannot fit final model or forecast.")


else:
    print("\nSkipping ML Model implementation as the daily_crime_counts time series was not prepared or is too short.")
    print("Please ensure the data preparation steps were run and the time series has enough data points (>100).")

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import pickle
# Import any necessary libraries for your model (e.g., sklearn, tensorflow, pytorch)
# from sklearn.linear_model import LinearRegression
# from your_forecasting_model import YourModel

# --- Placeholder for Model Training or Loading ---
# You need to define or load your 'model' variable here.
# For example, if you were training a model:
# model = YourModel(...) # Initialize your model
# model.fit(X_train, y_train) # Train your model on training data

# Or if you were loading a model:
# with open("path/to/your/trained_model.pkl", "rb") as infile:
#     model = pickle.load(infile)
# --- End of Placeholder ---

# Example: Creating a dummy model object for demonstration purposes if you don't have a real model yet
# REMOVE this section once you have your actual model training/loading code
class DummyModel:
    def predict(self, X):
        print("This is a dummy predict method.")
        return [0] * len(X)

model = DummyModel()
# END of Dummy Model Example - Please replace with your actual model


# Save the model
try:
    with open("best_model.pkl", "wb") as file:
        pickle.dump(model, file)
    print("Model saved successfully to best_model.pkl")
except NameError:
    print("Error: 'model' is not defined. Please ensure you have trained or loaded a model into a variable named 'model'.")
except Exception as e:
    print(f"An error occurred while saving the model: {e}")


# Load the model later
try:
    with open("best_model.pkl", "rb") as file:
        loaded_model = pickle.load(file)
    print("Model loaded successfully from best_model.pkl")
    # You can now use loaded_model
    # For example: loaded_model.predict(...)
except FileNotFoundError:
    print("Error: 'best_model.pkl' not found. Please ensure the save operation was successful.")
except Exception as e:
    print(f"An error occurred while loading the model: {e}")

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
import joblib
import pickle # Although not used for joblib, it was in the original context

# --- Placeholder for Model Training or Loading ---
# You need to define or load your 'model' variable here.
# For example, if you were training a model:
# model = YourModel(...) # Initialize your model
# model.fit(X_train, y_train) # Train your model on training data

# Or if you were loading a model:
# with open("path/to/your/trained_model.pkl", "rb") as infile:
#     model = pickle.load(infile) # Or joblib.load

# Example: Creating a dummy model object for demonstration purposes if you don't have a real model yet
# REMOVE this section once you have your actual model training/loading code
# Defining the DummyModel class again or importing it if defined elsewhere
class DummyModel:
    def predict(self, X):
        print("This is a dummy predict method.")
        return [0] * len(X)

model = DummyModel()
# END of Dummy Model Example - Please replace with your actual model


# Save the model
try:
    joblib.dump(model, "best_model.joblib")
    print("Model saved successfully to best_model.joblib")
except NameError:
    # This specific NameError should be caught by the definition above,
    # but keeping a general error catch is good practice.
    print("Error: 'model' is not defined. Please ensure you have trained or loaded a model into a variable named 'model'.")
except Exception as e:
    print(f"An error occurred while saving the model with joblib: {e}")


# Load the model later
try:
    loaded_model = joblib.load("best_model.joblib")
    print("Model loaded successfully from best_model.joblib")
    # You can now use loaded_model
    # For example: loaded_model.predict(...)
except FileNotFoundError:
    print("Error: 'best_model.joblib' not found. Please ensure the save operation was successful.")
except Exception as e:
    print(f"An error occurred while loading the model with joblib: {e}")

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**


- The heatmap provides a clear visualization of the relationships between different variables, making it easier to identify strong and weak correlations.
- Features with high correlation can indicate redundancy in the dataset, which helps in refining predictive models by selecting only the most relevant variables.
- Variables that show unexpected correlations may highlight underlying trends or issues, such as data inconsistencies or hidden dependencies.
- Understanding correlations can assist in strategic decision-making, whether for business optimization, safety planning, or operational improvements.
- The heatmap supports risk assessment by revealing connections between factors that may contribute to specific outcomes, helping to develop preventive measures.
- Insights from the correlation matrix can guide feature engineering, allowing better data transformation for machine learning models.
- If certain features show weak correlations, they may be less significant for prediction models, making it possible to simplify analyses without losing accuracy.
- This visualization aids in anomaly detection, as extreme correlations might indicate biases or irregularities in the dataset that require further investigation.
- Organizations can leverage these insights for smarter resource allocation, whether improving security measures, adjusting business strategies, or refining service delivery.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***