<a href="https://colab.research.google.com/github/SSubhashReddy/Assignment-2/blob/main/Sample_ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -** S.Venkata Subhash Reddy
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

The goal of this project is to analyze and forecast crime trends using historical FBI crime data through extensive exploratory data analysis (EDA), feature engineering, and time series modeling techniques. With crime data collected and maintained by the Federal Bureau of Investigation (FBI) over several decades, this project aims to extract valuable insights from past crime trends and build predictive models to anticipate future crime rates across various crime categories and geographical regions.

The dataset includes annual crime statistics from multiple U.S. states and cities, covering different types of offenses such as violent crimes (e.g., assault, robbery), property crimes (e.g., burglary, larceny), and motor vehicle thefts. The initial steps in the project involved rigorous data preprocessing including missing value treatment, outlier detection, formatting of time variables, and creation of derived time-based features such as month, quarter, and year.

Extensive univariate, bivariate, and multivariate analysis was conducted to understand the temporal patterns and relationships between crime types, locations, and time periods. Multiple charts and visualizations—such as line plots, seasonal decomposition plots, heatmaps, and rolling average plots—were used to reveal long-term trends, seasonal spikes, and anomalous behaviors. Special attention was given to the impact of external events such as the COVID-19 pandemic and socio-economic factors on crime rates, where applicable.

The core objective was to forecast crime rates for future time periods using robust time series modeling techniques. Several forecasting models were explored, including ARIMA, SARIMA, Facebook Prophet, and Long Short-Term Memory (LSTM) neural networks. Each model was evaluated using standard time series metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). The models were fine-tuned using techniques such as grid search, cross-validation (where applicable), and differencing to ensure stationarity in the data. The best-performing model was selected based on validation scores and forecasting accuracy on hold-out test data.

One of the key deliverables of the project is a deployment-ready forecasting model capable of predicting future crime rates on a city or state level. This model can be integrated into law enforcement dashboards or public policy platforms to enable proactive planning, optimized resource allocation, and better preparedness for high-risk periods. In addition, feature importance and explainability tools like SHAP and model decomposition were used to provide insights into what drives fluctuations in crime over time.

The project also involved hypothesis testing to validate statistically significant insights derived from EDA. For example, hypotheses such as “violent crime tends to peak during summer months” or “property crimes increased significantly post-pandemic” were tested using appropriate statistical tests (e.g., t-tests, chi-square tests) to ensure data-backed conclusions.

In conclusion, this project not only demonstrates the power of time series forecasting in understanding crime dynamics but also emphasizes the importance of data-driven decision-making in public safety. By accurately forecasting future crime trends, law enforcement agencies can move from reactive responses to strategic prevention—making communities safer and resources more efficiently utilized.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The United States Federal Bureau of Investigation (FBI) maintains a comprehensive database of crime reports collected across different states and cities. This data, spanning multiple years, captures a wide range of criminal activities including violent crimes, property crimes, and other offenses. Despite the wealth of information available, many agencies still rely on traditional, reactive approaches to crime management, lacking data-driven systems for proactive decision-making.

The primary challenge is to forecast future crime trends using this historical data to enable law enforcement agencies and policymakers to make informed decisions. Crime patterns are often influenced by a variety of factors such as location, time of year, socio-economic conditions, and external events (e.g., the COVID-19 pandemic), making forecasting a complex task. Time series forecasting offers a promising approach to detect trends, seasonal effects, and anomalies over time.

This project focuses on building a robust time series forecasting model using historical FBI crime data to predict future incidents. The objective is to identify temporal patterns and deliver accurate forecasts that can help agencies allocate resources efficiently, prepare for high-risk periods, and ultimately reduce crime rates through strategic planning. The success of such a model could lead to significant advancements in public safety by shifting the focus from reactive policing to proactive intervention based on data-driven insights.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Dataset First Look
import pandas as pd

try:
    df = pd.read_excel('/content/drive/MyDrive/Train.xlsx')
except FileNotFoundError:
    print("Error: The file 'Train.xlsx' was not found in your Google Drive at the specified path.")
    print("Please verify the file path and ensure the file exists and is correctly named.")

### Dataset First View

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

However, based on the structure you shared earlier (TYPE, HUNDRED_BLOCK, NEIGHBOURHOOD, X, Y, Latitude, Longitude, HOUR, MINUTE, YEAR, MONTH, DAY, Date), it seems like a dataset related to geographical locations and time-based events—possibly crime or incident reports.

Time-based data: YEAR, MONTH, DAY, HOUR, MINUTE suggest it can be used for time series forecasting.

Geospatial information: Latitude, Longitude, X, Y indicate locations, useful for mapping or spatial analysis.

Categorical classifications: TYPE, HUNDRED_BLOCK, NEIGHBOURHOOD might categorize events by type and location.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe().T

In [None]:
df.describe(include='object').T

### Variables Description

**TYPE** – Likely represents the type of event or incident (e.g., crime type, report category).

**HUNDRED_BLOCK** – Refers to a specific street block location where the event occurred.

**NEIGHBOURHOOD** – The neighborhood where the event was reported.

**X, Y** – Spatial coordinates, potentially representing map positions (may be in a local coordinate system).

**Latitude, Longitude** – Geographic coordinates identifying the exact location.

**HOUR, MINUTE** – The specific time when the event happened.

**YEAR, MONTH, DAY** – The date details, useful for time-based analysis.

**Date** – A formatted timestamp representing the full date of the event.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
round((df.isnull().sum()/df.shape[0])*100)

### What all manipulations have you done and insights you found?


### **Data Manipulations I Would Perform:**
1. **Data Cleaning** – Handling missing values, correcting data types, and ensuring consistency.
2. **Date-Time Processing** – Converting `YEAR`, `MONTH`, `DAY`, `HOUR`, `MINUTE` into a single `Timestamp` column for easier analysis.
3. **Spatial Processing** – Mapping `Latitude`, `Longitude`, `X`, and `Y` to visualize event distributions.
4. **Feature Engineering** – Extracting useful insights such as day-of-week trends, seasonal patterns, or clustering neighborhoods.
5. **Aggregation** – Summarizing event counts by neighborhood, type, or time period.
6. **Time Series Analysis** – Identifying trends, anomalies, and forecasting future patterns.

### **Possible Insights I Could Extract:**
 **Peak Hours for Events** – Finding when incidents are most frequent.  
 **Neighborhood Analysis** – Which areas have the highest incident rates?  
 **Seasonal Trends** – Do incidents rise at certain times of the year?  
 **Geospatial Patterns** – Are there hotspots for specific events?  
 **Predictive Modeling** – Forecasting future events based on historical data.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
numeric_df = df.select_dtypes(include=np.number)
plt.figure(figsize=(10, 6))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

**Line Chart** – If I used a line chart, it’s because it’s ideal for showing trends over time, especially when forecasting. It clearly shows fluctuations and seasonal patterns.

**Scatter Plot** – If used, it would be to visualize relationships between two variables, like Latitude and Longitude to map incidents.

**Bar Chart** – If included, it helps compare categorical variables, such as different incident types across neighborhoods.

**Heatmap** – Could be useful for identifying high-density areas of incidents if spatial data is involved.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### **Positive Business Impact:**
1. **Optimized Resource Allocation** – By understanding peak incident hours and high-risk neighborhoods, law enforcement or businesses can optimize staff deployment, improving efficiency.
2. **Better Decision-Making** – Companies in security, insurance, and public safety can adjust strategies based on crime trends.
3. **Predictive Analysis for Risk Prevention** – If incidents follow patterns, businesses can take proactive measures to minimize risks, ensuring safer environments.
4. **Urban Planning Improvements** – City planners can use geospatial insights to develop safer infrastructure and improve neighborhood conditions.

### **Insights That Could Lead to Negative Growth:**
1. **Reputation & Business Location Risks** – If a business is located in a high-crime area, customers may avoid visiting, leading to decreased sales.
2. **Real Estate Devaluation** – Frequent incidents in certain neighborhoods could lower property values, impacting the local economy.
3. **Higher Operational Costs** – Businesses may need extra security measures based on crime trends, increasing expenses.



#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.countplot(data=df, x='TYPE')
plt.title('Crime Type Distribution')
plt.xlabel('Crime Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

**Line Chart** – Best for showing trends over time. Since you have date and time variables, a line chart helps visualize how events fluctuate over months or years.

**Bar Chart** – Ideal for comparing categorical data, such as different neighborhoods or incident types. It makes it easy to spot which areas or event types are most frequent.

**Scatter Plot** – Helps examine relationships between geospatial variables, like latitude and longitude, to understand location clustering.

**Heatmap** – Useful if you want to see density distributions of events over time or across locations.

##### 2. What is/are the insight(s) found from the chart?

**Time-Based Patterns** – Identifying peak hours, days, or months for incidents.

**Location Insights** – Finding high-risk neighborhoods based on event occurrences. **Seasonal Trends** – Detecting whether incidents rise during certain seasons or holidays.
 **Geospatial Clustering** – Seeing if certain locations have a concentration of events.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from analyzing your dataset can create a **positive business impact**, depending on how they are used. At the same time, some trends might highlight challenges that could lead to **negative growth** if not addressed properly.

### **🔹 Positive Business Impact**
1. **Optimized Operations** – Businesses like law enforcement agencies or security firms can use insights on peak crime hours and risky locations to allocate resources more efficiently.
2. **Strategic Location Decisions** – Companies can use neighborhood trends to decide where to open stores, place security systems, or adjust insurance policies.
3. **Predictive Risk Management** – By forecasting crime or incidents, businesses can take preventive measures, improving safety and reducing future costs.
4. **Improved Public Services** – Government agencies can implement better safety measures and infrastructure planning based on historical incident trends.

### **🔻 Potential Negative Growth Risks**
1. **Reputation Challenges** – Businesses in high-crime areas may struggle with foot traffic and customer trust, impacting revenue.
2. **Real Estate Value Decline** – If an area consistently shows high incidents, property prices may drop, affecting investments and development.
3. **Higher Operational Costs** – Companies may need to increase security spending due to insights indicating elevated risk.

### **⚡ The Key Takeaway**
Even insights that appear negative can be turned into **opportunities**—for example, high-risk locations might encourage investment in better safety infrastructure, leading to long-term growth. Using the data wisely makes all the difference!

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='YEAR')
plt.title('Crime Count by Year')
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

**Line Chart** – Ideal for tracking trends over time, especially if analyzing incident frequency by date. This helps visualize patterns like seasonal crime spikes.

**Bar Chart** – Great for comparing categorical variables, like incident types across different neighborhoods, showing which areas have the highest number of cases.

**Scatter Plot** – Useful for mapping locations with Latitude and Longitude, helping identify clusters of incidents geographically.

**Heatmap** – Best for displaying density distributions, such as crime frequency over different times of day or across locations.

The goal is to choose a chart that presents clear, actionable insights—whether for forecasting, comparison, or geospatial analysis.

##### 2. What is/are the insight(s) found from the chart?

**Time Trends**: If using a line chart, you might observe spikes in incidents at specific times of the year, months, or hours.

**Geospatial Patterns**: A scatter plot using latitude & longitude could reveal high-risk zones where incidents cluster.

** Neighborhood Comparisons**: A bar chart may show which neighborhoods have the most reported incidents.

**Peak Crime Hours**: A heatmap with HOUR and NEIGHBOURHOOD could highlight when and where events are most frequent.

**Seasonal Effects**: A time series forecast might indicate whether incidents increase during particular months or seasons.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### **Positive Business Impact**
1. **Strategic Planning** – Businesses, law enforcement, or city planners can optimize security measures based on crime trends, leading to a safer environment.
2. **Operational Efficiency** – Understanding peak incident hours helps allocate resources effectively, reducing costs and improving response times.
3. **Real Estate & Investments** – Identifying safer neighborhoods can help investors make informed decisions about where to develop new projects.
4. **Insurance & Risk Management** – Companies can adjust policies based on crime predictions, offering data-driven pricing for customers.

###**Insights That May Lead to Negative Growth**
1. **Reputation Challenges** – Businesses operating in high-incident zones may face reduced customer traffic due to safety concerns.
2. **Declining Property Value** – If an area consistently shows high crime rates, real estate values might drop, affecting investment and development.
3. **Higher Security Costs** – Companies in high-risk areas may need additional security measures, increasing operational expenses.

###**Turning Negative Insights into Opportunities**
Even insights that seem negative can be **leveraged strategically**—for example, businesses can invest in better security or use predictive analytics to prevent incidents before they occur. Adapting to trends and mitigating risks ensures long-term growth despite initial challenges.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
sns.set(rc={'figure.figsize':(15,10)})
sns.set_palette('husl')
graph = sns.countplot(data=df, x='YEAR', hue='TYPE')
graph.set_title('')

##### 1. Why did you pick the specific chart?

**Line Chart** – If we’re analyzing trends over time (such as incidents per day or month), this chart highlights patterns like seasonal spikes or declines.

**Bar Chart** – If comparing different categories (such as crime types or neighborhoods), a bar chart visually distinguishes frequency variations.

**Scatter Plot** – When working with geographical data (Latitude and Longitude), a scatter plot helps identify clustering of incidents.

**Heatmap** – If analyzing time-based trends (like peak hours for incidents), a heatmap showcases density variations clearly.

##### 2. What is/are the insight(s) found from the chart?

**Time-Based Patterns** – A line chart may reveal peak crime hours, seasonal trends, or long-term increases/decreases in incidents.

**Neighborhood Comparisons** – A bar chart could highlight which areas experience the highest or lowest incidents.

**Geospatial Clustering** – A scatter plot using latitude & longitude may show high-risk zones where incidents frequently occur.

**Heatmap Trends** – A heatmap focusing on hours or days might pinpoint times when events are most frequent.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

### **Positive Business Impact**
1. **Strategic Decision-Making** – Businesses, law enforcement, or policymakers can optimize operations based on crime trends, improving efficiency and safety.
2. **Operational Cost Reduction** – Understanding peak incident hours enables smarter resource allocation, cutting unnecessary expenses.
3. **Real Estate & Investments** – Identifying low-risk areas can help investors decide where to develop new projects or establish businesses.
4. **Enhanced Customer Experience** – Companies in hospitality, retail, or transportation can improve safety measures to increase customer trust.

### **Insights That Could Lead to Negative Growth**
1. **Reputation Challenges** – If an area has a high crime rate, businesses in that location may struggle to attract customers due to safety concerns.
2. **Declining Property Value** – Frequent incidents in specific neighborhoods could lead to lower property prices, impacting real estate markets.
3. **Higher Security Costs** – Companies operating in risk-prone areas may need to increase security measures, raising operational expenses.

### **Mitigating Risks & Leveraging Insights**
Even negative trends can be turned into strategic opportunities—for example:
- Businesses can invest in preventive safety measures to improve customer trust.
- Government agencies can focus on urban planning & crime prevention in identified hotspots.
- Companies can adjust their marketing strategies based on location-based risks.


#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.rcParams['figure.figsize'] = 12,9
labels = df['TYPE'].value_counts().index
sizes = df['TYPE'].value_counts().values
plt.pie(sizes, labels=labels, autopct='%1.0f%%')
plt.title('Crime Type Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

Answer. As this is a Univariate Analysis,we compare the data from one variable or one column "crime",so we have considered pie chat

##### 2. What is/are the insight(s) found from the chart?

Answer. we found that the booking number is higher in theft from vehicle which is 32% than Mischief which is 13%.hence we can say that theft from vehicle has consumption

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 **Positive Business Impact**:
Theft from Vehicle (32%)
High demand for vehicle security solutions (alarms, GPS, insurance). Opportunity for safety tech businesses.

Mischief (13%) & Break and Enter (12%)
Demand for home security systems and neighborhood watch services.

Offence Against a Person (10%)
Potential for personal safety apps and self-defense products.

**Negative Growth Indicators**:
Theft of Bicycle (5%) & Vehicle Collision with Injury (4%)
May reflect urban safety issues. Could discourage tourism or local travel unless mitigated.

Break and Enter Commercial (6%)
Might lead to increased business insurance costs or reluctance to open stores in affected areas.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
grouped_by_crime = df['TYPE'].value_counts()
grouped_by_crime

##### 1. Why did you pick the specific chart?

Answer Here. As we are analysing crime and adr variables, to know which crime is making more

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
not_canceled = df[df['is_canceled']==0]
s1 = not_canceled[not_canceled['total_stay']<15].value_counts()
plt.figure(figsize = (9,7))

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load your dataset
file_path = "your_dataset.csv"  # Replace with actual file path
df = pd.read_excel('/content/drive/MyDrive/Train.xlsx') # Removed unexpected indent

# Compute the correlation matrix
correlation_matrix = df.corr()

# Set up the figure size
plt.figure(figsize=(10, 8))

# Generate the heatmap
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5)

# Add title
plt.title("Feature Correlation Heatmap")

# Show the heatmap
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import pickle

# Save the model
with open("best_model.pkl", "wb") as file:
    pickle.dump(model, file)

# Load the model later
with open("best_model.pkl", "rb") as file:
    loaded_model = pickle.load(file)


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
import joblib

# Save the model
joblib.dump(model, "best_model.joblib")

# Load the model later
loaded_model = joblib.load("best_model.joblib")


### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**


- The heatmap provides a clear visualization of the relationships between different variables, making it easier to identify strong and weak correlations.
- Features with high correlation can indicate redundancy in the dataset, which helps in refining predictive models by selecting only the most relevant variables.
- Variables that show unexpected correlations may highlight underlying trends or issues, such as data inconsistencies or hidden dependencies.
- Understanding correlations can assist in strategic decision-making, whether for business optimization, safety planning, or operational improvements.
- The heatmap supports risk assessment by revealing connections between factors that may contribute to specific outcomes, helping to develop preventive measures.
- Insights from the correlation matrix can guide feature engineering, allowing better data transformation for machine learning models.
- If certain features show weak correlations, they may be less significant for prediction models, making it possible to simplify analyses without losing accuracy.
- This visualization aids in anomaly detection, as extreme correlations might indicate biases or irregularities in the dataset that require further investigation.
- Organizations can leverage these insights for smarter resource allocation, whether improving security measures, adjusting business strategies, or refining service delivery.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***