<a href="https://colab.research.google.com/github/Gareth11-max/Regression---Rossmann-Retail-Sales-Prediction/blob/main/Regression_Rossmann_Retail_Sales_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Rossmann Retail Sales Prediction




##### **Project Type**    - EDA/Regression
##### **Contribution**    - Individual


# **Project Summary -**

The Rossmann Retail Sales Prediction project involves developing a machine learning model to forecast daily sales for over 1,100 Rossmann stores located in 7 European countries. The task is to predict sales for up to six weeks in advance based on historical data, considering various influencing factors such as promotions, holidays, competition, store location, and seasonality


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Rossmann, a leading drug store chain operating over 3,000 stores in 7 European countries, faces the challenge of accurately predicting daily sales for its stores. Store managers are tasked with forecasting sales up to six weeks in advance to optimize inventory management, staffing, promotions, and other operational aspects. However, sales are influenced by numerous factors such as promotions, state and school holidays, competition, seasonality, and store-specific conditions. The problem is further complicated by some stores being temporarily closed for refurbishment.

The objective of this project is to develop a predictive model that accurately forecasts daily sales for 1,115 Rossmann stores using historical sales data. The model should consider multiple factors that impact sales, such as:

Promotions: Ongoing promotional campaigns that could affect daily sales.
Competitor Proximity: Distance to nearby competing stores, which could influence sales.
Holiday Information: Effects of state holidays and school holidays on shopping behavior.
Store-Specific Factors: Different store types and their impact on sales.
Store Closures: Temporary closures for store refurbishments and their effects on sales.
The target variable is the Sales column, and the goal is to build a regression model capable of predicting sales for the test set. This will enable Rossmann to forecast sales with higher accuracy, facilitating better planning for inventory, staff, and promotions, and ultimately improving operational efficiency and profitability.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
train_df=pd.read_csv("/content/Rossmann Stores Data.csv")

### Dataset First View

In [None]:
# Dataset First Look
train_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
train_shape = train_df.shape
train_shape

### Dataset Information

In [None]:
# Dataset Info
train_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
train_duplicates = train_df.duplicated().sum()
train_duplicates

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
train_missing_values = train_df.isnull().sum()
train_missing_values

In [None]:
# Visualizing the missing values
import seaborn as sns
import matplotlib.pyplot as plt

# Visualize missing values in the training dataset
plt.figure(figsize=(12, 8))
sns.heatmap(train_df.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title("Missing Values Heatmap for Training Dataset")
plt.show()

### What did you know about your dataset?

This dataset contains no null or duplicate values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
train_df.columns

In [None]:
# Dataset Describe
train_df.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df_cleaned = train_df.drop_duplicates()
print(f"\nDataset after removing duplicates has {df_cleaned.shape[0]} rows.")




### What all manipulations have you done and insights you found?

Cleaned the dataset to prerpare it for analysis


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Sales trend over time
plt.figure(figsize=(10, 6))
sns.lineplot(data=train_df, x='Date', y='Sales')
plt.title('Sales Trend Over Time')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

To show the sales trend over time

##### 2. What is/are the insight(s) found from the chart?

Sales has fluctuated but the trend is the same.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Sales distribution
plt.figure(figsize=(8, 6))
sns.histplot(train_df['Sales'], kde=True)
plt.title('Sales Distribution')
plt.xlabel('Sales')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

To show sale distribution.

##### 2. What is/are the insight(s) found from the chart?

Sale in the range 10,000 to 20,000 has the highest frequency

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

A histogram can show the distribution of sales, which helps in understanding its spread and central tendency.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Sales by day of the week
plt.figure(figsize=(8, 6))
sns.barplot(x='DayOfWeek', y='Sales', data=train_df)
plt.title('Sales by Day of the Week')
plt.xlabel('Day of Week')
plt.ylabel('Average Sales')
plt.show()


##### 1. Why did you pick the specific chart?

A bar plot helps identify how sales vary by the day of the week.

##### 2. What is/are the insight(s) found from the chart?

Average sale on first day of the week is the highest

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

can be used to improve on days with less average sale.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Sales by month
plt.figure(figsize=(8, 6))
sns.barplot(x='Month', y='Sales', data= train_df)
plt.title('Sales by Month')
plt.xlabel('Month')
plt.ylabel('Average Sales')
plt.show()


##### 1. Why did you pick the specific chart?

This will help us analyze if there’s any seasonality based on months.

##### 2. What is/are the insight(s) found from the chart?

Certain months have higher sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Sales by store type
plt.figure(figsize=(8, 6))
sns.barplot(x='Store', y='Sales', data=train_df)
plt.title('Sales by Store Type')
plt.xlabel('Store Type')
plt.ylabel('Average Sales')
plt.show()


In [None]:
train_df.columns

##### 1. Why did you pick the specific chart?

This chart will show sales by store type, helping identify if certain store types perform better.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Can be helpful in determining future investments.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

plt.figure(figsize=(8, 6))
sns.barplot(x='Customers', y='Sales', data=train_df)
plt.title('Sales by Customer')
plt.xlabel('customer')
plt.ylabel('Average Sales')
plt.show()


##### 1. Why did you pick the specific chart?

To show sale by customer

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Will be helpful in creating a database for most profitable customers.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
top_stores = train_df.groupby('Store')['Sales'].mean().sort_values(ascending=False).head(10)
plt.figure(figsize=(10, 6))
top_stores.plot(kind='bar')
plt.title('Top 10 Stores by Average Sales')
plt.xlabel('Store')
plt.ylabel('Average Sales')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

To show top 10 stores.

##### 2. What is/are the insight(s) found from the chart?

Store 262 is the top store by sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Can be used to track important stores.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(8, 6))
sns.boxplot(x='Promo', y='Sales', data=train_df)
plt.title('Sales by Promotion (Promo vs Non-Promo)')
plt.xlabel('Promotion')
plt.ylabel('Sales')
plt.show()


##### 1. Why did you pick the specific chart?

This box plot helps compare the sales on promotional days vs non-promotional days.

```
# This is formatted as code
```



##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(8, 6))
sns.barplot(x='StateHoliday', y='Sales', data=train_df)
plt.title('Sales by State Holiday')
plt.xlabel('State Holiday')
plt.ylabel('Average Sales')
plt.show()


##### 1. Why did you pick the specific chart?

This chart shows how sales are affected by state holidays.

##### 2. What is/are the insight(s) found from the chart?

Sales are heavily affected on holidays.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Sales are dropping on holidays.

#### Chart - 10

In [None]:
plt.figure(figsize=(8, 6))
sns.barplot(x='SchoolHoliday', y='Sales', data=train_df)
plt.title('Sales by School Holiday')
plt.xlabel('School Holiday')
plt.ylabel('Average Sales')
plt.show()


##### 1. Why did you pick the specific chart?

Compare sales during school holidays and non-school holidays.

##### 2. What is/are the insight(s) found from the chart?

Not much differnce in average sales during school and non school days

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(8, 6))
sns.barplot(x='Open', y='Sales', data=train_df)
plt.title('Sales by Open vs Closed Stores')
plt.xlabel('Open (1 = Open, 0 = Closed)')
plt.ylabel('Average Sales')
plt.show()


##### 1. Why did you pick the specific chart?

This plot compares sales between open and closed stores.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code
train_df['WeekdayOrWeekend'] = train_df['DayOfWeek'].apply(lambda x: 'Weekend' if x >= 5 else 'Weekday')
plt.figure(figsize=(8, 6))
sns.boxplot(x='WeekdayOrWeekend', y='Sales', data=train_df)
plt.title('Sales by Weekday vs Weekend')
plt.xlabel('Weekday or Weekend')
plt.ylabel('Sales')
plt.show()


##### 1. Why did you pick the specific chart?

This box plot compares sales between weekdays and weekends.

##### 2. What is/are the insight(s) found from the chart?

Average sales on weekends is more

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Select relevant numerical columns for correlation matrix
correlation_data = train_df[['Sales', 'Customers', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday']]

# Compute the correlation matrix
correlation_matrix = correlation_data.corr()

# Create the heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)

# Add a title to the heatmap
plt.title('Correlation Heatmap of Numerical Variables')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Hypothetical Statements:
Statement 1: Stores that run promotions (Promo = 1) have higher average sales compared to stores that don't run promotions (Promo = 0).
Statement 2: Sales are higher during weekends (DayOfWeek >= 5) than on weekdays (DayOfWeek < 5).
Statement 3: The number of customers is positively correlated with sales, meaning as the number of customers increases, sales also increase.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Research Hypothesis:
Null Hypothesis (H₀): There is no difference in the average sales between stores that run promotions and stores that don't run promotions.
𝐻
0
:
𝜇
Promo = 1
=
𝜇
Promo = 0
H
0
​
 :μ
Promo = 1
​
 =μ
Promo = 0
​

Alternative Hypothesis (H₁): Stores that run promotions (Promo = 1) have higher average sales compared to stores that don't run promotions (Promo = 0).
𝐻
1
:
𝜇
Promo = 1
>
𝜇
Promo = 0
H
1
​
 :μ
Promo = 1
​
 >μ
Promo = 0
​


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy import stats

# Split data into two groups: Promo = 1 (stores with promotions) and Promo = 0 (stores without promotions)
promo_sales = train_df[train_df['Promo'] == 1]['Sales']
no_promo_sales = train_df[train_df['Promo'] == 0]['Sales']

# Perform a one-tailed t-test (Promo = 1 vs Promo = 0)
t_stat, p_value = stats.ttest_ind(promo_sales, no_promo_sales, alternative='greater')

# Display the results
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Determine if we can reject the null hypothesis at 5% significance level
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. Stores with promotions have higher sales on average.")
else:
    print("Fail to reject the null hypothesis. No significant difference in average sales between stores with and without promotions.")


##### Which statistical test have you done to obtain P-Value?

The statistical test that was used to obtain the p-value is the Independent Two-Sample t-test (also known as the Student's t-test).

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): There is no difference in the average sales between weekends and weekdays.
𝐻
0
:
𝜇
Weekend
=
𝜇
Weekday
H
0
​
 :μ
Weekend
​
 =μ
Weekday
​

Alternative Hypothesis (H₁): Sales during weekends are higher than sales during weekdays.
𝐻
1
:
𝜇
Weekend
>
𝜇
Weekday
H
1
​
 :μ
Weekend
​
 >μ
Weekday
​


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Split data into weekends and weekdays
weekend_sales = train_df[train_df['DayOfWeek'] >= 5]['Sales']
weekday_sales = train_df[train_df['DayOfWeek'] < 5]['Sales']

# Perform a one-tailed t-test (Weekends vs Weekdays)
t_stat, p_value = stats.ttest_ind(weekend_sales, weekday_sales, alternative='greater')

# Display results
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Determine if we can reject the null hypothesis at 5% significance level
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. Sales are higher during weekends than weekdays.")
else:
    print("Fail to reject the null hypothesis. No significant difference in sales between weekends and weekdays.")


##### Which statistical test have you done to obtain P-Value?

The statistical test that was used to obtain the p-value is the Independent Two-Sample t-test (also known as the Student's t-test).

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): There is no correlation between the number of customers and sales.

𝐻
0
:
𝜌
=
0
H
0
​
 :ρ=0 (where
𝜌
ρ is the correlation coefficient)
Alternative Hypothesis (H₁): There is a positive correlation between the number of customers and sales.

𝐻
1
:
𝜌
>
0
H
1
​
 :ρ>0

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Calculate the Pearson correlation coefficient
correlation, p_value = stats.pearsonr(train_df['Customers'], train_df['Sales'])

# Display results
print(f"Correlation: {correlation}")
print(f"P-value: {p_value}")

# Determine if we can reject the null hypothesis at 5% significance level
alpha = 0.05
if p_value < alpha and correlation > 0:
    print("Reject the null hypothesis. There is a positive correlation between the number of customers and sales.")
else:
    print("Fail to reject the null hypothesis. No significant positive correlation between the number of customers and sales.")


##### Which statistical test have you done to obtain P-Value?

The statistical test that was used to obtain the p-value is the Independent Two-Sample t-test (also known as the Student's t-test).

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Checking for missing values in the dataset
missing_values = train_df.isnull().sum()
missing_percentage = (train_df.isnull().sum() / len(train_df)) * 100

# Display missing values count and percentage
print("Missing Values Count:")
print(missing_values)
print("\nMissing Values Percentage:")
print(missing_percentage)
# Impute missing values for numerical columns with the mean or median
train_df['Sales'] = train_df['Sales'].fillna(train_df['Sales'].mean())  # Mean imputation for 'Sales'
train_df['Customers'] = train_df['Customers'].fillna(train_df['Customers'].median())  # Median imputation for 'Customers'


#### What all missing value imputation techniques have you used and why did you use those techniques?

Mean imputation and median imputation.

### 2. Handling Outliers

In [None]:
train_df.columns

In [None]:
# Handling Outliers & Outlier treatments
import matplotlib.pyplot as plt
import seaborn as sns


sns.boxplot(x=train_df['Sales'])


from scipy import stats
import numpy as np

z_scores = np.abs(stats.zscore(train_df['Sales']))  # Replace 'sales' with relevant column
df_outliers = train_df[z_scores > 3]  # Outliers with Z-score > 3

Q1 = train_df['Sales'].quantile(0.25)
Q3 = train_df['Sales'].quantile(0.75)
IQR = Q3 - Q1
df_outliers = train_df[(train_df['Sales'] < (Q1 - 1.5 * IQR)) | (train_df['Sales'] > (Q3 + 1.5 * IQR))]


df_cleaned = train_df[(train_df['Sales'] >= (Q1 - 1.5 * IQR)) & (train_df['Sales'] <= (Q3 + 1.5 * IQR))]

df_cleaned





##### What all outlier treatment techniques have you used and why did you use those techniques?

The IQR method is robust because it focuses on the spread of the middle 50% of the data, which helps mitigate the impact of extreme values on the identification of outliers.

The Z-score is useful when the data follows a normal (Gaussian) distribution because it assumes that most data points lie within a certain number of standard deviations from the mean.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
import pandas as pd
from sklearn.preprocessing import LabelEncoder


# 1. Handle missing values (if any)
train_df['StateHoliday'].fillna('0', inplace=True)  # Assuming missing values in StateHoliday should be treated as '0'

# 2. Date feature extraction
train_df['Date'] = pd.to_datetime(train_df['Date'])  # Convert Date column to datetime
train_df['Year'] = train_df['Date'].dt.year



#### What all categorical encoding techniques have you used & why did you use those techniques?

Label Encoding transforms each unique category into a numeric value while retaining the ordinal relationship. For example, Monday becomes 0, Tuesday becomes 1, and so on.

The Store and StateHoliday columns are nominal variables because they represent categories with no inherent order or ranking. For example, Store indicates different store IDs (each store is independent of others), and StateHoliday represents different holiday types (like '0', 'a', 'b').

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
import re

# Define a dictionary for contractions and their expansions
contractions_dict = {
    "I'm": "I am", "I've": "I have", "I'd": "I would", "I'll": "I will", "I am": "I am",
    "you're": "you are", "you've": "you have", "you'll": "you will", "you'd": "you would",
    "he's": "he is", "he'll": "he will", "he'd": "he would", "she's": "she is", "she'll": "she will",
    "she'd": "she would", "it's": "it is", "it'll": "it will", "it'd": "it would",
    "we're": "we are", "we've": "we have", "we'll": "we will", "we'd": "we would",
    "they're": "they are", "they've": "they have", "they'll": "they will", "they'd": "they would",
    "can't": "cannot", "won't": "will not", "don't": "do not", "doesn't": "does not", "didn't": "did not",
    "isn't": "is not", "aren't": "are not", "wasn't": "was not", "weren't": "were not",
    "hasn't": "has not", "haven't": "have not", "hadn't": "had not", "shouldn't": "should not",
    "couldn't": "could not", "wouldn't": "would not", "mustn't": "must not", "mightn't": "might not",
    "ain't": "is not"
}

# Function to expand contractions in a given text
def expand_contractions(text, contractions_dict):
    # Define a regular expression pattern to match contractions
    pattern = re.compile(r'\b(' + '|'.join(contractions_dict.keys()) + r')\b')

    # Replace contractions using the dictionary
    expanded_text = pattern.sub(lambda x: contractions_dict[x.group(0)], text)
    return expanded_text

# Example usage
sample_text = "I'm happy that you're coming over today! It's going to be fun."

expanded_text = expand_contractions(sample_text, contractions_dict)

print(f"Original Text: {sample_text}")
print(f"Expanded Text: {expanded_text}")


#### 2. Lower Casing

In [None]:
# Lower Casing
# Example text
sample_text = "This is a Simple Text with Mixed CASE letters!"

# Convert the text to lowercase
lowercased_text = sample_text.lower()

print(f"Original Text: {sample_text}")
print(f"Lowercased Text: {lowercased_text}")


#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string

# Example text with punctuation
sample_text = "Hello there! How are you doing today? I hope you're doing great."

# Remove punctuation using string.punctuation
cleaned_text = ''.join(char for char in sample_text if char not in string.punctuation)

print(f"Original Text: {sample_text}")
print(f"Text without Punctuation: {cleaned_text}")


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
import re

# Function to remove URLs and words containing digits
def clean_text(text):
    # Remove URLs (matches anything starting with http:// or https:// and any domain)
    text = re.sub(r'http[s]?://\S+', '', text)

    # Remove words containing digits (e.g., 'hello123', 'product_2021')
    text = re.sub(r'\b\w*\d\w*\b', '', text)

    return text

# Example text with URLs and words containing digits
sample_text = """
Visit our website at https://www.example.com for more info.
Our new product version is available: product_2021.
Contact us at support@example"""

cleaned_text = clean_text(sample_text)
cleaned_text


#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

import nltk
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
nltk.download('stopwords')

# Example text
sample_text = "This is an example sentence where some stopwords will be removed."

# Get English stopwords list from NLTK
stop_words = set(stopwords.words('english'))

# Tokenize the text (split it into words)
words = sample_text.split()

# Remove stopwords
filtered_words = [word for word in words if word.lower() not in stop_words]

# Join words back into a clean sentence
cleaned_text = ' '.join(filtered_words)

print(f"Original Text: {sample_text}")
print(f"Text without Stopwords: {cleaned_text}")


#### 6. Rephrase Text

In [None]:
# Rephrase Text
pip install transformers

from transformers import pipeline

# Load a pre-trained model for paraphrasing
paraphrase_model = pipeline("text2text-generation", model="t5-base")

# Example text to rephrase
text = "I enjoy going to the park in the evening because it is relaxing."

# Generate a paraphrase
paraphrased_text = paraphrase_model(f"paraphrase: {text}", max_length=50, num_return_sequences=1)

print(f"Original Text: {text}")
print(f"Paraphrased Text: {paraphrased_text[0]['generated_text']}")


#### 7. Tokenization

In [None]:
# Tokenization
import nltk
from nltk.tokenize import word_tokenize

# Download punkt tokenizer models if not already installed
nltk.download('punkt')

# Example text
sample_text = "I love Natural Language Processing!"

# Tokenize the text into words
tokens = word_tokenize(sample_text)

print(f"Tokens: {tokens}")


#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
import nltk
from nltk.corpus import stopwords

# Download NLTK stopwords
nltk.download('stopwords')

# Example text
sample_text = "This is a sample sentence with stopwords."

# Tokenize the text
tokens = sample_text.split()

# Load stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
normalized_text = " ".join(filtered_tokens)

print(normalized_text)

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

sample_text = "running runs runner"
tokens = sample_text.split()

# Apply stemming to each token
stemmed_tokens = [stemmer.stem(word) for word in tokens]
normalized_text = " ".join(stemmed_tokens)

print(normalized_text)



##### Which text normalization technique have you used and why?

Stemming reduces words to their root form, which may not always be a valid word but helps in grouping different forms of a word.



#### 9. Part of speech tagging

In [None]:
# POS Taging
import nltk
from nltk.tokenize import word_tokenize

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sample text
sample_text = "The quick brown fox jumps over the lazy dog."

# Tokenize the text
tokens = word_tokenize(sample_text)

# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)

print(pos_tags)


#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.decomposition import PCA

# Standardize the features
from sklearn.preprocessing import StandardScaler
X = train_df.drop('Sales', axis=1)  # Assuming 'target' is your dependent variable
X_scaled = StandardScaler().fit_transform(X)

# Apply PCA to reduce dimensionality and decorrelate features
pca = PCA(n_components=2)  # Reduce to 2 components, for example
X_pca = pca.fit_transform(X_scaled)

print("Explained variance ratio:", pca.explained_variance_ratio_)


#### 2. Feature Selection

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# Correlation matrix for numerical features
corr_matrix = train_df.corr()

# Plot heatmap for correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.show()

# Identify highly correlated features (correlation > 0.9)
threshold = 0.9
high_corr_pairs = set()

for i in range(len(corr_matrix.columns)):
    for j in range(i):
        if abs(corr_matrix.iloc[i, j]) > threshold:
            colname = corr_matrix.columns[i]
            high_corr_pairs.add(colname)

# Drop highly correlated features
train_df = train_df.drop(columns=high_corr_pairs)


##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Performed scaling,categorical encoding and handling outliers.

### 6. Data Scaling

In [None]:
train_df.columns

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

# Assume you have numerical features in train_df that need scaling
numerical_features = ['Sales']

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the data (Standardization)
train_df[numerical_features] = scaler.fit_transform(train_df[numerical_features])

# Check the transformed data
print(train_df[numerical_features].head())

from sklearn.preprocessing import MinMaxScaler

# Initialize the MinMaxScaler
min_max_scaler = MinMaxScaler()

# Fit and transform the data (Normalization)
train_df[numerical_features] = min_max_scaler.fit_transform(train_df[numerical_features])

# Check the transformed data
print(train_df[numerical_features].head())



##### Which method have you used to scale you data and why?

Normalization scales the data within a fixed range, usually between 0 and 1. This is especially useful when the algorithm makes use of distance metrics like KNN or Neural Networks.

Standardization involves subtracting the mean and dividing by the standard deviation. This makes the feature distribution have a mean of 0 and a standard deviation of 1.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split


X = train_df.drop(columns=['Sales'])
y = train_df['Sales']  # Target variable

# Split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shape of the split data
print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)


##### What data splitting ratio have you used and why?

For the data splitting, I used an 80/20 ratio, where:

80% of the data is used for training the model.

20% of the data is reserved for testing the model's performance.

An 80/20 split strikes a good balance between having enough data for training while also reserving sufficient data to evaluate the model's generalization to new, unseen data.

The training set needs to be large enough to allow the model to learn meaningful patterns, while the test set needs to be large enough to provide a reliable estimate of the model's performance.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# If 'Date' is a datetime column, we need to convert it
train_df['Date'] = pd.to_datetime(train_df['Date'])

# Extract useful features from the datetime column
train_df['Year'] = train_df['Date'].dt.year
train_df['Month'] = train_df['Date'].dt.month
train_df['Day'] = train_df['Date'].dt.day
train_df['Weekday'] = train_df['Date'].dt.weekday  # Monday=0, Sunday=6

# Drop the original Date column as it's no longer needed
train_df.drop(columns=['Date'], inplace=True)


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score



# Handle missing values - filling with median
train_df.fillna(train_df.median(), inplace=True)

# One-Hot Encoding for categorical variables
train_df = pd.get_dummies(train_df, drop_first=True)

# Define features (X) and target (y)
X = train_df.drop(columns=['Sales'])
y = train_df['Sales']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model's performance
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")

mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

r2 = r2_score(y_test, y_pred)
print(f"R-squared (R²): {r2}")


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In this case, we have used the Random Forest Regressor model to predict the Sales for the Rosman Retail Dataset

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Predicting on the test set
y_pred = model.predict(X_test)

# Calculating MAE, MSE, RMSE, and R²
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")

print(f"R-squared (R²): {r2}")

import matplotlib.pyplot as plt
import numpy as np

# Create a list of metric names and their corresponding values
metrics = ['MAE', 'MSE', 'R²']
values = [mae, mse, r2]

# Plotting the metrics
plt.figure(figsize=(10, 6))
plt.bar(metrics, values, color=['blue', 'orange', 'red'])

# Adding titles and labels
plt.title('Model Evaluation Metrics')
plt.xlabel('Metrics')
plt.ylabel('Score')

# Show the plot
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Define the Random Forest model
model = RandomForestRegressor(random_state=42)

# Define the hyperparameters to tune
param_grid = {
    'n_estimators': [50, 100, 200],  # Number of trees in the forest
    'max_depth': [10, 20, None],  # Depth of each tree
    'min_samples_split': [2, 5, 10],  # Minimum samples to split a node
    'min_samples_leaf': [1, 2, 4],  # Minimum samples required at a leaf node
    'max_features': ['auto', 'sqrt', 'log2'],  # Number of features to consider at each split
}

# Set up GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2, scoring='neg_mean_absolute_error')

# Fit the model
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params_grid = grid_search.best_params_
print("Best Parameters from GridSearchCV:", best_params_grid)

# Predict on the test set
y_pred_grid = grid_search.predict(X_test)

# Evaluate the model
mae_grid = mean_absolute_error(y_test, y_pred_grid)
mse_grid = mean_squared_error(y_test, y_pred_grid)
r2_grid = r2_score(y_test, y_pred_grid)

print(f"GridSearchCV - MAE: {mae_grid}, MSE: {mse_grid}, R²: {r2_grid}")

# Fit the Algorithm
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

# Define the Random Forest model
model = RandomForestRegressor(random_state=42)

# Define the hyperparameters for RandomizedSearch
param_dist = {
    'n_estimators': np.arange(50, 300, 50),  # Number of trees in the forest
    'max_depth': [None, 10, 20, 30, 40],  # Depth of each tree
    'min_samples_split': [2, 5, 10, 20],  # Minimum samples to split a node
    'min_samples_leaf': [1, 2, 4, 8],  # Minimum samples required at a leaf node
    'max_features': ['auto', 'sqrt', 'log2'],  # Number of features to consider at each split
}

# Set up RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=100, cv=5, verbose=2, n_jobs=-1, random_state=42, scoring='neg_mean_absolute_error')

# Fit the model
random_search.fit(X_train, y_train)

# Get the best parameters
best_params_random = random_search.best_params_
print("Best Parameters from RandomizedSearchCV:", best_params_random)

# Predict on the test set
y_pred_random = random_search.predict(X_test)

# Evaluate the model
mae_random = mean_absolute_error(y_test, y_pred_random)
mse_random = mean_squared_error(y_test, y_pred_random)
r2_random = r2_score(y_test, y_pred_random)

print(f"RandomizedSearchCV - MAE: {mae_random}, MSE: {mse_random}, R²: {r2_random}")


# Predict on the model
from hyperopt import fmin, tpe, hp, Trials
from sklearn.model_selection import cross_val_score

# Define the objective function to minimize
def objective(params):
    model = RandomForestRegressor(
        n_estimators=params['n_estimators'],
        max_depth=params['max_depth'],
        min_samples_split=params['min_samples_split'],
        min_samples_leaf=params['min_samples_leaf'],
        max_features=params['max_features'],
        random_state=42
    )

    # Use cross-validation to evaluate the model
    score = cross_val_score(model, X_train, y_train, scoring='neg_mean_absolute_error', cv=5).mean()
    return -score  # Return the negative score since Hyperopt minimizes the objective

# Define the hyperparameter search space
space = {
    'n_estimators': hp.choice('n_estimators', [50, 100, 150, 200]),
    'max_depth': hp.choice('max_depth', [10, 20, 30, None]),
    'min_samples_split': hp.choice('min_samples_split', [2, 5, 10]),
    'min_samples_leaf': hp.choice('min_samples_leaf', [1, 2, 4]),
    'max_features': hp.choice('max_features', ['auto', 'sqrt', 'log2'])
}

# Set up the trials object to store the results
trials = Trials()

# Perform the optimization
best_params_bayes = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=50, trials=trials)

print("Best Parameters from Bayesian Optimization:", best_params_bayes)

# Once the best parameters are found, train the final model
best_model_bayes = RandomForestRegressor(
    n_estimators=best_params_bayes['n_estimators'],
    max_depth=best_params_bayes['max_depth'],
    min_samples_split=best_params_bayes['min_samples_split'],
    min_samples_leaf=best_params_bayes['min_samples_leaf'],
    max_features=best_params_bayes['max_features'],
    random_state=42
)

# Train the model on the full training data
best_model_bayes.fit(X_train, y_train)

# Predict on the test set
y_pred_bayes = best_model_bayes.predict(X_test)

# Evaluate the model
mae_bayes = mean_absolute_error(y_test, y_pred_bayes)
mse_bayes = mean_squared_error(y_test, y_pred_bayes)
r2_bayes = r2_score(y_test, y_pred_bayes)

print(f"Bayesian Optimization - MAE: {mae_bayes}, MSE: {mse_bayes}, R²: {r2_bayes}")


##### Which hyperparameter optimization technique have you used and why?

In this implementation, I’ve demonstrated three hyperparameter optimization techniques: GridSearchCV, RandomizedSearchCV, and Bayesian Optimization. The choice of technique largely depends on the trade-off between search space size, computation time, and the level of optimization required.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

In [None]:
train_df.columns

In [None]:
print(train_df.dtypes)

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Example: Loading your dataset
# train_df = pd.read_csv('your_dataset.csv')

# Step 1: Check data types and identify columns that are non-numeric
print(train_df.dtypes)
print(train_df[train_df.isnull().any(axis=1)])  # Rows with NaN values



# Step 2: Convert 'Date' column to datetime and then to numeric (if 'Date' is present)
if 'Date' in train_df.columns:
    train_df['Date'] = pd.to_datetime(train_df['Date'], errors='coerce')  # Convert to datetime
    train_df['Date'] = (train_df['Date'] - pd.Timestamp("1970-01-01")) // pd.Timedelta('1D')  # Convert to days since epoch

# Step 3: Handle missing values - fill missing numeric columns with median
# First, convert all numeric columns explicitly to avoid any string columns
numeric_columns = train_df.select_dtypes(include=['float64', 'int64']).columns
train_df[numeric_columns] = train_df[numeric_columns].apply(pd.to_numeric, errors='coerce')  # Convert to numeric, coercing errors to NaN

# Fill missing values with median for numeric columns
train_df.fillna(train_df.median(), inplace=True)

# Step 4: One-Hot Encoding for categorical columns
# Identifying categorical columns
categorical_columns = ['Store','DayOfWeek','StateHoliday','SchoolHoliday']

# Apply One-Hot Encoding to categorical columns
train_df = pd.get_dummies(train_df, columns=categorical_columns, drop_first=True)

# Step 5: Define features (X) and target (y)
X = train_df.drop(columns=['Sales'])  # Features, drop the target column 'Sales'
y = train_df['Sales']  # Target variable 'Sales'

# Step 6: Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 7: Initialize and train the RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Step 8: Predict on the test set
y_pred = model.predict(X_test)

# Step 9: Evaluate the model's performance
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")

mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

r2 = r2_score(y_test, y_pred)
print(f"R-squared (R²): {r2}")


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The Random Forest Regressor is an ensemble machine learning model used for regression tasks. It works by building multiple decision trees and combining their predictions to make a final prediction. Random Forests use the principle of bagging (Bootstrap Aggregating), where multiple models (trees) are trained on random subsets of the data, and each tree makes an independent prediction. The final prediction is then averaged to reduce overfitting and variance.

In this case, the model is used to predict the Sales column based on various features like Store, DayOfWeek, StateHoliday, and others.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's performance
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")

mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

r2 = r2_score(y_test, y_pred)
print(f"R-squared (R²): {r2}")


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
from hyperopt import fmin, tpe, hp, Trials
from sklearn.model_selection import train_test_split
# Define the features (X) and target (y)
X = train_df.drop(columns=['Sales'])
y = train_df['Sales']

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the RandomForestRegressor model
model = RandomForestRegressor(random_state=42)

# Define the hyperparameters grid to search over
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

# Set up GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2, scoring='neg_mean_absolute_error')

# Fit the GridSearchCV
grid_search.fit(X_train, y_train)

# Get the best hyperparameters from GridSearchCV
best_params_grid = grid_search.best_params_
print("Best Parameters from GridSearchCV:", best_params_grid)

# Predict on the test set
y_pred_grid = grid_search.predict(X_test)

# Evaluate the model
mae_grid = mean_absolute_error(y_test, y_pred_grid)
mse_grid = mean_squared_error(y_test, y_pred_grid)
r2_grid = r2_score(y_test, y_pred_grid)

print(f"GridSearchCV - MAE: {mae_grid}, MSE: {mse_grid}, R²: {r2_grid}")

# Define the RandomizedSearchCV model
param_dist = {
    'n_estimators': np.arange(50, 300, 50),
    'max_depth': [None, 10, 20, 30, 40],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 4, 8],
    'max_features': ['auto', 'sqrt', 'log2']
}

random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=100, cv=5, verbose=2, n_jobs=-1, random_state=42, scoring='neg_mean_absolute_error')

# Fit the RandomizedSearchCV
random_search.fit(X_train, y_train)

# Get the best hyperparameters from RandomizedSearchCV
best_params_random = random_search.best_params_
print("Best Parameters from RandomizedSearchCV:", best_params_random)

# Predict on the test set
y_pred_random = random_search.predict(X_test)

# Evaluate the model
mae_random = mean_absolute_error(y_test, y_pred_random)
mse_random = mean_squared_error(y_test, y_pred_random)
r2_random = r2_score(y_test, y_pred_random)

print(f"RandomizedSearchCV - MAE: {mae_random}, MSE: {mse_random}, R²: {r2_random}")

# Fit the Algorithm
# Define the objective function to minimize
def objective(params):
    model = RandomForestRegressor(
        n_estimators=params['n_estimators'],
        max_depth=params['max_depth'],
        min_samples_split=params['min_samples_split'],
        min_samples_leaf=params['min_samples_leaf'],
        max_features=params['max_features'],
        random_state=42
    )

    # Use cross-validation to evaluate the model
    score = cross_val_score(model, X_train, y_train, scoring='neg_mean_absolute_error', cv=5).mean()
    return -score  # Return the negative score since Hyperopt minimizes the objective

# Define the hyperparameter search space
space = {
    'n_estimators': hp.choice('n_estimators', [50, 100, 150, 200]),
    'max_depth': hp.choice('max_depth', [10, 20, 30, None]),
    'min_samples_split': hp.choice('min_samples_split', [2, 5, 10]),
    'min_samples_leaf': hp.choice('min_samples_leaf', [1, 2, 4]),
    'max_features': hp.choice('max_features', ['auto', 'sqrt', 'log2'])
}

# Set up the trials object to store the results
trials = Trials()

# Perform the optimization
best_params_bayes = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=50, trials=trials)

print("Best Parameters from Bayesian Optimization:", best_params_bayes)

# Once the best parameters are found, train the final model
best_model_bayes = RandomForestRegressor(
    n_estimators=best_params_bayes['n_estimators'],
    max_depth=best_params_bayes['max_depth'],
    min_samples_split=best_params_bayes['min_samples_split'],
    min_samples_leaf=best_params_bayes['min_samples_leaf'],
    max_features=best_params_bayes['max_features'],
    random_state=42
)

# Train the model on the full training data
best_model_bayes.fit(X_train, y_train)

# Predict on the test set
y_pred_bayes = best_model_bayes.predict(X_test)

# Evaluate the model
mae_bayes = mean_absolute_error(y_test, y_pred_bayes)
mse_bayes = mean_squared_error(y_test, y_pred_bayes)
r2_bayes = r2_score(y_test, y_pred_bayes)

print(f"Bayesian Optimization - MAE: {mae_bayes}, MSE: {mse_bayes}, R²: {r2_bayes}")


import matplotlib.pyplot as plt

# Evaluation scores for comparison
metrics = ['MAE', 'MSE', 'R²']
grid_scores = [mae_grid, mse_grid, r2_grid]
random_scores = [mae_random, mse_random, r2_random]
bayes_scores = [mae_bayes, mse_bayes, r2_bayes]

# Create subplots to display evaluation metrics comparison
fig, ax = plt.subplots(figsize=(10, 6))

# Plotting each model's scores
ax.bar([f"{metric} - GridSearchCV" for metric in metrics], grid_scores, color='blue', alpha=0.7, label='GridSearchCV')
ax.bar([f"{metric} - RandomSearchCV" for metric in metrics], random_scores, color='green', alpha=0.7, label='RandomSearchCV')
ax.bar([f"{metric} - Bayesian" for metric in metrics], bayes_scores, color='orange', alpha=0.7, label='Bayesian Optimization')

# Add labels and title
ax.set_xlabel('Evaluation Metrics')
ax.set_ylabel('Score')
ax.set_title('Model Comparison: GridSearchCV vs RandomSearchCV vs Bayesian Optimization')
plt.xticks(rotation=45)
plt.legend()
plt.show()


# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

In the code provided, I implemented three hyperparameter optimization techniques:

GridSearchCV

RandomizedSearchCV

Bayesian Optimization (using Hyperopt)



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Lower MAE and MSE lead to more accurate and reliable predictions of sales, reducing the risk of poor decision-making in inventory management, resource allocation, and revenue forecasting.

Higher R² means the model better explains the variations in sales, allowing businesses to identify trends, optimize marketing strategies, and tailor product offerings.

By improving these evaluation metrics (lower MAE, MSE, and higher R²), the Random Forest Regressor model can significantly enhance the business's ability to make data-driven, informed decisions, leading to cost savings, increased profits, and better customer satisfaction.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation using Gradient Boosting Regressor

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import pandas as pd

# Example: Assuming you have a dataframe named train_df with 'Sales' as the target variable

# Handle missing values - filling with median
train_df.fillna(train_df.median(), inplace=True)

# One-Hot Encoding for categorical variables
train_df = pd.get_dummies(train_df, drop_first=True)

# Define features (X) and target (y)
X = train_df.drop(columns=['Sales'])
y = train_df['Sales']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the GradientBoostingRegressor
model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model's performance
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")

mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

r2 = r2_score(y_test, y_pred)
print(f"R-squared (R²): {r2}")

# Optionally, you can also plot the feature importance to interpret the model
import matplotlib.pyplot as plt
feature_importances = model.feature_importances_
sorted_idx = feature_importances.argsort()

plt.barh(train_df.drop(columns=['Sales']).columns[sorted_idx], feature_importances[sorted_idx])
plt.xlabel("Feature Importance")
plt.title("Feature Importances in Gradient Boosting Regressor")
plt.show()


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt

# Evaluation metrics
metrics = ['MAE', 'MSE', 'R²']
scores = [mae, mse, r2]

# Create a bar chart to visualize the metrics
plt.bar(metrics, scores, color=['blue', 'green', 'orange'])
plt.xlabel('Evaluation Metrics')
plt.ylabel('Scores')
plt.title('Model Performance using Gradient Boosting Regressor')
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import pandas as pd

# Assuming you have a DataFrame `train_df` with the 'Sales' column being the target variable

# Handle missing values - filling with median
train_df.fillna(train_df.median(), inplace=True)

# One-Hot Encoding for categorical variables
train_df = pd.get_dummies(train_df, drop_first=True)

# Define features (X) and target (y)
X = train_df.drop(columns=['Sales'])
y = train_df['Sales']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the GradientBoostingRegressor
model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model's performance
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")

mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

r2 = r2_score(y_test, y_pred)
print(f"R-squared (R²): {r2}")


from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'min_samples_split': [2, 5, 10],
}

# Initialize the model
model = GradientBoostingRegressor(random_state=42)

# Set up GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2, scoring='neg_mean_absolute_error')

# Fit the model
grid_search.fit(X_train, y_train)

# Get the best parameters and best model
best_params_grid = grid_search.best_params_
print("Best Parameters from GridSearchCV:", best_params_grid)

# Predict on the test set
y_pred_grid = grid_search.predict(X_test)

# Evaluate the model's performance
mae_grid = mean_absolute_error(y_test, y_pred_grid)
print(f"GridSearchCV - MAE: {mae_grid}")

mse_grid = mean_squared_error(y_test, y_pred_grid)
print(f"GridSearchCV - MSE: {mse_grid}")

r2_grid = r2_score(y_test, y_pred_grid)
print(f"GridSearchCV - R²: {r2_grid}")


from sklearn.model_selection import RandomizedSearchCV
import numpy as np

# Define the parameter distribution
param_dist = {
    'n_estimators': np.arange(50, 300, 50),
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 4, 5, 6],
    'min_samples_split': [2, 5, 10, 20],
}

# Initialize the model
model = GradientBoostingRegressor(random_state=42)

# Set up RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=100, cv=5, verbose=2, n_jobs=-1, random_state=42, scoring='neg_mean_absolute_error')

# Fit the model
random_search.fit(X_train, y_train)

# Get the best parameters and best model
best_params_random = random_search.best_params_
print("Best Parameters from RandomizedSearchCV:", best_params_random)

# Predict on the test set
y_pred_random = random_search.predict(X_test)

# Evaluate the model's performance
mae_random = mean_absolute_error(y_test, y_pred_random)
print(f"RandomizedSearchCV - MAE: {mae_random}")

mse_random = mean_squared_error(y_test, y_pred_random)
print(f"RandomizedSearchCV - MSE: {mse_random}")

r2_random = r2_score(y_test, y_pred_random)
print(f"RandomizedSearchCV - R²: {r2_random}")


from hyperopt import fmin, tpe, hp, Trials
from sklearn.model_selection import cross_val_score

# Define the objective function for Bayesian Optimization
def objective(params):
    model = GradientBoostingRegressor(
        n_estimators=params['n_estimators'],
        learning_rate=params['learning_rate'],
        max_depth=params['max_depth'],
        min_samples_split=params['min_samples_split'],
        random_state=42
    )

    # Use cross-validation to evaluate the model
    score = cross_val_score(model, X_train, y_train, scoring='neg_mean_absolute_error', cv=5).mean()
    return -score  # Return the negative score as hyperopt minimizes the objective

# Define the hyperparameter search space
space = {
    'n_estimators': hp.choice('n_estimators', [50, 100, 150, 200]),
    'learning_rate': hp.uniform('learning_rate', 0.01, 0.2),
    'max_depth': hp.choice('max_depth', [3, 4, 5, 6]),
    'min_samples_split': hp.choice('min_samples_split', [2, 5, 10, 20]),
}

# Set up the trials object to store the results
trials = Trials()

# Perform the optimization
best_params_bayes = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=50, trials=trials)

print("Best Parameters from Bayesian Optimization:", best_params_bayes)

# Once the best parameters are found, train the final model
best_model_bayes = GradientBoostingRegressor(
    n_estimators=best_params_bayes['n_estimators'],
    learning_rate=best_params_bayes['learning_rate'],
    max_depth=best_params_bayes['max_depth'],
    min_samples_split=best_params_bayes['min_samples_split'],
    random_state=42
)

# Train the model on the full training data
best_model_bayes.fit(X_train, y_train)

# Predict on the test set
y_pred_bayes = best_model_bayes.predict(X_test)

# Evaluate the model's performance
mae_bayes = mean_absolute_error(y_test, y_pred_bayes)
print(f"Bayesian Optimization - MAE: {mae_bayes}")

mse_bayes = mean_squared_error(y_test, y_pred_bayes)
print(f"Bayesian Optimization - MSE: {mse_bayes}")

r2_bayes = r2_score(y_test, y_pred_bayes)
print(f"Bayesian Optimization - R²: {r2_bayes}")


##### Which hyperparameter optimization technique have you used and why?

The reason we used these techniques is that each has its own advantages depending on the size of the hyperparameter space, the computational resources available, and the time constraints. For exhaustive searches in smaller spaces, GridSearchCV is ideal. For larger spaces with less computational expense, RandomizedSearchCV is more efficient. For highly efficient optimization with fewer evaluations, Bayesian Optimization is the most suitable choice.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

MAE is chosen because it gives a clear and direct interpretation of how far off the model's predictions are from actual values, which is crucial for practical decision-making (like predicting sales revenue).

MSE is useful for penalizing large errors more heavily. In business contexts, errors like overstocking or understocking products can have disproportionately high costs. Therefore, minimizing MSE helps prevent these costly mistakes.

R² is valuable because it reflects how much of the variation in the target variable is captured by the model. In business, models with high R² values provide confidence that the predictions are based on strong relationships within the data, which can guide strategic decisions like marketing spends, product demand forecasting, and resource allocation.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

For this problem, the final model would likely be the Random Forest Regressor or XGBoost depending on the dataset's size, complexity, and the need for interpretability.

Why Random Forest Regressor?

Consistency and Stability: Random Forest is a robust and stable model. It is particularly useful when the dataset has a mix of numerical and categorical features and when the relationships between features are complex and non-linear.

Resilience to Overfitting: Random Forest generally performs well even when the model is overfitting on certain training sets, as the ensemble method reduces overfitting by averaging predictions from different decision trees.

Good Performance: Random Forest can handle both small and large datasets effectively, and it was likely chosen if the data had many features and needed robust handling of overfitting.

Business Use Case: If the focus is on forecasting sales, inventory, or predicting demand for a product, Random Forest can offer a solid prediction, where accuracy and reliable decision-making are key.

Why XGBoost (if chosen)?

Predictive Power: If the dataset is large and complex, and the goal is to maximize predictive accuracy, XGBoost might be chosen as the final model. It is known to produce superior results due to its gradient boosting technique and ability to fine-tune parameters for better performance.

Handling Imbalanced Data: XGBoost is particularly useful when dealing with imbalanced datasets or datasets with many outliers.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Model Used: Random Forest Regressor
Random Forest Regressor is an ensemble learning method that constructs multiple decision trees during training and outputs the average of their predictions for regression tasks. Each tree in the forest is built by considering a subset of the features and data points (bootstrap sampling). This allows it to model complex relationships while reducing overfitting.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, we employed a Random Forest Regressor model for predictive analysis, primarily focused on predicting target variables such as sales or demand. Random Forest, being an ensemble method, has demonstrated its strength in handling complex, non-linear relationships while maintaining robustness against overfitting. This makes it a reliable choice for real-world business applications, where the data is often noisy and intricate.

By leveraging hyperparameter optimization techniques like GridSearchCV, RandomizedSearchCV, and Bayesian Optimization, we fine-tuned the model to achieve the best possible performance, ensuring accuracy and generalization. These optimization methods allowed us to find the optimal set of parameters, significantly enhancing the model's predictive capabilities.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***