<a href="https://colab.research.google.com/github/Irfan270791/CAPSTONE-PROJECT-Regression--Retail-Sales-Prediction/blob/main/Copy_of_IM_Retail_Sales_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  Retail Sales Prediction.



##### **Project Type**    - Regression
##### **Contribution**    - Team
##### **Team Member 1 -** Irfan Momin.
##### **Team Member 2 -** Sushil Ghodwinde.


# **Project Summary -**

The project title "Retail Sales Prediction" focuses on forecasting the sales amount for Rossman Sotres up to six weeks in advance. Initially Exploratory Data Analysis (EDA) was conducted to gain insights into the dataset.Following the EDA, Data Wrangling techniques were applied to clean and preprocess the dataset.Additionally feature engineering techniques were employed to create new and meaningful features from the existing dataset.

To predict the sales, various machine learning models were utilized, including Decision Tree Regression, Random Forest Regression, Gradient Boosting Regression & XGBoost Regression.

The main objective of this project is to assess the effectiveness and performance of these different regression models in accurately predicting the sales amount.

# **GitHub Link -**

https://github.com/Irfan270791/CAPSTONE-PROJECT-Regression--Retail-Sales-Prediction

# **Problem Statement**


Rossmann operates over 3000 drug stores in 7 European countries. Currently Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including  promotions, competition, school, and state holidays, seasonality and locality with thousands of individual managers prediciting sales based on their unique circumstances, the accuracy of results can be quite varied.

We are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set.Note that some stores in the dataset were temporarily closed for refurbishment.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn import tree
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, accuracy_score, auc
from sklearn.feature_selection import SelectKBest, mutual_info_regression

# Set random seed value for reproducibility.
np.random.seed(42)

# Required library for hyperparameter Tuning.
!pip install optuna --quiet

# Required library for visualizing missing values.
!pip install missingno --quiet



In [None]:
import optuna
import missingno as msno
import matplotlib.gridspec as gridspec

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df = pd.read_csv('/content/drive/MyDrive/Retail Sales Prediction/Rossmann Stores Data.csv')
df_2 = pd.read_csv('/content/drive/MyDrive/Retail Sales Prediction/store (1).csv')

### Dataset First View

In [None]:
# Dataset First Look - df & shape.
print(df.shape)
df.head()


In [None]:
# Dataset First Look - df_2 & shape.
print(df_2.shape)
df_2.head()

* We have a fact table(df) that contains the sales data for each store & date and a dimension table(df_2) that contains each store information.
* we can merge the fact table with the dimension table for easier analysis.

In [None]:
# Merge datasets.
df_m = df.merge(df_2, on='Store', how='left')

In [None]:
# Data set first look.
df_m.sample(5)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f'The dataset has {df_m.shape[0]} rows and {df_m.shape[1]} columns')

### Dataset Information

In [None]:
# Dataset Info
df_m.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print('The dataset has:',df_m.duplicated().sum(),'duplicate rows')

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df_m.isna().sum().sort_values(ascending=False)
print('Features with missing values',  'percent missing values:\n')
missing = missing_values[missing_values > 0]

# Percentage of missing values.
print((missing_values[missing_values > 0]* 100/df_m.shape[0]))

In [None]:
# Visualizing the missing values
msno.matrix(df_m, figsize=(12, 4))

### What did you know about your dataset?

The dataset represents historical sales data for 1,115 Rossmann stores. The data contains 1,017,209 entries(rows) & 18 Features (Columns).The dataset contains no duplicate entries and 5 features with more than 30% missing values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df_m.columns.tolist()

In [None]:
# Dataset Describe
df_m.describe().T

### Variables Description

The dataset includes the following features:
* **id** - Represents a unique identifier for a combination of Store & Date within the test set.
* **Store** - A unique identifier for each store.
* **Sales** - The turnover (sales)for given day, which is the target variable to be predicted.
* **Customers** - The number of customers on a given day.
* **Open** - An indicator of whether the store was open: 0=close, 1=open.
* **StateHoliday** - Indicates a state holiday.Most store are closed on state holidays, except for a few exceptions.The values are: a=public holiday, b=Easter holiday, c=Christmas, 0=None.
* **SchoolHoliday** - Indicates whether the (Store,Date) was affected by the clouser of public schools.
* **Store Type** - Differentiates between four different store models: a,b,c,d.
* **Assortment** - Describes the assortment level of the store: a=basic, b=extra, c=extended.
* **CompetitionDistance** - The distance in meters to the nearest competitor store.
* **CompetitionOpenSince[Month/Year]** - Provides an approximate year & month when the nearest competitior store was opened.
* **Promo** - Indicates whether a store is running a promotion on a given day.
* **Promo2** - Represents a continuing & consecutive promotion for some stores: 0=store is not participating, 1=store is participating.
* **Promo2Since[Year/Week]** - Describes the year & calendar week when the store started participating in Promo2.
* **PromoInterval** - Describes the consecutive intervals when Promo2 is started, specifying the months in which the promotion is started. For example, "Feb,May,Aug,Nov" means the promotions starts in February,May,August & Novemberof any given year for that Store.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df_m.nunique().sort_values(ascending=False)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Dropping the duplicate rows if found.
df_m.drop_duplicates(inplace=True)


In [None]:
# There are data points when the stores are closed.
# Checking the value counts during 'open' & 'closed'
print(df_m['Open'].value_counts())

In [None]:
# Checking the sum of sales during 'open' & 'closed'
print(df_m[['Open','Sales']].groupby(['Open']).sum())

* Since there are no sales when the stores are 'closed', we can drop the values where store is 'closed'.
* We can drop the 'Open' Columns since it contains only 1 unique values.

In [None]:
# Selecting only the rows with Open = 1.
df_m2 = df_m[df_m['Open']== 1].drop('Open', axis=1).copy()

# Converting Date to Datetime for analysis & feature engineering.
df_m2['Date'] = pd.to_datetime(df_m2['Date'])

# We can also convert object Dtype to category for reduced memory usage.
for col in df_m2.select_dtypes('object').columns:
  df_m2[col] = df_m2[col].astype('category')

In [None]:
df_m2.info()

In [None]:
# Checking the number of unique values in each categorical columns.
[df_m2[col].value_counts() for col in df_m2.select_dtypes('category').columns]

In [None]:
# Checking the unique values in 'StateHoliday'
print(df_m2['StateHoliday'].unique())

* The StateHoliday column contain a mix of integer & string representations of the "0" value. This can cause the value counts to show duplicates.

In [None]:
# Replace all variations of "0" with single representation.
df_m2['StateHoliday'] = df_m2['StateHoliday'].replace(['0', 0], '0')

### What all manipulations have you done and insights you found?

* The dataset contains mix of categorical & numerical columns.
* The sum of sales for 'closed' store are zero, Henece selecting only the rows with Open=1.
* Dropped the 'Open' Column since it contains only one unique value(1 for open stores).
* Converted the 'Date' column to datetime data type for analysis & feature engineering purposes.
* The 'StateHoliday' column has mix of integer & string representations of the "0" value. Which has been resolved by replacing all variations of "0" with str(0).   

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Helper functions for cleaning up the data(Helps better visualize the data)

In [None]:
def cap_outliers(df):
  # Create a copy of the Dataframe to avoid modifying the original data.
  cleaned_df = df.copy()

  # Iterate over numerical columns.
  for column in cleaned_df.select_dtypes(include=np.number):
    # Calculate the 99th percentile value.
    upper_cap_value = cleaned_df[column].quantile(0.99)
    lower_cap_value = cleaned_df[column].quantile(0.01)

    # Cap outliers to the 99th percentile value.
    cleaned_df[column] = np.where(cleaned_df[column] > upper_cap_value, upper_cap_value, cleaned_df[column])
    cleaned_df[column] = np.where(cleaned_df[column] < lower_cap_value, lower_cap_value, cleaned_df[column])

  return cleaned_df

#### Chart - 1  Univariate Analysis.

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns
df_temp = df_m2.copy()
# Numerical features
numerical_features = (df_temp.select_dtypes(include=['int64', 'float64']).columns)

for feature in numerical_features:
  plt.figure(figsize=(6, 4))
  sns.histplot(df_temp[feature], kde=True, bins=30)
  plt.title(f'Distribution of {feature}')
  plt.show()

  # Central tendencies and dispersion.
  mean = df_temp[feature].mean()
  median = df_temp[feature].median()
  std_dev = df_temp[feature].std()
  print(f'Mean of {feature}: {mean}')
  print(f'Median of {feature}: {median}')
  print(f'Standard deviation of {feature}: {std_dev}')
  print(f'Min value of {feature}: {df_temp[feature].min()}')
  print(f'Max value of {feature}: {df_temp[feature].max()}')

##### 1. Why did you pick the specific chart?

* A Histogram allow us to see the distribution of data & Understand its central tendency & the spread of the dataset.

##### 2. What is/are the insight(s) found from the chart?

* The sales data is available from 2013-01-01 to 2015-07-31 for a duration of 941 days.
* Stores are closed during Sundays hence the sales are reported zero.
* Only few data points are available during School holidays(SchoolHoliday=1) & StateHolidays(StateHoliday=(a,b,c)).
* The Distribution of competition distance is right skewed & ranges from 0 to 70000.
* The distribution of competition open since year is left skewed & ranges from 1900 to 2015.
* The distribution of competition distance is right skewed & ranges from 20 to 75860.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Examining sales data during school & state holidays can help in devising targeted marketing campaigns to attract customers during this periods.
* Understanding the distribution of competition distance can help in strategic placement of new stores to minimize competition & maximize market share.

#### Chart - 2 Histogram of sales.

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(10,5))

data = df_m2

plt.subplot(1, 2, 1)
plt.title('Histogram of sales')
sns.histplot(data['Sales'])

# Calculating the mean & median.
mean_value = data['Sales'].mean().round(2)
median_value = data['Sales'].median().round(2)

# Drawing the lines for mean & median.
plt.axvline(mean_value, color='red', linestyle='--', label='Mean:'+ str(mean_value))
plt.axvline(median_value, color= 'blue', linestyle='--', label='Median:'+str(median_value))
plt.legend()

plt.subplot(1, 2, 2)
plt.title('Sales Spread')
sns.boxplot(y= data['Sales'])
plt.show()

print('5th Percentile of sales', data['Sales'].quantile(0.05))
print('95th Percentile of sales', data['Sales'].quantile(0.95))

##### 1. Why did you pick the specific chart?

* A Histogram allow us to see the distribution of data & understand its central tendency & speard of dataset.
* A box plot provides a summary of the distribution of a dataset,including information about the median,quartiles & potential outliers.

##### 2. What is/are the insight(s) found from the chart?

* The distribution of sales seems to be skewed to the right.
* 90% of the time the sales per day are within the interval 3173 & 12668.
* However the top 5% of the sales are within the interval 12668 & 41551

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Recognizing the skewness in the sales distribution allows businesses to tailor marketing strategies to address both the majority of sales & the occasional high sales events.this approch can improve marketing ROI & customer engagement.
* Understanding the distribution of sales can helps business identify periods of high sales activity. This can be leveraged to optimize inventory level, marketing efforts during peak sales periods,leading to increased revenue.

#### Chart - 3 Categorical features VS Sales histogram/Box plot.
categorical features (StateHoliday, StoreType, Assortment, PromoInterval)

In [None]:
# Chart - 3 visualization code
import itertools
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style= "whitegrid")
plt.figure(figsize=(10, 20))

input = cap_outliers(df_m2.copy())
columns = ['Promo', 'Promo2', 'SchoolHoliday', 'StateHoliday', 'StoreType', 'Assortment', 'PromoInterval']
colors = ['skyblue', 'lightgreen', 'lightcoral', 'gold']

color_cycle = itertools.cycle(colors)

for i, col in enumerate(columns):
  color_name = next(color_cycle)
  plt.subplot(7,2, 2*i+1)
  plt1 = input[col].value_counts().plot(kind='bar', color = color_name)
  plt.title(f'{col} Histogram')
  plt1.set(xlabel=col, ylabel= 'Count of sales')

  plt.subplot(7, 2, 2*i+2)
  sns.boxplot(x=col, y='Sales', data=input, color=color_name)
  plt.title(f'Sales over {col} Boxplot')
  plt.xlabel(col)
  plt.ylabel('Sales')

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

* A Histogram allows us to see the distribution of data & understand its central tendency and the spread of the dataset.
* A Boxplot provides a summary of the distribution of a dataset, including information about the median, quartiles & potential outliers.

##### 2. What is/are the insight(s) found from the chart?

1. **Promotion(Promo):** - The median value of sales is higher when there is a promotion compared to when there is no promotion.
2. **Promotion Type2 (Promo2):** - The count & median value of sales is relatively similar for both Promo2 Types.
3. **School Holiday:** - The median sales are slightly higher during school holidays.
4. **State Holiday:** - The median sales are highest during state holiday type 'b'.
5. **Store type:** - The count of the sales is highest for store type 'a'
                   -The median sales are highest for store type 'b'.
6. **Assortment:** -The count of sales is highest for assortment type 'a'.
                  - The median sales are highest for assortment type 'c'.
7. **Promotion Interval(PromoInterval)**:   - The count of sales is highest for the promo interval 'jan,Apr,Jul,Oct'.
- The median sales are highest for the promo interval 'Feb,May,Aug,Nov'.
                                    

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Understanding the impact of promotions,storetype,assortment and holiday periods on sales can help Rossmann optimize their strategies.
* Targeting the promotion during high-sales periods, such as school holidays or specific state holidays, can lead to increased sales & better resource utilization.

#### Chart - 4 - Median sales over year/Quarter/Month/Week.

In [None]:
# Chart - 4 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import itertools

# Creat a copy of dataframe.
df_temp = df_m.copy()
df_temp['Date'] = pd.to_datetime(df_temp['Date'])

# Extract Year, Month, Week & Quarter from the Date.
df_temp['year'] = df_temp['Date'].dt.year
df_temp['month'] = df_temp['Date'].dt.month_name()  # Getting the months name.
df_temp['week'] = df_temp['Date'].dt.day_name()   #Getting the weekdays name.
df_temp['quarter'] = 'Q'+ df_temp['Date'].dt.quarter.astype(str)  # Getting the quarters.

# Specifying the correct order for months & Weeks.
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
week_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Covert 'month' & 'week' columns to categorical type with specified order.
df_temp['month'] = pd.Categorical(df_temp['month'], categories = month_order, ordered= True)
df_temp['week']= pd.Categorical(df_temp['week'], categories= week_order, ordered= True)

# Set the seaborn style & color palette.
sns.set_style("whitegrid")
colors = itertools.cycle(['skyblue', 'lightgreen', 'lightcoral', 'gold'])

fig, axes = plt.subplots(2, 2, figsize=(10, 10))

for i,col in enumerate(['year', 'quarter', 'month', 'week']):
    # Group by year, quarter, month, week & find the median of sales.
    sales_data = df_temp.groupby([col])['Sales'].median()

    # Plotting
    ax = axes[i // 2, i % 2]  # selecting the appropriate subplot.
    sns.barplot(x=sales_data.index, y=sales_data.values, color=next(colors), edgecolor='black', ax=ax)

    ax.set_title(f'Median Sales over {col.capitalize()}', fontsize=16)
    ax.set_xlabel(f'{col.capitalize()}', fontsize=14)
    ax.set_ylabel('Median Sales', fontsize=14)
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45)  # Rotate x-labels for better visibility.

plt.tight_layout()  # Adjusting spacing between subplots.
plt.show()

##### 1. Why did you pick the specific chart?

A bar plot help us to compare a numerical variable across a category. Categorical variable is plotted along the horizontal and the height of the bar represent the value of the numerical variable.

##### 2. What is/are the insight(s) found from the chart?

1. Yearly Sales:
   * The sales data available for the year 2013, 2014 & 2015.
   * Sales increased from 5,599.0 in 2013 to 5,918.0 in 2015, showing a positive trend over the years.

2. Quarterly Sales:
   * The highest sales were observed in Q4 with a value of 6,044.0 suggesting that the holiday season might contribute significantly to sales.
   * Q1 had the lowest sales with a value of 5,614.0.
3. Monthly Sales:
   * December recorded the highest sales with a value of 6,732.0 indicating a peak in sales during holiday season .
   * January had the lowest sales with a value of 5,484.0 possibly due to reduced consumer spending after the holiday season.
   * Sales remained relatively stable from February to November, ranging from 5,611.0 to 6,083.0 with minor fluctuations.
4. Weekly Sales:
   * The highest sales are observed on Monday with a value of 7,311.0 indicating strong sales at the beginning of the week.
   * Sunday had no sales recorded, suggesting that the business might be closed on sundays.
   * Sales remained relatively consistent throughout the week.
   * Saturday had comparatively lower sales with a value of 5410.0         

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* The business experiences seasonal variation in sales,with the highest sales occurring in Q4 during the holiday season and the lowest sales in Q1.
* Heavy reliance on the holiday season for generating significant revenue can pose challenges in maintaining consistent sales throughout the year. to mitigate negative growth during non-peak seasons,the business should explore strategies to stimulate demand & attract customers during off-peak periods.

#### Chart - 5  Continuous features VS Sales - Scatterplot.

In [None]:
# Chart - 5 visualization code
# Numerical Columns.
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

#  Assuming you have your data stored in a DataFrame called 'df'
# 'Sales' is the dependent variable & other numerical columns are independent variables.

# numerical_columns = df_m2.select_dtypes(include=['float64','int64']).columns
numerical_columns = ['CompetitionDistance', 'Customers']

# Calculate the number of rows & columns needed for the subplots.
num_rows = (len(numerical_columns) // 2) + (len(numerical_columns) % 2)
num_cols = 2

# Create subplot with 2 columns
fig, axes = plt.subplots(num_rows, num_cols, figsize=(10, 6))

# Flatten the axes array if necessary.
axes = axes.flatten() if isinstance(axes, np.ndarray) else axes

# Iterate through each numerical column & create scatter plots in the subplots.
for i, column in enumerate(numerical_columns):
    sns.scatterplot(x=column, y='Sales', data=df_m2, ax=axes[i], alpha=0.1, markers=['o'])
    axes[i].set_title(f'Scatterplot for {column} VS Sales')

# Remove any empty subplot if the number of variable is odd
if len(numerical_columns) % 2 !=0:
   axes[-1].remove()

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A Scatter plot help us to visualize the realtionship between two continuous variables.it consists of a grid where each point represents an observation, plotted along the horizontal & vertical axes corresponding to the two variables being analyzed.

##### 2. What is/are the insight(s) found from the chart?

* **Competition Distance VS sales**- The competition distance is negatively correlated with sales. suggesting that the area with higher competition generates higher sales.

* **Customers VS sales**- The customers are positively correlated with sales. increased number of customers visiting the store leads to higher sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Strategic decisions regarding store locations can be made based on competition distance to optimize sales & capture a larger market share.

#### Chart - 6  Multiclass variables VS Sales.

In [None]:
# Chart - 6 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

multiclass = ['CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2SinceWeek', 'Promo2SinceYear']
sample_data = cap_outliers(df_m2.reset_index(drop=True).copy())
target = 'Sales'

fig,axes = plt.subplots(len(multiclass),2, figsize=(15, 15))
sns.set(style='whitegrid')

for i, feature in enumerate(multiclass):
    sns.boxplot(x=feature, y=target, data=sample_data, ax=axes[i, 0])
    axes[i, 0].set_title(f'Box plot: {feature} VS {target}')
    axes[i, 0].set_xlabel(feature)
    axes[i, 0].set_ylabel(target)
    axes[i, 0].tick_params(rotation=45)

    sns.lineplot(x=feature, y=target, data=sample_data, ax=axes[i, 1])
    axes[i, 1].set_title(f'Line Plot: {feature} VS {target}')
    axes[i, 1].set_xlabel(feature)
    axes[i, 1].set_ylabel(target)
    axes[i, 1].tick_params(rotation=45)

    # Calculate correlation coefficient.
    correlation = sample_data[feature].corr(sample_data[target])

    # Set color based on correlation sign.
    color = 'lightgreen' if correlation >= 0 else 'lightcoral'

    # Add correlation value as  a legend with color.
    axes[i, 1].legend([f'Correlation: {correlation:.2f}'], facecolor=color)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

* A line plot, is graphical representation of data points connected by stright lines.it is commonly used to show the relationship between two continuous variable or to display the change in variable over time.

##### 2. What is/are the insight(s) found from the chart?

1. Median Sales by CompetitionOpenSinceMonth:
   * There seems to be some trend in which highest median sales are observed in june(6.0)
   * The lowest median sales are observed in February(2.0)

2. Median Sales by CompetitionOpenSinceYear:
   * Median sales seems to vary across different years, indicating that the year of competition opening might have an impact on sales performance.

3. Median Sales by Promo2SinceWeek:
   * The varying median sales suggest that the timing of promo2 activation might influence the sales performance of stores.

4. Median Sales by Promo2SinceYear:
   * The median sales vary across different years of promo2 activation.
   * Stores that activated Promo2 in 2014 have the highest medain sales.         

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Insights on sales trends based on competiton opening, Promo2 activation, and timing can help business optimize strategies and potentially generate positive business impact.

#### Chart - 7 Sales Volume and Medain Sales by Month.

In [None]:
# Chart - 7 visualization code
import pandas as pd

# Create a copy of the dataframe.
df_temp = df_m.copy()

# Extract year and month from date.
df_temp['year'] = pd.to_datetime(df_temp['Date']).dt.year
df_temp['month'] = pd.to_datetime(df_temp['Date']).dt.month

# Groupby year, month & find the median of sales.
sales_data = df_temp.groupby(['year', 'month']).Sales.median()

# Find sales volume.
sales_volume = df_temp.groupby(['year', 'month']).size()

fig, ax1 = plt.subplots(figsize=(15, 6))

# plot medain sales as a line plot.
sales_data.plot(kind='line', ax=ax1, color='blue', legend=True)

# Create another y-axis for sales volume.
ax2  = ax1.twinx()
sales_volume.plot(kind= 'bar', ax=ax2, alpha=0.3, legend=True)

ax1.set_title('Median Sales and Sales Volume Over Month and Years')
ax1.set_xlabel('Time (year, Month)')
ax1.set_ylabel('Median Sales')
ax2.set_ylabel('Sales Volume')

# Make the new labels for x-axis where only the first occurrence of each year in labeled.
labels = [(2013, 'jan'),('feb'),('mar'),('apr'),('may'),('jun'),('jul'),('aug'),('sep'),('oct'),('nov'),('dec'),
          (2014, 'jan'),('feb'),('mar'),('apr'),('may'),('jun'),('jul'),('aug'),('sep'),('oct'),('nov'),('dec'),
          (2015, 'jan'),('feb'),('mar'),('apr'),('may'),('jun'),('jul')]

ax1.set_xticks(range(len(labels)))
ax2.set_xticks(range(len(labels)))

# Set labels for both axis after the plots are created.
ax1.set_xticklabels(labels, rotation=20, ha='right')
ax2.set_xticklabels(labels, rotation=20, ha='right')

plt.show()

##### 1. Why did you pick the specific chart?

A line plot help us to visualize the relatioship between two continuous variables or to display the change in variable over time.

##### 2. What is/are the insight(s) found from the chart?

* **Seasonal Sales Trend:** - There seems to be a noticeable seasonal pattern in sales, with peaks occurring in December & lower points during the early months of the year.This suggests a possible relationship between sales and the holiday season.

* **Increasing Sales Volume**- The sales volume represented by the "Sales_Volume"data, shows a general increasing trend over time.this suggests that the number of sales transaction has been growing steadily.

* **Sales Fluctuations** - While the sales volume exhibits a consistent upward trend, the actual sales figure show some fluctuations from month to month.These fluctuation may be influenced by various factors,such as promotions, external events or changes in customer behaviour.

* **Strong Sales in 2014** - There is noticeable increase in sales during 2014 compared to the preceding & subsequent years .this could indicate a period of significant growth or successful marketing initiatives during that year.

* **Correlation between Sales & Sales Volume** - There appears to be a positive correlation between sales and sales volume, as the general trend of increasing sales volume aligns with the overall pattern of sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insight have the potential to create a positive business impact by leveraging seasonal sales trends, increasing sales volume & identifying success

#### Chart - 8 Median Sales by store Type/Assortment Over Year & Month.

In [None]:
# Chart - 8 visualization code
import pandas as pd
import matplotlib.pyplot as plt

# Create a copy of the dataframe
df_temp = df_m.copy()

# Extract year and month from Date.
df_temp['year'] = pd.to_datetime(df_temp['Date']).dt.year
df_temp['month'] = pd.to_datetime(df_temp['Date']).dt.month

# Groupby year, month & storetype, then find the median of sales.
sales_data = df_temp[['Sales', 'StoreType', 'year', 'month']].groupby(['year', 'month', 'StoreType']).median()

# Reshape the data for plotting
sales_data = sales_data.unstack(level='StoreType')

# Plotting
sales_data.plot(figsize=(10, 6))

plt.title('Median Sales by Store Type Over Year & Month')
plt.xlabel('Time (year, Month)')
plt.ylabel('Median Sales')
plt.show()

##-------------------------------------------------------------------------------------------------------------------------##

# Groupby year, month & Storetype, then find the median of sales.
sales_data = df_temp[['Sales', 'Assortment', 'year', 'month']].groupby(['year', 'month', 'Assortment']).median()

# Reshape the data for plotting
sales_data = sales_data.unstack(level='Assortment')

# Plotting.
sales_data.plot(figsize=(10, 6))
plt.title('Median Sales by Assortment Over Year & Month')
plt.xlabel('Time (Year, Month)')
plt.ylabel('Median Sales')
plt.show()

##### 1. Why did you pick the specific chart?

A Line plot help us to identify the trend in the data and compare multiple categories.

##### 2. What is/are the insight(s) found from the chart?

* **Overall Trend** - There is a general growth in sales across all store type & assortment from 2013 to 2015.
* **Store Type Performance** - Store type "b" consistently has the highest sales throughout the entire period, followed by assortment type "d". Store types "a" & "c" generally has lower sales numbers in comparison.
* **Assortment Performance** - Assortment type "c" consistently has the highest sales throughout the entire period, followed by assortment type "b". Assortment type "a" generally has lower sales numbers in comparison.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights can help businesses identify the best-performing store type & assortments and allocate resources accordingly.

#### Chart - 9 Median Sales with/without Promo/Promo2 Over Year & Month.

In [None]:
# Chart - 9 visualization code
# Create a copy of the dataframe
df_temp = df_m.copy()

# Extract year and month from the Date.
df_temp['year'] = pd.to_datetime(df_temp['Date']).dt.year
df_temp['month'] = pd.to_datetime(df_temp['Date']).dt.month

# Group by year, month & Storetype then find the median of sales.
sales_data = df_temp[['Sales', 'Promo', 'month', 'year']].groupby(['year', 'month', 'Promo']).median()

# Reshape the data for plotting.
sales_data = sales_data.unstack(level='Promo')

# Plotting.
sales_data.plot(figsize=(10, 6))

plt.title('Median Sales with/without Promo Over Year & Month')
plt.xlabel('Time (Year, Month)')
plt.ylabel('Median Sales')
plt.show()

##------------------------------------------------------------------------------------------------------------------##

# Group by year, month & Storetype then find the median of sales.
sales_data = df_temp[['Sales', 'Promo2', 'month', 'year']].groupby(['year', 'month', 'Promo2']).median()

# Reshape the data for plotting
sales_data = sales_data.unstack(level='Promo2')

# Plotting.
sales_data.plot(figsize=(10, 6))

plt.title('Median Sales with/without promo2 over Year & Month')
plt.xlabel('Time (Year, Month)')
plt.ylabel('Median Sales')
plt.show()

##### 1. Why did you pick the specific chart?

* A Line plot help us to identify the trends in the data & compare multiple categories.

##### 2. What is/are the insight(s) found from the chart?

* **Promotion Impact** -
    1. The Sales data shows that promotional peroids(promo = 1) generally result in higher sales compared to non-promotional (promo=0) across all years.
    2. The impact of promo2 is less on the sales where promo2=0 has higher sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Promotional Strategy Evaluation** - By analyzing the sales performance during promotional & non-promotional periods, businesses can assess the effectiveness of their promotional strategies.

#### Chart - 10 Average Number of Customer Over School Holiday & Promo.

In [None]:
# Chart - 10 visualization code
# Create a subset of the data with the required columns.
subset_df = df_m[['Customers', 'Promo', 'SchoolHoliday']]

# Group the data by 'Promo' & 'SchoolHoliday' and calculate the average number of customers.
grouped_data = subset_df.groupby(['Promo', 'SchoolHoliday'])['Customers'].mean().reset_index()

# Pivot the data to make it suitable for visualization.
pivot_table = grouped_data.pivot(index='Promo', columns = 'SchoolHoliday', values= 'Customers')

# Plotting the customer count using a heatmap.
plt.figure(figsize=(8, 6))
plt.imshow(pivot_table, cmap='viridis', aspect='auto')

# Set x-axis and y-axis labels.
plt.xlabel('School Holiday')
plt.ylabel('Promotional Activities')

# Set x-tick labels.
plt.xticks([0, 1], ['No', 'Yes'])
plt.yticks([0, 1], ['No', 'Yes'])

# Add a color bar legend
cbar = plt.colorbar()
cbar.set_label('Average Number of Customers')

# Add value annotations to each cell
for i in range(pivot_table.shape[0]):
   for j in range(pivot_table.shape[1]):
     plt.text(j,i,f'{pivot_table.iloc[i, j]:.2f}', ha='center', va='center', color='white')

# Add a title.
plt.title('Average Number of Customers: Promo VS School Holiday')

plt.show()

##### 1. Why did you pick the specific chart?

* Here a heatmap is used to visualize patterns & relationships in data that are organized in a tabular format.

##### 2. What is/are the insight(s) found from the chart?

* The combination of a promotion & non-school holiday has the highest average number of customers (824.27).
* On the other hand, the combination of no promotion & non-school holiday has the lowest average number of customers (498.24)

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights suggest that promotions play a significant role in increasing customer visits, particularly during non-school holidays.

#### Chart - 11 - Correlation Heatmap





In [None]:
# Correlation Heatmap visualization code
# Plot correlation matrix, specify image size.
corr_matrix = df_m2.corr()
fig, ax = plt.subplots(figsize= (12, 9))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', square=True)
plt.show()

##### 1. Why did you pick the specific chart?

A correlation matrix, can be used to show the correlation coefficient between pairs of variable in a dataset. correlation coefficients can range from -1 to 1.
 * A Correlation of 1 indicates a perfect positive correlation. That is for every increase in one variable, there is a proprotinate increase in the other.

##### 2. What is/are the insight(s) found from the chart?

1. The strongest positivie correlation appears to be between Sales & Customers (0.82). This is to be expected, as more customers would generally lead to higher sales.
2. The variable Promo has a significant positive correlation with both Sales(0.37) & Customers(0.18).This suggests that running a promotion tend to increase both sales and customer visits.
3. Promo2 seems to have a siginificant negative correlation with both Sales(-0.13)and Customer (-0.20) which might indicate that this type of promotion does not work as effectively as Promo in driving sales and customer visits or even has a negative effect.
4. CompetitionDistance Show a negative correlation with Customers (-0.15) suggesting that store with closer competitiors might have more customers.
5. Promo2SinceYear has notable negative correlation with promo2SinceWeek (-0.24), Since they might be derrived from the same date.  

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1 The Presence of a promotion(Promo)has a positive impact on sales.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* Null Hypothesis - H0: The presence of a promotion(promo) has no siginificant impact on sales.
* Alternate Hypothesis- HA: The presence of a promotion(Promo)has an impact on sales.

#### 2. Perform an appropriate statistical test.

In [None]:
import scipy.stats as stats

# Seperate the data into two groups.
promo_true = df_m2[df_m2['Promo']== 1]
promo_false = df_m2[df_m2['Promo']== 0]

In [None]:
# Perform Statistical Test to obtain P-Value
t_stat, p_value = stats.ttest_ind(promo_true['Sales'], promo_false['Sales'], axis=0, equal_var=True, nan_policy='propagate', permutations=None, random_state=None, alternative='two-sided', trim=0)
print('t-stat:', t_stat)
print('p-value:', p_value)

# Compare p_value with the siginificant level (0.05)
if p_value < 0.05:
   conclusion = "Reject Null hypothesis. The presence of a promotion (promo) has an impact on sales."
else:
    conclusion = "Fail to Reject Null hypothesis. The presence of a promotion (promo) has no siginificant impact on sales."

print(conclusion)

##### Which statistical test have you done to obtain P-Value?

The independent t-test, also called the two sample t-test,independent-samples-t-test or student's t-test, is a inferential statistical test that determines whether there is a statistically significant difference between the means in two unrelated groups.

##### Why did you choose the specific statistical test? Insights from the Hypothesis Testing.

* The t-statistic value(t=363.84) indicates the magnitude of the difference between the means of two groups. A higher absolute value of the t-statistic indicates a larger difference between the means.

* The t-test result produced a very low p_value(p_value=0.0), which means we can safely reject null hypothesis. This suggest there is a significant difference in sales between the days where there is promotion(promo=1) and the days when there is no promotion(Promo=0).

### Hypothetical Statement - 2 There is significant difference in sales between different store types.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* Null hypothesis(H0)- There is no significant difference in sales between different store types.

* Alternative hypothesis(H1) - There is a significant difference in sales between different store type.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# perform ANOVA test.
f_statistic, p_value = stats.f_oneway(
    df_m2['Sales'][df_m2.StoreType == 'a'],
    df_m2['Sales'][df_m2.StoreType == 'b'],
    df_m2['Sales'][df_m2.StoreType == 'c'],
    df_m2['Sales'][df_m2.StoreType == 'd'])

print('f-statistic:', f_statistic)
print('p-values:', p_value)

# Compare p-value with significance level (0.05)
if p_value < 0.05:
   conclusion = "Reject Null hypothesis. There is a significant difference in sales between different store types. "
else:
   conclusion = "Fail to Reject Null hypoythesis. There is no significant difference in sales between different store types."

print(conclusion)

##### Which statistical test have you done to obtain P-Value?

ANOVA is a statistical test used to compare the means of two or more groups to determine if there are significant difference between them. In this case the different store type are treated, as the groups & the sales data is compared to see if there are significant variations among the groups.
By Choosing the ANOVA test we can compare the means of multipale store types simultaneously, rather than conducting pairwise comparisons between each pair of store types.

##### Why did you choose the specific statistical test? Insights from hypothesis test.

The F-statistic(F=6081) is measure of the ratio of variance between groups to the variance within groups. A higher F-statistic suggests a large difference between the group means relative to the variability within each group.

A p_value below certain threshold (commonly 0.05) indicates that the observed differences are unlikely to be due to random chance.hence in this case (p=0.0) we can reject the null hypothesis. There is significat difference in sales between different store types.

### Hypothetical Statement - 3  There is significant difference in sales on different days of the week.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* Null hypothesis (H0) - There is no significant difference in sales on different days of the week.

* Alternative hypothesis (H1) - There is a significant difference in sales on different days of the week.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Preform one-way ANOVA test.
f_statistic, p_value = stats.f_oneway(
    df_m2[df_m2['DayOfWeek']== 1]['Sales'],
    df_m2[df_m2['DayOfWeek']== 2]['Sales'],
    df_m2[df_m2['DayOfWeek']== 3]['Sales'],
    df_m2[df_m2['DayOfWeek']== 4]['Sales'],
    df_m2[df_m2['DayOfWeek']== 5]['Sales'],
    df_m2[df_m2['DayOfWeek']== 6]['Sales'],
    df_m2[df_m2['DayOfWeek']== 7]['Sales'])

print('f-statistic:', f_statistic)
print('p_value:', p_value)

# Compare the p_value with significant level(0.05)
if p_value < 0.05:
   conclusion = "Reject Null hypothesis. There is a significant difference in sales on different days of the week."
else:
    conclusion = "Fail to Reject Null hypothesis. There is no significant difference in sales on different days of the week."

print(conclusion)

##### Which statistical test have you done to obtain P-Value?

ANOVA is a statistical test used to compare the means of two or more groups to determine if there are significant difference between them.

##### Why did you choose the specific statistical test? Insight from hypothesis test.

The large F-statistic suggests (F=7451) a large difference between the group means relative to the variability within each group.

The p_value (p=0.0) indicates that the observed difference are unlikely to be due to random chance, hence in this case (p=0.0) so we can reject the null hypothesis. ie, there is a significant difference in sales on different days of the week.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

#### 1.1 Let's identify the rows with missing values in CompetitonsDistance.

In [None]:
df_m2.shape

In [None]:
df_m2[np.isnan(df_m2['CompetitionDistance'])].isna().sum()

* We can assum the rows with missing values in the competition distance have no competition. Hence we can fill the rows with competition details with a value outside the range.

In [None]:
# Calculating max value of competition open since date from year & month.
df_temp = df_m2.dropna().copy()
pd.to_datetime(
    df_temp['CompetitionOpenSinceYear'].astype(int).astype(str) + '-' +
    df_temp['CompetitionOpenSinceMonth'].astype(int).astype(str) + '-1'
).max()

In [None]:
# Calculating the max value of competition distance.
df_temp['CompetitionDistance'].max()

In [None]:
# Creating a copy of the original Dataframe.
df_c1 = df_m2.copy()

# Specifying the column to impute.
columns_to_impute = ['CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear']

# Identifying the rows where CompetitionDistance is missing.
condition = df_c1['CompetitionDistance'].isnull()

# Update the specified columns with a value outside the range for the rows satisfying the condition.
df_c1.loc[condition,'CompetitionDistance'] = 4000
df_c1.loc[condition,'CompetitionOpenSinceMonth'] = 12
df_c1.loc[condition, 'CompetitionOpenSinceYear'] = 2016


#### 1.2 let impute the remaining missing values with either mode or median

In [None]:
# Selecting the columns to impute.
columns_to_impute_medain = ['CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2SinceWeek', 'Promo2SinceYear']
columns_to_impute_mode = ['PromoInterval', 'Promo2']

df_c1_imputed = df_c1.copy()
# Impute the missing values using mode
for column in columns_to_impute_mode:
  df_c1_imputed[column].fillna(df_c1_imputed[column].mode()[0], inplace=True)

# Impute the missing values using median.
for column in columns_to_impute_medain:
  df_c1_imputed[column].fillna(df_c1_imputed[column].median(), inplace=True)

In [None]:
df_c1_imputed.shape

#### What all missing value imputation techniques have you used and why did you use those techniques?

Here we used the mode & median imputation.

1. ** Mode Imputation**: Mode imputation is used to fill in missing categorical or nominal data.The Mode is the most frequently occurring value in a dataset. when a value is missing.it can be replaced with the mode of that particular feature.This approach assumes that the missing value is likely to be similar to the most common value in  the dataset.

2. **Median Imputation**: Median imputation is used to fill in missing numerical or continuous data.The median is the middle value in a sorted list of numbers. When a value is missing, it can be replaced with the median of that particular feature.This approach assumes that the missing value is likely to be similar to the typical or central value in the dataset.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
def cap_outliers(df):
  # create a copy of the dataframe to avoid modifying the orignial data.
  cleaned_df = df.copy()

  # Iterate over numerical columns.
  for column in cleaned_df.select_dtypes(include = np.number):
    # Calculate the 99th percentil value.
    upper_cap_value = cleaned_df[column].quantile(0.99)

    # Caping lower outliers seems to decrease model score, hence a commented out.
    # lower_cap_value = cleaned_df[column].quantile(0.05)

    # Cap outliers to the 99th percentile value.
    cleaned_df[column] = np.where(cleaned_df[column] > upper_cap_value, upper_cap_value, cleaned_df[column])
    # cleaned_df[column] = np.where(cleaned_df[column] < lower_cap_value, lower_cap_value, cleaned_df[column])

  return cleaned_df
df_c1_capped = cap_outliers(df_c1_imputed)

##### What all outlier treatment techniques have you used and why did you use those techniques?

* Since we are dealing with retail data which might have spikes in sales during peak seasons, we can take into account most of the values in the sales column & customer column.
* How ever outliers in the remaining columns needs to be addressed.
* For simplicity, we cap the values in all columns upto the respective 99th percentile values.

### 3. Feature Manipulation & Selection

#### 1. Feature Manipulation

#### 1.1 Creating new feature from Date column.

In [None]:
# Manipulating Features to minimize feature correlation & create new feature.
# Create week,month & year columns from Date.
df_c1_capped['week_number'] = (df_c1_capped['Date']).dt.week
df_c1_capped['month'] = (df_c1_capped['Date']).dt.month
df_c1_capped['year'] = (df_c1_capped['Date']).dt.year



#### 1.2 Creating new feature IspromoMonth.

In [None]:
# Create a column to detect whether a promo run in the current month.
df_c1_capped['IsPromoMonth'] = df_c1_capped.apply(lambda row: 1 if row['Date'].strftime('%b') in row['PromoInterval'] else 0, axis=1)
df_c1_capped['IsPromoMonth'].sample(5)

#### 1.3 Create new feature Av_Sales/Assortment/StoreType/Month & Av_customers/Assortment/StoreType/Month.

In [None]:
# Extract Av_sales/Assorment/StoreType/Month.
describe_1 = df_c1_capped.groupby(['StoreType','Assortment','month'])['Sales'].mean().reset_index()
describe_1.rename(columns={'Sales': 'Av_sales/Assortment/StoreType/Month'}, inplace=True)
describe_1.sample(5)

In [None]:
# Extract Av_customer/Assortment/StoreType/Month.
describe_2 = df_c1_capped.groupby(['StoreType','Assortment','month'])['Customers'].mean().reset_index()
describe_2.rename(columns={'Customers': 'Av_customers/Assortment/StoreType/Month'}, inplace=True)
describe_2.sample(5)

In [None]:
# Merge tha above two tables.
describe = describe_1.join(describe_2['Av_customers/Assortment/StoreType/Month'])
describe.dropna(inplace=True)
print(describe.shape)
describe.sample(5)


In [None]:
# Add Av_sales/Assortment/StoreType/Month, Av_customer/Assortment/StoreType/Month information to our dataset.
df_c2_capped = df_c1_capped.merge(describe, on=['StoreType', 'Assortment','month'])
print(df_c2_capped.shape)
df_c2_capped.sample(5)

#### 1.4 Convert competition Distance to binned values.

In [None]:
df_c2_capped['CompetitionDistanceBin'] = pd.cut(df_c2_capped['CompetitionDistance'],6, labels=['1','2','3','4','5','6'])

#### 1.5 Create a new feature 'CompetitonOpenSinceDay'.

In [None]:
df_temp = df_c2_capped.copy()

# Create a CompetitionOpenSinceDay' from year & month values.
df_temp['CompetitionOpenSinceDay'] = (pd.to_datetime(
    df_temp['CompetitionOpenSinceYear'].astype(int).astype(str) + '-' +
    df_temp['CompetitionOpenSinceMonth'].astype(int).astype(str) + '-1'
)-df_temp['Date']).dt.days

##---------------------------------------------------------------------------------------##

# Handling Outliers.
upper_cap = df_temp['CompetitionOpenSinceDay'].quantile(0.95)
lower_cap = df_temp['CompetitionOpenSinceDay'].quantile(0.05)

df_temp2 = df_temp.copy()
# replace outliers with capped value.
df_temp2['CompetitionOpenSinceDay'] = np.where(df_temp['CompetitionOpenSinceDay'] > upper_cap, upper_cap, df_temp['CompetitionOpenSinceDay'])
df_temp2['CompetitionOpenSinceDay'] = np.where(df_temp['CompetitionOpenSinceDay'] < lower_cap, lower_cap, df_temp['CompetitionOpenSinceDay'])

##--------------------------------------------------------------------------------------------##

df_temp2['Promo2SinceWeek'] = df_temp2['Promo2SinceWeek'].astype(int)
df_temp2['Promo2SinceYear'] = df_temp2['Promo2SinceYear'].astype(int)

# Create a Promo2SinceDay from year & month value.
df_temp2['Promo2SinceDay'] = (pd.to_datetime(df_temp2['Promo2SinceYear'].astype(str),format='%Y') + \
                              pd.to_timedelta(df_temp2['Promo2SinceWeek'] * 7, unit='days')-df_temp2['Date']).dt.days

# Check the shape of the new dataframe.
print(df_c2_capped.shape)
print(df_temp2.shape)

# Copy the changed dataframe to df_c3_capped.
df_c3_capped = df_temp2.copy()

### 4. Categorical Encoding

In [None]:
# Separate features & target.
cols_to_drop = ['Sales', 'Date', 'Store']
target_col = ['Sales']
X = df_c3_capped.drop(cols_to_drop, axis=1)
y = df_c3_capped[target_col]

# Perform one-hot encoding for features.
X_encoded = pd.get_dummies(X)

In [None]:
X_encoded.info()

#### What all categorical encoding techniques have you used & why did you use those techniques?

Here we used pd.get_dummies() which is a function provided by the pandas library in python. it's used for one-hot encoding categorical variables or features.One-hot encoding is a process of transforming categorical variable into binary vector, making them suitable for machine learning algorithms.

### 5. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

In [None]:
# Manipulate Features to minimize feature correlation and create new features

### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

def calculate_vif(df):
  # select numerical columns.
  numerical_columns = df.select_dtypes(include=['float64', 'int64'])

  # Create a dataframe to store VIF result.
  vif_data = pd.DataFrame()
  vif_data['Variable'] = numerical_columns.columns
  vif_data['VIF'] = [vif(numerical_columns.values, i) for i in range(numerical_columns.shape[1])]

  return vif_data.sort_values(by='VIF', ascending=False)

calculate_vif(X_encoded.drop(['year', 'Promo2SinceYear', 'Av_sales/Assortment/StoreType/Month', 'Customers', 'CompetitionOpenSinceYear', 'month'], axis=1))

In [None]:
# Select top 25 features based on mutual_info_regression.
selector = SelectKBest(mutual_info_regression, k=25)
X = X_encoded.drop(['year', 'Promo2SinceYear', 'Av_sales/Assortment/StoreType/Month', 'Customers', 'CompetitionOpenSinceYear', 'month'], axis=1).copy()
y = y
X_new = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()].to_list()

print('Selected features:', selected_features)

##### What all feature selection methods have you used  and why?

1. **Variance Inflation Factor**(VIF) is a statistical concept used to measure the severity of multicolinearity in regression analysis.
Multicolinearity occurs when there is a correlation between multiple independent variable in a multiple regression model, which can adversely affect the regression results.

2. **SelectKBest** is a feature selection method provided by the sklearn.feature_selection module in the scikit-learn library. it is used to select the top k feature from a given dataset based on a specific scoring function. One of the scoring functions available in scikit-learn is Mutul information Regression.

* Mutual information Regression: This is a technique used to measure the dependency or information gain between two variables. it is commonly used in feature selection for regression task.Mutual information captures both linear & non-linear relationships between variables.

##### Which all features you found important and why?

* As from the EDA we can see that sales vary greatly with promo.
* Competition Distance, StoreType & Assortment type also has some influence on sales.
* We are not using the Customer column as it wont be available beforehand & since our objective is to forcast sales.

For training the model we can use the feature selected using SelectKBest.

### 6. Data Splitting

We'll select a test set from our dataset that is 6 weeks long for evaluating our model on unseen data.

In [None]:
# We'll select the test set from our dataset that is 6 week long for evaluating our model on unseen data.
df_c1_train_val = df_c3_capped[df_c2_capped['Date'] < (pd.to_datetime('2015-07-31') - pd.to_timedelta(42, unit='d'))]
df_c1_test = df_c3_capped[df_c2_capped['Date'] >= (pd.to_datetime('2015-07-31') - pd.to_timedelta(42, unit='d'))]

# We are dropping the Date & Store Columns since it contains a lot of classes and might be cause of overfitting.
cols_to_drop = ['Sales', 'Date', 'Store']
target_col = ['Sales']

# Select feature & target for training & validation.
X_train_val = df_c1_train_val.drop(cols_to_drop, axis=1)
y_train_val = df_c1_train_val[target_col]

# Select feature & target for testing.
X_test = df_c1_test.drop(cols_to_drop, axis=1)
y_test = df_c1_test[target_col]

In [None]:
# Calculate the percentage of the test set.
test_size_perc = len(df_c1_test)/ len(df_c3_capped)*100
print(f'allocated percentage for test set: {test_size_perc:.2f}%')

In [None]:
# Selecting the features
X_train_val_selected = pd.get_dummies(X_train_val)[selected_features]
X_test = pd.get_dummies(X_test)[selected_features]

# Split data into train & test
X_train, X_val, y_train, y_val = train_test_split(X_train_val_selected, y_train_val, test_size=0.2, random_state=42)

In [None]:
print(X_train_val_selected.shape)
print(X_test.shape)

##### What data splitting ratio have you used and why?

* Since we have sufficiently large dataset, we can use 80% of the data for training & 20% for validation.
* Allocating 20% of the data for validation provides a separate set of examples to evaluate the model's performance.

### 7. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# Apply square root transformation to target variable.
y_train_s = np.sqrt(y_train)
y_val_s = np.sqrt(y_val)
y_test_s = np.sqrt(y_test)

* Here square root transformation is used on target variables because it exhibit a slightly skewed distribution.
* it can reduce the impact of extreme values & make the data more symmetrical.

### 8. Data Scaling

In [None]:
# Scaling your data
scaler = MinMaxScaler()
X_train_s = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_val_s = pd.DataFrame(scaler.transform(X_val), columns=X_val.columns, index=X_val.index)
X_test_s = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)

##### Which method have you used to scale you data and why?

Here we have used MinMaxScaler.
**MinMaxScaler** is used to scale numerical features in a dataset to a specific range using each feature's minimum & maximum value.

* The purpose of using MinMaxScaler is to bring all the features to a similar scale, which can be important for certain machine learning algorithms. Scaling the features ensures that no particular feature dominates the learning process due to differences in their scales.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Since our number of feature selected is only 25, dimensionality reduction is not required.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

To find the optimum models for our dataset, we will train and compare the evaluation metrics of multiple models ar once.

In [None]:
from datetime import time
from time import time as timer

# Define models.
models = [
    ('Linear Regression', LinearRegression()),
    ('Decision Tree Regression', DecisionTreeRegressor(random_state=42)),
    ('Random Forest Regression', RandomForestRegressor(random_state=42)),
    ('Gradient Boosting Regression', GradientBoostingRegressor(random_state=42)),
    ('XGBoosting Regression', XGBRegressor(random_state=42))
]

def Train_evaluate(models, X_train, y_train, X_val, y_val):
   """
   Train and evaluate regression models using different valuation metrics.

   Args:
       models (list): List of tuple  containing the model names & model instances.
       X_train (array-like) : Training data features.
       y_train (array-like) : Training data targets.
       X_val (array-like) : valuation data features.
       y_val (array-like) : valuation data targets.

   Returns:
       Tuple (model, metrics_df): Trained model and metrics dataframe.
   """
   # Initialize lists to store the metrics
   metrics = []

   # Const numbers to calculate adjusted R-squared
   p_train = X_train.shape[1]
   n_train = y_train.shape[0]

   p_val = X_val.shape[1]
   n_val = y_val.shape[0]

   # Train & evalute each model.
   for name, model in models:
      # fit & predict using the selected model
      start_train_time = timer()
      model.fit(X_train, y_train)
      end_train_time = timer()
      training_time = end_train_time - start_train_time

      start_pred_time = timer()
      y_pred_train = model.predict(X_train)
      y_pred_val = model.predict(X_val)
      end_pred_time = timer()
      prediction_time = end_pred_time - start_pred_time

      # Evaluation metrics for train data.
      rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
      mae_train = mean_absolute_error(y_train, y_pred_train)
      r2_train = r2_score(y_train, y_pred_train)
      adjusted_r2_train = 1- ((1-r2_train) * (n_train - 1)) / (n_train - p_train - 1)
      mape_train = (mae_train / np.mean(y_train)) * 100

      # Evaluation metrics for val data
      rmse_val = np.sqrt(mean_squared_error(y_val, y_pred_val))
      mae_val = mean_absolute_error(y_val, y_pred_val)
      r2_val = r2_score(y_val, y_pred_val)
      adjusted_r2_val = 1- ((1 - r2_val) * (n_val - 1)) / (n_val - p_val - 1)
      mape_val = (mae_val / np.mean(y_val)) * 100


      # Append valuation metrics to metrics list
      metrics.append([
          name, rmse_train, mae_train, r2_train, adjusted_r2_train, mape_train[0],
          rmse_val, mae_val, r2_val, adjusted_r2_val, mape_val[0], training_time, prediction_time
      ])

   # Create a dataframe for metrics
   metrics_df = pd.DataFrame(
   metrics,
    columns=['Model', 'RMSE Train', 'MAE Train', 'R-squared Train', 'Adjusted R-squared Train', 'MAPE Train',
             'RMSE val', 'MAE val', 'R-squared val', 'Adjusted R-squared val', 'MAPE val', 'Training Time', 'Prediction Time']
   )

   # Format numeric columns
   metrics_df = metrics_df.round(2)

   return metrics_df, models

In [None]:
metrics_history,_ = Train_evaluate(models, X_train_s, y_train_s, X_val_s, y_val_s)
metrics_history

Here linear regression is only able to explain 25% of variance in the target variable. Since it fails to capture the complexity of the data, we will use the rest of the models for our following analysis.

### ML Model - 1 Decision Tree.

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

* A Decision tree is a non-parametric supervised learning algorithm used for both classification & regression task.it has a hierarchical, tree-like structure cosisting of a root node, branches, internal nodes & leaf nodes.

* In a decision tree, each internal node represents a "test"on an attribute each branch represents the outcome of the test, and each leaf node represents a class label or decision taken after computing all attributes.The paths from the root to the leaf nodes represent classification rules.

* Decision Tree are commonly used in operations research, decision analysis & machine learning.

In [None]:
# Visualizing evaluation Metric Score chart
metrics_history[metrics_history['Model'] == 'Decision Tree Regression']

* The Decision Tree Regression model seems to overfit the training data indicated by the R-squared scores of 1.0 on the training set.The model performs reasonably well on validation data, suggested by R-squared score of 0.81 on validation set.

* Overall, the Decision Tree Regeression model shows promising performance & can be improved by tunning hyperparameters.

#### 2. Cross- Validation & Hyperparameter Tuning

* Here we have used Optuna library for hyperparameter optimization for all ML models.
* Optuna is hyperparameter Optimization software framework designed for machine lerning. it allows user to implement different state-of-the-art optimization methods to perform hyperparameter optimization rapidly with great performance.
* By default, optuna implements a Bayesian optimization- Tree structured Parzen Estimator (TPE) algorithm. TPE constructs a probabilistic model to model the relationship between hyperparameters & the objective function and the samples the hyperparameters based on this model to guide the search towards promising regions of the hyperparameter space.
*In the following code, the objective function defines the objective to be maximmized, which is the R-squared score between the predicted and actual values.The trial object is used to sample hyperparameters from a specified search space.

In [None]:
def objective(trial):
    params = {
        'max_depth': 55, # Model performance increase with depth, also causes overfitting
        'min_samples_split': trial.suggest_int('min_samples_split', 4, 10), # Initial trail used 2,10 then narrowed to current value.
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 9, 18), # Initial trial used 1, 10 then narrowed to current value.
        'criterion': 'friedman_mse'  # Eliminated {poisson, 'absolute_error}
    }

    model = DecisionTreeRegressor(**params)

    model.fit(X_train_s, y_train_s)
    val_r2 = model.score(X_val_s, y_val_s)
    print('Train r2 score: ', model.score(X_train_s, y_train_s))

    return val_r2


study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=15)

print('Best hyperparameters: ', study.best_trial.params)
print('Best r2 Score: ', study.best_value)




| max_depth | min_samples_split | min_samples_leaf | criterion      | R2 Score     |
|----|------------------|-----------------|----------------|--------------|
| 65 | 9                | 10              | friedman_mse   | 0.860358     |
| 55 | 8                | 10              | friedman_mse   | 0.860386     |
| 45 | 4                | 10              | friedman_mse   | 0.860393     |
| 45 | 4                | 6               | friedman_mse   | 0.858035     |
| 35 | 5                | 5               | friedman_mse   | 0.856207     |
| 25 | 3                | 5               | friedman_mse   | 0.845925     |
| 15 | 3                | 4               | friedman_mse   | 0.643338     |

* As we can see from above result increasing the depth of tree leads to higher optimized result, however increasing from n=45 and higher leads to only marginal improvement .
* Hence we can conclude that the model coverges at hyperparameters{'max_depth' = 45, min_samples_split= 4, min_sample_leaf=10 and criterion = friedman_mse}.
* We can stop the trial here since increasing the depth of the tree might lead to overfitting & increasing the model complexity.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
# Selected hypreparameters for Decision Tree Regression
params = {'max_depth' : 45, 'min_samples_split' : 4, 'min_samples_leaf' : 10, 'criterion' : 'friedman_mse'}

# use the best model to make predictions.
models = [('Decision Tree Regression HPO', DecisionTreeRegressor(**params,random_state=42))]
metrics_df,_ = Train_evaluate(models, X_train, y_train, X_val, y_val)

# Save evaluation metrics to metrics_history
metrics_history = metrics_history.append(metrics_df)
metrics_history[(metrics_history['Model'] == 'Decision Tree Regression') |
                (metrics_history['Model'] == 'Decision Tree Regression HPO')]

* The HPO has lead to more optimized results indicated by lower R-squared score of 0.9 on the training set and a higher R-squared score of 0.85 on the validation set.
* The model is able to generalize better on unseen data after the hyperparameter tuning .

### ML Model - 2  Random Forest.

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

**sklearn.ensemble.RandomForestRegressor.**

* Random forest is machine learning algorithm that is uses for classification & regression tasks. it is an ensemble learning method that combines multiple decision tree to make prediction.Each individual tree in the random forest produces a class  prediction and the final prediction is determined by majority voting.

* The Random forest algorithm is know for its effectiveness and versatility. it can handle a wide range of data and requires little configuration. it is often used as a black box model in businesses because it generates reasonable prediction without much tuning.

In [None]:
# Visualizing evaluation Metric Score chart
metrics_history[metrics_history['Model'] == 'Random Forest Regression']

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
def objective(trial):
    params = {
        'n_estimators' : 100, # trial.suggest_int('n_estimators', 100, 200),
        'max_depth' : trial.suggest_int('max_depth', 15, 50),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 7),
        'min_samples_leaf' : trial.suggest_int('min_samples_leaf', 5, 10),
        'n_jobs': -1
    }

    model = RandomForestRegressor(**params)
    model.fit(X_train_s, y_train_s)
    val_r2 = model.score(X_val_s, y_val_s)
    print('Train r2 score: ', model.score(X_train_s, y_train_s))

    return val_r2

###--- Commenting out following code because HPO takes too long to run----###

# study = optuna.create_study(direction= 'maximize')
# study.optimize(objective, n_trials=25)

# print('Best hyperparameters: ', study.best_trial.params)
# print('Best r2 score: ', study.best_value)

###-----End of HPO for Random Forest Regression---#

| `n_estimators` | `max_depth` | `min_samples_split` | `min_samples_leaf` | Best R-squared Score |
|-----------------|-------------|---------------------|--------------------|----------------------|
| 100             | 43          | 5                    | 5                  | 0.8840932266273452 |
| 50              | 24           | 7                    | 9                  | 0.8786926164161606 |
| 25              | 24          | 7                    |


These hyperparameters represent the best configuration found through the hyperparameter optimization process.The corresponding R-squared scores indicate the goodness of fit the model using these hyperparameter settings.

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
# Selected hyperparameters from HPO
params = {'n_estimators': 100, 'max_depth': 43, 'min_samples_split':5, 'min_samples_leaf': 5}

# Use the best model to make predictions
models = [('Random Forest Regression HPO', RandomForestRegressor(**params,random_state=42))]
metrics_df,_ = Train_evaluate(models, X_train, y_train, X_val, y_val)

# Save evaluation metrics to metrics_history
metrics_history = metrics_history.append(metrics_df)
metrics_history[(metrics_history['Model'] == 'Random Forest Regression') |
                (metrics_history['Model'] == 'Random Forest Regression HPO')]


* The HPO technique has helped the model to reduce overfitting indicated by a lower R-squared score of 0.93 on the training set.The model reduce the values of evaluation metrics on the validation set which is 0.88

### ML Model - 3 - Gradient Boosting Regressor.

**Gradient Boosting Machine (GBM)**
* GBM is a machine learning technique that uses an ensemble of weak prediction models, typically decision tree, to create a strong learner.
* GBM trains many models in gradual, additive & sequential manner and it is highly customizable to the particular needs of the application.
* GBM involves three elements: a loss function to be optimized, a weak learner to make predictions & an additive model to add weak learner to minimize the loss function.
* GBM can overfit a training dataset quickly, but regularization methods can improve its performance by reducing overfitting.

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
metrics_history[metrics_history['Model'] == 'Gradient Boosting Regression']

* The RMSE & MAE indicates that the model is making prediction with the average error of around 2300.
* The R-squared & Adjusted R-squared values for both set are around 0.4, which means that the model is able to explain around 40% of the variance in the target variable.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
import optuna
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

def objective(trial) :
    params = {
        'n_estimators' : 50, # trial.suggest_int('n_estimators', 50, 400),
        'max_depth' : trial.suggest_int('max_depth', 7, 15),
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.2),
        'min_samples_split' : trial.suggest_int('min_sample_split', 2, 5),
        'min_samples_leaf' : trial.suggest_int('min_samples_leaf', 1, 3),
        'subsample': trial.suggest_uniform('subsample', 0.5, 1.0),
    }
    model = GradientBoostingRegressor(**params)
    model.fit(X_train_s, y_train_s)

    y_pred = model.predict(X_val_s)
    r2 = r2_score(y_val_s, y_pred)

    return r2

###---Commenting out following code because HPO takes too long to run -----###

# study = optuna.create_study(direction = 'maximize')
# study.optimize(objective, n_trials=10)

# print('Best hyperparameters: ', study.best_trial.params)
# Print('Best r2 score: ', study.best_value)

###--- End of HPO for Gradient Boosting Regressor----###

| `n_estimators` | `max_depth` | `learning_rate` | `min_samples_split` | `min_samples_leaf` | `subsample` | Best R-squared Score |
|-----------------|-------------|-----------------|---------------------|--------------------|-------------|----------------------|
| 100             | 12          | 0.12507294020594728 | 5                    | 2                  | 0.7857656224069701 | 0.9099075010286225 |
| 50              | 13          | 0.06375016028751634 | 4                    | 3                  | 0.6921886023174675 | 0.7929251654166077 |
| 25              | 11          | 0.13788691898021682 | 4                    | 2                  | 0.6801601804268422 | 0.7225825814634139 |


* These hyperparameters represent the best configuration found through the hyperparameter optimization process. The corresponding R-Squared scores indicate the goodness of fit of the model using these hyperparameter settings.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
# Define models
params = {'n_estimators': 100,
          'max_depth': 12,
          'learning_rate': 	0.12507294020594728,
          'min_samples_split': 5,
          'min_samples_leaf': 2,
          'subsample': 0.7857656224069701}

          # use the best model to make predicitions.
# Save evaluation metrics to metrics_history
models = [('Gradient Boosting Regression HPO', GradientBoostingRegressor(**params, random_state=42))]

metrics_df,_ = Train_evaluate(models, X_train_s, y_train_s, X_val_s, y_val_s)
metrics_history = metrics_history.append(metrics_df)

metrics_history[(metrics_history['Model'] == 'Gradient Boosting Regression') |
                (metrics_history['Model'] == 'Gradient Boosting Regression HPO')]

* The model has improved significantly after HPO indicatd by an increase in R-squared value and decrease in RMSE & MAE values.

* The final R-squared value of 0.92 and RMSE of 4.97 indicated model is performing well on the validation set.

### ML Model - 4- XGBRegressor.

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

* XGBoos is a highly efficient and widely used machine learning algorithm based on gradient boosting.it combines weak predictive models, typically decision tree, to create a powerful ensemble model.
* XGBoost incorporates regularization techniques to prevent overfitting and provide feature importance analysis. it handles missing values, supports parallel processing and allows for early stopping to find the optimal number of trees.
* XGBoost is known for its speed, scalability and accuracy, making it suitable for various applications.

In [None]:
# Evaluation metrics from XGBoost Regressor Model.
metrics_history[metrics_history['Model'] == 'XGBoosting Regression']

* The XGBoos Regression model demonstrates strong performance based on the evaluation metrics.
* It has low value for RMSE & MAE, indicating accurate predictions with minimal errors.
* The R-squared & Adjusted R-squared scores of 0.83 suggest that the model captures a significant portion (83%) of the target variables variance, indicating a good fit.
* The model's performance is consistent on both the training and validation sets,highlighting its generalization ability.

#### 2. Cross-Validation & Hyperparameter Tuning.

In [None]:
import optuna
import xgboost as xgb
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

# Enable GPU support
xgb_config = {
    'tree_method': 'gpu_hist',  # use GPU accelerated algorithm
    'gpu_id' : 0  # Specify the GPU device index to use (e.g., 0 for the first GPU)
}

def objective(trial):
    params = {
        'max_depth': 20, #trial.suggest_int('max_depth', 3, 9),
        'learning_rate': 0.11096130067213479, # trial.suggest_loguniform('learning_rate', 0.01, 0.2),
        'min_child_weight': trial.suggest_int('min_child_weight', 4, 10),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.6, 1),
        'gamma': trial.suggest_uniform('gamma', 0.1, 0.4),
    }
    # Update the XGBoost parameters with GPU configuration
    params.update(xgb_config)

    model = xgb.XGBRegressor(**params)
    model.fit(X_train_s, y_train_s)

    y_pred = model.predict(X_val_s)
    r2 = r2_score(y_val_s, y_pred)

    return r2

###----- Commenting out following code because HPO takes too long to run----###

# study = optuna.create_study(direction= 'maximize')
# study.optimize(objective, n_trials=20)

# Print('Best hyperparameters:', study.best_trial.params)
# print('Best r2 score: ', study.best_value)

###---End of HPO for XGBoost Regressor-----###

| `max_depth` | `min_child_weight` | `colsample_bytree` | `gamma` | Best R-squared Score |
|-------------|--------------------|--------------------|---------|----------------------|
| 20          | 9                  | 0.6007938923707836 | 0.1919642988536765 | 0.9254781023199998 |
| 9           | 1                  | 0.7100240740796231 | 0.38772768285415593 | 0.8859478511092371 |


* These hyperparameters represent the best configuration found through the hyperparameter optimization process. The corresponding R-squared score indicate the goodness of fit of the model using these hyperparameter settings.

##### Have you seen any improvement ? Note down the improvement with updates Evaluation metric score charts.

In [None]:
# Selected hyperparameters from HPO
XGB_parameters = {
     'max_depth':20,
     'learning_rate': 0.11096130067213479,
     'min_child_weight':9,
     'colsample_bytree': 0.7100240740796231,
    #  'tree_method' : gpu_hist, # use GPU accelerated algorithm
}

# Use the best model to make predictions
models = [('XGBoost Regression HPO', XGBRegressor(**XGB_parameters, random_state= 42))]

# Save Evaluation metrics to metrics_history
metrics_df,XGB_model = Train_evaluate(models, X_train_s, y_train_s, X_val_s, y_val_s)
metrics_history = metrics_history.append(metrics_df)

metrics_history[(metrics_history['Model'] == 'XGBoosting Regression') |
                (metrics_history['Model'] == 'XGBoost Regression HPO')]

* After HPO the model has improved significantly with an train R-squared value of 0.98 and validation R-squared value of 0.93.

* So far XGBoost has shown the best performance on the validation set with the minimum RSME & MAE values and highest R-squared.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

1. Root Mean Squared Error (RMSE):
   * RMSE measures the average deviation of predicted values from actual values.
   * A lower RMSE suggests that the model's predictions are closer to the actual values, which is desirable for businesses relying on accurate predictions.

2. Mean Absolute Error (MAE):
   * MAE measures the average magnitude of error between predicted and actual values.
   * MAE is more interpretable than RMSE since it represents the absolute difference between prediction and actual values.

3. R-squared(R^2) Score:
   * R-squared is a statistical measure that represents the proportion of the variance in the dependent variable (target)explained by the independent variables (features).
   * R-Squared range from 0 to 1, where 1 indicates a perfect fit.
   * Higher R-Squared values indicate better model performance in capturing the relationship between features and the target variable.

4. Adjusted R-Squared:
   * Adjusted R-Squared adjusts the R-squared value by considering the number of features and the sample size.
   * It penalize models with too many features that may overfit the data.
   * A higher adjusted R-squared value suggests better model performance in capturing the relevant information without overfitting.

5. Mean Absolute Percentage Error(MAPE):
   * MAPE measures the average percentage difference between predicted and actual values.
   * it provide a relative measure of the prediction accuracy.
   * Lower MAPE values indicate better accuracy, as the model's predictions are closed to the actual values.    

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

In [None]:
metrics_history

In [None]:
# Model Names.
models = [
     "Decision Tree Regression",
     "Decision Tree Regression HPO",
     "Random Forest Regression",
     "Random Forest Regression HPO",
     "Gradient Boosting Regression",
     "Gradient Boosting Regression HPO",
     "XGBoosting Regression",
     "XGBoost Regression HPO"
]

# R squared values
eval_r_squared = [0.81, 0.85, 0.88, 0.88, 0.40, 0.92, 0.83, 0.93]

# Runtime information.
runtime = [1.13, 0.47, 74.15, 40.15, 1.92, 6.83, 1.67, 37.55]

# Create a data frame from the above data.
data = pd.DataFrame({"Model": models, "Validation R-squared":eval_r_squared, "Runtime":runtime })

# Sort the dataframe by R-squared value.
data.sort_values(by= "Validation R-squared", inplace=True, ascending=False)

# Set the color palette.
colors= ["steelblue" if "HPO" not in model else "salmon" for model in data['Model']]

# Create a 1x2 subplot grid (1 row, 2 columns)
gs = gridspec.GridSpec(1, 2, width_ratios=[3, 1])

plt.figure(figsize=(14, 7))

# Create a horizontal bar plot using seaborn in the first sublot.
ax1 = plt.subplot(gs[0])
sns.barplot(x= "Validation R-squared", y="Model", data=data, palette=colors, ax=ax1)
ax1.set_xlabel("Validation R-squared")
ax1.set_ylabel("Model")
ax1.set_title("Comparison of validation R-squared for different Models")

# Add annotations for the corresponding R-squared values.
for i, v in enumerate(data["Validation R-squared"]):
    ax1.text(v + 0.01, i, str(v), color= 'black', va='center')

# Create the line plot in the second subplot.
ax2 = plt.subplot(gs[1])
sns.lineplot(x="Runtime", y="Model", data=data, sort=False, ax=ax2)
ax2.set_xlabel("Runtime (seconds)")
ax2.set_ylabel("")
ax2.set_title("Model Runtime Comparison")
ax2.yaxis.tick_right()

plt.tight_layout()
plt.show()

Observing the chart, the XGBoost Regression HPO(Hyperparameter Optimized) model has the highest validation R-squared 0.93, indicating the best predictive power.However, it also a relatively high runtime at 37.55 seconds.

On the other hand, the Gradient Boosting Regression HPO model has a slightly lower validation R-squared(0.92), but it still fairly high and also has significantly shorter runtime(6.83 seconds).

If runtime is a critical factor (say, in a productive enviroment where prediction need to be made quickly), we could pick Gradient Boosting Regression HPO model as it strikes good balance between predictive power & efficiency.

However, Since we are seeking the highest possible predictive power the XGBoost Regression HPO would be the better choice.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In [None]:
importance = XGB_model[0][1].feature_importances_

# Create a Dataframe with feature names and importance scores.
feature_importances_df = pd.DataFrame({'Feature': X_train_s.columns, 'Importance': importance})

# Sort the Dataframe by importance socres in descending order.
feature_importances_df = feature_importances_df.sort_values(by= 'Importance', ascending=False)

# plot feature importance
plt.figure(figsize=(10, 6))
bars = plt.barh(feature_importances_df['Feature'], feature_importances_df['Importance'] )
plt.xlabel('Feature Importance')
plt.ylabel('Features')
plt.title('XGBoost Feature Importance')

# Adding corresponding values to the bars
for bar in bars:
  plt.text(bar.get_width(), bar.get_y() + bar.get_height()/2,
           '{0:.3f}'.format(bar.get_width()),
           va='center', ha='left')

plt.show()


**Feature importance** gives an idea of how much each feature contributes to the model's prediction.The feature importance scores are calculated based on the reduction in the criterion used to select split points, like Gini or entropy.

  * In this case the most important feature for predictting the target variable is Promo, with an importance score of about 0.441.
  * The second and third most important features are StoreType_b and StateHoliday_0 with importance score of 0.114 and 0.038 respectively.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import pickle
pickle.dump(metrics_history, open('metrics_history2.pkl','wb'))
pickle.dump(XGB_model, open('XGB_model2.pkl', 'wb'))

In [None]:
#  Commenting out since this only needed to be done once.
# !cp metrics_history2.pkl '/content/drive/MyDrive/Almabetter/Almabetter Projects/Hotel Booking'
# ! cp XGB_model2.pkl '/content/drive/MyDrive/Almabetter/Almabetter Projects/Hotel Booking

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Commenting out since this only needed to be done once
# Load the File and predict unseen data.
# import pickle
# metrics_history = pickle.load(open('/content/drive/MyDrive/Almabetter/Almabetter Project/Hotel Booking/metrics_history.pkl', 'rb'))
# XGB_model = pickle.load(open('/content/drive/MyDrive/Almabetter/Almabetter Projects/Hotel Booking/XGB_model.pkl','rb'))

In [None]:
test_prediction = XGB_model[0][1].predict(X_test_s)
r2_test = r2_score(y_test_s, test_prediction)
r2_test
rmse = np.sqrt(mean_squared_error(y_test_s, test_prediction))
mae = mean_absolute_error(y_test_s, test_prediction)
print('Test r2 score: ', r2_test)
print('Test rmse: ',rmse)
print('Test mae: ', mae)

In [None]:
plt.plot([y_test_s.min(), y_test_s.max()], [y_test_s.min(), y_test_s.max()], 'k--', lw=4)
plt.scatter(y_test_s, test_prediction)

In [None]:
import matplotlib.pyplot as plt

# Merg the predicted and actual values with the corresponding dates.
merged_data = X_test_s.join(df_m2['Date'])
merged_data['Actual'] = y_test_s
merged_data['Predicted'] = test_prediction

#  Calculate the daily mean values.
daily_mean_actual = merged_data.groupby('Date')['Actual'].mean()
daily_mean_predicted = merged_data.groupby('Date')['Predicted'].mean()

# Set the figure size
plt.figure(figsize=(12, 6))

# Plot the daily mean values.
plt.plot(daily_mean_actual.index, daily_mean_actual, label='Actual')
plt.plot(daily_mean_predicted.index, daily_mean_predicted, label='Predicted')

# Rotate the x-axis labels for better readability
plt.xticks(rotation=90)

# Add labels and title
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Prediction on Unseen Data \n 6 - Week Forecast of Average Daily Sales VS Actual Average Daily Sales')

# Add a legend
plt.legend()

plt.show()



### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

These are the key findings from EDA.

  1. Seasonal Variation: The business experiences a seasonal variation in sales, with higher sales during the holiday season. To mitigate negative growth during non-peak seasons, strategies should be explored to stimulate demand and attract customer during off-peak periods.

  2. Competition: The presence of competiton impacts sales, with areas having higher competition generating higher sales.Businesses should consider competiton when planning store locations and marketing strategies.

  3. Promotions: Promotional periods generally lead to higher sales.Business should carefully plan and optimize promotional activities to maximize sales impact.

  4. Store Type and Assortment: Store type 'b' assortment type 'c' consistently performed better in terms of sales. Businesses could focus on leveraging the strengths of these types while considering strategies for improving the performance of other types.

  5. Sales Volume: The overall sales volume showed a positive trend, indicating a growing number of sales transactions. Businesses should monitor and maintain this growth to sustain revenue generation.

  In the next part, We trained and evaluated several regression models,including Decision Tree Regression, Random forest Regression,
  Gradient Boosting Regression and XGBoost Regression. We also performed hyperparameter optimization for the above models.

These are the observation after training and testing above models.

  1. In this initial training of models Random forest regressor obtained the highest R-squared value of 0.88, on the validation set.
  2. After tuning the hyperparameters the XGBoos model achieved the highest R-squared value of 0.93 on validation set, indicating that it explains a siginificant portion of the variance in  the target variable.
  3. XGBoost model has also obtained a R-squared value of 0.88 on the test data.
  4. So according to our observations it can be concluded that XGBoost Regression is the optimal model for predicting Rossmann's sales data.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***