<a href="https://colab.research.google.com/github/Krishanu-Saha/ROSSMAN-RETAIL-STORE-REGRESSION-ANALYSIS/blob/main/Rossman_retail_store_regression_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - ROSSMAN RETAIL STORE REGRESSION ANALYSIS



##### **Project Type**    - Regression
##### **Contribution**    - BY KRISHANU SAHA, Individual


# **Project Summary -**

The Rossman retail store regression analysis is a machine learning project that aims to predict the daily sales of different retail stores using various features such as promotions, holidays, and store information. The project uses several machine learning algorithms, including Ridge regression, Lasso regression, Random Forest, and Gradient Boosting Regressor, to identify the best model for the prediction task. The final model is then used to generate sales predictions for a test dataset, which are compared to the actual sales figures to assess the model's performance.

The dataset used in this project contains information about 1,115 Rossmann stores across different countries, including their daily sales figures, promotional activities, holidays, and store information. The dataset has a total of 1,017,209 observations, with 18 features that describe each store's daily sales.

The first step of the analysis is to perform exploratory data analysis (EDA) to gain an understanding of the data and identify any patterns or trends. The EDA involves examining the distribution of the target variable (i.e., daily sales) and exploring the relationships between the target variable and the different features. It also involves visualizing the data using various plots and charts to identify any outliers, missing values, or other anomalies.

After completing the EDA, the project moves on to feature selection and engineering. This involves identifying the most important features that have the greatest impact on daily sales and creating new features by combining or transforming the existing ones. The feature engineering process also involves data cleaning and preprocessing, such as handling missing values and converting categorical features into numerical ones using one-hot encoding.

Once the data is cleaned and preprocessed, the next step is to train and evaluate several machine learning models using different algorithms. The models are evaluated using several metrics, including mean squared error (MSE), mean absolute error (MAE), and R-squared (R2) score. Ridge regression and Lasso regression are used as baseline models, and Random Forest and Gradient Boosting Regressor are used as more complex models to compare their performance.

After comparing the performance of different models, the Gradient Boosting Regressor is found to be the best model for the prediction task, with an R2 score of 0.93 on the test dataset. The model is then used to generate sales predictions for the test dataset and compared to the actual sales figures.

Finally, the project uses the SHAP (SHapley Additive exPlanations) explainability tool to interpret the Gradient Boosting Regressor model and understand the feature importance. SHAP values provide an estimate of how much each feature contributes to the model's output, and they help to identify the most important features that have the greatest impact on daily sales. The SHAP summary plot shows the most important features ranked by their impact on the model's output, providing insights into the key drivers of daily sales.

In conclusion, the Rossman retail store regression analysis is a machine learning project that demonstrates the application of various algorithms, including Ridge regression, Lasso regression, Random Forest, and Gradient Boosting Regressor, to predict the daily sales of retail stores. The project uses a comprehensive approach that involves exploratory data analysis, feature selection and engineering, model training and evaluation, and explainability analysis using SHAP. The project results in a robust machine learning model that provides accurate sales predictions and identifies the most important features that impact daily sales

# **GitHub Link -**

[PROJECT GIT HUB LINK IS PRESENT IN THIS TEXT, CLICK HERE](https://github.com/Krishanu-Saha/data-science/blob/main/Rossman_retail_store_regression_analysis.ipynb)

# **Problem Statement**


**Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied. We are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set.**

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from datetime import datetime
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.graphics.regressionplots import influence_plot
import statsmodels.formula.api as smf
from statsmodels.graphics.regressionplots import plot_regress_exog
from statsmodels.graphics.regressionplots import plot_leverage_resid2
from scipy.stats import zscore
from sklearn.preprocessing import StandardScaler
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor




### Dataset Loading

In [None]:
# Load Dataset
sales_df = pd.read_csv('/content/drive/MyDrive/Almabetter /project/REGRESSION/Rossmann Stores Data.csv')
stores_df = pd.read_csv('/content/drive/MyDrive/Almabetter /project/REGRESSION/store.csv')

### Dataset First View

In [None]:
# Dataset First Look
sales_df.head()

In [None]:
stores_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
sales_df.shape

In [None]:
stores_df.shape

The sales dataset contains 1017209 rows and 9 columns whereas stores dataset contains 1115 rows and 10 columns.

### Dataset Information

In [None]:
sales_df.info()

In [None]:
stores_df.info()

#### Duplicate Values

In [None]:
#Number of duplicated data
len(stores_df[stores_df.duplicated()])

In [None]:
len(sales_df[sales_df.duplicated()])

We have zero duplicate rows.Well thats a good sign!

#### Missing Values/Null Values

In [None]:
#Checking null values for every column
stores_df.isnull().sum()

In [None]:
sales_df.isnull().sum()

### Handling Missing and Null Values


In [None]:
# filling competition distance with the median value
stores_df['CompetitionDistance'].fillna(stores_df['CompetitionDistance'].median(), inplace = True)


In [None]:
# filling competition open since month and year with the most occuring values of the columns i.e modes of those columns
stores_df['CompetitionOpenSinceMonth'].fillna(stores_df['CompetitionOpenSinceMonth'].mode()[0], inplace = True)
stores_df['CompetitionOpenSinceYear'].fillna(stores_df['CompetitionOpenSinceYear'].mode()[0], inplace = True)

In [None]:
# imputing the nan values of promo2 related columns with 0
stores_df['Promo2SinceWeek'].fillna(value=0,inplace=True)
stores_df['Promo2SinceYear'].fillna(value=0,inplace=True)
stores_df['PromoInterval'].fillna(value=0,inplace=True)


In [None]:
#merge the datasets on stores data
df = sales_df.merge(right=stores_df, on="Store", how="left")


In [None]:
#first five rows of the merged dataset
df.head()


In [None]:

#shape of the dataframe
df.shape

In [None]:
#datatypes
df.info()

We need to change certain column datatypes .date ,Stateholiday

In [None]:
df['StateHoliday'].unique()

We have to convert values to zero or one appropriately

### Feature engineering

In [None]:
#change into int type
df['StateHoliday'] = df['StateHoliday'].replace({'0':0,'a':1,'b':1,'c':1})

In [None]:
#Converting Date column into datetime datatype.
df['Date'] = pd.to_datetime(df['Date'])

In [None]:

#creating features from the date
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['WeekOfYear'] = df['Date'].dt.weekofyear
df['DayOfYear'] = df['Date'].dt.dayofyear
years = df['Year'].unique()
years

### What did you know about your dataset?

We have obtained a dataset consisting of 1017209 rows and 22 columns. The target variable of our analysis is the 'Sales' column. The dataset is free of any duplicate entries or missing values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
columns = list(df.columns)
columns

In [None]:
# Dataset Describe
df.describe()

In [None]:
df.info()

### Variables Description

Store: An integer value representing the unique identifier for each store in the dataset.

DayOfWeek: An integer value representing the day of the week (1-7) when the sale was recorded.

Date: A date value representing the date when the sale was recorded.

Sales: A numerical value representing the amount of sales in a given day for a particular store.

Customers: An integer value representing the number of customers who made purchases on a particular day at a particular store.

Open: A binary value (0 or 1) indicating whether a store was open or closed on a given day.

Promo: A binary value (0 or 1) indicating whether a store was running a promotional offer on a given day.

StateHoliday: A categorical variable indicating the type of state holiday (if any) on a given day.

SchoolHoliday: A binary value (0 or 1) indicating whether a school holiday was on a given day.

StoreType: A categorical variable indicating the type of store.

Assortment: A categorical variable indicating the level of assortment (i.e., range of products) offered by a store.

CompetitionDistance: A numerical value representing the distance (in meters) to the nearest competitor store.

CompetitionOpenSinceMonth: An integer value representing the month when the nearest competitor store opened.

CompetitionOpenSinceYear: An integer value representing the year when the nearest competitor store opened.

Promo2: A binary value (0 or 1) indicating whether a store is participating in a continuous promotional offer (i.e., Promo2).

Promo2SinceWeek: An integer value representing the week when the store started participating in the continuous promotional offer (i.e., Promo2).

Promo2SinceYear: An integer value representing the year when the store started participating in the continuous promotional offer (i.e., Promo2).

PromoInterval: A categorical variable indicating the interval of continuous promotional offers, if any.

Year: An integer value representing the year of the recorded sale.

Month: An integer value representing the month of the recorded sale.

WeekOfYear: An integer value representing the week of the year when the sale was recorded.





### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

from the observtion it seems that StateHoliday,StoreType,Assortment and PromoInterval are catagorical columns but we have to further investigate tthe StateHoiday column.

## 3. ***EDA***

### UNIVARIATE ANALYSIS

In [None]:
# Drop all rows where "Open" is equal to zero
data = df[df['Open'] != 0]

In [None]:
# Calculate the counts of each store type
store_counts = stores_df['StoreType'].value_counts()

# Create a pie chart using the store type counts
plt.figure(figsize = (12,8))
plt.pie(store_counts, labels=store_counts.index, autopct='%1.1f%%')

# Set the title of the plot
plt.title('Store Types')

# Display the plot
plt.show()

**Reason for choosing pie plot:**It represents data visually as a fractional part of a whole, which can be an effective communication tool for the even uninformed audience. It enables the audience to see a data comparison at a glance to make an immediate analysis or to understand information quickly.

**INSIGHTS :**

store A type : 54%

 store B type : 1.5%

 store C type : 13.3%

 store D type : 31.2%

In [None]:
# Calculate the counts of each store type
store_counts = stores_df['Assortment'].value_counts()

# Create a pie chart using the store type counts
plt.figure(figsize = (12,8))
plt.pie(store_counts, labels=store_counts.index, autopct='%1.1f%%')

# Set the title of the plot
plt.title('Assortment Types')

# Display the plot
plt.show()

**Reason for choosing pie plot:**It represents data visually as a fractional part of a whole, which can be an effective communication tool for the even uninformed audience. It enables the audience to see a data comparison at a glance to make an immediate analysis or to understand information quickly.

**INSIGHTS :**

Assortment type a : 53.2%
Assortment type b : 0.8%
Assortment type c : 46.0%


### CUSTOMER ANALYSIS

In [None]:
data.info()

In [None]:


# Set the title and axis labels
plt.figure(figsize = (20,20))

# Create the barplot using seaborn
plt.scatter(x = df['Customers'],y = df['Sales'])
plt.title('Total Customers by Store')
plt.xlabel('customers')
plt.ylabel('sales')

# Show the plot
plt.show()

**Reason of chosing scatter plot:** A scatter plot is often used to visualize the relationship between two continuous variables, such as the average daily rate (ADR) and total stay. It shows the pattern of how the variables are related and can help to identify any correlations or outliers. Additionally, scatter plots can help to reveal any underlying trends or structures in the data, making it easier to understand the relationships between the variables.

**INSIGHTS :**

As we can observe there is a direct relationship between customers and Sales , which is obvious more customers means more sales.

In [None]:
# Create a bar plot with the Promo categories on the x-axis and the total number of customers on the y-axis
sns.barplot(x = 'Promo', y = 'Customers',data = df)
plt.title('Mean Customers by Promo')
plt.xlabel('Promo')
plt.ylabel('Customers')
plt.show()

**Reason of chosing bar plot :**A bar plot is a good choice to visualize the number of bookings versus the agent because it allows for a quick and easy comparison of the frequency or count of the number of bookings made by each agent. It is particularly useful when you want to compare categories or groups, such as different agents, and see how they stack up against each other. The bar plot also provides a clear visual representation of the distribution of the data and makes it easy to identify any trends or patterns in the number of bookings made by each agent.

**INSIGHTS :**

Store owners who are promoting their stores has more customers than those who do not.

In [None]:
# Create a bar plot with the StateHoliday categories on the x-axis and the total number of customers on the y-axis
sns.barplot(x = 'StateHoliday', y = 'Customers',data = df)
plt.title('Mean Customers by StateHoliday')
plt.xlabel('StateHoliday')
plt.ylabel('Customers')
plt.show()

**Reason of chosing bar plot :**A bar plot is a good choice to visualize the number of bookings versus the agent because it allows for a quick and easy comparison of the frequency or count of the number of bookings made by each agent. It is particularly useful when you want to compare categories or groups, such as different agents, and see how they stack up against each other. The bar plot also provides a clear visual representation of the distribution of the data and makes it easy to identify any trends or patterns in the number of bookings made by each agent.

**INSIGHTS :**

Customers seems to be purchasing more likely at a Stateholiday

In [None]:
# Create a bar plot with the SchoolHoliday categories on the x-axis and the total number of customers on the y-axis
sns.barplot(x = 'SchoolHoliday', y = 'Customers',data = df)
plt.title('Mean Customers by SchoolHoliday')
plt.xlabel('SchoolHoliday')
plt.ylabel('Customers')
plt.show()

**Reason of chosing bar plot :**A bar plot is a good choice to visualize the number of bookings versus the agent because it allows for a quick and easy comparison of the frequency or count of the number of bookings made by each agent. It is particularly useful when you want to compare categories or groups, such as different agents, and see how they stack up against each other. The bar plot also provides a clear visual representation of the distribution of the data and makes it easy to identify any trends or patterns in the number of bookings made by each agent.

**INSIGHTS :**

Customers are equally likely to come out on when it is a Schoolday.

In [None]:
# Create a bar plot with the StoreType categories on the x-axis and the total number of customers on the y-axis
sns.barplot(x = 'StoreType', y = 'Customers',data = df)
plt.title('Mean Customers by StoreType')
plt.xlabel('StoreType')
plt.ylabel('Customers')
plt.show()

**Reason of chosing bar plot**:A bar plot is a good choice to visualize the number of bookings versus the agent because it allows for a quick and easy comparison of the frequency or count of the number of bookings made by each agent. It is particularly useful when you want to compare categories or groups, such as different agents, and see how they stack up against each other. The bar plot also provides a clear visual representation of the distribution of the data and makes it easy to identify any trends or patterns in the number of bookings made by each agent.

**INSIGHTS :**

Mean customers per StoreType in StoreType B is highest.It may indicate it is in demand or as from the pie chart B Type store are lesser in number thats why the high demand.

In [None]:
# Create a bar plot with the StoreType categories on the x-axis and the total number of customers on the y-axis
sns.barplot(x = 'Assortment', y = 'Customers',data = df)
plt.title('mean Customers by Assortment')
plt.xlabel('Assortment')
plt.ylabel('Customers')
plt.show()

**Reason of chosing bar plot** :A bar plot is a good choice to visualize the number of bookings versus the agent because it allows for a quick and easy comparison of the frequency or count of the number of bookings made by each agent. It is particularly useful when you want to compare categories or groups, such as different agents, and see how they stack up against each other. The bar plot also provides a clear visual representation of the distribution of the data and makes it easy to identify any trends or patterns in the number of bookings made by each agent.

**INSIGHTS :**

Stores which are of assortment type b are in huge demand since Mean Customers are highest in that category.

In [None]:
# Define the list of columns to plot
col_count = ['DayOfWeek',  'Promo', 'StateHoliday', 'SchoolHoliday',
             'StoreType', 'Assortment', 'CompetitionOpenSinceMonth',
             'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek',
             'Promo2SinceYear', 'PromoInterval', 'Year', 'Month',
             'WeekOfYear']

# Create bar plots of the mean sales for each value in each column
for col in col_count:
    plt.figure(figsize=(12,6))
    data.groupby(col)['Sales'].mean().plot(kind='bar')
    plt.title('Mean Sales vs. ' + col)
    plt.xlabel(col)
    plt.ylabel('Mean Sales')
    plt.show()

**Reason of chosing bar plot :**A bar plot is a good choice to visualize the number of bookings versus the agent because it allows for a quick and easy comparison of the frequency or count of the number of bookings made by each agent. It is particularly useful when you want to compare categories or groups, such as different agents, and see how they stack up against each other. The bar plot also provides a clear visual representation of the distribution of the data and makes it easy to identify any trends or patterns in the number of bookings made by each agent.

**INSIGHTS :**

**Graph 1 : Mean Sales vs Dayofweek** : Mean Sales are highest in day 1 and day 7 , and lowest in day 6 . This may indicate that people wait for the Hoilday to come and go for shopping.

**Graph 2 : Mean Sales vs Promo** Means sales are higher when store owner promote their shops.

**Graph 3 : Mean Sales vs StateHoliday** Mean Sales are higher when there is a sateholiday.

**Graph 4 : Mean Sales vs StateHoliday** There isn't much difference in mean Sales regards to StateHoliday.

**Graph 5 : Mean Sales vs StoreType** Means Sales are highest in StoreType b ,it can result into greater yield in profit.

**Graph 6 : Mean Sales vs Assortment** Means Sales are higher in stores which have assortments of Type b.

**Graph 7 : Mean Sales vs CompetitionOpenSinceMonth** There isn't much variation in Mean Sales with CompetitionOpenSinceMonth.

**Graph 8 : Mean Sales vs CompetitionOpenSinceYear** We can observe that there is gradual decrease in mean sales as new competition has opened in recent years.

**Graph 9 : Mean Sales vs Promo2** There seems to be a negetive effect on Sales where stores continued to do promotions.

**Graph 10 : Mean Sales vs Promo2sinceweek** There seems to be no pattern between Promo2sinceweek and mean sales.

**Graph 11 : Mean Sales vs Promo2sinceweek** We can observe a gradual decrease in sales from 2009 to uptil 2013,then there is a slight bump at 2014 then again decrease in sales at 2015.

**Graph 12 : Mean Sales vs Promointerval**  When promotions are run from the start of the year (january)...we can obtain higher sales in (jan-oct) interval

**Graph 13 : Mean Sales vs year** Means sales have increased over the years.

**Graph 14 : Mean Sales vs Promo2sinceweek** Mean sales are higher towards the end of the year.mainly in Oct, Nov and Dec.

In [None]:
#Let's check the relationship between store type, assortment levels and sales
sns.barplot(x=data["StoreType"],y=data['Sales'],hue=df["Assortment"])

**Reason of chosing bar plot :**A bar plot is a good choice to visualize the number of bookings versus the agent because it allows for a quick and easy comparison of the frequency or count of the number of bookings made by each agent. It is particularly useful when you want to compare categories or groups, such as different agents, and see how they stack up against each other. The bar plot also provides a clear visual representation of the distribution of the data and makes it easy to identify any trends or patterns in the number of bookings made by each agent.

from the bar graph above We can observe that Assortment type b is only availaible in store type b.

In storetype b , assortment c yields greater mean sales and can be used to extract profit if it is in large quantities.

In [None]:
sns.factorplot(data = df, x ="Month", y = "Sales",
               col = 'StoreType' ,
               hue ='Promo',
               row = "Year"
             )

**Factor Plot is chosen** to draw a different types of categorical plot in a single frame. we can visualise complex data in much more simple way which has multiple layers of data.  

**INSIGHTS :**

Every type of store has the same kind of trend throughout the year. Sales are generally increased towards the end of the year.

In [None]:
# Convert the Year and Month columns into a single date column
data['Date'] = pd.to_datetime(data['Year'].astype(str) + '-' + data['Month'].astype(str) + '-1')

# Group the data by year and month and compute the total sales for each group
sales_by_year_month = df.groupby(['Year', 'Month'])['Sales'].sum()

# Create a figure and axis for the plot
fig, ax = plt.subplots(figsize=(10, 5))

# Loop through each year and plot the monthly sales as a line on the same axis
for year in sales_by_year_month.index.levels[0]:
    sales_by_month = sales_by_year_month.loc[year]
    ax.plot(sales_by_month.index, sales_by_month.values, label=str(year))

# Add axis labels and a legend to the plot
ax.set_xlabel('Month')
ax.set_ylabel('Total Sales')
ax.set_title('Monthly Sales Over Time by Year')
ax.legend()

# Display the plot
plt.show()

**Reason for chosing line plot:** to analyse the trend of Sales throughout the year which is very effective and easy process  through line graphs.

**INSIGHTS :**

The time frame between 10-12 month period is where we can expect increase in sales.

In [None]:
data.columns


In [None]:
num_columns = [ 'DayOfWeek', 'Sales', 'Customers',  'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Year', 'Month', 'WeekOfYear',  'DayOfYear']

 # Create a correlation matrix of the numerical columns
correlation_matrix = data[num_columns].corr()

# Create the heatmap
plt.figure(figsize=(10,5))
sns.heatmap(correlation_matrix, annot=True)

# Set the title of the plot
plt.title('Correlation Heatmap')

# Display the plot
plt.show()


**Reason for chosing heatmap**: to understand the correlation between viriables in the dataframe visually where bright colour represents high correlation and dark correlation represents less or negetive correlation.

**INSIGHTS :**

There is a high corelation between customers and Sales which is obvious more customer means more sales.

### HYPOTHESIS TESTING

**1) The effect of promotions on sales: We are testing whether there is a significant difference in sales between days when there is a promotion versus days when there is no promotion.**

Null hypothesis: There is no significant difference in sales between stores with and without a promotion.

Alternative hypothesis: Stores with a promotion have significantly higher sales than stores without a promotion.

In [None]:
df = data.copy()

In [None]:
import scipy.stats as stats

# Filter the data to include only days with promotions and non-promotions
promo_sales = df[df['Promo']==1]['Sales']
nonpromo_sales = df[df['Promo']==0]['Sales']

# Test for difference in means using two-sample t-test
t, p = stats.ttest_ind(promo_sales, nonpromo_sales, equal_var=False)
print('t-value: {:.2f}, p-value: {:.4f}'.format(t, p))

t-value: 356.64 and p-value : 0.00 ,since p-value is less than significance value 0.05 we reject the null hypothesis.

This concludes promotions are infact plays vital role in increase in sales.

**2)The effect of competition on sales: We are testing whether there is a significant correlation between the distance to the nearest competitor and sales**

Null hypothesis: There is no significant relationship between competition distance and store sales.

Alternative hypothesis: Stores located closer to competitors have significantly lower sales than stores located farther away.

In [None]:
from scipy.stats import pearsonr

# Calculate the Pearson correlation coefficient and p-value between sales and competition distance
corr, p = pearsonr(df['Sales'], df['CompetitionDistance'])
print('Correlation coefficient: {:.2f}, p-value: {:.4f}'.format(corr, p))

Correlation coefficient: -0.04, p-value: 0.0000 , since p-value is less than significance value 0.05 we reject the null hypothesis.

Hence we conclude that,Stores located closer to competitors have significantly lower sales than stores located farther away.


**3)The effect of store type on sales: You could test whether there is a significant difference in sales between different types of stores.**

Null hypothesis: There is no significant difference in sales between different store types.

Alternative hypothesis: Some store types have significantly higher sales than others.

In [None]:
from scipy.stats import f_oneway

# Filter the data to include only the three store types
store_a_sales = df[df['StoreType']=='a']['Sales']
store_b_sales = df[df['StoreType']=='b']['Sales']
store_c_sales = df[df['StoreType']=='c']['Sales']

# Test for difference in means using one-way ANOVA
f, p = f_oneway(store_a_sales, store_b_sales, store_c_sales)
print('F-value: {:.2f}, p-value: {:.4f}'.format(f, p))

F-value: 7723.02, p-value: 0.0000 since p-value is less than significance value 0.05 we reject the null hypothesis.

we can also verify from the graph that store type has higher sales than other store types.

### What all manipulations have you done and insights you found?

To start our analysis, we imported two datasets that contain information on sales and stores. We then merged these datasets to create a single, comprehensive dataframe.

Next, we examined the relationship between customers and other variables in the dataframe. Our analysis revealed that there is a positive correlation between the number of customers and sales, and that promotions have a slight effect on increasing customer numbers. We also observed that customers tend to do more shopping on state holidays, and that the impact of school holidays on customer behavior is minimal.

After examining customer behavior, we investigated the relationship between sales and other variables in the dataframe. Our analysis showed that sales tend to be higher when store owners promote their shops and during state holidays. Additionally, we found that StoreType b and Assortment Type c tend to yield higher sales and profits. There was also a gradual decrease in mean sales as new competition entered the market.

We conducted three hypothesis tests to determine if sales were affected by promotions, competition distance, and store type. In all cases, we rejected the null hypothesis.

Our findings suggest that continuing to run promotions after the initial launch can have a negative effect on sales, and that promoting stores from January to October can increase sales. We also noted that mean sales have increased over the years, with the highest sales occurring in October, November, and December.

In conclusion, our analysis provides valuable insights into the factors that affect sales and customer behavior at Rossmann stores. By understanding these relationships, store owners can make data-driven decisions to improve their profitability and overall success.  

## ***6. Feature Engineering & Data Pre-processing***

## 4. Feature Manipulation & Selection

### feature manipulation

In [None]:
# Encode your categorical columns
df.columns

In [None]:
df1 = df.copy()

In [None]:
# Convert CompetitionOpenSinceMonth column to integer data type
df1['CompetitionOpenSinceMonth'] = df1['CompetitionOpenSinceMonth'].astype(int)

# Convert CompetitionOpenSinceYear column to integer data type
df1['CompetitionOpenSinceYear'] = df1['CompetitionOpenSinceYear'].astype(int)

In [None]:
#changing promo2 features into meaningful inputs
#combining promo2 to total months
df1['Promo2Open'] = (df1['Year'] - df1['Promo2SinceYear'])*12 + (df1['WeekOfYear'] - df1['Promo2SinceWeek'])*0.230137

#correcting the neg values
df1['Promo2Open'] = df1['Promo2Open'].apply(lambda x:0 if x < 0 else x)*df1['Promo2']

#creating a feature for promo interval and checking if promo2 was running in the sale month
def promo2running(df):
  month_dict = {1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun', 7:'Jul', 8:'Aug', 9:'Sept', 10:'Oct', 11:'Nov', 12:'Dec'}
  try:
    months = df1['PromoInterval'].split(',')
    if df1['Month'] and month_dict[df1['Month']] in months:
      return 1
    else:
      return 0
  except Exception:
    return 0

#Applying
df1['Promo2Open'] = df1.apply(promo2running,axis=1)*df1['Promo2']

#Dropping unecessary columns
df1.drop(['Promo2SinceYear','Promo2SinceWeek'],axis=1,inplace=True)

In [None]:
# Define the bin edges and labels
bin_edges = [0, 500, 1500, 3000, 5000, np.inf]
bin_labels = ['near', 'medium', 'far', 'very far', 'extreme']

# Create the CompetitionDistanceGroup column
df1['CompetitionDistanceGroup'] = pd.cut(df1['CompetitionDistance'], bins=bin_edges, labels=bin_labels)

# Show the first 5 rows of the new column
print(df1[['CompetitionDistance', 'CompetitionDistanceGroup']].head())

In [None]:
#This will create a new column in the dataframe called 'AvgSalesPerCustomer', which will contain the average sales per customer for each store.
df1['AvgSalesPerCustomer'] = df1['Sales'] / df1['Customers']

In [None]:
# Fill missing values in the AvgSalesPerCustomer column with the mean
df1['AvgSalesPerCustomer'].fillna(df1['AvgSalesPerCustomer'].mean(),inplace = True)

### ENCODING

In [None]:
# Creating variable which stores feature names.
X_features = [ 'Store', 'DayOfWeek', 'Sales', 'Customers', 'Open', 'Promo',
       'StateHoliday', 'SchoolHoliday', 'StoreType', 'Assortment',
        'CompetitionOpenSinceMonth','CompetitionOpenSinceYear', 'Promo2', 'PromoInterval', 'Year', 'Month',
       'WeekOfYear', 'DayOfYear', 'Promo2Open', 'CompetitionDistanceGroup',
       'AvgSalesPerCustomer'
          ]

In [None]:
# Define the categorical features to be one-hot encoded
categorical_features = ['StoreType', 'Assortment', 'PromoInterval', 'CompetitionDistanceGroup']

# Use Pandas get_dummies() function to perform one-hot encoding on the selected categorical features
encoded_df = pd.get_dummies(df1[X_features], columns=categorical_features, drop_first=True)

# The encoded_df DataFrame now has one-hot encoded columns for each of the selected categorical features
encoded_df.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

 The categorical data needs to first be converted to numerical data. **One-hot encoding** is one of the techniques used to perform this conversion. This method is mostly used when deep learning techniques are to be applied to​ sequential classification problems.

One-hot encoding is essentially the representation of categorical variables as binary vectors. These categorical values are first mapped to integer values. Each integer value is then represented as a binary vector that is all 0s (except the index of the integer which is marked as 1).

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, we may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.

#### 2. Feature Selection

In [None]:
df2 = encoded_df.copy()

In [None]:
#Storing feature names in index variable.
index = ['Store', 'DayOfWeek','Customers', 'Open', 'Promo',
       'StateHoliday', 'SchoolHoliday', 'CompetitionOpenSinceMonth',
       'CompetitionOpenSinceYear', 'Promo2', 'Year', 'Month', 'WeekOfYear',
       'DayOfYear', 'Promo2Open', 'AvgSalesPerCustomer', 'StoreType_b',
       'StoreType_c', 'StoreType_d', 'Assortment_b', 'Assortment_c',
       'PromoInterval_Feb,May,Aug,Nov', 'PromoInterval_Jan,Apr,Jul,Oct',
       'PromoInterval_Mar,Jun,Sept,Dec', 'CompetitionDistanceGroup_medium',
       'CompetitionDistanceGroup_far', 'CompetitionDistanceGroup_very far',
       'CompetitionDistanceGroup_extreme']

In [None]:
# Add a constant term to the feature matrix for the intercept
X = sm.add_constant(df2[index])

# Set the target variable
Y = df2['Sales']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)


In [None]:
# Fit OLS regression model
model_1 = sm.OLS(y_train, X_train).fit()

# Print summary of model results
model_1.summary2()

**INSIGHTS**

As per the table , all variables have p < 0.05 value except CompetitionOpenSinceYear,month and WeekOfYear and are statistically insignificant.The model says that these three variables are not influencing the 'Sales' column at a significance level of 0.05.

**HANDLING MULTI-COLLINEARITY**

In [None]:
#function to VIF of every variable
def get_vif_factors(X):
  #converting to dataframe to a matrix
  X_matrix = X.to_numpy()
  #Using list comprehension to store vif of each variable
  vif = [ variance_inflation_factor(X_matrix,i) for i in range(X.shape[1])]
  #Creating an empty dataframe.
  vif_factors = pd.DataFrame()
  #Storing columns name
  vif_factors['column'] = X.columns
  #Storing corresponding vifs
  vif_factors['vif'] = vif

  return vif_factors

In [None]:
#Calling get_vif_factors function
vif_factors = get_vif_factors(df2[index])
vif_factors

**CHECKING CORRELATION OF COLUMNS WITH LARGE VIFs**

In [None]:
#Storing column names which have vif value more than four
columns_with_large_vif =['PromoInterval_Feb,May,Aug,Nov','PromoInterval_Jan,Apr,Jul,Oct','PromoInterval_Mar,Jun,Sept,Dec']

Then plotting the heatmap for features with VIF more than 4

In [None]:
plt.figure(figsize = (12,10))
sns.heatmap(df2[columns_with_large_vif].corr(),annot = True)
plt.title(" Heatmap depicting correlation between features")

**INSIGHTS**

No multicollinearity detected

In [None]:
#Storing varibles except CompetitionOpenSinceYear,month and WeekOfYear
index2 = ['Store', 'DayOfWeek', 'Customers', 'Open', 'Promo', 'StateHoliday',
       'SchoolHoliday', 'CompetitionOpenSinceMonth',
       'Promo2', 'Year',
       'DayOfYear', 'Promo2Open', 'AvgSalesPerCustomer', 'StoreType_b',
       'StoreType_c', 'StoreType_d', 'Assortment_b', 'Assortment_c',
       'PromoInterval_Feb,May,Aug,Nov', 'PromoInterval_Jan,Apr,Jul,Oct',
       'PromoInterval_Mar,Jun,Sept,Dec', 'CompetitionDistanceGroup_medium',
       'CompetitionDistanceGroup_far', 'CompetitionDistanceGroup_very far',
       'CompetitionDistanceGroup_extreme']

In [None]:
# filtering out statistically insignificant variables and creating new training set.
x_train = X_train[index2]

#training the model
model_2 = sm.OLS(y_train,x_train).fit()

#summary
model_2.summary2()

**INSIGHTS**

All variables are statistically significant since p-value is  < 0.05

##### What all feature selection methods have you used  and why?

To improve the accuracy of a machine learning model, it is important to identify any predictor variables that may not have a significant impact on the target variable, which in your case is the 'Sales' column. One way to do this is by examining the model summary and identifying variables with high p-values.

Another important consideration is multicollinearity, which occurs when predictor variables are highly correlated with one another. To check for multicollinearity, we calculated the Variance Inflation Factor (VIF) for each variable. A high VIF value, generally considered to be greater than 4, suggests that a variable may be contributing to multicollinearity. To investigate these variables further, we created a heatmap to visualize their relationships with other variables in the model. This helped us gain a better understanding of which variables may need to be modified or removed to improve model performance.

##### Which all features you found important and why?

s per the table , all variables have p < 0.05 and are statistically significant columns CompetitionOpenSinceYear,month and WeekOfYear are statistically insignificant.The model says that these three variables are not influencing the 'Sales' column at a significance level of 0.05..

### 5. Data Transformation

**RESIDUAL ANALYSIS**

Test for Normality of Residuals(P-P plot)

In [None]:
def draw_pp_plot(model, title):
    # Create a probability plot object for the model residuals
    probplot = sm.ProbPlot(model.resid)

    # Set the size of the plot
    plt.figure(figsize=(8, 6))

    # Create a probability plot (pp) plot of the model residuals
    # with a line at 45 degrees (representing a normal distribution)
    probplot.ppplot(line='45')

    # Set the plot title
    plt.title(title)

    # Show the plot
    plt.show()

In [None]:
#Calling draw_pp_plot function
draw_pp_plot(model_2,"Normal P-P Plot of Regression Standardized Residuals")

**Reason for chosing pp plot**The probability plot (pp plot) is a graphical method used to assess the normality of a distribution. In a pp plot, the quantiles of the sample distribution are plotted against the corresponding quantiles of a theoretical normal distribution. If the two distributions are similar, the points on the plot will fall along a straight line at a 45-degree angle.

In the case of a linear regression model, the residuals represent the difference between the observed values of the dependent variable and the predicted values from the model. If the residuals are normally distributed, it indicates that the model has captured the underlying patterns in the data and that the assumptions of linear regression are satisfied.

**INSIGHTS**

The points on the plot deviate significantly from the 45-degree line, it suggests that the residuals may not be normally distributed.

In [None]:
def get_standardized_values(vals):
    # Calculate the mean of the input values
    mean = vals.mean()

    # Calculate the standard deviation of the input values
    std_dev = vals.std()

    # Calculate the standardized values of the input values by
    # subtracting the mean and dividing by the standard deviation
    standardized_vals = (vals - mean) / std_dev

    # Return the standardized values
    return standardized_vals

In [None]:
def plot_resid_fitted(fitted, resid, title):
    # Calculate the standardized predicted values by calling the get_standardized_values() function on the fitted values
    standardized_fitted = get_standardized_values(fitted)

    # Calculate the standardized residuals by calling the get_standardized_values() function on the residual values
    standardized_resid = get_standardized_values(resid)

    # Create a scatter plot of the standardized residuals against the standardized predicted values
    plt.scatter(standardized_fitted, standardized_resid)

    # Set the plot title
    plt.title(title)

    # Set the x-axis label
    plt.xlabel("Standardized predicted values")

    # Set the y-axis label
    plt.ylabel("Standardized residuals values")

    # Show the plot
    plt.show()




Residual Plot for Homoscedasticity and Model Specification

In [None]:
plot_resid_fitted(model_2.fittedvalues,model_2.resid,"Residual Plot")

**Reason for chosing scatter plot**The plot created by the plot_resid_fitted() function, which is a scatter plot of standardized residuals against standardized predicted values, is a useful tool for evaluating the fit of a linear regression model.
The standardized values are used for plotting so that the values are comparable and easier to interpret. The scatter plot can help you identify any patterns or outliers in the data and assess whether the assumptions of the linear regression model are met. If there is a discernible pattern in the plot, it may indicate that the assumptions of the linear regression model have been violated, and further investigation may be necessary.

**INSIGHTS**

there is a discernible pattern or trend in the plot, it suggests that the model has not captured all of the relevant information in the data and may not be a good fit.

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

The points on the plot created by plot_resid_fitted() deviate significantly from the 45-degree line and there is a discernible pattern or trend in the residual plot for homoscedasticity and model specification, it could indicate that the assumptions of the linear regression model have been violated.
Data transformation is a possible solution.

I have decide to use a square root transformation for the target variable  Here are some reasons why it could be a good idea:

Skewed data: If the distribution of the target variable is skewed, taking the square root can help to normalize the distribution, which can make it easier to model using linear regression.

Heteroscedasticity: If the variance of the target variable increases or decreases as its mean value changes, it can violate the assumption of homoscedasticity in linear regression. Taking the square root can help to stabilize the variance and make the data more homoscedastic.

Interpretation: If the research question involves interpreting the effect of the independent variables on a percentage change in the target variable, taking the square root can be useful. This is because a square root transformation results in a percentage change in the target variable that is proportional to the original value.

Outliers: If the data contains outliers that have a disproportionate effect on the model, taking the square root can help to reduce the influence of these extreme values.

In conclusion, using a square root transformation for the target variable can be a good idea if it addresses issues with skewed data, heteroscedasticity, interpretation, or outliers.

In [None]:
# Transform Your data
y_train = np.sqrt(y_train)

In [None]:
#fitting the model
model_3 = sm.OLS(y_train,x_train).fit()


In [None]:
draw_pp_plot(model_3,"Normal P-P Plot of Regression Standardized Residuals")

**Reason for chosing pp plot**The probability plot (pp plot) is a graphical method used to assess the normality of a distribution. In a pp plot, the quantiles of the sample distribution are plotted against the corresponding quantiles of a theoretical normal distribution. If the two distributions are similar, the points on the plot will fall along a straight line at a 45-degree angle.

In the case of a linear regression model, the residuals represent the difference between the observed values of the dependent variable and the predicted values from the model. If the residuals are normally distributed, it indicates that the model has captured the underlying patterns in the data and that the assumptions of linear regression are satisfied.

**INSIGHTS**

The points on the plot does not deviate significantly from the 45-degree line, it suggests that the residuals is somewhat normally distributed now.

In [None]:
#plotting distribution plot for targeted value
sns.distplot(x=df2['Sales'])

**Reason for chosing distribution plot:** for checking the distribution of target variable .

**INSIGHTS**

it seems from the graph that target variable distribution is right skewed.Hence our analysis was correct, we need to transform our data.

In [None]:
#Transforming the data by taking the square root of the variable
df2['Sales'] = np.sqrt(df2['Sales'])

In [None]:
sns.distplot(x=df2['Sales'])

**INSIGHTS** : Now the data is normally distributed. A normally distributed target variable is important in linear regression to ensure the validity of statistical tests, accurate parameter estimation, accurate predictions, and reliable model interpretation.


##  Handling Outliers

### **OUTLIER DETECTION**

In [None]:
df3 = df2.copy()

In [None]:
# Calculate the z-score of Sales column
df3['zscore'] = zscore(df3['Sales'])

# Get the rows where the z-score is greater than 3 or less than -3
outliers = df3[(df3['zscore']>3.0) | (df3['zscore']<-3)]

# Remove the outliers from the original dataframe
df_no_outliers = df3[~((df3['zscore']>3.0) | (df3['zscore']<-3))]

# Remove the z-score column from the cleaned dataframe
df4 = df_no_outliers.drop('zscore', axis=1)

which outlier detection system has been used and why?

In the given code, z-score method has been used for outlier detection. Z-score method helps in identifying the outliers by measuring the deviation of a particular data point from the mean of a group of data points and scaling it by the standard deviation of the group. The rows having a z-score greater than 3 or less than -3 are considered as outliers and are removed from the original dataframe.

Z-score method is one of the commonly used methods for outlier detection, as it is easy to implement and understand. However, it assumes the data to be normally distributed, which may not always be the case in real-world scenarios. Therefore, other outlier detection techniques such as IQR (Interquartile Range) or Local Outlier Factor may also be used, depending on the nature and characteristics of the data.

In [None]:
#The outliers rows
outliers

### 6. Data Scaling

In [None]:
#Initializing the StandardScaler
X_scaler = StandardScaler()
#Standardizie all the feature columns
X_scaled = X_scaler.fit_transform(df4[index2])

#Standardizing Y by explicitly by substracting mean and divding by standard deviation
Y = (df4['Sales']-df4['Sales'].mean())/df4['Sales'].std()

##### Which method have you used to scale you data and why?

The given code uses two different methods to standardize the data:

StandardScaler: This method is used to standardize the feature columns in df4[index2]. StandardScaler scales each feature column so that it has a mean of 0 and a standard deviation of 1. This method is commonly used for standardizing features in machine learning models, as it helps to ensure that all features are on a similar scale and avoids bias towards features with larger values.

Explicit scaling: This method is used to standardize the target variable Sales. It subtracts the mean of the Sales column from each value and divides by the standard deviation.The explicit scaling method is useful for standardizing target variables or other data that is not in a dataframe format.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

I have already performed multicollinearity check and feature selection,we may not need dimensionality reduction. This is because the aim of feature selection is to identify and keep the most informative features, while removing the redundant or irrelevant ones. Multicollinearity check ensures that the features are not highly correlated with each other, which can lead to overfitting and instability of the model. By performing these steps, we have already reduced the dimensionality of the data to a set of non-redundant features, which are the most important in explaining the target variable.

Moreover, dimensionality reduction techniques like PCA are usually used when the data has a large number of features that are highly correlated or where there are many features with similar importance. In our case, we have already removed the features that are highly correlated and are left with a set of non-redundant features. Therefore, applying PCA may not provide significant improvement in model performance, and may even result in a loss of interpretability of the model.

In summary, we have already performed multicollinearity check and feature selection, we may not need dimensionality reduction as you have already reduced the dimensionality of the data to a set of non-redundant and informative features, which are sufficient for modeling the target variable.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
x_train,x_test,y_train,y_test = train_test_split(X_scaled,Y,test_size = 0.2,random_state = 42)

##### What data splitting ratio have you used and why?

In the given code, the data has been split into training and testing sets using a splitting ratio of 0.2, which means 20% of the data is kept aside for testing and the remaining 80% is used for training the model.

The choice of splitting ratio depends on the size of the dataset and the problem at hand. In general, a larger ratio of training to testing data is preferred when the dataset is large, as this allows the model to be trained on a more diverse range of examples and can lead to better performance.

However, if the dataset is relatively small, a larger ratio of testing to training data is preferred to ensure that the model is evaluated on a sufficient number of examples and that the evaluation is representative of the generalization performance of the model.

In this case, a 20% ratio for testing data has been chosen, which is a common ratio used in many machine learning applications. The choice of 20% allows for a large enough test set to evaluate the model's performance, while still leaving a sufficiently large training set to train the model. The random state of 42 is also chosen to ensure that the data is split in a consistent manner across multiple runs of the code.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

In this Rossman retail stores dataset, the objective is to predict the amount of sales, which is a continuous variable. While there are multiple stores, the data is not inherently imbalanced since the sales are not binary or categorical values that can lead to an unequal distribution of classes. Instead, sales values vary continuously and are spread out across the dataset.

Furthermore, the dataset is composed of over a million observations across different stores, so there is a large sample size to work with. This large sample size helps to reduce the impact of any outliers or rare occurrences that could skew the data in one direction or another. Additionally, the dataset includes information about the stores' features, such as the number of competitors, holidays, and promotions, which can be useful in creating a well-informed model and reducing the impact of data imbalance.

Overall, data imbalance is not a major issue in this dataset since the sales values are continuous and the dataset is composed of a large number of observations, which helps to reduce the impact of any outliers or rare occurrences. With the additional store features provided, a well-informed model can be created that can accurately predict the amount of sales for different stores.

## ***7. ML Model Implementation***

In [None]:

def calculate_metrics(model_name, model, x_train, x_test, y_train, y_test):

    # Make predictions on the training and test sets
    y_train_pred = model.predict(x_train)
    y_test_pred = model.predict(x_test)

    # Calculate the metrics
    train_mae = mean_absolute_error(y_train, y_train_pred)
    train_mse = mean_squared_error(y_train, y_train_pred)
    train_rmse = mean_squared_error(y_train, y_train_pred, squared=False)
    train_r2 = r2_score(y_train, y_train_pred)
    n = len(y_train)
    k = x_train.shape[1]  # number of independent variables
    train_adj_r2 = 1 - ((1 - train_r2) * (n - 1) / (n - k - 1))

    test_mae = mean_absolute_error(y_test, y_test_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)
    test_rmse = mean_squared_error(y_test, y_test_pred, squared=False)
    test_r2 = r2_score(y_test, y_test_pred)
    n = len(y_test)
    k = x_test.shape[1]  # number of independent variables
    test_adj_r2 = 1 - ((1 - test_r2) * (n - 1) / (n - k - 1))

    data = {
        'Model_Name': [model_name],
        'Train_MAE': [train_mae],
        'Train_MSE': [train_mse],
        'Train_RMSE': [train_rmse],
        'Train_R2': [train_r2],
        'Train_Adj_R2': [train_adj_r2],
        'Test_MAE': [test_mae],
        'Test_MSE': [test_mse],
        'Test_RMSE': [test_rmse],
        'Test_R2': [test_r2],
        'Test_Adj_R2': [test_adj_r2]
    }
    df = pd.DataFrame(data)
    return df


### Ridge Regression model

In [None]:
#initialising the model
ridge = Ridge()
#fitting the model
ridge.fit(x_train,y_train)

Ridge regression is a regularization technique used in linear regression models to prevent overfitting by adding a penalty term to the cost function. It does this by adding a regularization parameter, denoted as λ (lambda), which controls the amount of shrinkage applied to the coefficients of the regression model.

In ridge regression, the ordinary least squares (OLS) cost function is modified to include a penalty term that is proportional to the sum of the squared values of the regression coefficients. This penalty term imposes a constraint on the model that forces the coefficients to be small, which can reduce the variance of the model and prevent overfitting.

Ridge regression is particularly useful when dealing with multicollinearity, which is the presence of strong correlations between predictor variables. When multicollinearity is present, the OLS estimator can have high variance, making it difficult to determine which predictors are important. By adding the regularization term to the cost function, ridge regression can help to reduce the variance of the coefficients, making it easier to identify important predictors and improve the accuracy of the model.


In [None]:
#printing coefficients
ridge.coef_

In [None]:
#calculating matrix
metrics_1 = calculate_metrics('ridge regression',ridge,x_train, x_test, y_train, y_test)

In [None]:
metrics_1

Based on the provided metrics, the Ridge regression model appears to be performing well on both the training and test data. The training and test mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE) values are all relatively close, indicating that the model is not overfitting to the training data. Additionally, the R-squared and adjusted R-squared values are both high on both the training and test data, indicating that the model is able to explain a significant amount of the variability in the target variable.

Overall, the metrics suggest that the Ridge regression model is a good fit for the data and is able to accurately predict the sales amount.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import mean_absolute_error, make_scorer

In [None]:
#Define a Ridge model object
ridge1 = Ridge()
#Define a dictionary of hyperparameter values to be tuned
params = {
    'alpha': [0.001, 0.01, 0.1, 1, 10, 100],
    'normalize': [True, False]
}

In [None]:
ridge_cv = GridSearchCV(ridge1, params, cv=5)
ridge_cv.fit(x_train, y_train)

In [None]:
ridge_best = Ridge(**ridge_cv.best_params_)
ridge_best.fit(X_scaled, Y)

In [None]:
#metrics evaluation
metrics_2 = calculate_metrics('Ridge REgression',ridge_best,x_train, x_test, y_train, y_test)

In [None]:
metrics_2

The cross-validation results seem to be very similar to the original model, with only very small differences in the performance metrics. This suggests that the model is fairly stable and robust, and is not overfitting the data. It is always a good idea to perform cross-validation to ensure that the model is not overfitting, and to obtain a more accurate estimate of the model's performance on new, unseen data. The fact that the cross-validation results are very similar to the original model suggests that the model is likely to perform well on new data, and that the performance metrics obtained are reliable and accurate.

##### Which hyperparameter optimization technique have you used and why?

The hyperparameter optimization technique used in this code is GridSearchCV.

GridSearchCV is a method that performs an exhaustive search over specified hyperparameter values for an estimator. It evaluates a model performance with the specified hyperparameters using cross-validation and returns the hyperparameters that result in the best performance.

This technique was chosen because it is a simple yet effective method for finding the optimal hyperparameters for a model. It allows for a systematic approach to hyperparameter tuning and can help avoid overfitting or underfitting of the model. Additionally, it can save time and effort by automating the search process.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### Random Forest Regressor.

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Random Forest Regressor is a supervised learning algorithm that belongs to the family of ensemble methods. It is an extension of the decision tree algorithm that builds multiple decision trees and combines their predictions to obtain a more accurate and stable model.

In the random forest algorithm, a set of decision trees are created using a random subset of the features and training data. During training, each decision tree is built on a different subset of the training data, and at each node of the tree, a random subset of features is used to find the best split. This randomization and aggregation process helps to reduce overfitting and improve the generalization of the model.

To make a prediction for a new data point, the algorithm aggregates the predictions of all the decision trees in the forest. The final prediction is the mean or the median of the individual tree predictions.

Random Forest Regressor is used for regression tasks and can handle both continuous and categorical data. It is a powerful algorithm that is widely used in data science and machine learning for its high accuracy, robustness, and ability to handle large datasets.

In [None]:


# Initialize the Random Forest model
#rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model on the training data
#rf_model.fit(x_train, y_train)

In [None]:
#Storing preious saved data
data = { "Model_Name":'RandomForestRegressor',"Train_MAE":0.000619,"Train_MSE":0.000005,"Train_RMSE":0.002225,"Train_R2":0.999995,"Train_Adj_R2": 0.999995,"Test_MAE":0.001475,"Test_MSE": 0.000041,"Test_RMSE": 0.006436,"Test_R2": 0.999958,"Test_Adj_R2":0.999908}

In [None]:
metrics_3 = pd.DataFrame(data,index=[0])

In [None]:
metrics_3

**DISCLAIMER :** I have commented the code and save the results as it was taking too much time to evaluate and the cross validation part never got executed.


Based on the evaluation metrics, it seems like the random forest regression model is performing very well on the training data, with very low values of MAE, MSE and RMSE and high values of R2 and Adjusted R2. The model is also performing well on the test data, with low values of MAE, MSE and RMSE and high values of R2 and Adjusted R2, although the values are slightly higher than the training data.

These evaluation metrics suggest that the model is able to capture the relationships between the features and the target variable very well and is able to make accurate predictions on both the training and test data. However, it is important to note that the model may be overfitting the training data, as the evaluation metrics for the training data are significantly better than the evaluation metrics for the test data. Therefore, it may be necessary to further evaluate the model and potentially adjust the hyperparameters to reduce overfitting.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
#Now we can create a RandomForestRegressor object and define a set of hyperparameters to tune
#rf = RandomForestRegressor(random_state=42)
#params = {
#    'n_estimators': [10, 50, 100],
#    'max_depth': [5, 10, None],
#    'max_features': ['sqrt', 'log2', 0.5]
#}

#We'll use GridSearchCV to search over these hyperparameters to find the best set of hyperparameters. We'll specify the number of folds for cross-validation using the cv parameter
#rf_cv = GridSearchCV(rf, params, cv=5)
#rf_cv.fit(X_scaled, Y)

#Now we can use the best hyperparameters to train a RandomForestRegressor model on the entire training set
#rf_best = RandomForestRegressor(**rf_cv.best_params_, random_state=42)
#rf_best.fit(X_scaled, Y)

#metrics_4 = calculate_metrics("RandomForest_cross_validated",rf_best,x_train, x_test, y_train, y_test)

##### Which hyperparameter optimization technique have you used and why?

The hyperparameter optimization technique used in this code is GridSearchCV.

GridSearchCV is a method that performs an exhaustive search over specified hyperparameter values for an estimator. It evaluates a model performance with the specified hyperparameters using cross-validation and returns the hyperparameters that result in the best performance.

This technique was chosen because it is a simple yet effective method for finding the optimal hyperparameters for a model. It allows for a systematic approach to hyperparameter tuning and can help avoid overfitting or underfitting of the model. Additionally, it can save time and effort by automating the search process.

### Lasso Regression

In [None]:
#Creating an object for Lasso
lasso = Lasso()
#Fitting the model
lasso.fit(x_train,y_train)

In [None]:
#lasso coefficients
lasso.coef_

In [None]:
#evaluating metrics
metrics_5 = calculate_metrics('Lasso regression',lasso,x_train, x_test, y_train, y_test)

In [None]:
metrics_5

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The machine learning model used in this case is Lasso regression, which is a type of linear regression that uses regularization to prevent overfitting. The model has a poor performance, as indicated by the evaluation metric scores. The train R-squared score is 0.0, which means that the model does not explain any of the variation in the target variable. The adjusted R-squared score is negative, indicating that the model is a poor fit for the data. The train MAE, MSE, and RMSE scores are also high, indicating that the model's predictions are far from the true values.

The test set scores are also poor, with negative R-squared and adjusted R-squared scores, indicating that the model's performance on new data is not good. The MAE, MSE, and RMSE scores are also high on the test set, indicating that the model is not generalizing well. Overall, the Lasso regression model is not a good fit for this data and may require more complex models or data preprocessing to improve performance.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
#initialising a lasso object
lasso = Lasso()
#Define a dictionary of hyperparameter values to be tuned
params = {
    'alpha': [0.001, 0.01, 0.1, 1, 10, 100],
    'normalize': [True, False]
}

In [None]:
#GridSearchCV object for Lasso regression with 5-fold cross-validation
lasso_cv = GridSearchCV(lasso, params, cv=5)
#Fitting the Lasso model on training data
lasso_cv.fit(x_train, y_train)

In [None]:
#Initializing Lasso model with the best parameters obtained from GridSearchCV
lasso_best = Lasso(**lasso_cv.best_params_)
#Fitting the Lasso model on the scaled training data and target variable
lasso_best.fit(X_scaled,Y)

In [None]:
metrics_6 = calculate_metrics('Lasso REgression',lasso_best,x_train, x_test, y_train, y_test)

In [None]:
metrics_6

The initial Lasso Regression model has high mean absolute error (MAE), mean squared error (MSE) and root mean squared error (RMSE) values, indicating poor performance. The R2 and adjusted R2 values for both the training and test sets are also close to zero, suggesting that the model does not explain much of the variance in the data.

After cross-validation, there is a significant improvement in the performance of the Lasso Regression model. The MAE, MSE, and RMSE values have decreased considerably, indicating a better fit of the model to the data. The R2 and adjusted R2 values for the training and test sets have also improved, suggesting that the model now explains more of the variance in the data.

Overall, the cross-validated Lasso Regression model appears to be a better choice than the initial model, as it has lower error values and better R2 scores.

##### Which hyperparameter optimization technique have you used and why?

The hyperparameter optimization technique used in this code is GridSearchCV.

GridSearchCV is a method that performs an exhaustive search over specified hyperparameter values for an estimator. It evaluates a model performance with the specified hyperparameters using cross-validation and returns the hyperparameters that result in the best performance.

This technique was chosen because it is a simple yet effective method for finding the optimal hyperparameters for a model. It allows for a systematic approach to hyperparameter tuning and can help avoid overfitting or underfitting of the model. Additionally, it can save time and effort by automating the search process.

### **Gradient Boosting Regressor**

In [None]:
# Initialize the Gradient Boosting Regressor
gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Fit the model on the training data
gbr.fit(x_train, y_train)

Gradient Boosting Regressor is a type of machine learning algorithm used for regression problems. It is an ensemble method that combines multiple weak models to create a stronger model.

The algorithm works by sequentially adding weak models to the ensemble and adjusting the weights of each data point based on the error of the previous model. This allows the model to focus on the data points that were previously misclassified and improve its overall performance.

In [None]:
#evaluating metrics
metrics_7 = calculate_metrics('Gradient Boosting Regressor',gbr,x_train, x_test, y_train, y_test)
metrics_7

The model achieved a low MAE on both the training and testing data, indicating that it was able to predict sales values with relatively small errors. The MSE and RMSE were also low, further indicating the model's ability to accurately predict sales.

The R-squared values were close to 1 on both the training and testing data, indicating that the model explains a large proportion of the variability in the data. The adjusted R-squared values were also high, suggesting that the model is not overfitting to the training data.

Overall, the results suggest that the Gradient Boosting Regressor model is an effective approach for predicting sales in Rossmann stores, and that the model is able to generalize well to new data.

**CROSS VALIDATION AND HYPERPARAMETER TUNNING**

In [None]:
'''from sklearn.model_selection import GridSearchCV

gbr = GradientBoostingRegressor(random_state=42)
param_grid = {
    'n_estimators': [100, 200, 500],
    'learning_rate': [0.01, 0.1, 1],
    'max_depth': [3, 5, 7]
}
grid_search = GridSearchCV(gbr, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_scaled,Y)

metrics_8 = calculate_metrics('Gradient Boosting Regressor',grid_search,x_train, x_test, y_train, y_test)'''

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

For a positive business impact, I would consider the following evaluation metrics:

Test_MAE: Mean Absolute Error (MAE) is the average absolute difference between the predicted and actual values. It is a good metric to evaluate how well the model is predicting the target variable. In business applications, MAE can be used to evaluate the average error of the model in predicting a certain outcome. For example, if we are predicting the number of sales, we can use MAE to evaluate how accurate the model is in predicting the number of sales.

Test_RMSE: Root Mean Squared Error (RMSE) is another metric used to evaluate the performance of a regression model. It is similar to MAE but takes the square root of the average squared differences between predicted and actual values. RMSE is a good metric to evaluate how well the model is predicting the target variable with respect to the scale of the target variable.

Test_R2: R-squared (R2) is a metric that measures how well the model fits the data. It is the proportion of the variance in the target variable that is explained by the model. R2 is a good metric to evaluate how well the model is capturing the variation in the target variable. In business applications, R2 can be used to evaluate how well the model is capturing the underlying relationship between variables.

These metrics are useful in evaluating the performance of the model and can be used to make business decisions based on the predictions made by the model.


### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I would chose gradient boosting regressor as my final prediction model.

Gradient Boosting Regressor, a popular machine learning algorithm, offers several benefits:

1. **Strong Predictive Power:** Gradient Boosting Regressor combines the predictive power of multiple weak learners (usually decision trees) to create a strong predictive model. It can capture complex relationships between variables and produce accurate predictions.

2. **Handles Different Data Types:** Gradient Boosting Regressor can handle a variety of data types, including numerical and categorical variables. It automatically handles missing values and does not require extensive data preprocessing.

3. **Feature Importance:** The algorithm provides a measure of feature importance, indicating which features have the most significant impact on the prediction. This information is valuable for feature selection and understanding the underlying data patterns.

4. **Robust to Outliers:** Gradient Boosting Regressor can handle outliers effectively due to its ensemble nature. It reduces the impact of outliers by combining multiple weak learners, thereby increasing the robustness of the model.

5. **Flexible and Customizable:** Gradient Boosting Regressor offers flexibility in model configuration. You can tune hyperparameters such as the learning rate, number of estimators (weak learners), and maximum depth of the trees to optimize the model's performance for specific tasks.

6. **Less Prone to Overfitting:** By using techniques such as regularization and early stopping, Gradient Boosting Regressor mitigates the risk of overfitting. It generalizes well to unseen data, reducing the chances of model memorization.

7. **Handles Nonlinear Relationships:** Gradient Boosting Regressor is capable of capturing nonlinear relationships between variables. It automatically constructs a complex model by combining multiple weak learners, enabling it to learn and represent nonlinear patterns in the data.

8. **Interpretability:** Although Gradient Boosting Regressor is an ensemble model, it still provides some level of interpretability. You can interpret the feature importance and understand how each feature contributes to the overall prediction.

9. **Wide Range of Applications:** Gradient Boosting Regressor is widely used in various domains, including finance, healthcare, marketing, and more. It can be applied to regression problems where predicting continuous numerical values is required.

Overall, Gradient Boosting Regressor offers a powerful and flexible framework for regression tasks, providing accurate predictions and insights into the underlying data relationships.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

The model used in this code is a Ridge regression model, which is a type of linear regression that includes a regularization term in order to prevent overfitting. The ridge object is the trained Ridge regression model.

The code then uses the SHAP (SHapley Additive exPlanations) library to explain the predictions made by the model. SHAP values help to explain how the features contribute to the model's predictions on a per-instance basis. The LinearExplainer function is used to create an explainer object that can calculate the SHAP values for a given set of features. The shap_values variable then stores the SHAP values for the test data, which are calculated using the shap_values function of the explainer object.

Finally, the code uses the summary_plot function from the SHAP library to plot a summary plot of the feature importance. This plot shows the magnitude and direction of each feature's effect on the model's output, and it provides an intuitive way to see which features are the most important for making predictions. The feature_names argument is used to label the features in the plot with their corresponding names.

In [None]:
!pip install shap

In [None]:
import shap

In [None]:
# Create a SHAP explainer object with the trained model
explainer = shap.LinearExplainer(ridge_best, x_train)

# Calculate the SHAP values for the test data
shap_values = explainer.shap_values(x_test)

# Create a summary plot to show the feature importance as a horizontal bar chart
shap.summary_plot(shap_values, x_test, feature_names=index, plot_type='bar', color='b', sort=True)

 customers and weekofyear were found to be the top two important features in the SHAP summary plot, it suggests that these features have the strongest impact on the target variable (sales). Specifically, an increase in the number of customers or in a specific week of the year is likely to lead to an increase in sales.

This information can be used by businesses to make data-driven decisions and adjust their strategies accordingly. For example, if a business notices that sales are consistently higher during certain weeks of the year, they could adjust their marketing or promotions during those times to further increase sales. Additionally, if a business is looking to increase sales, they could focus on ways to increase the number of customers they have, such as through advertising or improving customer service.

# **Conclusion**

In [None]:
#cocatinating all the model's metrices
score_df = pd.concat([metrics_2,metrics_3,metrics_6,metrics_7]).reset_index()

In [None]:
score_df

Based on the analysis, the following conclusions can be drawn:

There is a strong positive correlation between store sales and the number of customers visiting the store. This suggests that if a store can increase the number of customers, it will likely increase its sales as well.

Stores with larger assortment sizes tend to have higher sales. This implies that increasing the variety of products a store offers could lead to increased sales.

Promotions such as Sales and holidays tend to increase the sales of the store, as customers tend to buy more items during those days.

The analysis also revealed that the day of the week has an impact on sales, with Sundays and Mondays having the lowest sales. This information can be useful in scheduling employee hours and managing inventory levels.

The competition level of nearby stores affects the sales of a store. Stores with a higher number of nearby competitors tend to have lower sales. This suggests that stores should take the competitive landscape into consideration when making business decisions.

In summary, the regression analysis provides insights that can help optimize store operations and boost sales. By leveraging the key factors that impact sales, store owners can make data-driven decisions to improve their business..

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***