<a href="https://colab.research.google.com/github/AkashDas-AD/ML/blob/main/Sample_ML_Submission_Template_AkashDas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Retail Sales Prediction



##### **Project Type**    - Regression - Rossmann Retail Sales Prediction
##### **Contribution**    - Individual


# **GitHub Link -** https://github.com/AkashDas-AD/ML





Provide your GitHub Link here.

# **Problem Statement**


Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.
You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishmen

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import missingno as msno
import matplotlib
import matplotlib.pylab as pylab
import warnings
warnings.filterwarnings('ignore')
from scipy import stats


%matplotlib inline
matplotlib.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 8,6

import math
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import LassoLars
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import ElasticNet



### Dataset Loading

In [None]:
# Load Dataset
rossmann_df = pd.read_excel('Rossmann Stores Data.xlsx')
store_df = pd.read_csv('store.csv')

### Dataset First View

In [None]:
# Dataset First Look
rossmann_df.head()

In [None]:
store_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

In [None]:
rossmann_df.shape , store_df.shape

### Dataset Information

In [None]:
# Dataset Info
rossmann_df.info() , store_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

duplicate_data=rossmann_df.duplicated().sum()
duplicate_data1 = store_df.duplicated().sum()
duplicate_data,duplicate_data1

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

rossmann_df.isnull().sum(), store_df.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(store_df.isnull(), cmap='viridis', cbar=False, yticklabels=False)
plt.title('Visualization of Null Values in store_df')
plt.xlabel('Columns')
plt.ylabel('Rows')
plt.show()

### What did you know about your dataset?

rossmaann_df comprises of 1017209 rows and  9 columns and store_df comprises of 1115 rows and 10 columns.

Missing Values : Notable missing values in the CompetitionOpenSinceMonth,
CompetitionOpenSinceYear,Promo2SinceWeek,Promo2SinceYear,PromoInterval, CompetitionDistance  

Duplicate Values : There are no duplicate values in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

rossmann_df.columns , store_df.columns

In [None]:
# Dataset Describe

rossmann_df.describe() , store_df.describe()

### Variables Description

Data fields
Most of the fields are self-explanatory. The following are descriptions for those that aren't.

Id - an Id that represents a (Store, Date) duple within the test set

Store - a unique Id for each store

Sales - the turnover for any given day (this is what you are predicting)

Customers - the number of customers on a given day

Open - an indicator for whether the store was open: 0 = closed, 1 = open

StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None

SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools

StoreType - differentiates between 4 different store models: a, b, c, d

Assortment - describes an assortment level: a = basic, b = extra, c = extended

CompetitionDistance - distance in meters to the nearest competitor store

CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened

Promo - indicates whether a store is running a promo on that day

Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating

Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2

PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

def check_unique_values(df1, df2):
    print("Unique Values in DataFrame 1:")
    for col in df1.columns.tolist():
        print("No. of unique values in ", col, " = ", df1[col].nunique())

    print("\nUnique Values in DataFrame 2:")
    for col in df2.columns.tolist():
        print("No. of unique values in ", col, " = ", df2[col].nunique())

# Call the function with your two DataFrames
check_unique_values(rossmann_df,store_df )

## 3. ***Data Wrangling***

### EDA and Visualization On Rossman Dataset







In [None]:
# Write your code to make your dataset analysis ready.

# extract year, month, day and week of year from "Date"

rossmann_df['Year'] = rossmann_df['Date'].apply(lambda x: x.year)
rossmann_df['Month'] = rossmann_df['Date'].apply(lambda x: x.month)
rossmann_df['Day'] = rossmann_df['Date'].apply(lambda x: x.day)
rossmann_df['WeekOfYear'] = rossmann_df['Date'].apply(lambda x: x.weekofyear)

In [None]:
rossmann_df.sort_values(by=['Date','Store'],inplace=True,ascending=[False,True])
rossmann_df.head(2)

In [None]:
# Heatmap of the Rossman Dataset
numeric_columns = rossmann_df.select_dtypes(include=['int64', 'float64'])
correlation_map = rossmann_df[numeric_columns.columns].corr()
obj = np.array(correlation_map)
obj[np.tril_indices_from(obj)] = False
fig,ax= plt.subplots()
fig.set_size_inches(9,9)
sns.heatmap(correlation_map, mask=obj,vmax=.7, square=True,annot=True)

In [None]:
# check on which date the stores are mainly open

sns.countplot(x='DayOfWeek',hue='Open',data=rossmann_df)
plt.show()

In [None]:
#Impact of promo on sales
Promo_sales = pd.DataFrame(rossmann_df.groupby('Promo').agg({'Sales':'mean'}))
sns.barplot(x=Promo_sales.index, y = Promo_sales['Sales'])

In [None]:
#Monthly sales
sns.catplot(x="Month", y="Sales", data=rossmann_df, kind="point", aspect=2, height=6)
plt.show()

In [None]:
# Value Counts of SchoolHoliday Column
rossmann_df.SchoolHoliday.value_counts()

In [None]:
# Check if school holiday has affected the sales or not.

labels = 'Not-Affected' , 'Affected'
sizes = rossmann_df.SchoolHoliday.value_counts()
colors = ['gold', 'silver']
explode = (0.1, 0.0)
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=180)
plt.axis('equal')
plt.title("Effect of sales during school holiday",fontsize=10)
plt.plot()
fig=plt.gcf()
fig.set_size_inches(6,6)
plt.show()

In [None]:
#Transforming Variable StateHoliday

rossmann_df["StateHoliday"] = rossmann_df["StateHoliday"].map({0: 0, "0": 0, "a": 1, "b": 1, "c": 1})
rossmann_df.StateHoliday.value_counts()

In [None]:
#Check if state holidays has affected the sales or not.
labels = 'Not-Affected' , 'Affected'
sizes = rossmann_df.StateHoliday.value_counts()
colors = ['orange','green']
explode = (0.1, 0.0)
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=180)
plt.axis('equal')
plt.title("Effect of sales during state holiday",fontsize=10)
plt.plot()
fig=plt.gcf()
fig.set_size_inches(6,6)
plt.show()

In [None]:
#sales doesnot affect during state holiday so dropping State Holiday column

rossmann_df.drop('StateHoliday',inplace=True,axis=1)

In [None]:
#linear relation between sales and customers
sns.lmplot(x= 'Sales' , y ='Customers',data=rossmann_df, palette='seismic', height=5,aspect=1, line_kws={'color':'blue'})
plt.show()

EDA and visualizing the Store Dataset

In [None]:
#Remove features with high percentages of missing values
# we can see that some features have a high percentage of missing values and they won't be accurate as indicators,
#so we will remove features with more than 30% missing values.

store_df = store_df.drop(['CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear','Promo2SinceWeek',
                     'Promo2SinceYear', 'PromoInterval'], axis=1)

In [None]:
# Replace missing values in features with low percentages of missing values

# CompetitionDistance is distance in meters to the nearest competitor store
# let's first have a look at its distribution

sns.distplot(store_df.CompetitionDistance.dropna())
plt.title("Distributin of Store Competition Distance")
plt.show()

In [None]:
#The distribution is right skewed, so we'll replace missing values with the median.

store_df.CompetitionDistance.fillna(store_df.CompetitionDistance.median(), inplace=True)

In [None]:
#Distribution Of Different Store Types

labels = 'a' , 'b' , 'c' , 'd'
sizes = store_df.StoreType.value_counts()
colors = ['orange', 'green' , 'red' , 'pink']
explode = (0.1, 0.0 , 0.15 , 0.0)
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=180)
plt.axis('equal')
plt.title("Distribution of different StoreTypes")
plt.plot()
fig=plt.gcf()
fig.set_size_inches(6,6)
plt.show()

In [None]:
# Pairplot for Store Dataset

sns.set_style("whitegrid", {'axes.grid' : False})
pp=sns.pairplot(store_df,hue='StoreType')
pp.fig.set_size_inches(10,10)
plt.show()

In [None]:
#Merging Two Datasets

df = pd.merge(rossmann_df, store_df, on='Store', how='left')

In [None]:
df.shape

In [None]:
df.info()

In [None]:
# plotting heatmap on the merged dataset
numeric_df = df.select_dtypes(include=['float64', 'int64'])

plt.figure(figsize=(20,12))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Correlation Heatmap of Numeric Columns")
plt.show()

In [None]:
df["Avg_Customer_Sales"] = df.Sales/df.Customers

In [None]:
f, ax = plt.subplots(2, 3, figsize = (20,10))

store_df.groupby("StoreType")["Store"].count().plot(kind = "bar", ax = ax[0, 0], title = "Total StoreTypes in the Dataset")
df.groupby("StoreType")["Sales"].sum().plot(kind = "bar", ax = ax[0,1], title = "Total Sales of the StoreTypes")
df.groupby("StoreType")["Customers"].sum().plot(kind = "bar", ax = ax[0,2], title = "Total nr Customers of the StoreTypes")
df.groupby("StoreType")["Sales"].mean().plot(kind = "bar", ax = ax[1,0], title = "Average Sales of StoreTypes")
df.groupby("StoreType")["Avg_Customer_Sales"].mean().plot(kind = "bar", ax = ax[1,1], title = "Average Spending per Customer")
df.groupby("StoreType")["Customers"].mean().plot(kind = "bar", ax = ax[1,2], title = "Average Customers per StoreType")

plt.subplots_adjust(hspace = 0.3)
plt.show()

We can see from the graphs, the StoreType A has the most stores, sales and customers. However the StoreType D has the best averages spendings per customers. StoreType B, with only 17 stores has the most average customers.

In [None]:
# Comparison of sales when promo is on
sns.catplot(data=df, x="Month", y="Sales",
            col='Promo',
            hue='Promo2',
            row="Year",
            kind="point",
            aspect=1.5,
            height=4)
plt.show()

In [None]:
# sales for every day of a week

sns.catplot(data=df, x="DayOfWeek", y="Sales", hue="Promo", kind="point", aspect=1.5)
plt.show()

In [None]:
# sales for yearly basis

sns.catplot(data=df, x="Month", y="Sales", col="Year", hue="StoreType", kind="point", aspect=1.5, height=4)
plt.show()

In [None]:
df.columns

In [None]:
# Removed becouse box plot showed to many outliers
df.drop(['Avg_Customer_Sales','CompetitionDistance'],axis=1,inplace=True)

In [None]:
#checking outliers in sales
sns.boxplot(rossmann_df['Sales'])
plt.show()

In [None]:
# Deceting outliers with z-score method.

# Calculate the Z-scores for the Sales column
df['Sales_zscore'] = stats.zscore(df['Sales'])

# Identify outliers based on Z-scores (|Z| > 3)
outliers = df[df['Sales_zscore'].abs() > 3]

# Display the outliers
outliers.shape

In [None]:
# Since the ouliers are less than 1% of the data set we are removing all the outliers.

df = df[df['Sales_zscore'].abs() <= 3]

# Drop the temporary Z-score column if not needed anymore
df = df.drop(columns=['Sales_zscore'])

In [None]:
df.shape

**Conclusion of the analysis:**
Sales are highly correlated to number of Customers.

The most selling and crowded store type is A.

StoreType B has the lowest Average Sales per Customer. So i think customers visit this type only for small things.

StoreTybe D had the highest buyer cart.

Promo runs only in weekdays.

For all stores, Promotion leads to increase in Sales and Customers both.

More stores are opened during School holidays than State holidays.

The stores which are opened during School Holiday have more sales than normal days.

Sales are increased during Chirstmas week, this might be due to the fact that people buy more beauty products during a Christmas celebration.

Promo2 doesnt seems to be correlated to any significant change in the sales amount.



***Drop Subsets Of Data Where Might Cause Bias***

In [None]:
# where stores are closed, they won't generate sales, so we will remove that part of the dataset
df = df[df.Open != 0]

In [None]:
# Open isn't a variable anymore, so we'll drop it too
df = df.drop('Open', axis=1)

In [None]:
# Check if there's any opened store with zero sales
df[df.Sales == 0]['Store'].sum()

In [None]:
# see the percentage of open stored with zero sales
df[df.Sales == 0]['Sales'].sum()/df.Sales.sum()

In [None]:
# remove this part of data to avoid bias
df = df[df.Sales != 0]

In [None]:
df_new=df.copy()

In [None]:
df_new = pd.get_dummies(df_new,columns=['StoreType','Assortment'])

In [None]:
#plot for sales in terms of days ofthe week
plt.figure(figsize=(15,8))
sns.barplot(x='DayOfWeek', y='Sales' ,data=df_new);

From plot it can be sen that most of the sales have been on 1st and last day of week

In [None]:
#Setting Features and Target Variables

X = df_new.drop(['Sales','Store','Date','Year'] , axis = 1)
y= df_new.Sales

In [None]:
X.shape,y.shape

In [None]:
# Splitting Dataset Into Training Set and Test Set
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2, random_state=0)

In [None]:
columns=X_train.columns

In [None]:
# defined function for evaluation matrix.

def mape(x, y):
    return np.mean(np.abs((x - y) / x)) * 100

def rmspe(y_true, y_pred):

    loss = np.sqrt(np.mean(np.square(((y_true - y_pred) / y_pred)), axis=0))

    return loss

## ***4. ML Model Implementation***

### ML Model - 1 - Linear Regression

In [None]:
# ML Model - 1 Implementation
#Build linear regression model
lin_model = LinearRegression()

#Train model on training dataset
lin_model.fit(X_train,y_train )


In [None]:
#Predict using model

yd_predicted = lin_model.predict(X_train)

#Calculate RMSE and MAPE

print("The model performance for training dataset:\n")
print("RMSPE :",rmspe(y_train, yd_predicted))
print("MAPE :",mape(y_train, yd_predicted))

In [None]:
#Predict target on test data using model

yd_test_predicted = lin_model.predict(X_test)

#Calculate RMSE and MAPE

print("The model performance for test dataset:\n")
print("RMSPE :",rmspe(y_test, yd_test_predicted))
print("MAPE :",mape(y_test, yd_test_predicted))

In [None]:
# calculate the regressor score.
train_score_1=lin_model.score(X_train,y_train)
test_score_1=lin_model.score(X_test,y_test)

In [None]:
train_score_1,test_score_1

In [None]:
#storing 100 observations for analysis
simple_lr_pred = yd_test_predicted[:100]
simple_lr_real = y_test[:100]
dataset_lr = pd.DataFrame({'Real':simple_lr_real,'PredictedLR':simple_lr_pred}) #storing these values into dataframe
#storing absolute diffrences between actual sales price and predicted
dataset_lr['diff']=(dataset_lr['Real']-dataset_lr['PredictedLR']).abs()
#visualising our predictions
sns.lmplot(x='Real', y='PredictedLR', data=dataset_lr, line_kws={'color': 'black'});

### ML Model - 2 - Random Forest

In [None]:
# Not able to execute because it crashes whhile executing
# rfr=RandomForestRegressor(n_jobs=-1)

# params = {
#          'n_estimators':[40,50,60,70,80,90],
#          'min_samples_split':[2,3,6,8],
#          'min_samples_leaf':[1,2,3,4],
#          'max_depth':[None,5,15,30]
#          }

# #the dimensionality is high, the number of combinations we have to search is enormous, using RandomizedSearchCV is a better option then GridSearchCV
# grid = RandomizedSearchCV(estimator=rfr,param_distributions=params,verbose=True,cv=10)

# #choosing 10 K-Folds makes sure i went through all of the data and didn't miss any pattern.
# grid.fit(X_train, y_train)
# grid.best_params_

In [None]:
#Build random forest model
rdf_model = RandomForestRegressor( n_estimators=80)

#Train model on training dataset
rdf_model.fit(X_train, y_train)

In [None]:
#Predict using model

yl_pred = rdf_model.predict(X_train)

#Calculate RMSE and MAPE

print("The model performance for training dataset:\n")
print("RMSPE :",rmspe(y_train, yl_pred))
print("MAPE :",mape(y_train, yl_pred))

In [None]:
#Predict target on test data using model

yl_pred_test = rdf_model.predict(X_test)

#Calculate RMSE and MAPE

print("The model performance for test dataset:\n")
print("RMSPE :",rmspe(y_test, yl_pred_test))
print("MAPE :",mape(y_test, yl_pred_test))

In [None]:
train_score_2=rdf_model.score(X_train, y_train)
test_score_2=rdf_model.score(X_test, y_test)

In [None]:
train_score_2,test_score_2

In [None]:
#storing 100 observations for analysis
rf_prd = yl_pred_test[:100]
rf_real = y_test[:100]
dataset_rf = pd.DataFrame({'Real':rf_real,'PredictedRF':rf_prd})
#storing absolute diffrences between actual sales price and predicted
dataset_rf['diff']=(dataset_rf['Real']-dataset_rf['PredictedRF']).abs()

In [None]:
#visualising our predictions
sns.lmplot(x='Real', y='PredictedRF', data=dataset_rf, line_kws={'color': 'red'},height=6, aspect=1);

### ML Model 3 -  LARS Lasso Regression

In [None]:
# Build and train model.

las = LassoLars(alpha=0.3)
lasreg = las.fit(X_train, y_train)

In [None]:
#Predict using model
yl_pred = lasreg.predict(X_train)
#Calculate RMSE and MAPE

print("The model performance for training dataset:\n")
print("RMSPE :",rmspe(y_train, yl_pred))
print("MAPE :",mape(y_train, yl_pred))

In [None]:
#Predict target on test data using model

yl_pred_test =lasreg.predict(X_test)

#Calculate RMSE and MAPE

print("The model performance for test dataset:\n")
print("RMSPE :",rmspe(y_test, yl_pred_test))
print("MAPE :",mape(y_test, yl_pred_test))

In [None]:
train_score_3=lasreg.score(X_train, y_train)
test_score_3=lasreg.score(X_test, y_test)

In [None]:
train_score_3,test_score_3

In [None]:
#storing 100 observations for analysis
las_prd = yl_pred_test[:100]
las_real = y_test[:100]
dataset_las = pd.DataFrame({'Real':las_real,'PredictedLas':las_prd})
#storing absolute diffrences between actual sales price and predicted
dataset_las['diff']=(dataset_las['Real']-dataset_las['PredictedLas']).abs()

In [None]:
#visualising our predictions
sns.lmplot(x='Real', y='PredictedLas', data=dataset_las, line_kws={'color': 'red'},height=6, aspect=1)

# **Conclusion**

In [None]:
score_df = pd.DataFrame({'Train_Score':[train_score_1,train_score_2,train_score_3],'Test_Score':[test_score_1,test_score_2,test_score_3]},index=['Linear Regression','Random Forest','Lasso Regression'])
score_df

Random forest model is the accurate model as of now we can also improve the model performance by training the model with hyper parameter tunning.

In [None]:
data = {
    "Model": ["Linear Regression", "Random Forest", "LARS Lasso Regression"],
    "Training RMSPE": [3.550758912267052, 0.06790501224557141, 4.9840958915470415],
    "Test RMSPE": [1.1323232821970965, 0.16482888084789837, 1.238996213686942],
    "Training MAPE": [14.529856129862303, 5.314279398689622, 14.528667832761268],
    "Test MAPE": [14.529823278555165, 13.289990687719957, 14.528301424060732],
}

df1 = pd.DataFrame(data)
df1


Summary:

Random Forest: Exhibits the lowest errors on both the training and test sets, but shows signs of overfitting due to the large gap between training and test performance, particularly in MAPE.

Linear Regression & LARS Lasso Regression: Both models exhibit similar performance with relatively high errors, but they show consistent behavior across training and test datasets. ​



### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***