<a href="https://colab.research.google.com/github/Souvik0651/Rossmann-Sales-Prediction--Regression/blob/main/Souvik_Rossmann_Sales_Prediction_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Sales Prediction : Predicting sales of a major store chain Rossmann</u></b>

## <b> Problem Description </b>

### Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

### You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.

## <b> Data Description </b>

### <b>Rossmann Stores Data.csv </b> - historical data including Sales
### <b>store.csv </b> - supplemental information about the stores


### <b><u>Data fields</u></b>
### Most of the fields are self-explanatory. The following are descriptions for those that aren't.

* #### Id - an Id that represents a (Store, Date) duple within the test set
* #### Store - a unique Id for each store
* #### Sales - the turnover for any given day (this is what you are predicting)
* #### Customers - the number of customers on a given day
* #### Open - an indicator for whether the store was open: 0 = closed, 1 = open
* #### StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
* #### SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
* #### StoreType - differentiates between 4 different store models: a, b, c, d
* #### Assortment - describes an assortment level: a = basic, b = extra, c = extended
* #### CompetitionDistance - distance in meters to the nearest competitor store
* #### CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
* #### Promo - indicates whether a store is running a promo on that day
* #### Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
* #### Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
* #### PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

In [None]:
# importing numpy and pandas libraries
import pandas as pd
import numpy as np

In [None]:
# mounting the drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# reading the csv file from drive
df1=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Capstone project2- regeression/Rossmann Stores Data.csv')

In [None]:
# top 5 rows of dataframe
df1.head(5)

In [None]:
# bottom 5 rows of dataframe
df1.tail(5)

In [None]:
# reading another dataset from drive
df2=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Capstone project2- regeression/Store.csv')

In [None]:
# top 5 rows of dataframe
df2.head(5)

In [None]:
# bottom 5 rows of dataframe
df2.tail(5)

In [None]:
# checking statistics of df2 dataframe
df2.describe()

In [None]:
# checking information of df2 dataframe
df2.info()

In [None]:
# finding number of null values for df2 dataframe
df2.isnull().sum()

In [None]:
# replacing null values for df2
df2.PromoInterval=df2.PromoInterval.fillna(0)
df2.Promo2SinceYear =df2.Promo2SinceYear .fillna(0).astype('int64')
df2.Promo2SinceWeek =df2.Promo2SinceWeek .fillna(0).astype('int64')

In [None]:
# rechecking df2 information
df2.info()

In [None]:
# finding the median
comp_dist_median=df2.CompetitionDistance.median()
comp_dist_median

In [None]:
# replacing the null values with median
df2['CompetitionDistance']=df2.CompetitionDistance.fillna(comp_dist_median)

In [None]:
df2.CompetitionDistance=df2.CompetitionDistance.astype('int64')

In [None]:
# value counts for CompetitionOpenSinceMonth
df2.CompetitionOpenSinceMonth.value_counts()

In [None]:
# finding mode value for a categorical column
df2['CompetitionOpenSinceMonth']=df2.CompetitionOpenSinceMonth.fillna(int(df2.CompetitionOpenSinceMonth.mode()))

In [None]:
# value count
df2.CompetitionOpenSinceYear.value_counts()

In [None]:
mode1=int(df2.CompetitionOpenSinceYear.mode())

In [None]:
# finding the mode value
df2['CompetitionOpenSinceYear']=df2.CompetitionOpenSinceYear.fillna(mode1)

In [None]:
df2.CompetitionOpenSinceYear.astype(int)

In [None]:
# finding sum of null value for columns in df2
df2.isnull().sum()

In [None]:
#  finding sum of null value for columns in df1
df1.isnull().sum()

In [None]:
# information of columns for df1
df1.info()

In [None]:
# information of columns for df2
df2.info()

In [None]:
# value counts
df2.Promo2SinceYear.value_counts()

In [None]:
df2.info()

In [None]:
df2.CompetitionOpenSinceYear.unique()

In [None]:
df1.shape

In [None]:
df2.shape

In [None]:
# joining two dataset using outer join
new_df1=df1.merge(df2, on= 'Store',how='outer')

In [None]:
# shape
new_df1.shape

In [None]:
new_df1.Customers.value_counts()

In [None]:
# information of new dataframe
new_df1.info()

In [None]:
new_df1.head()

In [None]:
# converting to datetime
new_df1['Date']= pd.to_datetime(new_df1.Date)

In [None]:
new_df1['Year']=new_df1.Date.dt.year

In [None]:
new_df1['Month']=new_df1.Date.dt.month
new_df1['Day']=new_df1.Date.dt.day
new_df1['month_weeks']=new_df1.Date.dt.week%4
new_df1['total_weeks']=new_df1.Date.dt.week

In [None]:
new_df1.head()

In [None]:
# importing seaborn and matplotlib lib
import seaborn as sns

In [None]:
import matplotlib.pyplot as plt

In [None]:
# plotting count plot for promo2 sinceyear
sns.countplot(x='Promo2SinceYear',data= new_df1)

In [None]:
# plotting count plot for SchoolHoliday
plot=sns.countplot(x='SchoolHoliday', data= new_df1)
total = len(new_df1['SchoolHoliday'])
for p in plot.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() / 2 - 0.05
        y = p.get_y() + p.get_height()
        plot.annotate(percentage, (x, y), size = 12)
plt.show()

In [None]:
# changing datatype to string
new_df1.StateHoliday=new_df1.StateHoliday.astype('str')

In [None]:
# plotting count plot for StateHoliday
plot1=sns.countplot(x='StateHoliday', data= new_df1)
total = len(new_df1['StateHoliday'])
for p in plot1.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() / 2 - 0.05
        y = p.get_y() + p.get_height()
        plot1.annotate(percentage, (x, y), size = 12)
plt.show()

In [None]:
# plotting count plot for promo2
plot2=sns.countplot(x='Promo2', data= new_df1)
total = len(new_df1['Promo2'])
for p in plot2.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() / 2 - 0.05
        y = p.get_y() + p.get_height()
        plot2.annotate(percentage, (x, y), size = 12)
plt.show()

In [None]:
# plotting count plot for promo
plot3=sns.countplot(x='Promo', data= new_df1)
total = len(new_df1['Promo'])
for p in plot3.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() / 2 - 0.05
        y = p.get_y() + p.get_height()
        plot3.annotate(percentage, (x, y), size = 12)
plt.show()

In [None]:
# plotting count plot for open
plot4=sns.countplot(x='Open', data= new_df1)
total = len(new_df1['Open'])
for p in plot4.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() / 2 - 0.05
        y = p.get_y() + p.get_height()
        plot4.annotate(percentage, (x, y), size = 12)
plt.show()

In [None]:
# plotting count plot for storetype
plot5=sns.countplot(x='StoreType', data= new_df1)
total = len(new_df1['StoreType'])
for p in plot5.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() / 2 - 0.05
        y = p.get_y() + p.get_height()
        plot5.annotate(percentage, (x, y), size = 12)
plt.show()

In [None]:
# statistical description of new df1
new_df1.describe()

### **Label Encording for categorical columns**

In [None]:
# importing label encoder from sklearn library
from sklearn.preprocessing import LabelEncoder

In [None]:
# calling label encoder
label_encoder=LabelEncoder()

In [None]:
# transforming categorical column Store Type
new_df1.StoreType=label_encoder.fit_transform(new_df1.StoreType)

In [None]:
# displaying datframe new_df1
new_df1

In [None]:
# droping the date column
new_df1.drop('Date',axis='columns',inplace=True)

In [None]:
# correlation of the dataframe
cor=new_df1.corr()

In [None]:
cor

In [None]:
new_df1['Sales'].value_counts()

In [None]:
new_df1['Customers'].value_counts()

In [None]:
# plotting heatmap for better understanding correlation between columns
sns.heatmap(data=cor,cmap='tab20')

In [None]:
new_df1.columns

In [None]:
# dropping dayof week as it has lowest correlation with sales
new_df1=new_df1.drop('DayOfWeek',axis='columns')

In [None]:
new_df1.columns

In [None]:
# label encoding
new_df1['Assortment']=label_encoder.fit_transform(new_df1['Assortment'])

In [None]:
# dumming column promointerval values
dumm=pd.get_dummies(new_df1['PromoInterval'])
new_df2=pd.concat([dumm,new_df1],axis=1)
new_df2

In [None]:
# dropping columns
new_df2.drop([0,'PromoInterval'],axis=1,inplace=True)
new_df2

In [None]:
new_df2.CompetitionDistance.describe()

In [None]:
new_df2.StateHoliday.unique()

In [None]:
new_df2.StateHoliday=label_encoder.fit_transform(new_df2.StateHoliday)

In [None]:
new_df2.info()

In [None]:
# shape new_df2
new_df2.shape

In [None]:
new_df2

In [None]:
new_df2.columns

In [None]:
new_df2.describe()

In [None]:
# using log for columns whinch have large values
new_df2['Customer_lg'] = new_df2['Customers'].map(lambda x : np.log(x) if x != 0 else 0)
new_df2['CompetitionDistance_lg'] = new_df2['CompetitionDistance'].map(lambda x : np.log(x) if x != 0 else 0)


In [None]:
new_df2.Promo2SinceYear.value_counts()

In [None]:
new_df2

In [None]:
new_df2[new_df2['Customers']!=0]

In [None]:
# plotting histogram and checking mean and median
fig=plt.figure(figsize=(10,8))
ax=fig.gca()
new_df2['Customers'].hist(bins=50,ax=ax)
ax.axvline(new_df2.Customers.mean(),color='r',ls= 'dotted',linewidth=3)

ax.axvline(new_df2.Customers.median(),color='orange',ls= 'dotted',linewidth=3)
plt.title('Customer')

In [None]:
# plotting histogram and checking mean and median
fig=plt.figure(figsize=(10,8))
ax=fig.gca()
new_df2.CompetitionDistance.hist(bins=50,ax=ax)
ax.axvline(new_df2.CompetitionDistance.mean(),color='r',ls= 'dotted',linewidth=3)

ax.axvline(new_df2.CompetitionDistance.median(),color='orange',ls= 'dotted',linewidth=3)
plt.title('Competiton Distance')

In [None]:
new_df2.columns

In [None]:
# droping columns
new_df2.drop(['Customers','CompetitionDistance','Store'],axis='columns',inplace=True)

In [None]:
# dependent variable
y= new_df2['Sales']

In [None]:
y

In [None]:
#  independent variable
x=new_df2.drop('Sales',axis='columns')

In [None]:
x

In [None]:
x.columns

### **Dividing data in train and test dataset**

In [None]:
# importing train test split
from sklearn.model_selection import train_test_split

In [None]:
xtrain,xtest,ytrain,ytest= train_test_split(x,y, test_size= 0.2, random_state=0)

In [None]:
ytrain

In [None]:
new_df2.shape

In [None]:
xtrain

In [None]:
xtest.shape

### **Linear Regression model**

In [None]:
# importing linearregression model from sklearn.linear_model
from sklearn.linear_model import LinearRegression as LR

In [None]:
# calling linear model
lr_model= LR()

In [None]:
# importing cross validation
from sklearn.model_selection import cross_val_score

In [None]:
# mean score for cross validation for linear model
cross_val_score(lr_model,xtrain,ytrain,cv=10).mean()

In [None]:
# fitting train data into linear model 
lr_model.fit(xtrain,ytrain)

In [None]:
ytest

In [None]:
ytrain

In [None]:
xtrain


In [None]:
# ytrain predict value
y_pred=lr_model.predict(xtrain)

In [None]:
ytrain.shape

In [None]:
y_pred.shape

In [None]:
# importing r2score for accuracy
from sklearn.metrics import r2_score

In [None]:
# training and predicted score
r2_score(ytrain,y_pred)

In [None]:
y_testpred=lr_model.predict(xtest)

In [None]:
# testing and test predict score
r2_score(ytest,y_testpred)

**Linear Regression model has test score 86.16% .**

## **Regularisation using Lasso Regularisation**

In [None]:
# import linear model from sklearn
from sklearn import linear_model

In [None]:
# calling Lasso from linear model 
lass_reg= linear_model.Lasso(alpha=0.1,max_iter=100,)

In [None]:
# train score
cross_val_score(lass_reg,xtrain,ytrain,cv=10).mean()

In [None]:
# fitting train dataset into lasso reg model
lass_reg.fit(xtrain,ytrain)

In [None]:
ytestprdict=lass_reg.predict(xtest)

In [None]:
ytrainpred=lass_reg.predict(xtrain)

In [None]:
ytest

In [None]:
# test score
lass_reg.score(xtest,ytest)

In [None]:
# train score using r2 score
r2_score(ytrain,ytrainpred)

In [None]:
# test score using r2 score
r2_score(ytest,ytestprdict)

### **After regularisation has test score approx 85.69%**

### **Decision Tree Regressor**

In [None]:
# importing decision tree regressor model
from sklearn.tree import DecisionTreeRegressor

In [None]:
# calling decision tree
decision=DecisionTreeRegressor(random_state=10)

In [None]:
# cross validation score for decision model
cross_val_score(decision,xtrain,ytrain,cv=10).mean()

In [None]:
decision.fit(xtrain,ytrain)

In [None]:
# predicting ytest
ytestpred=decision.predict(xtest)

In [None]:
# test score
decision.score(xtest,ytest)

### **Decision tree Regressor model has test score approx 97.12%**