# **Project Name**    - Rossman Retail Sales Prediction



##### **Project Type**    - Regression
##### **Contribution**    - Team
##### **Team Member 1 - Juned Akhtar
##### **Team Member 2 - Muskan Garg

# **Project Summary -**

We have a time series forecasting problem where we need to predict the "Sales" column for the test set. We have historical sales data for 1,115 Rossmann stores, and We need to use this data to train a model that can predict sales for the next six weeks in advance.

Given the nature of the problem, there are several ways to solve this. By this we can capture the trends, seasonality, and other time-dependent patterns that affect the sales of Rossmann stores.

To get started, We can explore the dataset to understand the features and their relationships with the target variable. We can also perform data cleaning and preprocessing to handle missing data, outliers, and other issues that may affect the performance of our model.

Once we clean and preprocess the data, we can train our model using a training set and evaluate its performance using a validation set. We can then use the trained model to make predictions on the test set and submit it to the competition.

There are many factors that can influence store sales, including promotions, competition, holidays, seasonality, and location. With so many different factors to consider, it can be challenging to accurately predict sales, especially when each store has its unique circumstances.

To make matters more complicated, some stores in the dataset are temporarily closed for refurbishment, which will need to be taken into account when making predictions.

To tackle this task, we need to use data analysis and machine learning techniques to build a model that can accurately predict sales for each store. We will need to explore the data to identify trends and patterns that may be useful for making predictions, and then use this information to train and test our model.

Ultimately, our goal is to create a model that can accurately predict sales for each store, even when faced with a wide range of different factors that can influence sales. By doing so, we can help Rossmann make better business decisions and improve their overall performance.


# **GitHub Link -**

https://github.com/Junedaktar/Capstone-project-2-Sales-data-prediction.git

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
store_df=pd.read_csv('/content/store.csv')
rossman = pd.read_csv('/content/Rossmann Stores Data.csv')
store_df

In [None]:
rossman

### Dataset First View

In [None]:
# Dataset First Look
store_df.head()

In [None]:
rossman.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns 
store_df.shape

In [None]:
rossman.shape

### Dataset Information

In [None]:
# Dataset Info
store_df.info()

In [None]:
rossman.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
store_df.duplicated()

In [None]:
rossman.duplicated()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
store_df.isnull().sum()

In [None]:
rossman.isnull().sum()

In [None]:
# Visualizing the missing values
store_df.isnull()

In [None]:
rossman.isnull()

### What did you know about your dataset?

Based on the information provided, It can be considered that the dataset we are working with contains historical sales data for 1,115 Rossmann stores. The data include information about the stores, such as their location, size, and age, as well as details about the sales, such as the date and amount of each sale. The dataset is being used to train and test the machine learning model to predict sales for each store up to six weeks in advance.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
store_df.columns

In [None]:
rossman.columns

In [None]:
# Dataset Describe
store_df.describe()

In [None]:
rossman.describe()

### Variables Description 

Most of the fields are self-explanatory. The following are descriptions for those that aren't.

* #### Id - an Id that represents a (Store, Date) duple within the test set
* #### Store - a unique Id for each store
* #### Sales - the turnover for any given day (this is what you are predicting)
* #### Customers - the number of customers on a given day
* #### Open - an indicator for whether the store was open: 0 = closed, 1 = open
* #### StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
* #### SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
* #### StoreType - differentiates between 4 different store models: a, b, c, d
* #### Assortment - describes an assortment level: a = basic, b = extra, c = extended
* #### CompetitionDistance - distance in meters to the nearest competitor store
* #### CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
* #### Promo - indicates whether a store is running a promo on that day
* #### Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
* #### Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
* #### PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable
rossman['Store'].unique()

In [None]:
data=pd.merge(rossman,store_df,on='Store',how='left')

In [None]:
data

In [None]:
data.info()

## 3. ***Data Wrangling***

### 1. Competition distance

In [None]:
data[pd.isnull(data['CompetitionDistance'])]

In [None]:
data['CompetitionDistance'].fillna(data['CompetitionDistance'].mean(), inplace = True)

### 2. CompetitionOpenSinceMonth

In [None]:
# Write your code to make your dataset analysis ready.
data['CompetitionOpenSinceMonth']=data['CompetitionOpenSinceMonth'].fillna(0)

In [None]:
data['CompetitionOpenSinceYear']=data['CompetitionOpenSinceYear'].fillna(0)

In [None]:
data['Promo2SinceWeek']=data['Promo2SinceWeek'].fillna(0)

In [None]:
data['Promo2SinceYear']=data['Promo2SinceYear'].fillna(0)

In [None]:
data['PromoInterval']=data['PromoInterval'].fillna(0)

In [None]:
data

3. State Holiday

In [None]:
data.loc[data['StateHoliday'] == '0', 'StateHoliday'] = 0
data.loc[data['StateHoliday']=='a','StateHoliday']=1
data.loc[data['StateHoliday']=='b','StateHoliday']=2
data.loc[data['StateHoliday']=='c','StateHoliday']=3
data['StateHoliday']=data['StateHoliday'].astype('int')

In [None]:
data.loc[data['StoreType'] == 'a', 'StoreType']= 0
data.loc[data['StoreType'] == 'b', 'StoreType'] = 1
data.loc[data['StoreType'] == 'c', 'StoreType'] = 2
data.loc[data['StoreType'] == 'd', 'StoreType'] = 3
data['StoreType'] = data['StoreType'].astype('int')


In [None]:
data.loc[data['Assortment'] == 'a', 'Assortment']= 0
data.loc[data['Assortment'] == 'b', 'Assortment']= 1
data.loc[data['Assortment'] == 'c', 'Assortment']= 2
data['Assortment']=data['Assortment'].astype('int')

In [None]:
data['CompetitionOpenSinceMonth'].unique()

In [None]:
data['Date'] = pd.to_datetime(data['Date'], format= '%Y-%m-%d')
data['CompetitionOpenSinceYear']= data['CompetitionOpenSinceYear'].astype(int)
data['Promo2SinceYear']= data['Promo2SinceYear'].astype(int)
data['CompetitionDistance']= data['CompetitionDistance'].astype(int)
data['Promo2SinceWeek']= data['Promo2SinceWeek'].astype(int)
data['CompetitionOpenSinceMonth'] =data['CompetitionOpenSinceMonth'].astype(int)

In [None]:
data

### What all manipulations have you done and insights you found?

1. In competition distance thereare 2642 null values, so we can replace with Mean, Median or Zero. But in this case, distnce can not be zero, so we will replace with Mean value.


2. In CompetitionOpenSinceMonth, CompetitionOpenSinceYesr, Promo2 since week and promo2 since year, PromoInterval have many nulll data. we can replace with Mean, Median or Zero. But in this case, we will replace with Zero.

3. StateHoliday, StoreType and assortment are categorised with a,b,c. So for model we replace with integer 0,1,2,3.

4. Some columns have string and float data type so i changed to the integer.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(15,6))
sns.barplot(x=data['CompetitionOpenSinceMonth'],y=data['Sales'])
plt.title('Plot between Sales and Competition Open Since month')

##### 1. Why did you pick the specific chart?

Here we have used the barplot to showcase the relation between the sales and competition open since month. With this it becomes easy to identify that what was the sales in a particular month .

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(15,6))
sns.barplot(x=data['CompetitionOpenSinceYear'],y=data['Sales'])
plt.title('Plot between Sales and Competition Open Since year')

##### 1. Why did you pick the specific chart?

In this graph, the relation between sales and competition open since year is shown. This shows that the highest sales happen was in the year 1900 and then again in year 2013.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(15,6))
sns.lineplot(x=data['Promo2SinceWeek'],y=data['Sales'])
plt.title('Plot between Sales and promo2 since week')

##### 1. Why did you pick the specific chart?

This lineplot chart shows the relation between sales and promo2 since week. 

##### 2. What is/are the insight(s) found from the chart?

This clearly shows that the highest sales happen between the 40th and the 50th week. 

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(15,6))
sns.pointplot(x=data['Promo2SinceYear'],y=data['Sales'])
plt.title('Plot between Sales and promo2 since year')

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(15,6))
sns.pointplot(x= 'DayOfWeek', y= 'Sales', data=data)
plt.title('Plot between Sales and day of week')

##### 1. Why did you pick the specific chart?

This pointplot chart shows the relation between sales and days of week. 

##### 2. What is/are the insight(s) found from the chart?

This pointplot easily explains that the sale was high in the starting days of the week and then falls down at the last day.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

By this chart, we can conclude that sales has the high demand in the beginning of the week so this is the right time to invite more customers which can be one via advertisement, sale, discounts.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(15,6))
sns.barplot(x=data['Assortment'],y=data['Sales'])
plt.title('comparision between assortment and sales')

##### 1. Why did you pick the specific chart?

Barplot between promo and Sales shows the effect of promotion on Sales. 

##### 2. What is/are the insight(s) found from the chart?

Here 0 represents the store which didnt opt for promotion and 1 & 2 represents for stores who opt for promotion. Those store who took promotions their sales are high as compared to stores who didnt took promotion.



#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(15,6))
sns.boxplot(x=data['StoreType'],y=data['Sales'])
plt.title('comparison between store sales')

##### 1. Why did you pick the specific chart?

Box plot help to show the outliers in the data.

##### 2. What is/are the insight(s) found from the chart?

There are many outliers in the store data.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(12,6))
sns.barplot(x="StateHoliday", y="Sales", data=data)

##### 1. Why did you pick the specific chart?

This chart is easy to understand the sales during the state holidays.

##### 2. What is/are the insight(s) found from the chart?

This shows the Sales during state holiday.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

We can observe that sales are incresing masively during the holiday.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(15,8))
cor=data.corr()
sns.heatmap(cor,annot=True,cmap='coolwarm')

##### 1. Why did you pick the specific chart?

Heatmaps are used to show relationships between two variables, one plotted on each axis.

##### 2. What is/are the insight(s) found from the chart?

This heatmap shows the multi - collinearity of of the sales data.

## ***5. Feature Engineering & Data Pre-processing***

###1. Correlation

In [None]:
dependent='Sales'
not_needed=('Store','Date','StoreType','PromoInterval')
independent=list(set(data.columns.tolist())-set(not_needed)-{dependent})

In [None]:
for a in independent:
  plt.figure(figsize=(12,8))
  plt.scatter(x=data[a],y=data[dependent])
  plt.xlabel(a)
  plt.ylabel('Sales')
  plt.title('sales vs '+a)
  co=np.polyfit(data[a],data['Sales'],1)
  eq=np.poly1d(co)(data[a])
  plt.plot(data[a],eq,c='r')

###2. import libraries

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
import math
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import Lasso, Ridge

### 3. Calculation for VIF

In [None]:
def cal_vif(v):
  vif=pd.DataFrame()
  vif['variable']=v.columns
  vif['VIF']=[variance_inflation_factor(v.values,i) for i in range(v.shape[1])]
  return(vif)

In [None]:
cal_vif(data[[i for i in independent]])

In [None]:
cal_vif(data[[i for i in independent if i not in ['Promo2']]])

###4. Feature selection

In [None]:
data = pd.get_dummies(data, columns=['PromoInterval'])

In [None]:
dependent='Sales'
not_needed=('Store','Date','StoreType','Promo2')
independent=list(set(data.columns.tolist())-set(not_needed)-{dependent})

In [None]:
independent

In [None]:
dependent

###4. Selecting data

In [None]:
x=data[independent].values
y=data[dependent].values

### 5. Train the data

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.20,random_state=0)
print(x_train.shape)
print(x_test.shape)

### 6. Data Scaling

In [None]:

scaler=StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

###7.Linear regression

In [None]:
regressor = LinearRegression()
regressor.fit(x_train, y_train)

### 8. Prediction

In [None]:
y_pred=regressor.predict(x_test)

In [None]:
regressor.score(x_train,y_train)

In [None]:
mse=mean_squared_error(y_pred,y_test)
print('MSE:', mse)
rmse=math.sqrt(mse)
print('RMSE: ', rmse)
r2=r2_score(y_pred,y_test)
print('R Square: ', r2)

In [None]:
regressor.coef_

### 9. Ridge

In [None]:
ridge=Ridge(alpha=0.5)
ridge.fit(x_train,y_train)


In [None]:
y_pred_ridge=ridge.predict(x_test)

In [None]:
ridge.score(x_train,y_train)

In [None]:
mse=mean_squared_error(y_pred_ridge,y_test)
print('MSE:', mse)
rmse=math.sqrt(mse)
print('RMSE: ', rmse)
r2=r2_score(y_pred_ridge,y_test)
print('R Square: ', r2)

###10. Lass0

In [None]:
lasso=Lasso(alpha=2)
lasso.fit(x_train,y_train)

In [None]:
y_pred_lasso=lasso.predict(x_test)

In [None]:
lasso.score(x_train,y_train)

In [None]:
mse=mean_squared_error(y_pred_lasso,y_test)
print('MSE:', mse)
rmse=math.sqrt(mse)
print('RMSE: ', rmse)
r2=r2_score(y_pred_lasso,y_test)
print('R Square: ', r2)

# **Conclusion**

We saw that Sales column contains 1017902 we trained linear model using various algorithms and we got accuracy near about 84% . 
And we used ridge and lasso for better model but its not impacted much.

So we came to conclusion that removing sales=0 rows actually removes lot of information from dataset as it has 172871 rows which is quite large and therefore we decided not to remove those values.we tried taking an optimum parameter so that our model doesnt overfit.


1)From plot sales and competition Open Since Month shows sales go increasing from November and highest in month December.

2)From plot Sales and day of week, Sales highest on Monday and start declining from Tuesday to Saturday and on Sunday Sales almost near to Zero.

3)Plot between Promotion and Sales shows that promotion helps in increasing Sales.

4)Type of Store plays an important role in opening pattern of stores.

5)All Type ‘b’ stores never closed except for refurbishment or other reason.

6)All Type ‘b’ stores have comparatively higher sales and it mostly constant with peaks appears on weekends.

7)ssortment Level ‘b’ is only offered at Store Type ‘b’.

8)We can observe that most of the stores remain closed during State Holidays. But it is interesting to note that the number of stores opened during School Holidays were more than that were opened during State Holidays.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***