# Rossmann Store Sales
                                                         Code written by: Dasari Mohan


## PROJECT OBJECTIVE :

Forecast sales using store, promotion, and competitor data

## CONTEXT :

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied. We are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column.

Data fields

1.	Id - an Id that represents a (Store, Date) duple within the test set
2.	Store - a unique Id for each store
3.	Sales - the turnover for any given day (this is what you are predicting)
4.	Customers - the number of customers on a given day
5.	Open - an indicator for whether the store was open: 0 = closed, 1 = open
6.	StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
7.	SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
8.	StoreType - differentiates between 4 different store models: a, b, c, d
9.	Assortment - describes an assortment level: a = basic, b = extra, c = extended
10.	CompetitionDistance - distance in meters to the nearest competitor store
11.	CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
12.	Promo - indicates whether a store is running a promo on that day
13.	Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
14.	Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
15.	PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store


# My Approch:

1.	Importing the required libraries and reading the dataset. 

    a.	 Merging of the two datasets 
    
    b.	 Understanding the dataset

2.	Exploratory Data Analysis (EDA) – 

    a.	 Data Visualizatiom

3.	Feature Engineering 

    a.	 Dropping of unwanted columns and values (closed stores)
    
    b.	 Filling Missing Values with Imputation
    
    c.   Outliers Detection and removal

4.  Further Exploratory Data Analysis to find out a few exceptional cases.

5.	Label Encoding (Converting categorical variables to numerical values)

6.	Model Building 

    a.	 Performing train test split 
    
    b.	 Linear Regression Model 
    
    c.	 SGD Regression Model 
    
    d.	 Decision Tree Regression Model 
    
    e.	 Random Forest Regression Model

7.	Model Validation 

    a.	 r2 score 
    
    b.	 Mean absolute error 
    
    c.	 Root mean squared error

8.	Creating the final right model and making predictions

9.	Feature Importance Analysis

10.	Conclusion

In [None]:
# Importing Required Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
store_data = pd.read_csv('../input/c/rossmann-store-sales/store.csv')
store_data.head(5)

In [None]:
train_data = pd.read_csv('../input/c/rossmann-store-sales/train.csv')
train_data.head(5)

In [None]:
combined_data = pd.merge(store_data,train_data,on='Store')
combined_data.head(5)

## Exploring Data Analysis

In [None]:
combined_data.shape

In [None]:
# Checking for null values
combined_data.isnull().mean()*100

In [None]:
# Unique values
columns = list(combined_data.columns)
columns.remove('Date')
columns.remove('CompetitionDistance')
for i in columns:
    print('Unique values in column :',combined_data[i].unique())

## Data Visualization

In [None]:
# extracting year and month from Date 
combined_data['year'] = combined_data['Date'].apply(lambda x : int(str(x)[0:4]))
combined_data['month'] = combined_data['Date'].apply(lambda x : int(str(x)[5:7]))

# Sales with respect to year 
sns.barplot(x='year', y='Sales', data=combined_data).set(title='Year vs Sales')
# sns.barplot(x='month',y='Sales', data=combined_data).set(title='Month vs Sales')

plt.show()

#### Observation:
Sales have been increasing year to year

In [None]:
sns.barplot(x='DayOfWeek',y='Sales',data=combined_data).set(title='Sales vs Day of Week')

#### Observation
Sales on 1 (Monday) and 5 (Friday) are the highest

In [None]:
# Lets see how promo is impacting sales
sns.barplot(x='Promo',y='Sales',data=combined_data).set(title='Sales on Promo')

#### Observation:
Customers are definately attracted by Promo codes thus sales are higher when there is a Promo code in a Store

In [None]:
# StateHoliday column has values 0 & "0", So, we need to change values with 0 to "0"

combined_data['StateHoliday'].loc[combined_data['StateHoliday'] == 0] = '0'

# Sales with respect to State Holiday
sns.barplot(x='StateHoliday', y='Sales', data=combined_data).set(title='State Holiday vs Sales')
plt.show()

#### Observation:

Most stores are closed on State Holidays that's why we can see that there are very less sales in a,b,c 

where

a = Public Holiday,
b = Easter Holiday,
c = Chirstmas,
0 = No Holiday, Working day

In [None]:
# Sales with respect to School Holiday
sns.barplot(x='SchoolHoliday', y='Sales', data=combined_data).set(title='School Holiday vs Sales')

#### Observation:
On School Holidays there are more sales

In [None]:
# Sales with respect to Storetype
sns.barplot(x='StoreType', y='Sales', data=combined_data).set(title='StoreType vs Sales')

#### Observation:
Of all a,b,c,d are store models b type stores have the highest sales

In [None]:
# Sales with respect to Assortment
sns.barplot(x='Assortment', y='Sales', data=combined_data).set(title='Assortment vs Sales')

#### Observation:
Assortment level 'b' have the highest sales

### Filling Missing Values and Removing Outliers
Few columns have high number of missing values, so we need to fill them with appropriate method for better result.

##### Approach
1: The null values in Column Promo2SinceWeek, Promo2SinceYear, PromoInterval is due to Promo2 is 0 for those stores. So we would fill all the null values in these columns with 0.

2: Since Competition Distance for 3 stores isn't given so we could fill it with mean of the distance given for all other stores

3: CompetitionOpenSinceMonth, CompetitionOpenSinceYear can be filled using the most occuring month and year respectively.

In [None]:
store_data.isnull().sum()

In [None]:
train_data.isnull().sum()

#### Observation:
Here we can clearly see that only store data has null values in it and we need to fill missing values in store data

In [None]:
# Filling Promo2SinceWeek, Promo2SinceYear, PromoInterval with 0
store_data.update(store_data['Promo2SinceWeek'].fillna(value=0,inplace=True))
store_data.update(store_data['Promo2SinceYear'].fillna(value=0,inplace=True))
store_data.update(store_data['PromoInterval'].fillna(value=0,inplace=True))

In [None]:
# Filling CompetitionDistance with mean distance
mean_CompetitionDistance = store_data['CompetitionDistance'].mean()
store_data.update(store_data['CompetitionDistance'].fillna(value=mean_CompetitionDistance,inplace=True))

In [None]:
# Filling CompetitionOpenSinceMonth, CompetitionOpenSinceYear with most occuring month and year respectively
mode_CompetitionOpenSinceMonth = store_data['CompetitionOpenSinceMonth'].mode()[0]
mode_CompetitionOpenSinceYear = store_data['CompetitionOpenSinceYear'].mode()[0]

store_data.update(store_data['CompetitionOpenSinceMonth'].fillna(value=mode_CompetitionOpenSinceMonth,inplace=True))
store_data.update(store_data['CompetitionOpenSinceYear'].fillna(value=mode_CompetitionOpenSinceYear,inplace=True))

store_data.isnull().sum()

In [None]:
combined_data = pd.merge(store_data,train_data,on='Store')
print(combined_data.shape)
combined_data.head(5)

In [None]:
combined_data.isnull().mean()*100

#### Great ! We don't have any null values, we can proceed further

In [None]:
combined_data.plot(x='CompetitionDistance',y='Sales',kind='scatter',figsize =(10,6))

#### Observation:
From the above plot we can say that more nearer the compitetor store are the more sales in Rossman stores.

## Finding Outliers

In [None]:
sns.displot(combined_data,x='Sales',bins=60)

#### Observation:
As we can see in the distribution plot Sales greater than 25k are very less,therefore they might be the outliers.

### Z-Score: If the Z-Score of any datapoint is greater than 3(threshold) then that can be considered as an Outlier

In [None]:
mean_sales = np.mean(combined_data['Sales'])
std_sales = np.std(combined_data['Sales'])

threshold = 3

outliers = []
for i in combined_data['Sales']:
    z_score = (i-mean_sales)/std_sales
    if z_score > threshold:
        outliers.append(i)
        
print('Total No.of outliers in dataset: ', len(outliers))

sns.displot(x=outliers,bins=20).set(title='Outliers Distribution')

In [None]:
# Percentage of Outliers 
zero_sales = combined_data.loc[combined_data['Sales']==0]

sales_greater_than_25k = combined_data.loc[combined_data['Sales'] > 25000]

print('Length of the dataset:', len(combined_data))
print('Percentage of Zeros in dataset: %.3f%%' %((len(zero_sales)/len(combined_data))*100))
print('Percentage of sales greater than 25k in dataset: %.3f%% ' %((len(sales_greater_than_25k)/len(combined_data))*100))

#### Obervation:
We can drop the sales datapoints which are greater than 25k as they are very less percentage of the dataset and are probably outliers

In [None]:
combined_data.drop(combined_data.loc[combined_data['Sales'] > 25000].index,inplace=True)

In [None]:
combined_data.shape

### Some exceptional cases
Looking for a scenerio where the Stores are open and yet there is no sales on that day

In [None]:
no_sales = combined_data.loc[(combined_data['Sales']==0) & (combined_data['Open'] == 1) & (combined_data['StateHoliday'] == 0) 
                               & (combined_data['SchoolHoliday'] == 0)]
print(no_sales.shape)
no_sales.head()

#### Observation:
There are total 12 dates where there is no record of sales even without any holidays. We can remove these data points too as they are an exceptional case

In [None]:
combined_data.drop(combined_data.loc[(combined_data['Sales']==0) & (combined_data['Open'] == 1)
                                     & (combined_data['StateHoliday'] == 0) & 
                                     (combined_data['SchoolHoliday'] == 0)].index,inplace=True)
print(combined_data.shape)

In [None]:
combined_data.head()

### Converting Categorical Variable to Numeric

In [None]:
combined_data['Year'] = combined_data['Date'].apply(lambda x: int(str(x)[0:4]))
combined_data['Month'] = combined_data['Date'].apply(lambda x: int(str(x)[5:7]))
combined_data.drop(['Date'],axis=1,inplace=True)

combined_data.head(5)

In [None]:
combined_data.dtypes

In [None]:
# StateHoliday column has values 0 & "0", So, we need to change values with 0 to "0"

combined_data['StateHoliday'].loc[combined_data['StateHoliday'] == 0] = '0'

In [None]:
# PromoInterval column has values 0 & "0", So, we need to change values with 0 to "0"

combined_data['PromoInterval'].loc[combined_data['PromoInterval'] == 0] = '0'

In [None]:
combined_data['PromoInterval'].head()

In [None]:
# encoding all categorical varibale to numeric values
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

combined_data['StoreType'] = label_encoder.fit_transform(combined_data['StoreType'])
combined_data['Assortment'] = label_encoder.fit_transform(combined_data['Assortment'])
combined_data['StateHoliday'] = label_encoder.fit_transform(combined_data['StateHoliday'])
combined_data['PromoInterval'] = label_encoder.fit_transform(combined_data['PromoInterval'])

combined_data.head()

In [None]:
# Correlation
correlation = combined_data.corr()
correlation

In [None]:
# Heat Map
plt.figure(figsize=(18,10))
sns.heatmap(correlation, annot=True, linewidths=0.2, cmap='BrBG')

#### Observation:
Correlation map shows

Sales is highly correlated with Customers, Open and Promo code and minorly correlated to school holidays


## Buliding a Regression Model

#### Here we want our ML model to predict sales only when they are open and we know that there will be no sales if the store is closed

In [None]:
combined_data_open = combined_data[combined_data['Open']==1]
combined_data_closed = combined_data[combined_data['Open']==0]

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
import math

X_train, X_test, y_train, y_test_open = train_test_split(combined_data_open.drop(['Sales','Customers','Open'],axis=1),
                                                        combined_data_open['Sales'], test_size=0.2, random_state=23)

In [None]:
X_train.columns

In [None]:
y_train.head()

In [None]:
y_test_closed = np.zeros(combined_data_closed.shape[0])
y_test = np.append(y_test_open, y_test_closed)

## Linear Regression Algorithm

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train,y_train)

In [None]:
prediction_open = model.predict(X_test)
prediction_closed = np.zeros(combined_data_closed.shape[0])

y_predict = np.append(prediction_open,prediction_closed)

In [None]:
# Performance of the model

print('r2_score:',r2_score(y_test,y_predict))
print('Mean absolute error: %.2f' % mean_absolute_error(y_test,y_predict))
print('Root mean squared error: ', math.sqrt(mean_squared_error(y_test,y_predict)))

In [None]:
plt.figure(figsize=(8,8))
plt.scatter(y_test,y_predict)

p1 = max(max(y_predict),max(y_test))
p2 = min(min(y_predict),min(y_test))
plt.plot([p1,p2],[p1,p2],c='r')
plt.xlabel('Actual values')
plt.ylabel('Predicted values')

#### Observation:
From the above plot we can see that Linear regression model is performing badly as its not making any predictions more than 10000 even for 25000 sales.

## SGD Regression Algorithm

In [None]:
from sklearn.linear_model import SGDRegressor

model = SGDRegressor()
model.fit(X_train,y_train)

prediction_open = model.predict(X_test)
prediction_closed = np.zeros(combined_data_closed.shape[0])

y_predict = np.append(prediction_open,prediction_closed)

# Performance of the model

print('r2_score:',r2_score(y_test,y_predict))
print('Mean absolute error: %.2f' % mean_absolute_error(y_test,y_predict))
print('Root mean squared error: ', math.sqrt(mean_squared_error(y_test,y_predict)))

plt.figure(figsize=(8,8))
plt.scatter(y_test,y_predict)

p1 = max(max(y_predict),max(y_test))
p2 = min(min(y_predict),min(y_test))
plt.plot([p1,p2],[p1,p2],c='r')
plt.xlabel('Actual values')
plt.ylabel('Predicted values')

#### Observation:
The SGD regressor is performing worse than Linear Regression as its giving negative r2 score, lets see other regression models

## Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor()
model.fit(X_train,y_train)

prediction_open = model.predict(X_test)
prediction_closed = np.zeros(combined_data_closed.shape[0])

y_predict = np.append(prediction_open,prediction_closed)

# Performance of the model

print('r2_score:',r2_score(y_test,y_predict))
print('Mean absolute error: %.2f' % mean_absolute_error(y_test,y_predict))
print('Root mean squared error: ', math.sqrt(mean_squared_error(y_test,y_predict)))

plt.figure(figsize=(8,8))
plt.scatter(y_test,y_predict)

p1 = max(max(y_predict),max(y_test))
p2 = min(min(y_predict),min(y_test))
plt.plot([p1,p2],[p1,p2],c='r')
plt.xlabel('Actual values')
plt.ylabel('Predicted values')

#### Observation:
The decision tree regressor performing well compared to Linear and SGD regressors

## Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

random_forest_model = RandomForestRegressor(n_estimators=100)
random_forest_model.fit(X_train,y_train)

prediction_open = random_forest_model.predict(X_test)
prediction_closed = np.zeros(combined_data_closed.shape[0])

y_predict = np.append(prediction_open,prediction_closed)

# Performance of the model

print('r2_score:',r2_score(y_test,y_predict))
print('Mean absolute error: %.2f' % mean_absolute_error(y_test,y_predict))
print('Root mean squared error: ', math.sqrt(mean_squared_error(y_test,y_predict)))

plt.figure(figsize=(8,8))
plt.scatter(y_test,y_predict)

p1 = max(max(y_predict),max(y_test))
p2 = min(min(y_predict),min(y_test))
plt.plot([p1,p2],[p1,p2],c='r')
plt.xlabel('Actual values')
plt.ylabel('Predicted values')

#### Observation:
Random Forest regressor had the lowest error as compared to other models, which means it is better at predicting sales than other models.

### Understanding the important features

In [None]:
# getting weights of all the features used in the data
feature_importance = random_forest_model.feature_importances_
feature_importance

In [None]:
# features used
columns = list(X_train.columns)
columns

In [None]:
# Lets make a dataframe consists of features and values
feature_importance_df = pd.DataFrame({'Features':columns, 'Values':feature_importance})
feature_importance_df

In [None]:
feature_importance_df.sort_values(by=["Values"], inplace=True, ascending=False)
feature_importance_df

In [None]:
# Feature Importance
plt.figure(figsize=(15,6))

sns.barplot(x=feature_importance_df['Features'], y=feature_importance_df['Values'],
            data = feature_importance_df ).set(title='Feature Importance')

plt.xticks(rotation=90)
plt.show()

# Conclusion:

1. Closer competiton distance make stores more competitive and thus by using Promo codes can help them to boost their sales.

2. Store Type affects the sales - Of all a,b,c,d store models 'b' type stores have the highest sales.

3. Promo code can help increase in the competition and lead to more sales.

4. Sales on 1 (Monday) and 5 (Friday) are the highest.

5. Assortment level 'b' have the highest sales.

6. Customers are definately attracted by Promo codes thus sales are higher when there is a Promo code in a Store

7. Since most of the stores are closed on Holidays, the feature state holidays has no effect on sales