# DESCRIPTION

One of the leading retail stores in the US, Walmart, would like to predict the sales and demand accurately. There are certain events and holidays which impact sales on each day. There are sales data available for 45 stores of Walmart. The business is facing a challenge due to unforeseen demands and runs out of stock some times, due to the inappropriate machine learning algorithm. An ideal ML algorithm will predict demand accurately and ingest factors like economic conditions including CPI, Unemployment Index, etc.

Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of all, which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data. Historical sales data for 45 Walmart stores located in different regions are available.

Dataset Description

This is the historical data that covers sales from 2010-02-05 to 2012-11-01, in the file Walmart_Store_sales. Within this file you will find the following fields:

    Store - the store number

    Date - the week of sales

    Weekly_Sales -  sales for the given store

    Holiday_Flag - whether the week is a special holiday week 1 – Holiday week 0 – Non-holiday week

    Temperature - Temperature on the day of sale

    Fuel_Price - Cost of fuel in the region

    CPI – Prevailing consumer price index

    Unemployment - Prevailing unemployment rate

Holiday Events

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

Analysis Tasks

Basic Statistics tasks

    Which store has maximum sales

    Which store has maximum standard deviation i.e., the sales vary a lot. Also, find out the coefficient of mean to standard deviation

    Which store/s has good quarterly growth rate in Q3’2012

    Some holidays have a negative impact on sales. Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together

    Provide a monthly and semester view of sales in units and give insights

Statistical Model

For Store 1 – Build  prediction models to forecast demand

    Linear Regression – Utilize variables like date and restructure dates as 1 for 5 Feb 2010 (starting from the earliest date in order). Hypothesize if CPI, unemployment, and fuel price have any impact on sales.

    Change dates into days by creating new variable.

Select the model which gives best accuracy.


In [None]:
# Import necessary libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import dates
from datetime import datetime

# import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

In [None]:
data=pd.read_csv("data/Walmart_Store_sales.csv")

In [None]:
data.head()

In [None]:
data.isnull().sum()

In [None]:
data.describe()

In [None]:
data['Date']=pd.to_datetime(data['Date'])

In [None]:
data.info()

In [None]:
data.dtypes

In [None]:
data['days']=pd.DatetimeIndex(data['Date']).day

In [None]:
data['Month']=pd.DatetimeIndex(data["Date"]).month

In [None]:
data['year']=pd.DatetimeIndex(data["Date"]).year

In [None]:
data.head()

In [None]:
maxsale=data.groupby('Store')['Weekly_Sales'].sum().sort_values(ascending=False)

In [None]:
maxsale.values.argmax()

In [None]:
maxsale.index[0]


In [None]:
maxsale.plot(kind="bar",figsize=(16,6),stacked=True,
                                             title="Total sales for each store")

# Which store has maximum standard deviation i.e., the sales vary a lot. Also, find out the coefficient of mean to standard deviation

In [None]:
maxstd=pd.DataFrame(data.groupby('Store')['Weekly_Sales'].sum().sort_values(ascending=False))

In [None]:
print("the store has maximum standard deviation"+str(maxstd.index[0])+" with {0:.0f} $".format(maxstd.head(1).Weekly_Sales[maxstd.head(1).index[0]]))

In [None]:
import seaborn as sns

In [None]:
plt.figure(figsize=(15,7))
sns.distplot(data[data['Store']==maxstd.index[0]]['Weekly_Sales'])
plt.title('The Sales Distribution of Store #'+ str(maxstd.head(1).index[0]));

In [None]:
co_mean_std = pd.DataFrame(data.groupby('Store')['Weekly_Sales'].std()/data.groupby('Store')['Weekly_Sales'].mean())

In [None]:
co_mean_std=co_mean_std.rename(columns={'Weekly_Sales':'Coefficient of mean to standard deviation'})

In [None]:
co_mean_std

In [None]:
co_mean_max_std=co_mean_std.sort_values(by='Coefficient of mean to standard deviation')

In [None]:
plt.figure(figsize=(15,7))
sns.distplot(data[data['Store']==co_mean_max_std.tail(1).index[0]]['Weekly_Sales'])
plt.title('The max Distribution of Store #'+ str(co_mean_max_std.tail(1).index[0]));

# Which store/s has good quarterly growth rate in Q3’2012

In [None]:
q3=data[(data['Date']> '2012-07-01') & (data['Date'] < '2012-09-30')].groupby('Store')["Weekly_Sales"].sum()
q2=data[(data['Date']> '2012-04-01') & (data['Date'] < '2012-06-30')].groupby('Store')["Weekly_Sales"].sum()

In [None]:
q3.head(3)

In [None]:
plt.figure(figsize=(16,6))
q2.plot(ax=q3.plot(kind='bar',legend=True),kind='bar',color='r',alpha=0.2,legend=True);
plt.legend(["Q3' 2012", "Q2' 2012"]);

In [None]:
print('Store have good quarterly growth rate in Q3’2012 is Store '+str(q3.idxmax())+' With '+str(q3.max())+' $')

# Some holidays have a negative impact on sales. Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together

In [None]:
data['Holiday_Flag'].unique()

In [None]:
Super_Bowl=['12-2-2010', '11-2-2011', '10-2-2012', '8-2-2013'] 
Labour_Day=['10-9-2010', '9-9-2011', '7-9-2012', '6-9-2013' ]
Thanksgiving=['26-11-2010', '25-11-2011', '23-11-2012', '29-11-2013'] 
Christmas=['31-12-2010', '30-12-2011', '28-12-2012', '27-12-2013']

In [None]:
data.loc[data.Date.isin(Super_Bowl)]

In [None]:
total_sales = data.groupby('Date')['Weekly_Sales'].sum().reset_index()

In [None]:
total_sales

In [None]:
# Yearly Sales in holidays
Super_Bowl_df = pd.DataFrame(data.loc[data.Date.isin(Super_Bowl)].groupby('year')['Weekly_Sales'].sum())
Thanksgiving_df = pd.DataFrame(data.loc[data.Date.isin(Thanksgiving)].groupby('year')['Weekly_Sales'].sum())
Labour_Day_df = pd.DataFrame(data.loc[data.Date.isin(Labour_Day)].groupby('year')['Weekly_Sales'].sum())
Christmas_df = pd.DataFrame(data.loc[data.Date.isin(Christmas)].groupby('year')['Weekly_Sales'].sum())

Super_Bowl_df.plot(kind='bar',legend=False,title='Yearly Sales in Super Bowl holiday') 
Thanksgiving_df.plot(kind='bar',legend=False,title='Yearly Sales in Thanksgiving holiday') 
Labour_Day_df.plot(kind='bar',legend=False,title='Yearly Sales in Labour_Day holiday')
Christmas_df.plot(kind='bar',legend=False,title='Yearly Sales in Christmas holiday')

Provide a monthly and semester view of sales in units and give insights

In [None]:
plt.scatter(data[data.year==2010]["Month"],data[data.year==2010]["Weekly_Sales"])
plt.xlabel("months")
plt.ylabel("Weekly Sales")
plt.title("Monthly view of sales in 2010")
plt.show()

In [None]:
plt.scatter(data[data.year==2011]["Month"],data[data.year==2011]["Weekly_Sales"])
plt.xlabel("Month")
plt.ylabel("Weekly Slaes")
plt.title("Monthly sale in 2011")
plt.show()

In [None]:
plt.scatter(data[data.year==2012]["Month"],data[data.year==2012]["Weekly_Sales"])
plt.xlabel("Month")
plt.ylabel("Weekly sales")
plt.title("Monthly sale in 2012")
plt.show()

In [None]:
plt.figure(figsize=(10,6))
plt.bar(data["Month"],data["Weekly_Sales"])
plt.xlabel("months")
plt.ylabel("Weekly Sales")
plt.title("Monthly view of sales")

In [None]:
plt.figure(figsize=(10,6))
plt.bar(data["year"],data["Weekly_Sales"])
plt.xlabel("months")
plt.ylabel("Weekly Sales")
plt.title("Yearly view of sales")

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics
import sklearn.linear_model as lm

In [None]:
data.columns

In [None]:
x=data[['Store','Fuel_Price', 'CPI', 'Unemployment', 'days', 'Month', 'year']]
y=data[['Weekly_Sales']]

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)

In [None]:
x_train.shape,x_test.shape,y_train.shape,y_test.shape

In [None]:
glm=lm.LinearRegression()

In [None]:
glm.fit(x_train,y_train)

In [None]:
metrics.mean_squared_error(y_test,glm.predict(x_test))

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
li=[]
print(li)

In [None]:

def nestimator(n):
    for i in range(400,n,10):
        #print("for value of",i)
        
        rfr = RandomForestRegressor(n_estimators = i,max_depth=15,n_jobs=5)        
        rfr.fit(x_train,y_train)
        y_pred=rfr.predict(x_test)
        li.append(metrics.mean_squared_error(y_test, y_pred))
        #print('Accuracy:',rfr.score(x_test, y_test)*100)

        #print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
        #print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
        #print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))



In [None]:
nestimator(800)

In [None]:
print(min(li))

In [None]:
li.index(min(li))

In [None]:
li[li.index(min(li))]

In [None]:
# Random Forest Regressor
print('Random Forest Regressor:')
print()
rfr = RandomForestRegressor(n_estimators = 450,max_depth=15,n_jobs=5)        
rfr.fit(x_train,y_train)
y_pred=rfr.predict(x_test)
print('Accuracy:',rfr.score(x_test, y_test)*100)

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

