# Attention Walmart Shoppers
### A Walmart retail analysis

The data was originally retrieved from:
 -   https://www.kaggle.com/rutuspatel/retail-analysis-with-walmart-sales-data
 - https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data

### Data Dictionary

| Target                  |  Data Type       | Description                     |
|-------------------------|------------------|---------------------------------|
| next_week_sales_target  |   float64        | Sales in USD per week by store  |


| Column Name             |  Data Type        | Description                                                     |  
|:------------------------|:------------------|:----------------------------------------------------------------|
| store_id                |   int64           | unique identifier for store  (1-45)                             |
| Temperature             |   float64         | temperature in Farenheight                                      |
| Fuel_Price              |   float64         | cost of fuel(in USD) in region                                  | 
| CPI                     |   float64         | Prevailing consumer price index, cost of goods                  |
| this_week_date          |   datetime64[ns]  | date for current week                                           |
| this_week_sales         |   float64         | sales in USD for current week                                   |     
| this_week_holiday_flag  |   int64           | indicator of a Holiday for current week (boolean)               |
| this_week_unemployment  |   float64         | unemployment rate for current week                              |     
| store_type              |   object          | A: SuperStore, B: Walmart, C: neighborhood Walmart              |
| store_size              |   int64           | size of specific location in sqft                               |
| next_week_1_year_ago    |   float64         | sales for following week of previous year (51 weeks ago)        |
| next_week_date          |   datetime64[ns]  | the date of the following week                                  |
| next_week_holiday_flag  |   float64         | indicator of a Holiday for following week (boolean)             |
| next_week_holiday_name  |   object          | name of holiday for following week                              |  
| christmas               |   uint8           | indicator of Christmas (boolean)                                |     
| labor_day               |   uint8           | indicator of labor day (boolean)                                |    
| pre_christmas           |   uint8           | indicator of pre-christmas: 2 weeks prior to christmas (boolean)|     
| super_bowl              |   uint8           | indicator of super bowl (boolean)                               |     
| thanksgiving            |   uint8           | indicator of Thanksgiving (boolean)                             |

## Goal:
- to predict weekly sales price for a store

## Think about...
- What is your goal?
- what is your TARGET? drivers for that target?
- what is one oberservation? what does one row from your dataset represent?

## Daily meetings
- standup doc
- shared knowledge doc

### Three important Questions
- what did you work on since we last talked?
- what are you planning on working on next?
- what are your blockers?

In [1]:
import pandas as pd
import numpy as np

#visualization
import matplotlib.pyplot as plt
import seaborn as sns

#math
from scipy import stats
import math

#sklearn
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.feature_selection import RFE, SelectKBest, f_regression
from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.cluster import KMeans
from sklearn.metrics import mean_squared_error, explained_variance_score

#custom modules
import new_wrangle as w

#remove warnings
import warnings
warnings.filterwarnings("ignore")

<hr style="border:2px solid black"> </hr>

# Acquire

In [2]:
#bring in walmart data using new_wrangle.py
df= w.acquire_data()

In [3]:
#take a look
df.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Type,Size
0,1,05-02-2010,1643690.9,0,42.31,2.572,211.096358,8.106,A,151315
1,1,12-02-2010,1641957.44,1,38.51,2.548,211.24217,8.106,A,151315
2,1,19-02-2010,1611968.17,0,39.93,2.514,211.289143,8.106,A,151315
3,1,26-02-2010,1409727.59,0,46.63,2.561,211.319643,8.106,A,151315
4,1,05-03-2010,1554806.68,0,46.5,2.625,211.350143,8.106,A,151315


In [4]:
#check for nulls, dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6435 entries, 0 to 6434
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Store         6435 non-null   int64  
 1   Date          6435 non-null   object 
 2   Weekly_Sales  6435 non-null   float64
 3   Holiday_Flag  6435 non-null   int64  
 4   Temperature   6435 non-null   float64
 5   Fuel_Price    6435 non-null   float64
 6   CPI           6435 non-null   float64
 7   Unemployment  6435 non-null   float64
 8   Type          6435 non-null   object 
 9   Size          6435 non-null   int64  
dtypes: float64(5), int64(3), object(2)
memory usage: 553.0+ KB


<hr style="border:2px solid black"> </hr>

# Prepare

In [5]:
#import the cleaned data using new_wrangle.py
df= w.wrangle_walmart()

NameError: name 'date' is not defined

In [None]:
#make sure that all columns are created
df.info()

In [None]:
#take a look at the data
df.tail().T

In [None]:
df.isnull().sum()

In [None]:
#train test split
train, test = w.split_scale(df,'next_week_sales_target')

In [None]:
#take a look
df.info()

<hr style="border:2px solid black"> </hr>

# Explore

In [None]:
#count of season
#train.season.value_counts()

In [None]:
#counts by holidays
train.next_week_holiday_name.value_counts()

## bivariate exploration

In [None]:
#average weekly sales by store
stores = train.groupby(['store_id']).agg({'next_week_sales_target': ['mean']})

plt.figure(figsize=(20, 10))
plt.bar(stores.index,stores['next_week_sales_target']['mean'])
plt.xticks(np.arange(1, 46, step=1))
plt.ylabel('Next Week Sales (in USD)', fontsize=16)
plt.xlabel('Store', fontsize=16)
plt.show()

In [None]:
#visualize store_type by store_size
plt.figure(figsize=(14, 6))
sns.boxplot(x='store_type', y='store_size', data=train)

### Takeaways:
- Store A: appears to be only larger stores
- Store B: appear to be midsized stores
- Store C: appears to be only smaller stores

- outliers were addressed (store 3, store 5, store 33, store 36 were classified incorrectly)

In [None]:
#visualize stores and weekly sales
plt.figure(figsize=(14,6))
sns.boxplot(x='store_type', y='next_week_sales_target', data=train)

In [None]:
#visualize store type and unemployment rate
plt.figure(figsize=(14,6))
sns.boxplot(x='store_type', y='this_week_unemployment', data=train)

In [None]:
walmart = train.corr()
walmart

In [None]:
#this shows correlation with sales
wal_corr = walmart['next_week_sales_target'].sort_values(ascending=False)
wal_corr

In [None]:
plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(df.corr()[['next_week_sales_target']].sort_values(by='next_week_sales_target', ascending=False), vmin=-1, vmax=1, annot=True, cmap='mako_r')
heatmap.set_title('Features Correlating with weekly sales', fontdict={'fontsize':18}, pad=16);

<hr style="border:2px solid black"> </hr>

## Hypothesis 1: Pearson's (cont vs cont)
$H_0$: There is no correlation between next week sales and store_size

$H_a$: There is a correlation between next week sales and store_size


In [None]:
#pearsons correlation on entire train set
#number of rows
n = train.shape[0] 

#degrees of freedom- how much the data can vary
deg_f = n-2 

#confidence interval (!)
conf_in = 0.95

alpha = 1- conf_in

In [None]:
x= train.next_week_sales_target
y= train.store_size

In [None]:
r, p = stats.pearsonr(x,y)
r,p

In [None]:
p < alpha

In [None]:
print(f'We calculate a pearson r of {r:3f} and a statistical certainty p of {p:4f}')
print(f'Because p {p:4f} < α  {alpha:4f}, we can reject our null hypothesis')

### Takeaways:
- We rejected our null hypothesis, thus indicating that there is a correlation between next week sales and store size

<hr style="border:1px solid black"> </hr>

## Hypothesis 2: Pearson's (cont vs cont)
$H_0$: There is no correlation between this week sales and next week sales

$H_a$: There is a correlation between this week sales and next week sales

In [None]:
#pearsons correlation on entire train set
#number of rows
n = train.shape[0] 

#degrees of freedom- how much the data can vary
deg_f = n-2 

#confidence interval (!)
conf_in = 0.95

alpha = 1- conf_in

x= train.next_week_sales_target
y= train.this_week_sales

In [None]:
r, p = stats.pearsonr(x,y)
r,p

In [None]:
p < alpha

In [None]:
print(f'We calculate a pearson r of {r:3f} and a statistical certainty p of {p:4f}')
print(f'Because p {p:4f} < α  {alpha:4f}, we can reject our null hypothesis')

### Takeaways:
- We rejected our null hypothesis, thus indicating that there is a correlation between this week sales and next week sales.

<hr style="border:1px solid black"> </hr>

## Hypothesis 3: Pearson's (cont vs cont)
$H_0$: There is no correlation between this week sales and sales from this time last year

$H_a$: There is a correlation between this week sales and sales from this time last year

In [None]:
#pearsons correlation on entire train set
#number of rows
n = train.shape[0] 

#degrees of freedom- how much the data can vary
deg_f = n-2 

#confidence interval (!)
conf_in = 0.95

alpha = 1- conf_in

In [None]:
x= train.next_week_sales_target
y= train.next_week_1_year_ago

In [None]:
r, p = stats.pearsonr(x,y)
r,p

In [None]:
p < alpha

In [None]:
print(f'We calculate a pearson r of {r:3f} and a statistical certainty p of {p:4f}')
print(f'Because p {p:4f} < α  {alpha:4f}, we can reject our null hypothesis')

### Takeaways:
- We rejected our null hypothesis, thus indicating that there is a correlation between this week sales and sales from this week last year.

<hr style="border:1px solid black"> </hr>

## Hypothesis 3: T-Test (discrete vs cont)
$H_0$: There is no relationship between this next_weeks_sales_target and pre_christmas

$H_a$: There is a relationship between this next_weeks_sales_target and pre_christmas

In [None]:
#set alpha
alpha = .05

In [None]:
# sample size, must be more then 30 to meet assumption
train.next_week_sales_target.count(), train.pre_christmas.count()

In [None]:
#check variance
train.next_week_sales_target.var(), train.pre_christmas.var()

#this shows not equal varient

In [None]:
#t-test on entire train set
t, p = stats.ttest_ind(train.next_week_sales_target,train.pre_christmas, equal_var=False)
t, p

In [None]:
p <alpha

In [None]:
print(f'We calculate a t of {t:3f} and a statistical certainty p of {p:4f}')
print(f'Because p {p:4f} < α  {alpha:4f}, we reject our null hypothesis')

### Takeaways:
- We rejected our null hypothesis, thus indicating that there is a correlation between next week sales and pre-christmas.

<hr style="border:2px solid black"> </hr>

# Modeling

In [None]:
train, test, X_train_scaled, X_test_scaled, y_train, y_test= w.split_scale(df, 'next_week_sales_target', scaler= MinMaxScaler())

In [None]:
#set features
#we do not want to include all columns in this because it could cause overfitting
features = ['store_size', 'this_week_unemployment', 'next_week_1_year_ago', 'this_week_sales', 'pre_christmas']

In [None]:
# We need y_train and y_validate to be dataframes to append the new columns with predicted values. 
y_train = pd.DataFrame({'actual': y_train})
y_test = pd.DataFrame({'actual': y_test})

## Baseline

In [None]:
#create the baseline using mean of all sales
baseline= y_train['actual'].mean()

In [None]:
#create column called baseline to compare
y_train['baseline'] = baseline

In [None]:
#calculate RMSE for baseline model
rmse_baseline_train= math.sqrt(mean_squared_error(y_train.actual, y_train.baseline))

In [None]:
#create a dataframe to make data easier to visualize/understand
metric_df = pd.DataFrame(data=[{
    'model': "Baseline (using mean)",
    'rmse_train': round(rmse_baseline_train, 2),
    'r^2_train': round(explained_variance_score(y_train.actual, y_train.baseline),4),

}])

metric_df

## Baseline 2

In [None]:
#baseline version 2 using last years sales
baseline2 = train['next_week_1_year_ago']

In [None]:
#prediction
#create column called baseline to compare
y_train['last_year_baseline'] = baseline2

In [None]:
#calculate RMSE for baseline model
rmse_baseline2_train= math.sqrt(mean_squared_error(y_train.actual, y_train.last_year_baseline))

In [None]:
#create a dataframe to make data easier to visualize/understand
metric_df = metric_df.append(
    {
    'model': "Baseline (using last year's sales)",
    'rmse_train': round(rmse_baseline2_train, 2),
    'r^2_train': round(explained_variance_score(y_train.actual, y_train.last_year_baseline),4),
    }, ignore_index=True)

metric_df

## OLS Model

In [None]:
#ordinary least squares
#create the model 
model1 = LinearRegression(normalize=True)

#fit the model
model1.fit(X_train_scaled[features], y_train.actual)

In [None]:
# predict train
y_train['sales_pred_lm'] = model1.predict(X_train_scaled[features])

# evaluate: rmse
rmse_train = mean_squared_error(y_train.actual, y_train.sales_pred_lm)**(1/2)

In [None]:
#create visual to see baseline vs LinearRegression model
metric_df = metric_df.append(
    {
    'model': 'Model 1: OLS',
    'rmse_train': round(rmse_train, 2),
    'r^2_train': round(explained_variance_score(y_train.actual, y_train.sales_pred_lm),4),
    }, ignore_index=True)

metric_df

## Lasso Lars

In [None]:
# create the model object
model2 = LassoLars(alpha= 2)

# fit the model to our training data. We must specify the column in y_train, 
# since we have converted it to a dataframe from a series! 
model2.fit(X_train_scaled[features], y_train.actual)

# predict train
y_train['sales_pred_lars'] = model2.predict(X_train_scaled[features])

# evaluate: rmse
rmse_train = mean_squared_error(y_train.actual, y_train.sales_pred_lars)**(1/2)

In [None]:
#shows baseline vs LinearRegression vs LassoLars
metric_df = metric_df.append(
    {
    'model': 'Model 2: LassoLars (alpha 2)',
    'rmse_train': round(rmse_train,2),
    'r^2_train': round(explained_variance_score(y_train.actual, y_train.sales_pred_lars),4),
    }, ignore_index=True)

metric_df

## Polynomial Regression

In [None]:
#make the polynomial features to get a new set of features
model3 = PolynomialFeatures(degree=2)

# fit and transform X_train_scaled features
X_train_degree2 = model3.fit_transform(X_train_scaled[features])

In [None]:
#create the model
lm2 = LinearRegression(normalize=True)

#fit the mode
lm2.fit(X_train_degree2, y_train.actual)

#use the model
y_train['sale_pred_lm2'] = lm2.predict(X_train_degree2)

# evaluate: rmse
rmse_train_model3 = mean_squared_error(y_train.actual, y_train.sale_pred_lm2) ** (1/2)

In [None]:
#shows baseline vs LinearRegression vs LassoLars
metric_df = metric_df.append(
    {
    'model': 'Model 3: Polynomial Regression (degree=2)',
    'rmse_train': round(rmse_train_model3,2),
    'r^2_train': round(explained_variance_score(y_train.actual, y_train.sale_pred_lm2),4),
    }, ignore_index=True)

metric_df

### Takeaways
- Data was scaled using MinMaxScaler
- Features included for modeling were: 'store_size', 'this_week_unemployment', 'next_week_1_year_ago', 'this_week_sales', and 'pre_christmas'

<br>

- 2nd Degree Polynomial Regression model out performed the baseline (using last year's sales) by 23.44% on the train set

<hr style="border:2px solid black"> </hr>