# Walmart Sales Prediction
-----
>In this second phase of the project, we will try to:  

>> find out which model best suits this prediction problem   
>> understand which predictors might have high impact on the prediction outputs

> Please go to the folder **iframe_figures**, to see the different visualizations.

---------

### Table of Contents

* [1. Load Data](#section1)
* [2. Linear Regression Models](#section2)
    * [2.1. Data Preparing for Modeling](#section21)
    * [2.2. Linear Regressor](#section22)
    * [2.6. Features Importance](#section26)

In [1]:
import warnings
warnings.filterwarnings('ignore')

# Generic librairies 
import pandas as pd
pd.options.display.max_columns=None

import numpy as np

# Visualization librairies 
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "iframe" 


# Machine learning librairies
# split
from sklearn.model_selection import train_test_split

# Preprocessing
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

# Regressors
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso

# score metrics
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# model selection
from sklearn.model_selection import cross_val_score, GridSearchCV

# predefined modules
from modules import MyFunctions as MyFunct

# Global parameters 
filepath = 'data/prep_walmart_sales.csv'
results_path='results/'

 # Load data

In [2]:
print("Loading dataset...")
dataset = pd.read_csv(filepath)
print("...Done.")
print()

Loading dataset...
...Done.



# Linear Regression Models

## Data Preparing for Modeling

In [3]:
# Define target variable (y) and explanatory variables (X)
Y = dataset['Weekly_Sales']
X = dataset.drop(['Weekly_Sales', 'Store'], axis = 'columns')

In [4]:
# Divide dataset 
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

# Convert pandas DataFrames to numpy arrays before using scikit-learn
X_train = X_train.values
X_test = X_test.values
Y_train = Y_train.tolist()
Y_test = Y_test.tolist()

# Create pipeline for numeric features 
#Num_X =['Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'Week'] 
num_X = [1,2,3,4,5]
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')), 
    ('scaler', StandardScaler())
])

# Create pipeline for categorical features
#cat_X = [Holiday_Flag']
cat_X = [0]
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first'))
])

# Use ColumnTranformer to make a preprocessor object 
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_X),
        ('cat', categorical_transformer, cat_X)
    ])

# Preprocessings on train set
X_train = preprocessor.fit_transform(X_train)
X_test  = preprocessor.transform(X_test)

## Linear Regressor

In [5]:
iterables = [["RMSE", "R2", "ADJ R2"], ["Train", "Test"]]
ind = pd.MultiIndex.from_product(iterables)

metrics = pd.DataFrame(index = ind)

In [6]:
linear_regressor = LinearRegression()
linear_regressor.fit(X_train, Y_train)

y_train_pred, y_test_pred= linear_regressor.predict(X_train), linear_regressor.predict(X_test)
mse_train, mse_test = mean_squared_error(Y_train,y_train_pred ), mean_squared_error(Y_test, y_test_pred)
r2_train, r2_test = r2_score(Y_train,y_train_pred), r2_score(Y_test, y_test_pred)
adjR2_train, adjR2_test = MyFunct.adjusted_r2(r2_train, X_train.shape[0], X_train.shape[1]), MyFunct.adjusted_r2(r2_test, X_train.shape[0], X_train.shape[1]) 

# compute SS for the F_value computation
SST1, SSR1, SSE1 = MyFunct.sum_squares(Y_train, y_train_pred)

metrics['Linear Regressor'] = [np.sqrt(mse_train), np.sqrt(mse_test), r2_train, r2_test,adjR2_train, adjR2_test]
metrics

Unnamed: 0,Unnamed: 1,Linear Regressor
RMSE,Train,559332.999435
RMSE,Test,567622.906003
R2,Train,0.027987
R2,Test,0.032217
ADJ R2,Train,0.026828
ADJ R2,Test,0.031062


>🗒 Note: The obtained very low scores confirmed that the predictor **Store** explains the **majority of the target variability**. 

## Features Importance

In [7]:
coef = pd.DataFrame()
coef['feature'] = ['Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'Week','Holiday_Flag']

coef['coef_linear_regressor'] = linear_regressor.coef_
coef['coef_linear_regressor'] = coef.coef_linear_regressor.abs()
coef = coef.sort_values(by = 'coef_linear_regressor', ascending = True)

px.bar(coef, x ='coef_linear_regressor', y = 'feature')


>> The predictors importance ordering coincide with our human intuition. The consumer price index (CPI) and the unemployment are the most important predictors, the week and the holiday_flag are as well significant predictors.

>> Propose another model that do not use stores as predictors