### Walart Sales Forecasting

## Problem:
There are many seasons that sales are significantly higher or lower than averages. If the company does not know about these seasons, it can lose too much money. Predicting future sales is one of the most crucial plans for a company. Sales forecasting gives an idea to the company for arranging stocks, calculating revenue, and deciding to make a new investment. Another advantage of knowing future sales is that achieving predetermined targets from the beginning of the seasons can have a positive effect on stock prices and investors' perceptions. Also, not reaching the projected target could significantly damage stock prices, conversely. And, it will be a big problem especially for Walmart as a big company.

## Aim:
My aim in this project is to build a model which predicts sales of the stores. With this model, Walmart authorities can decide their future plans which is very important for arranging stocks, calculating revenue and deciding to make new investment or not.
## Solution:
- With the accurate prediction company can;
- Determine seasonal demands and take action for this
- Protect from money loss because achieving sales targets can have a positive effect on stock prices and investors' perceptions
- Forecast revenue easily and accurately
- Manage inventories
- Do more effective campaigns

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

### Dataset information(Feature description)
- Store        : Store number
- Dept         : Department number
- Date         : Week
- Weekly_Sales : Sales for given dept in given store
- IsHoliday    : holiday or not
- Temperature  : Average temperature in the region
- Fuel_Price   : Cost of fuel in the region
- MarkDown1 to 5: Anonymized data related to promotional markdowns that Walmart is running.
- CPI          : Consumer price index
- Unemployement: Unemployement in the region

In [2]:
# Reading the data from all the given Datasets.

Sales_features=pd.read_csv('data/Sales_features.csv')
Sales_stores=pd.read_csv('data/Sales_stores.csv')
Sales_test=pd.read_csv('data/Sales_test.csv')
Sales_train=pd.read_csv('data/Sales_train.csv')


In [3]:
# Sales_featuers,Sales_stores and Sales_train contain some common features.
# Need to be merged for cerating the training dataset.

def showCols(data,name):
    print(name," : ",data.columns)
showCols(Sales_features,"Sales_features")
showCols(Sales_train,"Sales_train")
showCols(Sales_stores,"Sales_stores")
showCols(Sales_test,"Sales_test")


Sales_features  :  Index(['Store', 'Date', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2',
       'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment',
       'IsHoliday'],
      dtype='object')
Sales_train  :  Index(['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday'], dtype='object')
Sales_stores  :  Index(['Store', 'Type', 'Size'], dtype='object')
Sales_test  :  Index(['Store', 'Dept', 'Date', 'IsHoliday'], dtype='object')


Datasets Sales_train and Sales_stores have 'Store' feature in common hence we need to merge the both datsets on Stores feature.<br>
Datasets df and Sales_features have 'Store','Dept','Date' features in common <br>
Hence we need to merge the both datsets on those features.

In [4]:
df=Sales_train.merge(Sales_stores,how='left',on='Store') # merging Sales_train and Sales_stores
df=df.merge(Sales_features,on=['Store','Date','IsHoliday']) # merging Sales_features into df

# final Dataset
df.head() 

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday,Type,Size,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment
0,1,1,2010-02-05,24924.5,False,A,151315,42.31,2.572,,,,,,211.096358,8.106
1,1,2,2010-02-05,50605.27,False,A,151315,42.31,2.572,,,,,,211.096358,8.106
2,1,3,2010-02-05,13740.12,False,A,151315,42.31,2.572,,,,,,211.096358,8.106
3,1,4,2010-02-05,39954.04,False,A,151315,42.31,2.572,,,,,,211.096358,8.106
4,1,5,2010-02-05,32229.38,False,A,151315,42.31,2.572,,,,,,211.096358,8.106


In [5]:
df.isnull().sum()

Store                0
Dept                 0
Date                 0
Weekly_Sales         0
IsHoliday            0
Type                 0
Size                 0
Temperature          0
Fuel_Price           0
MarkDown1       270889
MarkDown2       310322
MarkDown3       284479
MarkDown4       286603
MarkDown5       270138
CPI                  0
Unemployment         0
dtype: int64

We can see that the MarkDown1 to 5 features having about 58% null values which makes them to inappropriate for model training<br>
Hence we drop those 5 features for better Data Analysis

In [6]:
# Droping the MarkDown1 to 5 features from df training dataset

df.drop(['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5'],axis=1,inplace=True)

In [7]:
df.head()

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday,Type,Size,Temperature,Fuel_Price,CPI,Unemployment
0,1,1,2010-02-05,24924.5,False,A,151315,42.31,2.572,211.096358,8.106
1,1,2,2010-02-05,50605.27,False,A,151315,42.31,2.572,211.096358,8.106
2,1,3,2010-02-05,13740.12,False,A,151315,42.31,2.572,211.096358,8.106
3,1,4,2010-02-05,39954.04,False,A,151315,42.31,2.572,211.096358,8.106
4,1,5,2010-02-05,32229.38,False,A,151315,42.31,2.572,211.096358,8.106


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421570 entries, 0 to 421569
Data columns (total 11 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Store         421570 non-null  int64  
 1   Dept          421570 non-null  int64  
 2   Date          421570 non-null  object 
 3   Weekly_Sales  421570 non-null  float64
 4   IsHoliday     421570 non-null  bool   
 5   Type          421570 non-null  object 
 6   Size          421570 non-null  int64  
 7   Temperature   421570 non-null  float64
 8   Fuel_Price    421570 non-null  float64
 9   CPI           421570 non-null  float64
 10  Unemployment  421570 non-null  float64
dtypes: bool(1), float64(5), int64(3), object(2)
memory usage: 35.8+ MB


The feature 'Date' is unable to use hence we convert the date time into 'year', 'month' and 'week' features for the reason of seasonal changes in 'Weekly_Sales'


In [9]:
df['year']=pd.DatetimeIndex(df['Date']).year ## Extracting 'year' data
df['month']=pd.DatetimeIndex(df['Date']).month ## Extracting 'month' data
df['week']=pd.DatetimeIndex(df['Date']).week ## Extracting 'week' data

In [10]:
# stored the date in three features y,m,d heance we dont need 'Date' feature

df.drop('Date',inplace=True,axis=1)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421570 entries, 0 to 421569
Data columns (total 13 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Store         421570 non-null  int64  
 1   Dept          421570 non-null  int64  
 2   Weekly_Sales  421570 non-null  float64
 3   IsHoliday     421570 non-null  bool   
 4   Type          421570 non-null  object 
 5   Size          421570 non-null  int64  
 6   Temperature   421570 non-null  float64
 7   Fuel_Price    421570 non-null  float64
 8   CPI           421570 non-null  float64
 9   Unemployment  421570 non-null  float64
 10  year          421570 non-null  int64  
 11  month         421570 non-null  int64  
 12  week          421570 non-null  int64  
dtypes: bool(1), float64(5), int64(6), object(1)
memory usage: 42.2+ MB
