### Walart Sales Forecasting
## Exploratory Data Analysis
### Problem:
There are many seasons that sales are significantly higher or lower than averages. If the company does not know about these seasons, it can lose too much money. Predicting future sales is one of the most crucial plans for a company. Sales forecasting gives an idea to the company for arranging stocks, calculating revenue, and deciding to make a new investment. Another advantage of knowing future sales is that achieving predetermined targets from the beginning of the seasons can have a positive effect on stock prices and investors' perceptions. Also, not reaching the projected target could significantly damage stock prices, conversely. And, it will be a big problem especially for Walmart as a big company.

## Aim:
My aim in this project is to build a model which predicts sales of the stores. With this model, Walmart authorities can decide their future plans which is very important for arranging stocks, calculating revenue and deciding to make new investment or not.
## Solution:
 With the accurate prediction company can;
- Determine seasonal demands and take action for this
- Protect from money loss because achieving sales targets can have a positive effect on stock prices and investors' perceptions
- Forecast revenue easily and accurately
- Manage inventories
- Do more effective campaigns

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

### Dataset information(Feature description)
- Store        : Store number
- Dept         : Department number
- Date         : Week
- Weekly_Sales : Sales for given dept in given store
- IsHoliday    : holiday or not
- Temperature  : Average temperature in the region
- Fuel_Price   : Cost of fuel in the region
- MarkDown1 to 5: Anonymized data related to promotional markdowns that Walmart is running.
- CPI          : Consumer price index
- Unemployement: Unemployement in the region

In [None]:
# Reading the data from all the given Datasets.

Sales_features=pd.read_csv('data/Sales_features.csv')
Sales_stores=pd.read_csv('data/Sales_stores.csv')
Sales_test=pd.read_csv('data/Sales_test.csv')
Sales_train=pd.read_csv('data/Sales_train.csv')


In [None]:
# Sales_featuers,Sales_stores and Sales_train contain some common features.
# Need to be merged for cerating the training dataset.

def showCols(data,name):
    print(name," : ",data.columns)
showCols(Sales_features,"Sales_features")
showCols(Sales_train,"Sales_train")
showCols(Sales_stores,"Sales_stores")
showCols(Sales_test,"Sales_test")


Datasets Sales_features and Sales_stores have 'Store' feature in common hence we need to merge the both datsets on Stores feature.<br>

In [None]:
dataset=Sales_features.merge(Sales_stores,how='inner',on='Store')

In [None]:
dataset

In [None]:
# Firstly we need to change the date format into year and week

dataset['Date']=pd.to_datetime(dataset['Date'])
dataset['year']=dataset['Date'].dt.year ## Extracting 'year' data
dataset['week']=dataset['Date'].apply(lambda x: datetime.strftime(x, '%U'))  ## Extracting 'week' data
dataset.info()

We can see that the MarkDown1 to 5 features having about 58% null values which makes them to inappropriate for model training<br>
Hence we drop those 5 features for better Data Analysis

In [None]:
# Droping the MarkDown1 to 5 features from  dataset

dataset.drop(['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5'],axis=1,inplace=True)

#### Training data
Merging the 'dataset' and 'Sales_train' for making the training dataset

In [None]:
Sales_train['Date']=pd.to_datetime(Sales_train['Date'])

df_train=Sales_train.merge(dataset,how='inner',on=['Store','IsHoliday','Date']) # merging Sales_train and dataset.

# final Training Dataset
df_train.head() 

#### Testing data
Merging the 'dataset' and 'Sales_train' for making the Testing dataset

In [None]:
Sales_test['Date']=pd.to_datetime(Sales_test['Date'])
df_test=Sales_test.merge(dataset,how='inner',on=['Store','IsHoliday','Date']) # merging Sales_test and dataset.

# final Testing Dataset 
df_test.head()

dropping the date column as we have both 'year' and 'week' features in both training and testing dataset

In [None]:
df_test.drop(['Date'],axis=1,inplace=True)
df_train.drop(['Date'],axis=1,inplace=True)
df_test.shape,df_train.shape

In [None]:
df_train.head()

#### Converting the prepared df_train dataset into .csv file inorder to make the Dataingestion into the modules easier

In [None]:
# to_csv file 'Sales_data'

df_train.to_csv('data/Sales_data.csv',index=False)

In [None]:
df_test.head()

In [None]:
df_train.isnull().sum()

In [None]:
df_test.isnull().sum()

In [None]:
df_train.head()

In [None]:
# Find the indices of rows with null values for the specific feature
null_ind = df_test[df_test['Unemployment'].isnull()].index

# Delete the rows with null values
df_test=df_test.drop(null_ind)

In [None]:
df_train.info()

In [None]:
# looking at the target variable 'Weekly_Sales' it might contain outliers ,
# Outliers in this data is having negative values in 'Weekly_Sales'

df_train.loc[df_train['Weekly_Sales']<=0]


In [None]:
#dropping those outliers
df_train=df_train.drop(df_train[df_train['Weekly_Sales']<=0].index,axis=0)

In [None]:
## Categorical Variables -> 'Type', Percentage of stores in each type
values=df_train['Type'].value_counts()
plt.pie(values,labels=df_train['Type'].unique(),autopct='%.2f%%')
plt.title("Percent of Store types")

plt.show()

Concluding that 'Type A' has greater no of stores comparing to others

In [None]:
## 'IsHoliday', Percentage of Holidays 
values=df_train['IsHoliday'].value_counts()
plt.pie(values,labels=df_train['IsHoliday'].unique(),autopct='%.2f%%')
plt.title("Percent of Holidays")

plt.show()

the no of Holidays too less

As we can see that the higher positive correlation found in betwee 'Fuel_Price' and 'year'.


In [None]:
# Barplot bertween 'year' adn 'Fuel_Price'

sns.barplot(x='year',y='Fuel_Price',data=df_train)

 FuelPrices are greatear in 2012 and 2011 when comapared to 2010 which is implying Positive correlation between them

In [None]:
# Relationship of unemployement with the stores

plt.rcParams['figure.figsize']=[10,3]
sns.lineplot(data=df_train,y=df_train.Unemployment,x=df_train.Store) 

Unemployment is almost common in every store

In [None]:
weekly_sales_2010=df_train[df_train['year']==2010]['Weekly_Sales'].groupby(df_train['week']).mean()
weekly_sales_2011=df_train[df_train['year']==2011]['Weekly_Sales'].groupby(df_train['week']).mean()
weekly_sales_2012=df_train[df_train['year']==2012]['Weekly_Sales'].groupby(df_train['week']).mean()

In [None]:
## Average Weekly_Sales per year 

fig,axs=plt.subplots(nrows=1,ncols=3,figsize=(15,3))
plt.subplot(131)
sns.lineplot(data=weekly_sales_2010,x= weekly_sales_2010.index,y=weekly_sales_2010.values)
plt.subplot(132)
sns.lineplot(data=weekly_sales_2011,x= weekly_sales_2011.index,y=weekly_sales_2011.values)
plt.subplot(133)
sns.lineplot(data=weekly_sales_2012,x= weekly_sales_2012.index,y=weekly_sales_2012.values)


In [None]:
## Combining all of those weekly_sales in every year

plt.rcParams['figure.figsize']=[15,5]
sns.lineplot(data=weekly_sales_2010,x= weekly_sales_2010.index,y=weekly_sales_2010.values)
sns.lineplot(data=weekly_sales_2011,x= weekly_sales_2011.index,y=weekly_sales_2011.values)
sns.lineplot(data=weekly_sales_2012,x= weekly_sales_2012.index,y=weekly_sales_2012.values)
plt.grid()
plt.legend(['2010','2011','2012'],loc='best')
plt.title("Average weekly sales for every year")
plt.xticks(np.arange(1,60))
plt.show()  

This concludes that in every year at the end of it the Weekly_Sales increases

In [None]:
# Average sales in every department

plt.rcParams['figure.figsize']=[20,5]
sns.barplot(data=df_train,x='Dept',y='Weekly_Sales',palette='bright')
plt.grid()
plt.title("Average sales in every department")
plt.show()


Departments have various amounts of weekly_sales

In [None]:
# Barplot between the 'Stores' and the 'Weekly_Sales', Average sales in each store

plt.rcParams['figure.figsize']=[15,5]
sns.barplot(x='Store',y='Weekly_Sales',data=df_train,palette='bright')
plt.grid()
plt.title("Average sales in every Store")
plt.show()


In [None]:
# Correlation between the features using confusion matrix

sns.heatmap(data=df_train.corr(),annot=True,fmt='.2f',cmap='Blues')

#### Conclusion:

- The 'Markdowns' are having almost 60% of Null values hence those are suggested to be dropped
- The 'Date' feature is converted to 'year' and 'week' features as the problem stated that sales depend upon week highly
- As observed that the 'Weekly_Sales' in every year are increasing at the end of the year
- Holidays are too less comparing non-Holidays
- 'Fuel_Price' is highly correlated with the 'year' feature
- 'Size' and 'Dept' are noticably correalated with 'Weekly_Sales'