<a href="https://colab.research.google.com/github/IsaacdAnalyst/EPL-Stats-Analysis/blob/main/Sales_Forcast_With_Linear_Regression_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this machine learning tutorial, I will be forecast sales and compare actual and forecasted sales using different metrics such as mean squared error, mean absolute error and R2 score using Linear Regression model.

We are going to use sales data from different stores from 2013 to 2017
[ items sold per day ]. 

# **SETTING UP OUR DATA**

**Importing Neccessary Libraries**

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
from tensorflow.keras.layers import Dense,LSTM
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping,ModelCheckpoint
plt.rcParams['axes.facecolor'] = 'green'

In [None]:
store_sales = pd.read_csv("train.csv")
store_sales.head(10)

Unnamed: 0,date,store,item,sales
0,2013-01-01,1.0,1.0,13.0
1,2013-01-02,1.0,1.0,11.0
2,2013-01-03,1.0,1.0,14.0
3,2013-01-04,1.0,1.0,13.0
4,2013-01-05,1.0,1.0,10.0
5,2013-01-06,1.0,1.0,12.0
6,2013-01-07,1.0,1.0,10.0
7,2013-01-08,1.0,1.0,9.0
8,2013-01-09,1.0,1.0,12.0
9,2013-01-10,1.0,1.0,9.0


# **DATA PREPROCESSING**

**I Would Be Checking For Null Values In The Dataset**

In [None]:
store_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173155 entries, 0 to 173154
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   date    173155 non-null  object 
 1   store   173154 non-null  float64
 2   item    173154 non-null  float64
 3   sales   173154 non-null  float64
dtypes: float64(3), object(1)
memory usage: 5.3+ MB


**Dropping Store And Item Columns To Concentrate More On The Sales Value.**

In [None]:
store_sales = store_sales.drop(["store","item"],axis = 1)

In [None]:
store_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173155 entries, 0 to 173154
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   date    173155 non-null  object 
 1   sales   173154 non-null  float64
dtypes: float64(1), object(1)
memory usage: 2.6+ MB


**Converting Date From Object Datatype To Datetime Datatype**

In [None]:
store_sales['date'] = pd.to_datetime(store_sales['date'])

ParserError: ignored

In [None]:
store_sales.info()

**Converting Date To A Month Period, And Then Sum The Number Of Items In Each Month**

In [None]:
store_sales['date'] = store_sales['date'].dt.to_period("M")
monthly_sales = store_sales.groupby('date').sum().reset_index()

**Convert The Resulting Date Column To Timestamp Datatype**

In [None]:
monthly_sales['date'] = monthly_sales['date'].dt.to_timestamp()

In [None]:
#to see our result on converting to monthly.
monthly_sales.head(12)

**Visualization**

In [None]:
plt.figure(figsize=(15,5))
plt.plot(monthly_sales['date'],monthly_sales['sales'],color="yellow")
plt.xlabel("Date")
plt.ylabel("Sales")
plt.title("Monthly Customer Sales")
plt.grid()

**Call The Difference On The Sales Columns To Make The Sales Data Stationary**

In [None]:
#Calculating the increase and decrease in sales over the period of time.
monthly_sales['sales_diff'] = monthly_sales['sales'].diff()
monthly_sales = monthly_sales.dropna()
monthly_sales.head(10)

In [None]:
plt.figure(figsize=(15,5))
plt.plot(monthly_sales['date'],monthly_sales['sales'],color="yellow")
plt.xlabel("Date")
plt.ylabel("Sales")
plt.title("Monthly Customer Sales Difference")
plt.grid()

**Dropping Of Sales and Dates**

In [None]:
Supervised_Data = monthly_sales.drop(['date','sales'],axis=1)

**Preparing the supervised data**

In [None]:
for i in range(1,13):
  col_name = 'month'+str(i)
  Supervised_Data[col_name] = Supervised_Data['sales_diff'].shift(i)
Supervised_Data=Supervised_Data.dropna().reset_index(drop=True)
Supervised_Data.head(10)

**Split The Data Into Train And Test Data**

In [None]:
train_data = Supervised_Data[:-12]        #For The Previous 12 Months
test_data  = Supervised_Data[-12:]         #For The Coming 12 Months
print("Train Data Shape:",train_data.shape)
print("Test Data Shape:",test_data.shape)

In [None]:
#use the minmax scalar to scale th future values to restrict them to a range of -1 and 1
scaler = MinMaxScaler(feature_range=(-1,1))
#Fitting the training data to the scalar
scaler.fit(train_data)
train_data = scaler.transform(train_data)
test_data = scaler.transform(test_data)

In [None]:
x_train,y_train = train_data[:,1:], train_data[:,0:1]
x_test,y_test = test_data[:,1:], test_data[:,0:1]
y_train = y_train.ravel()
y_test = y_test.ravel()
print("x_train.shape:",x_train.shape)
print("y_train.shape:",y_train.shape)
print("x_test.shape:",x_test.shape)
print("y_test.shape:",y_test.shape)


 **Make Prediction DataFrame To Merge The Predicted Sales Prices Of All Trained Algorithms**

In [None]:
sales_dates = monthly_sales['date'][-12:].reset_index(drop = True)
predict_df = pd.DataFrame(sales_dates)

In [None]:
actual_sales = monthly_sales['sales'][-13:].to_list()
print(actual_sales)

# **LINEAR REGRESSION**

**Create The Linear Regression Model and Predicted Output**

In [None]:
lr_model = LinearRegression()
lr_model.fit(x_train,y_train)
lr_pre = lr_model.predict(x_test)

In [None]:
#created a linear regression prediction
lr_pre = lr_pre.reshape(-1,1)
#Created a test set for the prediction model and also concantenate the prediction and x test
#This is a set matrix - containing the input features of the test data and also the predicted output.
lr_pre_test_set = np.concatenate([lr_pre,x_test],axis=1)
lr_pre_test_set = scaler.inverse_transform(lr_pre_test_set)

In [None]:
result_list = []
for index in range(0,len(lr_pre_test_set)):
  result_list.append(lr_pre_test_set[index][0] + actual_sales[index])
lr_pre_series = pd.Series(result_list, name = "Linear Prediction")
predict_df = predict_df.merge(lr_pre_series, left_index = True, right_index=True)

In [None]:
#mse = mean squared error
lr_mse = np.sqrt(mean_squared_error(predict_df['Linear Prediction'],monthly_sales['sales'][-12:]))
#mae = mean absolute error
lr_mae = mean_absolute_error(predict_df['Linear Prediction'], monthly_sales['sales'][-12:])
#r2 = 
lr_r2 = r2_score = (predict_df['Linear Prediction'], monthly_sales['sales'][-12:])
print("Linear Regression MSE: ",lr_mse)
print("Linear Regression MAE: ",lr_mae)
print("Linear Regression R2: ",lr_r2)

**Visualization Of The Prediction Against The Actual Sales**

In [None]:
plt.figure(figsize = (15,5))
#Actual Sales
plt.plot(monthly_sales['date'],monthly_sales['sales'],color ='yellow')
#Predicted Sales
plt.plot(predict_df['date'],predict_df['Linear Prediction'],color='red')
plt.title("Customer Sales Forcast Using LR Model")
plt.xlabel("Date")
plt.ylabel("Sales")
plt.legend(['Actual Sales','Predicted Sales'])
plt.grid()

# **SUMMARY**

As you can see here we have the actual sales with the yellow line and the predicted sales only from 2017 till 2018 with the red line.

We can see here that from 2017 until 2018 actually its pretty okay and it didnt do a bad job in prediciting except from some steepness.
