Skip to content

Abdelrhman-Sadek/Store-chain-Analysis-and-Forecasting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 

Repository files navigation

Store-chain-Analysis-and-Forecasting

image

Description

This is a complete Time_Series store chain analysis the goal was to determine:

  • products at each store of the chain and their promotion at a given date.
  • which product family has the most and least sales.
  • which store has the most and least sales
  • which (city, state, type) has the most and least sales.
  • The trend and seasonality of the time series and how the data changes over time(weekly, monthly, daily, etc)
  • the relationship between sales and (Holidays(Local, Regional), Oil Price) and how big they affect the sales.
  • the relationship between the time lags
  • How many shop visitors in each day(transactions) and how it affects the sales

Data Analysis

before beginning with the analysis first, we need to know the skewness of the data: image
the sales are highly positively skewed, the given distribution is shifted to the left and with its tail on the right side.
Yearly average sales by: day, week, month
Daily: image
Weekly: image
Monthly: image
Weekday average sales:
image
there is an increase in sales every year that indicates a trend variable
every day of the year the sales =0 as shown from "Daily Avg, which means that the market is closed.
sales on the weekend the highest of the week and generally low on Thursdays.

Identifies the type of family product sold: image
The best family products sold are: ['GROCERY I', 'BEVERAGES', 'PRODUCE', 'CLEANING', 'DAIRY']
The worst family products sold are: ['MAGAZINES', 'HARDWARE', 'HOME APPLIANCES', 'BABY CARE', 'BOOKS']
Identifies how the sales are going with each store: image
Best stores sales are : [44, 45, 47, 3, 49]
Worst stores sales are : [35, 30, 32, 22, 52]

Determine the trend

I chose a window of 365 days since this series has daily observations to smooth over any short-term changes within the year so that only the long-term changes remain image
The sales have been increasing over the years

Determine seasonality

Using the periodogram to determine the seasonality there are 10 different seasonalities Annual (1) Semiannual (2) Quarterly (4) Bimonthly (6) Monthly (12) Biweekly (26) Weekly (52) Semiweekly(104) Daily(365) Time of day image
The periodogram suggests a strong weekly seasonality
image
A lag plot of a time series shows its values plotted against its lags. Serial dependence in a time series will often become apparent by looking at a lag plot.
Using plot_pacf to see the correlation between 12 lags only: image
plot_pacf shows a strong correlation between lags (1 3 5 6 7 8 and 9) so we will be using these lags in training

Holidays

After removing the transferred data so it dont caused misleading the sales showed a strong correlation with sales: image image
Comparing Avg_sales on holidays vs on workdays image
Sales are significantly higher in Holidays

Oil prices

image
The oil price has a negative correlation with sales the lower the oil price is the more purchasing power for the customers.

Stores

image the plots shows:

  • the best city sales is 'Quito' with over 500 and a percentage of 9.4% of total sales the lowest city sales is 'Puyo' with below 100 sales with only 1.2%

  • The order of best store type sales is (A, over(700),38% of sales),(B, over(300),19% of sales),(E,(300),17% of sales),(C, over(200),10.7% of sales)

  • the highest sale state is 'pichincha' with 13.2% of the sales and over 500 while the other states are close in sales with the highest percentage of 8.7 ('tungurahua') and the lowest is 'pastaza' at 1.8%

  • the highest store cluster is 5 with over 1000 and a percentage of 16%

Modeling

Linear regression excels at extrapolating trends, but can't learn interactions. CATBoost excels at learning interactions, but can't extrapolate trends. In the next codes, I'll create "hybrid" forecasters that combine complementary learning algorithms and let the strengths of one make up for the weaknesses of the other.
It's possible to use one algorithm for some of the components and another algorithm for the rest. This way we can always choose the best algorithm for each component. To do this, we use one algorithm to fit the original series and then the second algorithm to fit the residual series.
In detail, the process is this:
1-Train and predict with the first model model_1.fit(X_train_1, y_train)
y_pred_1 = model_1.predict(X_train)
2-Train and predict with the second model on residuals
model_2.fit(X_train_2, y_train - y_pred_1)
y_pred_2 = model_2.predict(X_train_2)
And then Add to get overall predictions
y_pred = y_pred_1 + y_pred_2
image

Forecasting the next 16 days in sales:

image


Data Link:
https://www.kaggle.com/competitions/store-sales-time-series-forecasting/overview

Releases

No releases published

Packages

No packages published