<a href="https://colab.research.google.com/github/Billy1999/Sales-Prediction-Model/blob/main/Sales_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Problem Statement:**

Predicting Daily Sales for a Retail Store Using a Regression Machine Learning model

**Objective:**

To develop a machine learning model that accurately predicts daily sales for a retail store based on historical sales data, store identifiers, item identifiers, and date features.

**Background**:
Retail businesses rely heavily on accurate sales forecasting to manage inventory, optimize staffing, and improve customer satisfaction. By predicting future sales, a store can ensure that popular items are in stock, reducing the likelihood of stockouts, and manage excess inventory, reducing waste and storage costs. This project aims to use historical sales data to build a predictive model that can forecast daily sales for each store-item combination.

**Data**:

The dataset consists of historical sales records, including the following features:

1. date:  *The date of the sales record.*

2. store: *The identifier for the store.*

3. item: *The identifier for the item.*

4. sales: *The number of units sold.*

**Problem**:

Develop a regression model using the CatBoost algorithm to predict the daily sales for a given store and item on a specific date. The model should learn from historical sales data and be able to generalize well to future dates, providing accurate predictions.

**Challenges**:

- Seasonality and Trends:

Sales data often exhibit seasonal patterns and trends that need to be captured by the model.

- Store and Item Variability:

Different stores and items may have distinct sales patterns, requiring the model to handle this variability.

- Data Preprocessing:

Proper handling of date features, such as extracting day of the week, month, and year, is crucial for improving model performance.

- Evaluation:

The model's performance will be evaluated using the Mean Absolute Percentage Error (MAPE) to ensure it provides reliable predictions.

**Goals**:

- Preprocess the data to ensure it is suitable for training a regression model.
- Train a CatBoostRegressor model on the historical sales data.
- Evaluate the model's performance using MAPE.
- Perform predictions using raw input.

In [None]:
%pip install -Uq upgini catboost
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from upgini import FeaturesEnricher, SearchKey
from upgini.metadata import CVType
from catboost import CatBoostRegressor
from sklearn.metrics import mean_absolute_percentage_error

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m499.7 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.6/151.6 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.3/108.3 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m913.9/913.9 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.6/162.6 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m46.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━

**Data Preparation**

In [None]:
from os.path import exists

get_df = 'train.csv.zip' if exists('train.csv.zip') else 'https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip'
df = pd.read_csv(get_df)

df = df.sample(n = 10000, random_state=0)

df['store'] = df['store'].astype(str)
df['item'] = df['item'].astype(str)

df['date'] = pd.to_datetime(df['date'])

df.sort_values('date', inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,date,store,item,sales
0,2013-01-01,3,12,38
1,2013-01-01,4,9,19
2,2013-01-01,10,21,33
3,2013-01-01,3,27,11
4,2013-01-01,2,3,19


**Exploratory Data Analysis**

In [None]:
fig = px.bar(df, x='date', y='sales', title='Date vs Sales', color_discrete_sequence=['#1093B8']*len(df))
fig.update_layout(bargap=0.2)
fig.show()

  v = v.dt.to_pydatetime()


In [None]:
fig = px.bar(df, x='item', y='sales', color= 'item', text_auto='.2s',
            title="Volume of Sales per item")
fig.show()

In [None]:
fig = px.bar(df, x='store', y='sales', color= 'store', title="Volume of Sales by store")
fig.show()

In [None]:
train_df = df[df['date'] < '2017-01-01']
test_df = df[df['date'] >= '2017-01-01']

In [None]:
train_input = train_df.drop(columns= 'sales')
target_input = train_df['sales']
test_input = test_df.drop(columns= 'sales')
test_target = test_df['sales']

**Fearture Enrichment**

In [None]:
enricher = FeaturesEnricher(
    search_keys = {
        'date': SearchKey.DATE,
    },
    cv = CVType.time_series
)

enricher.fit(train_input,
             target_input,
             eval_set = [(test_input, test_target)])



Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.




<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IPv4 to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Detected task type: ModelTaskType.REGRESSION



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>




Column name,Status,Errors
date,All valid,-
target,All valid,-


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Running search request, search_id=cf29b8c7-84ec-41c8-b6fa-82a6c4027ee2
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com

[92m[1m
72 relevant feature(s) found with the search keys: ['date'][0m


Feature name,SHAP value,Coverage %,Value preview,Provider,Source,Updates
f_weather_date_weather_umap_31_fa6d9a99,0.0414,100.0,"5.0192, 4.9369, 4.8183",Upgini,Weather & climate normals data,Daily
f_weather_date_weather_umap_48_b39cd0c4,0.025,100.0,"4.9914, 5.1518, 5.8073",Upgini,Weather & climate normals data,Daily
f_weather_date_weather_umap_34_c3ef5b4f,0.0185,100.0,"4.7504, 4.8679, 5.2753",Upgini,Weather & climate normals data,Daily
f_autofe_div_89c56a5f,0.0134,100.0,"0.7546, -0.8222, 0.5888",Upgini,"AutoFE: features from Calendar data,Markets data",Daily
f_autofe_div_62d91cae,0.0078,100.0,"52.1132, 54.1773, 54.6842",Upgini,AutoFE: features from Markets data,Daily
f_autofe_div_a152c923,0.0073,100.0,"-0.0031, -0.0026, 0.0033",Upgini,"AutoFE: features from Calendar data,Markets data",Daily
f_autofe_mul_6a97c336,0.0073,100.0,"1.0023, 0.9289, -1.0908",Upgini,"AutoFE: features from Calendar data,Markets data",Daily
f_autofe_div_d6db5d7a,0.007,100.0,"-1.192, -0.2143, -0.5971",Upgini,"AutoFE: features from Calendar data,Markets data",Daily
f_autofe_div_1c37ce17,0.0069,100.0,"0.2188, 0.3307, -0.3185",Upgini,"AutoFE: features from Calendar data,Markets data",Daily
f_events_date_year_cos1_9014a856,0.0052,100.0,"-0.9987, -0.7841, -0.8703",Upgini,Calendar data,Daily


Provider,Source,All features SHAP,Number of relevant features
Upgini,Weather & climate normals data,0.1014,14
Upgini,"AutoFE: features from Calendar data,Markets data",0.0936,26
Upgini,AutoFE: features from Markets data,0.0291,14
Upgini,AutoFE: features from Calendar data,0.0083,7
Upgini,Calendar data,0.0055,4
Upgini,World economic indicators,0.0014,4
Upgini,Markets data,0.0003,3


Sources,Feature name,Feature 1,Feature 2,Function
"Calendar data,Markets data",f_autofe_div_89c56a5f,f_events_date_year_cos1_9014a856,f_financial_date_natural_gas_7d_to_7d_1y_shift_a5c3c07f,/
Markets data,f_autofe_div_62d91cae,f_financial_date_dow_jones_65aaa996,f_financial_date_stoxx_043cbcd4,/
"Calendar data,Markets data",f_autofe_div_a152c923,f_events_date_year_cos1_9014a856,f_financial_date_stoxx_043cbcd4,/
"Calendar data,Markets data",f_autofe_mul_6a97c336,f_events_date_week_cos1_f6a8c1fc,f_financial_date_vix_7d_to_1y_634c77eb,*
"Calendar data,Markets data",f_autofe_div_d6db5d7a,f_events_date_year_cos1_9014a856,f_financial_date_crude_oil_7d_to_1y_c3e0ad17,/
"Calendar data,Markets data",f_autofe_div_1c37ce17,f_events_date_week_sin1_847b5db1,f_financial_date_natural_gas_92dac942,/
"Calendar data,Markets data",f_autofe_div_0a5adf97,f_events_date_year_cos1_9014a856,f_financial_date_natural_gas_92dac942,/
"Calendar data,Markets data",f_autofe_mul_25296268,f_events_date_week_sin1_847b5db1,f_financial_date_vix_7d_to_1y_634c77eb,*
"Calendar data,Markets data",f_autofe_mul_af6d166b,f_events_date_week_cos3_7525fe31,f_financial_date_vix_7d_to_1y_634c77eb,*
"Calendar data,Markets data",f_autofe_mul_b59b15f6,f_events_date_week_sin1_847b5db1,f_financial_date_dow_jones_65aaa996,*



Examples of outliers with maximum value of target:
40    205
24    196
46    176
Name: target, dtype: int64
Outliers will be excluded during the metrics calculation.
Calculating accuracy uplift after enrichment...

which makes metrics between the train and eval_set incomparable.


Dataset type,Rows,Mean target,Baseline mean_squared_error,Enriched mean_squared_error,Uplift
Train,7988,50.1955,311.8579,213.1434,98.7144
Eval 1,2012,59.4155,503.3738,380.9944,122.3794


**Model Training**

In [None]:
model = CatBoostRegressor(verbose=False, allow_writing_files=False, random_state=0)
enricher.calculate_metrics(
    train_input,
    target_input,
    eval_set = [(test_input, test_target)],
    estimator = model,
    scoring = 'mean_absolute_percentage_error'
)

Calculating accuracy uplift after enrichment...
-
which makes metrics between the train and eval_set incomparable.


Unnamed: 0,Dataset type,Rows,Mean target,Baseline mean_absolute_percentage_error,Enriched mean_absolute_percentage_error,Uplift
0,Train,7988,50.1955,0.260691,0.17055,0.090141
1,Eval 1,2012,59.4155,0.265491,0.186428,0.079063


In [None]:
new_train_input = enricher.transform(train_input, keep_input=True)
new_test_input = enricher.transform(test_input, keep_input=True)

You use Trial access to Upgini data enrichment. Limit for Trial: 10000 rows. You have already enriched: 0 rows.

Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IPv4 to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history



Column name,Status,Errors
date,All valid,-



Running search request, search_id=ad2c8778-bd61-4f0a-a070-37b8f7ae1a2d
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com

Retrieving selected features from data sources...


You use Trial access to Upgini data enrichment. Limit for Trial: 10000 rows. You have already enriched: 0 rows.

Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IPv4 to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history



Column name,Status,Errors
date,All valid,-



Running search request, search_id=e4a8967e-ccc4-48fe-9374-3f091dee8865
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com

Retrieving selected features from data sources...


In [None]:
model.fit(train_input, target_input)
preds = model.predict(test_input)

baseline_mape = mean_absolute_percentage_error(test_target, preds)
baseline_mape

0.319122534318008

In [None]:
model.fit(new_train_input, target_input)
preds = model.predict(new_test_input)

new_mape = mean_absolute_percentage_error(test_target, preds)
new_mape

0.15989892782373913

**Predicting a single input**

In [None]:
def predict_input(single_input):
    input_df = pd.DataFrame([single_input])
    input_df['store'] = input_df['store'].astype(str)
    input_df['item'] = input_df['item'].astype(str)
    input_df['date'] = pd.to_datetime(input_df['date'])
    pred_inputs = enricher.transform(input_df, keep_input=True)

    pred = model.predict(pred_inputs)[0]
    return pred

In [None]:
new_input = {'date': '2018-06-19',
             'store': 3,
             'item': 5}

prediction = predict_input(new_input)
print(f'Prediction: {prediction}')

You use Trial access to Upgini data enrichment. Limit for Trial: 10000 rows. You have already enriched: 0 rows.

Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IPv4 to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history

That search key will add constant features for different y values.
Please add extra search keys with non constant values, like the COUNTRY, POSTAL_CODE, DATE, PHONE NUMBER, EMAIL/HEM or IPv4



Column name,Status,Errors
date,All valid,-



Running search request, search_id=875b28b6-823e-46c0-8a34-c5d5e821c3c6
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com

Retrieving selected features from data sources...
Prediction: 32.708767243131234
