# **PASTERY PREDICTION**

<img src="https://drive.google.com/uc?export=view&id=10ReGEglV-DfjhQ3CCkeDxgE6WxNo4BXM" width="9000" height="400">
<hr>

## **Overview**
Predicting Pastry Sales for Smarter Resource Use
In a world where systematic overproduction drives both economic inefficiency and environmental degradation, the challenge of accurate demand forecasting has never been more pressing. This competition invites you to tackle one such real-world scenario: predicting pastry sales. Framed as a time-series forecasting problem, the task calls for innovative yet efficient machine learning solutions to help reduce waste while meeting customer demands.

The dual objectives of reducing waste and satisfying customer demand embody a broader economic challenge. Can businesses achieve resource efficiency without compromising profitability? This competition highlights the power of data science to address these dilemmas, while also hinting at a deeper truth: sustainable change might require not just better prediction models, but a shift in societal values and consumption habits.
<hr>

## **Description**
The Problem: Pastry Planning and Overproduction
Pastry sales present a particularly intriguing forecasting challenge. Freshness is critical, and unsold items often end up as waste at the end of the day. Overproduction leads to increased costs and environmental harm, while underproduction risks lost revenue and dissatisfied customers. The stakes are high: businesses that master this balance can not only improve their profitability but also contribute to global efforts to reduce food waste.

The Goal: Intelligent Solutions for a Complex Trade-Off
Your mission is to develop models that enable better pastry production planning. By accurately predicting daily and seasonal variations in sales, bakeries can produce just the right amount. However, the solution comes with its own trade-offs. Many state-of-the-art machine learning methods require significant computational resources, raising questions about their sustainability. This competition encourages you to weigh these trade-offs and explore innovative, computationally efficient approaches to forecasting.
<hr>


## **1. Importing the Neccesary Libraries**

In [1]:
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [2]:
!pip install opendatasets --quiet

### 1.1 Downloading the Dataset


In [3]:
import opendatasets as od
# Download the dataset

dataset_url = 'https://www.kaggle.com/competitions/pastry-prediction/data'

%time
od.download(dataset_url)

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 8.34 µs
Extracting archive ./pastry-prediction/pastry-prediction.zip to ./pastry-prediction


### 1.2 Viewing the DAtasets

In [4]:
data_dir = 'pastry-prediction'

In [5]:
!ls -lh {data_dir}

total 916K
-rw-r--r-- 1 root root  99K Apr  9 01:11 sample_submission.csv
-rw-r--r-- 1 root root 142K Apr  9 01:11 test.csv
-rw-r--r-- 1 root root 671K Apr  9 01:11 train.csv


In [6]:
# No. of lines in training set
!wc -l {data_dir}/train.csv

5143 pastry-prediction/train.csv


In [7]:
!wc -l {data_dir}/test.csv

1576 pastry-prediction/test.csv


In [8]:
!head {data_dir}/train.csv

date,store,is_state_holiday,is_school_holiday,is_special_day,temperature_max,temperature_min,temperature_mean,sunshine_sum,precipitation_sum,sales,unsold,ordered
2021-08-02,store_5,normal_day,school_holiday,normal_day,21.8,16.3,19.59090909090909,203,0.0,1.4327476970685655,,
2021-08-02,store_8,normal_day,school_holiday,normal_day,21.8,16.3,19.59090909090909,203,0.0,1.2531925968069404,,
2021-08-02,store_7,normal_day,school_holiday,normal_day,21.8,16.3,19.59090909090909,203,0.0,0.9850258886239939,,
2021-08-02,store_4,normal_day,school_holiday,normal_day,21.8,16.3,19.59090909090909,203,0.0,0.889418627445726,,
2021-08-02,store_1,normal_day,school_holiday,normal_day,21.8,16.3,19.59090909090909,203,0.0,0.5722823464641544,,
2021-08-03,store_8,normal_day,school_holiday,normal_day,22.3,18.7,20.945454545454545,162,0.0,0.8381171702281188,,
2021-08-03,store_7,normal_day,school_holiday,normal_day,22.3,18.7,20.945454545454545,162,0.0,0.7471736778878152,,
2021-08-03,store_4,normal_day,school_holiday,n

## **2 Explaratory Data Analysis**

To start with, we would explore the train dataset and the test datasets but first we must extract them to a pandas dataframe

In [9]:
train_df=pd.read_csv(f'{data_dir}/train.csv', parse_dates=['date'])
test_df =pd.read_csv(f'{data_dir}/test.csv', parse_dates=['date'])

### 2.1 Train Datasets

In [10]:
train_df.head()

Unnamed: 0,date,store,is_state_holiday,is_school_holiday,is_special_day,temperature_max,temperature_min,temperature_mean,sunshine_sum,precipitation_sum,sales,unsold,ordered
0,2021-08-02,store_5,normal_day,school_holiday,normal_day,21.8,16.3,19.590909,203,0.0,1.432748,,
1,2021-08-02,store_8,normal_day,school_holiday,normal_day,21.8,16.3,19.590909,203,0.0,1.253193,,
2,2021-08-02,store_7,normal_day,school_holiday,normal_day,21.8,16.3,19.590909,203,0.0,0.985026,,
3,2021-08-02,store_4,normal_day,school_holiday,normal_day,21.8,16.3,19.590909,203,0.0,0.889419,,
4,2021-08-02,store_1,normal_day,school_holiday,normal_day,21.8,16.3,19.590909,203,0.0,0.572282,,


In [11]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5142 entries, 0 to 5141
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   date               5142 non-null   datetime64[ns]
 1   store              5142 non-null   object        
 2   is_state_holiday   5142 non-null   object        
 3   is_school_holiday  5142 non-null   object        
 4   is_special_day     5142 non-null   object        
 5   temperature_max    5142 non-null   float64       
 6   temperature_min    5142 non-null   float64       
 7   temperature_mean   5142 non-null   float64       
 8   sunshine_sum       5142 non-null   int64         
 9   precipitation_sum  5142 non-null   float64       
 10  sales              5142 non-null   float64       
 11  unsold             3226 non-null   float64       
 12  ordered            3226 non-null   float64       
dtypes: datetime64[ns](1), float64(7), int64(1), object(4)
memory us

In [12]:
#the start and end date of my dataframe

print("Start date:", train_df['date'].min())
print("End date:", train_df['date'].max())


Start date: 2021-08-02 00:00:00
End date: 2023-11-30 00:00:00


In [13]:
#descriptive summary of the train dataset
train_df.describe()

Unnamed: 0,date,temperature_max,temperature_min,temperature_mean,sunshine_sum,precipitation_sum,sales,unsold,ordered
count,5142,5142.0,5142.0,5142.0,5142.0,5142.0,5142.0,3226.0,3226.0
mean,2022-11-21 12:47:19.673278720,15.60142,10.825729,13.641699,179.550758,0.916725,0.11553,0.047613,0.09261
min,2021-08-02 00:00:00,-4.4,-8.1,-5.654545,0.0,0.0,-1.656999,-1.289811,-1.715776
25%,2022-04-18 00:00:00,8.9,5.8,7.581818,11.0,0.0,-0.582,-0.67552,-0.594898
50%,2022-12-31 00:00:00,16.0,10.95,13.827273,148.0,0.0,0.045276,-0.146548,0.00985
75%,2023-06-30 00:00:00,21.9,16.2,19.754545,308.0,0.4,0.712195,0.570125,0.672079
max,2023-11-30 00:00:00,37.7,30.2,35.172727,551.0,32.2,7.539953,7.472362,7.869175
std,,8.317952,6.946531,7.825537,162.651716,2.596219,0.997744,0.994486,0.967469


#### Observation


*   A total of 13 columns
*   No missing values except for ``unsold`` and ``ordered`` column
*   The ``date`` column span from 2021 august to november 2023
*  ``sales``, ``orders`` and ``unsold`` column contain a negative value which is unlikely, we would investigate that
*   we might need to undergo some feature engineering




### 2.2 Test Datasets

In [14]:
test_df.head()

Unnamed: 0,row_id,date,store,is_state_holiday,is_school_holiday,is_special_day,temperature_max,temperature_min,temperature_mean,sunshine_sum,precipitation_sum
0,0,2023-12-01,store_3,normal_day,normal_day,normal_day,-0.8,-2.7,-1.709091,273,0.8
1,1,2023-12-01,store_5,normal_day,normal_day,normal_day,-2.7,-6.4,-4.554545,8,0.0
2,2,2023-12-01,store_1,normal_day,normal_day,normal_day,-2.7,-6.4,-4.554545,8,0.0
3,3,2023-12-01,store_4,normal_day,normal_day,normal_day,-2.7,-6.4,-4.554545,8,0.0
4,4,2023-12-01,store_7,normal_day,normal_day,normal_day,-2.7,-6.4,-4.554545,8,0.0


In [15]:
test_df.describe()

Unnamed: 0,row_id,date,temperature_max,temperature_min,temperature_mean,sunshine_sum,precipitation_sum
count,1575.0,1575,1575.0,1575.0,1575.0,1575.0,1575.0
mean,787.0,2024-02-27 23:58:10.285714432,11.143683,6.960063,9.32592,131.841905,0.917651
min,0.0,2023-12-01 00:00:00,-4.5,-8.8,-6.545455,0.0,0.0
25%,393.5,2024-01-14 00:00:00,7.2,3.4,5.445455,0.0,0.0
50%,787.0,2024-02-27 00:00:00,9.9,6.6,8.3,76.0,0.0
75%,1180.5,2024-04-12 00:00:00,15.2,10.3,12.977273,233.0,0.7
max,1574.0,2024-05-31 00:00:00,28.5,21.0,24.8,529.0,27.9
std,454.807652,,7.277286,5.813653,6.667414,145.860553,2.30591


In [16]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1575 entries, 0 to 1574
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   row_id             1575 non-null   int64         
 1   date               1575 non-null   datetime64[ns]
 2   store              1575 non-null   object        
 3   is_state_holiday   1575 non-null   object        
 4   is_school_holiday  1575 non-null   object        
 5   is_special_day     1575 non-null   object        
 6   temperature_max    1575 non-null   float64       
 7   temperature_min    1575 non-null   float64       
 8   temperature_mean   1575 non-null   float64       
 9   sunshine_sum       1575 non-null   int64         
 10  precipitation_sum  1575 non-null   float64       
dtypes: datetime64[ns](1), float64(4), int64(2), object(4)
memory usage: 135.5+ KB


### 2.0 EDA-Train Dataset

In [17]:
train_df_roll_back = train_df.copy()

investingating the neagative figures in the ``sales`` column

In [18]:
train_df[train_df['unsold'] > 0 ].head()

Unnamed: 0,date,store,is_state_holiday,is_school_holiday,is_special_day,temperature_max,temperature_min,temperature_mean,sunshine_sum,precipitation_sum,sales,unsold,ordered
1676,2022-07-05,store_8,normal_day,normal_day,normal_day,25.7,18.3,23.145455,315,0.0,-0.0993,0.740761,0.315218
1680,2022-07-06,store_8,normal_day,normal_day,normal_day,20.4,16.7,19.090909,56,0.0,0.34609,0.868738,0.808596
1685,2022-07-07,store_1,normal_day,school_holiday,normal_day,20.2,14.8,17.218182,40,5.9,-0.262532,1.06497,0.238576
1688,2022-07-07,store_7,normal_day,school_holiday,normal_day,20.2,14.8,17.218182,40,5.9,0.733182,0.382425,1.069655
1689,2022-07-07,store_4,normal_day,school_holiday,normal_day,20.2,14.8,17.218182,40,5.9,-0.246209,1.337988,0.331983


In [19]:
# train_df['cross_check'] = train_df['sales'] + train_df['unsold']
train_df[train_df['unsold'] > 0 ].head()

Unnamed: 0,date,store,is_state_holiday,is_school_holiday,is_special_day,temperature_max,temperature_min,temperature_mean,sunshine_sum,precipitation_sum,sales,unsold,ordered
1676,2022-07-05,store_8,normal_day,normal_day,normal_day,25.7,18.3,23.145455,315,0.0,-0.0993,0.740761,0.315218
1680,2022-07-06,store_8,normal_day,normal_day,normal_day,20.4,16.7,19.090909,56,0.0,0.34609,0.868738,0.808596
1685,2022-07-07,store_1,normal_day,school_holiday,normal_day,20.2,14.8,17.218182,40,5.9,-0.262532,1.06497,0.238576
1688,2022-07-07,store_7,normal_day,school_holiday,normal_day,20.2,14.8,17.218182,40,5.9,0.733182,0.382425,1.069655
1689,2022-07-07,store_4,normal_day,school_holiday,normal_day,20.2,14.8,17.218182,40,5.9,-0.246209,1.337988,0.331983


In [20]:
# prompt: count how many minus value we have in the sales column

negative_sales_count = train_df[train_df['sales'] < 0]['sales'].count()
print(f"Number of negative values in 'sales' column: {negative_sales_count}")


Number of negative values in 'sales' column: 2465


upon investigation, the negative values in the ``sales`` column is not issues. When compared the the ``unsold`` column and the ``ordered`` column the maths doesn't add up so this coould be because of the scaling method
<br>
Then also we saw quite a large count of negative value of about ``2456`` leaving us with the conclusion that this values represents valeus below the mean
<br>
Then for the ``odered`` and the ``unsold`` column, we would drop them when creating the model since they are not input in the test datasets. we also see that the missing values we had was comming from here


### Let us investigate other columns

In [21]:
#count the unique values in the is_holiday_column
train_df['is_state_holiday'].value_counts()

Unnamed: 0_level_0,count
is_state_holiday,Unnamed: 1_level_1
normal_day,5025
state_holiday,114
day_after,2
day_before,1


In [22]:
train_df['is_school_holiday'].value_counts()

Unnamed: 0_level_0,count
is_school_holiday,Unnamed: 1_level_1
normal_day,3957
school_holiday,1185


In [23]:
train_df['is_special_day'].value_counts()

Unnamed: 0_level_0,count
is_special_day,Unnamed: 1_level_1
normal_day,4947
special_day,192
day_before,3


For the categorical column, we would use one-hot encoding

### 2.4 EDA-Visualization

In [24]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Creating the subplot for numerical features vs sales
fig = make_subplots(rows=1, cols=5, shared_yaxes=True, subplot_titles=[
    'Max Temp vs Sales', 'Min Temp vs Sales', 'Mean Temp vs Sales',
    'Sunshine vs Sales', 'Precipitation vs Sales'
])

# Scatter plots for each numerical column against sales
fig.add_trace(go.Scatter(x=train_df['temperature_max'], y=train_df['sales'], mode='markers', name='Max Temp'),
              row=1, col=1)
fig.add_trace(go.Scatter(x=train_df['temperature_min'], y=train_df['sales'], mode='markers', name='Min Temp'),
              row=1, col=2)
fig.add_trace(go.Scatter(x=train_df['temperature_mean'], y=train_df['sales'], mode='markers', name='Mean Temp'),
              row=1, col=3)
fig.add_trace(go.Scatter(x=train_df['sunshine_sum'], y=train_df['sales'], mode='markers', name='Sunshine'),
              row=1, col=4)
fig.add_trace(go.Scatter(x=train_df['precipitation_sum'], y=train_df['sales'], mode='markers', name='Precipitation'),
              row=1, col=5)

# Update layout
fig.update_layout(title_text="Numerical Features vs Sales", height=500, width=1500)
fig.show()


In [25]:
# Creating the subplot for categorical features vs sales
fig = make_subplots(rows=1, cols=3, shared_yaxes=True, subplot_titles=[
    'State Holiday vs Sales', 'School Holiday vs Sales', 'Special Day vs Sales'
])

# Box plots for each categorical column against sales
fig.add_trace(go.Box(x=train_df['is_state_holiday'], y=train_df['sales'], name='State Holiday'),
              row=1, col=1)
fig.add_trace(go.Box(x=train_df['is_school_holiday'], y=train_df['sales'], name='School Holiday'),
              row=1, col=2)
fig.add_trace(go.Box(x=train_df['is_special_day'], y=train_df['sales'], name='Special Day'),
              row=1, col=3)

# Update layout
fig.update_layout(title_text="Categorical Features vs Sales", height=500, width=1500)
fig.show()


In [26]:
# Selecting only numerical columns
numerical_columns = train_df.select_dtypes(include=['float64', 'int64'])

# Saving the numerical columns to a new dataframe
numerical_df = numerical_columns.copy()

# drop unsold and ordered
numerical_df.drop(['unsold', 'ordered'], axis=1, inplace=True)

# Display the new dataframe
numerical_df.head()

Unnamed: 0,temperature_max,temperature_min,temperature_mean,sunshine_sum,precipitation_sum,sales
0,21.8,16.3,19.590909,203,0.0,1.432748
1,21.8,16.3,19.590909,203,0.0,1.253193
2,21.8,16.3,19.590909,203,0.0,0.985026
3,21.8,16.3,19.590909,203,0.0,0.889419
4,21.8,16.3,19.590909,203,0.0,0.572282


In [27]:
correlation_matrix = numerical_df.corr()
# Create the heatmap with a red color scale
fig = go.Figure(data=go.Heatmap(
    z=correlation_matrix.values,
    x=correlation_matrix.columns,
    y=correlation_matrix.columns,
    colorscale='reds',  # Red color scale
    #colorbar=dict(title="Correlation", tickvals=[-1, 0, 1], ticktext=["-1", "0", "1"])
))

# Update layout for title and aesthetics
fig.update_layout(
    title="Correlation Heatmap",
    xaxis=dict(title="Features"),
    yaxis=dict(title="Features")
)

fig.show()

In [28]:


# Example: Plotting 'temperature_max' against 'sales' with 'is_state_holiday' as legend
fig = px.scatter(train_df,
                 x='temperature_max',
                 y='sales',
                 color='is_state_holiday',  # Use categorical column as legend
                 title="Temperature Max vs Sales with State Holiday as Legend")

# Show the plot
fig.show()


## **3 Feature Enginnering**
we just extracted the year and month

In [29]:
# Extracting year and month from the 'date' column
train_df['year'] = train_df['date'].dt.year
train_df['month'] = train_df['date'].dt.month
train_df['weekday'] = train_df['date'].dt.dayofweek
train_df.head()

Unnamed: 0,date,store,is_state_holiday,is_school_holiday,is_special_day,temperature_max,temperature_min,temperature_mean,sunshine_sum,precipitation_sum,sales,unsold,ordered,year,month,weekday
0,2021-08-02,store_5,normal_day,school_holiday,normal_day,21.8,16.3,19.590909,203,0.0,1.432748,,,2021,8,0
1,2021-08-02,store_8,normal_day,school_holiday,normal_day,21.8,16.3,19.590909,203,0.0,1.253193,,,2021,8,0
2,2021-08-02,store_7,normal_day,school_holiday,normal_day,21.8,16.3,19.590909,203,0.0,0.985026,,,2021,8,0
3,2021-08-02,store_4,normal_day,school_holiday,normal_day,21.8,16.3,19.590909,203,0.0,0.889419,,,2021,8,0
4,2021-08-02,store_1,normal_day,school_holiday,normal_day,21.8,16.3,19.590909,203,0.0,0.572282,,,2021,8,0


In [30]:
# Extracting year and month from the 'date' column
test_df['year'] = test_df['date'].dt.year
test_df['month'] = test_df['date'].dt.month
test_df['weekday'] = test_df['date'].dt.dayofweek
test_df.head()

Unnamed: 0,row_id,date,store,is_state_holiday,is_school_holiday,is_special_day,temperature_max,temperature_min,temperature_mean,sunshine_sum,precipitation_sum,year,month,weekday
0,0,2023-12-01,store_3,normal_day,normal_day,normal_day,-0.8,-2.7,-1.709091,273,0.8,2023,12,4
1,1,2023-12-01,store_5,normal_day,normal_day,normal_day,-2.7,-6.4,-4.554545,8,0.0,2023,12,4
2,2,2023-12-01,store_1,normal_day,normal_day,normal_day,-2.7,-6.4,-4.554545,8,0.0,2023,12,4
3,3,2023-12-01,store_4,normal_day,normal_day,normal_day,-2.7,-6.4,-4.554545,8,0.0,2023,12,4
4,4,2023-12-01,store_7,normal_day,normal_day,normal_day,-2.7,-6.4,-4.554545,8,0.0,2023,12,4


In [31]:
train_df['weekday'].value_counts()

Unnamed: 0_level_0,count
weekday,Unnamed: 1_level_1
3,741
1,740
0,740
2,735
4,733
5,729
6,724


In [32]:
fig = px.violin(train_df, x='year', y='sales', title='Sales by Year')
fig.show()

In [33]:
fig = px.violin(train_df, x='month', y='sales', title='Sales by Month')
fig.show()

In [34]:
fig = px.violin(train_df, x='weekday', y='sales', title='Sales by Days')
fig.show()

## **4. Prepare Dataset for Training**

- Split Training & Validation Set
- Fill/Remove Missing Values
- Extract Inputs & Outputs
   - Training
   - Validation
   - Test

### 4.1 Transforming the Categorical Features for Machine Learning

# ✈ ✈ ✈

#### Trainset

In [35]:
train_df_copy = train_df.copy()
train_df_copy.head()

Unnamed: 0,date,store,is_state_holiday,is_school_holiday,is_special_day,temperature_max,temperature_min,temperature_mean,sunshine_sum,precipitation_sum,sales,unsold,ordered,year,month,weekday
0,2021-08-02,store_5,normal_day,school_holiday,normal_day,21.8,16.3,19.590909,203,0.0,1.432748,,,2021,8,0
1,2021-08-02,store_8,normal_day,school_holiday,normal_day,21.8,16.3,19.590909,203,0.0,1.253193,,,2021,8,0
2,2021-08-02,store_7,normal_day,school_holiday,normal_day,21.8,16.3,19.590909,203,0.0,0.985026,,,2021,8,0
3,2021-08-02,store_4,normal_day,school_holiday,normal_day,21.8,16.3,19.590909,203,0.0,0.889419,,,2021,8,0
4,2021-08-02,store_1,normal_day,school_holiday,normal_day,21.8,16.3,19.590909,203,0.0,0.572282,,,2021,8,0


In [36]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

def fit_one_hot_encoder(df):
    """Fits OneHotEncoder on categorical columns and stores the encoder + expected columns."""
    categorical_cols = df.select_dtypes('object').columns.tolist()
    enc = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
    encoded_array = enc.fit_transform(df[categorical_cols])

    encoded_df = pd.DataFrame(encoded_array, columns=enc.get_feature_names_out(categorical_cols), index=df.index)
    df_encoded = pd.concat([df.drop(columns=categorical_cols), encoded_df], axis=1)

    expected_fit_columns = encoded_df.columns.tolist()

    return enc, categorical_cols, expected_fit_columns

def transform_with_encoder(df, enc, categorical_cols, expected_fit_columns):
    """Transforms new data using a fitted encoder, ensures all expected columns are present.
        it takes the encoder, cat_cols andactual_enc_cols as argument
        see this column in the pastery predict joblib file
    """
    encoded_array = enc.transform(df[categorical_cols])
    encoded_df = pd.DataFrame(encoded_array, columns=enc.get_feature_names_out(categorical_cols), index=df.index)

    df_encoded = pd.concat([df.drop(columns=categorical_cols), encoded_df], axis=1)

    # Add missing columns and fill with 0
    for col in expected_fit_columns:
        if col not in df_encoded.columns:
            df_encoded[col] = 0

    # Ensure column order matches expected_fit_columns
    df_encoded = df_encoded[df_encoded.columns.sort_values()]

    return df_encoded


In [37]:
# Fit on training data
encoder, cat_cols, actual_enc_cols = fit_one_hot_encoder(train_df_copy)


In [38]:
train_df_encoded = transform_with_encoder(train_df_copy, encoder, cat_cols, actual_enc_cols)
train_df_encoded.head()

Unnamed: 0,date,is_school_holiday_normal_day,is_school_holiday_school_holiday,is_special_day_day_before,is_special_day_normal_day,is_special_day_special_day,is_state_holiday_day_after,is_state_holiday_day_before,is_state_holiday_normal_day,is_state_holiday_state_holiday,...,store_store_6,store_store_7,store_store_8,sunshine_sum,temperature_max,temperature_mean,temperature_min,unsold,weekday,year
0,2021-08-02,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,203,21.8,19.590909,16.3,,0,2021
1,2021-08-02,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,203,21.8,19.590909,16.3,,0,2021
2,2021-08-02,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,203,21.8,19.590909,16.3,,0,2021
3,2021-08-02,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,203,21.8,19.590909,16.3,,0,2021
4,2021-08-02,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,203,21.8,19.590909,16.3,,0,2021


#### Split Training & Validation Set

We'll set aside 20% of the training data as the validation set, to evaluate the models we train on previously unseen data.

In [39]:
from sklearn.model_selection import train_test_split

# Sort the dataframe by the 'date' column to ensure the split is based on time order
train_df_copy_sorted = train_df_encoded.sort_values(by='date', ascending=True)

#drop unwanted column
train_df_copy_sorted.drop(['date'], axis=1, inplace=True)
train_df_copy_sorted.drop(['unsold','ordered'], axis=1, inplace=True)

# Calculate the index for the split
validation_size = int(0.2 * len(train_df_copy_sorted))


# Split the data
train_df_final = train_df_copy_sorted[:-validation_size]  # Training set (80%)
val_df = train_df_copy_sorted[-validation_size:]  # Validation set (20%)

# Display the shapes of the train and validation datasets
print(f"Training Data Shape: {train_df_final.shape}")
print(f"Validation Data Shape: {val_df.shape}")

Training Data Shape: (4114, 27)
Validation Data Shape: (1028, 27)


##### Training Datasets

In [40]:
train_df_final.head()

Unnamed: 0,is_school_holiday_normal_day,is_school_holiday_school_holiday,is_special_day_day_before,is_special_day_normal_day,is_special_day_special_day,is_state_holiday_day_after,is_state_holiday_day_before,is_state_holiday_normal_day,is_state_holiday_state_holiday,month,...,store_store_5,store_store_6,store_store_7,store_store_8,sunshine_sum,temperature_max,temperature_mean,temperature_min,weekday,year
0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,1.0,0.0,0.0,0.0,203,21.8,19.590909,16.3,0,2021
1,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,0.0,1.0,203,21.8,19.590909,16.3,0,2021
2,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,1.0,0.0,203,21.8,19.590909,16.3,0,2021
3,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,0.0,0.0,203,21.8,19.590909,16.3,0,2021
4,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,0.0,0.0,203,21.8,19.590909,16.3,0,2021


###### Extract Inputs and Outputs

In [41]:
train_df_final.head()

Unnamed: 0,is_school_holiday_normal_day,is_school_holiday_school_holiday,is_special_day_day_before,is_special_day_normal_day,is_special_day_special_day,is_state_holiday_day_after,is_state_holiday_day_before,is_state_holiday_normal_day,is_state_holiday_state_holiday,month,...,store_store_5,store_store_6,store_store_7,store_store_8,sunshine_sum,temperature_max,temperature_mean,temperature_min,weekday,year
0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,1.0,0.0,0.0,0.0,203,21.8,19.590909,16.3,0,2021
1,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,0.0,1.0,203,21.8,19.590909,16.3,0,2021
2,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,1.0,0.0,203,21.8,19.590909,16.3,0,2021
3,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,0.0,0.0,203,21.8,19.590909,16.3,0,2021
4,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,0.0,0.0,203,21.8,19.590909,16.3,0,2021


In [42]:
# get all the input column from the datasets
def input_column(df):
    """
    Returns a list of all column names in the DataFrame except the 'sales' column.

    Parameters:
    - df: Pandas DataFrame

    Returns:
    - List of column names excluding 'sales'
    """
    return [col for col in df.columns if col != 'sales']

In [43]:
input_col= input_column(train_df_final)
target_col = 'sales'

In [44]:
train_input = train_df_final[input_col]
train_target = train_df_final[target_col]
train_input.head()

Unnamed: 0,is_school_holiday_normal_day,is_school_holiday_school_holiday,is_special_day_day_before,is_special_day_normal_day,is_special_day_special_day,is_state_holiday_day_after,is_state_holiday_day_before,is_state_holiday_normal_day,is_state_holiday_state_holiday,month,...,store_store_5,store_store_6,store_store_7,store_store_8,sunshine_sum,temperature_max,temperature_mean,temperature_min,weekday,year
0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,1.0,0.0,0.0,0.0,203,21.8,19.590909,16.3,0,2021
1,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,0.0,1.0,203,21.8,19.590909,16.3,0,2021
2,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,1.0,0.0,203,21.8,19.590909,16.3,0,2021
3,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,0.0,0.0,203,21.8,19.590909,16.3,0,2021
4,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,0.0,0.0,203,21.8,19.590909,16.3,0,2021


In [45]:
train_target.head()

Unnamed: 0,sales
0,1.432748
1,1.253193
2,0.985026
3,0.889419
4,0.572282


##### Validation Sets

In [46]:
val_input = val_df[input_col]
val_target = val_df[target_col]
val_input.head()

Unnamed: 0,is_school_holiday_normal_day,is_school_holiday_school_holiday,is_special_day_day_before,is_special_day_normal_day,is_special_day_special_day,is_state_holiday_day_after,is_state_holiday_day_before,is_state_holiday_normal_day,is_state_holiday_state_holiday,month,...,store_store_5,store_store_6,store_store_7,store_store_8,sunshine_sum,temperature_max,temperature_mean,temperature_min,weekday,year
4119,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,1.0,0.0,41,21.8,20.772727,19.1,2,2023
4121,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,1.0,0.0,0.0,0.0,41,21.8,20.772727,19.1,2,2023
4120,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,0.0,1.0,41,21.8,20.772727,19.1,2,2023
4118,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,0.0,0.0,41,21.8,20.772727,19.1,2,2023
4114,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,0.0,0.0,6,18.9,17.636364,15.8,2,2023


In [47]:
val_target.head()

Unnamed: 0,sales
4119,-0.733573
4121,0.019626
4120,0.502326
4118,-0.549354
4114,-1.468117


#### Testsets

##### Test Datasets One-Hot Encoding

In [48]:
test_df_encoded = transform_with_encoder(test_df, encoder, cat_cols, actual_enc_cols)
test_df_encoded= test_df_encoded.drop(['date','row_id'], axis=1)
test_df_encoded.head()

Unnamed: 0,is_school_holiday_normal_day,is_school_holiday_school_holiday,is_special_day_day_before,is_special_day_normal_day,is_special_day_special_day,is_state_holiday_day_after,is_state_holiday_day_before,is_state_holiday_normal_day,is_state_holiday_state_holiday,month,...,store_store_5,store_store_6,store_store_7,store_store_8,sunshine_sum,temperature_max,temperature_mean,temperature_min,weekday,year
0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,12,...,0.0,0.0,0.0,0.0,273,-0.8,-1.709091,-2.7,4,2023
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,12,...,1.0,0.0,0.0,0.0,8,-2.7,-4.554545,-6.4,4,2023
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,12,...,0.0,0.0,0.0,0.0,8,-2.7,-4.554545,-6.4,4,2023
3,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,12,...,0.0,0.0,0.0,0.0,8,-2.7,-4.554545,-6.4,4,2023
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,12,...,0.0,0.0,1.0,0.0,8,-2.7,-4.554545,-6.4,4,2023


In [49]:
test_inputs = test_df_encoded
test_inputs.head()

Unnamed: 0,is_school_holiday_normal_day,is_school_holiday_school_holiday,is_special_day_day_before,is_special_day_normal_day,is_special_day_special_day,is_state_holiday_day_after,is_state_holiday_day_before,is_state_holiday_normal_day,is_state_holiday_state_holiday,month,...,store_store_5,store_store_6,store_store_7,store_store_8,sunshine_sum,temperature_max,temperature_mean,temperature_min,weekday,year
0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,12,...,0.0,0.0,0.0,0.0,273,-0.8,-1.709091,-2.7,4,2023
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,12,...,1.0,0.0,0.0,0.0,8,-2.7,-4.554545,-6.4,4,2023
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,12,...,0.0,0.0,0.0,0.0,8,-2.7,-4.554545,-6.4,4,2023
3,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,12,...,0.0,0.0,0.0,0.0,8,-2.7,-4.554545,-6.4,4,2023
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,12,...,0.0,0.0,1.0,0.0,8,-2.7,-4.554545,-6.4,4,2023


## **5. Linear Regression**
### Train Hardcoded & Baseline Models

- Hardcoded model: always predict average fare
- Baseline model: Linear regression

For evaluation the dataset uses MSE error

In [50]:
class MeanRegressor():
    def fit(self, inputs, targets):
        self.mean = targets.mean()

    def predict(self, inputs):
        return np.full(inputs.shape[0], self.mean)

In [51]:
mean_model = MeanRegressor()
mean_model.fit(train_input, train_target)

In [52]:
mean_model.mean

np.float64(0.24354615915432087)

In [53]:
train_preds = mean_model.predict(train_input)
train_preds

array([0.24354616, 0.24354616, 0.24354616, ..., 0.24354616, 0.24354616,
       0.24354616])

In [54]:
val_preds = mean_model.predict(val_input)
val_preds

array([0.24354616, 0.24354616, 0.24354616, ..., 0.24354616, 0.24354616,
       0.24354616])

In [55]:
from sklearn.metrics import mean_squared_error

In [56]:
train_rmse = mean_squared_error(train_target, train_preds) #calculate rmse by taking square root instead of squared =False
train_rmse

1.0037659076299277

In [57]:
val_rmse = mean_squared_error(val_target, val_preds)
val_rmse

1.04338825347136

In [58]:
rmse = 1.04
sales_max = 7
sales_min = -1.6
sales_range = sales_max - sales_min  # 8.6

rmse_percent = (rmse / sales_range) * 100
print(f"Baseline RMSE is {rmse_percent:.2f}% of the sales range.")


Baseline RMSE is 12.09% of the sales range.


The linear regression model at a baseline is ``1.04..`` off and with a max number of ``$7`` and a minimun of ``$-1.6`` this is quite a lot but it is the baseline model

### 5.1 Train & Evaluate Baseline Model

We'll train a linear regression model as our baseline, which tries to express the target as a weighted sum of the inputs.

In [59]:
train_input

Unnamed: 0,is_school_holiday_normal_day,is_school_holiday_school_holiday,is_special_day_day_before,is_special_day_normal_day,is_special_day_special_day,is_state_holiday_day_after,is_state_holiday_day_before,is_state_holiday_normal_day,is_state_holiday_state_holiday,month,...,store_store_5,store_store_6,store_store_7,store_store_8,sunshine_sum,temperature_max,temperature_mean,temperature_min,weekday,year
0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,1.0,0.0,0.0,0.0,203,21.8,19.590909,16.3,0,2021
1,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,0.0,1.0,203,21.8,19.590909,16.3,0,2021
2,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,1.0,0.0,203,21.8,19.590909,16.3,0,2021
3,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,0.0,0.0,203,21.8,19.590909,16.3,0,2021
4,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,0.0,0.0,203,21.8,19.590909,16.3,0,2021
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4110,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,1.0,0.0,0.0,0.0,214,22.7,18.872727,16.9,1,2023
4106,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,0.0,0.0,148,18.0,16.718182,15.3,1,2023
4108,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,1.0,0.0,0.0,148,18.0,16.718182,15.3,1,2023
4107,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,8,...,0.0,0.0,0.0,0.0,214,22.7,18.872727,16.9,1,2023


In [60]:
from sklearn.linear_model import LinearRegression
linreg_model = LinearRegression()
linreg_model.fit(train_input, train_target)

In [61]:
train_preds = linreg_model.predict(train_input)
train_preds

array([-0.0727168 ,  0.44438017,  0.52523946, ..., -1.57260003,
       -0.80104307, -0.06978827])

In [62]:
val_preds = linreg_model.predict(val_input)
val_preds

array([ 0.1207375 , -0.47721876,  0.03987821, ...,  0.44575365,
       -1.34417991, -0.32804692])

### 5.2 Model Evaluation

In [63]:
from sklearn.metrics import mean_squared_error

def compute_train_val_mse(train_targets, train_preds, val_targets, val_preds):
    """
    Computes train and validation Mean Squared Error (MSE)
    and returns the results as a pandas DataFrame.
    """
    train_mse = mean_squared_error(train_targets, train_preds)
    val_mse = mean_squared_error(val_targets, val_preds)

    return pd.DataFrame({
        "dataset": ["trainsets", "validation"],
        "mse": [train_mse, val_mse]
    })



def compute_mse(targets, predictions):
    """
    Returns the Mean Squared Error (MSE) between targets and predictions.
    """
    return mean_squared_error(targets, predictions)


In [64]:
compute_train_val_mse(train_target, train_preds, val_target, val_preds)

Unnamed: 0,dataset,mse
0,trainsets,0.453235
1,validation,0.185161


Suprisingly for us, our model seemed to performed well enough on the validation set than the trained dataset and the score of ``0.185`` is far below the baseline model of about ``1.04``.
<br>
And this happened just at first trial because we did all our feature engineering and other scaling well enough. Also I think the model was random enough to take care of different values
<br>
From our model, this put us at top as it is better than the highest of ``0.20`` although on our validation set

## **6 Submission**

In [65]:
test_preds = linreg_model.predict(test_inputs)
test_preds

array([-0.43222593,  0.110118  , -0.10403982, ...,  0.17804441,
       -1.35140301,  0.09718512])

In [66]:
# def get_submission(test_preds, output_path=None):
#     sub_df = pd.read_csv(data_dir+'/sample_submission.csv')
#     sub_df.drop(['sales'], axis=1, inplace=True)
#     sub_df['sales'] = test_preds
#     sub_df.drop(['unsold','ordered'], axis=1, inplace=True)

#     if output_path:
#         sub_df.to_csv(output_path, index=False)
#         print(f"Submission saved to: {output_path}")
#     return sub_df

In [67]:
def get_submission(test_df, predictions, output_path=None):
    """
    Creates a submission DataFrame with 'row_id' and 'sales' columns.

    Parameters:
    - test_df (DataFrame): The original test DataFrame with a 'row_id' column.
    - predictions (array-like): The predicted sales values, same length as test_df.
    - output_path (str, optional): If provided, saves the submission as a CSV to this path.

    Returns:
    - DataFrame: Submission DataFrame with columns ['row_id', 'sales'].
    """
    submission_df = test_df[['row_id']].copy()
    submission_df['sales'] = predictions

    if output_path:
        submission_df.to_csv(output_path, index=False)
        print(f"Submission saved to: {output_path}")

    return submission_df


In [68]:
get_submission(test_df, test_preds, output_path='linreg_submission.csv')

Submission saved to: linreg_submission.csv


Unnamed: 0,row_id,sales
0,0,-0.432226
1,1,0.110118
2,2,-0.104040
3,3,0.134205
4,4,0.708074
...,...,...
1570,1570,-0.876855
1571,1571,-0.395825
1572,1572,0.178044
1573,1573,-1.351403


## **7. Ridge Regression**

##### Helper Function
let's define a helper function to evaluate and generate test prediction

In [69]:


class ModelTrainer:
    def __init__(self, model_1):
        self.model = model_1
        self.model_name = type(self.model).__name__

    from sklearn.metrics import mean_squared_error


    def fit(self, train_inputs, train_targets):
        self.model.fit(train_inputs, train_targets)

    def evaluate(self, train_inputs, train_targets, val_inputs, val_targets):
        """
        This function takes the train_inputs, train_targets, validation_inputs
        and validation targets
        """
        train_preds = self.model.predict(train_inputs)
        val_preds = self.model.predict(val_inputs)

        return train_preds,val_preds

    def calc_train_val_mse(self, train_targets, train_preds, val_targets, val_preds):
        """
        Computes train and validation Mean Squared Error (MSE)
        and returns the results as a pandas DataFrame.
        """
        train_mse = mean_squared_error(train_targets, train_preds)
        val_mse = mean_squared_error(val_targets, val_preds)

        return pd.DataFrame({
            "dataset": ["trainsets", "validation"],
            "mse": [train_mse, val_mse]
        })

    def predict_test(self, test_inputs):
        test_preds = self.model.predict(test_inputs)
        return test_preds

    def get_submission(self, test_df, predictions, output_path=None):
        """
        Creates a submission DataFrame with 'row_id' and 'sales' columns.

        Parameters:
        - test_df (DataFrame): The original test DataFrame with a 'row_id' column.
        - predictions (array-like): The predicted sales values, same length as test_df.
        - output_path (str, optional): If provided, saves the submission as a CSV to this path.

        Returns:
        - DataFrame: Submission DataFrame with columns ['row_id', 'sales'].
        """
        submission_df = test_df[['row_id']].copy()
        submission_df['sales'] = predictions

        if output_path:
            submission_df.to_csv(output_path, index=False)
            print(f"Submission saved to: {output_path}")

        return submission_df

    def get_model_name(self):
        return self.model_name


##### Back to the code

In [70]:
from sklearn.linear_model import Ridge

# Step 1: Initialize model
ridge_model = Ridge(random_state=42)

# Step 2: Wrap it in your ModelTrainer
ridge_trainer = ModelTrainer(ridge_model)

# Step 3: Fit the model
ridge_trainer.fit(train_input, train_target)

# Step 4: Get predictions
ridge_train_preds, ridge_val_preds = ridge_trainer.evaluate(train_input, train_target, val_input, val_target)

# Step 5: Get MSE
ridge_mse = ridge_trainer.calc_train_val_mse(train_target, ridge_train_preds, val_target, ridge_val_preds)
ridge_mse


Unnamed: 0,dataset,mse
0,trainsets,0.453248
1,validation,0.185337


## **8. Random Forest**

In [71]:
from sklearn.ensemble import RandomForestRegressor

# Step 1: Initialize model
rf_model = RandomForestRegressor(random_state=42)

# Step 2: Wrap it
rf_trainer = ModelTrainer(rf_model)

# Step 3: Fit
rf_trainer.fit(train_input, train_target)

# Step 4: Evaluate
rf_train_preds, rf_val_preds = rf_trainer.evaluate(train_input, train_target, val_input, val_target)

# Step 5: MSE
rf_mse = rf_trainer.calc_train_val_mse(train_target, rf_train_preds, val_target, rf_val_preds)
"Random Forest MSE:\n", rf_mse


('Random Forest MSE:\n',
       dataset       mse
 0   trainsets  0.025992
 1  validation  0.153403)

In [72]:
rf_trainer.predict_test(test_inputs)
rf_trainer.get_submission(test_df, rf_trainer.predict_test(test_inputs), output_path='rf_submission.csv')

Submission saved to: rf_submission.csv


Unnamed: 0,row_id,sales
0,0,-0.338015
1,1,0.060340
2,2,-0.545203
3,3,0.171198
4,4,0.734815
...,...,...
1570,1570,-0.480843
1571,1571,-0.022208
1572,1572,0.638251
1573,1573,-1.141443


Random forest performed the best so we would try tuning some hyypaparameters here and thabks ro sklearn we have some quickk search function to spot the perfect fit

## **Tying Everything Together**

In [82]:
def predict_input(single_input):
    """
      the input_df should be a dictionary of the input you want to predict
    """
    input_df = pd.DataFrame([single_input])
    input_df['date'] = pd.to_datetime(input_df['date'])
    input_df_encoded = transform_with_encoder(input_df, encoder, cat_cols, actual_enc_cols)
    input_df_ref = input_df_encoded.drop(['date','row_id'], axis=1)
    pred_sale = rf_trainer.predict_test(input_df_ref)
    return pred_sale


In [83]:
# prompt: i want to test run the model i created so just help me create a dicionary containing a sample sp  row of my test_df dataframe where by the key are the same as the column and the values are just any random values

sample_data = {
    "row_id": 1,
    "date": "2024-05-15",
    "store": "Store_4",
    "is_state_holiday": 'normal_day',
    "is_school_holiday": 'normal_day',
    "is_special_day": 'normal_day',
    "temperature_max": 26.5,
    "temperature_min": 14.2,
    "temperature_mean": 20.3,
    "sunshine_sum": 8.5,
    "precipitation_sum": 2.1,
    "year": 2024,
    "month": 5,
    "weekday": 3
}



In [84]:
predict_input(sample_data)

array([-0.63393151])

In [76]:
import joblib

In [77]:
# pastery_sale_predictior={
#     'model':rf_trainer,
#     'input_col':input_col,
#     'target_col':target_col,
#     'get_encoder':fit_one_hot_encoder,
#     'encoder_transform': transform_with_encoder,
#     'encoder': encoder,
#     'cat_cols':cat_cols,
#     'actual_enc_cols':actual_enc_cols,
#     'predict_input':predict_input,
#     'sample_data':sample_data
# }
joblib.dump(pastery_sale_predictior,'pastery_sale_predictor.joblib')

In [85]:
import cloudpickle

pastery_sale_predictor = {
    'model':rf_trainer,
    'input_col':input_col,
    'target_col':target_col,
    'get_encoder':fit_one_hot_encoder,
    'encoder_transform': transform_with_encoder,
    'encoder': encoder,
    'cat_cols':cat_cols,
    'actual_enc_cols':actual_enc_cols,
    'predict_input':predict_input,
    'sample_data':sample_data,
    'train_input':train_input,
    'train_target':train_target,
}

with open("pastery_sale_predictor.pkl", "wb") as f:
    cloudpickle.dump(pastery_sale_predictor, f)


## **ROLL BACK**
# ✈ ✈ ✈

### Roll Back
We want to iterate here
so we would create another version of the train dataset and fill the null values for the ``ordered`` column since it is very correlated with sales column

In [None]:
# Selecting only numerical columns
numerical_columns = train_df_roll_back.select_dtypes(include=['float64', 'int64'])

# Saving the numerical columns to a new dataframe
numerical_df = numerical_columns.copy()

# Display the new dataframe
numerical_df.head()

import plotly.graph_objects as go
import pandas as pd

# Calculate the correlation matrix
correlation_matrix = numerical_df.corr()

# Create the heatmap with a red color scale
fig = go.Figure(data=go.Heatmap(
    z=correlation_matrix.values,
    x=correlation_matrix.columns,
    y=correlation_matrix.columns,
    colorscale='reds',  # Red color scale
    colorbar=dict(title="Correlation", tickvals=[-1, 0, 1], ticktext=["-1", "0", "1"])
))

# Update layout for title and aesthetics
fig.update_layout(
    title="Correlation Heatmap",
    xaxis=dict(title="Features"),
    yaxis=dict(title="Features")
)

fig.show()



In [None]:
train_df_roll_back.info()

In [None]:
train_df_roll_back.iloc[1693:1700]

#### Let us fill thw N/A values in the ``odered`` column

# ✈ ✈ ✈

In [None]:
from sklearn.linear_model import LinearRegression

test_roll_back = train_df_roll_back.copy()
# Filter non-null rows
ordered_not_null = test_roll_back[test_roll_back['ordered'].notnull()]

# Features and target
X = ordered_not_null[['sales']]
y = ordered_not_null['ordered']

# Fit model
model = LinearRegression()
model.fit(X, y)

# Predict missing values
ordered_missing = test_roll_back['ordered'].isnull()
test_roll_back.loc[ordered_missing, 'ordered'] = model.predict(test_roll_back.loc[ordered_missing, ['sales']])
test_roll_back.info()


In [None]:
# Selecting only numerical columns
numerical_columns = test_roll_back.select_dtypes(include=['float64', 'int64'])

# Saving the numerical columns to a new dataframe
numerical_df = numerical_columns.copy()


# Calculate the correlation matrix
correlation_matrix = numerical_df.corr()

# Create the heatmap with a red color scale
fig = go.Figure(data=go.Heatmap(
    z=correlation_matrix.values,
    x=correlation_matrix.columns,
    y=correlation_matrix.columns,
    colorscale='reds',  # Red color scale
    colorbar=dict(title="Correlation", tickvals=[-1, 0, 1], ticktext=["-1", "0", "1"])
))

# Update layout for title and aesthetics
fig.update_layout(
    title="Correlation Heatmap",
    xaxis=dict(title="Features"),
    yaxis=dict(title="Features")
)

fig.show()


In [None]:
test_roll_back.drop(['unsold'], axis=1, inplace=True)
test_roll_back.info()

In [None]:
#extract the year and month from my test_roll_back date column
test_roll_back['year'] = test_roll_back['date'].dt.year
test_roll_back['month'] = test_roll_back['date'].dt.month
test_roll_back.info()