#### Iron Kaggle

TODO: Predict the sales of shops.

- Dataset (training.csv) containing information on shops’ sales per day.

- <u>**Training data (640841 entries)**</u>: we will share with you a training set of store sales per day, with bits of information of what happened in that day in that store.

    - TODO: Validate its 640841.

- <u>**Real-Life Data**</u> (+70k entries): we will also share with you entries without the sales. This will be used (on the teachers side) to verify how good your model really is!

TODOs:
 - R2 Prediction
 - R2 Score
 - R2 Difference


#### Expected Deliver:

- “Real-life data set” with an extra column called “sales”, with your predictions. Name this G1.csv (or G2, G3...)

- Expected R2 of performance of your model. Save the number in a file `g4_r2_prediction.txt`

- A 5’ presentation on the choices you did and the road you took

- Your code (jupyter notebook)

- Upload everything to your Github repo.

In [2]:
import pandas as pd

data = pd.read_csv("./data/training.csv")

data.head()

Unnamed: 0.1,Unnamed: 0,store_ID,day_of_week,date,nb_customers_on_day,open,promotion,state_holiday,school_holiday,sales
0,425390,366,4,2013-04-18,517,1,0,0,0,4422
1,291687,394,6,2015-04-11,694,1,0,0,0,8297
2,411278,807,4,2013-08-29,970,1,1,0,0,9729
3,664714,802,2,2013-05-28,473,1,1,0,0,6513
4,540835,726,4,2013-10-10,1068,1,1,0,0,10882


In [None]:
data.shape # !!!! The number of rows are 1 less from what said it is..

(640840, 10)

In [9]:
print(data.isnull().sum()) # No nulls

Unnamed: 0             0
store_ID               0
day_of_week            0
date                   0
nb_customers_on_day    0
open                   0
promotion              0
state_holiday          0
school_holiday         0
sales                  0
dtype: int64


In [None]:
print(data.dtypes) # So we have categorical data data and state_holiday

Unnamed: 0              int64
store_ID                int64
day_of_week             int64
date                   object
nb_customers_on_day     int64
open                    int64
promotion               int64
state_holiday          object
school_holiday          int64
sales                   int64
dtype: object


In [11]:
data.describe()

Unnamed: 0.1,Unnamed: 0,store_ID,day_of_week,nb_customers_on_day,open,promotion,school_holiday,sales
count,640840.0,640840.0,640840.0,640840.0,640840.0,640840.0,640840.0,640840.0
mean,355990.675084,558.211348,4.000189,633.398577,0.830185,0.381718,0.178472,5777.469011
std,205536.290268,321.878521,1.996478,464.094416,0.37547,0.485808,0.38291,3851.338083
min,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,178075.75,280.0,2.0,405.0,1.0,0.0,0.0,3731.0
50%,355948.5,558.0,4.0,609.0,1.0,0.0,0.0,5746.0
75%,533959.25,837.0,6.0,838.0,1.0,1.0,0.0,7860.0
max,712044.0,1115.0,7.0,5458.0,1.0,1.0,1.0,41551.0


##### Observations:
- `sales` ranges from 0 to 41.551 with `mean` ~5.777 and `standard deviation` ~3851

- `nb_customers_on_day` ranges from 0 to 5.458 with `mean` ~633

- `open = 0` means store was closed. Those rows might have `sales` = 0.

##### Facts:
- Random variation and “spikes” in sales will reduce predictive accuracy.
- If `sales` have high standard deviation, even a good model may not explain 100% of the variation.

##### Some theory:
- [`standard deviation` measures how spread out the values are from the mean.]
- [`Variance` is basically standard deviation squared.]
- [`R²` measures how much variance your model explains. - Values 0 to 1]


| R² Value | Meaning                                                                       |
| -------- | ----------------------------------------------------------------------------- |
| **1**    | Perfect prediction: model explains 100% of the variance                       |
| **0.9**  | Model explains 90% of the variance — very good                                |
| **0.7**  | Model explains 70% of the variance — decent, may improve with better features |
| **0.5**  | Model explains 50% of the variance — weak                                     |
| **0**    | Model explains none of the variance — predicts just the mean                  |
| **<0**   | Model is worse than predicting the mean — seriously bad                       |


Model Limitations

- A `LinearRegression` using all features may capture 60–80% of the variance.

- A tree-based model (`Random Forest / XGBoost`) with good feature engineering may reach R² ≈ 0.85–0.95, but never exactly 1 because:

    - Sales have random daily fluctuations

    - Unknown events (e.g., local events) affect sales

In [None]:
data["open"].value_counts()


open
1    532016
0    108824
Name: count, dtype: int64

In [None]:
data["state_holiday"].value_counts()

state_holiday
0    621160
a     12842
b      4214
c      2624
Name: count, dtype: int64

In [None]:
data["promotion"].value_counts()

promotion
0    396220
1    244620
Name: count, dtype: int64

In [59]:
data["store_ID"].value_counts()

store_ID
1045    645
309     636
754     635
432     634
286     634
       ... 
1004    448
287     448
1065    445
81      438
542     436
Name: count, Length: 1115, dtype: int64

NOW

- Step 1: Clean & prepare the data

- Step 2: Choose features and target X,y

- Step 3: Split train/test (`train_test_split`)

- Step 4: Train (which algorithm(s) ? ) -> LinearRegression
    - Best algorithm for sales data (sales = regression problem):

        - `LinearRegression` (Cons: Cannot capture non-linear patterns)
        - Tree-Based Models (Most common for sales data): `XGBoost`, `Decision Tree Regressor`

- Step 5: Predict (`y_pred`)

- Step 6: R2 (`r2_score(y_test, y_pred)`)

In [28]:
data.groupby("state_holiday")["sales"].sum()

# convert state holiday to 0,1,1,1 or 0,1,2,3 cast it to integer
# take month and year
# drop date
# 

state_holiday
0    3697272529
a       3626172
b       1065876
c        468664
Name: sales, dtype: int64

In [None]:
print("Sum: State Holiday - Open\n",data.groupby("state_holiday")["open"].sum())
# 0 → normal days → 531,437 open days
# a → public holiday → 429 open days
# b → Easter → 102 open days
# c → Christmas → 48 open days
print("\nSum: State Holiday - Sales\n",data.groupby("state_holiday")["sales"].sum())
# Insight: Sales drop significantly on holidays.
print("\nCount: State Holiday - Open\n",data.groupby("state_holiday")["open"].count())
# Total rows (days) in the dataset for each holiday type.
# Eg: 12,842 rows are public holidays (a), 4,214 are Easter (b), etc.

Sum: State Holiday - Open
 state_holiday
0    531437
a       429
b       102
c        48
Name: open, dtype: int64

Sum: State Holiday - Sales
 state_holiday
0    3697272529
a       3626172
b       1065876
c        468664
Name: sales, dtype: int64

Count: State Holiday - Open
 state_holiday
0    621160
a     12842
b      4214
c      2624
Name: open, dtype: int64


In [69]:
import matplotlib.pyplot as plt
import seaborn as sns

# --- Sum of open stores by state holiday ---
plt.figure(figsize=(10,8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.show()


ValueError: could not convert string to float: 'a'

<Figure size 1000x800 with 0 Axes>

#### Manipulate data

In [None]:
data = data.drop(columns=["Unnamed: 0"])

In [53]:
data["date"] = pd.to_datetime(data["date"])
data["month"] = data["date"].dt.month
data["year"] = data["date"].dt.year
data = data.drop(columns=["date"])
data.head()

Unnamed: 0,store_ID,day_of_week,nb_customers_on_day,open,promotion,state_holiday,school_holiday,sales,month,year
0,366,4,517,1,0,0,0,4422,4,2013
1,394,6,694,1,0,0,0,8297,4,2015
2,807,4,970,1,1,0,0,9729,8,2013
3,802,2,473,1,1,0,0,6513,5,2013
4,726,4,1068,1,1,0,0,10882,10,2013


In [96]:
# Step 2: Define features and target

X = data[["store_ID", "day_of_week", "nb_customers_on_day", "open", "promotion", 
          "month", "year"]] # excluding  and "state_holiday", "school_holiday", 

y = data["sales"]

In [97]:
# Split to train - test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

#### Models Time:

In [98]:
# Linear Regression
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

# Option 1: Convert categorical variables to numeric codes
# Option 2 (better for linear models): One-hot encode

#### Predict & R2 per Model:

In [99]:
from sklearn.metrics import r2_score

y_pred_linear_regression = model.predict(X_test)
r2_linear_regression = r2_score(y_test, y_pred_linear_regression)
# Accuracy excluding            and "state_holiday", "school_holiday",   : R² score: 0.8513495928002516
# Accuracy excluding "store_ID" and "state_holiday", "school_holiday".   : R² score: 0.8511873298774476
# Accuracy excluding "store_ID" and "state_holiday"                      : R² score: 0.8511872183658378
# Accuracy excluding "state_holiday", "school_holiday", and "day_of_week": R² score: 0.8508182898805067
# Accuracy excluding "store_ID" and "state_holiday" and "day_of_week",   : R² score: 0.8506748572837379


print(f"R² score: {r2_linear_regression}")


R² score: 0.8513495928002516


#### One Hot Encoding in state_holiday and new results

In [102]:
data_after_one_hot = data
data_after_one_hot = pd.get_dummies(data, columns=["state_holiday"], drop_first=True)
data_after_one_hot

Unnamed: 0,store_ID,day_of_week,nb_customers_on_day,open,promotion,school_holiday,sales,month,year,state_holiday_a,state_holiday_b,state_holiday_c
0,366,4,517,1,0,0,4422,4,2013,False,False,False
1,394,6,694,1,0,0,8297,4,2015,False,False,False
2,807,4,970,1,1,0,9729,8,2013,False,False,False
3,802,2,473,1,1,0,6513,5,2013,False,False,False
4,726,4,1068,1,1,0,10882,10,2013,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
640835,409,6,483,1,0,0,4553,10,2013,False,False,False
640836,97,1,987,1,1,0,12307,4,2014,False,False,False
640837,987,1,925,1,0,0,6800,7,2014,False,False,False
640838,1084,4,725,1,0,0,5344,6,2014,False,False,False


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X = data_after_one_hot[[
    "store_ID",
    "day_of_week",
    "nb_customers_on_day",
    "open",
    "promotion",
    "school_holiday",
    "month",
    "year",
    "state_holiday_a",
    "state_holiday_b",
    "state_holiday_c"
]]

y = data_after_one_hot["sales"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred_linear_regression_one_hot = model.predict(X_test)
r2_linear_regression_one_hot = r2_score(y_test, y_pred_linear_regression_one_hot)

print(f"R² score: {r2_linear_regression_one_hot}") # R² score: 0.8523183320745474

R² score: 0.8523183320745474


#### One-Hot in store_ID


In [104]:
data_after_one_hot_Store_ID_and_state_holiday = data_after_one_hot
data_after_one_hot_Store_ID_and_state_holiday = pd.get_dummies(data_after_one_hot, columns=["store_ID"], drop_first=True)
data_after_one_hot_Store_ID_and_state_holiday

Unnamed: 0,day_of_week,nb_customers_on_day,open,promotion,school_holiday,sales,month,year,state_holiday_a,state_holiday_b,...,store_ID_1106,store_ID_1107,store_ID_1108,store_ID_1109,store_ID_1110,store_ID_1111,store_ID_1112,store_ID_1113,store_ID_1114,store_ID_1115
0,4,517,1,0,0,4422,4,2013,False,False,...,False,False,False,False,False,False,False,False,False,False
1,6,694,1,0,0,8297,4,2015,False,False,...,False,False,False,False,False,False,False,False,False,False
2,4,970,1,1,0,9729,8,2013,False,False,...,False,False,False,False,False,False,False,False,False,False
3,2,473,1,1,0,6513,5,2013,False,False,...,False,False,False,False,False,False,False,False,False,False
4,4,1068,1,1,0,10882,10,2013,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
640835,6,483,1,0,0,4553,10,2013,False,False,...,False,False,False,False,False,False,False,False,False,False
640836,1,987,1,1,0,12307,4,2014,False,False,...,False,False,False,False,False,False,False,False,False,False
640837,1,925,1,0,0,6800,7,2014,False,False,...,False,False,False,False,False,False,False,False,False,False
640838,4,725,1,0,0,5344,6,2014,False,False,...,False,False,False,False,False,False,False,False,False,False


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X = data_after_one_hot_Store_ID_and_state_holiday.drop(columns=["sales"])
y = data_after_one_hot_Store_ID_and_state_holiday["sales"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred_linear_regression_one_hot2 = model.predict(X_test)
r2_linear_regression_one_hot2 = r2_score(y_test, y_pred_linear_regression_one_hot2)

print(f"R² score: {r2_linear_regression_one_hot2}") #R² score: 0.9531318324317393

# BUT it was running for a long time: 34.9 seconds





R² score: 0.9531318324317393


# Observations on R² Scores

## 1. Before one-hot encoding
- Features: `store_ID`, `day_of_week`, `nb_customers_on_day`, `open`, `promotion`, `month`, `year`  
  (excluding `state_holiday` and `school_holiday`)
- R² score: ~0.8513
- Variations when including/excluding features:

| Features Excluded                                    | R² Score  |
|------------------------------------------------------|-----------|
| `state_holiday` & `school_holiday`                  | 0.8513    |
| `store_ID`, `state_holiday`, `school_holiday`       | 0.8512    |
| `store_ID` & `state_holiday`                        | 0.8512    |
| `state_holiday`, `school_holiday`, `day_of_week`    | 0.8508    |
| `store_ID`, `state_holiday`, `day_of_week`          | 0.8507    |


## 2. After one-hot encoding `state_holiday`
- R² score: ~0.8523
- Slight improvement, computation is still fast.

## 3. After one-hot encoding both `state_holiday` and `store_ID`
- R² score: ~0.9531
- Much higher accuracy (explains most of the variance in sales)
- Training time: ~34.9 seconds due to high number of features (1125 columns)


In [109]:
y_pred_train = model.predict(X_train)

In [111]:
r2_new = r2_score(y_train, y_pred_train)
r2_new

0.9537409319110332

In [None]:
def adjusted_r2_score(y_true, y_pred, X):
    """Compute adjusted R² for a regression model."""
    r2 = r2_score(y_true, y_pred)
    n = len(y_true)         # number of samples
    p = X.shape[1]          # number of predictors
    return 1 - (1 - r2) * (n - 1) / (n - p - 1)

y_pred = model.predict(X_test)
adj_r2 = adjusted_r2_score(y_test, y_pred, X_test)
print("Adjusted R²:", adj_r2)


Adjusted R²: 0.9527171710938716


In [113]:
X.shape[1] 

1124