
# 📊 Sales Forecasting Project

Hey! This is my attempt at building a simple sales forecasting model.  
The idea is to take sales data (daily/weekly), do some feature engineering, and then train a model (LightGBM) to predict future sales.

I'll go step by step so it’s easy to follow.  


In [None]:

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lightgbm as lgb

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import joblib

# just to keep results consistent
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

def mape(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    mask = y_true != 0
    return np.mean(np.abs((y_true[mask] - y_pred[mask]) / y_true[mask])) * 100



## 1. Load the data

I have a CSV file with sales data. I'll load it here.  
(Change the filename if your dataset has a different name.)


In [None]:

# Load dataset
DATA_PATH = "/mnt/data/submission.csv"  # change if needed
df = pd.read_csv(DATA_PATH)

print("Shape of data:", df.shape)
df.head()



## 2. Quick look at the data

Let’s check missing values, datatypes, and a quick plot of sales over time.


In [None]:

print(df.info())
print("\nMissing values per column:")
print(df.isna().sum())

df.describe().T



## 3. Preprocessing

Now I’ll make sure the date column is in datetime format,  
and then set the target variable (sales).

👉 Change `DATE_COL` and `TARGET` below if your dataset uses different column names.


In [None]:

# Change these if needed
DATE_COL = "date"    # example: 'date', 'order_date', etc.
TARGET = "sales"     # example: 'sales', 'y', etc.

df[DATE_COL] = pd.to_datetime(df[DATE_COL])
df = df.sort_values(DATE_COL).reset_index(drop=True)

# If multiple rows per day, aggregate
daily = df.groupby(DATE_COL)[TARGET].sum().reset_index()
daily.rename(columns={TARGET: "y"}, inplace=True)
daily.head()



## 4. Feature engineering

I’ll add some date-based features (day, month, weekday, etc.)  
and also create lag/rolling features (previous sales, moving averages).


In [None]:

data = daily.copy()
data['day'] = data[DATE_COL].dt.day
data['month'] = data[DATE_COL].dt.month
data['year'] = data[DATE_COL].dt.year
data['dayofweek'] = data[DATE_COL].dt.dayofweek

# Lags and rolling averages
for lag in [1, 7, 14]:
    data[f"lag_{lag}"] = data["y"].shift(lag)

for window in [7, 14]:
    data[f"roll_mean_{window}"] = data["y"].shift(1).rolling(window).mean()

# drop rows with NaN (from lagging)
data = data.dropna().reset_index(drop=True)
data.head()



## 5. Train-test split

I’ll keep the last 30 days for testing and train on the rest.


In [None]:

TEST_DAYS = 30
train = data.iloc[:-TEST_DAYS]
test = data.iloc[-TEST_DAYS:]

FEATURES = [c for c in data.columns if c not in [DATE_COL, "y"]]
print("Features:", FEATURES)

X_train, y_train = train[FEATURES], train["y"]
X_test, y_test = test[FEATURES], test["y"]



## 6. Train a model

I’ll use a LightGBM regressor (works well for tabular + time series features).  
Also scaling features with StandardScaler inside a pipeline.


In [None]:

model = Pipeline([
    ("scaler", StandardScaler()),
    ("lgb", lgb.LGBMRegressor(random_state=RANDOM_STATE))
])

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("RMSE:", rmse(y_test, y_pred))
print("MAE:", mean_absolute_error(y_test, y_pred))
print("MAPE:", mape(y_test, y_pred))



## 7. Predictions vs Actual


In [None]:

plt.figure(figsize=(12,5))
plt.plot(test[DATE_COL], y_test, marker="o", label="Actual")
plt.plot(test[DATE_COL], y_pred, marker="o", label="Predicted")
plt.legend()
plt.title("Sales Forecast (Last 30 days)")
plt.show()



## 8. Save model

Finally, I’ll save the trained model so we can reuse it later (deployment, API, etc.).


In [None]:

joblib.dump(model, "/mnt/data/sales_forecast_model.joblib")
print("Model saved!")



## 9. Save predictions in a clean format for Power BI



In [None]:

results = test.copy()
results["Predicted_Sales"] = y_pred
results = results[["date", "sales", "Predicted_Sales"]]

results.to_csv("sales_forecast_results.csv", index=False)
print("Saved: sales_forecast_results.csv")
