<a href="https://colab.research.google.com/github/Shahroz-Harral/demand-forecasting/blob/main/forecasting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1: Data Preparation

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
# Load dataset
data = pd.read_csv('/content/train_0irEZ2H.csv')

In [10]:
# Convert week column to datetime
data['week'] = pd.to_datetime(data['week'], format='%y/%m/%d')

In [11]:
# Extract relevant date features
data['year'] = data['week'].dt.year
data['month'] = data['week'].dt.month
data['day'] = data['week'].dt.day
data['week_number'] = data['week'].dt.isocalendar().week

In [12]:
# Define features and target
features = ['year', 'month', 'day', 'week_number', 'store_id', 'sku_id', 'total_price', 'base_price', 'is_featured_sku', 'is_display_sku']
target = 'units_sold'


In [13]:
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data[features], data[target], test_size=0.2, random_state=42)

# Step 2: Feature Engineering

Feature engineering is crucial for improving model performance. We'll add more features based on domain knowledge and interactions.

In [14]:
# Feature engineering: creating interaction terms
X_train['price_diff'] = X_train['total_price'] - X_train['base_price']
X_test['price_diff'] = X_test['total_price'] - X_test['base_price']

# Additional feature transformations
X_train['total_base_ratio'] = X_train['total_price'] / X_train['base_price']
X_test['total_base_ratio'] = X_test['total_price'] / X_test['base_price']

# Step 3: Model Selection and Training
We will use several models and select the best one based on performance metrics.

In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

In [16]:
# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'XGBoost': XGBRegressor(n_estimators=100, random_state=42)
}

In [17]:
# Train and evaluate models
model_performance = {}
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    model_performance[model_name] = mean_squared_error(y_test, y_pred)


In [18]:
# Print model performance
for model_name, mse in model_performance.items():
    print(f"{model_name}: MSE = {mse:.4f}")

Linear Regression: MSE = 2319.4835
Random Forest: MSE = 720.1785
XGBoost: MSE = 590.8568



# Step 4: Model Evaluation and Hyperparameter Tuning



We'll perform hyperparameter tuning on the best model using GridSearchCV.

In [19]:
from sklearn.model_selection import GridSearchCV

In [None]:
# Example with Random Forest
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=3, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Best model and parameters
best_model = grid_search.best_estimator_
print(f"Best parameters: {grid_search.best_params_}")

**Step 5: Forecasting and Performance Improvement**

Finally, we use the best model for forecasting and evaluate its performance.

In [None]:
# Predict and evaluate the best model
y_pred_best = best_model.predict(X_test)
best_mse = mean_squared_error(y_test, y_pred_best)
print(f"Best Model MSE: {best_mse:.4f}")

# Forecast future demand (assuming future data structure is similar)
future_weeks = pd.date_range(start=data['week'].max(), periods=30, freq='W')
future_data = pd.DataFrame({
    'week': future_weeks,
    'year': future_weeks.year,
    'month': future_weeks.month,
    'day': future_weeks.day,
    'week_number': future_weeks.isocalendar().week,
    'store_id': [8091]*30,  # Assuming a single store for simplicity
    'sku_id': [216418]*30,  # Assuming a single SKU for simplicity
    'total_price': [100]*30,  # Placeholder values
    'base_price': [90]*30,  # Placeholder values
    'is_featured_sku': [0]*30,
    'is_display_sku': [0]*30
})

# Add engineered features
future_data['price_diff'] = future_data['total_price'] - future_data['base_price']
future_data['total_base_ratio'] = future_data['total_price'] / future_data['base_price']

# Forecast
future_predictions = best_model.predict(future_data[features])

# Output the forecasted demand
forecast = pd.DataFrame({'week': future_weeks, 'predicted_units_sold': future_predictions})
print(forecast)
