# Capstone Projekt Rossmann

# Predictive Modeling

## Selection of models that will be tested to predict the sales of Rossmann stores

Linear Regression Models:

Linear Regression: If the relationship between the predictors and the sales is linear, linear regression can be a good starting point.
Ridge/Lasso Regression: These are variations of linear regression that include regularization to prevent overfitting, especially useful if you have many predictors.
Tree-based Models:

Decision Trees: Good for capturing non-linear relationships but can overfit.
Random Forest: An ensemble of decision trees, it is more robust and less likely to overfit than a single decision tree.
Gradient Boosting Machines (GBM): Models like Gradient Boosting Regressor or XGBoost, LightGBM, and CatBoost are powerful for capturing complex patterns in data.
Support Vector Machines (SVM):

SVR (Support Vector Regression): Effective in high-dimensional spaces and with kernel functions, it can capture complex relationships.
Nearest Neighbors:

K-Nearest Neighbors (KNN): Can be used for regression; it predicts the value based on the 'k' closest points.
Time Series Specific Models (Not in Scikit-learn but worth considering):

ARIMA/SARIMA: Traditional time series models suitable for univariate time series.
Prophet: Developed by Facebook, good for daily data with multiple seasonality and holiday effects.
LSTM/GRU (Deep Learning): RNNs like LSTM or GRU can be effective, especially if you have a large amount of historical data.

- Linear Regression Models:  
	- Linear Regression
	- Ridge Regression
	- Lasso Regression

- Tree-based Models:
	- Decision Trees
	- Random Forest
	- Gradient Boosting Machines (GBM) (in scikit-learn: GradientBoostingRegressor)

- Support Vector Machines (SVM):
	- SVR (Support Vector Regression)

- Nearest Neighbors:
	- K-Nearest Neighbors (KNN)

- Neural Networks:
	- MLP (Multi-layer Perceptron)

- Time Series Specific Models:
	- ARIMA/SARIMA
	- Prophet
	- LSTM/GRU (Deep Learning)


I want to use:
- LinearRegression
- RidgeRegression
- LassoRegression
- DecisionTreeRegressor
- RandomForestRegressor
- GradiantBoostingRegressor
- SVR
- KNN
- MLPRegressor


## Definition of KPIs for model evaluation

- Mean Absolute Error (MAE): The average of the absolute differences between predictions and actual values. It gives an idea of the magnitude of the error, but no information about the direction (over or under predicting).
- Mean Squared Error (MSE): The average of the squared differences between predictions and actual values. It gives more weight to larger errors and is more useful in practice than MAE.
- Root Mean Squared Error (RMSE): The square root of the MSE, it is more interpretable than the MSE as it is in the same units as the response variable.
- R-squared (R2): The proportion of the variance in the dependent variable that is predictable from the independent variables. It provides an indication of the goodness of fit and therefore a measure of how well unseen samples are likely to be predicted by the model, through the proportion of explained variance.
- Adjusted R-squared: The R-squared value adjusted for the number of predictors in the model. It is useful for comparing models with different numbers of predictors.


## Functions needed for testing

### Test models with test and tain data. Test includes the last 8 weeks from each store

In [3]:
## Test models with test and tain data. Test includes the last 8 weeks from each store

def testModelsTestSplit8W(df, scaler):
	train_data = []
	test_data = []

	# Group by store and split into training and test data
	amount_test_weeks = 8
	for store_id, group in df.groupby('Store'):
		train_data.append(group[: -amount_test_weeks])
		test_data.append(group[-amount_test_weeks:])

	# Combine the list entries to one dataframe
	train_df = pd.concat(train_data)
	test_df = pd.concat(test_data)

	# Create feature and target data frames
	X_train = train_df.drop(columns=['Future_Sales'])
	y_train = train_df['Future_Sales']
	X_test = test_df.drop(columns=['Future_Sales'])
	y_test = test_df['Future_Sales']

	# Scaling of the data
	if scaler:
		X_train = scaler.fit_transform(X_train)
		X_test = scaler.transform(X_test)

	def adj_r2_score(model, X, y):
		n = X.shape[0]
		p = X.shape[1]
		r2 = r2_score(y, model.predict(X))
		return 1 - (1 - r2) * ((n - 1) / (n - p - 1))

	# Defining the models to test
	models = [
		('LinearRegression', LinearRegression(n_jobs=-1)),
		#('RidgeRegression', Ridge(random_state=42)),
		#('LassoRegression', Lasso(random_state=42)),
		#('DecisionTreeRegressor', DecisionTreeRegressor(random_state=42)),
		#('RandomForestRegressor', RandomForestRegressor(n_jobs=-1, max_depth=10, random_state=42, n_estimators=100)),
		#('SVR', SVR()),
		#('KNN', KNeighborsRegressor())
	]

	results = []
	# Train models and calculate metrics
	for name, model in models:
		model.fit(X_train, y_train)
		y_train_pred = model.predict(X_train)
		y_test_pred = model.predict(X_test)

		results.append({
			'Model': name,
			'RMSE_Train': sqrt(mse(y_train, y_train_pred)),
			'MAE_Train': mae(y_train, y_train_pred),
			'R2_Train': r2_score(y_train, y_train_pred),
			'Adj_R2_Train': adj_r2_score(model, X_train, y_train),
			'RMSE_Test': sqrt(mse(y_test, y_test_pred)),
			'MAE_Test': mae(y_test, y_test_pred),
			'R2_Test': r2_score(y_test, y_test_pred),
			'Adj_R2_Test': adj_r2_score(model, X_test, y_test)
		})
		#print last result
		print(results[-1])

	results_df = pd.DataFrame(results)
	return results_df

### Creates x splits in test and train where the last 8 weeks of each store are included in the respective test split and the splits are distributed evenly using gap

In [4]:
#Creates x splits in test and train where the last 8 weeks of each store are included in the respective test split and the splits are distributed
# evenly using gap

def testModelsCV8W(df, scaler):

    n_splits = 5
    window_size = 8
    total_weeks =109
    train_size = window_size / 0.2
    gap = int((total_weeks - window_size - train_size) // (n_splits))

    results = []

    for split in range(n_splits):
        train_data = []
        test_data = []

        for store_id, group in df.groupby('Store'):
            # calculate start and end index for test data
            if split == 0:
                test_start_index = -window_size
                test_df_store = group[test_start_index:]  # Kein Endindex für den ersten Split
            else:
                test_start_index = -(window_size + gap * split)
                test_end_index = test_start_index + window_size
                test_df_store = group[test_start_index:test_end_index]
                print("test:", test_df_store.shape, "Test Start Index:", test_start_index, "Test End Index:", test_end_index)
            
            train_start_index = -int(-test_start_index + gap + train_size)
            train_df_store = group[train_start_index:test_start_index]
            print("Train:", train_df_store.shape, "Train Start Index:", train_start_index, "Train End Index:", test_start_index)
            # Check if test set contains data
            if not test_df_store.empty:
                train_data.append(train_df_store)
                test_data.append(test_df_store)
            else:
                print(f"Store {store_id} has not enough data for splitting {split}")

        # Combine the list entries to one dataframe
        train_df_combined = pd.concat(train_data)
        test_df_combined = pd.concat(test_data)

        # Create feature and target data frames
        X_train = train_df_combined.drop(columns=['Future_Sales'])
        y_train = train_df_combined['Future_Sales']
        X_test = test_df_combined.drop(columns=['Future_Sales'])
        y_test = test_df_combined['Future_Sales']

        # Scaling of the data
        if scaler:
            X_train = scaler.fit_transform(X_train)
            X_test = scaler.transform(X_test)

        def adj_r2_score(model, X, y):
            n = X.shape[0]
            p = X.shape[1]
            r2 = r2_score(y, model.predict(X))
            return 1 - (1 - r2) * ((n - 1) / (n - p - 1))

    	# Defining the models to test
        models = [
            ('LinearRegression', LinearRegression(n_jobs=-1)),
            #('RidgeRegression', Ridge(random_state=42)),
            #('LassoRegression', Lasso(random_state=42)),
            #('DecisionTreeRegressor', DecisionTreeRegressor(random_state=42)),
            #('RandomForestRegressor', RandomForestRegressor(n_jobs=-1, max_depth=10, random_state=42, n_estimators=100)),
            #('SVR', SVR()),
            #('KNN', KNeighborsRegressor())
        ]
        
        # Train models and calculate metrics
        for name, model in models:
            model.fit(X_train, y_train)
            y_train_pred = model.predict(X_train)
            y_test_pred = model.predict(X_test)
        
            results.append({
                'Model': name,
                'RMSE_Train': sqrt(mse(y_train, y_train_pred)),
                'MAE_Train': mae(y_train, y_train_pred),
                'R2_Train': r2_score(y_train, y_train_pred),
                'Adj_R2_Train': adj_r2_score(model, X_train, y_train),
                'RMSE_Test': sqrt(mse(y_test, y_test_pred)),
                'MAE_Test': mae(y_test, y_test_pred),
                'R2_Test': r2_score(y_test, y_test_pred),
                'Adj_R2_Test': adj_r2_score(model, X_test, y_test)
            })
            #print last result
            print(results[-1])

    results_df = pd.DataFrame(results)

    # calculate mean of all splits
    model_list = results_df['Model'].unique()
    # create resulte_mean_df
    resulte_mean_df = pd.DataFrame(columns=results_df.columns)
    # iterate over model_list
    for model in model_list:
        # get mean of each model
        mean = results_df[results_df['Model'] == model].mean(numeric_only=True)
        mean['Model'] = model
        # append mean to resulte_mean_df
        resulte_mean_df = pd.concat([resulte_mean_df, pd.DataFrame([mean], columns=results_df.columns)], ignore_index=True)

    return results_df, resulte_mean_df


## Performance Reference

In [2]:
import pandas as pd
import numpy as np
from datetime import datetime

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from pandas.api.types import infer_dtype

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import mean_absolute_error as mae, mean_squared_error as mse, r2_score
from math import sqrt

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler 
from sklearn.preprocessing import RobustScaler

pd.set_option('display.max_columns', None)

In [8]:
df = pd.read_csv('weekly_sales_with_store_info.csv')
df['Date'] = pd.to_datetime(df['Date'])

In [10]:
## Test models with test and tain data. Test includes the last 8 weeks from each store

train_data = []
test_data = []

# Group by store and split into training and test data
amount_test_weeks = 8
for store_id, group in df.groupby('Store'):
    train_data.append(group[: -amount_test_weeks])
    test_data.append(group[-amount_test_weeks:])

# Combine the list entries to one dataframe
train_df = pd.concat(train_data)
test_df = pd.concat(test_data)

# Create feature and target data frames
X_train = train_df
y_train = train_df['Sales']
X_test = test_df
y_test = test_df['Sales']

def adj_r2_score(model, X, y):
    n = X.shape[0]
    p = X.shape[1]
    r2 = r2_score(y, model.predict(X))
    return 1 - (1 - r2) * ((n - 1) / (n - p - 1))

results = []

# Calculate the salces mean and using it as a prediction
# df for means of last x weeks
timeframeForMean = 12
last_day_in_train = X_train['Date'].max()
df_X_train_for_means = X_train[X_train['Date'] > last_day_in_train - pd.Timedelta(weeks=timeframeForMean)]
mean_sales_train = df_X_train_for_means.mean(numeric_only=True)['Sales']

y_train_pred = np.full(y_train.shape, mean_sales_train)
y_test_pred = np.full(y_test.shape, mean_sales_train)

results.append({
    'Model': "Mean reference",
    'RMSE_Train': sqrt(mse(y_train, y_train_pred)),
    'MAE_Train': mae(y_train, y_train_pred),
    'R2_Train': r2_score(y_train, y_train_pred),
    #'Adj_R2_Train': adj_r2_score(model, X_train, y_train),
    'RMSE_Test': sqrt(mse(y_test, y_test_pred)),
    'MAE_Test': mae(y_test, y_test_pred),
    'R2_Test': r2_score(y_test, y_test_pred),
    #'Adj_R2_Test': adj_r2_score(model, X_test, y_test)
})
#print last result
print(results[-1])

# Konvertieren Sie die Liste von Dictionaries in einen DataFrame
results_df = pd.DataFrame(results)

# Ergebnisse anzeigen
results_df

{'Model': 'Mean reference', 'RMSE_Train': 18049.073796005105, 'MAE_Train': 13313.040861216872, 'R2_Train': -0.018051765838443812, 'RMSE_Test': 16117.825576344818, 'MAE_Test': 11780.306415371311, 'R2_Test': -3.2271684458073935e-07}


Unnamed: 0,Model,RMSE_Train,MAE_Train,R2_Train,RMSE_Test,MAE_Test,R2_Test
0,Mean reference,18049.073796,13313.040861,-0.018052,16117.825576,11780.306415,-3.227168e-07


**Result:**  
-> The MAE of the simple mean forecast is 11780.306415

In [12]:
df = pd.read_csv('df_nans_handeled_cat_power.csv')

In [13]:
testModelsTestSplit8W(df, MinMaxScaler())

{'Model': 'LinearRegression', 'RMSE_Train': 9081.178921413037, 'MAE_Train': 6229.184072093825, 'R2_Train': 0.7515692210307627, 'Adj_R2_Train': 0.7511593471555716, 'RMSE_Test': 5941.163691018397, 'MAE_Test': 4395.438901345292, 'R2_Test': 0.8641279045091326, 'Adj_R2_Test': 0.8611545348667453}


Unnamed: 0,Model,RMSE_Train,MAE_Train,R2_Train,Adj_R2_Train,RMSE_Test,MAE_Test,R2_Test,Adj_R2_Test
0,LinearRegression,9081.178921,6229.184072,0.751569,0.751159,5941.163691,4395.438901,0.864128,0.861155
