# Capstone Projekt Rossmann

# Predictive Modeling

## Selection of models that will be tested to predict the sales of Rossmann stores

Linear Regression Models:

Linear Regression: If the relationship between the predictors and the sales is linear, linear regression can be a good starting point.
Ridge/Lasso Regression: These are variations of linear regression that include regularization to prevent overfitting, especially useful if you have many predictors.
Tree-based Models:

Decision Trees: Good for capturing non-linear relationships but can overfit.
Random Forest: An ensemble of decision trees, it is more robust and less likely to overfit than a single decision tree.
Gradient Boosting Machines (GBM): Models like Gradient Boosting Regressor or XGBoost, LightGBM, and CatBoost are powerful for capturing complex patterns in data.
Support Vector Machines (SVM):

SVR (Support Vector Regression): Effective in high-dimensional spaces and with kernel functions, it can capture complex relationships.
Nearest Neighbors:

K-Nearest Neighbors (KNN): Can be used for regression; it predicts the value based on the 'k' closest points.
Time Series Specific Models (Not in Scikit-learn but worth considering):

ARIMA/SARIMA: Traditional time series models suitable for univariate time series.
Prophet: Developed by Facebook, good for daily data with multiple seasonality and holiday effects.
LSTM/GRU (Deep Learning): RNNs like LSTM or GRU can be effective, especially if you have a large amount of historical data.

- Linear Regression Models:  
	- Linear Regression
	- Ridge Regression
	- Lasso Regression

- Tree-based Models:
	- Decision Trees
	- Random Forest
	- Gradient Boosting Machines (GBM) (in scikit-learn: GradientBoostingRegressor)

- Support Vector Machines (SVM):
	- SVR (Support Vector Regression)

- Nearest Neighbors:
	- K-Nearest Neighbors (KNN)

- Neural Networks:
	- MLP (Multi-layer Perceptron)

- Time Series Specific Models:
	- ARIMA/SARIMA
	- Prophet
	- LSTM/GRU (Deep Learning)


I want to use:
- LinearRegression
- RidgeRegression
- LassoRegression
- DecisionTreeRegressor
- RandomForestRegressor
- GradiantBoostingRegressor
- SVR
- KNN
- MLPRegressor


## Definition of KPIs for model evaluation

- Mean Absolute Error (MAE): The average of the absolute differences between predictions and actual values. It gives an idea of the magnitude of the error, but no information about the direction (over or under predicting).
- Mean Squared Error (MSE): The average of the squared differences between predictions and actual values. It gives more weight to larger errors and is more useful in practice than MAE.
- Root Mean Squared Error (RMSE): The square root of the MSE, it is more interpretable than the MSE as it is in the same units as the response variable.
- R-squared (R2): The proportion of the variance in the dependent variable that is predictable from the independent variables. It provides an indication of the goodness of fit and therefore a measure of how well unseen samples are likely to be predicted by the model, through the proportion of explained variance.
- Adjusted R-squared: The R-squared value adjusted for the number of predictors in the model. It is useful for comparing models with different numbers of predictors.


In [6]:
import pandas as pd
import numpy as np

# Einstellen des Zufallsgenerators für Reproduzierbarkeit
np.random.seed(0)

# Erstellen eines DataFrame mit 100 Zeilen und 4 Spalten
n_rows = 1000
df = pd.DataFrame({
    'Feature1': np.random.rand(n_rows),  # Zufällige Werte zwischen 0 und 1
    'Feature2': np.random.randint(1, 10, size=n_rows),  # Zufällige ganze Zahlen zwischen 1 und 9
    'Feature3': np.random.randn(n_rows),  # Zufällige Werte aus einer Normalverteilung
    'Target': np.random.randint(0, 2, size=n_rows)  # Binäre Zielvariable (0 oder 1)
})

X = df[['Feature1', 'Feature2', 'Feature3']]
y = df['Target']

print(df.head())  # Anzeigen der ersten 5 Zeilen des DataFrame
print("Feature-Matrix (X):")
print(X.head())
print("\nZielvektor (y):")
print(y.head())

   Feature1  Feature2  Feature3  Target
0  0.548814         2 -1.029392       0
1  0.715189         7 -0.386121       1
2  0.602763         6  0.539313       0
3  0.544883         5 -0.415789       1
4  0.423655         5 -0.718476       0
Feature-Matrix (X):
   Feature1  Feature2  Feature3
0  0.548814         2 -1.029392
1  0.715189         7 -0.386121
2  0.602763         6  0.539313
3  0.544883         5 -0.415789
4  0.423655         5 -0.718476

Zielvektor (y):
0    0
1    1
2    0
3    1
4    0
Name: Target, dtype: int32


In [20]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error as mae, mean_squared_error as mse, r2_score
from math import sqrt

# Erstellen eines Beispieldatenrahmens
np.random.seed(0)
n_rows = 1000
df = pd.DataFrame({
    'Feature1': np.random.rand(n_rows),
    'Feature2': np.random.randint(1, 10, size=n_rows),
    'Feature3': np.random.randn(n_rows),
    'Target': np.random.randint(0, 2, size=n_rows)
})

# Aufteilen in Features (X) und Zielvariable (y)
feature_columns = df.columns.difference(['Target'])
X = df[feature_columns]
y = df['Target']

# Aufteilen in Trainings- und Testdaten
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Funktion zur Berechnung des angepassten R2
def adj_r2_score(model, X, y):
    n = X.shape[0]
    p = X.shape[1]
    r2 = r2_score(y, model.predict(X))
    return 1 - (1 - r2) * ((n - 1) / (n - p - 1))

# Modelle definieren
models = [
    ('LinearRegression', LinearRegression()),
    ('RidgeRegression', Ridge()),
    ('LassoRegression', Lasso()),
    ('DecisionTreeRegressor', DecisionTreeRegressor()),
    ('RandomForestRegressor', RandomForestRegressor()),
    ('SVR', SVR()),
    ('KNN', KNeighborsRegressor())
]

# Ergebnis-DataFrame vorbereiten
results = []

# Modelle trainieren und Metriken auswerten
for name, model in models:
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    results.append({
        'Model': name,
        'RMSE_Train': sqrt(mse(y_train, y_train_pred)),
        'MAE_Train': mae(y_train, y_train_pred),
        'R2_Train': r2_score(y_train, y_train_pred),
        'Adj_R2_Train': adj_r2_score(model, X_train, y_train),
        'RMSE_Test': sqrt(mse(y_test, y_test_pred)),
        'MAE_Test': mae(y_test, y_test_pred),
        'R2_Test': r2_score(y_test, y_test_pred),
        'Adj_R2_Test': adj_r2_score(model, X_test, y_test)
    })

# Konvertieren Sie die Liste von Dictionaries in einen DataFrame
results_df = pd.DataFrame(results)

# Ergebnisse anzeigen
results_df


Unnamed: 0,Model,RMSE_Train,MAE_Train,R2_Train,Adj_R2_Train,RMSE_Test,MAE_Test,R2_Test,Adj_R2_Test
0,LinearRegression,0.499868,0.499735,0.000523,-0.003243,0.500522,0.500388,-0.00249,-0.017834
1,RidgeRegression,0.499868,0.499735,0.000523,-0.003243,0.500527,0.500393,-0.002508,-0.017853
2,LassoRegression,0.499998,0.499997,0.0,-0.003769,0.500027,0.500025,-0.000506,-0.01582
3,DecisionTreeRegressor,0.0,0.0,1.0,1.0,0.655744,0.43,-0.720688,-0.747025
4,RandomForestRegressor,0.19998,0.185588,0.840032,0.839429,0.514354,0.4777,-0.058663,-0.074867
5,SVR,0.55232,0.481259,-0.220236,-0.224835,0.580585,0.513486,-0.348854,-0.369499
6,KNN,0.447661,0.3975,0.198395,0.195374,0.548635,0.497,-0.204482,-0.222918


## Performance Reference

In [None]:
## Nutzt gesplitte daten in einmaligen test und train wobei test die letzten 8 wochen jedes stores beinhaltet
##
##

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error as mae, mean_squared_error as mse, r2_score
from math import sqrt

# Aufteilen in Trainings- und Testdaten
# Listen, um die Trainings- und Testdaten zu speichern
train_data = []
test_data = []

# Gruppieren nach Store und Aufteilen in Trainings- und Testdaten
amount_test_weeks = 8
for store_id, group in df.groupby('Store'):
    train_data.append(group[: -amount_test_weeks])
    test_data.append(group[-amount_test_weeks:])

# Kombinieren der Trainings- und Testdaten
train_df = pd.concat(train_data)
test_df = pd.concat(test_data)

X_train = train_df
y_train = train_df['Sales']
X_test = test_df
y_test = test_df['Sales']


# Funktion zur Berechnung des angepassten R2
def adj_r2_score(model, X, y):
    n = X.shape[0]
    p = X.shape[1]
    r2 = r2_score(y, model.predict(X))
    return 1 - (1 - r2) * ((n - 1) / (n - p - 1))


# Ergebnis-DataFrame vorbereiten
results = []

# Calculate the salces mean and using it as a prediction
# df for means of last x weeks
timeframeForMean = 12
last_day_in_train = X_train['Date'].max()
df_X_train_for_means = X_train[X_train['Date'] > last_day_in_train - pd.Timedelta(weeks=timeframeForMean)]
mean_sales_train = df_X_train_for_means.mean(numeric_only=True)['Sales']
#mean_sales_test = X_test['Sales'].mean()

y_train_pred = np.full(y_train.shape, mean_sales_train)
y_test_pred = np.full(y_test.shape, mean_sales_train)

results.append({
    'Model': "Mean reference",
    'RMSE_Train': sqrt(mse(y_train, y_train_pred)),
    'MAE_Train': mae(y_train, y_train_pred),
    'R2_Train': r2_score(y_train, y_train_pred),
    #'Adj_R2_Train': adj_r2_score(model, X_train, y_train),
    'RMSE_Test': sqrt(mse(y_test, y_test_pred)),
    'MAE_Test': mae(y_test, y_test_pred),
    'R2_Test': r2_score(y_test, y_test_pred),
    #'Adj_R2_Test': adj_r2_score(model, X_test, y_test)
})
#print last result
print(results[-1])

# Konvertieren Sie die Liste von Dictionaries in einen DataFrame
results_df = pd.DataFrame(results)

# Ergebnisse anzeigen
results_df