# English / Random Forest Model

- is an ensemble learning method
- consists of many decision trees  
- Each tree is created from a random subset of the data and features
- Predictions are aggregated by majority vote (for classification) or mean (for regression) of the individual trees

# Advantages:

- Robust against overfitting, as many trees are created from different data samples
- Well suited for large data sets and many features
- Can model non-linear relationships well

# Disadvantages:

- Complexity and computational effort higher than with single decision trees
- Model interpretation more difficult than with simple models

# German / Random-Forest-Modell

- ist ein Ensemble-Lernverfahren
- besteht aus vielen Entscheidungsbäumen  
- Jeder Baum wird aus einem zufälligen Subset der Daten und Merkmale erstellt
- Die Vorhersagen werden durch Mehrheitsvotum (bei Klassifikation) oder Mittelwert (bei Regression) der einzelnen Bäume aggregiert

# Vorteile:

- Robust gegenüber Overfitting, da viele Bäume aus verschiedenen Datenproben erstellt werden
- Gut geeignet für große Datensätze und viele Merkmale
- Kann nicht-lineare Zusammenhänge gut modellieren

# Nachteile:

- Komplexität und Rechenaufwand höher als bei einzelnen Entscheidungsbäumen
- Modellinterpretation schwieriger als bei einfachen Modellen

In [1]:
# import of libraries 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import LabelEncoder
import numpy as np

In [3]:
# Load the dataset
file_path = '17072024_sales_data/Clean_Data.csv'
data = pd.read_csv(file_path)

In [5]:
# Convert date columns to datetime
data['Order_Date'] = pd.to_datetime(data['Order_Date'])
data['Ship_Date'] = pd.to_datetime(data['Ship_Date'])

In [6]:
# Extract useful date features
data['Order_Year'] = data['Order_Date'].dt.year
data['Order_Month'] = data['Order_Date'].dt.month
data['Order_Day'] = data['Order_Date'].dt.day
data['Ship_Year'] = data['Ship_Date'].dt.year
data['Ship_Month'] = data['Ship_Date'].dt.month
data['Ship_Day'] = data['Ship_Date'].dt.day

In [11]:
# displaying the dataset
data.head()

Unnamed: 0,Ship_Mode,Customer_ID,Segment,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Order_Year,Order_Month,Order_Day,Ship_Year,Ship_Month,Ship_Day
0,2,143,0,194,15,42420.0,2,12,0,4,384,261.96,2017,11,8,2017,11,11
1,2,143,0,194,15,42420.0,2,55,0,5,835,731.94,2017,11,8,2017,11,11
2,2,237,1,265,3,90036.0,3,945,1,10,1429,14.62,2017,6,12,2017,6,16
3,3,705,0,153,8,33311.0,2,319,0,16,364,957.5775,2016,10,11,2016,10,18
4,3,705,0,153,8,33311.0,2,1315,1,14,570,22.368,2016,10,11,2016,10,18


In [7]:
# Drop original date columns
data = data.drop(columns=['Order_Date', 'Ship_Date'])

In [10]:
# Encode categorical variables
label_encoders = {}
for column in data.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    data[column] = le.fit_transform(data[column])
    label_encoders[column] = le

In [12]:
# Prepare features and target variable
X = data.drop(columns=['Sales'])
y = data['Sales']

In [13]:
# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:
# Initialize and train the Random Forest model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [15]:
# Predict on the test set
y_pred = model.predict(X_test)

In [16]:
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

mse, mae, r2

(334881.8198153792, 194.58894879425068, 0.31535733754885675)

# English / Interpretation of the results

# Results 

- ## MSE: 334881.8198153792
- ## MAE: 194.58894879425068 
- ## R^2: 0.31535733754885675

# Mean Squared Error (MSE)

- Measure of the mean squared difference between the predicted values and the actual values
- ## A lower value means that the model is more accurate

    ## Advantages:

    - Penalizes large errors more than smaller ones, as the errors are squared
    - Differentiated consideration of errors often leads to better models with uniform error distribution

    # Disadvantages:

    - Sensitive to outliers as large errors are squared.

# Mean Absolute Error (MAE)

- Measure of the average absolute difference between the predicted values and the actual values
- ## A lower value means that the model is more accurate

    # Advantages:

    - Less sensitive to outliers compared to the MSE, as no squaring takes place
    - Easy to interpret as it is the average absolute deviation

    # Disadvantages:

    - Penalizes all errors equally, regardless of their size    

# R² score (coefficient of determination)

- The R² score indicates how well the predicted values explain the actual values
- ## A value closer to 1 means that the model is good
    # Advantages:

    - Indicates how much of the variance of the dependent variable is explained by the model
    - Value range from 0 to 1 (sometimes also negative if the model is worse than a simple mean model)

    # Disadvantages:

    - Can be misleading for non-linear relationships
    - When overfitting, the R² score can appear artificially high


# German / Interpretation der Ergebnisse 

# Ergebnisse 

- ## MSE: 334881.8198153792
- ## MAE: 194.58894879425068 
- ## R^2: 0.31535733754885675

# Mean Squared Error (MSE)

- Maß für die durchschnittliche quadratische Differenz zwischen den vorhergesagten Werten und den tatsächlichen Werten
- ## Ein geringerer Wert bedeutet, dass das Modell genauer ist

    ## Vorteile:

    - Bestraft große Fehler stärker als kleinere, da die Fehler quadriert werden
    - Differenzierte Berücksichtigung von Fehlern führt oft zu besseren Modellen bei gleichmäßiger Fehlerverteilung

    # Nachteile:

    - Empfindlich gegenüber Ausreißern, da große Fehler quadratisch eingehen.

# Mean Absolute Error (MAE)

- Maß für die durchschnittliche absolute Differenz zwischen den vorhergesagten Werten und den tatsächlichen Werten
- ## Ein geringerer Wert bedeutet, dass das Modell genauer ist

    # Vorteile:

    - Weniger empfindlich gegenüber Ausreißern im Vergleich zum MSE, da keine Quadrierung stattfindet
    - Leicht interpretierbar, da es sich um die durchschnittliche absolute Abweichung handelt

    # Nachteile:

    - Bestraft alle Fehler gleich, unabhängig von ihrer Größe    

# R²-Score (Bestimmtheitsmaß)

- Der R²-Score gibt an, wie gut die vorhergesagten Werte die tatsächlichen Werte erklären
- ## Ein Wert näher bei 1 bedeutet, dass das Modell gut ist
    # Vorteile:

    - Gibt an, wie viel der Varianz der abhängigen Variable durch das Modell erklärt wird
    - Wertebereich von 0 bis 1 (manchmal auch negativ, wenn das Modell schlechter ist als ein einfaches Mittelwertsmodell)

    # Nachteile:

    - Kann bei nicht-linearen Zusammenhängen irreführend sein
    - Bei Überanpassung kann der R²-Score künstlich hoch erscheinen

# English / Improvement attempt

- Hyperparameter tuning using grid search and cross-validation 

# German / Verbesserungsversuch 

- Hyperparameter-Tuning mittels Grid Search und Cross-Validation 

In [17]:
#import of library
from sklearn.model_selection import GridSearchCV

In [18]:
# Define the parameter grid for GridSearchCV
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

In [19]:
# Initialize the GridSearchCV with cross-validation
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2, scoring='r2')

In [20]:
# Fit the grid search to the data
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 108 candidates, totalling 324 fits
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   1.9s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   1.9s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   1.9s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   3.7s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   3.7s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   3.7s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   1.8s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   1.8s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=300; total time=   5.5s
[CV] END max_depth=10, min_s

In [21]:
# Best parameters from GridSearchCV
best_params = grid_search.best_params_

In [22]:
# Train the Random Forest model with the best parameters
best_model = RandomForestRegressor(**best_params, random_state=42)
best_model.fit(X_train, y_train)

In [23]:
# Predict on the test set with the best model
y_pred_best = best_model.predict(X_test)

In [24]:
# Evaluate the best model
mse_best = mean_squared_error(y_test, y_pred_best)
mae_best = mean_absolute_error(y_test, y_pred_best)
r2_best = r2_score(y_test, y_pred_best)

best_params, mse_best, mae_best, r2_best

({'max_depth': 20,
  'min_samples_leaf': 4,
  'min_samples_split': 10,
  'n_estimators': 300},
 344656.6245043581,
 190.81066144330944,
 0.2953734270729407)

# English / results after 

- Best settings: 'max_depth': 20, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 300
- MSE: 344656.6245043581
- MAE: 190.81066144330944
- R^2: 0.2953734270729407

# German / Ergebnisse 

- Beste Einstellungen: 'max_depth': 20, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 300
- MSE: 344656.6245043581
- MAE: 190.81066144330944
- R^2: 0.2953734270729407