# Vergleich und Bewertung von Regressionsmodellen

- MAE:

$$MAE=\frac{1}{n}\sum_{i=1}^n \vert y_i - \hat{y_i} \vert$$

$$\text{Residuum}=y_i-\hat y_i$$

- MSE:

$$MSE=\frac{1}{n} \sum_{i=1}^n(y_i-\hat{y_i})^2$$

- RMSE:

$$RMSE=\sqrt{\frac{1}{n} \sum_{i=1}^n(y_i-\hat{y_i})^2}$$

- R²:

**Gesamtvariablilität (TSS, SST, $SS_{tot}$)**

$$SS_{tot}=\sum_{i=1}^n(y_i-\overline{y})^2$$

**Erklärte Variabilität (RSS, SSR)**

$$SSR=\sum_{i=1}^n({\hat{y_i}-\overline{y}})^2$$

**Modellfehler (Residuen oder ESS $SS_{res}$)**

$$SS_{res}=\sum_{i=1}^n(y_i-\hat{y_i})^2$$

**Zusammenhang:**

$$SS_{tot}=SSR+SS_{res}$$

$$R^2=1- \frac{SS_{res}}{SS_{tot}}$$

## Funktion für Regressions-Metriken

In [1]:
from sklearn.model_selection import KFold, cross_val_score
import numpy as np
from sklearn.base import clone
from sklearn.metrics import r2_score, mean_squared_error

In [2]:
def evaluate_regression(
    model,
    X_train, y_train,
    X_test, y_test,
    *, # Ab hier nur noch Keyword-Argumente
    cv_splits: int = 5, # Anzahl der Fold
    cv_shuffle: bool = True, # Daten vor Aufteilung mischen
    cv_random_state: int = 42,
    scoring: str = "r2", # Angabe der Metrik, für die CV
    return_print: bool = True
):
    # 1. Cross-Validation:
    cv = KFold(n_splits=cv_splits, shuffle=cv_shuffle, random_state=cv_random_state) # Strategie der Aufteilung
    cv_scores = cross_val_score(model, X_train, y_train, scoring=scoring, cv=cv) # CV durchführen
    cv_mean = float(np.mean(cv_scores))
    cv_std = float(np.std(cv_scores, ddof=1))
    
    # 2. Modell fitten und Test-Vorhersage:
    fitted = clone(model).fit(X_train, y_train)
    y_pred = fitted.predict(X_test)
    
    # 3. Metriken:
    r2_test = float(r2_score(y_test, y_pred))
    rmse_test = float(np.sqrt(mean_squared_error(y_test, y_pred)))
    
    # 4. Korrigierte Bestimmtheitsmaß
    # 4.1 Anzahl der Beobachtungen im Testset:
    n_test = X_test.shape[0]
    # 4.2 Anzahl der Features (Prädikatoren):
    p = X_test.shape[1]
    # 4.3 Nenner der Formel:
    denom = n_test - p - 1
    # 4.4 Berechnung:
    if denom <= 0:
        adj_r2_test = np.nan
    else:
        adj_r2_test = 1 - (1-r2_test) * (n_test - 1) / denom
        adj_r2_test = float(adj_r2_test)
    
    # 5. Metriken zusammenfassen:
    result = {
        "R2_test": r2_test,
        "Adj_R2_test": adj_r2_test,
        "RMSE_test": rmse_test,
        "CV_mean": cv_mean,
        "CV_std": cv_std,
        "CV_metric": scoring,
        "n_test": n_test,
        "p_features": p,
        "cv_splits": cv_splits
    }
    
    if return_print:
        print(f"RMSE (Test): {result['RMSE_test']:.4f}")
        print(f"R² (Test): {result['R2_test']:.4f}")
        print(f"Adjusted R² (Test): {result['Adj_R2_test']:.4f}" if np.isfinite(result['Adj_R2_test']) else "Adjusted R² (Test): n/a" )
        print(f"CV-{scoring.upper()} (Train) Mittelwert: {result['CV_mean']:.4f} | Std: {result['CV_std']:.4f} | Folds: {cv_splits}")
    
    return result

$$R^2_{adj}=1-(1-R^2) \cdot \frac{n-1}{n-p-1}$$

In [3]:
import pandas as pd

df_avocado = pd.read_csv("data/avocado.csv")
df_avocado

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.70,109149.67,130.50,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.00,71976.41,72.58,5811.16,5677.40,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.60,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18244,7,2018-02-04,1.63,17074.83,2046.96,1529.20,0.00,13498.67,13066.82,431.85,0.0,organic,2018,WestTexNewMexico
18245,8,2018-01-28,1.71,13888.04,1191.70,3431.50,0.00,9264.84,8940.04,324.80,0.0,organic,2018,WestTexNewMexico
18246,9,2018-01-21,1.87,13766.76,1191.92,2452.79,727.94,9394.11,9351.80,42.31,0.0,organic,2018,WestTexNewMexico
18247,10,2018-01-14,1.93,16205.22,1527.63,2981.04,727.01,10969.54,10919.54,50.00,0.0,organic,2018,WestTexNewMexico


In [4]:
df_avocado.drop("Unnamed: 0", axis=1, inplace=True)
df_avocado

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.70,109149.67,130.50,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.00,71976.41,72.58,5811.16,5677.40,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.60,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18244,2018-02-04,1.63,17074.83,2046.96,1529.20,0.00,13498.67,13066.82,431.85,0.0,organic,2018,WestTexNewMexico
18245,2018-01-28,1.71,13888.04,1191.70,3431.50,0.00,9264.84,8940.04,324.80,0.0,organic,2018,WestTexNewMexico
18246,2018-01-21,1.87,13766.76,1191.92,2452.79,727.94,9394.11,9351.80,42.31,0.0,organic,2018,WestTexNewMexico
18247,2018-01-14,1.93,16205.22,1527.63,2981.04,727.01,10969.54,10919.54,50.00,0.0,organic,2018,WestTexNewMexico


## Datenvorbereitung

### Datentransformation schief verteilter Variablen

- Skewness beschreibt wie symmetrisch unsere Daten sind:
- `0`: Symmetrische Verteilung
- `> 0`: Rechts-schief
- `< 0`: Links-schief

Faustregel:
- -0,5 < skewness < 0,5 --> kein Problem
- -1 < skewness < 1 --> noch okay
- 1 < skewness < -1 --> wir müssen eingreifen!

In [5]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

def show_dist(
    dataset: pd.DataFrame,
    columns_list,
    rows: int,
    cols: int,
    title: str = "Distributions"
):
    
    # Subplots erstellen:
    
    fig = make_subplots(
        rows=rows,
        cols=cols,
        subplot_titles=[str(c) for c in columns_list]
    )
    
    for i, col in enumerate(columns_list):
        r = i // cols + 1 # Berechnet die Zeilennummer des Subplots 
        c = i % cols + 1 # Berechnet die Spaltennummer des Subplots 
        
        # Nur numersiche Werte verwenden:
        df_cleaned = pd.to_numeric(dataset[col], errors="coerce").dropna()
        
        # Histogramm zeichnen:
        fig.add_trace(go.Histogram(
            x=df_cleaned,
            nbinsx=min(50, max(10, int(np.sqrt(len(df_cleaned))))),
            showlegend=False
        ),
            row=r,
            col=c
        )
        
        # Skewness berechnen:
        skew = float(df_cleaned.skew())
        fig.layout.annotations[i].text = f"{col} (Skewness: {skew:.2f})"
        
    fig.update_layout(
        title=title,
        template="plotly_white",
        bargap=0.05,
        
        height=max(400, 280 * rows),
        width=max(500, 320 * cols)
    )
    
    for ax in fig["layout"]:
        fig.update_yaxes(title_text="Häufigkeit")
    
    return fig

In [6]:
numeric_columns = ["AveragePrice", "Total Volume", "4046", "4225", "4770", "Total Bags", "Small Bags", "Large Bags", "XLarge Bags"]

show_dist(df_avocado, numeric_columns, rows=3, cols=3)

**Log-Tranformation:**

$$y=ln(1+x)$$

| Original x | y    |
| ---------- | ---- |
| 0          | 0    |
| 9          | 2,30 |
| 99         | 4,61 |
| 9999       | 9,21 |



In [7]:
def log_transform_and_skewness(df, numeric_columns, show_skewness=False):
    # Kopie:
    df_transformed = df.copy()
    
    # Transformation anwenden:
    df_transformed[numeric_columns] = np.log1p(df_transformed[numeric_columns])
    
    # Skewness vorher und nachher:
    before_skew = df[numeric_columns].skew()
    after_skew = df_transformed[numeric_columns].skew()
    
    # Skewness anzeigen:
    if show_skewness:
        skew_df = pd.DataFrame({
            "Before": before_skew,
            "After": after_skew
        })
        print(skew_df)
    
    return df_transformed

In [8]:
df_avocado_log_transformed = log_transform_and_skewness(df=df_avocado, numeric_columns=numeric_columns, show_skewness=True)

                 Before     After
AveragePrice   0.580303  0.138629
Total Volume   9.007687  0.088098
4046           8.648220 -0.328195
4225           8.942466 -0.486654
4770          10.159396  0.099986
Total Bags     9.756072 -0.218874
Small Bags     9.540660 -0.622148
Large Bags     9.796455 -0.547765
XLarge Bags   13.139751  1.176494


In [9]:
df_avocado_log_transformed

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,0.845868,11.070344,6.944801,10.905146,3.895080,9.070833,9.060055,4.545951,0.0,conventional,2015,Albany
1,2015-12-20,0.854415,10.912867,6.515127,10.706381,4.083115,9.159737,9.149429,4.589955,0.0,conventional,2015,Albany
2,2015-12-13,0.657520,11.680313,6.679222,11.600485,4.879007,9.005325,8.992584,4.645736,0.0,conventional,2015,Albany
3,2015-12-06,0.732368,11.277116,7.032624,11.184108,4.298373,8.667708,8.644425,4.903495,0.0,conventional,2015,Albany
4,2015-11-29,0.824175,10.840377,6.848515,10.688288,4.340944,8.729874,8.697389,5.291746,0.0,conventional,2015,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18244,2018-02-04,0.966984,9.745419,7.624599,7.333154,0.000000,9.510421,9.477908,6.070391,0.0,organic,2018,WestTexNewMexico
18245,2018-01-28,0.996949,9.538855,7.083975,8.141044,0.000000,9.134090,9.098407,5.786284,0.0,organic,2018,WestTexNewMexico
18246,2018-01-21,1.054312,9.530085,7.084159,7.805389,6.591591,9.147945,9.143431,3.768384,0.0,organic,2018,WestTexNewMexico
18247,2018-01-14,1.075002,9.693150,7.332127,8.000363,6.590315,9.302969,9.298401,3.931826,0.0,organic,2018,WestTexNewMexico


In [10]:
show_dist(df_avocado_log_transformed, numeric_columns, rows=3, cols=3)

In [11]:
df_avocado["Date"]

0        2015-12-27
1        2015-12-20
2        2015-12-13
3        2015-12-06
4        2015-11-29
            ...    
18244    2018-02-04
18245    2018-01-28
18246    2018-01-21
18247    2018-01-14
18248    2018-01-07
Name: Date, Length: 18249, dtype: object

In [12]:
df_avocado_log_transformed["Date"] = pd.to_datetime(df_avocado_log_transformed["Date"], format="%Y-%m-%d", errors="raise")
df_avocado_log_transformed.dtypes

Date            datetime64[ns]
AveragePrice           float64
Total Volume           float64
4046                   float64
4225                   float64
4770                   float64
Total Bags             float64
Small Bags             float64
Large Bags             float64
XLarge Bags            float64
type                    object
year                     int64
region                  object
dtype: object

In [13]:
df_avocado_log_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18249 entries, 0 to 18248
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Date          18249 non-null  datetime64[ns]
 1   AveragePrice  18249 non-null  float64       
 2   Total Volume  18249 non-null  float64       
 3   4046          18249 non-null  float64       
 4   4225          18249 non-null  float64       
 5   4770          18249 non-null  float64       
 6   Total Bags    18249 non-null  float64       
 7   Small Bags    18249 non-null  float64       
 8   Large Bags    18249 non-null  float64       
 9   XLarge Bags   18249 non-null  float64       
 10  type          18249 non-null  object        
 11  year          18249 non-null  int64         
 12  region        18249 non-null  object        
dtypes: datetime64[ns](1), float64(9), int64(1), object(2)
memory usage: 1.8+ MB


In [14]:
df_avocado_log_transformed["type"] = df_avocado_log_transformed["type"].astype("category")
df_avocado_log_transformed["region"] = df_avocado_log_transformed["region"].astype("category")

df_avocado_log_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18249 entries, 0 to 18248
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Date          18249 non-null  datetime64[ns]
 1   AveragePrice  18249 non-null  float64       
 2   Total Volume  18249 non-null  float64       
 3   4046          18249 non-null  float64       
 4   4225          18249 non-null  float64       
 5   4770          18249 non-null  float64       
 6   Total Bags    18249 non-null  float64       
 7   Small Bags    18249 non-null  float64       
 8   Large Bags    18249 non-null  float64       
 9   XLarge Bags   18249 non-null  float64       
 10  type          18249 non-null  category      
 11  year          18249 non-null  int64         
 12  region        18249 non-null  category      
dtypes: category(2), datetime64[ns](1), float64(9), int64(1)
memory usage: 1.6 MB


In [15]:
df_avocado_log_transformed["month"] = df_avocado_log_transformed["Date"].dt.month.astype("Int16").astype("category")
# df_avocado_log_transformed["weekday"] = df_avocado_log_transformed["Date"].dt.dayofweek.astype("Int16").astype("category")
df_avocado_log_transformed

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region,month
0,2015-12-27,0.845868,11.070344,6.944801,10.905146,3.895080,9.070833,9.060055,4.545951,0.0,conventional,2015,Albany,12
1,2015-12-20,0.854415,10.912867,6.515127,10.706381,4.083115,9.159737,9.149429,4.589955,0.0,conventional,2015,Albany,12
2,2015-12-13,0.657520,11.680313,6.679222,11.600485,4.879007,9.005325,8.992584,4.645736,0.0,conventional,2015,Albany,12
3,2015-12-06,0.732368,11.277116,7.032624,11.184108,4.298373,8.667708,8.644425,4.903495,0.0,conventional,2015,Albany,12
4,2015-11-29,0.824175,10.840377,6.848515,10.688288,4.340944,8.729874,8.697389,5.291746,0.0,conventional,2015,Albany,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18244,2018-02-04,0.966984,9.745419,7.624599,7.333154,0.000000,9.510421,9.477908,6.070391,0.0,organic,2018,WestTexNewMexico,2
18245,2018-01-28,0.996949,9.538855,7.083975,8.141044,0.000000,9.134090,9.098407,5.786284,0.0,organic,2018,WestTexNewMexico,1
18246,2018-01-21,1.054312,9.530085,7.084159,7.805389,6.591591,9.147945,9.143431,3.768384,0.0,organic,2018,WestTexNewMexico,1
18247,2018-01-14,1.075002,9.693150,7.332127,8.000363,6.590315,9.302969,9.298401,3.931826,0.0,organic,2018,WestTexNewMexico,1


In [16]:
from sklearn.model_selection import train_test_split

X = df_avocado_log_transformed.drop("AveragePrice", axis=1)
y = df_avocado_log_transformed["AveragePrice"]

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.2, random_state=42)

**IQR**

- Untere Grenze $Q_1 - 1.5 \cdot IQR$
- Obere Grenze $Q_3 + 1.5 \cdot IQR$

In [17]:
from collections import Counter

def iqr_outlier_indices(df, features, whisker=1.5 , min_cols=1):
    all_idx = []
    for col in features:
        s = pd.to_numeric(df[col], errors="coerce")
        q1, q3 = s.quantile([0.25, 0.75])
        iqr = q3 - q1
        lower ,upper = q1 - whisker * iqr, q3 + whisker * iqr
        filter = (s < lower) | (s > upper)
        idx = s[filter].index.to_list()
        all_idx.extend(idx)
    counts = Counter(all_idx)
    return [i for i, k in counts.items() if k >= min_cols]

In [18]:
numeric_columns = [
    "Total Volume", "4046", "4225", "4770",
    "Total Bags", "Small Bags", "Large Bags", "XLarge Bags"
]

out_idx = iqr_outlier_indices(df=df_avocado_log_transformed, features=numeric_columns, min_cols=1)
print(f"{len(out_idx)} Zeilen mit Außreisern erkannt")

479 Zeilen mit Außreisern erkannt


**Winsorizing**
- Alle werte unterhalb der unteren Grenze auf die untere Grenze setzen.
- Alle werte oberhalb der oberen Grenze auf die obere Grenze setzen.
- 

In [19]:
def iqr_clip_cols(df, columns, whisker=1.5, bounds=None, return_bounds=False):
    df_out = df.copy()
    if bounds is None:
        num = df_out[columns].apply(pd.to_numeric, errors="coerce")
        q = num.quantile([0.25, 0.75])
        iqr = q.loc[0.75] - q.loc[0.25]
        lower = q.loc[0.25] - whisker * iqr
        upper = q.loc[0.75] + whisker * iqr
        bounds = pd.DataFrame({
            "lower": lower,
            "upper": upper
        })
        
    df_out[columns] = df_out[columns].clip(lower=bounds["lower"], upper=bounds["upper"], axis=1)
    
    if return_bounds:
        return df_out, bounds
    return df_out

In [20]:
X_train_capped, bounds_train =  iqr_clip_cols(X_train, numeric_columns, return_bounds=True)
X_test_capped, bounds_test =  iqr_clip_cols(X_test, numeric_columns, return_bounds=True)
bounds_train

Unnamed: 0,lower,upper
Total Volume,3.757399,18.520135
4046,-0.551189,18.923846
4225,2.129847,17.799276
4770,-13.106223,21.843706
Total Bags,3.920413,16.243185
Small Bags,2.91615,16.386385
Large Bags,-2.963719,17.807235
XLarge Bags,-7.357069,12.261782


In [21]:
out_idx = iqr_outlier_indices(df=X_train_capped, features=numeric_columns, min_cols=1)
print(f"{len(out_idx)} Zeilen mit Außreisern erkannt")

0 Zeilen mit Außreisern erkannt


In [22]:
fig = go.Figure()

columns_to_plot = ["Total Bags", "Large Bags"]

for col in columns_to_plot:
    fig.add_traces(go.Box(
        y=X_train_capped[col],
        name=col,
        boxpoints="outliers",
    ))

fig

## Standardisierung (z-Transformaton)

In [23]:
from sklearn.preprocessing import StandardScaler

numeric_columns = [
    "Total Volume", "4046", "4225", "4770",
    "Total Bags", "Small Bags", "Large Bags", "XLarge Bags"
]

# Kopien:
X_train_scaled = X_train_capped.copy()
X_test_scaled = X_test_capped.copy()

# Sclaer anwenden:
scaler = StandardScaler()
X_train_scaled[numeric_columns] = scaler.fit_transform(X_train_capped[numeric_columns])
X_test_scaled[numeric_columns] = scaler.fit_transform(X_test_capped[numeric_columns])

# Kontrolle:
print("Arithmetisches Mittel:")
print(X_train_scaled[numeric_columns].mean().round(4))
print("\nStandardabweichung:")
print(X_train_scaled[numeric_columns].std().round(4))

Arithmetisches Mittel:
Total Volume    0.0
4046           -0.0
4225            0.0
4770            0.0
Total Bags      0.0
Small Bags      0.0
Large Bags     -0.0
XLarge Bags    -0.0
dtype: float64

Standardabweichung:
Total Volume    1.0
4046            1.0
4225            1.0
4770            1.0
Total Bags      1.0
Small Bags      1.0
Large Bags      1.0
XLarge Bags     1.0
dtype: float64


In [24]:
X_train_scaled

Unnamed: 0,Date,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region,month
16391,2017-11-12,-1.192413,-1.186422,-1.092390,-1.220408,-0.815529,-0.546993,-1.319327,-0.646836,organic,2017,Orlando,11
4990,2016-01-10,0.849121,0.878532,0.741728,1.184886,0.993868,0.866898,1.151544,-0.646836,conventional,2016,SanDiego,1
13653,2016-11-27,-1.111638,-0.410174,-0.714429,-1.220408,-1.183396,-0.892171,-0.756099,-0.646836,organic,2016,PhoenixTucson,11
2074,2015-02-08,0.247172,0.674353,0.254325,-0.037372,0.025623,0.173844,0.191401,-0.646836,conventional,2015,Roanoke,2
12377,2016-06-12,-0.310956,-1.158125,0.253532,-1.220408,-1.237785,-0.927046,-1.848537,-0.646836,organic,2016,Chicago,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...
11284,2015-06-28,-0.808981,-0.360137,-0.219339,-1.220408,-1.952690,-1.571546,-1.848537,-0.646836,organic,2015,SanDiego,6
11964,2016-05-22,-1.849105,-1.684595,-1.419439,-1.220408,-1.562515,-1.219796,-1.848537,-0.646836,organic,2016,Albany,5
5390,2016-05-01,0.003915,-0.621986,0.378183,0.381961,-0.004497,0.184785,-1.848537,-0.646836,conventional,2016,Syracuse,5
860,2015-06-14,0.528593,0.495177,0.711161,0.615565,0.551022,0.665779,0.215336,1.459596,conventional,2015,HarrisburgScranton,6


## One-Hot-Encoding

In [25]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Kategorien definieren:
categorical_features = ["type", "year", "region", "month"]

# Encoder erstellen:
ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

# Columntransformer verwenden:
ct = ColumnTransformer(
    transformers=[("ohe", ohe, categorical_features)],
    remainder="passthrough",
    verbose_feature_names_out=False
)

# OHE anwenden:
X_train_ohe = ct.fit_transform(X_train_scaled) # Lerne, wie du umwandelst und wandle dann um
X_test_ohe = ct.transform(X_test_scaled) # Wande um, so wie du es gelernt hast beim Training

# Spaltennamen vom CT holen:
feature_names = ct.get_feature_names_out()

X_train_ohe = pd.DataFrame(X_train_ohe, columns=feature_names, index=X_train_scaled.index)
X_test_ohe = pd.DataFrame(X_test_ohe, columns=feature_names, index=X_test_scaled.index)
X_test_ohe

Unnamed: 0,type_conventional,type_organic,year_2015,year_2016,year_2017,year_2018,region_Albany,region_Atlanta,region_BaltimoreWashington,region_Boise,...,month_12,Date,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags
8604,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,2018-02-11,0.435627,0.693736,0.121004,0.717855,0.639458,0.602012,0.879166,1.539086
2608,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2015-11-01,0.689529,1.012704,0.479439,-0.095769,0.553472,0.401856,0.955733,-0.641141
14581,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2016-01-24,-1.313334,-0.582518,-2.16593,-1.206524,-0.99497,-0.713459,-1.491549,-0.641141
4254,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2016-03-06,0.524668,0.887652,0.129859,0.317288,0.549717,0.608867,0.658927,-0.641141
16588,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2017-02-19,-0.960167,-0.680523,-2.699032,-1.206524,-0.519701,-0.310672,-0.228145,-0.641141
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15956,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2017-01-22,-0.015112,0.096222,0.053815,-0.866212,0.242538,0.377562,0.225706,-0.641141
12471,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2016-08-21,-1.046245,-0.763347,-0.542952,-1.206524,-1.019071,-1.063571,0.043103,-0.641141
4574,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2016-01-10,1.203217,1.415487,0.919101,1.052057,0.755005,0.691363,0.982066,-0.641141
16359,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2017-06-18,-0.603453,-1.883775,-1.38334,-1.206524,-0.121677,0.075421,-1.878523,-0.641141


## Modell erstellen

In [26]:
from sklearn.linear_model import LinearRegression

lm = LinearRegression()

X_train_ohe.drop(columns=["Date"], inplace=True)
X_test_ohe.drop(columns=["Date"], inplace=True)

# Training:
lm.fit(X_train_ohe, y_train)

# Vorhersagen:
y_pred = lm.predict(X_test_ohe)

In [27]:
evaluate_regression(lm, X_train_ohe, y_train, X_test_ohe, y_test, cv_splits=5)

RMSE (Test): 0.0780
R² (Test): 0.7744
Adjusted R² (Test): 0.7694
CV-R2 (Train) Mittelwert: 0.7772 | Std: 0.0063 | Folds: 5


{'R2_test': 0.7744221912073155,
 'Adj_R2_test': 0.7693658099511052,
 'RMSE_test': 0.07797220948940693,
 'CV_mean': 0.777198040020689,
 'CV_std': 0.006276653790277569,
 'CV_metric': 'r2',
 'n_test': 3650,
 'p_features': 80,
 'cv_splits': 5}