# 2. Baseline: Lineare Regression

**KI1-Projekt 308** — California Housing Datensatz

Referenzmodell für den Vergleich aller weiteren Methoden.
Dieses Notebook stellt die Baseline her, gegen die alle anderen Modelle verglichen werden.

In [None]:
import sys
sys.path.insert(0, '..')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

from utils.data import load_and_clean_data, get_train_test_split
from utils.evaluation import evaluate_model, add_result
from utils.plotting import plot_predicted_vs_actual, plot_residuals, plot_feature_importances

plt.rcParams['figure.dpi'] = 100
%matplotlib inline

## 2.1 Daten laden

In [None]:
df = load_and_clean_data()
X_train, X_test, y_train, y_test, feature_names = get_train_test_split(df)
print(f"Training:  {X_train.shape[0]} Samples, {X_train.shape[1]} Features")
print(f"Test:      {X_test.shape[0]} Samples")
print(f"Features:  {feature_names}")

## 2.2 Lineare Regression (alle Features, ohne Skalierung)

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

result_lr = evaluate_model(lr, X_train, X_test, y_train, y_test, "Lineare Regression")
add_result(result_lr)

In [None]:
# Koeffizienten der linearen Regression
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Koeffizient': lr.coef_
}).sort_values('Koeffizient', key=abs, ascending=False)

print(f"Intercept: {lr.intercept_:.4f}\n")
print(coef_df.to_string(index=False))

In [None]:
fig, ax = plot_predicted_vs_actual(
    y_test, lr.predict(X_test),
    title="Lineare Regression: Predicted vs. Actual",
    save_name="baseline_pred_vs_actual"
)
plt.show()

In [None]:
fig, ax = plot_residuals(
    y_test, lr.predict(X_test),
    title="Lineare Regression: Residuen",
    save_name="baseline_residuals"
)
plt.show()

## 2.3 Lineare Regression nur mit MedInc (stärkstes Feature)

In [None]:
# Nur MedInc als Feature
medinc_idx = feature_names.index('MedInc')
X_train_medinc = X_train[:, medinc_idx].reshape(-1, 1)
X_test_medinc = X_test[:, medinc_idx].reshape(-1, 1)

lr_medinc = LinearRegression()
lr_medinc.fit(X_train_medinc, y_train)

result_medinc = evaluate_model(lr_medinc, X_train_medinc, X_test_medinc, y_train, y_test, "LR nur MedInc")
add_result(result_medinc)

## 2.4 Zusammenfassung

Die Baseline Lineare Regression liefert den Referenz-R²-Wert.
Alle weiteren Modelle sollen diesen übertreffen.

**Beobachtungen:**
- MedInc allein erklärt bereits einen Großteil der Varianz
- Residuen zeigen systematische Muster → nicht-lineare Zusammenhänge vorhanden
- Verbesserungspotenzial durch Feature-Engineering und komplexere Modelle