# California Housing Dataset Regressionsmodelle

## 1. Aufgabe (Erste Regression)
Ziel: Verstehen, was Regression ist.

1. Lade den Datensatz housing.csv mit Pandas oder mithilfe von Scikit-Learn durch from sklearn.datasets import fetch_california_housing.
2. Untersuche den Aufbau des Datensatzes: Wieviele Zeilen? Welche Spalten?
3. Erstelle ein einfaches Diagramm:
- x-Achse: median_income
- y-Achse: median_house_value
4. Was erkennst du im Scatterplot?
5. Führe ein einfaches Lineares Regressionsmodell mit median_income als Eingabe durch.
6. Berechne den R²-Wert.
- Was bedeutet dieser Wert? Ist das Modell "gut"?

In [4]:
import plotly.express as px
from sklearn.datasets import fetch_california_housing
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

housing_df = pd.read_csv("data/housing 2.csv")
housing_df

Unnamed: 0.1,Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...,...
20635,20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [6]:
print("Anzahl Zeilen:", housing_df.shape[0])
print("Spalten:", housing_df.columns.tolist())

px.scatter(
  housing_df,
  x="MedInc",
  y="MedHouseVal",
  title="Median Income vs. Median House Value"
)

X = housing_df[["MedInc"]]
y = housing_df["MedHouseVal"]
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
y_pred

r2 = r2_score(y, y_pred)
print(f"R^2-Wert (einfaches Modell): {r2:.4f}")

features = ["MedInc", "HouseAge", "AveRooms", "Population"]
X = housing_df[features]
y = housing_df["MedHouseVal"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lin_model = LinearRegression()
lin_model.fit(X_train, y_train)
y_pred_lin = lin_model.predict(X_test)
y_pred_lin

r2_lin = r2_score(y_test, y_pred_lin)
print(f"R^2-Wert (multivariates Modell): {r2_lin:.4f}")

X_reduced = X.drop("Population", axis=1)
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_reduced, y, test_size=0.2, random_state=42)
lin_model_r = LinearRegression()
lin_model_r.fit(X_train_r, y_train_r)
y_pred_r = lin_model_r.predict(X_test_r)
print(f"R^2 ohne 'Population': {r2_score(y_test_r, y_pred_r):.4f}")

models = {
  "LinearRegression": LinearRegression(),
  "DecisionTree": DecisionTreeRegressor(random_state=42),
  "RandomForest": RandomForestRegressor(random_state=42)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"\nModell: {name}")
    print(f"R^2: {r2_score(y_test, y_pred):.4f}")
    print(f"MAE: {mean_absolute_error(y_test, y_pred):.4f}")
    print(f"MSE: {mean_squared_error(y_test, y_pred):.4f}")

housing_df["RoomsPerPerson"] = housing_df["AveRooms"] / housing_df["Population"]
housing_df["RoomsPerPerson"] = housing_df["RoomsPerPerson"].replace([np.inf, -np.inf], np.nan)
housing_df.dropna(subset=["RoomsPerPerson"], inplace=True)

features_with_new = features + ["RoomsPerPerson"]
X_new = housing_df[features_with_new]
y_new = housing_df["MedHouseVal"]
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new, y_new, test_size=0.2, random_state=42)

model_with_new = LinearRegression()
model_with_new.fit(X_train_new, y_train_new)
y_pred_new = model_with_new.predict(X_test_new)

print(f"\nR^2 mit neuem Feature 'RoomsPerPerson': {r2_score(y_test_new, y_pred_new):.4f}")    

px.scatter(
  housing_df, 
  x="Longitude", 
  y="Latitude", 
  color="MedHouseVal", 
  title="Hauspreise in Kalifornien", 
  labels={"MedHouseVal": "Median House Value"},
  color_continuous_scale="Viridis"
)


Anzahl Zeilen: 20640
Spalten: ['Unnamed: 0', 'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude', 'MedHouseVal']
R^2-Wert (einfaches Modell): 0.4734
R^2-Wert (multivariates Modell): 0.4980
R^2 ohne 'Population': 0.4972

Modell: LinearRegression
R^2: 0.4980
MAE: 0.6025
MSE: 0.6578

Modell: DecisionTree
R^2: 0.1939
MAE: 0.7426
MSE: 1.0563

Modell: RandomForest
R^2: 0.5621
MAE: 0.5486
MSE: 0.5738

R^2 mit neuem Feature 'RoomsPerPerson': 0.5015
