## Lies `cars.csv` in einen DataFrame

In [16]:
import numpy as np
import pandas as pd
from scipy.stats import randint, norm
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import RobustScaler

In [17]:
df = pd.read_csv("data/cars.csv")
df.head()

Unnamed: 0,price,yearOfRegistration,powerPS,kilometer,model,fuelType,name
0,1450,1997,75,90000,andere,benzin,Toyota_Toyota_Starlet_1._Hand__TÜV_neu
1,13100,2005,280,5000,golf,benzin,R32_tauschen_oder_kaufen
2,4500,2008,87,90000,yaris,benzin,Toyota_Yaris_1.3_VVT_i
3,6000,2009,177,125000,3er,diesel,320_Alpinweiss_Kohlenstoff
4,3990,1999,118,90000,3er,benzin,BMW_318i_E46_+++_1._Hand_+++_Liebhaberfahrzeug


## Data Cleaning

* entferne die Features `model` und `name`
* entferne Observations mit `NaN`-Einträgen
* entferne Observations, deren `fuelType` nicht `benzin` oder `diesel` ist
* Führe ein One-Hot-Encoding für `fuelType` durch

In [18]:
df = df.drop(columns=["model", "name"])
df = df.dropna()
df.isna().sum()

price                 0
yearOfRegistration    0
powerPS               0
kilometer             0
fuelType              0
dtype: int64

In [19]:
df = df.query("fuelType == 'benzin' or fuelType == 'diesel'")
df['isDiesel'] = df['fuelType'].replace({'benzin':0, 'diesel':1})
df = df.drop(columns='fuelType')

In [20]:
df.head()

Unnamed: 0,price,yearOfRegistration,powerPS,kilometer,isDiesel
0,1450,1997,75,90000,0
1,13100,2005,280,5000,0
2,4500,2008,87,90000,0
3,6000,2009,177,125000,1
4,3990,1999,118,90000,0


![output](assets/data_cleaning_output.png)

## Training

* Verwende 20% Testdaten und `random_state=42`
* predicte `price`
* Verwende eine LinearRegression und ermittle, welches Feature einen negativen Einfluss auf den Preis hat
* Verwende einen RandomForestRegressor und ermittle das wichtigste Feature

In [21]:
X = df.drop(columns='price')
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [22]:
model_lin_reg = LinearRegression()
model_lin_reg.fit(X_train, y_train)

In [23]:
df.corr()['price']

price                 1.000000
yearOfRegistration    0.357412
powerPS               0.608716
kilometer            -0.443695
isDiesel              0.443016
Name: price, dtype: float64

In [24]:
model_forrest = RandomForestRegressor(random_state=42)
model_forrest.fit(X_train, y_train)
pd.DataFrame(data=model_forrest.feature_importances_, index=X_train.columns)
#powerPS with 0.451

Unnamed: 0,0
yearOfRegistration,0.280431
powerPS,0.463751
kilometer,0.224976
isDiesel,0.030842


![test](assets/forest_feature_importances.png)

## Evaluierung

* Ermittle den mean squared error für beide Modelle
* Performt eines der models besser, wenn die Daten skaliert werden?

In [25]:
def calc_error(model, X_test, y_true):
    predictions = model.predict(X_test)
    return np.sqrt(mean_squared_error(y_true, predictions))

In [26]:
print("Lin Reg:", calc_error(model_lin_reg, X_test, y_test), "\n")
print("Forest:", calc_error(model_forrest, X_test, y_test))

Lin Reg: 3703.9811554900343 

Forest: 3370.642243161223


In [27]:
scaler = RobustScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_lin_reg_scaled = LinearRegression()
model_lin_reg_scaled.fit(X_train_scaled, y_train)
print("Lin Reg:", calc_error(model_lin_reg, X_test, y_test), "\n")

Lin Reg: 3703.9811554900343 



### Tuning

Finde ein model, welches am Testset einen mse < 3200 aufweist

In [29]:
model_grid_randomized_cv = RandomizedSearchCV(estimator=RandomForestRegressor(),
                                              param_distributions={'n_estimators': randint(0, 1000),
                                                                   'max_features': norm(loc=0.5, scale=0.15)},
                                              scoring='neg_mean_squared_error',  # -mse, damit höher=besser
                                              cv=5,
                                              n_iter=20,
                                              n_jobs=8,
                                              random_state=42)
model_grid_randomized_cv.fit(X_train, y_train)
pd.DataFrame(model_grid_randomized_cv.cv_results_)
calc_error(model_grid_randomized_cv, X_test, y_test)

2959.5069126973817

![tuning](assets/tuning.png)