## Lies `cars.csv` in einen DataFrame

In [2]:
import pandas as pd

cars = pd.read_csv('cars.csv')
cars.describe()

Unnamed: 0,price,yearOfRegistration,powerPS,kilometer
count,250.0,250.0,250.0,250.0
mean,8553.384,2006.448,118.068,84520.0
std,7832.009205,7.103883,49.183202,38265.21274
min,0.0,1974.0,41.0,5000.0
25%,2820.0,2002.0,75.25,60000.0
50%,5900.0,2008.0,109.0,90000.0
75%,12000.0,2011.0,150.0,125000.0
max,44000.0,2018.0,280.0,125000.0


## Data Cleaning

* entferne die Features `model` und `name`
* entferne Observations mit `NaN`-Einträgen
* entferne Observations, deren `fuelType` nicht `benzin` oder `diesel` ist
* Führe ein One-Hot-Encoding für `fuelType` durch

In [3]:
cars = cars.drop(['model', 'name'], axis=1)
cars = cars.dropna()
cars = cars[cars["fuelType"].isin(['benzin', 'diesel'])]
fuelType = pd.get_dummies(cars['fuelType'], drop_first=True)
cars = cars.drop(['fuelType'], axis=1)
cars = pd.concat([cars, fuelType], axis=1)
cars.head()


Unnamed: 0,price,yearOfRegistration,powerPS,kilometer,diesel
0,1450,1997,75,90000,0
1,13100,2005,280,5000,0
2,4500,2008,87,90000,0
3,6000,2009,177,125000,1
4,3990,1999,118,90000,0


## Training

* Verwende 20% Testdaten und `random_state=42`
* predicte `price`
* Verwende eine LinearRegression und ermittle, welches Feature einen negativen Einfluss auf den Preis hat
* Verwende einen RandomForestRegressor und ermittle das wichtigste Feature

In [4]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model
X = cars.drop(['price'], axis=1)
y = cars['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
priceLinReg = linear_model.LinearRegression()
priceLinReg.fit(X_train, y_train)
pd.DataFrame([priceLinReg.coef_, X.columns])

Unnamed: 0,0,1,2,3
0,131.662959,84.224383,-0.079952,4997.553764
1,yearOfRegistration,powerPS,kilometer,diesel


In [6]:
# RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

priceForestReg = RandomForestRegressor()
priceForestReg.fit(X_train, y_train)
pd.DataFrame([priceForestReg.feature_importances_,X.columns])

Unnamed: 0,0,1,2,3
0,0.304028,0.44865,0.212051,0.035271
1,yearOfRegistration,powerPS,kilometer,diesel


## Evaluierung

* Ermittle einen durschnittlichen Fehler für beide Modelle

In [7]:
from sklearn import metrics
print(metrics.mean_absolute_error(y_test, priceLinReg.predict(X_test)))
print(metrics.mean_absolute_error(y_test, priceForestReg.predict(X_test)))

2681.997339586527
2489.3779047619046
