<a href="https://colab.research.google.com/github/Konstantin5054232/ausbildungsprojekte/blob/main/11_autokosten/autokosten.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Ermittlung der Autokosten**

Ein Gebrauchtwagenhändler entwickelt eine App zur Neukundengewinnung. Darin erfahren Sie schnell den Marktwert Ihres Autos. Zu Ihrer Verfügung stehen historische Daten: technische Daten, Ausstattung und Preise von Autos. Sie müssen ein Modell bauen, um die Kosten zu ermitteln.

Der Kunde ist wichtig:


*   Vorhersagequalität;
*   Vorhersagegeschwindigkeit;
*   Studienzeit.

# Datenaufbereitung

In [None]:
# Wir importieren die notwendigen Bibliotheken
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from lightgbm.sklearn import LGBMRegressor

In [None]:
# Wir laden Tabellen mit Daten
df = pd.read_csv('/content/autos.csv')

In [None]:
# Wir werden die erhaltenen Daten studieren
df.info()
display(df.shape)
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Kilometer          354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

(354369, 16)

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,2016-03-24 11:52:17,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


Wir werden Spalten entfernen, die für die Analyse nicht benötigt werden. Die preisbestimmenden Bedingungen sind unserer Meinung nach: Baujahr, Kilometerstand, Marke und Modell und ob das Auto repariert wurde. Es gibt viele Modelle, aber wir haben Daten über die Leistung von Autos. Um die Modellarbeit zu beschleunigen, nehmen Sie an, dass eine Marke nur ein Modell mit einer bestimmten Leistung haben kann, was auch den Autotyp, den Getriebetyp und den Kraftstofftyp bestimmt.

In [None]:
df.drop(['DateCrawled', 'VehicleType', 'RegistrationMonth', 'DateCreated', 'NumberOfPictures', 'PostalCode', 
         'LastSeen', 'Model', 'Gearbox', 'FuelType'], axis=1, inplace = True)

display(df.info())
display(df.shape)
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Price             354369 non-null  int64 
 1   RegistrationYear  354369 non-null  int64 
 2   Power             354369 non-null  int64 
 3   Kilometer         354369 non-null  int64 
 4   Brand             354369 non-null  object
 5   NotRepaired       283215 non-null  object
dtypes: int64(4), object(2)
memory usage: 16.2+ MB


None

(354369, 6)

Unnamed: 0,Price,RegistrationYear,Power,Kilometer,Brand,NotRepaired
0,480,1993,0,150000,volkswagen,
1,18300,2011,190,125000,audi,yes
2,9800,2004,163,125000,jeep,
3,1500,2001,75,150000,volkswagen,no
4,3600,2008,69,90000,skoda,no


In [None]:
# Wir werden kategoriale Merkmale durch numerische ersetzen
df_ohe = pd.get_dummies(df, drop_first=True)
display(df_ohe.shape)
display(df_ohe.head())

(354369, 44)

Unnamed: 0,Price,RegistrationYear,Power,Kilometer,Brand_audi,Brand_bmw,Brand_chevrolet,Brand_chrysler,Brand_citroen,Brand_dacia,...,Brand_skoda,Brand_smart,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,NotRepaired_yes
0,480,1993,0,150000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,18300,2011,190,125000,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,9800,2004,163,125000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1500,2001,75,150000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,3600,2008,69,90000,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [None]:
# Wir werden die Zielmerkmale und -bedingungen hervorheben
target = df_ohe['Price']
features = df_ohe.drop('Price', axis=1)

In [None]:
# Wir teilen die Daten in Stichproben auf
features_train_1, features_valid, target_train_1, target_valid = train_test_split(features, target, test_size=0.20, random_state=12345)
features_train, features_test, target_train, target_test = train_test_split(features_train_1, 
                                                                            target_train_1, test_size=0.25, random_state=12345)

In [None]:
# Wir prüfen, ob die Daten korrekt verteilt werden
print('{:.0%}'.format(features_train.shape[0]/features.shape[0]))
print('{:.0%}'.format(features_valid.shape[0]/features.shape[0]))
print('{:.0%}'.format(features_test.shape[0]/features.shape[0]))

60%
20%
20%


# Modelltraining

Wir werden die optimalen Hyperparameter für verschiedene Modelle auswählen

In [None]:
%%time
model = LinearRegression()
model.fit(features_train, target_train)
predictions = model.predict(features_valid) 
result = mean_squared_error(target_valid, predictions) ** 0.5 
print("LinearRegression", result)

LinearRegression 3852.6079894236263
CPU times: user 702 ms, sys: 83 ms, total: 785 ms
Wall time: 479 ms


In [None]:
best_result = 5000
best_depth = 0
for depth in range(10, 20, 1):
    model = DecisionTreeRegressor(max_depth=depth, random_state=12345)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    result = mean_squared_error(target_valid, predictions_valid) ** 0.5
    if result < best_result:
        best_result = result
        best_depth = depth
    
print(best_depth, best_result)

14 2145.6947851677387


In [None]:
%%time
model = DecisionTreeRegressor(random_state=12345, max_depth=14) 
model.fit(features_train, target_train)
predictions = model.predict(features_valid)
result = mean_squared_error(target_valid, predictions) ** 0.5
print('DecisionTreeRegressor', result)

DecisionTreeRegressor 2145.6947851677387
CPU times: user 1.12 s, sys: 4.99 ms, total: 1.12 s
Wall time: 1.11 s


In [None]:
best_result = 5000
best_est = 0
best_depth = 0
for est in range(10, 41, 10):
    for depth in range (16, 18, 1):
        model = RandomForestRegressor(random_state=12345, n_estimators=90, max_depth=depth)
        model.fit(features_train, target_train) 
        predictions_valid = model.predict(features_valid) 
        result = mean_squared_error(target_valid, predictions_valid)**0.5 
        if result < best_result:
            best_result = result
            best_est = est
            best_depth = depth
            
print(best_est, best_depth, best_result)

10 17 1988.3880598136138


In [None]:
%%time
model = RandomForestRegressor(random_state=12345, n_estimators=10, max_depth=17)
model.fit(features_train, target_train) 
predictions = model.predict(features_valid) 
result = mean_squared_error(target_valid, predictions)**0.5 
print('RandomForestRegressor', result)

RandomForestRegressor 2005.41141819194
CPU times: user 7.68 s, sys: 28 ms, total: 7.71 s
Wall time: 7.66 s


In [None]:
from lightgbm.sklearn import LGBMRegressor
best_result = 5000
best_est = 0
best_depth = 0
for est in range(2000, 2201, 100):
    for depth in range (9, 12, 1):
        model = LGBMRegressor(n_estimators=est, max_depth=depth)
        model.fit(features_train, target_train) 
        predictions_valid = model.predict(features_valid) 
        result = mean_squared_error(target_valid, predictions_valid)**0.5 
        if result < best_result:
            best_result = result
            best_est = est
            best_depth = depth
            
print(best_est, best_depth, best_result)

2000 11 1928.582833550351


In [None]:
%%time
model = LGBMRegressor(n_estimators=2000, max_depth=11)
model.fit(features_train, target_train) 
predictions_valid = model.predict(features_valid) 
result = mean_squared_error(target_valid, predictions_valid)**0.5 
print('LGBMRegressor', result)

LGBMRegressor 1928.582833550351
CPU times: user 38.8 s, sys: 311 ms, total: 39.1 s
Wall time: 20 s


# Analysieren von Modellen

Wir werden die Leistung verschiedener Modelle auf einem Testset testen und sehen, wie lange es dauert, sie zu trainieren.

In [None]:
%%time
model = LinearRegression()
model.fit(features_train, target_train)

CPU times: user 677 ms, sys: 60 ms, total: 737 ms
Wall time: 450 ms


In [None]:
%%time
predictions = model.predict(features_test) 
result = mean_squared_error(target_test, predictions) ** 0.5 
print("LinearRegression", result)

LinearRegression 3841.780370810574
CPU times: user 36.1 ms, sys: 26.8 ms, total: 62.9 ms
Wall time: 31.5 ms


In [None]:
%%time
model = DecisionTreeRegressor(random_state=12345, max_depth=14) 
model.fit(features_train, target_train)

CPU times: user 1.1 s, sys: 89 ms, total: 1.19 s
Wall time: 1.1 s


In [None]:
%%time
predictions = model.predict(features_test)
result = mean_squared_error(target_test, predictions) ** 0.5
print('DecisionTreeRegressor', result)

DecisionTreeRegressor 2099.582348645584
CPU times: user 35.5 ms, sys: 2.01 ms, total: 37.5 ms
Wall time: 33.1 ms


In [None]:
%%time
model = RandomForestRegressor(random_state=12345, n_estimators=10, max_depth=17)
model.fit(features_train, target_train) 

CPU times: user 7.48 s, sys: 13 ms, total: 7.49 s
Wall time: 7.46 s


In [None]:
%%time
predictions = model.predict(features_test) 
result = mean_squared_error(target_test, predictions)**0.5 
print('RandomForestRegressor', result)

RandomForestRegressor 1971.2955291349315
CPU times: user 166 ms, sys: 2 ms, total: 168 ms
Wall time: 168 ms


In [None]:
%%time
model = LGBMRegressor(n_estimators=2000, max_depth=11)
model.fit(features_train, target_train) 

CPU times: user 26 s, sys: 281 ms, total: 26.3 s
Wall time: 13.5 s


In [None]:
%%time
predictions = model.predict(features_test) 
result = mean_squared_error(target_test, predictions)**0.5 
print('LGBMRegressor', result)

LGBMRegressor 1899.5044347231749
CPU times: user 13 s, sys: 14.9 ms, total: 13.1 s
Wall time: 6.73 s


In [None]:
# Wir präsentieren unsere Ergebnisse in Form einer Tabelle.
display(pd.DataFrame([['450 ms', '31.5 ms', '3842'], 
                    ['1.1 s', '33.1 ms', '2100'], 
                    ['7.46 s', '168 ms', '1971'],
                    ['13.5 s', '6.73 s', '1900']], 
                     columns=['Trainingszeit', 'Vorhersagezeit', 'RMSE'],
                    index = ['LinearRegression', 'DecisionTreeRegressor', 'RandomForestRegressor', 'LGBMRegressor']))

Unnamed: 0,Trainingszeit,Vorhersagezeit,RMSE
LinearRegression,450 ms,31.5 ms,3842
DecisionTreeRegressor,1.1 s,33.1 ms,2100
RandomForestRegressor,7.46 s,168 ms,1971
LGBMRegressor,13.5 s,6.73 s,1900


# Allgemeine Schlussfolgerung

Wir sehen, dass höchste Qualität, d.h. Die RMSE-Metrik wurde vom LGBMRegressor-Modell gezeigt - 1900, während dieses Modell die signifikanteste Trainingszeit hat – 13.5 Sekunden, und die Vorhersagezeit – 6.73 Sekunden.