# Used Car Models

### In this notebook:
I will attempt to use car details to predict the price of a used car. I will be using data that I cleaned in another notebook.

In [1]:
import time

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import sklearn
from sklearn.model_selection import train_test_split

from sklearn import ensemble
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn import tree
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import GridSearchCV

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)

1.12.0


In [2]:
# Bringing in the data from the other notebook
autos_import = pd.read_csv("autos_export.csv", index_col=0)

# The newly scraped data
new_used_cars = pd.read_csv('new_used_cars.csv', index_col=0)

In [3]:
new_used_cars.shape

(180, 11)

In [4]:
# Sampling down because my computer is weak
autos_sample = autos_import.sample(n=25000, random_state=24)

# Combining the two datasets
autos = pd.concat([autos_sample, new_used_cars], axis=0)

In [5]:
autos.head()

Unnamed: 0,age,brand,fuelType,gearbox,kilometer,model,name,notRepairedDamage,powerPS,price,vehicleType
135357,14.0,ford,benzin,manuell,150000,fiesta,Ford_Fiesta_Mod.2005,nein,60.0,1830.0,kleinwagen
57557,6.0,opel,benzin,manuell,50000,astra,Opel_Astra_1_4_Sports_Tourer_Kombi_Limosine__T...,nein,140.0,12500.0,limousine
30729,14.0,bmw,diesel,automatik,150000,x_reihe,BMW_X3_3.0_D,nein,204.0,10500.0,suv
249210,11.0,seat,benzin,manuell,125000,ibiza,Seat_Ibiza_1.2_12V__Klimaautomatik,blank,69.0,2800.0,kleinwagen
159151,11.0,suzuki,benzin,manuell,80000,swift,Suzuki_Swift_mit_neuem_Tuev,blank,92.0,3999.0,kleinwagen


In [6]:
# Looking good
autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25180 entries, 135357 to 255
Data columns (total 11 columns):
age                  25180 non-null float64
brand                25180 non-null object
fuelType             25180 non-null object
gearbox              25180 non-null object
kilometer            25180 non-null int64
model                25180 non-null object
name                 25180 non-null object
notRepairedDamage    25180 non-null object
powerPS              25180 non-null float64
price                25180 non-null float64
vehicleType          25180 non-null object
dtypes: float64(3), int64(1), object(7)
memory usage: 2.3+ MB


In [7]:
# Splitting categorical and continuous data so I can make dummies
autos_cat = autos.select_dtypes(exclude=['float', 'int'])
autos_cont = autos.select_dtypes(include=['float', 'int'])

In [8]:
# Creating dummy variabels out of the categorical variables
autos_dummies = autos_cat.drop(columns=['name'])
autos_dummies = pd.get_dummies(autos_dummies)

In [9]:
combined_datasets = pd.concat([autos_cont.drop(columns=['price']), autos_dummies], axis=1)

In [10]:
features = combined_datasets.iloc[:(len(combined_datasets)-180)]
outcome = autos.price.iloc[:(len(combined_datasets)-180)]

In [11]:
later_features = combined_datasets.iloc[(len(combined_datasets)-180):]
later_outcome = autos.price.iloc[(len(combined_datasets)-180):]

In [12]:
X_train, X_test, y_train, y_test = train_test_split(features, 
                                                    outcome,
                                                    test_size=.3,
                                                    random_state=24)

In [13]:
X_train.head()

Unnamed: 0,age,kilometer,powerPS,brand_alfa romeo,brand_alfa_romeo,brand_audi,brand_bmw,brand_chevrolet,brand_chrysler,brand_citroen,...,notRepairedDamage_ja,notRepairedDamage_nein,vehicleType_andere,vehicleType_bus,vehicleType_cabrio,vehicleType_coupe,vehicleType_kleinwagen,vehicleType_kombi,vehicleType_limousine,vehicleType_suv
277428,22.0,150000,75.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
147288,21.0,150000,139.0,0,0,0,1,0,0,0,...,0,1,0,0,1,0,0,0,0,0
271765,26.0,150000,150.0,0,0,1,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
179659,13.0,150000,116.0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
108415,10.0,125000,170.0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0


### Scaling
StandardScaler is useful if there are large outliers. But I dealt with that thoroughly in my cleaning process. MinMaxScaler is useful for the neural networks that I will be trying later so I will use that.

In [14]:
# MinMaxScaler
mms = sklearn.preprocessing.MinMaxScaler()
X_train_scaled = mms.fit_transform(X_train)
X_test_scaled = mms.fit_transform(X_test)

  return self.partial_fit(X, y)
  return self.partial_fit(X, y)


### K Nearest Neighbors

In [15]:
# Using default parameters
start = time.time()

knn = sklearn.neighbors.KNeighborsRegressor()
knn.fit(X_train_scaled, y_train)
knn_preds = knn.predict(X_test_scaled)
print(sklearn.metrics.r2_score(y_test, knn_preds))

knn_time = time.time() - start
print(knn_time)

0.6293126575672813
25.38869285583496


### Random Forest

In [16]:
# Using default parameters
start = time.time()

rfc = ensemble.RandomForestRegressor(random_state=24)
rfc.fit(X_train, y_train)
rfc_preds = rfc.predict(X_test)
print(sklearn.metrics.r2_score(y_test, rfc_preds))

rfc_time = time.time() - start
print(rfc_time)



0.8177188967137915
2.5202600955963135


### Gradient Boosting

In [17]:
# Using default parameters
start = time.time()

gbc = ensemble.GradientBoostingRegressor()
gbc.fit(X_train, y_train)
gbc_preds = gbc.predict(X_test)
print(sklearn.metrics.r2_score(y_test, gbc_preds))

gbc_time = time.time() - start
print(gbc_time)

0.7818630029084347
6.794547080993652


Random Forest and Gradient boosting are significantly faster and more accurate than KNN so I will focus in on those.

### Optimized Random Forest

In [18]:
# rfr_params =[{'n_estimators':[10, 90, 100, 200, 300, 400],
#              'max_features' : ['auto', 'sqrt', 'log2'],
#              'min_sample_split' [2, 10]
# }]
# I commented this out because it takes forever to run. It chose the defult parameters with 'n_estimators' = 300

In [19]:
# rfr_search = GridSearchCV(estimator=rfc, param_grid=rfr_params, scoring='r2', cv=5)
# rfr_search.fit(X_train, y_train)
# tuned_rfr = rfr_search.best_estimator_

In [23]:
tuned_rfr = ensemble.RandomForestRegressor(n_estimators=300, random_state=24)
tuned_rfr.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=None,
           oob_score=False, random_state=24, verbose=0, warm_start=False)

In [25]:
tuned_rfr_preds = tuned_rfr.predict(X_test)
print(sklearn.metrics.r2_score(y_test, tuned_rfr_preds))

0.8221253693854778


### Optimized Gradient Boosting

In [None]:
# gbc_params = [{'learning_rate' : [.05, .09, .1, .15, .25, .3], 
#                'n_estimators' : [100, 150, 200, 250, 350, 400, 500, 600]
# }]

# I commented this out so I would not have to run it again and again. It chose learning_rate=0.25 and
# n_estimators=400

In [None]:
# gbc_search = GridSearchCV(estimator=gbc, param_grid=gbc_params, scoring='r2', cv=5)
# gbc_search.fit(X_train, y_train)
# tuned_gbc = gbc_search.best_estimator_

In [27]:
# Ended up choosing learning rate of .25 and 400 estimators
tuned_gbr = ensemble.GradientBoostingRegressor(learning_rate=0.25, n_estimators=400)
tuned_gbr.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.25, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=400, n_iter_no_change=None, presort='auto',
             random_state=None, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)

In [29]:
tuned_gbr_preds = tuned_gbr.predict(X_test)
print(sklearn.metrics.r2_score(y_test, tuned_gbr_preds))

0.8150663624650674


## Scikit Learn modeling Summary
- Gradient Boosting and Random Forest outperformed other models for predicting price.
- Once tuned, they improved to achieve r2 scores of roughly .82

## Modeling with TensorFlow

X_train_scaled.shape

def build_model():
    model = keras.Sequential([
        layers.Dense(160, activation=tf.nn.relu, input_shape=[X_train_scaled.shape[1]]),
        layers.Dense(80, activation=tf.nn.relu),
        layers.Dense(40, activation=tf.nn.relu),
        layers.Dense(20, activation=tf.nn.relu),
        layers.Dense(1)
    ])
    
    optimizer = tf.keras.optimizers.RMSprop(0.001)
    
    model.compile(loss='mean_squared_error',
                 optimizer=optimizer,
                 metrics=['mean_absolute_error', 'mean_squared_error'])
    return model

model = build_model()

model.summary()

example_batch = X_train_scaled[:10]

example_result = model.predict(example_batch)
example_result

class PrintDot(keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs):
        if epoch % 100 == 0: print('')
        print('.', end='')
        
EPOCHS = 30

history = model.fit(X_train_scaled, y_train, epochs=EPOCHS, validation_split=.2, verbose=1,
                   callbacks=[PrintDot()])

# Modeling the Newly Scraped Data
At this point I have created some models to predict the price of cars on the large eBay used cars dataset that I found on Kaggle.com. I also scraped a new dataset, cleaned it, and preprocessed it the same way as the old dataset. I am now ready to run some predictions on this new dataset.

My random forest regressor was the most accurate predictor so I will be using that to run my predictions now.

In [30]:
final_predictions = tuned_rfr.predict(later_features)

In [32]:
# Let's see how close my predictions came to the actual sale prices
sklearn.metrics.r2_score(later_outcome, final_predictions)

0.743358854826337

That's pretty good for how small the datasets were that I used. The data science product will be finding which of these cars are underpriced. That way we can buy them and make a profit.

In [36]:
# Subtracting the sale price from the car's value. Higher values represent good deals on cars.
final_predictions - later_outcome

0      -1019.502222
1     -53261.563333
2       -140.273475
3        403.901667
4       1260.960873
5      -4083.026667
6      -5744.880000
7        994.857683
8       1426.738333
9         35.620952
10      2625.122152
12      1904.677690
13     -5372.270000
15     -2233.250556
16      3886.085777
17      -968.229778
18      5036.201905
20      3513.147460
21      -878.315885
23       569.093440
25      -537.464722
27      2880.332111
29     -2342.973103
30       235.640016
33      2084.670856
37      1181.211111
38      4323.344444
39      -221.623901
41      2121.014877
42       235.640016
           ...     
212    -1979.678611
213     1024.683333
215     2585.793981
216     -358.571111
217     -608.803889
218      529.580524
220       51.230889
222      733.077612
224    -2381.461667
225     2050.482167
228     1465.349667
230     3293.319907
233      961.625288
234     1326.897778
235      447.205413
236     3139.433333
237     1195.637341
239       50.722222
242     3199.816359
