# Pricing a Used Toyota Corolla
### Shan Wang
### 29 August, 2021

## 1 Loading and Preprocessing the Data Set

Let us read the data set:

In [1]:
import pandas as pd

toyotaDF = pd.read_csv('ToyotaCorolla.csv', encoding = 'GBK' )
toyotaDF.info()
toyotaDF.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1436 entries, 0 to 1435
Data columns (total 39 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Id                 1436 non-null   int64 
 1   Model              1436 non-null   object
 2   Price              1436 non-null   int64 
 3   Age_08_04          1436 non-null   int64 
 4   Mfg_Month          1436 non-null   int64 
 5   Mfg_Year           1436 non-null   int64 
 6   KM                 1436 non-null   int64 
 7   Fuel_Type          1436 non-null   object
 8   HP                 1436 non-null   int64 
 9   Met_Color          1436 non-null   int64 
 10  Color              1436 non-null   object
 11  Automatic          1436 non-null   int64 
 12  CC                 1436 non-null   int64 
 13  Doors              1436 non-null   int64 
 14  Cylinders          1436 non-null   int64 
 15  Gears              1436 non-null   int64 
 16  Quarterly_Tax      1436 non-null   int64 


Unnamed: 0,Id,Model,Price,Age_08_04,Mfg_Month,Mfg_Year,KM,Fuel_Type,HP,Met_Color,...,Powered_Windows,Power_Steering,Radio,Mistlamps,Sport_Model,Backseat_Divider,Metallic_Rim,Radio_cassette,Parking_Assistant,Tow_Bar
0,1,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,13500,23,10,2002,46986,Diesel,90,1,...,1,1,0,0,0,1,0,0,0,0
1,2,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,13750,23,10,2002,72937,Diesel,90,1,...,0,1,0,0,0,1,0,0,0,0
2,3,燭OYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,13950,24,9,2002,41711,Diesel,90,1,...,0,1,0,0,0,1,0,0,0,0
3,4,TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors,14950,26,7,2002,48000,Diesel,90,0,...,0,1,0,0,0,1,0,0,0,0
4,5,TOYOTA Corolla 2.0 D4D HATCHB SOL 2/3-Doors,13750,30,3,2002,38500,Diesel,90,0,...,1,1,0,1,0,1,0,0,0,0


We will use predictors Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco, Automatic_airco, CD_Player, Powered_Windows, Sport_Model and Tow_Bar to predict Price.

So delete irrelavant variables.

In [2]:
toyota = toyotaDF.drop(toyotaDF.columns[[0,1,4,5,9,10,12,14,15,17,19,21,22,23,26,28,30,31,32,34,35,36,37]], axis = 1)
toyota

Unnamed: 0,Price,Age_08_04,KM,Fuel_Type,HP,Automatic,Doors,Quarterly_Tax,Mfr_Guarantee,Guarantee_Period,Airco,Automatic_airco,CD_Player,Powered_Windows,Sport_Model,Tow_Bar
0,13500,23,46986,Diesel,90,0,3,210,0,3,0,0,0,1,0,0
1,13750,23,72937,Diesel,90,0,3,210,0,3,1,0,1,0,0,0
2,13950,24,41711,Diesel,90,0,3,210,1,3,0,0,0,0,0,0
3,14950,26,48000,Diesel,90,0,3,210,1,3,0,0,0,0,0,0
4,13750,30,38500,Diesel,90,0,3,210,1,3,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1431,7500,69,20544,Petrol,86,0,3,69,1,3,1,0,0,1,1,0
1432,10845,72,19000,Petrol,86,0,3,69,0,3,0,0,0,0,1,0
1433,8500,71,17016,Petrol,86,0,3,69,0,3,0,0,0,0,0,0
1434,7250,70,16916,Petrol,86,0,3,69,1,3,0,0,0,0,0,0


Great dummies for Fuel_Type.

In [3]:
toyota.Fuel_Type.value_counts()

Petrol    1264
Diesel     155
CNG         17
Name: Fuel_Type, dtype: int64

In [4]:
FuelType = pd.get_dummies(toyota.Fuel_Type, prefix = 'Fuel_Type')
FuelType

Unnamed: 0,Fuel_Type_CNG,Fuel_Type_Diesel,Fuel_Type_Petrol
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0
...,...,...,...
1431,0,0,1
1432,0,0,1
1433,0,0,1
1434,0,0,1


In [5]:
toyota = toyota.join(FuelType)
toyota = toyota.drop(['Fuel_Type' , 'Fuel_Type_Petrol'], axis = 1 )
toyota

Unnamed: 0,Price,Age_08_04,KM,HP,Automatic,Doors,Quarterly_Tax,Mfr_Guarantee,Guarantee_Period,Airco,Automatic_airco,CD_Player,Powered_Windows,Sport_Model,Tow_Bar,Fuel_Type_CNG,Fuel_Type_Diesel
0,13500,23,46986,90,0,3,210,0,3,0,0,0,1,0,0,0,1
1,13750,23,72937,90,0,3,210,0,3,1,0,1,0,0,0,0,1
2,13950,24,41711,90,0,3,210,1,3,0,0,0,0,0,0,0,1
3,14950,26,48000,90,0,3,210,1,3,0,0,0,0,0,0,0,1
4,13750,30,38500,90,0,3,210,1,3,1,0,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1431,7500,69,20544,86,0,3,69,1,3,1,0,0,1,1,0,0,0
1432,10845,72,19000,86,0,3,69,0,3,0,0,0,0,1,0,0,0
1433,8500,71,17016,86,0,3,69,0,3,0,0,0,0,0,0,0,0
1434,7250,70,16916,86,0,3,69,1,3,0,0,0,0,0,0,0,0


For the label price, we usally take log.

In [6]:
import numpy as np

toyota['Price'] = np.log(toyota['Price'])

Split the data set to training and validation set. We use function `train_test_split()` here.

In [7]:
from sklearn.model_selection import train_test_split

X = toyota.drop(['Price'], axis = 1)
y = toyota['Price']
X_train0, X_test0, y_train0, y_test0 = train_test_split(X, y, test_size = 0.3, random_state = 1)

We firsty use function `MinMaxScaler` in package `sklearn.preprocessing` to scale the numeric variables to 0-1, and save the transformation to scalers. Then use `scaler.transform` to apply the transformation on the data. Data and label indicate the x and y

In [8]:
from sklearn.preprocessing import MinMaxScaler

#归一化
mm = MinMaxScaler()
scalerX = mm.fit(X_train0)
X_train = scalerX.transform(X_train0)
X_test= scalerX.transform(X_test0)

scalerY = mm.fit(y_train0.values.reshape(-1,1))
y_train = scalerY.transform(y_train0.values.reshape(-1,1))
y_test= scalerY.transform(y_test0.values.reshape(-1,1))
print('X_test:', X_test.shape, '\n',  X_test[:3,:],'\n', 'y_test:', y_test.shape, '\n', y_test[:3,:])

X_test: (431, 16) 
 [[0.56962025 0.23052816 0.22764228 0.         1.         0.25
  1.         0.         0.         0.         1.         1.
  0.         0.         0.         0.        ]
 [0.65822785 0.34257467 0.33333333 0.         1.         0.25
  1.         0.         0.         0.         0.         0.
  0.         0.         0.         0.        ]
 [0.44303797 0.15569741 0.33333333 0.         1.         0.25
  0.         0.         1.         0.         1.         1.
  0.         1.         0.         0.        ]] 
 y_test: (431, 1) 
 [[0.45676658]
 [0.45653844]
 [0.44987618]]


Now we finish the preprocessing.

## 2 Training the Deep Neural Network - An MLP

In [9]:
import torch
import torch.nn as nn
x_trainNN = torch.Tensor(X_train).float()
y_trainNN = torch.Tensor(y_train).float()
x_testNN = torch.Tensor(X_test).float()
y_testNN = torch.Tensor(y_test).float()

In [10]:
# Build the network: An MLP
myNet = nn.Sequential(
    nn.Linear(16, 15),
    nn.Tanh(),
    nn.Linear(15, 25),
    nn.ReLU(),
    nn.Linear(25,1)
    
)
print(myNet)

Sequential(
  (0): Linear(in_features=16, out_features=15, bias=True)
  (1): Tanh()
  (2): Linear(in_features=15, out_features=25, bias=True)
  (3): ReLU()
  (4): Linear(in_features=25, out_features=1, bias=True)
)


In [11]:
# Define the optimizer and loss function
optimzer = torch.optim.SGD(myNet.parameters(), lr=0.05)
loss_func = nn.MSELoss()

In [13]:
# Train the MLP
for epoch in range(1000):
    out = myNet(x_trainNN)
    loss = loss_func(out, y_trainNN)  # 计算误差
    optimzer.zero_grad()  # 清除梯度
    loss.backward()
    optimzer.step()

## 3 Performance on the training set

Use trained model to predict test set. `myNet()` here will use model to predict `x_trainNN`.

In [14]:
y_pred = myNet(x_trainNN).data
y_pred = y_pred.numpy()

The prediction accuracy can be calculated by function `explained_variance_score`.

In [15]:
from sklearn.metrics import explained_variance_score

acc = explained_variance_score(y_true=y_train, y_pred=y_pred)
print('Accuracy',acc)

Accuracy 0.8413021461656649


Or by package `metrics`

In [16]:
from sklearn import metrics

def validation(y_true,y_pred):
    # MSE
    print('MSE:', metrics.mean_squared_error(y_true, y_pred)) 
    # RMSE
    print('RMSE:', np.sqrt(metrics.mean_squared_error(y_true, y_pred))) 
    # MAE
    print('MAE:', metrics.mean_absolute_error(y_true, y_pred)) 

validation(y_train,y_pred)

MSE: 0.0034668390544152603
RMSE: 0.05887986968748538
MAE: 0.04378607107236114


## 4 Predict the validation set

In [17]:
y_pred = myNet(x_testNN).data
y_pred = y_pred.numpy()

The prediction accuracy can be calculated by function `explained_variance_score`.

In [18]:
acc = explained_variance_score(y_true=y_test, y_pred=y_pred)
print('Accuracy',acc)
validation(y_test,y_pred)

Accuracy 0.843668308014645
MSE: 0.0033420965622137843
RMSE: 0.05781086889343374
MAE: 0.04534487305850327
