# Car Base Regression

This part of the project works to apply the Cross Validation method for validation of a model. Here, we will use the results from the analyses made on the preprocessing step, explained on the file "CarBase_Regression.ipynb". 

As we know, the pytorch library doesn't have a Cross Validation method for the training and validation of our model. So, in order to use this technique, we will use the **Skorch** library. The [Skorch library](https://skorch.readthedocs.io/en/stable/index.html) makes it possible for us to use all the power of Deep Learning within the Pytorch library, and the well known and easy to understand, structure from sklearn library. A simple example is the training of a neural network, where with pytorch, we must create the loop for the training step, passing the data by each layer from our NN model. With Skorch, we must create a class where we will build our Neural Network structure, and pass some parameters to Regressor or classifier method from skorch, as the *learning rate*, *optimization algorithm*,  number of epochs to be used on the training, and some other parameters. Some examples of how to use skorch library can be seen [HERE](https://skorch.readthedocs.io/en/stable/user/quickstart.html).

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import torch
from  torch import nn, optim
from skorch import NeuralNetRegressor
from sklearn.model_selection import cross_val_score

In [4]:
np.random.seed(123)
torch.manual_seed(123)

<torch._C.Generator at 0x254eb063750>

In [5]:
base = pd.read_csv('./autos.zip', encoding='ISO-8859-1')

In [6]:
base.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


In [7]:
# Features dropped based on their relevance on the project
base = base.drop('dateCrawled', axis=1)
base = base.drop('dateCreated', axis=1)
base = base.drop('nrOfPictures', axis=1)
base = base.drop('postalCode', axis=1)
base = base.drop('lastSeen', axis=1)

# Features dropped based on the analysis made on "CarBase_Regression.ipynb" file;
base = base.drop('name', axis=1)
base = base.drop('seller', axis=1)
base = base.drop('offerType', axis = 1)

In [8]:
base.shape

(371528, 15)

## Threating Outliers from price feature:

In [13]:
Q1 = base['price'].quantile(0.25)
Q3 = base['price'].quantile(0.75)

IQR = Q3-Q1

base_ = base[~((base['price'] < (Q1-1.5*IQR)) | (base['price'] > (Q3 + 1.5*IQR)))]

In [14]:
print('number of observations dropped from the dataset:\t %d (%.2f%%)' % (base.shape[0] - base_.shape[0], (base.shape[0] - base_.shape[0])/base.shape[0]*100))

number of observations dropped from the dataset:	 28108 (7.57%)


Replacing null values on each column (Values to be used instead null, are based on the analysis of the file "CarBase_Regression.ipynb"):

In [20]:
values = {'vehicleType': 'limousine', 'gearbox': 'manuell', 'model': 'golf', 'fuelType': 'benzin', 'notRepairedDamage': 'nein'}

base_ = base_.fillna(value=values)

In [21]:
features = base_.iloc[:, 1:].values
classes = base_.iloc[:, 0].values.reshape(-1, 1)

Treating Categorical Features with One Hot Encoder technique:

In [22]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

oneHotEncoder = ColumnTransformer(transformers=[('OneHot', OneHotEncoder(), [0,1,3,5,8,9,10])], remainder='passthrough')

features = oneHotEncoder.fit_transform(features).toarray()

In [25]:
features = features.astype('float32')
classes = classes.astype('float32')

## Creating the Neural Network Model:

Now that we have modeled our data on "Preprocessing data" fase, we will create our neural network model so that we can predict the price of a vehicle based on some parameters given to the model.

As is explained on skorch documentation, the neural network model must be created within a class, where it will be passed as a parameter to the regressor method used by skorch. 

So, first, we must create our class:

In [26]:
# Model Structure: 315 -> 158 -> 158 -> 1
# Hidden Layer:    (Input + Output)/2 = 158

class neuralNetTrain(nn.Module):
    def __init__(self):
        super().__init__()
        self.dense0 = nn.Linear(315, 158)
        self.dense1 = nn.Linear(158, 158)
        self.Output = nn.Linear(158, 1)
        self.activation0 = nn.ReLU()

    def forward(self, X):
        X = self.dense0(X)
        X = self.activation0(X)
        X = self.dense1(X)
        X = self.activation0(X)
        X = self.Output(X)

        return X


Now we have created our Neural Network class with it's structure, we must call our Neural Network Regressor, passing some parametgers, as the NN structure to be optimized, the optimization algotithm, number of epochs, and so on.

In [27]:
nnreg = NeuralNetRegressor(
    module=neuralNetTrain,
    criterion=torch.nn.L1Loss,
    optimizer=torch.optim.Adam, 
    batch_size=300, 
    max_epochs=100,
    device='cuda',
    train_split = False)

Now we have our Neural Network Regressor created, we may call the Cross Validation method from sklearn and train the our model based on this technique.

In [31]:
results = cross_val_score(nnreg, features, classes, cv=5, scoring='neg_mean_absolute_error')

  epoch    train_loss     dur
-------  ------------  ------
      1     [36m2498.4433[0m  9.6925
      2     [36m2137.1967[0m  8.8871
      3     [36m2059.0723[0m  8.5646
      4     [36m2016.5721[0m  8.5125
      5     [36m1989.0098[0m  8.5587
      6     [36m1945.3866[0m  8.8191
      7     [36m1942.6489[0m  8.6033
      8     [36m1921.8457[0m  8.5985
      9     [36m1898.5018[0m  11.5820
     10     [36m1883.5594[0m  12.3071
     11     [36m1863.4250[0m  8.6000
     12     [36m1843.5910[0m  8.4911
     13     [36m1826.1203[0m  8.7392
     14     [36m1817.9483[0m  8.5344
     15     [36m1799.3464[0m  8.6623
     16     [36m1778.0419[0m  8.3195
     17     1784.6129  8.5245
     18     [36m1757.6450[0m  8.5236
     19     [36m1753.2489[0m  9.2406
     20     1762.1706  8.6955
     21     [36m1750.7953[0m  8.5539
     22     [36m1743.8013[0m  8.5469
     23     1745.1796  8.4181
     24     [36m1719.9659[0m  8.5192
     25     1744.8844  8.5527

## Model Evaluation:

In [32]:
mean = results.mean()
std_dev = results.std()

In [33]:
print('mean score:\t%0.05f \nstandard deviation: \t%0.05f' %(mean, std_dev))

mean score:	-1758.99980 
standard deviation: 	44.63939
