## CSE5ML Lab 2 Part 1 : Machine Learning with Scikit Learn for Regression

This week, we are going to learn how to use some ML models in scikit learn package on regression and classification tasks. One important step before building a model, is to apply some data preprocessing steps on the given dataset. 

In this Part 1, we are going to solve a regression task.

we will be working on Vehicle dataset from cardekho. This dataset contains information about used cars listed on www.cardekho.com. Our target is to build a price prediction model with the use of regression models. Thus, our model can help mark a suitable price for a given car.

The datasets consist of several independent variables include:

1. Car_Name: Name of the cars
2. Year: Year of the car when it was bought
3. Selling_Price: Price at which the car is being sold
4. Kms_Driven: Number of Kilometres the car is driven
5. Fuel_Type: Fuel type of car (petrol / diesel / CNG / LPG / electric)
6. Seller_Type: Tells if a Seller is Individual or a Dealer
7. Transmission: Gear transmission of the car (Automatic/Manual)
8. Owner: Number of previous owners of the car.
9. Mileage: mileage of the car
10. Engine: engine capacity of the car
11. Max_power: max power of engine
12. Seats: number of seats in the car

### Load the dataset
use pandas to load the csv file "Car details.csv" provided on LMS, then check dataset length and print the first 5 rows of the dataset

In [1]:
import pandas as pd

dataset = pd.read_csv("Car details.csv")
print("dataset length:", len(dataset))
dataset.head()

dataset length: 7761


Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4 kmpl,1248 CC,74 bhp,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14 kmpl,1498 CC,103.52 bhp,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7 kmpl,1497 CC,78 bhp,5.0
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0 kmpl,1396 CC,90 bhp,5.0
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1 kmpl,1298 CC,88.2 bhp,5.0


### Preprocess the dataset

#### Drop columns that we are not going to use

In [2]:
# drop the columns that we are not going to use, here, we are droping the car name
dataset.drop(['name'], axis=1, inplace=True)
dataset.head()

Unnamed: 0,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,seats
0,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4 kmpl,1248 CC,74 bhp,5.0
1,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14 kmpl,1498 CC,103.52 bhp,5.0
2,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7 kmpl,1497 CC,78 bhp,5.0
3,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0 kmpl,1396 CC,90 bhp,5.0
4,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1 kmpl,1298 CC,88.2 bhp,5.0


#### Dealing with missing values
##### 1) identify if there is any missing value in the dataset

In [3]:
# check if these is any missing value in the dataset
dataset.isna().sum()

year              0
selling_price     0
km_driven         0
fuel              0
seller_type       0
transmission      0
owner             0
mileage          19
engine           19
max_power        13
seats            19
dtype: int64

##### 2) Drop the rows which has missing values

In [4]:
# dealing with missing values, since we only have very small number of missing values in our dataset, we can just remove it for easy processing
dataset = dataset.dropna()
print("dataset length:", len(dataset))

dataset length: 7742


#### Dealing with duplicated rows
##### 1) Check if there is any duplicated rows in the dataset

In [5]:
# check if these is any duplicated rows
dataset.duplicated().any()

True

##### 2) Remove duplicated rows

In [6]:
# remove duplicate
dataset = dataset.drop_duplicates()
print("dataset length:", len(dataset))

dataset length: 6539


#### Remove units strings from column: mileage, engine and max_power, and then transform the column type to float

In [7]:
# define the function which with inputs of a dataframe and column name, and returns the column which has removed the units string
def remove_unit(df,colum_name) :
    t = []
    for i in df[colum_name]:
        number = str(i).split(' ')[0]
        t.append(number)
    return t

dataset['mileage'] = remove_unit(dataset,'mileage')

# transform the column type to float
dataset['mileage'] = pd.to_numeric(dataset['mileage'])
dataset.head()

Unnamed: 0,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,seats
0,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4,1248 CC,74 bhp,5.0
1,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14,1498 CC,103.52 bhp,5.0
2,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7,1497 CC,78 bhp,5.0
3,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0,1396 CC,90 bhp,5.0
4,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1,1298 CC,88.2 bhp,5.0


In [8]:
a= 1234
b="1234"

In [9]:
# do the same for engine and max_power and investigate the dataset
dataset['engine'] = remove_unit(dataset,'engine')
dataset['max_power'] = remove_unit(dataset,'max_power')

dataset['engine'] = pd.to_numeric(dataset['engine'])
dataset['max_power'] = pd.to_numeric(dataset['max_power'])
dataset.head()

Unnamed: 0,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,seats
0,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4,1248,74.0,5.0
1,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14,1498,103.52,5.0
2,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7,1497,78.0,5.0
3,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0,1396,90.0,5.0
4,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1,1298,88.2,5.0


#### Adding 'age' feature to know how old the car is and dropping 'year' feature as it is useless now

In [10]:
dataset['age'] = 2022 - dataset['year']

# drop the year column by the function that we used before (hints drop function)
dataset.drop(['year'], axis = 1, inplace = True)

# take a look at the dataset afterwards
dataset.head()

Unnamed: 0,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,seats,age
0,450000,145500,Diesel,Individual,Manual,First Owner,23.4,1248,74.0,5.0,8
1,370000,120000,Diesel,Individual,Manual,Second Owner,21.14,1498,103.52,5.0,8
2,158000,140000,Petrol,Individual,Manual,Third Owner,17.7,1497,78.0,5.0,16
3,225000,127000,Diesel,Individual,Manual,First Owner,23.0,1396,90.0,5.0,12
4,130000,120000,Petrol,Individual,Manual,First Owner,16.1,1298,88.2,5.0,15


#### Get a summary of numerical columns

In [11]:
dataset.describe()

Unnamed: 0,selling_price,km_driven,mileage,engine,max_power,seats,age
count,6539.0,6539.0,6539.0,6539.0,6539.0,6539.0,6539.0
mean,530542.9,72696.97,19.517721,1430.571953,87.85761,5.433553,8.273589
std,513261.4,58693.17,4.04541,491.703071,31.685995,0.979437,3.810427
min,29999.0,1000.0,0.0,624.0,32.8,2.0,2.0
25%,250000.0,36000.0,16.8,1197.0,68.0,5.0,5.0
50%,425000.0,66444.0,19.64,1248.0,81.86,5.0,8.0
75%,650000.0,100000.0,22.54,1498.0,100.0,5.0,11.0
max,10000000.0,2360457.0,42.0,3604.0,400.0,14.0,28.0


#### Handling categorical variables
##### 1) check value count for the categorical variables
including column fuel, seller_type, transmission and owner

In [12]:
# check value count for the categorical variables
print(dataset.fuel.value_counts(),"\n")

#please do the same for seller_type, transmission and owner 
print(dataset.seller_type.value_counts(),"\n")
print(dataset.transmission.value_counts(), "\n")
print(dataset.owner.value_counts())

Diesel    3569
Petrol    2885
CNG         51
LPG         34
Name: fuel, dtype: int64 

Individual          5851
Dealer               661
Trustmark Dealer      27
Name: seller_type, dtype: int64 

Manual       5973
Automatic     566
Name: transmission, dtype: int64 

First Owner     4160
Second Owner    1886
Third Owner      493
Name: owner, dtype: int64


#### 2) Deal with ordinal variables
transform the strings to numbers

In [13]:
#Ordinal encoding
dataset['owner'] = dataset['owner'].replace({'First Owner': 1, 'Second Owner': 2, 'Third Owner': 3})
dataset.head()

Unnamed: 0,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,seats,age
0,450000,145500,Diesel,Individual,Manual,1,23.4,1248,74.0,5.0,8
1,370000,120000,Diesel,Individual,Manual,2,21.14,1498,103.52,5.0,8
2,158000,140000,Petrol,Individual,Manual,3,17.7,1497,78.0,5.0,16
3,225000,127000,Diesel,Individual,Manual,1,23.0,1396,90.0,5.0,12
4,130000,120000,Petrol,Individual,Manual,1,16.1,1298,88.2,5.0,15


#### 3) Deal with nominal variables
transform nominal variable into dummy variables: 

In [14]:
dataset = pd.get_dummies(dataset, columns=['fuel', 'seller_type', 'transmission'])
dataset.head() 

Unnamed: 0,selling_price,km_driven,owner,mileage,engine,max_power,seats,age,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,seller_type_Dealer,seller_type_Individual,seller_type_Trustmark Dealer,transmission_Automatic,transmission_Manual
0,450000,145500,1,23.4,1248,74.0,5.0,8,0,1,0,0,0,1,0,0,1
1,370000,120000,2,21.14,1498,103.52,5.0,8,0,1,0,0,0,1,0,0,1
2,158000,140000,3,17.7,1497,78.0,5.0,16,0,0,0,1,0,1,0,0,1
3,225000,127000,1,23.0,1396,90.0,5.0,12,0,1,0,0,0,1,0,0,1
4,130000,120000,1,16.1,1298,88.2,5.0,15,0,0,0,1,0,1,0,0,1


#### Check dataset shape

In [15]:
dataset.shape

(6539, 17)

#### Define the input variables and the target variable
target variable is the selling_price, and input variables are the rest of the columns (you can check the dataset column names and the shape to know which column index you should put here)

In [16]:
array = dataset.values
X = array[:,1:17]
y = array[:,0]

### Split the dataset and normalize data

#### Split the training and testing dataset
Randomly sample the dataset with a random state of 123, use 90% for training and 10% for testinguse

In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=123)

#### Apply normalization on both train and testing dataset
One reason this is important is because the variables are multiplied by the model weights. So the scale of the outputs and the scale of the gradients are affected by the scale of the inputs.

Although a model might converge without normalization, normalization makes training much more stable.

In [18]:
from sklearn.preprocessing import MinMaxScaler

# fit scaler on training data
norm = MinMaxScaler().fit(X_train)

# transform training data
X_train_norm = norm.transform(X_train)

# transform testing data
X_test_norm = norm.transform(X_test)

### Train a model

#### Approach 1: Train the model based on entire training dataset and then evaluate the model based on testing dataset

Example of how to build a Linear Regression (LR) model

In [19]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train_norm, y_train)

test_score = model.score(X_test_norm, y_test)
print("R2 of LR:", test_score)

R2 of LR: 0.5917399630409527


Example of how to build a support vector machine (SVM) model

In [20]:
from sklearn.svm import SVR

model = SVR()

model.fit(X_train_norm, y_train)

test_score = model.score(X_test_norm, y_test)
print("R2 of SVM:", test_score)

R2 of SVM: -0.04101852488410551


#### Approach 2: Train the model based on training dataset with cross validation and then evaluate the model based on testing dataset
#####  1) Define a 10 fold cross validation with data shufflling and set the random state with 123
benefits of cross validation: the model can be more generalized, and less prone to be over-fiited. Normally value of k is 5 or 10

In [21]:
from sklearn.model_selection import KFold

kfold = KFold(n_splits=10, shuffle=True, random_state=123) #set 10-fold cross validation after shuffle the dataset with random seed 7

##### 2) Run 10-fold cross validation and print the average r-squared score based on the cross validation results
For a regression task, the default evaluation metrics is r squared.

First we evaluate on a linear regression model

In [22]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

#Basic training of the linear regression model
# define a LR model with default parameter setting
lr = LinearRegression()
# run the previously defined 10-fold validation on the dataset
results = cross_val_score(lr, X_train_norm, y_train, cv=kfold)
# print the averae r squared scores
print("Average R2 of LR:",results.mean())

Average R2 of LR: 0.6301286590968104


In [23]:
results

array([0.66534965, 0.68389047, 0.59910499, 0.62052783, 0.57296351,
       0.56451801, 0.64380195, 0.63352477, 0.67666672, 0.64093867])

Now, we evaluate on a SVM model

In [24]:
from sklearn.svm import SVR

svr = SVR()
results = cross_val_score(svr, X_train_norm, y_train, cv=kfold)
print("Average R2 of SVM:",results.mean())

Average R2 of SVM: -0.0433563621588632


In [25]:
results

array([-0.05091688, -0.04148946, -0.06101536, -0.03055859, -0.04283295,
       -0.03243792, -0.04872806, -0.04082401, -0.04594328, -0.03881711])

### Optimize models with cross validatioin

First we optimize the Liner Regression model

The parameters that can be applied in grid_params can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html You can add more values and parameters in the grid_params_lr.

In [26]:
# fine tune parameters for lr model
from sklearn.model_selection import GridSearchCV

grid_params_lr = {
    'fit_intercept': [True, False],
    'positive': [True, False]
}

lr = LinearRegression()
gs_lr_result = GridSearchCV(lr, grid_params_lr, cv=kfold).fit(X_train_norm, y_train)
print(gs_lr_result.best_score_)

0.6301286590968104


Then we optimize the SVM model

The parameters that can be applied in grid_params can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html You can add more values and parameters in the grid_params_svr. Note: it will take some time to find the optimal SVM model. (sometimes more than 1 hour or even more)

In [27]:
# fine tune parameters for lr model
from sklearn.model_selection import GridSearchCV

grid_params_svr = {
    'kernel' : ('linear', 'poly', 'rbf', 'sigmoid'),
    'C' : [1,5],
    'degree' : [3,8],
    'coef0' : [0.01,10,1],
    'gamma' : ('auto','scale')
}

svr = SVR()
gs_svr_result = GridSearchCV(svr, grid_params_svr, cv=kfold).fit(X_train_norm, y_train)
print(gs_svr_result.best_score_)

0.8663820328015628


### Evaluate the trained Linear Regression model using testing dataset

In [29]:
# use the best model and evaluate on testing set
lr_test_R2 = gs_lr_result.best_estimator_.score(X_test_norm, y_test)
print("R2 of LR in testing:", lr_test_R2)

R2 of LR in testing: 0.5917399630409527


In [30]:
# check the parameter setting for the best selected model
gs_lr_result.best_params_

{'fit_intercept': True, 'positive': False}

### Evaluate the trained Support Vector Machine model using testing dataset¶

In [31]:
# use the best model and evaluate on testing set
svr_test_R2 = gs_svr_result.best_estimator_.score(X_test_norm, y_test)
print("R2 of SVM in testing:", svr_test_R2)

R2 of SVM in testing: 0.8734479774991956


In [32]:
# check the parameter setting for the best selected model
gs_svr_result.best_params_

{'C': 5, 'coef0': 10, 'degree': 8, 'gamma': 'scale', 'kernel': 'poly'}

### Predict with a trained model

In [35]:
# predict with the first 5 data points
y_predict = gs_lr_result.best_estimator_.predict(X_test_norm[:5]) 
print(y_predict)

[1214499.9736509   337037.49493902  551382.04202298  143190.56974847
  498800.41080653]


In [36]:
# predict with the first 5 data points
y_predict = gs_svr_result.best_estimator_.predict(X_test_norm[:5]) 
print(y_predict)

[1328635.02826361  260837.69530753  305721.66056136  147022.75486139
  434599.19056777]


### Save and load a trained model

#### linear regression model

In [37]:
import pickle

# Save to file in the current working directory
pkl_filename = "lr_model.pkl"  
with open(pkl_filename, 'wb') as file:  
    pickle.dump(gs_lr_result.best_estimator_, file)

# Load from file
with open(pkl_filename, 'rb') as file:  
    pickle_model = pickle.load(file)

# Calculate the accuracy score and predict target values
score = pickle_model.score(X_test_norm, y_test)  
print("R2 score:", score)  

R2 score: 0.5917399630409527


#### similarly for a svm model

In [38]:
import pickle

# Save to file in the current working directory
pkl_filename = "svm_model.pkl"  
with open(pkl_filename, 'wb') as file:  
    pickle.dump(gs_svr_result.best_estimator_, file)

# Load from file
with open(pkl_filename, 'rb') as file:  
    pickle_model = pickle.load(file)

# Calculate the accuracy score and predict target values
score = pickle_model.score(X_test_norm, y_test)  
print("R2 score:", score)  

R2 score: 0.8734479774991956
