# AHP-TOPSIS ile model seçimi

Makine öğrenmesinde model seçimi çok önemli konulardan biridir. Model seçimi için birçok metrik kullanılmaktadır. Problemin yapısına ve amaca uygun bir metrik seçilerek modeller karşılaştırıla bilir. Peki tüm metrikleri kullanarak bir model seçimi yapmak istersek bunu nasıl yaparız? İşte bu kernelde sizlere bütün metrikleri kullanarak bir model nasıl seçilir onu göstereceğim.

Problemimiz bir regresyon problemi olsun. Regresyon problemlerinde kullanılan metrikler R squared, RMSE ve MAE değerleridir. Ben 10 tane regresyon algoritması kullanıcağım ve bu 3 metrikle birlikte her modelin eğitim hızınıda değerlendirerek 4 farklı kritere göre bu 10 model içinden en iyi modeli seçeceğim

In [1]:
library(caret)
library(ahptopsis2n)
library(tidyverse)


Loading required package: lattice

Loading required package: ggplot2


Attaching package: ‘caret’


The following object is masked from ‘package:httr’:

    progress


── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mtibble [39m 3.1.5     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.4     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.0.2     [32m✔[39m [34mforcats[39m 0.5.1
[32m✔[39m [34mpurrr  [39m 0.3.4     

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m   masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m      masks [34mstats[39m::lag()
[31m✖[39m [34mpurrr[39m::[32mlift()[39m     masks [34mcaret[39m::lift()
[31m✖[39m [34mcaret[39m::[32mprogress()[39m masks [34mhttr[39m::progress()



Tidyverse kütüphanesi içersinde olan mpg veri setini kullanacağım. Veri ön işleme değişken seçimi gibi konulara girmeden direkt modellerimi oluşturup değerlendirmelerimi yapacağım.

In [2]:
df = mpg
head(df)

manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
<chr>,<chr>,<dbl>,<int>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>
audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
audi,a4,2.8,1999,6,manual(m5),f,18,26,p,compact


In [3]:
df = df %>% mutate_if(is.character, as.factor)

In [4]:
set.seed(1255)
indeks = sample(1:nrow(df), size = 0.8*nrow(df))
train = df[indeks, ]
test = df[-indeks, ]

Bu çalışmada cty değişkeni 10 farklı makine öğrenmesi modeli ile tahmin edilecektir

# Linear Model

In [5]:
lm_start = Sys.time()
lm_model = train(cty~., data = train,
                  method = 'lm',
                  trControl = trainControl(method = 'cv', number = 5))

lm_end = Sys.time()
lm_time = as.numeric(lm_end - lm_start)
tahminlm = predict(lm_model, test)

Lm = data.frame(R2 = R2(tahminlm, test$cty),
                 MAE = MAE(tahminlm, test$cty),
                 RMSE = RMSE(tahminlm, test$cty),
                 Time = lm_time)
rownames(Lm) = 'LinearModel'

“prediction from a rank-deficient fit may be misleading”
“prediction from a rank-deficient fit may be misleading”
“prediction from a rank-deficient fit may be misleading”
“prediction from a rank-deficient fit may be misleading”
“prediction from a rank-deficient fit may be misleading”
“prediction from a rank-deficient fit may be misleading”


# Lasso Model

In [6]:
lasso_start = Sys.time()
lasso_model = train(cty~., data = train,
                 method = 'lasso',
                 na.action = na.omit,
                 trControl = trainControl(method = 'cv', number = 5))

lasso_end = Sys.time()
lasso_time = as.numeric(lasso_end - lasso_start)
tahminlasso = predict(lasso_model, test)

Lasso = data.frame(R2 = R2(tahminlasso, test$cty),
                MAE = MAE(tahminlasso, test$cty),
                RMSE = RMSE(tahminlasso, test$cty),
                Time = lasso_time)
rownames(Lasso) = "Lasso"

“model fit failed for Fold4: fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = 0, ...) : 
  Some of the columns of x have zero variance
”
“model fit failed for Fold5: fraction=0.9 Error in elasticnet::enet(as.matrix(x), y, lambda = 0, ...) : 
  Some of the columns of x have zero variance
”
“There were missing values in resampled performance measures.”


# Ridge Model

In [7]:
R_start = Sys.time()
R_model = train(cty~., data = train,
                 method = 'ridge',
                 trControl = trainControl(method = 'cv', number = 5))

R_end = Sys.time()
R_time = as.numeric(R_end - R_start)
tahminR = predict(R_model, test)

Ridge = data.frame(R2 = R2(tahminR, test$cty),
                MAE = MAE(tahminR, test$cty),
                RMSE = RMSE(tahminR, test$cty),
                Time = R_time)
rownames(Ridge) = 'Ridge'

“model fit failed for Fold1: lambda=0e+00 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) : 
  Some of the columns of x have zero variance
”
“model fit failed for Fold1: lambda=1e-01 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) : 
  Some of the columns of x have zero variance
”
“model fit failed for Fold1: lambda=1e-04 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) : 
  Some of the columns of x have zero variance
”
“model fit failed for Fold2: lambda=0e+00 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) : 
  Some of the columns of x have zero variance
”
“model fit failed for Fold2: lambda=1e-01 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) : 
  Some of the columns of x have zero variance
”
“model fit failed for Fold2: lambda=1e-04 Error in elasticnet::enet(as.matrix(x), y, lambda = param$lambda) : 
  Some of the columns of x have zero variance
”
“model fit failed for Fold3: lambda=0e+00 Erro

# PLS Model

In [8]:
Pls_start = Sys.time()
Pls_model = train(cty~., data = train,
                 method = 'pls',
                 trControl = trainControl(method = 'cv', number = 5))

Pls_end = Sys.time()
Pls_time = as.numeric(Pls_end - Pls_start)
tahminPls = predict(Pls_model, test)

Pls = data.frame(R2 = R2(tahminPls, test$cty),
                MAE = MAE(tahminPls, test$cty),
                RMSE = RMSE(tahminPls, test$cty),
                Time = Pls_time)
rownames(Pls) = 'Partial least squares'


# PCR Model

In [9]:
Pcr_start = Sys.time()
Pcr_model = train(cty~., data = train,
                  method = 'pcr',
                  trControl = trainControl(method = 'cv', number = 5))

pcr_end = Sys.time()
pcr_time = as.numeric(pcr_end - Pls_start)
tahminPcr = predict(Pcr_model, test)

Pcr = data.frame(R2 = R2(tahminPcr, test$cty),
                 MAE = MAE(tahminPcr, test$cty),
                 RMSE = RMSE(tahminPcr, test$cty),
                 Time = pcr_time)
rownames(Pcr) = 'Principal component regression'

# KNN Model

In [10]:
knn_start = Sys.time()
knn_model = train(cty~., data = train,
                  method = 'knn',
                  trControl = trainControl(method = 'cv', number = 5))

knn_end = Sys.time()
knn_time = as.numeric(knn_end - knn_start)
tahmin = predict(knn_model, test)

knn = data.frame(R2 = R2(tahmin, test$cty),
                 MAE = MAE(tahmin, test$cty),
                 RMSE = RMSE(tahmin, test$cty),
                 Time = knn_time)

rownames(knn) = 'KNN'

# SVM Model

In [11]:
SVM_start = Sys.time()
SVM_model = train(cty~., data = train,
                 method = 'svmRadial',
                 trControl = trainControl(method = 'cv', number = 5))

SVM_end = Sys.time()
SVM_time = as.numeric(SVM_end - SVM_start)
tahminSVM = predict(SVM_model, test)

SVM = data.frame(R2 = R2(tahminSVM, test$cty),
                MAE = MAE(tahminSVM, test$cty),
                RMSE = RMSE(tahminSVM, test$cty),
                Time = SVM_time)
rownames(SVM) = 'SVM'


“Variable(s) `' constant. Cannot scale data.”
“Variable(s) `' constant. Cannot scale data.”
“Variable(s) `' constant. Cannot scale data.”
“Variable(s) `' constant. Cannot scale data.”
“Variable(s) `' constant. Cannot scale data.”
“Variable(s) `' constant. Cannot scale data.”


# Random Forest Model

In [12]:
RF_start = Sys.time()
RF_model = train(cty~., data = train,
                 method = 'rf',
                 trControl = trainControl(method = 'cv', number = 5))

RF_end = Sys.time()
RF_time = as.numeric(RF_end - RF_start)
tahminrf = predict(RF_model, test)

RF = data.frame(R2 = R2(tahminrf, test$cty),
                MAE = MAE(tahminrf, test$cty),
                RMSE = RMSE(tahminrf, test$cty),
                Time = RF_time)
rownames(RF) = 'RandomForest'

# CART Model

In [13]:
C_start = Sys.time()
C_model = train(cty~., data = train,
                 method = 'rpart',
                 trControl = trainControl(method = 'cv', number = 5))

C_end = Sys.time()
C_time = as.numeric(C_end - C_start)
tahminC = predict(C_model, test)

Cart = data.frame(R2 = R2(tahminC, test$cty),
                MAE = MAE(tahminC, test$cty),
                RMSE = RMSE(tahminC, test$cty),
                Time = C_time)
rownames(Cart) = 'CART'

“There were missing values in resampled performance measures.”


# XGBTREE Model

In [14]:
XGB_start = Sys.time()
XGB_model = train(cty~., data = train,
                 method = 'xgbTree',
                 trControl = trainControl(method = 'cv', number = 5))

XGB_end = Sys.time()
XGB_time = as.numeric(XGB_end - XGB_start)
tahminXGB = predict(XGB_model, test)

XGB = data.frame(R2 = R2(tahminXGB, test$cty),
                MAE = MAE(tahminXGB, test$cty),
                RMSE = RMSE(tahminXGB, test$cty),
                Time = XGB_time)
rownames(XGB) = 'XgbTree'

Bütün sonuçlarımızı birleştirelim

In [15]:
Sonuc = rbind(Lm, Lasso, Ridge, Pls, Pcr, knn, SVM, RF, Cart, XGB)
Sonuc

Unnamed: 0_level_0,R2,MAE,RMSE,Time
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>
LinearModel,0.920544,0.9534646,1.212376,1.0956013
Lasso,0.7048675,1.6764454,2.960005,1.9276955
Ridge,0.9322373,0.870538,1.107316,1.8893993
Partial least squares,0.9338042,0.8611106,1.079692,1.1515598
Principal component regression,0.9326192,0.8377937,1.073957,2.8716326
KNN,0.9152363,0.8422492,1.227772,0.8565326
SVM,0.7949561,1.1706389,2.037794,3.2574952
RandomForest,0.9416602,0.7390092,1.038782,8.2906361
CART,0.7405199,1.6463683,2.131761,1.4979632
XgbTree,0.928072,0.8445458,1.19252,14.8707221


# AHP-TOPSİS

Ve şimdi sonuçları Çok Kriterli Karar Verme tekniklerinden AHP ve TOPSİS yardımıyla en iyiden en kötüye sıralayacağız

In [16]:
criteria = matrix(c(1,1, 1, 1,
                    1, 1, 1, 1,
                    1, 1, 1, 1,
                    1, 1, 1, 1
                    ), ncol = 4, byrow =TRUE)


AHPTOPSIS = ahptopsis2n(as.matrix(Sonuc), criteria, c('max', 'min', 'min', 'min'))
AHPTOPSIS

0
0

Unnamed: 0_level_0,values,ranking
Unnamed: 0_level_1,<dbl>,<dbl>
LinearModel,0.9232733,3
Lasso,0.6041827,8
Ridge,0.9231556,4
Partial least squares,0.9568676,1
Principal component regression,0.873826,5
KNN,0.9484325,2
SVM,0.7168319,6
RandomForest,0.5895821,9
CART,0.6874227,7
XgbTree,0.3529064,10

Unnamed: 0_level_0,values,ranking
Unnamed: 0_level_1,<dbl>,<dbl>
LinearModel,0.8724814,5
Lasso,0.3475577,10
Ridge,0.9173811,2
Partial least squares,0.9323993,1
Principal component regression,0.9098798,3
KNN,0.9087061,4
SVM,0.5513931,8
RandomForest,0.7718442,6
CART,0.4290121,9
XgbTree,0.6110061,7


Son tabloya baktığımız zaman en iyi modelin Partial Least Squares (Kısmi En Küçük Kareler Regresyonu) olduğunu görüyoruz. Sadece topsise göre bir değerlendirme yaptığımız zamanda sıralamada birincinin PLS modeli olduğunu görüyoruz. Ancak burda dikkat edilmesi gereken bir noktada her kriterin ağırlığını 0,25 olarak değerlendirdik. Yani modelin çalışma hızı ile doğruluk oranını beraber tutduk. Bu yüzden XGB ve RandomForest algorıtmaları doğruluk oranları yüksek olmasına rağmen çalışma hızı yavaş oldukları için alt sıralara itildi. Metriklere ayrı ayrı olarak ağırlıklar koyarakda bir hesaplama yapılabilir.

Görüş ve önerileriniz için şimdiden teşekkürler