Skip to content

Latest commit

ย 

History

History
570 lines (488 loc) ยท 20.7 KB

220524.md

File metadata and controls

570 lines (488 loc) ยท 20.7 KB

์ตœ๊ทผ์ ‘ ์ด์›ƒ ๋ถ„๋ฅ˜์˜ ์ดํ•ด

์ตœ๊ทผ์ ‘ ์ด์›ƒ ๋ถ„๋ฅ˜๊ธฐ

  • ๋ ˆ์ด๋ธ”์ด ์—†๋Š” ์˜ˆ์‹œ๋ฅผ ๋ ˆ์ด๋ธ”๋œ ์œ ์‚ฌํ•œ ์˜ˆ์‹œ์˜ ํด๋ž˜์Šค๋กœ ํ• ๋‹นํ•ด ๋ถ„๋ฅ˜ํ•˜๋Š” ํŠน์ง•
  • ํŠน์ง•๊ณผ ํƒ€๊ฒŸ ํด๋ž˜์Šค ๊ฐ„์— ๊ด€๊ณ„๊ฐ€ ๋งŽ๊ณ  ๋ณต์žกํ•˜๊ฑฐ๋‚˜ ์ดํ•ดํ•˜๊ธฐ๊ฐ€ ๋งค์šฐ ์–ด๋ ต์ง€๋งŒ, ์œ ์‚ฌํ•œ ํด๋ž˜์Šค ์œ ํ˜•์˜ ์•„์ดํ…œ์ด ์ƒ๋‹นํžˆ ๋™ใ…ˆ์งˆ์ ์ธ ๊ฒฝํ–ฅ์„ ๋ ๋Š” ๋ถ„๋ฅ˜ ์ž‘์—…์— ์ ํ•ฉ

K-NN ์•Œ๊ณ ๋ฆฌ์ฆ˜

์žฅ์ 

  • ๋‹จ์ˆœํ•˜๊ณ  ํšจ์œจ์ 
  • ๊ธฐ์ € ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์— ๋Œ€ํ•œ ๊ฐ€์ •์„ ํ•˜์ง€ ์•Š์Œ
  • ํ›ˆ๋ จ ๋‹จ๊ณ„๊ฐ€ ๋น ๋ฆ„

๋‹จ์ 

  • ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜์ง€ ์•Š์•„ ํŠน์ง•๊ณผ ํด๋ž˜์Šค ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•˜๋Š” ๋Šฅ๋ ฅ์ด ์ œ์•ฝ์„ ๋ฐ›์Œ
  • ์ ์ ˆํ•œ k์˜ ์„ ํƒ ํ•„์š”
  • ๋ถ„๋ฅ˜ ๋‹จ๊ณ„๊ฐ€ ๋Š๋ฆผ
  • ๋ช…๋ชฉ ํŠน์ง•๊ณผ ๋ˆ„๋ฝ ๋ฐ์ดํ„ฐ์šฉ ์ถ”๊ฐ€ ์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š”

๊ฑฐ๋ฆฌ๋กœ ์œ ์‚ฌ๋„ ์ธก์ •

  • ์ตœ๊ทผ์ ‘ ์ด์›ƒ์„ ์ฐพ์œผ๋ ค๋ฉด ๊ฑฐ๋ฆฌ ํ•จ์ˆ˜๋‚˜ ๋‘ ์ธ์Šคํ„ด์Šค ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ณต์‹์ด ํ•„์š”
  • ์ „ํ†ต์ ์œผ๋กœ K-NN ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ๋ฅผ ์‚ฌ์šฉ

์ ์ ˆํ•œ k ์„ ํƒ

  • k์˜ ๊ฐ’์€ ๋ชจ๋ธ์ด ๋ฏธ๋ž˜ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์ผ๋ฐ˜ํ™”๋˜๋Š” ๋Šฅ๋ ฅ์„ ๊ฒฐ์ •
  • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๊ณผ์ ํ•ฉ๊ณผ ๊ณผ์†Œ์ ํ•ฉ ์‚ฌ์ด์˜ ๊ท ํ˜•์€ ํŽธํ–ฅ ๋ถ„์‚ฐ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„๋กœ ์•Œ๋ ค์ง„ ๋ฌธ์ œ
  • k๊ฐ€ ํฌ๋ฉด ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ์€ ๋ฐ์ดํ„ฐ๋กœ ์ธํ•œ ์˜ํ–ฅ์ด๋‚˜ ๋ถ„์‚ฐ์€ ๊ฐ์†Œํ•˜์ง€๋งŒ ์ž‘๋”๋ผ๋„ ์ค‘์š”ํ•œ ํŒจํ„ด์„ ๋ฌด์‹œํ•˜๋Š” ์œ„ํ—˜์„ ๊ฐ์ˆ˜ํ•˜๋Š” ํ•™์Šต์ž๋กœ ํŽธํ–ฅ๋  ์ˆ˜ ์žˆ์Œ
  • k๊ฐ€ ์ž‘์œผ๋ฉด ๋…ธ์ด์ฆˆ๊ฐ€ ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋‚˜ ์ด์ƒ์น˜๊ฐ€ ์˜ˆ์‹œ์˜ ๋ถ„๋ฅ˜์— ๊ณผ๋„ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์นจ
  • k๋Š” ๋ณดํ†ต ํ›ˆ๋ จ ์˜ˆ์‹œ ๊ฐœ์ˆ˜์˜ ์ œ๊ณฑ๊ทผ์œผ๋กœ ์„ค์ •ํ•˜๊ณ  ์‹œ์ž‘ํ•ด์„œ ๋‹ค์ˆ˜์˜ k ๊ฐ’์„ ํ…Œ์ŠคํŠธํ•ด ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ข‹์€ ๊ฒƒ์„ ์„ ํƒ
  • ๋ฐ์ดํ„ฐ์— ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ์ง€ ์•Š๊ณ  ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹์ด ํด ๊ฒฝ์šฐ k๋Š” ๋œ ์ค‘์š”ํ•ด์ง

k-NN ์‚ฌ์šฉ์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ์ค€๋น„

  • ์ผ๋ฐ˜์ ์œผ๋กœ ํŠน์ง•์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์ ์šฉํ•˜๊ธฐ ์ „์— ํ‘œ์ค€ ๋ฒ”์œ„๋กœ ๋ณ€ํ™˜
  • ๊ฑฐ๋ฆฌ ๊ณต์‹์ด ํŠน์ •ํ•œ ์ธก์ • ๋ฐฉ๋ฒ•์— ๋งค์šฐ ์˜์กด์ ์ด๊ธฐ ๋•Œ๋ฌธ
  • ์ „ํ†ต์ ์œผ๋กœ ์ตœ์†Œ-์ตœ๋Œ€ ์ •๊ทœํ™”๋ฅผ ์‚ฌ์šฉ

์ตœ์†Œ-์ตœ๋Œ€ ์ •๊ทœํ™”

  • ๋ชจ๋“  ๊ฐ’์ด 0์—์„œ 1 ์‚ฌ์ด ๋ฒ”์œ„์— ์žˆ๋„๋ก ํŠน์ง•์„ ๋ณ€ํ™˜
  • Xnew = (X - min(X)) / (max(X) - min(X))

z-์ ์ˆ˜ ํ‘œ์ค€ํ™”

  • ํŠน์ง• X์—์„œ ํ‰๊ท ๊ฐ’์„ ๋นผ๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ X์˜ ํ‘œ์ค€ ํŽธ์ฐจ๋กœ ๋‚˜๋ˆ”
  • Xnew = (X - Mean(X)) / StdDev(X)
  • ๊ฐ ํŠน์ง• ๊ฐ’์ด ํ‰๊ท ์˜ ์œ„๋‚˜ ์•„๋ž˜๋กœ ๋ช‡ ํ‘œ์ค€ ํŽธ์ฐจ๋งŒํผ ๋–จ์–ด์ ธ ์žˆ๋Š”์ง€์˜ ๊ด€์ ์œผ๋กœ ๊ฐ ํŠน์ง• ๊ฐ’์„ ํ™•๋Œ€ ์ถ•์†Œ
  • ๊ฒฐ๊ณผ ๊ฐ’์€ z-์ ์ˆ˜
  • z-์ ์ˆ˜๋Š” ์ •๊ทœํ™” ๊ฐ’๊ณผ ๋‹ฌ๋ฆฌ ๋ฏธ๋ฆฌ ์ •์˜๋œ ์ตœ์†Ÿ๊ฐ’๊ณผ ์ตœ๋Œ“๊ฐ’์ด ์—†์Œ

์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ๊ณต์‹

  • ๋ช…๋ชฉ ๋ฐ์ดํ„ฐ์—๋Š” ์ •์˜๋˜์ง€ ์•Š์Œ
  • ๋ช…๋ชฉ ํŠน์ง• ๊ฐ„์— ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ ์ž ํŠน์ง•์„ ์ˆ˜์น˜ ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•  ํ•„์š”๊ฐ€ ์žˆ์Œ
  • ๋”๋ฏธ ์ฝ”๋”ฉ์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋Œ€ํ‘œ์ 
  • ๋”๋ฏธ ์ฝ”๋”ฉ์—์„œ ๊ฐ’ 1์€ ํ•ด๋‹น ๋ฒ”์ฃผ๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ , 0์€ ๋‹ค๋ฅธ ๋ฒ”์ฃผ๋ฅผ ๋‚˜ํƒ€๋ƒ„
  • n-๋ฒ”์ฃผ ๋ช…๋ชฉ ํŠน์ง•์€ ํŠน์ง•์˜ (n-1) ๋ ˆ๋ฒจ์˜ ์ด์ง„ ์ง€์‹œ ๋ณ€์ˆ˜๋ฅผ ์ƒ์„ฑํ•ด์„œ ๋”๋ฏธ ์ฝ”๋“œํ™”ํ•  ์ˆ˜ ์žˆ์Œ

k-NN ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๊ฒŒ์œผ๋ฅธ ์ด์œ 

  • ์ตœ๊ทผ์ ‘ ์ด์›ƒ ๋ฐฉ๋ฒ•์— ๊ธฐ๋ฐ˜์„ ๋‘” ๋ถ„๋ฅ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ฒŒ์œผ๋ฅธ ํ•™์Šต์ด๋ผ๊ณ  ํ•จ
  • ์ถ”์ƒํ™”๊ฐ€ ์ผ์–ด๋‚˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ
  • ์ฆ‰, ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๊ธฐ๋งŒ ํ•˜์—ฌ ๋น ๋ฅด์ง€๋งŒ, ์•„๋ฌด๊ฒƒ๋„ ํ›ˆ๋ จํ•˜์ง€ ์•Š์Œ
  • ์˜ˆ์ธก ๋‹จ๊ณ„๊ฐ€ ํ›ˆ๋ จ ๋‹จ๊ณ„์— ๋น„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ๋Š๋ฆฐ ๊ฒฝํ–ฅ์ด ์žˆ์Œ
  • ์ถ”์ƒํ™”๋œ ๋ชจ๋ธ๋ณด๋‹ค ํ›ˆ๋ จ ์ธ์Šคํ„ด์Šค์— ๋งŽ์ด ์˜์กดํ•˜์—ฌ ๊ฒŒ์œผ๋ฅธ ํ•™์Šต์€ ์ธ์Šคํ„ด์Šค ๊ธฐ๋ฐ˜ ํ•™์Šต, ์•”๊ธฐ ํ•™์Šต์ด๋ผ๊ณ ๋„ ํ•จ
  • ๋ชจ๋ธ์„ ๋งŒ๋“ค์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๋น„๋ชจ์ˆ˜ ํ•™์Šต ๋ฐฉ๋ฒ•์˜ ๋ถ€๋ฅ˜๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Œ

์˜ˆ์ œ: k-NN ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์•” ์ง„๋‹จ

๋ฐ์ดํ„ฐ

  • UCI ๋จธ์‹ ๋Ÿฌ๋‹ repository์˜ Wisconsin Breast Cancer Diagnostic dataset

๋ฐ์ดํ„ฐ ํƒ์ƒ‰

wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors=F)
str(wbcd)
> str(wbcd)
'data.frame':	569 obs. of  32 variables:
 $ id               : int  87139402 8910251 905520 868871 9012568 906539 925291 87880 862989 89827 ...
 $ diagnosis        : chr  "B" "B" "B" "B" ...
 $ radius_mean      : num  12.3 10.6 11 11.3 15.2 ...
 $ texture_mean     : num  12.4 18.9 16.8 13.4 13.2 ...
 $ perimeter_mean   : num  78.8 69.3 70.9 73 97.7 ...
 $ area_mean        : num  464 346 373 385 712 ...
 $ smoothness_mean  : num  0.1028 0.0969 0.1077 0.1164 0.0796 ...
 $ compactness_mean : num  0.0698 0.1147 0.078 0.1136 0.0693 ...
 $ concavity_mean   : num  0.0399 0.0639 0.0305 0.0464 0.0339 ...
 $ points_mean      : num  0.037 0.0264 0.0248 0.048 0.0266 ...
 $ symmetry_mean    : num  0.196 0.192 0.171 0.177 0.172 ...
 $ dimension_mean   : num  0.0595 0.0649 0.0634 0.0607 0.0554 ...
 $ radius_se        : num  0.236 0.451 0.197 0.338 0.178 ...
 $ texture_se       : num  0.666 1.197 1.387 1.343 0.412 ...
 $ perimeter_se     : num  1.67 3.43 1.34 1.85 1.34 ...
 $ area_se          : num  17.4 27.1 13.5 26.3 17.7 ...
 $ smoothness_se    : num  0.00805 0.00747 0.00516 0.01127 0.00501 ...
 $ compactness_se   : num  0.0118 0.03581 0.00936 0.03498 0.01485 ...
 $ concavity_se     : num  0.0168 0.0335 0.0106 0.0219 0.0155 ...
 $ points_se        : num  0.01241 0.01365 0.00748 0.01965 0.00915 ...
 $ symmetry_se      : num  0.0192 0.035 0.0172 0.0158 0.0165 ...
 $ dimension_se     : num  0.00225 0.00332 0.0022 0.00344 0.00177 ...
 $ radius_worst     : num  13.5 11.9 12.4 11.9 16.2 ...
 $ texture_worst    : num  15.6 22.9 26.4 15.8 15.7 ...
 $ perimeter_worst  : num  87 78.3 79.9 76.5 104.5 ...
 $ area_worst       : num  549 425 471 434 819 ...
 $ smoothness_worst : num  0.139 0.121 0.137 0.137 0.113 ...
 $ compactness_worst: num  0.127 0.252 0.148 0.182 0.174 ...
 $ concavity_worst  : num  0.1242 0.1916 0.1067 0.0867 0.1362 ...
 $ points_worst     : num  0.0939 0.0793 0.0743 0.0861 0.0818 ...
 $ symmetry_worst   : num  0.283 0.294 0.3 0.21 0.249 ...
 $ dimension_worst  : num  0.0677 0.0759 0.0788 0.0678 0.0677 ...
  • 569๊ฐœ์˜ ์•” ์กฐ์ง๊ฒ€์‚ฌ ์˜ˆ์‹œ
  • 32๊ฐœ์˜ ํŠน์ง•
  • ์ฒซ ๋ณ€์ˆ˜๋Š” ํ™˜์ž ID๋กœ ์œ ์šฉํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜์ง€ ์•Š์œผ๋‹ˆ ์ œ์™ธํ•  ํ•„์š”๊ฐ€ ์žˆ์Œ
  • ์˜๋ฏธ ์—†๋Š” ID์™€ chr์ธ diagnosis๋ฅผ ์ œ์™ธํ•œ ๋‹ค๋ฅธ ํŠน์ง•๋“ค์€ ์ˆซ์ž
wbcd <- wbcd[-1] # ํ™˜์ž id๋Š” ์œ ์šฉํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜์ง€ ์•Š์•„ ์ œ์™ธ
table(wbcd$diagnosis)
> table(wbcd$diagnosis)

  B   M 
357 212 
  • ๋ณ€์ˆ˜ diagnosis๋Š” ์˜ˆ์ธกํ•˜๋ ค๋Š” ๊ฒฐ๊ณผ
  • ์–‘์„ฑ ์ข…์–‘์ธ์ง€ ์Œ์„ฑ ์ข…์–‘์ธ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋ƒ„
  • ์–‘์„ฑ์ด 357๊ฐœ, ์Œ์„ฑ์ด 212๊ฐœ
  • ํŒฉํ„ฐ๋กœ ๋ณ€ํ™˜ํ•  ํ•„์š” ์žˆ์Œ
  • ์ถ”๊ฐ€๋กœ ์ •๋ณด๋ฅผ ์ฃผ๋Š” ๋ ˆ์ด๋ธ” ์ œ๊ณต ํ•„์š”
wbcd$diagnosis <- factor(wbcd$diagnosis, levels=c("B","M"),
                         labels=c("Benign","Malignant"))
round(prop.table(table(wbcd$diagnosis)) * 100, digits=1)
> round(prop.table(table(wbcd$diagnosis)) * 100, digits=1)

   Benign Malignant 
     62.7      37.3 
  • ์–‘์„ฑ์ด 62.7%, ์Œ์„ฑ์ด 37.3%
summary(wbcd[c("radius_mean","area_mean","smoothness_mean")])
> summary(wbcd[c("radius_mean","area_mean","smoothness_mean")])
  radius_mean       area_mean      smoothness_mean  
 Min.   : 6.981   Min.   : 143.5   Min.   :0.05263  
 1st Qu.:11.700   1st Qu.: 420.3   1st Qu.:0.08637  
 Median :13.370   Median : 551.1   Median :0.09587  
 Mean   :14.127   Mean   : 654.9   Mean   :0.09636  
 3rd Qu.:15.780   3rd Qu.: 782.7   3rd Qu.:0.10530  
 Max.   :28.110   Max.   :2501.0   Max.   :0.16340 
  • ๋งค๋„๋Ÿฌ์›€์˜ ๋ฒ”์œ„๋Š” 0.05~0.16์ธ ๋ฐ˜๋ฉด, ๋ฉด์ ์˜ ๋ฒ”์œ„๋Š” 143.5์—์„œ 250.10์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ์—์„œ ๋ฉด์ ์˜ ์˜ํ–ฅ์ด ์ปค์ง€๊ฒŒ ๋จ
  • ์ •๊ทœํ™” ์ ์šฉ์ด ํ•„์š”

์ˆ˜์น˜ ๋ฐ์ดํ„ฐ ์ •๊ทœํ™”

min-max ์ •๊ทœํ™” ํ•จ์ˆ˜ ์ƒ์„ฑ

normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}
normalize(c(1, 2, 3, 4, 5))
normalize(c(10, 20, 30, 40, 50))
> normalize(c(1, 2, 3, 4, 5))
[1] 0.00 0.25 0.50 0.75 1.00
> normalize(c(10, 20, 30, 40, 50))
[1] 0.00 0.25 0.50 0.75 1.00

lapply()

  • ๋ฆฌ์ŠคํŠธ๋ฅผ ์ทจํ•ด ๊ฐ ๋ฆฌ์ŠคํŠธ ํ•ญ๋ชฉ์— ์ง€์ •๋œ ํ•จ์ˆ˜๋ฅผ ์ ์šฉ
  • ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์€ ๋™์ผํ•œ ๊ธธ์ด๋ฅผ ๊ฐ–๋Š” ๋ฒกํ„ฐ๋“ค์˜ ๋ฆฌ์ŠคํŠธ
  • lapply()๋Š” ๋ฆฌ์ŠคํŠธ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋ฏ€๋กœ ๊ฒฐ๊ณผ๊ฐ’์„ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
summary(wbcd_n$area_mean)
> summary(wbcd_n$area_mean)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.1174  0.1729  0.2169  0.2711  1.0000 
  • ์›๋ž˜ 143.5~2501.0์˜ ๋ฒ”์œ„๋ฅผ ๊ฐ€์กŒ์ง€๋งŒ ์ง€๊ธˆ์€ 0์—์„œ 1๊นŒ์ง€ ๋ฒ”์œ„์— ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

๋ฐ์ดํ„ฐ ์ค€๋น„

train, test ๋ถ„๋ฆฌ

  • 469๊ฐœ๋ฅผ train, 100๊ฐœ๋ฅผ test๋กœ ๋ถ„๋ฆฌํ•จ
wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ]

diagnosis ํŒฉํ„ฐ๋ฅผ ๊ฐ€์ ธ์™€์„œ ๋ ˆ์ด๋ธ” ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑ

wbcd_train_label <- wbcd[1:469, 1]
wbcd_test_label <- wbcd[470:569, 1]

๋ฐ์ดํ„ฐ๋กœ ๋ชจ๋ธ ํ›ˆ๋ จ

  • k์˜ ๊ฐ’์€ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๊ฐ€ 469๊ฐœ ์ด๋ฏ€๋กœ 469์˜ ์ œ๊ณฑ๊ทผ๊ณผ ๋™์ผํ•œ ํ™€์ˆ˜์ธ 21๋กœ ์‹œ๋„
  • 2-๋ฒ”์ฃผ ๊ฒฐ๊ณผ์ด๋ฏ€๋กœ ํ™€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ๋™์  ํ‘œ๋กœ ๋๋‚  ๊ฐ€๋Šฅ์„ฑ์„ ์ œ๊ฑฐ
library(class)
wbcd_test_pred <- knn(train=wbcd_train, test=wbcd_test, 
                      cl=wbcd_train_label, k=21)

๋ชจ๋ธ ์„ฑ๋Šฅ ํ‰๊ฐ€

library(gmodels)
CrossTable(x=wbcd_test_label, y=wbcd_test_pred, prop.chisq=F)
> CrossTable(x=wbcd_test_label, y=wbcd_test_pred, prop.chisq=F)

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
                | wbcd_test_pred 
wbcd_test_label |    Benign | Malignant | Row Total | 
----------------|-----------|-----------|-----------|
         Benign |        61 |         0 |        61 | 
                |     1.000 |     0.000 |     0.610 | 
                |     0.968 |     0.000 |           | 
                |     0.610 |     0.000 |           | 
----------------|-----------|-----------|-----------|
      Malignant |         2 |        37 |        39 | 
                |     0.051 |     0.949 |     0.390 | 
                |     0.032 |     1.000 |           | 
                |     0.020 |     0.370 |           | 
----------------|-----------|-----------|-----------|
   Column Total |        63 |        37 |       100 | 
                |     0.630 |     0.370 |           | 
----------------|-----------|-----------|-----------|
  • ์ขŒ์ธก ์ƒ๋‹จ ์…€์€ TN ๊ฒฐ๊ณผ๋ฅผ ๋‚˜ํƒ€๋ƒ„
  • 100๊ฐœ์˜ ๊ฐ’ ์ค‘ 61๊ฐœ๋Š” ์ข…์–‘์ด ์–‘์„ฑ์ด๊ณ , k-NN ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ •ํ™•ํžˆ ์–‘์„ฑ์œผ๋กœ ์‹๋ณ„ํ•œ ๊ฒฝ์šฐ
  • ์šฐ์ธก ํ•˜๋‹จ ์…€์€ TP ๊ฒฐ๊ณผ๋ฅผ ๋‚˜ํƒ€๋ƒ„
  • ์•…์„ฑ ์ข…์–‘์„ ์ •ํ™•ํžˆ ์•…์„ฑ ์ข…์–‘์ด๋ผ๊ณ  ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ๋งž์ถ˜ ๊ฒฐ๊ณผ
  • ์ขŒ์ธก ํ•˜๋‹จ์€ FN, ์šฐ์ธก ์ƒ๋‹จ์€ FP๋กœ ์˜ค๋ฅ˜๋“ค์ž„
  • ์ข…์–‘ 100๊ฐœ ์ค‘ 2๊ฐœ๋ฅผ ์ž˜๋ชป ๋ถ„๋ฅ˜
  • ์ •ํ™•๋„๊ฐ€ ๋†’์ง€๋งŒ ์˜ค๋ฅ˜๊ฐ€ FN์ด๊ธฐ ๋•Œ๋ฌธ์— ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ํ•„์š”

๋ชจ๋ธ ์„ฑ๋Šฅ ๊ฐœ์„ 

  1. ์ˆ˜์น˜ ํŠน์ง• ์žฌ์กฐ์ • ํ›„ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ• ์‚ฌ์šฉ
  2. k์— ๋ช‡ ๊ฐ€์ง€ ๋‹ค๋ฅธ ๊ฐ’์„ ์‹œ๋„

z-์ ์ˆ˜ ํ‘œ์ค€ํ™”

  • scale() ์‚ฌ์šฉ
  • scale()์€ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์— ์ง์ ‘ ์ ์šฉ ๊ฐ€๋Šฅ
wbcd_z <- as.data.frame(scale(wbcd[-1]))
summary(wbcd_z$area_mean)
> summary(wbcd_z$area_mean)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.4532 -0.6666 -0.2949  0.0000  0.3632  5.2459 
  • ์ฒซ ๋ฒˆ์งธ ์—ด์˜ diagnosis๋ฅผ ์ œ์™ธํ•˜๊ณ  ๋ชจ๋“  ํŠน์ง•์„ ์žฌ์กฐ์ •
  • z-์ ์ˆ˜๋กœ ํ‘œ์ค€ํ™”๋œ ๋ณ€์ˆ˜์˜ ํ‰๊ท ์€ ํ•ญ์ƒ 0, ๋ฒ”์œ„๋Š” ์ƒ๋‹นํžˆ ์ž‘์•„์•ผ ํ•จ
  • 3๋ณด๋‹ค ํฌ๊ฑฐ๋‚˜ -3๋ณด๋‹ค ์ž‘์€ z-์ ์ˆ˜๋Š” ๋งค์šฐ ํฌ์†Œํ•œ ๊ฐ’์„ ๋‚˜ํƒ€๋ƒ„
wbcd_train <- wbcd_z[1:469, ]
wbcd_test <- wbcd_z[470:569, ]
wbcd_train_label <- wbcd[1:469, 1]
wbcd_test_label <- wbcd[470:569, 1]
wbcd_test_pred <- knn(train=wbcd_train, test=wbcd_test,
                      cl=wbcd_train_label, k=21)
CrossTable(x=wbcd_test_label, y=wbcd_test_pred,
           prop.chisq=F)
> CrossTable(x=wbcd_test_label, y=wbcd_test_pred,
+            prop.chisq=F)

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
                | wbcd_test_pred 
wbcd_test_label |    Benign | Malignant | Row Total | 
----------------|-----------|-----------|-----------|
         Benign |        61 |         0 |        61 | 
                |     1.000 |     0.000 |     0.610 | 
                |     0.924 |     0.000 |           | 
                |     0.610 |     0.000 |           | 
----------------|-----------|-----------|-----------|
      Malignant |         5 |        34 |        39 | 
                |     0.128 |     0.872 |     0.390 | 
                |     0.076 |     1.000 |           | 
                |     0.050 |     0.340 |           | 
----------------|-----------|-----------|-----------|
   Column Total |        66 |        34 |       100 | 
                |     0.660 |     0.340 |           | 
----------------|-----------|-----------|-----------|
  • ์ •ํ™•๋„ ๊ฐ์†Œ
  • FN์ด ์ฆ๊ฐ€ํ•˜๋Š” ๋” ์ข‹์ง€ ๋ชปํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์Œ

k์˜ ๋Œ€์ฒด ๊ฐ’ ํ…Œ์ŠคํŠธ

k์˜ ๊ฐ’์„ 1, 5, 11, 15, 21, 27๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ ์‹œ๋„

# k ๊ฐ’ ๋ณ€ํ™”
k.val <- c(1, 5, 11, 15, 21, 27)
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ]
wbcd_train_label <- wbcd[1:469, 1]
wbcd_test_label <- wbcd[470:569, 1]
for (i in k.val) {
  wbcd_test_pred <- knn(train=wbcd_train, test=wbcd_test,
                        cl=wbcd_train_label, k=i)
  print(CrossTable(x=wbcd_test_label, y=wbcd_test_pred,
             prop.chisq=F))
}
> for (i in k.val) {
+   wbcd_test_pred <- knn(train=wbcd_train, test=wbcd_test,
+                         cl=wbcd_train_label, k=i)
+   print(CrossTable(x=wbcd_test_label, y=wbcd_test_pred,
+              prop.chisq=F)$t)
+ }

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
                | wbcd_test_pred 
wbcd_test_label |    Benign | Malignant | Row Total | 
----------------|-----------|-----------|-----------|
         Benign |        59 |         2 |        61 | 
                |     0.967 |     0.033 |     0.610 | 
                |     0.952 |     0.053 |           | 
                |     0.590 |     0.020 |           | 
----------------|-----------|-----------|-----------|
      Malignant |         3 |        36 |        39 | 
                |     0.077 |     0.923 |     0.390 | 
                |     0.048 |     0.947 |           | 
                |     0.030 |     0.360 |           | 
----------------|-----------|-----------|-----------|
   Column Total |        62 |        38 |       100 | 
                |     0.620 |     0.380 |           | 
----------------|-----------|-----------|-----------|

 
           y
x           Benign Malignant
  Benign        59         2
  Malignant      3        36

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
                | wbcd_test_pred 
wbcd_test_label |    Benign | Malignant | Row Total | 
----------------|-----------|-----------|-----------|
         Benign |        60 |         1 |        61 | 
                |     0.984 |     0.016 |     0.610 | 
                |     0.968 |     0.026 |           | 
                |     0.600 |     0.010 |           | 
----------------|-----------|-----------|-----------|
      Malignant |         2 |        37 |        39 | 
                |     0.051 |     0.949 |     0.390 | 
                |     0.032 |     0.974 |           | 
                |     0.020 |     0.370 |           | 
----------------|-----------|-----------|-----------|
   Column Total |        62 |        38 |       100 | 
                |     0.620 |     0.380 |           | 
----------------|-----------|-----------|-----------|

 
           y
x           Benign Malignant
  Benign        60         1
  Malignant      2        37

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
                | wbcd_test_pred 
wbcd_test_label |    Benign | Malignant | Row Total | 
----------------|-----------|-----------|-----------|
         Benign |        60 |         1 |        61 | 
                |     0.984 |     0.016 |     0.610 | 
                |     0.952 |     0.027 |           | 
                |     0.600 |     0.010 |           | 
----------------|-----------|-----------|-----------|
      Malignant |         3 |        36 |        39 | 
                |     0.077 |     0.923 |     0.390 | 
                |     0.048 |     0.973 |           | 
                |     0.030 |     0.360 |           | 
----------------|-----------|-----------|-----------|
   Column Total |        63 |        37 |       100 | 
                |     0.630 |     0.370 |           | 
----------------|-----------|-----------|-----------|

 
           y
x           Benign Malignant
  Benign        60         1
  Malignant      3        36

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
                | wbcd_test_pred 
wbcd_test_label |    Benign | Malignant | Row Total | 
----------------|-----------|-----------|-----------|
         Benign |        61 |         0 |        61 | 
                |     1.000 |     0.000 |     0.610 | 
                |     0.953 |     0.000 |           | 
                |     0.610 |     0.000 |           | 
----------------|-----------|-----------|-----------|
      Malignant |         3 |        36 |        39 | 
                |     0.077 |     0.923 |     0.390 | 
                |     0.047 |     1.000 |           | 
                |     0.030 |     0.360 |           | 
----------------|-----------|-----------|-----------|
   Column Total |        64 |        36 |       100 | 
                |     0.640 |     0.360 |           | 
----------------|-----------|-----------|-----------|

 
           y
x           Benign Malignant
  Benign        61         0
  Malignant      3        36

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
                | wbcd_test_pred 
wbcd_test_label |    Benign | Malignant | Row Total | 
----------------|-----------|-----------|-----------|
         Benign |        61 |         0 |        61 | 
                |     1.000 |     0.000 |     0.610 | 
                |     0.924 |     0.000 |           | 
                |     0.610 |     0.000 |           | 
----------------|-----------|-----------|-----------|
      Malignant |         5 |        34 |        39 | 
                |     0.128 |     0.872 |     0.390 | 
                |     0.076 |     1.000 |           | 
                |     0.050 |     0.340 |           | 
----------------|-----------|-----------|-----------|
   Column Total |        66 |        34 |       100 | 
                |     0.660 |     0.340 |           | 
----------------|-----------|-----------|-----------|

 
           y
x           Benign Malignant
  Benign        61         0
  Malignant      5        34

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
                | wbcd_test_pred 
wbcd_test_label |    Benign | Malignant | Row Total | 
----------------|-----------|-----------|-----------|
         Benign |        61 |         0 |        61 | 
                |     1.000 |     0.000 |     0.610 | 
                |     0.924 |     0.000 |           | 
                |     0.610 |     0.000 |           | 
----------------|-----------|-----------|-----------|
      Malignant |         5 |        34 |        39 | 
                |     0.128 |     0.872 |     0.390 | 
                |     0.076 |     1.000 |           | 
                |     0.050 |     0.340 |           | 
----------------|-----------|-----------|-----------|
   Column Total |        66 |        34 |       100 | 
                |     0.660 |     0.340 |           | 
----------------|-----------|-----------|-----------|

 
           y
x           Benign Malignant
  Benign        61         0
  Malignant      5        34

์ •๋ฆฌ

k ๊ฐ’ FN FP ์—๋Ÿฌ์œจ
1 3 2 5%
5 2 1 3%
11 3 1 4%
15 3 0 3%
21 5 0 5%
27 5 0 5%