In [1]:
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(caret)
library(stringr)

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

Attaching package: ‘testthat’

The following object is masked from ‘package:dplyr’:

    matches

The following object is masked from ‘package:purrr’:

    is_null

Loading required package: lattice

Attaching package: ‘caret’

The following object is masked from ‘package:purrr’:

    lift



## “Who’s that Pokémon?”: Predicting Pokémon Types and Legendary status from their Generation and various Attributes using k-nn classification. 

In [2]:
pokemon <- read_csv("https://raw.githubusercontent.com/UBC-DSCI/datasets/master/pokemon/Pokemon.csv")
head(pokemon)

Parsed with column specification:
cols(
  `#` = col_integer(),
  Name = col_character(),
  `Type 1` = col_character(),
  `Type 2` = col_character(),
  Total = col_integer(),
  HP = col_integer(),
  Attack = col_integer(),
  Defense = col_integer(),
  `Sp. Atk` = col_integer(),
  `Sp. Def` = col_integer(),
  Speed = col_integer(),
  Generation = col_integer(),
  Legendary = col_character()
)


#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False


Looking at the data as it is, it needs to be wrangled and cleaned to get it to a format necessary for classification. This process will involve, renaming the column names, removing unnecessary columns, adding new columns, and scaling the data.

In [3]:
#renaming the column headings to remove spaces between Type 1 and Type 2 and make the rest more understandable.
cleaning_data <- names(pokemon) <- c("Number","Name","Type_1", "Type_2", "Total", "Hit_Points", "Attack", "Defense", 
                                     "Special_Attack","Special_Defense", "Speed", "Generation", "Legendary")
cleaning_data
head(pokemon)

Number,Name,Type_1,Type_2,Total,Hit_Points,Attack,Defense,Special_Attack,Special_Defense,Speed,Generation,Legendary
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False


In [4]:
# removing the unnecessary columns
cleaned_pokemon <- pokemon %>% 
                    select(-Number, -Name)
head(cleaned_pokemon)

Type_1,Type_2,Total,Hit_Points,Attack,Defense,Special_Attack,Special_Defense,Speed,Generation,Legendary
Grass,Poison,318,45,49,49,65,65,45,1,False
Grass,Poison,405,60,62,63,80,80,60,1,False
Grass,Poison,525,80,82,83,100,100,80,1,False
Grass,Poison,625,80,100,123,122,120,80,1,False
Fire,,309,39,52,43,60,50,65,1,False
Fire,,405,58,64,58,80,65,80,1,False


Below we will be creating new columns namely the Type_summary column and the Gen_1, Gen_2... columns.
The Type_summary column is required because it will be used to separate the data into pokemon with one or two types for further classification. The Generation column is split into six separate ones because,it is a categorical variable and would not have been evenly spaced.

In [5]:
#creating new columns 
cleaned_pokemon2 <- cleaned_pokemon %>% mutate(Type_sum =(Type_1 == Type_2),
                                               Gen_1 = as.numeric(Generation == 1),
                                               Gen_2 = as.numeric(Generation == 2),
                                               Gen_3 = as.numeric(Generation == 3),
                                               Gen_4 = as.numeric(Generation == 4),
                                               Gen_5 = as.numeric(Generation == 5),
                                               Gen_6 = as.numeric(Generation == 6),
                                               Type_summary = if_else(is.na(`Type_2`), "Mono", "Dual"))
                                        
head(n = 10, cleaned_pokemon2)



Type_1,Type_2,Total,Hit_Points,Attack,Defense,Special_Attack,Special_Defense,Speed,Generation,Legendary,Type_sum,Gen_1,Gen_2,Gen_3,Gen_4,Gen_5,Gen_6,Type_summary
Grass,Poison,318,45,49,49,65,65,45,1,False,False,1,0,0,0,0,0,Dual
Grass,Poison,405,60,62,63,80,80,60,1,False,False,1,0,0,0,0,0,Dual
Grass,Poison,525,80,82,83,100,100,80,1,False,False,1,0,0,0,0,0,Dual
Grass,Poison,625,80,100,123,122,120,80,1,False,False,1,0,0,0,0,0,Dual
Fire,,309,39,52,43,60,50,65,1,False,,1,0,0,0,0,0,Mono
Fire,,405,58,64,58,80,65,80,1,False,,1,0,0,0,0,0,Mono
Fire,Flying,534,78,84,78,109,85,100,1,False,False,1,0,0,0,0,0,Dual
Fire,Dragon,634,78,130,111,130,85,100,1,False,False,1,0,0,0,0,0,Dual
Fire,Flying,634,78,104,78,159,115,100,1,False,False,1,0,0,0,0,0,Dual
Water,,314,44,48,65,50,64,43,1,False,,1,0,0,0,0,0,Mono


In [6]:
#scaling relevant column data to normalize the features because different magnitudes in the attributes would
#bias the predictions. 
#Also removing unneccessary columns.
scaled_pokemon <- cleaned_pokemon2 %>%
                                    mutate(scaled_Total = scale(Total,center = FALSE), 
                                    scaled_Hit_Points = scale(Hit_Points, center = FALSE), 
                                    scaled_Attack = scale(Attack, center = FALSE),
                                    scaled_Defense = scale(Defense, center = FALSE),
                                    scaled_Special_Attack = scale(Special_Attack, center = FALSE),
                                    scaled_Special_Defense = scale(Special_Defense, center = FALSE),
                                    scaled_Speed = scale(Speed, center = FALSE),
                                    scaled_Gen_1 = scale(Gen_1, center = FALSE),
                                    scaled_Gen_2 = scale(Gen_2, center = FALSE),
                                    scaled_Gen_3 = scale(Gen_3, center = FALSE),
                                    scaled_Gen_4 = scale(Gen_4, center = FALSE),
                                    scaled_Gen_5 = scale(Gen_5, center = FALSE),
                                    scaled_Gen_6 = scale(Gen_6, center = FALSE)) %>%
                                    select(-Type_sum, -Generation,-Total, -Hit_Points, -Attack, -Defense, -Special_Attack, 
                                           -Special_Defense, -Speed, -Gen_1, -Gen_2, -Gen_3, -Gen_4, -Gen_5, -Gen_6)
head(scaled_pokemon)

Type_1,Type_2,Legendary,Type_summary,scaled_Total,scaled_Hit_Points,scaled_Attack,scaled_Defense,scaled_Special_Attack,scaled_Special_Defense,scaled_Speed,scaled_Gen_1,scaled_Gen_2,scaled_Gen_3,scaled_Gen_4,scaled_Gen_5,scaled_Gen_6
Grass,Poison,False,Dual,0.7041635,0.6092888,0.5734039,0.6109768,0.8137637,0.8426019,0.6061101,2.193913,0,0,0,0,0
Grass,Poison,False,Dual,0.896812,0.8123851,0.7255314,0.7855417,1.0015554,1.0370485,0.8081468,2.193913,0,0,0,0,0
Grass,Poison,False,Dual,1.1625341,1.0831801,0.9595738,1.03492,1.2519442,1.2963107,1.077529,2.193913,0,0,0,0,0
Grass,Poison,False,Dual,1.3839691,1.0831801,1.170212,1.5336766,1.5273719,1.5555728,1.077529,2.193913,0,0,0,0,0
Fire,,False,Mono,0.6842343,0.5280503,0.6085102,0.5361634,0.7511665,0.6481553,0.8754923,2.193913,0,0,0,0,0
Fire,,False,Mono,0.896812,0.7853056,0.7489357,0.7231971,1.0015554,0.8426019,1.077529,2.193913,0,0,0,0,0


In [12]:
#knn using full data (model 1)
set.seed(151)
training_rows <- scaled_pokemon %>%
    select(Legendary) %>%
    unlist() %>%
    createDataPartition(p=0.75, list=FALSE)

scaled_pokemon <- scaled_pokemon %>%
    mutate(Legendary=as.factor(Legendary))

scaled_pokemon$Type_2[is.na(scaled_pokemon$Type_2)] <- 'Mono'

X_train <- scaled_pokemon %>% 
    select(scaled_Total, scaled_Hit_Points, scaled_Attack, scaled_Defense, scaled_Special_Attack, scaled_Special_Defense, scaled_Speed) %>% 
    slice(training_rows) %>% 
    data.frame()

Y_train_legendary <- scaled_pokemon %>% 
    select(Legendary) %>% 
    slice(training_rows) %>% 
    unlist()

Y_train_type1 <- scaled_pokemon %>% 
    select(Type_1) %>% 
    slice(training_rows) %>% 
    unlist()

Y_train_type2 <- scaled_pokemon %>% 
    select(Type_2) %>% 
    slice(training_rows) %>% 
    unlist()

X_test <- scaled_pokemon %>% 
    select(scaled_Total, scaled_Hit_Points, scaled_Attack, scaled_Defense, scaled_Special_Attack, scaled_Special_Defense, scaled_Speed) %>% 
    slice(-training_rows) %>% 
    data.frame()

Y_test_legendary <- scaled_pokemon %>% 
    select(Legendary) %>% 
    slice(-training_rows) %>%
    unlist()

Y_test_type1 <- scaled_pokemon %>% 
    select(Type_1) %>% 
    slice(training_rows) %>% 
    unlist()

Y_test_type2 <- scaled_pokemon %>% 
    select(Type_2) %>% 
    slice(training_rows) %>% 
    unlist()

ks <- data.frame(k=c(1:11))
train_control <- trainControl(method='cv', number=3)

In [8]:
install.packages("e1071", dependencies=TRUE, type='source')
library(e1071)

also installing the dependencies ‘cluster’, ‘mlbench’, ‘SparseM’

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


In [15]:
set.seed(151)
knn_cv_legendary <- train(x=X_train, y=Y_train_legendary, method='knn', tuneGrid=ks, trControl=train_control)
knn_cv_legendary

k-Nearest Neighbors 

601 samples
  7 predictor
  2 classes: 'False', 'True' 

No pre-processing
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 400, 401, 401 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   1  0.9351078  0.5623313
   2  0.9317828  0.5132229
   3  0.9400995  0.5832989
   4  0.9301078  0.5173349
   5  0.9434163  0.6142356
   6  0.9434163  0.6047997
   7  0.9450912  0.6336299
   8  0.9384245  0.5761218
   9  0.9384245  0.5811766
  10  0.9384245  0.5704622
  11  0.9367579  0.5550565

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 7.

In [16]:
set.seed(151)
knn_cv_type1 <- train(x=X_train, y=Y_train_type1, method='knn', tuneGrid=ks, trControl=train_control)
knn_cv_type1

k-Nearest Neighbors 

601 samples
  7 predictor
 18 classes: 'Bug', 'Dark', 'Dragon', 'Electric', 'Fairy', 'Fighting', 'Fire', 'Flying', 'Ghost', 'Grass', 'Ground', 'Ice', 'Normal', 'Poison', 'Psychic', 'Rock', 'Steel', 'Water' 

No pre-processing
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 398, 403, 401 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   1  0.2063458  0.1402913
   2  0.1914936  0.1231304
   3  0.1932276  0.1235876
   4  0.1865363  0.1145878
   5  0.2180150  0.1464921
   6  0.2247243  0.1515009
   7  0.2163729  0.1416106
   8  0.2264571  0.1511181
   9  0.2198138  0.1417513
  10  0.2280240  0.1497314
  11  0.2280150  0.1503588

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 10.

In [17]:
set.seed(151)
knn_cv_type2 <- train(x=X_train, y=Y_train_type2, method='knn', tuneGrid=ks, trControl=train_control)
knn_cv_type2

k-Nearest Neighbors 

601 samples
  7 predictor
 19 classes: 'Bug', 'Dark', 'Dragon', 'Electric', 'Fairy', 'Fighting', 'Fire', 'Flying', 'Ghost', 'Grass', 'Ground', 'Ice', 'Mono', 'Normal', 'Poison', 'Psychic', 'Rock', 'Steel', 'Water' 

No pre-processing
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 400, 403, 399 
Resampling results across tuning parameters:

  k   Accuracy   Kappa     
   1  0.3094819  0.04354262
   2  0.2962144  0.03191236
   3  0.3543180  0.05429599
   4  0.3828042  0.05792460
   5  0.4194151  0.08142840
   6  0.4526764  0.09193225
   7  0.4593263  0.07631357
   8  0.4558926  0.06602656
   9  0.4576264  0.05908062
  10  0.4677192  0.05595406
  11  0.4609518  0.02872364

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 10.