In [8]:
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(caret)
library(stringr)
install.packages("e1071", dependencies=TRUE, type='source')
library(e1071)
install.packages("GGally")
library(GGally)

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
also installing the dependencies ‘progress’, ‘reshape’

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done

Attaching package: ‘GGally’

The following object is masked from ‘package:dplyr’:

    nasa



# “Who’s that Pokémon?”: Predicting Pokémon Types and Legendary status from their Generation and various Attributes using k-nn classification. 

## Introduction

On February 27 2019, the 23rd anniversary of the Pokémon franchise, Nintendo announced its latest Pokémon games, which will introduce the eighth generation of Pokémon. What unique Pokémon might we expect in this new game? The goal of this project is to train a k-nearest neighbours classification model that can predict a given Pokémon’s Type(s) as well as its Legendary status given its Generation number and Stats (a measure of a Pokémon’s capabilities). To train this model, we will use the Pokémon Stats Data Set, which contains the Names, Types, Generation, Stats, and Legendary-status of 800 Pokémon from six generations.

In [9]:
pokemon <- read_csv("https://raw.githubusercontent.com/UBC-DSCI/datasets/master/pokemon/Pokemon.csv")
head(pokemon)

Parsed with column specification:
cols(
  `#` = col_integer(),
  Name = col_character(),
  `Type 1` = col_character(),
  `Type 2` = col_character(),
  Total = col_integer(),
  HP = col_integer(),
  Attack = col_integer(),
  Defense = col_integer(),
  `Sp. Atk` = col_integer(),
  `Sp. Def` = col_integer(),
  Speed = col_integer(),
  Generation = col_integer(),
  Legendary = col_character()
)


#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False


## Data Wrangling

Looking at the data as it is, it needs to be wrangled and cleaned to get it to a format necessary for classification. This process will involve, renaming the column names, removing unnecessary columns, adding new columns, and scaling the data.

In [10]:
#renaming the column headings to remove spaces between Type 1 and Type 2 and make the rest more understandable.
cleaning_data <- names(pokemon) <- c("Number","Name","Type_1", "Type_2", "Total", "Hit_Points", "Attack", "Defense", 
                                     "Special_Attack","Special_Defense", "Speed", "Generation", "Legendary")
head(cleaning_data)

In [11]:
# removing the unnecessary columns
cleaned_pokemon <- pokemon %>% 
    select(-Number, -Name)
head(cleaned_pokemon)

Type_1,Type_2,Total,Hit_Points,Attack,Defense,Special_Attack,Special_Defense,Speed,Generation,Legendary
Grass,Poison,318,45,49,49,65,65,45,1,False
Grass,Poison,405,60,62,63,80,80,60,1,False
Grass,Poison,525,80,82,83,100,100,80,1,False
Grass,Poison,625,80,100,123,122,120,80,1,False
Fire,,309,39,52,43,60,50,65,1,False
Fire,,405,58,64,58,80,65,80,1,False


Below we will be creating new columns namely the Type_summary column and the Gen_1, Gen_2... columns.
The Type_summary column is required because it will be used to separate the data into pokemon with one or two types for further classification. The Generation column is split into six separate ones because,it is a categorical variable and would not have been evenly spaced.

In [12]:
#creating new columns 
cleaned_pokemon2 <- cleaned_pokemon %>% 
    mutate(Type_sum =(Type_1 == Type_2),
    Gen_1 = as.numeric(Generation == 1),
    Gen_2 = as.numeric(Generation == 2),
    Gen_3 = as.numeric(Generation == 3),
    Gen_4 = as.numeric(Generation == 4),
    Gen_5 = as.numeric(Generation == 5),
    Gen_6 = as.numeric(Generation == 6),
    Type_summary = if_else(is.na(`Type_2`), "Mono", "Dual"))
                                        
head(n = 10, cleaned_pokemon2)

Type_1,Type_2,Total,Hit_Points,Attack,Defense,Special_Attack,Special_Defense,Speed,Generation,Legendary,Type_sum,Gen_1,Gen_2,Gen_3,Gen_4,Gen_5,Gen_6,Type_summary
Grass,Poison,318,45,49,49,65,65,45,1,False,False,1,0,0,0,0,0,Dual
Grass,Poison,405,60,62,63,80,80,60,1,False,False,1,0,0,0,0,0,Dual
Grass,Poison,525,80,82,83,100,100,80,1,False,False,1,0,0,0,0,0,Dual
Grass,Poison,625,80,100,123,122,120,80,1,False,False,1,0,0,0,0,0,Dual
Fire,,309,39,52,43,60,50,65,1,False,,1,0,0,0,0,0,Mono
Fire,,405,58,64,58,80,65,80,1,False,,1,0,0,0,0,0,Mono
Fire,Flying,534,78,84,78,109,85,100,1,False,False,1,0,0,0,0,0,Dual
Fire,Dragon,634,78,130,111,130,85,100,1,False,False,1,0,0,0,0,0,Dual
Fire,Flying,634,78,104,78,159,115,100,1,False,False,1,0,0,0,0,0,Dual
Water,,314,44,48,65,50,64,43,1,False,,1,0,0,0,0,0,Mono


In [13]:
#scaling relevant column data to normalize the features because different magnitudes in the attributes would
#bias the predictions. 
#Also removing unneccessary columns.
scaled_pokemon <- cleaned_pokemon2 %>%
    mutate(scaled_Total = scale(Total,center = FALSE), 
    scaled_Hit_Points = scale(Hit_Points, center = FALSE), 
    scaled_Attack = scale(Attack, center = FALSE),
    scaled_Defense = scale(Defense, center = FALSE),
    scaled_Special_Attack = scale(Special_Attack, center = FALSE),
    scaled_Special_Defense = scale(Special_Defense, center = FALSE),
    scaled_Speed = scale(Speed, center = FALSE),
    scaled_Gen_1 = scale(Gen_1, center = FALSE),
    scaled_Gen_2 = scale(Gen_2, center = FALSE),
    scaled_Gen_3 = scale(Gen_3, center = FALSE),
    scaled_Gen_4 = scale(Gen_4, center = FALSE),
    scaled_Gen_5 = scale(Gen_5, center = FALSE),
    scaled_Gen_6 = scale(Gen_6, center = FALSE)) %>%
    select(-Type_sum, -Generation,-Total, -Hit_Points, -Attack, -Defense, -Special_Attack, 
           -Special_Defense, -Speed, -Gen_1, -Gen_2, -Gen_3, -Gen_4, -Gen_5, -Gen_6)
head(scaled_pokemon)

Type_1,Type_2,Legendary,Type_summary,scaled_Total,scaled_Hit_Points,scaled_Attack,scaled_Defense,scaled_Special_Attack,scaled_Special_Defense,scaled_Speed,scaled_Gen_1,scaled_Gen_2,scaled_Gen_3,scaled_Gen_4,scaled_Gen_5,scaled_Gen_6
Grass,Poison,False,Dual,0.7041635,0.6092888,0.5734039,0.6109768,0.8137637,0.8426019,0.6061101,2.193913,0,0,0,0,0
Grass,Poison,False,Dual,0.896812,0.8123851,0.7255314,0.7855417,1.0015554,1.0370485,0.8081468,2.193913,0,0,0,0,0
Grass,Poison,False,Dual,1.1625341,1.0831801,0.9595738,1.03492,1.2519442,1.2963107,1.077529,2.193913,0,0,0,0,0
Grass,Poison,False,Dual,1.3839691,1.0831801,1.170212,1.5336766,1.5273719,1.5555728,1.077529,2.193913,0,0,0,0,0
Fire,,False,Mono,0.6842343,0.5280503,0.6085102,0.5361634,0.7511665,0.6481553,0.8754923,2.193913,0,0,0,0,0
Fire,,False,Mono,0.896812,0.7853056,0.7489357,0.7231971,1.0015554,0.8426019,1.077529,2.193913,0,0,0,0,0


## Exploratory Data Visualisations

(Include visualisations)

## Models

(write-up about what are the three models we are using, why we do so, and what we expect)

In [None]:
#knn using full data (model 1)
set.seed(151)
training_rows <- scaled_pokemon %>%
    select(Legendary) %>%
    unlist() %>%
    createDataPartition(p=0.80, list=FALSE)

scaled_pokemon <- scaled_pokemon %>%
    mutate(Legendary=as.factor(Legendary))

scaled_pokemon$Type_2[is.na(scaled_pokemon$Type_2)] <- 'Not_Dual'

X_train <- scaled_pokemon %>% 
    select(scaled_Total, scaled_Hit_Points, scaled_Attack, scaled_Defense, scaled_Special_Attack, 
           scaled_Special_Defense, scaled_Speed, scaled_Gen_1, scaled_Gen_2, scaled_Gen_3,
           scaled_Gen_4, scaled_Gen_5, scaled_Gen_6) %>% 
    slice(training_rows) %>% 
    data.frame()

Y_train_legendary <- scaled_pokemon %>% 
    select(Legendary) %>% 
    slice(training_rows) %>% 
    unlist()

Y_train_type1 <- scaled_pokemon %>% 
    select(Type_1) %>% 
    slice(training_rows) %>% 
    unlist()

Y_train_type2 <- scaled_pokemon %>% 
    select(Type_2) %>% 
    slice(training_rows) %>% 
    unlist()

X_test <- scaled_pokemon %>% 
    select(scaled_Total, scaled_Hit_Points, scaled_Attack, scaled_Defense, scaled_Special_Attack, 
           scaled_Special_Defense, scaled_Speed, scaled_Gen_1, scaled_Gen_2, scaled_Gen_3,
           scaled_Gen_4, scaled_Gen_5, scaled_Gen_6) %>% 
    slice(-training_rows) %>% 
    data.frame()

Y_test_legendary <- scaled_pokemon %>% 
    select(Legendary) %>% 
    slice(-training_rows) %>%
    unlist()

Y_test_type1 <- scaled_pokemon %>% 
    select(Type_1) %>% 
    slice(training_rows) %>% 
    unlist()

Y_test_type2 <- scaled_pokemon %>% 
    select(Type_2) %>% 
    slice(training_rows) %>% 
    unlist()

ks <- data.frame(k=c(1:11))
train_control <- trainControl(method='cv', number=10)

In [None]:
set.seed(151)
knn_cv_legendary <- train(x=X_train, y=Y_train_legendary, method='knn', tuneGrid=ks, trControl=train_control)
knn_cv_legendary

In [None]:
set.seed(151)
knn_cv_type1 <- train(x=X_train, y=Y_train_type1, method='knn', tuneGrid=ks, trControl=train_control)
knn_cv_type1

In [None]:
set.seed(151)
knn_cv_type2 <- train(x=X_train, y=Y_train_type2, method='knn', tuneGrid=ks, trControl=train_control)
knn_cv_type2

In [None]:
#knn using mono types data only to predict legendary status (model 2)

set.seed(1234)
training_rows <- scaled_pokemon %>% 
mutate(Legendary = as.factor(Legendary)) %>%
  select(Legendary) %>% 
  unlist() %>% 
  createDataPartition(p = 0.80, list = FALSE)

training_set <- scaled_pokemon %>% filter(Type_summary == "Mono") %>% slice(training_rows)
testing_set <- scaled_pokemon %>%  slice(-training_rows) %>% filter(Type_summary == "Mono")

head(training_set)
head(testing_set)

In [None]:
Y_status <- training_set %>% select(Legendary) %>% unlist()
X_attributes <- training_set %>% select(scaled_Hit_Points, scaled_Attack, scaled_Defense, 
                                        scaled_Special_Attack, scaled_Special_Defense, 
                                        scaled_Speed, scaled_Gen_1, scaled_Gen_2, scaled_Gen_3, 
                                        scaled_Gen_4, scaled_Gen_5, scaled_Gen_6) %>% data.frame()

k <- c(1,3,5,7,9,11)
ks <- data.frame(k)

train_control <- trainControl(method = "cv", number = 10)
choose_k <- train(x = X_attributes, y = Y_status, method = 'knn', tuneGrid = ks, trControl = train_control)
choose_k

k_accuracies <- choose_k$results
k_accuracies 

In [None]:

Y_train <- training_set %>% select(Legendary) %>% unlist()
X_train <- training_set %>% select(scaled_Hit_Points, scaled_Attack, scaled_Defense, 
                                        scaled_Special_Attack, scaled_Special_Defense, 
                                        scaled_Speed, scaled_Gen_1, scaled_Gen_2, scaled_Gen_3, 
                                        scaled_Gen_4, scaled_Gen_5, scaled_Gen_6) %>% data.frame()


final_k = data.frame(k = 1)
final_classifier_legendary <- train(x = X_train, y = Y_train, method = "knn", tuneGrid = final_k)
final_classifier_legendary


X_test <- testing_set %>% 
    select(scaled_Hit_Points, scaled_Attack, scaled_Defense, 
        scaled_Special_Attack, scaled_Special_Defense, 
        scaled_Speed, scaled_Gen_1, scaled_Gen_2, scaled_Gen_3, 
        scaled_Gen_4, scaled_Gen_5, scaled_Gen_6) %>% 
        data.frame()

Y_test <- testing_set %>% 
    select(Legendary) %>% 
    unlist()
test_pred <- predict(final_classifier_legendary, X_test) 
head(test_pred)


In [None]:
#knn using mono types data only to predict Type_1 (model 2)
set.seed(5678)   
training_rows2 <- scaled_pokemon %>% 
mutate(Type_1 = as.factor(Type_1)) %>%
  select(Type_1) %>% 
  unlist() %>% 
  createDataPartition(p = 0.80, list = FALSE)

Y_type1 <- training_set %>% select(Type_1) %>% unlist()

k <- c(1,3,5,7,9,11)
ks <- data.frame(k)


train_control2 <- trainControl(method = "cv", number = 10)

choose_k <- train(x = X_attributes, y = Y_type1, method = 'knn', tuneGrid = ks, trControl = train_control2)
choose_k

k_accuracies2 <- choose_k$results
k_accuracies2 

In [None]:

final_k = data.frame(k = 1)
final_classifier_type1 <- train(x = X_attributes, y = Y_type1, method = "knn", tuneGrid = final_k)
final_classifier_type1

test_pred <- predict(final_classifier_type1, X_test) 
head(test_pred)


In [None]:
#knn using mono types data only to predict Type_2 (model 2)
set.seed(1435)   
training_rows3 <- scaled_pokemon %>% 
mutate(Type_2 = as.factor(Type_2)) %>%
  select(Type_2) %>% 
  unlist() %>% 
  createDataPartition(p = 0.80, list = FALSE)

Y_type2 <- training_set %>% select(Type_2) %>% unlist()

k <- c(1,3,5,7,9,11)
ks <- data.frame(k)


train_control3 <- trainControl(method = "cv", number = 10)

choose_k <- train(x = X_attributes, y = Y_type2, method = 'knn', tuneGrid = ks, trControl = train_control3)
choose_k

k_accuracies2 <- choose_k$results
k_accuracies2 

In [None]:
final_k = data.frame(k = 1)
final_classifier_type2 <- train(x = X_attributes, y = Y_type2, method = "knn", tuneGrid = final_k)
final_classifier_type2

test_pred <- predict(final_classifier_type2, X_test) 
head(test_pred)

From the mono-models we see that training the model on mono type pokemon makes it very accurate in predicting legendary status of mono-type pokemon(95%) and distinguishing between a type1 and type2 pokemon with an accuracy of 100% when predicting type2. But it has a very low accuracy(19%) when predicting the types of the type1 pokemon which suggests that the predictor/ x variables are too similar amongst mono type pokemon to distinguish between them accurately.

In [None]:
#knn using Dual type only predicting legendary status (model 2)
set.seed(1234)
training_rows_d <- scaled_pokemon %>% 
mutate(Legendary = as.factor(Legendary)) %>%
  select(Legendary) %>% 
  unlist() %>% 
  createDataPartition(p = 0.80, list = FALSE)

training_set_d <- scaled_pokemon %>% filter(Type_summary == "Dual") %>% slice(training_rows)
testing_set_d <- scaled_pokemon %>%  slice(-training_rows) %>% filter(Type_summary == "Dual")

head(training_set_d)
head(testing_set_d)

In [None]:
Y_train_d <- training_set_d %>% select(Legendary) %>% unlist()
X_train_d <- training_set_d %>% select(scaled_Hit_Points, scaled_Attack, scaled_Defense, 
                                        scaled_Special_Attack, scaled_Special_Defense, 
                                        scaled_Speed, scaled_Gen_1, scaled_Gen_2, scaled_Gen_3, 
                                        scaled_Gen_4, scaled_Gen_5, scaled_Gen_6) %>% data.frame()

k <- c(1,3,5,7,9,11)
ks <- data.frame(k)

train_control <- trainControl(method = "cv", number = 10)
choose_k_d <- train(x = X_train_d, y = Y_train_d, method = 'knn', tuneGrid = ks, trControl = train_control)
choose_k_d

k_accuracies_d <- choose_k_d$results
k_accuracies_d

In [None]:
#test on k = 3
Y_train_d2 <- training_set %>% select(Legendary) %>% unlist()
X_train_d2 <- training_set %>% select(scaled_Hit_Points, scaled_Attack, scaled_Defense, 
                                        scaled_Special_Attack, scaled_Special_Defense, 
                                        scaled_Speed, scaled_Gen_1, scaled_Gen_2, scaled_Gen_3, 
                                        scaled_Gen_4, scaled_Gen_5, scaled_Gen_6) %>% data.frame()


final_kd = data.frame(k = 3)
final_classifier_legendary_d <- train(x = X_train_d2, y = Y_train_d2, method = "knn", tuneGrid = final_kd)
final_classifier_legendary_d


X_test_d2 <- testing_set_d %>% 
    select(scaled_Hit_Points, scaled_Attack, scaled_Defense, 
        scaled_Special_Attack, scaled_Special_Defense, 
        scaled_Speed, scaled_Gen_1, scaled_Gen_2, scaled_Gen_3, 
        scaled_Gen_4, scaled_Gen_5, scaled_Gen_6) %>% 
        data.frame()

Y_test_d2 <- testing_set_d %>% 
    select(Legendary) %>% 
    unlist()
test_pred_d <- predict(final_classifier_legendary_d, X_test_d2) 
head(test_pred_d)

In [None]:
#knn using dual types data only to predict Type_1 (model 2)
set.seed(5678)   
training_rows2_d <- scaled_pokemon %>% 
mutate(Type_1 = as.factor(Type_1)) %>%
  select(Type_1) %>% 
  unlist() %>% 
  createDataPartition(p = 0.80, list = FALSE)

Y_type1_d <- training_set_d %>% select(Type_1) %>% unlist()

k <- c(1,3,5,7,9,11)
kks <- data.frame(k)


train_control2_d <- trainControl(method = "cv", number = 10)

choose_k_d3 <- train(x = X_train_d, y = Y_type1_d, method = 'knn', tuneGrid = kks, trControl = train_control2_d)
choose_k_d3

k_accuracies2_d <- choose_k_d3$results
k_accuracies2_d

In [None]:
final_k = data.frame(k = 1)
final_classifier_type1_d <- train(x = X_train_d, y = Y_type1_d, method = "knn", tuneGrid = final_k)
final_classifier_type1_d

test_pred_d <- predict(final_classifier_type1_d, X_test_d2) 
head(test_pred_d)

In [None]:
#knn using Dual types data only to predict Type_2 (model 2)
set.seed(1435)   
training_rows3_d <- scaled_pokemon %>% 
mutate(Type_2 = as.factor(Type_2)) %>%
  select(Type_2) %>% 
  unlist() %>% 
  createDataPartition(p = 0.80, list = FALSE)

Y_type2_d <- training_set_d %>% select(Type_2) %>% unlist()

k <- c(1,3,5,7,9,11)
ks <- data.frame(k)


train_control3_d <- trainControl(method = "cv", number = 10)

choose_k_d <- train(x = X_train_d, y = Y_type2_d, method = 'knn', tuneGrid = ks, trControl = train_control3_d)
choose_k_d

k_accuracies2_d <- choose_k_d$results
k_accuracies2_d

In [None]:
final_k_d = data.frame(k = 1)
final_classifier_type2_d <- train(x = X_train_d, y = Y_type2_d, method = "knn", tuneGrid = final_k_d)
final_classifier_type2_d

test_pred_d <- predict(final_classifier_type2_d, X_test_d2) 
head(test_pred_d)

In [None]:
##CAN BE DELETED

#trying to see if we can use a subset of the x-variables based on those that differentiate between 
#the mono and dual types best.(i.e those that have a significant difference in the means)
#All in an effort to see if we could have increased the accuracy of the prediction models
means <- scaled_pokemon %>% 
        group_by(Type_summary) %>%
        summarise(average_Hit_Points = mean(scaled_Hit_Points, trim = 0,na.rm = TRUE),
                average_attack = mean(scaled_Attack,trim = 0,na.rm = TRUE),
                average_defense = mean(scaled_Defense,trim = 0,na.rm = TRUE),
                average_SPattack = mean(scaled_Special_Attack,trim = 0,na.rm = TRUE),
                average_SPdefense = mean(scaled_Special_Defense,trim = 0,na.rm = TRUE),
                average_speed = mean(scaled_Speed,trim = 0,na.rm = TRUE),
                average_Gen1 = mean(scaled_Gen_1,trim = 0,na.rm = TRUE),
                average_Gen2 = mean(scaled_Gen_2,trim = 0,na.rm = TRUE),
                average_Gen3 = mean(scaled_Gen_3,trim = 0,na.rm = TRUE),
                average_Gen4 = mean(scaled_Gen_4,trim = 0,na.rm = TRUE),
                average_Gen5 = mean(scaled_Gen_5,trim = 0,na.rm = TRUE),
                average_Gen6 = mean(scaled_Gen_6,trim = 0,na.rm = TRUE))
        
means
#Gen2 and Gen3 seem to have the smallest difference in means but removing them would make the models inaccurate because
#we would still include the other generations data. To combat this we would have to remove all generation column data...
#so lets try that.

## Discussion

## Conclusion