## Table of Contents

#### Set-Up
- [Splitting the data](#split)
- [Adjusting some features](#adjusting)
- [Clustering](#clustering)
- [Modeling guidelines](#modeling_guidelines)

#### Modeling
- [KNN](#knn)
    - [KNN with clustering](#knn_cluster)
    - [Two-Stage KNN](#two_stage_knn)
    - [Two-Stage KNN with clustering](#two_stage_knn_cluster)
    
- [Performance Results](#performance)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import re
import json
import pandas_profiling
import requests
from bs4 import BeautifulSoup
from scipy import stats
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale, StandardScaler, MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.manifold import TSNE
from sklearn.linear_model import LogisticRegression, Lasso
from sklearn.metrics import precision_recall_fscore_support, log_loss, r2_score, mean_squared_error, f1_score, make_scorer
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.cluster import KMeans, DBSCAN, MeanShift
from sklearn.neighbors import KNeighborsClassifier
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster

In [2]:
pokemon_abilities_df = pd.read_csv('./data/pokemon_abilities_df.csv', index_col="name")
pokemon_learnsets_df = pd.read_csv('./data/pokemon_learnsets_df.csv', index_col='name')
pokemon_data = pd.read_csv('./data/pokemon_data.csv', index_col="name")

In [3]:
pokemon_data

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,weight,height,formats,generation,...,Ability Cutoff 2,Ability Cutoff 3,Ability Cutoff 4,Ability Cutoff 5,Ability Cutoff 6,Best Ability,Best Ability <100,Unique Powerful Ability,oldformats,oldformat codes
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,6.9,0.7,ZU,RB,...,1.0,0.0,0.0,0.0,0.0,63.636364,63.636364,0,ZU,0
Ivysaur,60,62,63,80,80,60,13.0,1.0,ZU,RB,...,1.0,0.0,0.0,0.0,0.0,63.636364,63.636364,0,ZU,0
Venusaur,80,82,83,100,100,80,100.0,2.0,OU,RB,...,1.0,0.0,0.0,0.0,0.0,63.636364,63.636364,0,UU,4
Charmander,39,52,43,60,50,65,8.5,0.6,ZU,RB,...,1.0,0.0,0.0,0.0,0.0,50.000000,50.000000,0,ZU,0
Charmeleon,58,64,58,80,65,80,19.0,1.1,ZU,RB,...,1.0,0.0,0.0,0.0,0.0,50.000000,50.000000,0,ZU,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,800.0,2.2,NU,SS,...,1.0,1.0,1.0,0.0,0.0,75.000000,75.000000,0,NU,2
Spectrier,100,65,60,145,80,130,44.5,2.0,Uber,SS,...,0.0,0.0,0.0,0.0,0.0,1.000000,0.000000,1,Uber,6
Calyrex,100,80,80,80,80,80,7.7,1.1,PU,SS,...,0.0,0.0,0.0,0.0,0.0,18.181818,18.181818,0,ZU,0
Calyrex-Ice,100,165,150,85,130,50,809.1,2.4,Uber,SS,...,0.0,0.0,0.0,0.0,0.0,1.000000,0.000000,1,Uber,6


<a id="split"></a>
### Splitting the Data

In [4]:
pokemon_data.columns

Index(['hp', 'atk', 'def', 'spa', 'spd', 'spe', 'weight', 'height', 'formats',
       'generation', 'format codes', 'Weaknesses', 'Strong Weaknesses',
       'Resists', 'Strong Resists', 'Immune', 'STAB', 'Resistance Index',
       'Entry Hazards', 'Hazard Removal', 'Removal Deterrent', 'Cleric',
       'Pivot', 'Item Removal', 'Setup', 'Priority', 'HP Drain', 'HP Recovery',
       'Weather Set', 'Weather Gimmick', 'Physical Cutoff 1',
       'Physical Cutoff 2', 'Physical Cutoff 3', 'Physical Cutoff 4',
       'Physical Cutoff 5', 'Physical Cutoff 6', 'Physical Coverage 1',
       'Physical Coverage 2', 'Physical Coverage 3', 'Physical Coverage 4',
       'Special Cutoff 1', 'Special Cutoff 2', 'Special Cutoff 3',
       'Special Cutoff 4', 'Special Cutoff 5', 'Special Cutoff 6',
       'Special Cutoff 7', 'Special Coverage 1', 'Special Coverage 2',
       'Special Coverage 3', 'Special Coverage 4', 'Special Coverage 5',
       'Special Coverage 6', 'Special Coverage 7', 'Special Cove

In [5]:
X = pokemon_data.drop(columns=['weight', 'height', 'Weaknesses', 'Strong Weaknesses', 'Resists',
                                'Strong Resists', 'Immune', 'STAB', 'Physical Cutoff 1', 'Physical Cutoff 2',
                                'Physical Cutoff 4', 'Physical Cutoff 5', 'Physical Cutoff 6',
                                'Physical Coverage 1', 'Physical Coverage 2', 'Physical Coverage 4',
                                'Special Cutoff 1', 'Special Cutoff 2', 'Special Cutoff 4',
                                'Special Cutoff 5', 'Special Cutoff 6', 'Special Cutoff 7',
                                'Special Coverage 1', 'Special Coverage 2', 'Special Coverage 3',
                                'Special Coverage 4', 'Special Coverage 6', 'Special Coverage 7',
                                'Special Coverage 8', 'Special Coverage 9', 'Special Coverage 10',
                                'Ability Cutoff 1', 'Ability Cutoff 2', 'Ability Cutoff 4', 'Ability Cutoff 5',
                                'Ability Cutoff 6', 'Best Ability <100', 'formats', 'generation',
                                'format codes', 'oldformats', 'oldformat codes'])

y_df = pd.DataFrame(pokemon_data[['formats', 'format codes']], index=pokemon_data.index, columns=['formats', 'format codes', 'oldformats', 'oldformat codes'])
y_df['formats4'] = y_df['formats'].replace({'ZU':'Not c', 'PU': 'Low c', 'NU': 'Mid c', 'RU': 'Mid c', 'UU': 'Mid c', 'OU': 'High c', 'Uber': 'High c'})
y_df['format codes4'] = y_df['format codes'].replace({3:2, 4: 2, 5:3, 6:3})
y_df['formats4alt'] = y_df['formats'].replace({'ZU':'Not c', 'PU': 'Low c', 'NU': 'Mid c', 'RU': 'Mid c', 'UU': 'Mid c', 'OU': 'Mid c', 'Uber': 'High c'})
y_df['format codes4alt'] = y_df['format codes'].replace({3:2, 4: 2, 5:2, 6:3})
y_df['formats2'] = y_df['formats'].replace({'ZU':'No', 'PU': 'Yes', 'NU': 'Yes', 'RU': 'Yes', 'UU': 'Yes', 'OU': 'Yes', 'Uber': 'Yes'})
y_df

Unnamed: 0_level_0,formats,format codes,oldformats,oldformat codes,formats4,format codes4,formats4alt,format codes4alt,formats2
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Bulbasaur,ZU,0,,,Not c,0,Not c,0,No
Ivysaur,ZU,0,,,Not c,0,Not c,0,No
Venusaur,OU,5,,,High c,3,Mid c,2,Yes
Charmander,ZU,0,,,Not c,0,Not c,0,No
Charmeleon,ZU,0,,,Not c,0,Not c,0,No
...,...,...,...,...,...,...,...,...,...
Glastrier,NU,2,,,Mid c,2,Mid c,2,Yes
Spectrier,Uber,6,,,High c,3,High c,3,Yes
Calyrex,PU,1,,,Low c,1,Low c,1,Yes
Calyrex-Ice,Uber,6,,,High c,3,High c,3,Yes


<a id="adjusting"></a>
### Adjusting some features

- remove: ability cutoff, unique powerful ability

In [6]:
X.columns

Index(['hp', 'atk', 'def', 'spa', 'spd', 'spe', 'Resistance Index',
       'Entry Hazards', 'Hazard Removal', 'Removal Deterrent', 'Cleric',
       'Pivot', 'Item Removal', 'Setup', 'Priority', 'HP Drain', 'HP Recovery',
       'Weather Set', 'Weather Gimmick', 'Physical Cutoff 3',
       'Physical Coverage 3', 'Special Cutoff 3', 'Special Coverage 5',
       'Misc Status', 'Unique Powerful Move', 'Ability Cutoff 3',
       'Best Ability', 'Unique Powerful Ability'],
      dtype='object')

In [7]:
X.drop(columns=['Ability Cutoff 3', 'Unique Powerful Ability'], inplace=True)
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Hazard Removal,Removal Deterrent,...,HP Recovery,Weather Set,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,1,0,5,4,3,5,2,4,0,63.636364
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,1,0,5,4,3,5,2,4,0,63.636364
Venusaur,80,82,83,100,100,80,2,0,0,0,...,1,0,5,6,4,6,4,4,0,63.636364
Charmander,39,52,43,60,50,65,3,0,0,0,...,0,0,3,10,9,6,2,3,0,50.000000
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,0,0,3,10,9,6,2,3,0,50.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,0,1,12,7,3,2,2,0,75.000000
Spectrier,100,65,60,145,80,130,8,0,0,1,...,0,0,0,4,4,4,3,3,0,1.000000
Calyrex,100,80,80,80,80,80,-2,0,0,0,...,0,0,2,3,3,9,4,3,0,18.181818
Calyrex-Ice,100,165,150,85,130,50,-4,0,0,0,...,0,0,2,15,9,12,5,3,1,1.000000


In [8]:
X.columns

Index(['hp', 'atk', 'def', 'spa', 'spd', 'spe', 'Resistance Index',
       'Entry Hazards', 'Hazard Removal', 'Removal Deterrent', 'Cleric',
       'Pivot', 'Item Removal', 'Setup', 'Priority', 'HP Drain', 'HP Recovery',
       'Weather Set', 'Weather Gimmick', 'Physical Cutoff 3',
       'Physical Coverage 3', 'Special Cutoff 3', 'Special Coverage 5',
       'Misc Status', 'Unique Powerful Move', 'Best Ability'],
      dtype='object')

- fold weather set into weather gimmick

In [9]:
X['Weather Gimmick'].value_counts()

2    289
1    171
0    161
5     70
3     40
4      7
Name: Weather Gimmick, dtype: int64

In [10]:
X['Weather Set'].value_counts()

0    709
1     29
Name: Weather Set, dtype: int64

In [11]:
X.loc[X['Weather Set'] == 1, 'Weather Gimmick'] = 6
X['Weather Gimmick'].value_counts()

2    265
1    167
0    161
5     70
3     39
6     29
4      7
Name: Weather Gimmick, dtype: int64

In [12]:
X.drop(columns=['Weather Set'], inplace=True)
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Hazard Removal,Removal Deterrent,...,HP Drain,HP Recovery,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,2,1,5,4,3,5,2,4,0,63.636364
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,2,1,5,4,3,5,2,4,0,63.636364
Venusaur,80,82,83,100,100,80,2,0,0,0,...,2,1,5,6,4,6,4,4,0,63.636364
Charmander,39,52,43,60,50,65,3,0,0,0,...,0,0,3,10,9,6,2,3,0,50.000000
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,0,0,3,10,9,6,2,3,0,50.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,0,1,12,7,3,2,2,0,75.000000
Spectrier,100,65,60,145,80,130,8,0,0,1,...,0,0,0,4,4,4,3,3,0,1.000000
Calyrex,100,80,80,80,80,80,-2,0,0,0,...,2,0,2,3,3,9,4,3,0,18.181818
Calyrex-Ice,100,165,150,85,130,50,-4,0,0,0,...,2,0,2,15,9,12,5,3,1,1.000000


- fold hp drain and hp recovery together into a recovery feature

In [13]:
X['HP Recovery'].value_counts()

0    517
1    184
2     37
Name: HP Recovery, dtype: int64

In [14]:
X['HP Drain'].value_counts()

0    482
2    202
1     49
3      4
4      1
Name: HP Drain, dtype: int64

In [15]:
X.loc[X['HP Recovery'] == 1, 'HP Drain'] = 3
X.loc[X['HP Recovery'] == 2, 'HP Drain'] = 4
X['HP Drain'].value_counts()

0    382
3    187
2     91
1     41
4     37
Name: HP Drain, dtype: int64

In [16]:
X.drop(columns=['HP Recovery'], inplace=True)
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Hazard Removal,Removal Deterrent,...,Priority,HP Drain,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,0,3,5,4,3,5,2,4,0,63.636364
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,0,3,5,4,3,5,2,4,0,63.636364
Venusaur,80,82,83,100,100,80,2,0,0,0,...,0,3,5,6,4,6,4,4,0,63.636364
Charmander,39,52,43,60,50,65,3,0,0,0,...,1,0,3,10,9,6,2,3,0,50.000000
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,1,0,3,10,9,6,2,3,0,50.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,0,1,12,7,3,2,2,0,75.000000
Spectrier,100,65,60,145,80,130,8,0,0,1,...,0,0,0,4,4,4,3,3,0,1.000000
Calyrex,100,80,80,80,80,80,-2,0,0,0,...,0,2,2,3,3,9,4,3,0,18.181818
Calyrex-Ice,100,165,150,85,130,50,-4,0,0,0,...,0,2,2,15,9,12,5,3,1,1.000000


In [17]:
X['HP Recovery'] = X['HP Drain']
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Hazard Removal,Removal Deterrent,...,HP Drain,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,3,5,4,3,5,2,4,0,63.636364,3
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,3,5,4,3,5,2,4,0,63.636364,3
Venusaur,80,82,83,100,100,80,2,0,0,0,...,3,5,6,4,6,4,4,0,63.636364,3
Charmander,39,52,43,60,50,65,3,0,0,0,...,0,3,10,9,6,2,3,0,50.000000,0
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,0,3,10,9,6,2,3,0,50.000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,1,12,7,3,2,2,0,75.000000,0
Spectrier,100,65,60,145,80,130,8,0,0,1,...,0,0,4,4,4,3,3,0,1.000000,0
Calyrex,100,80,80,80,80,80,-2,0,0,0,...,2,2,3,3,9,4,3,0,18.181818,2
Calyrex-Ice,100,165,150,85,130,50,-4,0,0,0,...,2,2,15,9,12,5,3,1,1.000000,2


In [18]:
X.drop(columns=['HP Drain'], inplace=True)
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Hazard Removal,Removal Deterrent,...,Priority,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,0,5,4,3,5,2,4,0,63.636364,3
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,0,5,4,3,5,2,4,0,63.636364,3
Venusaur,80,82,83,100,100,80,2,0,0,0,...,0,5,6,4,6,4,4,0,63.636364,3
Charmander,39,52,43,60,50,65,3,0,0,0,...,1,3,10,9,6,2,3,0,50.000000,0
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,1,3,10,9,6,2,3,0,50.000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,1,12,7,3,2,2,0,75.000000,0
Spectrier,100,65,60,145,80,130,8,0,0,1,...,0,0,4,4,4,3,3,0,1.000000,0
Calyrex,100,80,80,80,80,80,-2,0,0,0,...,0,2,3,3,9,4,3,0,18.181818,2
Calyrex-Ice,100,165,150,85,130,50,-4,0,0,0,...,0,2,15,9,12,5,3,1,1.000000,2


In [19]:
X['HP Recovery'].value_counts()

0    382
3    187
2     91
1     41
4     37
Name: HP Recovery, dtype: int64

- considering: removal deterrent (could arguably just remove since its abilities), hazard removal, cleric, entry hazards (all 3 of those might go into misc status)

In [20]:
X.drop(columns=['Removal Deterrent'], inplace=True)
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Hazard Removal,Cleric,...,Priority,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,0,5,4,3,5,2,4,0,63.636364,3
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,0,5,4,3,5,2,4,0,63.636364,3
Venusaur,80,82,83,100,100,80,2,0,0,0,...,0,5,6,4,6,4,4,0,63.636364,3
Charmander,39,52,43,60,50,65,3,0,0,0,...,1,3,10,9,6,2,3,0,50.000000,0
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,1,3,10,9,6,2,3,0,50.000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,1,12,7,3,2,2,0,75.000000,0
Spectrier,100,65,60,145,80,130,8,0,0,0,...,0,0,4,4,4,3,3,0,1.000000,0
Calyrex,100,80,80,80,80,80,-2,0,0,1,...,0,2,3,3,9,4,3,0,18.181818,2
Calyrex-Ice,100,165,150,85,130,50,-4,0,0,1,...,0,2,15,9,12,5,3,1,1.000000,2


In [21]:
X['Misc Status'].value_counts()

3    335
2    234
1     89
0     41
4     35
5      4
Name: Misc Status, dtype: int64

In [22]:
X['Hazard Removal'].value_counts()

0    558
1    174
2      6
Name: Hazard Removal, dtype: int64

In [23]:
X.loc[X['Hazard Removal'] == 1, 'Misc Status'] = 4
X['Misc Status'].value_counts()

3    249
4    205
2    185
1     63
0     32
5      4
Name: Misc Status, dtype: int64

In [24]:
X.drop(columns=['Hazard Removal'], inplace=True)
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Cleric,Pivot,...,Priority,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,0,5,4,3,5,2,4,0,63.636364,3
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,0,5,4,3,5,2,4,0,63.636364,3
Venusaur,80,82,83,100,100,80,2,0,0,0,...,0,5,6,4,6,4,4,0,63.636364,3
Charmander,39,52,43,60,50,65,3,0,0,0,...,1,3,10,9,6,2,3,0,50.000000,0
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,1,3,10,9,6,2,3,0,50.000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,1,12,7,3,2,2,0,75.000000,0
Spectrier,100,65,60,145,80,130,8,0,0,0,...,0,0,4,4,4,3,3,0,1.000000,0
Calyrex,100,80,80,80,80,80,-2,0,1,0,...,0,2,3,3,9,4,3,0,18.181818,2
Calyrex-Ice,100,165,150,85,130,50,-4,0,1,0,...,0,2,15,9,12,5,3,1,1.000000,2


I'll just leave the other one's (Cleric and Entry Hazards) alone for now, updating them would be complicated and it's probably not even a good idea since they performed better than Hazard Removal

<a id="clustering"></a>
### Clustering

In [25]:
cluster_dfs = {}

n_clusters = list(range(5, 35, 5))
n_clusters

[5, 10, 15, 20, 25, 30]

The number of clusters we'll test in each model that uses clusters, which is only half of them, and we'll want to remember to convert those clusters to categories.

We are going to do clustering of 4 different subsets of features, as we did during EDA:
- one for overall features (scaled)
- one for stats (scaled)
- one for abilities (not scaled, because abilities are one-hot encoded)
- one for learnsets (not scaled, because learnsets are one-hot encoded)

Then we'll make 6 dataframes for each of the different amount of clusters, each with all of those 4 types, and they will go in the cluster_dfs dictionary

In [26]:
cluster5 = pd.DataFrame(index=X.index, columns=['features', 'stats', 'abilities', 'learnsets'])

X_scaled = StandardScaler().fit_transform(X)
stats_scaled = StandardScaler().fit_transform(X.loc[:, ['hp', 'atk', 'def', 'spa', 'spd', 'spe']])

kmeans = KMeans(n_clusters=5, random_state=273)
cluster5['features'] = kmeans.fit_predict(X_scaled)

kmeans = KMeans(n_clusters=5, random_state=273)
cluster5['stats'] = kmeans.fit_predict(stats_scaled)

kmeans = KMeans(n_clusters=5, random_state=273)
cluster5['abilities'] = kmeans.fit_predict(pokemon_abilities_df)

kmeans = KMeans(n_clusters=5, random_state=273)
cluster5['learnsets'] = kmeans.fit_predict(pokemon_learnsets_df)

cluster_dfs[5] = cluster5

In [27]:
cluster10 = pd.DataFrame(index=X.index, columns=['features', 'stats', 'abilities', 'learnsets'])

X_scaled = StandardScaler().fit_transform(X)
stats_scaled = StandardScaler().fit_transform(X.loc[:, ['hp', 'atk', 'def', 'spa', 'spd', 'spe']])

kmeans = KMeans(n_clusters=10, random_state=273)
cluster10['features'] = kmeans.fit_predict(X_scaled)

kmeans = KMeans(n_clusters=10, random_state=273)
cluster10['stats'] = kmeans.fit_predict(stats_scaled)

kmeans = KMeans(n_clusters=10, random_state=273)
cluster10['abilities'] = kmeans.fit_predict(pokemon_abilities_df)

kmeans = KMeans(n_clusters=10, random_state=273)
cluster10['learnsets'] = kmeans.fit_predict(pokemon_learnsets_df)

cluster_dfs[10] = cluster10

In [28]:
cluster15 = pd.DataFrame(index=X.index, columns=['features', 'stats', 'abilities', 'learnsets'])

X_scaled = StandardScaler().fit_transform(X)
stats_scaled = StandardScaler().fit_transform(X.loc[:, ['hp', 'atk', 'def', 'spa', 'spd', 'spe']])

kmeans = KMeans(n_clusters=15, random_state=273)
cluster15['features'] = kmeans.fit_predict(X_scaled)

kmeans = KMeans(n_clusters=15, random_state=273)
cluster15['stats'] = kmeans.fit_predict(stats_scaled)

kmeans = KMeans(n_clusters=15, random_state=273)
cluster15['abilities'] = kmeans.fit_predict(pokemon_abilities_df)

kmeans = KMeans(n_clusters=15, random_state=273)
cluster15['learnsets'] = kmeans.fit_predict(pokemon_learnsets_df)

cluster_dfs[15] = cluster15

In [29]:
cluster20 = pd.DataFrame(index=X.index, columns=['features', 'stats', 'abilities', 'learnsets'])

X_scaled = StandardScaler().fit_transform(X)
stats_scaled = StandardScaler().fit_transform(X.loc[:, ['hp', 'atk', 'def', 'spa', 'spd', 'spe']])

kmeans = KMeans(n_clusters=20, random_state=273)
cluster20['features'] = kmeans.fit_predict(X_scaled)

kmeans = KMeans(n_clusters=20, random_state=273)
cluster20['stats'] = kmeans.fit_predict(stats_scaled)

kmeans = KMeans(n_clusters=20, random_state=273)
cluster20['abilities'] = kmeans.fit_predict(pokemon_abilities_df)

kmeans = KMeans(n_clusters=20, random_state=273)
cluster20['learnsets'] = kmeans.fit_predict(pokemon_learnsets_df)

cluster_dfs[20] = cluster20

In [30]:
cluster25 = pd.DataFrame(index=X.index, columns=['features', 'stats', 'abilities', 'learnsets'])

X_scaled = StandardScaler().fit_transform(X)
stats_scaled = StandardScaler().fit_transform(X.loc[:, ['hp', 'atk', 'def', 'spa', 'spd', 'spe']])

kmeans = KMeans(n_clusters=25, random_state=273)
cluster25['features'] = kmeans.fit_predict(X_scaled)

kmeans = KMeans(n_clusters=25, random_state=273)
cluster25['stats'] = kmeans.fit_predict(stats_scaled)

kmeans = KMeans(n_clusters=25, random_state=273)
cluster25['abilities'] = kmeans.fit_predict(pokemon_abilities_df)

kmeans = KMeans(n_clusters=25, random_state=273)
cluster25['learnsets'] = kmeans.fit_predict(pokemon_learnsets_df)

cluster_dfs[25] = cluster25

In [31]:
cluster30 = pd.DataFrame(index=X.index, columns=['features', 'stats', 'abilities', 'learnsets'])

X_scaled = StandardScaler().fit_transform(X)
stats_scaled = StandardScaler().fit_transform(X.loc[:, ['hp', 'atk', 'def', 'spa', 'spd', 'spe']])

kmeans = KMeans(n_clusters=30, random_state=273)
cluster30['features'] = kmeans.fit_predict(X_scaled)

kmeans = KMeans(n_clusters=30, random_state=273)
cluster30['stats'] = kmeans.fit_predict(stats_scaled)

kmeans = KMeans(n_clusters=30, random_state=273)
cluster30['abilities'] = kmeans.fit_predict(pokemon_abilities_df)

kmeans = KMeans(n_clusters=30, random_state=273)
cluster30['learnsets'] = kmeans.fit_predict(pokemon_learnsets_df)

cluster_dfs[30] = cluster30

In [32]:
cluster_dfs

{5:                 features  stats  abilities  learnsets
 name                                                 
 Bulbasaur              1      0          1          1
 Ivysaur                3      0          1          1
 Venusaur               3      3          1          1
 Charmander             1      0          1          4
 Charmeleon             2      1          1          4
 ...                  ...    ...        ...        ...
 Glastrier              2      2          1          1
 Spectrier              0      3          1          1
 Calyrex                3      4          1          3
 Calyrex-Ice            4      2          1          3
 Calyrex-Shadow         4      3          1          3
 
 [738 rows x 4 columns],
 10:                 features  stats  abilities  learnsets
 name                                                 
 Bulbasaur              9      3          7          6
 Ivysaur                9      9          7          6
 Venusaur               9      

<a id="modeling_guidelines"></a>
### Modeling guidelines

How many models am I making:

one-stage: (3 + 1) x 2, 8 one-stage models: 7 class no clusters, 4 class no clusters, 4 class modified no clusters, 2 class no clusters, then each with clusters

two-stage: (2 + 1) x 2, 6 two-stage models: 7 class no clusters, 4 class no clusters, 4 class modified no clusters, then each with clusters

14 total models for each modeling type

Modeling types: Logistic Regression, KNN, Decision Tree, Random Forest, CatBoost

Extra considerations:

- For Logistic Regression and KNN we will need to scale our features.

- We might not even bother with clustering using something like logistic regression, though we can look up whether it might be worthwhile

- Metric will be weighted F1 score, there is no well developed ROC curve for multi-class, log loss is not good for unbalanced classes, F1 score weighted should be especially appropriate for unbalanced classes and where we don't care more about precision or recall (there is no greater cost to a false positive or false negative for our problem)

In [33]:
k_list = [2, 3, 5, 10]

cluster_types = list(cluster_dfs[5].columns)

<a id="knn"></a>
### KNN

In [34]:
pipe = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier())

pipe.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'standardscaler', 'kneighborsclassifier', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'kneighborsclassifier__algorithm', 'kneighborsclassifier__leaf_size', 'kneighborsclassifier__metric', 'kneighborsclassifier__metric_params', 'kneighborsclassifier__n_jobs', 'kneighborsclassifier__n_neighbors', 'kneighborsclassifier__p', 'kneighborsclassifier__weights'])

In [35]:
param_grid = {'kneighborsclassifier__n_neighbors': list(range(1, 16)),
              'kneighborsclassifier__p': [1, 2],
              'kneighborsclassifier__weights': ['uniform', 'distance']}
param_grid

{'kneighborsclassifier__n_neighbors': [1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15],
 'kneighborsclassifier__p': [1, 2],
 'kneighborsclassifier__weights': ['uniform', 'distance']}

#### 7 classes, no clusters

In [36]:
best = []

for k in k_list:
    pipe = make_pipeline(
        StandardScaler(),
        KNeighborsClassifier())
    X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats'].values)
    knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    knn_grid.fit(X_train, y_train)
    best.append(knn_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.5173231380306713, 5)

In [37]:
pipe = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier())
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats'].values)
knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=5, verbose=0)
knn_grid.fit(X_train, y_train)
knn_grid.best_params_, knn_grid.best_score_

({'kneighborsclassifier__n_neighbors': 5,
  'kneighborsclassifier__p': 1,
  'kneighborsclassifier__weights': 'uniform'},
 0.5173231380306713)

In [38]:
f1_score(y_train, knn_grid.predict(X_train), labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'], average='weighted')

0.6531496355198588

In [39]:
precision_recall_fscore_support(y_train, knn_grid.predict(X_train), labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'])

(array([0.83636364, 0.54362416, 0.25925926, 0.33333333, 0.52941176,
        0.47727273, 0.73913043]),
 array([0.91633466, 0.68067227, 0.22580645, 0.17142857, 0.26470588,
        0.48837209, 0.425     ]),
 array([0.87452471, 0.60447761, 0.24137931, 0.22641509, 0.35294118,
        0.48275862, 0.53968254]),
 array([251, 119,  31,  35,  34,  43,  40], dtype=int64))

Compared to logistic regression, the scores on ZU and PU are similar, though this does better with NU, UU and OU, and worse in RU and Ubers. So the performance could be said to be comparable.

KNN doesn't offer anything in terms of explainability, since it's just calculating the nearest neighbors, which in some sense isn't a "model". Therefore it will need to have better performance to justify its use, but it's not looking like it will obviously achieve that, but we still need to experiment a lot more.

#### 4 class no clusters

In [40]:
best = []

for k in k_list:
    pipe = make_pipeline(
        StandardScaler(),
        KNeighborsClassifier())
    X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats4'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats4'].values)
    knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    knn_grid.fit(X_train, y_train)
    best.append(knn_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.6083256519383994, 10)

In [41]:
pipe = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier())
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats4'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats4'].values)
knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
knn_grid.fit(X_train, y_train)
knn_grid.best_params_, knn_grid.best_score_

({'kneighborsclassifier__n_neighbors': 9,
  'kneighborsclassifier__p': 1,
  'kneighborsclassifier__weights': 'uniform'},
 0.6083256519383994)

In [42]:
f1_score(y_train, knn_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.6798280384593879

In [43]:
precision_recall_fscore_support(y_train, knn_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.76315789, 0.55454545, 0.55555556, 0.77586207]),
 array([0.92430279, 0.51260504, 0.45      , 0.54216867]),
 array([0.83603604, 0.53275109, 0.49723757, 0.63829787]),
 array([251, 119, 100,  83], dtype=int64))

Very similar scores to logistic regression, except a bit worse on mid competitive pokemon.

It's also worth noting that Manhattan distance and uniform weights are continue to be the best parameters.

#### 4 class alt no clusters

In [44]:
best = []

for k in k_list:
    pipe = make_pipeline(
        StandardScaler(),
        KNeighborsClassifier())
    X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats4alt'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats4alt'].values)
    knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    knn_grid.fit(X_train, y_train)
    best.append(knn_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.6352113271349081, 3)

In [45]:
pipe = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier())
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats4alt'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats4alt'].values)
knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=3, verbose=0)
knn_grid.fit(X_train, y_train)
knn_grid.best_params_, knn_grid.best_score_

({'kneighborsclassifier__n_neighbors': 9,
  'kneighborsclassifier__p': 1,
  'kneighborsclassifier__weights': 'uniform'},
 0.6352113271349081)

Strange, there is very little regularization for this model, which may just mean that regularization choice is somewhat random and not having much impact

In [46]:
f1_score(y_train, knn_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.6865193037830023

In [47]:
precision_recall_fscore_support(y_train, knn_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.76470588, 0.5       , 0.67479675, 0.94444444]),
 array([0.93227092, 0.44537815, 0.58041958, 0.425     ]),
 array([0.84021544, 0.47111111, 0.62406015, 0.5862069 ]),
 array([251, 119, 143,  40], dtype=int64))

This model performed notably worse than logistic regression, with about .05 worse f scores in 3 categories and almost .15 worse in Ubers.

The trend of Manhattan distance and uniform weights being the best parameters continues

#### 2 class no clusters

In [48]:
best = []

for k in k_list:
    pipe = make_pipeline(
        StandardScaler(),
        KNeighborsClassifier())
    X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
    knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    knn_grid.fit(X_train, y_train)
    best.append(knn_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.8753511195158892, 5)

Interesting that 5 fold cv works better here, even though 10 worked better for 4 classes (maybe 10 fold did so well by lowering variance in those cases)

In [49]:
pipe = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier())
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=5, verbose=0)
knn_grid.fit(X_train, y_train)
knn_grid.best_params_, knn_grid.best_score_

({'kneighborsclassifier__n_neighbors': 13,
  'kneighborsclassifier__p': 2,
  'kneighborsclassifier__weights': 'uniform'},
 0.8753511195158892)

This also has a higher regularization strength, so we've had a lot of variance in that c parameter

In [50]:
f1_score(y_train, knn_grid.predict(X_train), labels=['No', 'Yes'], average='weighted')

0.9041078722534099

In [51]:
precision_recall_fscore_support(y_train, knn_grid.predict(X_train), labels=['No', 'Yes'])

(array([0.89919355, 0.90819672]),
 array([0.88844622, 0.91721854]),
 array([0.89378758, 0.91268534]),
 array([251, 302], dtype=int64))

These are very close scores to logistic regression. But we did get Euclidean distance outperforming Manhattan distance this time, which means I'm probably right to be testing for that. If uniform continues to outperform distance-based weights every time, then I may stop testing for that during the two-stage modeling, since it improves run-time.

<a id="knn_cluster"></a>
#### KNN with clustering

#### 7 class with clustering

Since one-hot encoded columns shouldn't really be scaled, we can scale the rest of our data to be compatible with them via minmax scaling between 0 and 1 (which is the default setting for minmaxscaler)

In [52]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                KNeighborsClassifier())
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats'],
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_df['formats'].values)
            knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            knn_grid.fit(X_train, y_train)
            if knn_grid.best_score_ > best[0]:
                best = [knn_grid.best_score_, k, c_type, n]
                
best

[0.5786460635292554, 10, 'stats', 15]

The performance is very similar to logistic regression, but it's interesting that it chose the stats clusters this time for the 7 class model, as opposed to the general features chosen in logistic regression

In [53]:
pipe = make_pipeline(
    KNeighborsClassifier())
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[15]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats'].values)
knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
knn_grid.fit(X_train, y_train)
knn_grid.best_params_, knn_grid.best_score_

({'kneighborsclassifier__n_neighbors': 5,
  'kneighborsclassifier__p': 2,
  'kneighborsclassifier__weights': 'uniform'},
 0.5786460635292554)

We have Euclidean outperforming Manhattan distance for a second time!

In [54]:
f1_score(y_train, knn_grid.predict(X_train), labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'], average='weighted')

0.6768973425739454

In [55]:
precision_recall_fscore_support(y_train, knn_grid.predict(X_train), labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'])

(array([0.87401575, 0.51176471, 0.38235294, 0.58823529, 0.57894737,
        0.425     , 0.94736842]),
 array([0.88446215, 0.73109244, 0.41935484, 0.28571429, 0.32352941,
        0.39534884, 0.45      ]),
 array([0.87920792, 0.60207612, 0.4       , 0.38461538, 0.41509434,
        0.40963855, 0.61016949]),
 array([251, 119,  31,  35,  34,  43,  40], dtype=int64))

KNN is performing slightly worse on high performing classes like ZU, PU and Ubers, and slightly worse on most classes in general, but it's doing a lot better on the worst classes like NU, which went up almost .3. I like the evenness of its performance compared to logistic regression, but I'm not sure it's better.

#### 4 class with clustering

In [56]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                KNeighborsClassifier())
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats4'],
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_df['formats4'].values)
            knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            knn_grid.fit(X_train, y_train)
            if knn_grid.best_score_ > best[0]:
                best = [knn_grid.best_score_, k, c_type, n]
                
best

[0.6525176296607775, 5, 'stats', 25]

KNN is clustering by stats every single time, which is quite different than logistic regression which often preferred learnsets or features, but logistic regression also chose stats on the 2 class model which we also used for the two-stage model

In [57]:
pipe = make_pipeline(
    KNeighborsClassifier())
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[25]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats4'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats4'].values)
knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=5, verbose=0)
knn_grid.fit(X_train, y_train)
knn_grid.best_params_, knn_grid.best_score_

({'kneighborsclassifier__n_neighbors': 5,
  'kneighborsclassifier__p': 2,
  'kneighborsclassifier__weights': 'distance'},
 0.6525176296607775)

And not only did Euclidean distance outperform Manhattan again, distance weighting outperformed uniformed weighting for the first time, so we do need to test for it after all

In [58]:
f1_score(y_train, knn_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.9963911995177818

In [59]:
precision_recall_fscore_support(y_train, knn_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([1.        , 0.98347107, 1.        , 1.        ]),
 array([0.99203187, 1.        , 1.        , 1.        ]),
 array([0.996     , 0.99166667, 1.        , 1.        ]),
 array([251, 119, 100,  83], dtype=int64))

Wait, this is shocking.  The model got almost everything right?

This is amazing, but I'm also concerned about overfitting. Maybe grid search cross validation doesn't work well with knn with so few examples because when you remove the cross validation folds the neighbors change drastically.

Solution: in my final notebook for modeling, I'm going to test all of my best models on the test set, instead of just my one best model. Normally you shouldn't test multiple models on the test set, since then there is a danger of tuning your performance to the test set, but I think there are ways to mitigate that problem. I'm not going to choose the best model based on minor improvements or close comparisons on the test set; I'll value training and validation scores more there. But if there is a massive performance change on the test on only some of the models but not others, then statistically that is extremely unlikely to be due to the peculiarities of the test set. Rather it is a sign of massive overfitting, and that's the value of seeing the performance of more than one model on the test set in this case. Minor overfitting is to be expected, and maybe that's all this is, but if this model does quite poorly again on the test set and other models are doing much better, then I won't be able to accept this one.

Let's look at which one's the model actually got wrong, since there are so few:

In [60]:
wrong = pd.DataFrame(knn_grid.predict(X_train), index=X_train.index).merge(y_train, on='name')
wrong.loc[wrong[0] != wrong['formats4']]

Unnamed: 0_level_0,0,formats4
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Silvally-Dark,Low c,Not c
Silvally-Fighting,Low c,Not c


Probably in the case of Silvally it used the resistance index to differentiate between them, and dark and fighting may be good typings which caused them to be closer to some stronger neighbors.

The fact that the model did get these few wrong shows that it's probably making real mistakes, which is a somewhat good sign as to its possible legitimacy.

#### 4 class alt with clustering

In [61]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                KNeighborsClassifier())
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats4alt'],
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_df['formats4alt'].values)
            knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            knn_grid.fit(X_train, y_train)
            if knn_grid.best_score_ > best[0]:
                best = [knn_grid.best_score_, k, c_type, n]
                
best

[0.6865444747067663, 5, 'stats', 25]

This has the same best clustering parameters as the other 4 class clustering model, which is sensible.

In [62]:
pipe = make_pipeline(
    KNeighborsClassifier())
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[25]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats4alt'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats4alt'].values)
knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=5, verbose=0)
knn_grid.fit(X_train, y_train)
knn_grid.best_params_, knn_grid.best_score_

({'kneighborsclassifier__n_neighbors': 13,
  'kneighborsclassifier__p': 2,
  'kneighborsclassifier__weights': 'uniform'},
 0.6865444747067663)

However, it has a higher number of best nearest neighbors, and does back to uniform weighting of the neighbors.

In [63]:
f1_score(y_train, knn_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.7131340875081684

In [64]:
precision_recall_fscore_support(y_train, knn_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.87250996, 0.52      , 0.58333333, 0.9047619 ]),
 array([0.87250996, 0.54621849, 0.63636364, 0.475     ]),
 array([0.87250996, 0.53278689, 0.60869565, 0.62295082]),
 array([251, 119, 143,  40], dtype=int64))

This model actually performs worse than the logistic regression again. And we got the result for logistic regression that the standard 4 class model performs better than the alternative, which we are seeing again even more drastically in the case of KNN.

#### 2 class with clustering

In [65]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                KNeighborsClassifier())
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats2'],
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_df['formats2'].values)
            knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            knn_grid.fit(X_train, y_train)
            if knn_grid.best_score_ > best[0]:
                best = [knn_grid.best_score_, k, c_type, n]
                
best

[0.9018900579706037, 3, 'stats', 25]

Exactly the same best parameters as for logistic regression. It's very interesting that KNN seems to cluster by stats every time, whereas other algorithms use learnsets and features also.

In [66]:
pipe = make_pipeline(
    KNeighborsClassifier())
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[25]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=3, verbose=0)
knn_grid.fit(X_train, y_train)
knn_grid.best_params_, knn_grid.best_score_

({'kneighborsclassifier__n_neighbors': 15,
  'kneighborsclassifier__p': 2,
  'kneighborsclassifier__weights': 'uniform'},
 0.9018900579706037)

For two classes, it likes a high number of neighbors, maybe even higher than 15 would do well but I'm not sure it matters, though higher often leads to underfitting

In [67]:
f1_score(y_train, knn_grid.predict(X_train), labels=['No', 'Yes'], average='weighted')

0.9016989074110657

In [68]:
precision_recall_fscore_support(y_train, knn_grid.predict(X_train), labels=['No', 'Yes'])

(array([0.93777778, 0.87804878]),
 array([0.84063745, 0.95364238]),
 array([0.88655462, 0.91428571]),
 array([251, 302], dtype=int64))

This is very comparable performance to logistic regression. It's still amazing how much better the class model performed, even than a 2 class model! Assuming no mistakes or absurd overfitting, it's almost like 4 is a more natural clustering amount for the dataset.

<a id="two_stage_knn"></a>
#### two-stage KNN

#### two-stage 7 class, no clusters

The first part of this two stage model is just regular knn for two classes (which we already did! so we can just use that model again), to separate out the largest class, ZU i.e. relatively non-competitive pokemon, so that the second model doesn't have to include it and can exercise ALL of its discernment on figuring out which competitive class a competitive pokemon belongs to (which, as we saw from many of the f-scores above, can in some cases be quite difficult, so it's good that the second model can focus on that, and it might lead to higher performance than a single-stage model).

In [69]:
pipe = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier())
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
first_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=5, verbose=0)
first_stage.fit(X_train, y_train)
first_stage.best_params_, first_stage.best_score_

({'kneighborsclassifier__n_neighbors': 13,
  'kneighborsclassifier__p': 2,
  'kneighborsclassifier__weights': 'uniform'},
 0.8753511195158892)

That's the same model that we used before to separate competitive and non-competitive pokemon. Now let's use it to predict which pokemon in all of the training data will belong to ZU, so that we can remove them from consideration in the next model that we build (by filtering X and y_df so that we're only looking at competitive pokemon):

In [70]:
y_df['formats2'].loc[y_df['formats2'] == 'Yes']

name
Venusaur          Yes
Charizard         Yes
Blastoise         Yes
Pikachu           Yes
Raichu            Yes
                 ... 
Glastrier         Yes
Spectrier         Yes
Calyrex           Yes
Calyrex-Ice       Yes
Calyrex-Shadow    Yes
Name: formats2, Length: 403, dtype: object

In [71]:
X_second = X.loc[y_df['formats2'] == 'Yes']
X_second

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Cleric,Pivot,...,Priority,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Venusaur,80,82,83,100,100,80,2,0,0,0,...,0,5,6,4,6,4,4,0,63.636364,3
Charizard,78,84,78,109,85,100,8,0,0,0,...,1,3,14,10,11,5,4,0,50.000000,3
Blastoise,79,83,100,85,105,78,2,0,0,3,...,2,2,12,10,11,7,4,0,75.000000,0
Pikachu,35,55,40,50,50,90,2,0,1,3,...,2,2,7,7,5,3,3,0,70.000000,1
Raichu,60,90,55,90,80,110,2,0,1,3,...,2,2,7,7,6,4,3,0,70.000000,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,1,12,7,3,2,2,0,75.000000,0
Spectrier,100,65,60,145,80,130,8,0,0,0,...,0,0,4,4,4,3,3,0,1.000000,0
Calyrex,100,80,80,80,80,80,-2,0,1,0,...,0,2,3,3,9,4,3,0,18.181818,2
Calyrex-Ice,100,165,150,85,130,50,-4,0,1,0,...,0,2,15,9,12,5,3,1,1.000000,2


In [72]:
y_second_7 = y_df['formats'].loc[y_df['formats2'] == 'Yes']
y_second_7

name
Venusaur            OU
Charizard           PU
Blastoise           NU
Pikachu             PU
Raichu              PU
                  ... 
Glastrier           NU
Spectrier         Uber
Calyrex             PU
Calyrex-Ice       Uber
Calyrex-Shadow    Uber
Name: formats, Length: 403, dtype: object

In [73]:
best = []

for k in k_list:
    pipe = make_pipeline(
        StandardScaler(),
        KNeighborsClassifier())
    X_train, X_test, y_train, y_test = train_test_split(X_second, y_second_7,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_7.values)
    knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    knn_grid.fit(X_train, y_train)
    best.append(knn_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.3909825466677862, 5)

As with logistic regression, this is quite a low score.

In [74]:
pipe = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier())
X_train, X_test, y_train, y_test = train_test_split(X_second, y_second_7,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_7.values)
second_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=5, verbose=0)
second_stage.fit(X_train, y_train)
second_stage.best_params_, second_stage.best_score_

({'kneighborsclassifier__n_neighbors': 4,
  'kneighborsclassifier__p': 1,
  'kneighborsclassifier__weights': 'distance'},
 0.3909825466677862)

In [75]:
precision_recall_fscore_support(y_train, second_stage.predict(X_train), labels=['PU', 'NU', 'RU', 'UU', 'OU', 'Uber'])

(array([1.     , 0.96875, 1.     , 1.     , 1.     , 1.     ]),
 array([0.99159664, 1.        , 1.        , 1.        , 1.        ,
        1.        ]),
 array([0.99578059, 0.98412698, 1.        , 1.        , 1.        ,
        1.        ]),
 array([119,  31,  35,  34,  43,  40], dtype=int64))

What!?!?! Yet again we are faced with a model that gets almost everything correct! And this is in cross contradiction to its cross validation f score! It may be the case that using a low number of neighbors, like 4 or 5, doesn't usually perform well in cross-validation, so it doesn't get chosen, but then when we use it with all of the training data, it overfits, as often happens with relatively low k. Yet again, we'll have to try it on the test set to see whether there is overfitting or not, but the performance here is astonishingly good.

In [76]:
wrong = pd.DataFrame(second_stage.predict(X_train), index=X_train.index).merge(y_train, on='name')
wrong.loc[wrong[0] != wrong['formats']]

Unnamed: 0_level_0,0,formats
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Silvally-Poison,NU,PU


Silvally poison is in fact the only pokemon we got wrong! This is interesting and may be due to resistance index again.

In [77]:
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
pred_1 = pd.DataFrame(first_stage.predict(X_train), index=X_train.index)
pred_2 = pd.DataFrame(second_stage.predict(X_train[pred_1[0] == 'Yes']), index=X_train[pred_1[0] == 'Yes'].index)
y_pred = pred_1.merge(pred_2, on='name', how='left')['0_y'].fillna(value='ZU')
y_validation = y_train.to_frame().merge(y_df['formats'], on='name', how='left')['formats']
y_validation

name
Absol              PU
Ninetales-Alola    OU
Palossand          PU
Ponyta-Galar       ZU
Carvanha           ZU
                   ..
Dragonair          ZU
Qwilfish           PU
Cryogonal          PU
Wailord            ZU
Blaziken           OU
Name: formats, Length: 553, dtype: object

In [78]:
f1_score(y_validation, y_pred, labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'], average='weighted')

0.8435444591764393

In [79]:
precision_recall_fscore_support(y_validation, y_pred, labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'])

(array([0.89919355, 0.71851852, 0.82142857, 0.80555556, 0.87878788,
        0.86842105, 0.91428571]),
 array([0.88844622, 0.76377953, 0.76666667, 0.85294118, 0.85294118,
        0.80487805, 0.88888889]),
 array([0.89378758, 0.74045802, 0.79310345, 0.82857143, 0.86567164,
        0.83544304, 0.90140845]),
 array([251, 127,  30,  34,  34,  41,  36], dtype=int64))

This is actually a very fascinating result. Making the model two-stages not only has very good performance on all 7 classes, it also seems to lower the amount of overfitting so that it's not getting EVERY pokemon correct (which would hardly be believable).

#### two-stage 4 class, no clusters

In [80]:
y_second_4 = y_df['formats4'].loc[y_df['formats2'] == 'Yes']
best = []

for k in k_list:
    pipe = make_pipeline(
        StandardScaler(),
        KNeighborsClassifier())
    X_train, X_test, y_train, y_test = train_test_split(X_second, y_second_4,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_4.values)
    knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    knn_grid.fit(X_train, y_train)
    best.append(knn_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.5617779163259898, 10)

In [81]:
pipe = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier())
X_train, X_test, y_train, y_test = train_test_split(X_second, y_second_4,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_4.values)
second_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
second_stage.fit(X_train, y_train)
second_stage.best_params_, second_stage.best_score_

({'kneighborsclassifier__n_neighbors': 12,
  'kneighborsclassifier__p': 2,
  'kneighborsclassifier__weights': 'distance'},
 0.5617779163259898)

In [82]:
precision_recall_fscore_support(y_train, second_stage.predict(X_train), labels=['Low c', 'Mid c', 'High c'])

(array([1., 1., 1.]),
 array([1., 1., 1.]),
 array([1., 1., 1.]),
 array([119, 100,  83], dtype=int64))

Perfect scores again, which is just funny at this point. Luckily the two-stage model might bail us out from this level of overfitting.

In [83]:
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
pred_1 = pd.DataFrame(first_stage.predict(X_train), index=X_train.index)
pred_2 = pd.DataFrame(second_stage.predict(X_train[pred_1[0] == 'Yes']), index=X_train[pred_1[0] == 'Yes'].index)
y_pred = pred_1.merge(pred_2, on='name', how='left')['0_y'].fillna(value='Not c')
y_validation = y_train.to_frame().merge(y_df['formats4'], on='name', how='left')['formats4']
y_validation

name
Absol               Low c
Ninetales-Alola    High c
Palossand           Low c
Ponyta-Galar        Not c
Carvanha            Not c
                    ...  
Dragonair           Not c
Qwilfish            Low c
Cryogonal           Low c
Wailord             Not c
Blaziken           High c
Name: formats4, Length: 553, dtype: object

In [84]:
f1_score(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.8479552672955538

In [85]:
precision_recall_fscore_support(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.89919355, 0.71631206, 0.82474227, 0.95522388]),
 array([0.88844622, 0.79527559, 0.81632653, 0.83116883]),
 array([0.89378758, 0.75373134, 0.82051282, 0.88888889]),
 array([251, 127,  98,  77], dtype=int64))

Similarly to the 7 class model, the two-stage seems to take away a lot of overfitting while still giving quite GOOD performance, which is promising. However, since the performance isn't necessarily notably better than the 7 class model, there might not be a reason to use 4 classes when doing a two-stage model, which is also what we found for logistic regression.

Another interesting fact is that this model used 12 neighbors, which actually isn't that small. So maybe the low number of neighbors wasn't that important. but cross validation is still having a large effect on the availability of neighbors that are useful for the model.

#### two-stage 4 class alt, no clusters

In [86]:
y_second_4alt = y_df['formats4alt'].loc[y_df['formats2'] == 'Yes']
best = []

for k in k_list:
    pipe = make_pipeline(
        StandardScaler(),
        KNeighborsClassifier())
    X_train, X_test, y_train, y_test = train_test_split(X_second, y_second_4alt,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_4alt.values)
    knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    knn_grid.fit(X_train, y_train)
    best.append(knn_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.5934554693725439, 2)

In [87]:
pipe = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier())
X_train, X_test, y_train, y_test = train_test_split(X_second, y_second_4alt,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_4alt.values)
second_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=2, verbose=0)
second_stage.fit(X_train, y_train)
second_stage.best_params_, second_stage.best_score_

({'kneighborsclassifier__n_neighbors': 14,
  'kneighborsclassifier__p': 2,
  'kneighborsclassifier__weights': 'distance'},
 0.5934554693725439)

In [88]:
precision_recall_fscore_support(y_train, second_stage.predict(X_train), labels=['Low c', 'Mid c', 'High c'])

(array([1., 1., 1.]),
 array([1., 1., 1.]),
 array([1., 1., 1.]),
 array([119, 143,  40], dtype=int64))

Perfect again, unsurprisingly

In [89]:
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
pred_1 = pd.DataFrame(first_stage.predict(X_train), index=X_train.index)
pred_2 = pd.DataFrame(second_stage.predict(X_train[pred_1[0] == 'Yes']), index=X_train[pred_1[0] == 'Yes'].index)
y_pred = pred_1.merge(pred_2, on='name', how='left')['0_y'].fillna(value='Not c')
y_validation = y_train.to_frame().merge(y_df['formats4alt'], on='name', how='left')['formats4alt']
y_validation

name
Absol              Low c
Ninetales-Alola    Mid c
Palossand          Low c
Ponyta-Galar       Not c
Carvanha           Not c
                   ...  
Dragonair          Not c
Qwilfish           Low c
Cryogonal          Low c
Wailord            Not c
Blaziken           Mid c
Name: formats4alt, Length: 553, dtype: object

In [90]:
f1_score(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.8579077608459496

In [91]:
precision_recall_fscore_support(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.89919355, 0.79831933, 0.81875   , 1.        ]),
 array([0.88844622, 0.7480315 , 0.94244604, 0.72222222]),
 array([0.89378758, 0.77235772, 0.87625418, 0.83870968]),
 array([251, 127, 139,  36], dtype=int64))

And again, good performance, but not much better than 7 classes.

<a id="two_stage_knn_cluster"></a>
#### two-stage KNN with clustering

#### two-stage 7 class with clustering

We need to set it up so that our first stage has clustering now:

In [92]:
pipe = make_pipeline(
    KNeighborsClassifier())
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[25]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
first_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=3, verbose=0)
first_stage.fit(X_train, y_train)
first_stage.best_params_, first_stage.best_score_

({'kneighborsclassifier__n_neighbors': 15,
  'kneighborsclassifier__p': 2,
  'kneighborsclassifier__weights': 'uniform'},
 0.9018900579706037)

And we already have X_second and all iterations of y_second set up to make our training and testing sets, so we can just go ahead and do hyperparameter search:

In [93]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                KNeighborsClassifier())
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_second), index=X_second.index, columns=X_second.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_second_7,
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_second_7.values)
            knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            knn_grid.fit(X_train, y_train)
            if knn_grid.best_score_ > best[0]:
                best = [knn_grid.best_score_, k, c_type, n]
                
best

[0.4365004582201468, 10, 'stats', 15]

It is interesting that KNN always seems to choose to cluster based on stats, but logistic regression had more variance in the type of clustering it used.

In [94]:
pipe = make_pipeline(
    KNeighborsClassifier())
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_second), index=X_second.index, columns=X_second.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[15]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_second_7,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_7.values)
second_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
second_stage.fit(X_train, y_train)
second_stage.best_params_, second_stage.best_score_

({'kneighborsclassifier__n_neighbors': 8,
  'kneighborsclassifier__p': 2,
  'kneighborsclassifier__weights': 'distance'},
 0.4365004582201468)

In [95]:
precision_recall_fscore_support(y_train, second_stage.predict(X_train), labels=['PU', 'NU', 'RU', 'UU', 'OU', 'Uber'])

(array([1.     , 0.96875, 1.     , 1.     , 1.     , 1.     ]),
 array([0.99159664, 1.        , 1.        , 1.        , 1.        ,
        1.        ]),
 array([0.99578059, 0.98412698, 1.        , 1.        , 1.        ,
        1.        ]),
 array([119,  31,  35,  34,  43,  40], dtype=int64))

Again very nearly 100%. We can guess that Silvally will be what it gets wrong again:

In [96]:
wrong = pd.DataFrame(second_stage.predict(X_train), index=X_train.index).merge(y_train, on='name')
wrong.loc[wrong[0] != wrong['formats']]

Unnamed: 0_level_0,0,formats
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Silvally-Poison,NU,PU


Yep, just Silvally poison is wrong again

In [97]:
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[25]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
pred_1 = pd.DataFrame(first_stage.predict(X_train), index=X_train.index)

#drop old clusters and merge the new clusters
#because the two models work with different clusterings
X_train = X_train.drop(columns=list(range(0, 25)))
X_train = X_train.merge(pd.get_dummies(cluster_dfs[15]['stats']), on='name', how='left')

pred_2 = pd.DataFrame(second_stage.predict(X_train[pred_1[0] == 'Yes']), index=X_train[pred_1[0] == 'Yes'].index)
y_pred = pred_1.merge(pred_2, on='name', how='left')['0_y'].fillna(value='ZU')
y_validation = y_train.to_frame().merge(y_df['formats'], on='name', how='left')['formats']
y_validation

name
Absol              PU
Ninetales-Alola    OU
Palossand          PU
Ponyta-Galar       ZU
Carvanha           ZU
                   ..
Dragonair          ZU
Qwilfish           PU
Cryogonal          PU
Wailord            ZU
Blaziken           OU
Name: formats, Length: 553, dtype: object

In [98]:
f1_score(y_validation, y_pred, labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'], average='weighted')

0.849138551373321

In [99]:
precision_recall_fscore_support(y_validation, y_pred, labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'])

(array([0.93777778, 0.69871795, 0.91666667, 0.87878788, 0.79487179,
        0.83333333, 0.91176471]),
 array([0.84063745, 0.85826772, 0.73333333, 0.85294118, 0.91176471,
        0.85365854, 0.86111111]),
 array([0.88655462, 0.77031802, 0.81481481, 0.86567164, 0.84931507,
        0.84337349, 0.88571429]),
 array([251, 127,  30,  34,  34,  41,  36], dtype=int64))

This is very slightly better, albeit extremely similar performance to the two stage model that didn't use clustering.

Intuitively, it makes some sense why the clustering wouldn't help KNN by much, since it's just looking for the data points that are closest, and the clsuters only give somewhat redundant information on what is closer, since they are ALSO judging by what is closer. This may be a good justification for not using a clustering model for KNN even if it performs slightly better, since it wouldn't be anymore near a performance "elbow" or "knee", and because it's obviously redundant to a large extent.

#### two-stage 4 class with clustering

In [100]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                KNeighborsClassifier())
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_second), index=X_second.index, columns=X_second.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_second_4,
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_second_4.values)
            knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            knn_grid.fit(X_train, y_train)
            if knn_grid.best_score_ > best[0]:
                best = [knn_grid.best_score_, k, c_type, n]
                
best

[0.5586070813170988, 10, 'learnsets', 15]

That's the first time in a while that learnsets has been the best clustering parameter, but let's see if it's to any significant effect:

In [101]:
pipe = make_pipeline(
    KNeighborsClassifier())
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_second), index=X_second.index, columns=X_second.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[15]['learnsets']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_second_4,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_4.values)
second_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
second_stage.fit(X_train, y_train)
second_stage.best_params_, second_stage.best_score_

({'kneighborsclassifier__n_neighbors': 12,
  'kneighborsclassifier__p': 1,
  'kneighborsclassifier__weights': 'distance'},
 0.5586070813170988)

In [102]:
precision_recall_fscore_support(y_train, second_stage.predict(X_train), labels=['Low c', 'Mid c', 'High c'])

(array([1., 1., 1.]),
 array([1., 1., 1.]),
 array([1., 1., 1.]),
 array([119, 100,  83], dtype=int64))

Predictable

In [103]:
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[25]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
pred_1 = pd.DataFrame(first_stage.predict(X_train), index=X_train.index)

#drop old clusters and merge the new clusters
#because the two models work with different clusterings
X_train = X_train.drop(columns=list(range(0, 25)))
X_train = X_train.merge(pd.get_dummies(cluster_dfs[15]['learnsets']), on='name', how='left')

pred_2 = pd.DataFrame(second_stage.predict(X_train[pred_1[0] == 'Yes']), index=X_train[pred_1[0] == 'Yes'].index)
y_pred = pred_1.merge(pred_2, on='name', how='left')['0_y'].fillna(value='Not c')
y_validation = y_train.to_frame().merge(y_df['formats4'], on='name', how='left')['formats4']
y_validation

name
Absol               Low c
Ninetales-Alola    High c
Palossand           Low c
Ponyta-Galar        Not c
Carvanha            Not c
                    ...  
Dragonair           Not c
Qwilfish            Low c
Cryogonal           Low c
Wailord             Not c
Blaziken           High c
Name: formats4, Length: 553, dtype: object

In [104]:
f1_score(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.8355227241160874

In [105]:
precision_recall_fscore_support(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.93777778, 0.7       , 0.73636364, 0.92647059]),
 array([0.84063745, 0.82677165, 0.82653061, 0.81818182]),
 array([0.88655462, 0.75812274, 0.77884615, 0.86896552]),
 array([251, 127,  98,  77], dtype=int64))

Typical, and actually even worse performance than the 7 class model, which is funny.

#### two-stage 4 class alt with clustering

In [106]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                KNeighborsClassifier())
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_second), index=X_second.index, columns=X_second.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_second_4alt,
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_second_4alt.values)
            knn_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            knn_grid.fit(X_train, y_train)
            if knn_grid.best_score_ > best[0]:
                best = [knn_grid.best_score_, k, c_type, n]
                
best

[0.6321945296191911, 2, 'stats', 15]

In [107]:
pipe = make_pipeline(
    KNeighborsClassifier())
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_second), index=X_second.index, columns=X_second.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[15]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_second_4alt,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_4alt.values)
second_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=2, verbose=0)
second_stage.fit(X_train, y_train)
second_stage.best_params_, second_stage.best_score_

({'kneighborsclassifier__n_neighbors': 15,
  'kneighborsclassifier__p': 2,
  'kneighborsclassifier__weights': 'uniform'},
 0.6321945296191911)

In [108]:
precision_recall_fscore_support(y_train, second_stage.predict(X_train), labels=['Low c', 'Mid c', 'High c'])

(array([0.55072464, 0.62758621, 1.        ]),
 array([0.63865546, 0.63636364, 0.475     ]),
 array([0.59143969, 0.63194444, 0.6440678 ]),
 array([119, 143,  40], dtype=int64))

The alt classes seem to fall off of perfect scores back to pretty poor scores, which is good information.

In [109]:
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[25]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
pred_1 = pd.DataFrame(first_stage.predict(X_train), index=X_train.index)

#drop old clusters and merge the new clusters
#because the two models work with different clusterings
X_train = X_train.drop(columns=list(range(0, 25)))
X_train = X_train.merge(pd.get_dummies(cluster_dfs[15]['stats']), on='name', how='left')

pred_2 = pd.DataFrame(second_stage.predict(X_train[pred_1[0] == 'Yes']), index=X_train[pred_1[0] == 'Yes'].index)
y_pred = pred_1.merge(pred_2, on='name', how='left')['0_y'].fillna(value='Not c')
y_validation = y_train.to_frame().merge(y_df['formats4alt'], on='name', how='left')['formats4alt']
y_validation

name
Absol              Low c
Ninetales-Alola    Mid c
Palossand          Low c
Ponyta-Galar       Not c
Carvanha           Not c
                   ...  
Dragonair          Not c
Qwilfish           Low c
Cryogonal          Low c
Wailord            Not c
Blaziken           Mid c
Name: formats4alt, Length: 553, dtype: object

In [110]:
f1_score(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.7104159549676446

In [111]:
precision_recall_fscore_support(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.93777778, 0.48322148, 0.55      , 0.89473684]),
 array([0.84063745, 0.56692913, 0.63309353, 0.47222222]),
 array([0.88655462, 0.52173913, 0.58862876, 0.61818182]),
 array([251, 127, 139,  36], dtype=int64))

This is greatly reduced performance compared to one-stage models, the standard 4 class model, and models that don't use clustering. It is certainly a useless model.

Notebook runtime: about 7 minutes and 30 seconds on my computer

<a id="performance"></a>
## Score Summary:

### one-stage, no clusters

#### 7 classes, no clusters
0.6531496355198588\
[0.87452471, 0.60447761, 0.24137931, 0.22641509, 0.35294118, 0.48275862, 0.53968254]
 
#### 4 classes, no clusters
0.6798280384593879\
[0.83603604, 0.53275109, 0.49723757, 0.63829787]

#### 4 class alt no clusters
0.6865193037830023\
[0.84021544, 0.47111111, 0.62406015, 0.5862069 ]

#### 2 class no clusters
0.9041078722534099\
[0.89378758, 0.91268534]

### one-stage, with clustering

#### 7 class with clustering
0.6768973425739454\
[0.87920792, 0.60207612, 0.4, 0.38461538, 0.41509434, 0.40963855, 0.61016949]

#### 4 class with clustering
0.9963911995177818\
[0.996     , 0.99166667, 1.        , 1.        ]

#### 4 class alt with clustering
0.7131340875081684\
[0.87250996, 0.53278689, 0.60869565, 0.62295082]

#### 2 class with clustering
0.9016989074110657\
[0.88655462, 0.91428571]

### two-stage, no clustering

#### two-stage 7 class, no clusters
0.8435444591764393\
[0.89378758, 0.74045802, 0.79310345, 0.82857143, 0.86567164, 0.83544304, 0.90140845]

#### two-stage 4 class, no clusters
0.8479552672955538\
[0.89378758, 0.75373134, 0.82051282, 0.88888889]

#### two-stage 4 class alt, no clusters
0.8579077608459496\
[0.89378758, 0.77235772, 0.87625418, 0.83870968]

### two-stage, with clustering

#### two-stage 7 class with clustering
0.849138551373321\
[0.88655462, 0.77031802, 0.81481481, 0.86567164, 0.84931507, 0.84337349, 0.88571429]

#### two-stage 4 class with clustering
0.8355227241160874\
[0.88655462, 0.75812274, 0.77884615, 0.86896552]

#### two-stage 4 class alt with clustering
0.7104159549676446\
[0.88655462, 0.52173913, 0.58862876, 0.61818182]

## Performance Summary

### best 7 class model: two-stage 7 class with clustering (5/8)

#### best 7 class overall
- two-stage 7 class with clustering, 0.849138551373321

#### best 7 class ZU
- two-stage 7 class, no clusters, 0.89378758

#### best 7 class PU
- two-stage 7 class with clustering, 0.77031802

#### best 7 class NU
- two-stage 7 class with clustering, 0.81481481

#### best 7 class RU
- two-stage 7 class with clustering, 0.86567164

#### best 7 class UU
- two-stage 7 class, no clusters, 0.86567164

#### best 7 class OU
- two-stage 7 class with clustering, 0.84337349

#### best 7 class Uber
- two-stage 7 class, no clusters, 0.90140845

### best 4 class model: 4 class with clustering (3/5), one-stage

#### best 4 class overall
- 4 class with clustering, 0.9963911995177818

#### best 4 class "not competitive"
- 4 class with clustering, 0.996

#### best 4 class "low competitive"
- 4 class with clustering, 0.99166667

#### best 4 class "mid competitive"
- 4 class alt with clustering, 1.

#### best 4 class "high competitive"
- 4 class with clustering, 1.

### 2 class no clusters

Intuitively, it makes some sense why the clustering wouldn't help KNN by much, since it's just looking for the data points that are closest, and the clsuters only give somewhat redundant information on what is closer, since they are ALSO judging by what is closer. This may be a good justification for not using a clustering model for KNN even if it performs slightly better, since it wouldn't be anymore near a performance "elbow" or "knee", and because it's obviously redundant to a large extent.

Therefore, for the 7 class model, I'm going to choose the model which doesn't use clustering even though the one that uses clustering slightly outperforms it, because the gains in performance are small and don't justify such increased complexity in the modeling.

Two-stage modeling, on the other hand, seems to make a big difference for KNN, but we'll see based on the other models and the test set. This is only for 7 classes though, just like for logistic regression; for 4 classes two-stage models are worse.

We will use the one-stage standard 4 class model, because it has near perfect performance. Even though we expect this to have a lot of overfitting, there's no reason to use two-stage 4 class models instead because the 7 class two-stage models outperform the 4 class one and give more nuanced results.

## Explainability Summary

KNN doesn't offer us any explanation at all compared to other models, since there are no coefficients, splits, visualizations or feature importances of any relevance. It's just about finding the best ways to split the data based on manifold neighbor relationships. Therefore its only justified use could come from its performance. This might be high, but we have to make sure it's not overfitting. If it really does perform best, then I might use both KNN and another model as my best models since the KNN will have the best predictions as a model, but another model might offer more insights into how our features are leading to the results which is more useful to people actually working with and building pokemon.