## Table of Contents

#### Set-Up
- [Splitting the data](#split)
- [Adjusting some features](#adjusting)
- [Clustering](#clustering)
- [Modeling guidelines](#modeling_guidelines)

#### Modeling
- [Logistic Regression](#logistic_regression)
    - [Logistic Regression with clustering](#logistic_regression_cluster)
    - [Two-Stage Logistic Regression](#two_stage_logistic)
    - [Two-Stage Logistic Regression with clustering](#two_stage_logistic_cluster)
    
- [Performance Results](#performance)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import re
import json
import pandas_profiling
import requests
from bs4 import BeautifulSoup
from scipy import stats
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale, StandardScaler, MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.manifold import TSNE
from sklearn.linear_model import LogisticRegression, Lasso
from sklearn.metrics import precision_recall_fscore_support, log_loss, r2_score, mean_squared_error, f1_score, make_scorer
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.cluster import KMeans, DBSCAN, MeanShift
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster

In [2]:
pokemon_abilities_df = pd.read_csv('./data/pokemon_abilities_df.csv', index_col="name")
pokemon_learnsets_df = pd.read_csv('./data/pokemon_learnsets_df.csv', index_col='name')
pokemon_data = pd.read_csv('./data/pokemon_data.csv', index_col="name")

In [3]:
pokemon_data

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,weight,height,formats,generation,...,Ability Cutoff 2,Ability Cutoff 3,Ability Cutoff 4,Ability Cutoff 5,Ability Cutoff 6,Best Ability,Best Ability <100,Unique Powerful Ability,oldformats,oldformat codes
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,6.9,0.7,ZU,RB,...,1.0,0.0,0.0,0.0,0.0,63.636364,63.636364,0,ZU,0
Ivysaur,60,62,63,80,80,60,13.0,1.0,ZU,RB,...,1.0,0.0,0.0,0.0,0.0,63.636364,63.636364,0,ZU,0
Venusaur,80,82,83,100,100,80,100.0,2.0,OU,RB,...,1.0,0.0,0.0,0.0,0.0,63.636364,63.636364,0,UU,4
Charmander,39,52,43,60,50,65,8.5,0.6,ZU,RB,...,1.0,0.0,0.0,0.0,0.0,50.000000,50.000000,0,ZU,0
Charmeleon,58,64,58,80,65,80,19.0,1.1,ZU,RB,...,1.0,0.0,0.0,0.0,0.0,50.000000,50.000000,0,ZU,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,800.0,2.2,NU,SS,...,1.0,1.0,1.0,0.0,0.0,75.000000,75.000000,0,NU,2
Spectrier,100,65,60,145,80,130,44.5,2.0,Uber,SS,...,0.0,0.0,0.0,0.0,0.0,1.000000,0.000000,1,Uber,6
Calyrex,100,80,80,80,80,80,7.7,1.1,PU,SS,...,0.0,0.0,0.0,0.0,0.0,18.181818,18.181818,0,ZU,0
Calyrex-Ice,100,165,150,85,130,50,809.1,2.4,Uber,SS,...,0.0,0.0,0.0,0.0,0.0,1.000000,0.000000,1,Uber,6


<a id="split"></a>
### Splitting the Data

In [4]:
pokemon_data.columns

Index(['hp', 'atk', 'def', 'spa', 'spd', 'spe', 'weight', 'height', 'formats',
       'generation', 'format codes', 'Weaknesses', 'Strong Weaknesses',
       'Resists', 'Strong Resists', 'Immune', 'STAB', 'Resistance Index',
       'Entry Hazards', 'Hazard Removal', 'Removal Deterrent', 'Cleric',
       'Pivot', 'Item Removal', 'Setup', 'Priority', 'HP Drain', 'HP Recovery',
       'Weather Set', 'Weather Gimmick', 'Physical Cutoff 1',
       'Physical Cutoff 2', 'Physical Cutoff 3', 'Physical Cutoff 4',
       'Physical Cutoff 5', 'Physical Cutoff 6', 'Physical Coverage 1',
       'Physical Coverage 2', 'Physical Coverage 3', 'Physical Coverage 4',
       'Special Cutoff 1', 'Special Cutoff 2', 'Special Cutoff 3',
       'Special Cutoff 4', 'Special Cutoff 5', 'Special Cutoff 6',
       'Special Cutoff 7', 'Special Coverage 1', 'Special Coverage 2',
       'Special Coverage 3', 'Special Coverage 4', 'Special Coverage 5',
       'Special Coverage 6', 'Special Coverage 7', 'Special Cove

In [5]:
X = pokemon_data.drop(columns=['weight', 'height', 'Weaknesses', 'Strong Weaknesses', 'Resists',
                                'Strong Resists', 'Immune', 'STAB', 'Physical Cutoff 1', 'Physical Cutoff 2',
                                'Physical Cutoff 4', 'Physical Cutoff 5', 'Physical Cutoff 6',
                                'Physical Coverage 1', 'Physical Coverage 2', 'Physical Coverage 4',
                                'Special Cutoff 1', 'Special Cutoff 2', 'Special Cutoff 4',
                                'Special Cutoff 5', 'Special Cutoff 6', 'Special Cutoff 7',
                                'Special Coverage 1', 'Special Coverage 2', 'Special Coverage 3',
                                'Special Coverage 4', 'Special Coverage 6', 'Special Coverage 7',
                                'Special Coverage 8', 'Special Coverage 9', 'Special Coverage 10',
                                'Ability Cutoff 1', 'Ability Cutoff 2', 'Ability Cutoff 4', 'Ability Cutoff 5',
                                'Ability Cutoff 6', 'Best Ability <100', 'formats', 'generation',
                                'format codes', 'oldformats', 'oldformat codes'])

y_df = pd.DataFrame(pokemon_data[['formats', 'format codes']], index=pokemon_data.index, columns=['formats', 'format codes', 'oldformats', 'oldformat codes'])
y_df['formats4'] = y_df['formats'].replace({'ZU':'Not c', 'PU': 'Low c', 'NU': 'Mid c', 'RU': 'Mid c', 'UU': 'Mid c', 'OU': 'High c', 'Uber': 'High c'})
y_df['format codes4'] = y_df['format codes'].replace({3:2, 4: 2, 5:3, 6:3})
y_df['formats4alt'] = y_df['formats'].replace({'ZU':'Not c', 'PU': 'Low c', 'NU': 'Mid c', 'RU': 'Mid c', 'UU': 'Mid c', 'OU': 'Mid c', 'Uber': 'High c'})
y_df['format codes4alt'] = y_df['format codes'].replace({3:2, 4: 2, 5:2, 6:3})
y_df['formats2'] = y_df['formats'].replace({'ZU':'No', 'PU': 'Yes', 'NU': 'Yes', 'RU': 'Yes', 'UU': 'Yes', 'OU': 'Yes', 'Uber': 'Yes'})
y_df

Unnamed: 0_level_0,formats,format codes,oldformats,oldformat codes,formats4,format codes4,formats4alt,format codes4alt,formats2
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Bulbasaur,ZU,0,,,Not c,0,Not c,0,No
Ivysaur,ZU,0,,,Not c,0,Not c,0,No
Venusaur,OU,5,,,High c,3,Mid c,2,Yes
Charmander,ZU,0,,,Not c,0,Not c,0,No
Charmeleon,ZU,0,,,Not c,0,Not c,0,No
...,...,...,...,...,...,...,...,...,...
Glastrier,NU,2,,,Mid c,2,Mid c,2,Yes
Spectrier,Uber,6,,,High c,3,High c,3,Yes
Calyrex,PU,1,,,Low c,1,Low c,1,Yes
Calyrex-Ice,Uber,6,,,High c,3,High c,3,Yes


<a id="adjusting"></a>
### Adjusting some features

- remove: ability cutoff, unique powerful ability

In [6]:
X.columns

Index(['hp', 'atk', 'def', 'spa', 'spd', 'spe', 'Resistance Index',
       'Entry Hazards', 'Hazard Removal', 'Removal Deterrent', 'Cleric',
       'Pivot', 'Item Removal', 'Setup', 'Priority', 'HP Drain', 'HP Recovery',
       'Weather Set', 'Weather Gimmick', 'Physical Cutoff 3',
       'Physical Coverage 3', 'Special Cutoff 3', 'Special Coverage 5',
       'Misc Status', 'Unique Powerful Move', 'Ability Cutoff 3',
       'Best Ability', 'Unique Powerful Ability'],
      dtype='object')

In [7]:
X.drop(columns=['Ability Cutoff 3', 'Unique Powerful Ability'], inplace=True)
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Hazard Removal,Removal Deterrent,...,HP Recovery,Weather Set,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,1,0,5,4,3,5,2,4,0,63.636364
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,1,0,5,4,3,5,2,4,0,63.636364
Venusaur,80,82,83,100,100,80,2,0,0,0,...,1,0,5,6,4,6,4,4,0,63.636364
Charmander,39,52,43,60,50,65,3,0,0,0,...,0,0,3,10,9,6,2,3,0,50.000000
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,0,0,3,10,9,6,2,3,0,50.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,0,1,12,7,3,2,2,0,75.000000
Spectrier,100,65,60,145,80,130,8,0,0,1,...,0,0,0,4,4,4,3,3,0,1.000000
Calyrex,100,80,80,80,80,80,-2,0,0,0,...,0,0,2,3,3,9,4,3,0,18.181818
Calyrex-Ice,100,165,150,85,130,50,-4,0,0,0,...,0,0,2,15,9,12,5,3,1,1.000000


In [8]:
X.columns

Index(['hp', 'atk', 'def', 'spa', 'spd', 'spe', 'Resistance Index',
       'Entry Hazards', 'Hazard Removal', 'Removal Deterrent', 'Cleric',
       'Pivot', 'Item Removal', 'Setup', 'Priority', 'HP Drain', 'HP Recovery',
       'Weather Set', 'Weather Gimmick', 'Physical Cutoff 3',
       'Physical Coverage 3', 'Special Cutoff 3', 'Special Coverage 5',
       'Misc Status', 'Unique Powerful Move', 'Best Ability'],
      dtype='object')

- fold weather set into weather gimmick

In [9]:
X['Weather Gimmick'].value_counts()

2    289
1    171
0    161
5     70
3     40
4      7
Name: Weather Gimmick, dtype: int64

In [10]:
X['Weather Set'].value_counts()

0    709
1     29
Name: Weather Set, dtype: int64

In [11]:
X.loc[X['Weather Set'] == 1, 'Weather Gimmick'] = 6
X['Weather Gimmick'].value_counts()

2    265
1    167
0    161
5     70
3     39
6     29
4      7
Name: Weather Gimmick, dtype: int64

In [12]:
X.drop(columns=['Weather Set'], inplace=True)
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Hazard Removal,Removal Deterrent,...,HP Drain,HP Recovery,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,2,1,5,4,3,5,2,4,0,63.636364
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,2,1,5,4,3,5,2,4,0,63.636364
Venusaur,80,82,83,100,100,80,2,0,0,0,...,2,1,5,6,4,6,4,4,0,63.636364
Charmander,39,52,43,60,50,65,3,0,0,0,...,0,0,3,10,9,6,2,3,0,50.000000
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,0,0,3,10,9,6,2,3,0,50.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,0,1,12,7,3,2,2,0,75.000000
Spectrier,100,65,60,145,80,130,8,0,0,1,...,0,0,0,4,4,4,3,3,0,1.000000
Calyrex,100,80,80,80,80,80,-2,0,0,0,...,2,0,2,3,3,9,4,3,0,18.181818
Calyrex-Ice,100,165,150,85,130,50,-4,0,0,0,...,2,0,2,15,9,12,5,3,1,1.000000


- fold hp drain and hp recovery together into a recovery feature

In [13]:
X['HP Recovery'].value_counts()

0    517
1    184
2     37
Name: HP Recovery, dtype: int64

In [14]:
X['HP Drain'].value_counts()

0    482
2    202
1     49
3      4
4      1
Name: HP Drain, dtype: int64

In [15]:
X.loc[X['HP Recovery'] == 1, 'HP Drain'] = 3
X.loc[X['HP Recovery'] == 2, 'HP Drain'] = 4
X['HP Drain'].value_counts()

0    382
3    187
2     91
1     41
4     37
Name: HP Drain, dtype: int64

In [16]:
X.drop(columns=['HP Recovery'], inplace=True)
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Hazard Removal,Removal Deterrent,...,Priority,HP Drain,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,0,3,5,4,3,5,2,4,0,63.636364
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,0,3,5,4,3,5,2,4,0,63.636364
Venusaur,80,82,83,100,100,80,2,0,0,0,...,0,3,5,6,4,6,4,4,0,63.636364
Charmander,39,52,43,60,50,65,3,0,0,0,...,1,0,3,10,9,6,2,3,0,50.000000
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,1,0,3,10,9,6,2,3,0,50.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,0,1,12,7,3,2,2,0,75.000000
Spectrier,100,65,60,145,80,130,8,0,0,1,...,0,0,0,4,4,4,3,3,0,1.000000
Calyrex,100,80,80,80,80,80,-2,0,0,0,...,0,2,2,3,3,9,4,3,0,18.181818
Calyrex-Ice,100,165,150,85,130,50,-4,0,0,0,...,0,2,2,15,9,12,5,3,1,1.000000


In [17]:
X['HP Recovery'] = X['HP Drain']
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Hazard Removal,Removal Deterrent,...,HP Drain,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,3,5,4,3,5,2,4,0,63.636364,3
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,3,5,4,3,5,2,4,0,63.636364,3
Venusaur,80,82,83,100,100,80,2,0,0,0,...,3,5,6,4,6,4,4,0,63.636364,3
Charmander,39,52,43,60,50,65,3,0,0,0,...,0,3,10,9,6,2,3,0,50.000000,0
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,0,3,10,9,6,2,3,0,50.000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,1,12,7,3,2,2,0,75.000000,0
Spectrier,100,65,60,145,80,130,8,0,0,1,...,0,0,4,4,4,3,3,0,1.000000,0
Calyrex,100,80,80,80,80,80,-2,0,0,0,...,2,2,3,3,9,4,3,0,18.181818,2
Calyrex-Ice,100,165,150,85,130,50,-4,0,0,0,...,2,2,15,9,12,5,3,1,1.000000,2


In [18]:
X.drop(columns=['HP Drain'], inplace=True)
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Hazard Removal,Removal Deterrent,...,Priority,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,0,5,4,3,5,2,4,0,63.636364,3
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,0,5,4,3,5,2,4,0,63.636364,3
Venusaur,80,82,83,100,100,80,2,0,0,0,...,0,5,6,4,6,4,4,0,63.636364,3
Charmander,39,52,43,60,50,65,3,0,0,0,...,1,3,10,9,6,2,3,0,50.000000,0
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,1,3,10,9,6,2,3,0,50.000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,1,12,7,3,2,2,0,75.000000,0
Spectrier,100,65,60,145,80,130,8,0,0,1,...,0,0,4,4,4,3,3,0,1.000000,0
Calyrex,100,80,80,80,80,80,-2,0,0,0,...,0,2,3,3,9,4,3,0,18.181818,2
Calyrex-Ice,100,165,150,85,130,50,-4,0,0,0,...,0,2,15,9,12,5,3,1,1.000000,2


In [19]:
X['HP Recovery'].value_counts()

0    382
3    187
2     91
1     41
4     37
Name: HP Recovery, dtype: int64

- considering: removal deterrent (could arguably just remove since its abilities), hazard removal, cleric, entry hazards (all 3 of those might go into misc status)

In [20]:
X.drop(columns=['Removal Deterrent'], inplace=True)
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Hazard Removal,Cleric,...,Priority,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,0,5,4,3,5,2,4,0,63.636364,3
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,0,5,4,3,5,2,4,0,63.636364,3
Venusaur,80,82,83,100,100,80,2,0,0,0,...,0,5,6,4,6,4,4,0,63.636364,3
Charmander,39,52,43,60,50,65,3,0,0,0,...,1,3,10,9,6,2,3,0,50.000000,0
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,1,3,10,9,6,2,3,0,50.000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,1,12,7,3,2,2,0,75.000000,0
Spectrier,100,65,60,145,80,130,8,0,0,0,...,0,0,4,4,4,3,3,0,1.000000,0
Calyrex,100,80,80,80,80,80,-2,0,0,1,...,0,2,3,3,9,4,3,0,18.181818,2
Calyrex-Ice,100,165,150,85,130,50,-4,0,0,1,...,0,2,15,9,12,5,3,1,1.000000,2


In [21]:
X['Misc Status'].value_counts()

3    335
2    234
1     89
0     41
4     35
5      4
Name: Misc Status, dtype: int64

In [22]:
X['Hazard Removal'].value_counts()

0    558
1    174
2      6
Name: Hazard Removal, dtype: int64

In [23]:
X.loc[X['Hazard Removal'] == 1, 'Misc Status'] = 4
X['Misc Status'].value_counts()

3    249
4    205
2    185
1     63
0     32
5      4
Name: Misc Status, dtype: int64

In [24]:
X.drop(columns=['Hazard Removal'], inplace=True)
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Cleric,Pivot,...,Priority,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,0,5,4,3,5,2,4,0,63.636364,3
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,0,5,4,3,5,2,4,0,63.636364,3
Venusaur,80,82,83,100,100,80,2,0,0,0,...,0,5,6,4,6,4,4,0,63.636364,3
Charmander,39,52,43,60,50,65,3,0,0,0,...,1,3,10,9,6,2,3,0,50.000000,0
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,1,3,10,9,6,2,3,0,50.000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,1,12,7,3,2,2,0,75.000000,0
Spectrier,100,65,60,145,80,130,8,0,0,0,...,0,0,4,4,4,3,3,0,1.000000,0
Calyrex,100,80,80,80,80,80,-2,0,1,0,...,0,2,3,3,9,4,3,0,18.181818,2
Calyrex-Ice,100,165,150,85,130,50,-4,0,1,0,...,0,2,15,9,12,5,3,1,1.000000,2


I'll just leave the other one's (Cleric and Entry Hazards) alone for now, updating them would be complicated and it's probably not even a good idea since they performed better than Hazard Removal

<a id="clustering"></a>
### Clustering

In [25]:
cluster_dfs = {}

n_clusters = list(range(5, 35, 5))
n_clusters

[5, 10, 15, 20, 25, 30]

The number of clusters we'll test in each model that uses clusters, which is only half of them, and we'll want to remember to convert those clusters to categories.

We are going to do clustering of 4 different subsets of features, as we did during EDA:
- one for overall features (scaled)
- one for stats (scaled)
- one for abilities (not scaled, because abilities are one-hot encoded)
- one for learnsets (not scaled, because learnsets are one-hot encoded)

Then we'll make 6 dataframes for each of the different amount of clusters, each with all of those 4 types, and they will go in the cluster_dfs dictionary

In [26]:
cluster5 = pd.DataFrame(index=X.index, columns=['features', 'stats', 'abilities', 'learnsets'])

X_scaled = StandardScaler().fit_transform(X)
stats_scaled = StandardScaler().fit_transform(X.loc[:, ['hp', 'atk', 'def', 'spa', 'spd', 'spe']])

kmeans = KMeans(n_clusters=5, random_state=273)
cluster5['features'] = kmeans.fit_predict(X_scaled)

kmeans = KMeans(n_clusters=5, random_state=273)
cluster5['stats'] = kmeans.fit_predict(stats_scaled)

kmeans = KMeans(n_clusters=5, random_state=273)
cluster5['abilities'] = kmeans.fit_predict(pokemon_abilities_df)

kmeans = KMeans(n_clusters=5, random_state=273)
cluster5['learnsets'] = kmeans.fit_predict(pokemon_learnsets_df)

cluster_dfs[5] = cluster5

In [27]:
cluster10 = pd.DataFrame(index=X.index, columns=['features', 'stats', 'abilities', 'learnsets'])

X_scaled = StandardScaler().fit_transform(X)
stats_scaled = StandardScaler().fit_transform(X.loc[:, ['hp', 'atk', 'def', 'spa', 'spd', 'spe']])

kmeans = KMeans(n_clusters=10, random_state=273)
cluster10['features'] = kmeans.fit_predict(X_scaled)

kmeans = KMeans(n_clusters=10, random_state=273)
cluster10['stats'] = kmeans.fit_predict(stats_scaled)

kmeans = KMeans(n_clusters=10, random_state=273)
cluster10['abilities'] = kmeans.fit_predict(pokemon_abilities_df)

kmeans = KMeans(n_clusters=10, random_state=273)
cluster10['learnsets'] = kmeans.fit_predict(pokemon_learnsets_df)

cluster_dfs[10] = cluster10

In [28]:
cluster15 = pd.DataFrame(index=X.index, columns=['features', 'stats', 'abilities', 'learnsets'])

X_scaled = StandardScaler().fit_transform(X)
stats_scaled = StandardScaler().fit_transform(X.loc[:, ['hp', 'atk', 'def', 'spa', 'spd', 'spe']])

kmeans = KMeans(n_clusters=15, random_state=273)
cluster15['features'] = kmeans.fit_predict(X_scaled)

kmeans = KMeans(n_clusters=15, random_state=273)
cluster15['stats'] = kmeans.fit_predict(stats_scaled)

kmeans = KMeans(n_clusters=15, random_state=273)
cluster15['abilities'] = kmeans.fit_predict(pokemon_abilities_df)

kmeans = KMeans(n_clusters=15, random_state=273)
cluster15['learnsets'] = kmeans.fit_predict(pokemon_learnsets_df)

cluster_dfs[15] = cluster15

In [29]:
cluster20 = pd.DataFrame(index=X.index, columns=['features', 'stats', 'abilities', 'learnsets'])

X_scaled = StandardScaler().fit_transform(X)
stats_scaled = StandardScaler().fit_transform(X.loc[:, ['hp', 'atk', 'def', 'spa', 'spd', 'spe']])

kmeans = KMeans(n_clusters=20, random_state=273)
cluster20['features'] = kmeans.fit_predict(X_scaled)

kmeans = KMeans(n_clusters=20, random_state=273)
cluster20['stats'] = kmeans.fit_predict(stats_scaled)

kmeans = KMeans(n_clusters=20, random_state=273)
cluster20['abilities'] = kmeans.fit_predict(pokemon_abilities_df)

kmeans = KMeans(n_clusters=20, random_state=273)
cluster20['learnsets'] = kmeans.fit_predict(pokemon_learnsets_df)

cluster_dfs[20] = cluster20

In [30]:
cluster25 = pd.DataFrame(index=X.index, columns=['features', 'stats', 'abilities', 'learnsets'])

X_scaled = StandardScaler().fit_transform(X)
stats_scaled = StandardScaler().fit_transform(X.loc[:, ['hp', 'atk', 'def', 'spa', 'spd', 'spe']])

kmeans = KMeans(n_clusters=25, random_state=273)
cluster25['features'] = kmeans.fit_predict(X_scaled)

kmeans = KMeans(n_clusters=25, random_state=273)
cluster25['stats'] = kmeans.fit_predict(stats_scaled)

kmeans = KMeans(n_clusters=25, random_state=273)
cluster25['abilities'] = kmeans.fit_predict(pokemon_abilities_df)

kmeans = KMeans(n_clusters=25, random_state=273)
cluster25['learnsets'] = kmeans.fit_predict(pokemon_learnsets_df)

cluster_dfs[25] = cluster25

In [31]:
cluster30 = pd.DataFrame(index=X.index, columns=['features', 'stats', 'abilities', 'learnsets'])

X_scaled = StandardScaler().fit_transform(X)
stats_scaled = StandardScaler().fit_transform(X.loc[:, ['hp', 'atk', 'def', 'spa', 'spd', 'spe']])

kmeans = KMeans(n_clusters=30, random_state=273)
cluster30['features'] = kmeans.fit_predict(X_scaled)

kmeans = KMeans(n_clusters=30, random_state=273)
cluster30['stats'] = kmeans.fit_predict(stats_scaled)

kmeans = KMeans(n_clusters=30, random_state=273)
cluster30['abilities'] = kmeans.fit_predict(pokemon_abilities_df)

kmeans = KMeans(n_clusters=30, random_state=273)
cluster30['learnsets'] = kmeans.fit_predict(pokemon_learnsets_df)

cluster_dfs[30] = cluster30

In [32]:
cluster_dfs

{5:                 features  stats  abilities  learnsets
 name                                                 
 Bulbasaur              1      0          1          1
 Ivysaur                3      0          1          1
 Venusaur               3      3          1          1
 Charmander             1      0          1          4
 Charmeleon             2      1          1          4
 ...                  ...    ...        ...        ...
 Glastrier              2      2          1          1
 Spectrier              0      3          1          1
 Calyrex                3      4          1          3
 Calyrex-Ice            4      2          1          3
 Calyrex-Shadow         4      3          1          3
 
 [738 rows x 4 columns],
 10:                 features  stats  abilities  learnsets
 name                                                 
 Bulbasaur              9      3          7          6
 Ivysaur                9      9          7          6
 Venusaur               9      

<a id="modeling_guidelines"></a>
### Modeling guidelines

How many models am I making:

one-stage: (3 + 1) x 2, 8 one-stage models: 7 class no clusters, 4 class no clusters, 4 class modified no clusters, 2 class no clusters, then each with clusters

two-stage: (2 + 1) x 2, 6 two-stage models: 7 class no clusters, 4 class no clusters, 4 class modified no clusters, then each with clusters

14 total models for each modeling type

Modeling types: Logistic Regression, KNN, Decision Tree, Random Forest, CatBoost

Extra considerations:

- For Logistic Regression and KNN we will need to scale our features.

- We might not even bother with clustering using something like logistic regression, though we can look up whether it might be worthwhile

- Metric will be weighted F1 score, there is no well developed ROC curve for multi-class, log loss is not good for unbalanced classes, F1 score weighted should be especially appropriate for unbalanced classes and where we don't care more about precision or recall (there is no greater cost to a false positive or false negative for our problem)

In [33]:
k_list = [2, 3, 5, 10]

cluster_types = list(cluster_dfs[5].columns)

<a id="logistic_regression"></a>
### Logistic Regression

In [34]:
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression(random_state=273))

pipe.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'standardscaler', 'logisticregression', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'logisticregression__C', 'logisticregression__class_weight', 'logisticregression__dual', 'logisticregression__fit_intercept', 'logisticregression__intercept_scaling', 'logisticregression__l1_ratio', 'logisticregression__max_iter', 'logisticregression__multi_class', 'logisticregression__n_jobs', 'logisticregression__penalty', 'logisticregression__random_state', 'logisticregression__solver', 'logisticregression__tol', 'logisticregression__verbose', 'logisticregression__warm_start'])

In [35]:
param_grid = {'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
             'logisticregression__penalty': ['l1', 'l2']}
param_grid

{'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
 'logisticregression__penalty': ['l1', 'l2']}

#### 7 classes, no clusters

In [36]:
best = []

for k in k_list:
    pipe = make_pipeline(
        StandardScaler(),
        LogisticRegression(random_state=273))
    X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats'].values)
    log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    log_reg_grid.fit(X_train, y_train)
    best.append(log_reg_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.5891107913370068, 2)

In [37]:
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression(random_state=273))
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats'].values)
log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=2, verbose=0)
log_reg_grid.fit(X_train, y_train)
log_reg_grid.best_params_, log_reg_grid.best_score_

({'logisticregression__C': 1, 'logisticregression__penalty': 'l2'},
 0.5891107913370068)

In [38]:
f1_score(y_train, log_reg_grid.predict(X_train), labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'], average='weighted')

0.6462772739719035

In [39]:
precision_recall_fscore_support(y_train, log_reg_grid.predict(X_train), labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'])

(array([0.86641221, 0.50314465, 0.21428571, 0.46666667, 0.36363636,
        0.41025641, 0.64285714]),
 array([0.90438247, 0.67226891, 0.09677419, 0.2       , 0.23529412,
        0.37209302, 0.675     ]),
 array([0.88499025, 0.57553957, 0.13333333, 0.28      , 0.28571429,
        0.3902439 , 0.65853659]),
 array([251, 119,  31,  35,  34,  43,  40], dtype=int64))

This is better scores than what we got during EDA out of the box modeling, which is quite promising.

In [40]:
log_reg_7 = pd.DataFrame(log_reg_grid.best_estimator_._final_estimator.coef_, index=log_reg_grid.best_estimator_._final_estimator.classes_, columns=X_train.columns)
log_reg_7

Unnamed: 0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Cleric,Pivot,...,Priority,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery
NU,-0.111133,0.205987,0.064693,0.518538,0.116299,-0.187792,-0.280516,-0.00649,-0.049495,0.025564,...,0.026757,-0.217606,-0.130588,-0.024758,-0.224802,0.164442,-0.527176,-0.025308,0.038963,-0.149843
OU,0.283591,0.45511,0.571591,0.512399,0.196176,0.189714,0.464009,-0.331018,-0.04647,0.18168,...,0.065143,0.266433,0.084942,0.051977,0.564928,-0.495174,-0.126628,-0.103267,0.245477,0.322557
PU,-0.680436,-0.762866,-0.29146,-0.803732,-0.249462,-0.312833,-0.5644,-0.188765,0.208137,0.010171,...,0.1888,-0.071469,-0.035046,-0.102691,-0.55089,0.520494,0.350721,0.016018,-0.031374,-0.372265
RU,0.143561,0.438847,0.143958,0.005752,-0.135036,-0.061946,0.296418,0.156393,-0.024617,-0.086199,...,-0.038158,-0.085178,0.111255,-0.255209,-0.080273,0.073537,0.037847,-0.068155,-0.040967,0.117361
UU,0.288437,0.152608,0.128232,0.45916,0.482262,0.577118,0.488172,0.025385,0.329287,-0.022632,...,-0.15429,0.013193,0.111884,0.104749,0.224275,-0.212932,-0.394472,-0.315796,0.117181,0.043705
Uber,1.18147,0.900738,0.724624,0.648128,0.695291,1.23802,0.625875,0.549582,-0.043738,-0.119046,...,-0.250934,0.152662,0.00763,0.498544,0.482831,-0.687824,0.152247,0.202135,-0.159858,0.574109
ZU,-1.10549,-1.390424,-1.341638,-1.340245,-1.10553,-1.442281,-1.029558,-0.205087,-0.373104,0.010461,...,0.162681,-0.058035,-0.150077,-0.272613,-0.416068,0.637458,0.507461,0.294374,-0.169423,-0.535624


It's not easy to explain a model with this many coefficients, so logistic regression will only be justifiable if it performs better than other methods, since it's not going to do significantly better in terms of explainability (and might even do significantly worse  in terms of explainability).

#### 4 class no clusters

In [41]:
best = []

for k in k_list:
    pipe = make_pipeline(
        StandardScaler(),
        LogisticRegression(random_state=273))
    X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats4'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats4'].values)
    log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    log_reg_grid.fit(X_train, y_train)
    best.append(log_reg_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.6695938696017982, 10)

In [42]:
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression(random_state=273))
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats4'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats4'].values)
log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
log_reg_grid.fit(X_train, y_train)
log_reg_grid.best_params_, log_reg_grid.best_score_

({'logisticregression__C': 0.1, 'logisticregression__penalty': 'l2'},
 0.6695938696017982)

In [43]:
f1_score(y_train, log_reg_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.706612591329458

In [44]:
precision_recall_fscore_support(y_train, log_reg_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.85338346, 0.52542373, 0.58241758, 0.65384615]),
 array([0.90438247, 0.5210084 , 0.53      , 0.61445783]),
 array([0.87814313, 0.52320675, 0.55497382, 0.63354037]),
 array([251, 119, 100,  83], dtype=int64))

Very slightly better score than the out of the box modeling, in this case.

In [45]:
log_reg_4 = pd.DataFrame(log_reg_grid.best_estimator_._final_estimator.coef_, index=log_reg_grid.best_estimator_._final_estimator.classes_, columns=X_train.columns)
log_reg_4

Unnamed: 0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Cleric,Pivot,...,Priority,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery
High c,0.410015,0.607423,0.528918,0.556025,0.395796,0.607469,0.379749,-0.052432,-0.098665,0.016725,...,0.084073,0.209763,0.136008,0.096187,0.117334,-0.159005,-0.013065,0.095112,0.080691,0.302932
Low c,-0.23904,-0.138494,0.016593,-0.178468,0.051514,-0.04112,-0.195163,-0.012537,0.189829,0.01606,...,-0.001571,-0.065799,-0.067428,0.10558,-0.089416,0.0283,0.094991,-0.041119,-0.109832,-0.067364
Mid c,0.29879,0.266414,0.175217,0.206825,0.249262,0.29028,0.32025,0.126393,0.129996,-0.003659,...,-0.096958,-0.085097,0.046739,-0.047373,-0.031776,-0.019252,-0.196093,-0.215245,0.132179,0.037675
Not c,-0.469766,-0.735343,-0.720728,-0.584382,-0.696571,-0.856629,-0.504836,-0.061425,-0.22116,-0.029126,...,0.014456,-0.058867,-0.115319,-0.154395,0.003859,0.149957,0.114167,0.161252,-0.103038,-0.273244


#### 4 class alt no clusters

In [46]:
best = []

for k in k_list:
    pipe = make_pipeline(
        StandardScaler(),
        LogisticRegression(random_state=273))
    X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats4alt'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats4alt'].values)
    log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    log_reg_grid.fit(X_train, y_train)
    best.append(log_reg_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.6884465426350673, 10)

In [47]:
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression(random_state=273))
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats4alt'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats4alt'].values)
log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
log_reg_grid.fit(X_train, y_train)
log_reg_grid.best_params_, log_reg_grid.best_score_

({'logisticregression__C': 1000, 'logisticregression__penalty': 'l2'},
 0.6884465426350673)

Strange, there is very little regularization for this model, which may just mean that regularization choice is somewhat random and not having much impact

In [48]:
f1_score(y_train, log_reg_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.7378931899018644

In [49]:
precision_recall_fscore_support(y_train, log_reg_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.87795276, 0.53703704, 0.63580247, 0.86206897]),
 array([0.88844622, 0.48739496, 0.72027972, 0.625     ]),
 array([0.88316832, 0.51101322, 0.67540984, 0.72463768]),
 array([251, 119, 143,  40], dtype=int64))

Basically the model has the same performance on ZU and PU, but on mid-competitive pokemon (NU, RU, UU, OU) and high-competitive pokemon (Uber), it's performing notably better (about .1 better f-score in both cases). So the alternative 4 classes definitely seem to be a better option.

In [50]:
log_reg_4alt = pd.DataFrame(log_reg_grid.best_estimator_._final_estimator.coef_, index=log_reg_grid.best_estimator_._final_estimator.classes_, columns=X_train.columns)
log_reg_4alt

Unnamed: 0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Cleric,Pivot,...,Priority,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery
High c,1.439259,1.327013,0.790724,0.642143,1.270383,1.659536,0.937151,0.494351,-0.280272,-0.397176,...,-0.307978,0.074816,0.588032,-0.185251,0.547001,-0.976177,0.003204,0.195564,-0.057306,0.780857
Low c,-0.633873,-0.467124,-0.085216,-0.242228,-0.320494,-0.44094,-0.403587,-0.114764,0.39442,0.024205,...,0.126775,-0.075804,-0.181974,0.127519,-0.394233,0.450832,0.154418,-0.134706,-0.034506,-0.298235
Mid c,0.311595,0.350955,0.441698,0.603235,0.338031,0.357546,0.481562,-0.161636,0.139855,0.139627,...,0.094399,-0.030271,0.040025,0.078803,0.231062,-0.313114,-0.47442,-0.288955,0.187085,0.060956
Not c,-1.116981,-1.210843,-1.147206,-1.00315,-1.28792,-1.576143,-1.015126,-0.217951,-0.254003,0.233344,...,0.086804,0.031259,-0.446082,-0.021072,-0.383829,0.838458,0.316798,0.228097,-0.095273,-0.543579


#### 2 class no clusters

In [51]:
best = []

for k in k_list:
    pipe = make_pipeline(
        StandardScaler(),
        LogisticRegression(random_state=273))
    X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
    log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    log_reg_grid.fit(X_train, y_train)
    best.append(log_reg_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.9014975740163707, 5)

Interesting that 5 fold cv works better here, even though 10 worked better for 4 classes (maybe 10 fold did so well by lowering variance in those cases)

In [52]:
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression(random_state=273))
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=5, verbose=0)
log_reg_grid.fit(X_train, y_train)
log_reg_grid.best_params_, log_reg_grid.best_score_

({'logisticregression__C': 0.01, 'logisticregression__penalty': 'l2'},
 0.9014975740163707)

This also has a higher regularization strength, so we've had a lot of variance in that c parameter

In [53]:
f1_score(y_train, log_reg_grid.predict(X_train), labels=['No', 'Yes'], average='weighted')

0.9088498774843347

In [54]:
precision_recall_fscore_support(y_train, log_reg_grid.predict(X_train), labels=['No', 'Yes'])

(array([0.95475113, 0.87951807]),
 array([0.84063745, 0.96688742]),
 array([0.8940678 , 0.92113565]),
 array([251, 302], dtype=int64))

This is very comparable in f-score to training data, within a margin of about 1% of the f-score

In [55]:
log_reg_2 = pd.DataFrame(log_reg_grid.best_estimator_._final_estimator.coef_, columns=X_train.columns)
log_reg_2.T.sort_values(by=0, ascending=False)

Unnamed: 0,0
def,0.392491
spd,0.384042
atk,0.345165
spe,0.342129
spa,0.329095
hp,0.259617
Resistance Index,0.17506
Physical Cutoff 3,0.132301
Cleric,0.113465
Item Removal,0.105858


It's a good sign that all of the coefficients are positive, since some where negative during training which didn't make much sense. Regularization might have helped there.

Overall, this model will almost surely be too simple to be our best one, but the number and order of the coefficients does well in terms of explainability.

<a id="logistic_regression_cluster"></a>
#### logistic regression with clustering

#### 7 class with clustering

Since one-hot encoded columns shouldn't really be scaled, we can scale the rest of our data to be compatible with them via minmax scaling between 0 and 1

In [56]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                LogisticRegression(random_state=273, solver='newton-cg'))
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats'],
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_df['formats'].values)
            log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            log_reg_grid.fit(X_train, y_train)
            if log_reg_grid.best_score_ > best[0]:
                best = [log_reg_grid.best_score_, k, c_type, n]
                
best

[0.5991175293786364, 2, 'features', 10]

For some reason, newton-cg is the only solver that converges for this task!  I tried all of the other one's and they did not converge, some even with increasing the number of iterations.

So, our best result has a score slightly better than not using clusters for 7 classes (but not by much, so it's probably not significant), 2 fold cross validation for a 7 class model was best (which makes sense since we have very small class sizes), and the best clustering parameter setting was using all of our features in X and having 10 clusters

In [57]:
pipe = make_pipeline(
    LogisticRegression(random_state=273, solver='newton-cg'))
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[10]['features']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats'].values)
log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=2, verbose=0)
log_reg_grid.fit(X_train, y_train)
log_reg_grid.best_params_, log_reg_grid.best_score_

({'logisticregression__C': 1000, 'logisticregression__penalty': 'l2'},
 0.5991175293786364)

It's interesting that again we have very weak regularization. This has been a quite inconsistent parameter.

In [58]:
f1_score(y_train, log_reg_grid.predict(X_train), labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'], average='weighted')

0.6981130120478787

In [59]:
precision_recall_fscore_support(y_train, log_reg_grid.predict(X_train), labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'])

(array([0.890625  , 0.60283688, 0.21428571, 0.5       , 0.55555556,
        0.42222222, 0.67391304]),
 array([0.90836653, 0.71428571, 0.09677419, 0.34285714, 0.44117647,
        0.44186047, 0.775     ]),
 array([0.89940828, 0.65384615, 0.13333333, 0.40677966, 0.49180328,
        0.43181818, 0.72093023]),
 array([251, 119,  31,  35,  34,  43,  40], dtype=int64))

This performed a lot better on some classes than not using clusters (by as much as .2 in some cases), and about the same in some other classes. It did worse in none of the classes, but still has a very hard time with the NU tier. The monotonic improvements though are quite a good sign for the usefulness of clustering, even if this model almost surely won't be good enough.

In [60]:
log_reg_7 = pd.DataFrame(log_reg_grid.best_estimator_._final_estimator.coef_, index=log_reg_grid.best_estimator_._final_estimator.classes_, columns=X_train.columns)
log_reg_7

Unnamed: 0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Cleric,Pivot,...,0,1,2,3,4,5,6,7,8,9
NU,-3.398959,0.825846,-1.683641,3.837615,1.02939,-1.400224,-2.079783,-0.44195,-0.143571,0.538195,...,1.048154,-2.261804,1.465805,-0.776241,1.175927,-0.165251,-0.897639,-0.096038,0.795752,-0.284758
OU,2.897497,2.775805,5.655337,3.3912,2.32551,1.908341,3.719062,-0.343067,-2.288212,0.614503,...,0.861997,-1.241821,-1.456279,-0.179236,-0.816807,1.362088,-0.826147,1.34631,1.992823,-1.04667
PU,-8.943639,-7.108291,-5.757723,-5.364777,-3.554413,-3.221569,-3.958967,-1.919589,3.420528,0.663186,...,-0.999621,2.773132,0.577097,-1.835868,0.81175,-2.74591,0.704332,0.421272,1.60139,-1.29362
RU,0.596482,3.66338,0.255777,-0.45404,-2.12055,-0.537035,1.523593,0.562605,-1.047274,0.003358,...,0.539891,-2.333245,0.947494,-0.083741,-1.023909,0.889341,-1.173732,0.164865,1.404266,0.669748
UU,4.685442,2.119602,2.517959,2.992264,4.562877,4.665134,3.143711,1.007433,-0.054786,-0.698701,...,0.023841,-1.438526,-0.663294,1.407431,-1.176071,1.870178,-0.539609,-0.536413,0.740884,0.308451
Uber,16.447546,8.550706,11.544501,3.618391,8.200529,10.312648,4.667832,3.169069,-1.954663,-2.003499,...,-0.810683,-0.095555,-2.005025,3.850885,0.423075,2.50743,-0.025316,-1.124938,-4.907439,2.156279
ZU,-12.284369,-10.827048,-12.532209,-8.020654,-10.443342,-11.727296,-7.015448,-2.0345,2.067978,0.882959,...,-0.663579,4.597819,1.134202,-2.38323,0.606035,-3.717876,2.758111,-0.175059,-1.627676,-0.50943


#### 4 class with clustering

In [61]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                LogisticRegression(random_state=273, solver='newton-cg'))
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats4'],
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_df['formats4'].values)
            log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            log_reg_grid.fit(X_train, y_train)
            if log_reg_grid.best_score_ > best[0]:
                best = [log_reg_grid.best_score_, k, c_type, n]
                
best

[0.6760594264361616, 10, 'learnsets', 20]

Again, very similar to performance results without clustering, but slightly better with clustering. 10 fold cross validation is similar to 4 classes previously. Clustering settings also changed, now 20 clusters and clustering using learnsets instead of clustering our features

In [62]:
pipe = make_pipeline(
    LogisticRegression(random_state=273, solver='newton-cg'))
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[20]['learnsets']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats4'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats4'].values)
log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
log_reg_grid.fit(X_train, y_train)
log_reg_grid.best_params_, log_reg_grid.best_score_

({'logisticregression__C': 100, 'logisticregression__penalty': 'l2'},
 0.6760594264361616)

Still not very much regularization, yet more than last time.

In [63]:
f1_score(y_train, log_reg_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.7954788209205903

In [64]:
precision_recall_fscore_support(y_train, log_reg_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.91666667, 0.67460317, 0.66666667, 0.75949367]),
 array([0.92031873, 0.71428571, 0.64      , 0.72289157]),
 array([0.91848907, 0.69387755, 0.65306122, 0.74074074]),
 array([251, 119, 100,  83], dtype=int64))

Although the average weighted f-score did not improve by much over non-clustering, the model does seem significantly improved in each class:
- Not c improved by .04
- low c improved by .17
- mid c improved by .1
- high c improved by .11

In [65]:
log_reg_4 = pd.DataFrame(log_reg_grid.best_estimator_._final_estimator.coef_, index=log_reg_grid.best_estimator_._final_estimator.classes_, columns=X_train.columns)
log_reg_4

Unnamed: 0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Cleric,Pivot,...,10,11,12,13,14,15,16,17,18,19
High c,7.283015,6.439933,9.19356,5.132881,6.481213,10.176112,4.589923,-0.349624,-0.331814,0.269942,...,5.3549,0.326926,-7.013502,1.021632,-1.816366,2.222112,0.025215,0.303529,0.692468,-1.397766
Low c,-3.682915,-2.133413,-1.78042,-2.752695,-0.223626,-2.555312,-1.978635,-0.01518,1.03165,0.343865,...,-1.900658,-1.689608,2.800292,-0.418902,0.335869,2.806994,0.892975,0.431405,-0.731773,0.36248
Mid c,5.308342,3.548529,4.365388,2.45181,3.038902,5.618649,3.177357,0.431602,1.181753,-0.223484,...,-1.449385,0.829953,-2.41281,0.443607,1.317979,-2.700681,0.20317,-0.23101,1.223632,0.811493
Not c,-8.908442,-7.85505,-11.778528,-4.831996,-9.296488,-13.239449,-5.788645,-0.066797,-1.88159,-0.390323,...,-2.004857,0.532729,6.62602,-1.046336,0.162519,-2.328425,-1.121361,-0.503923,-1.184327,0.223794


#### 4 class alt with clustering

In [66]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                LogisticRegression(random_state=273, solver='newton-cg'))
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats4alt'],
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_df['formats4alt'].values)
            log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            log_reg_grid.fit(X_train, y_train)
            if log_reg_grid.best_score_ > best[0]:
                best = [log_reg_grid.best_score_, k, c_type, n]
                
best

[0.716340287913965, 10, 'learnsets', 5]

Yet again, slightly improved performance in overall f-score compared to not using clustering, and 10 fold which seems to work best with 4 classes. Learnset clustering proved to be the most useful for a 2nd time, which is surprising and interesting, but this time we needed only 5 clusters!

In [67]:
pipe = make_pipeline(
    LogisticRegression(random_state=273, solver='newton-cg'))
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[5]['learnsets']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats4alt'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats4alt'].values)
log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
log_reg_grid.fit(X_train, y_train)
log_reg_grid.best_params_, log_reg_grid.best_score_

({'logisticregression__C': 1000, 'logisticregression__penalty': 'l2'},
 0.716340287913965)

And yet again, very weak regularization seems best

In [68]:
f1_score(y_train, log_reg_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.7674903161456883

In [69]:
precision_recall_fscore_support(y_train, log_reg_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.88976378, 0.57391304, 0.69677419, 0.86206897]),
 array([0.90039841, 0.55462185, 0.75524476, 0.625     ]),
 array([0.8950495 , 0.56410256, 0.72483221, 0.72463768]),
 array([251, 119, 143,  40], dtype=int64))

We see monotonic improvements in each class by 0 to about .05, but while this is an improvement, it's not as much improved as the standard 4 class model, which has a higher f-score in every class except mid competitive which we artificially increased the size of for the alternative class schema.

So the standard 4 classes might be better when using clustering.

In [70]:
log_reg_4alt = pd.DataFrame(log_reg_grid.best_estimator_._final_estimator.coef_, index=log_reg_grid.best_estimator_._final_estimator.classes_, columns=X_train.columns)
log_reg_4alt

Unnamed: 0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Cleric,Pivot,...,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery,0,1,2,3,4
High c,14.548473,8.203466,6.659328,3.335335,10.575092,12.883202,5.298188,1.667269,-1.106433,-0.61056,...,-3.737695,1.093407,1.962446,0.797984,1.758652,2.322299,1.477958,-6.322592,1.028088,1.493626
Low c,-6.555992,-2.991315,-0.868274,-1.628833,-3.276305,-3.885007,-2.565184,-0.360377,1.740586,-0.053888,...,1.276707,0.172651,-1.131606,-0.399643,-0.769825,-1.091232,-0.311059,2.653136,-0.205808,-1.044763
Mid c,4.153323,2.846985,3.628214,3.364185,4.107399,4.165556,2.859981,-0.635682,0.888049,0.380809,...,-0.440543,-1.327495,0.081995,0.948077,-0.00812,0.826601,0.228169,-1.770321,0.166913,0.548649
Not c,-12.145804,-8.059137,-9.419268,-5.070687,-11.406186,-13.163752,-5.592984,-0.67121,-1.522201,0.283638,...,2.901532,0.061437,-0.912835,-1.346418,-0.980707,-2.057668,-1.395068,5.439777,-0.989193,-0.997512


#### 2 class with clustering

In [71]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                LogisticRegression(random_state=273, solver='newton-cg'))
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats2'],
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_df['formats2'].values)
            log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            log_reg_grid.fit(X_train, y_train)
            if log_reg_grid.best_score_ > best[0]:
                best = [log_reg_grid.best_score_, k, c_type, n]
                
best

[0.910813950674712, 3, 'stats', 25]

Slight improvement in performance as usual, but the other parameters are even more informative. 3 fold clustering is something we haven't seen win before, and with a new numbers of classes (2), we have a new type of clustering getting the best results: stats. Also with the largest amount of clusters yet, 25

In [72]:
pipe = make_pipeline(
    LogisticRegression(random_state=273, solver='newton-cg'))
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[25]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=3, verbose=0)
log_reg_grid.fit(X_train, y_train)
log_reg_grid.best_params_, log_reg_grid.best_score_

({'logisticregression__C': 1, 'logisticregression__penalty': 'l2'},
 0.910813950674712)

And now we're back to middling regularization strength.

In [73]:
f1_score(y_train, log_reg_grid.predict(X_train), labels=['No', 'Yes'], average='weighted')

0.9162620322390561

In [74]:
precision_recall_fscore_support(y_train, log_reg_grid.predict(X_train), labels=['No', 'Yes'])

(array([0.95555556, 0.8902439 ]),
 array([0.85657371, 0.96688742]),
 array([0.90336134, 0.92698413]),
 array([251, 302], dtype=int64))

Slightly improved f-score in all categories, which has been consistent in using the clusters, so using clustering is indisputably worthwhile for one-stage logistic regression

In [75]:
log_reg_2 = pd.DataFrame(log_reg_grid.best_estimator_._final_estimator.coef_, columns=X_train.columns)
log_reg_2.T.sort_values(by=0, ascending=False)

Unnamed: 0,0
def,2.355332
spd,2.158972
spe,2.099348
spa,2.042772
atk,1.854335
Resistance Index,1.830234
22,1.293846
4,1.279747
12,1.141616
hp,1.085723


This time we have negative coefficients on our features again, but some of these are sensible since they are clusters (and those clusters may simply contain uncompetitive pokemon, in terms of their stats). Weather Gimmick and Pivot being negative does not really make sense, but it is also only to a very small degree. But it is a sign that logistic regression, at least with one stage, is very unlikely to be our best choice of model.

<a id="two_stage_logistic"></a>
#### two-stage logistic regression

#### two-stage 7 class, no clusters

The first part of this two stage model is just regular logistic regression for two classes (which we already did! so we can just use that model again), to separate out the largest class, ZU i.e. relatively non-competitive pokemon, so that the second model doesn't have to include it and can exercise ALL of its discernment on figuring out which competitive class a competitive pokemon belongs to (which, as we saw from the f-scores above, can in some cases be quite difficult, so it's good that the second model can focus on that, and it might lead to higher performance than a single-stage model).

In [76]:
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression(random_state=273))
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
first_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=5, verbose=0)
first_stage.fit(X_train, y_train)
first_stage.best_params_, first_stage.best_score_

({'logisticregression__C': 0.01, 'logisticregression__penalty': 'l2'},
 0.9014975740163707)

That's the same model that we used before to separate competitive and non-competitive pokemon. Now let's use it to predict which pokemon in all of the training data will belong to ZU, so that we can remove them from consideration in the next model that we build (by filtering X and y_df so that we're only looking at competitive pokemon):

In [77]:
y_df['formats2'].loc[y_df['formats2'] == 'Yes']

name
Venusaur          Yes
Charizard         Yes
Blastoise         Yes
Pikachu           Yes
Raichu            Yes
                 ... 
Glastrier         Yes
Spectrier         Yes
Calyrex           Yes
Calyrex-Ice       Yes
Calyrex-Shadow    Yes
Name: formats2, Length: 403, dtype: object

In [78]:
X_second = X.loc[y_df['formats2'] == 'Yes']
X_second

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Cleric,Pivot,...,Priority,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Venusaur,80,82,83,100,100,80,2,0,0,0,...,0,5,6,4,6,4,4,0,63.636364,3
Charizard,78,84,78,109,85,100,8,0,0,0,...,1,3,14,10,11,5,4,0,50.000000,3
Blastoise,79,83,100,85,105,78,2,0,0,3,...,2,2,12,10,11,7,4,0,75.000000,0
Pikachu,35,55,40,50,50,90,2,0,1,3,...,2,2,7,7,5,3,3,0,70.000000,1
Raichu,60,90,55,90,80,110,2,0,1,3,...,2,2,7,7,6,4,3,0,70.000000,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,1,12,7,3,2,2,0,75.000000,0
Spectrier,100,65,60,145,80,130,8,0,0,0,...,0,0,4,4,4,3,3,0,1.000000,0
Calyrex,100,80,80,80,80,80,-2,0,1,0,...,0,2,3,3,9,4,3,0,18.181818,2
Calyrex-Ice,100,165,150,85,130,50,-4,0,1,0,...,0,2,15,9,12,5,3,1,1.000000,2


In [79]:
y_second_7 = y_df['formats'].loc[y_df['formats2'] == 'Yes']
y_second_7

name
Venusaur            OU
Charizard           PU
Blastoise           NU
Pikachu             PU
Raichu              PU
                  ... 
Glastrier           NU
Spectrier         Uber
Calyrex             PU
Calyrex-Ice       Uber
Calyrex-Shadow    Uber
Name: formats, Length: 403, dtype: object

In [80]:
best = []

for k in k_list:
    pipe = make_pipeline(
        StandardScaler(),
        LogisticRegression(random_state=273))
    X_train, X_test, y_train, y_test = train_test_split(X_second, y_second_7,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_7.values)
    log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    log_reg_grid.fit(X_train, y_train)
    best.append(log_reg_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.4108714041695132, 10)

Our second stage still has a pretty low f-score in categorizing the competitive pokemon into 6 tiers, but it still may lead to a significantly higher overall f-score than only using a one-stage model.

In [81]:
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression(random_state=273))
X_train, X_test, y_train, y_test = train_test_split(X_second, y_second_7,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_7.values)
second_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
second_stage.fit(X_train, y_train)
second_stage.best_params_, second_stage.best_score_

({'logisticregression__C': 10, 'logisticregression__penalty': 'l2'},
 0.4108714041695132)

In [82]:
precision_recall_fscore_support(y_train, second_stage.predict(X_train), labels=['PU', 'NU', 'RU', 'UU', 'OU', 'Uber'])

(array([0.65432099, 0.27777778, 0.5       , 0.375     , 0.43243243,
        0.65116279]),
 array([0.8907563 , 0.16129032, 0.25714286, 0.26470588, 0.37209302,
        0.7       ]),
 array([0.7544484 , 0.20408163, 0.33962264, 0.31034483, 0.4       ,
        0.6746988 ]),
 array([119,  31,  35,  34,  43,  40], dtype=int64))

PU and Ubers, unsurprisingly, are more accurate than the others, with NU especially still being quite difficult for the model to detect. This could indicate the value of a three-stage model, since only the top 5 classes are close to balance in their size.

So now, in order to get our actual scores, we will need to apply both models together. They've already been separately cross-validated for best parameters, so we can just apply them to all of the training data.

In [83]:
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
pred_1 = pd.DataFrame(first_stage.predict(X_train), index=X_train.index)
pred_2 = pd.DataFrame(second_stage.predict(X_train[pred_1[0] == 'Yes']), index=X_train[pred_1[0] == 'Yes'].index)
y_pred = pred_1.merge(pred_2, on='name', how='left')['0_y'].fillna(value='ZU')
y_validation = y_train.to_frame().merge(y_df['formats'], on='name', how='left')['formats']
y_validation

name
Absol              PU
Ninetales-Alola    OU
Palossand          PU
Ponyta-Galar       ZU
Carvanha           ZU
                   ..
Dragonair          ZU
Qwilfish           PU
Cryogonal          PU
Wailord            ZU
Blaziken           OU
Name: formats, Length: 553, dtype: object

In [84]:
f1_score(y_validation, y_pred, labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'], average='weighted')

0.6674254958803228

In [85]:
precision_recall_fscore_support(y_validation, y_pred, labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'])

(array([0.95475113, 0.52941176, 0.3       , 0.36      , 0.375     ,
        0.37142857, 0.58536585]),
 array([0.84063745, 0.77952756, 0.2       , 0.26470588, 0.26470588,
        0.31707317, 0.66666667]),
 array([0.8940678 , 0.63057325, 0.24      , 0.30508475, 0.31034483,
        0.34210526, 0.62337662]),
 array([251, 127,  30,  34,  34,  41,  36], dtype=int64))

This two-stage gives better results in almost all classes than the one-stage model, except for the two highest classes, OU and Uber, which perform slightly worse. The overall f-score is close to the one-stage 4 class model, which is a notable improvement.

In terms of explainability, it is partly more difficult, since we have coefficients in two models, but partly easier, since we can see separately how ZU pokemon are separated, and then how it gauges competitive pokemon.

#### two-stage 4 class, no clusters

In [86]:
y_second_4 = y_df['formats4'].loc[y_df['formats2'] == 'Yes']
best = []

for k in k_list:
    pipe = make_pipeline(
        StandardScaler(),
        LogisticRegression(random_state=273))
    X_train, X_test, y_train, y_test = train_test_split(X_second, y_second_4,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_4.values)
    log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    log_reg_grid.fit(X_train, y_train)
    best.append(log_reg_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.5627918735620913, 5)

In [87]:
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression(random_state=273))
X_train, X_test, y_train, y_test = train_test_split(X_second, y_second_4,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_4.values)
second_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=5, verbose=0)
second_stage.fit(X_train, y_train)
second_stage.best_params_, second_stage.best_score_

({'logisticregression__C': 0.1, 'logisticregression__penalty': 'l2'},
 0.5627918735620913)

In [88]:
precision_recall_fscore_support(y_train, second_stage.predict(X_train), labels=['Low c', 'Mid c', 'High c'])

(array([0.64827586, 0.64556962, 0.69230769]),
 array([0.78991597, 0.51      , 0.65060241]),
 array([0.71212121, 0.5698324 , 0.67080745]),
 array([119, 100,  83], dtype=int64))

We are getting much better scores for low competitive, a bit better for high competitive, and about the same for mid competitive, compared to the one-stage 4 class model

In [89]:
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
pred_1 = pd.DataFrame(first_stage.predict(X_train), index=X_train.index)
pred_2 = pd.DataFrame(second_stage.predict(X_train[pred_1[0] == 'Yes']), index=X_train[pred_1[0] == 'Yes'].index)
y_pred = pred_1.merge(pred_2, on='name', how='left')['0_y'].fillna(value='Not c')
y_validation = y_train.to_frame().merge(y_df['formats4'], on='name', how='left')['formats4']
y_validation

name
Absol               Low c
Ninetales-Alola    High c
Palossand           Low c
Ponyta-Galar        Not c
Carvanha            Not c
                    ...  
Dragonair           Not c
Qwilfish            Low c
Cryogonal           Low c
Wailord             Not c
Blaziken           High c
Name: formats4, Length: 553, dtype: object

In [90]:
f1_score(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.7229682104036999

In [91]:
precision_recall_fscore_support(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.95475113, 0.53571429, 0.56976744, 0.58974359]),
 array([0.84063745, 0.70866142, 0.5       , 0.5974026 ]),
 array([0.8940678 , 0.61016949, 0.5326087 , 0.59354839]),
 array([251, 127,  98,  77], dtype=int64))

Similar results to the 7-class two-stage model: we get better performance than one stage on not competitive and low competitive pokemon, but lower performance on mid competitive and high competitive pokemon (though the lower performance is by a smaller margin)

#### two-stage 4 class alt, no clusters

In [92]:
y_second_4alt = y_df['formats4alt'].loc[y_df['formats2'] == 'Yes']
best = []

for k in k_list:
    pipe = make_pipeline(
        StandardScaler(),
        LogisticRegression(random_state=273))
    X_train, X_test, y_train, y_test = train_test_split(X_second, y_second_4alt,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_4alt.values)
    log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    log_reg_grid.fit(X_train, y_train)
    best.append(log_reg_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.6431205741104676, 2)

In [93]:
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression(random_state=273))
X_train, X_test, y_train, y_test = train_test_split(X_second, y_second_4alt,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_4alt.values)
second_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=2, verbose=0)
second_stage.fit(X_train, y_train)
second_stage.best_params_, second_stage.best_score_

({'logisticregression__C': 0.1, 'logisticregression__penalty': 'l2'},
 0.6431205741104676)

In [94]:
precision_recall_fscore_support(y_train, second_stage.predict(X_train), labels=['Low c', 'Mid c', 'High c'])

(array([0.71304348, 0.66666667, 0.84      ]),
 array([0.68907563, 0.75524476, 0.525     ]),
 array([0.7008547 , 0.70819672, 0.64615385]),
 array([119, 143,  40], dtype=int64))

We are getting drastically better performance on low competitive, slightly better performance on mid competitive, and notably worse performance on high competitive, probably because it's by far the smallest class in this case.

In [95]:
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
pred_1 = pd.DataFrame(first_stage.predict(X_train), index=X_train.index)
pred_2 = pd.DataFrame(second_stage.predict(X_train[pred_1[0] == 'Yes']), index=X_train[pred_1[0] == 'Yes'].index)
y_pred = pred_1.merge(pred_2, on='name', how='left')['0_y'].fillna(value='Not c')
y_validation = y_train.to_frame().merge(y_df['formats4alt'], on='name', how='left')['formats4alt']
y_validation

name
Absol              Low c
Ninetales-Alola    Mid c
Palossand          Low c
Ponyta-Galar       Not c
Carvanha           Not c
                   ...  
Dragonair          Not c
Qwilfish           Low c
Cryogonal          Low c
Wailord            Not c
Blaziken           Mid c
Name: formats4alt, Length: 553, dtype: object

In [96]:
f1_score(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.7307719431394758

In [97]:
precision_recall_fscore_support(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.95475113, 0.55725191, 0.56666667, 0.71428571]),
 array([0.84063745, 0.57480315, 0.73381295, 0.41666667]),
 array([0.8940678 , 0.56589147, 0.63949843, 0.52631579]),
 array([251, 127, 139,  36], dtype=int64))

Yet again, higher performance on not c and low, lower performance on mid c and high c. And although this is our highest performance from a 4 class model in terms of the weighted average of f-scores, the Ubers decreased by .2 compared to one-stage, which is a massive decrease. The weighted average is overall better since Ubers is a small class, but I'm not sure it's worth such a large decrease in a single class. I also tried macro and micro averages, and they did not changed the result at all, so probably weighted is fine to use and even the most sensible since it doesn't make sense to consider one class like Uber more important. But the decreases in it are strange and I don't love this aspect of my new two-stage models.

<a id="two_stage_logistic_cluster"></a>
#### two-stage logistic regression with clustering

#### two-stage 7 class with clustering

We need to set it up so that our first stage has clustering now:

In [98]:
pipe = make_pipeline(
    LogisticRegression(random_state=273, solver='newton-cg'))
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[25]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
first_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=3, verbose=0)
first_stage.fit(X_train, y_train)
first_stage.best_params_, first_stage.best_score_

({'logisticregression__C': 1, 'logisticregression__penalty': 'l2'},
 0.910813950674712)

And we already have X_second and all iterations of y_second set up to make our training and testing sets, so we can just go ahead and do hyperparameter search:

In [99]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                LogisticRegression(random_state=273, solver='newton-cg'))
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_second), index=X_second.index, columns=X_second.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_second_7,
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_second_7.values)
            log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            log_reg_grid.fit(X_train, y_train)
            if log_reg_grid.best_score_ > best[0]:
                best = [log_reg_grid.best_score_, k, c_type, n]
                
best

[0.44595342251317494, 10, 'learnsets', 30]

In [100]:
pipe = make_pipeline(
    LogisticRegression(random_state=273, solver='newton-cg'))
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_second), index=X_second.index, columns=X_second.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[30]['learnsets']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_second_7,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_7.values)
second_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
second_stage.fit(X_train, y_train)
second_stage.best_params_, second_stage.best_score_

({'logisticregression__C': 100, 'logisticregression__penalty': 'l2'},
 0.44595342251317494)

In [101]:
precision_recall_fscore_support(y_train, second_stage.predict(X_train), labels=['PU', 'NU', 'RU', 'UU', 'OU', 'Uber'])

(array([0.8503937 , 0.39393939, 0.73333333, 0.4516129 , 0.63157895,
        0.76744186]),
 array([0.90756303, 0.41935484, 0.62857143, 0.41176471, 0.55813953,
        0.825     ]),
 array([0.87804878, 0.40625   , 0.67692308, 0.43076923, 0.59259259,
        0.79518072]),
 array([119,  31,  35,  34,  43,  40], dtype=int64))

This is actually incredibly improved performance, clustering unquestionably seems to help. Our lowest, which as usual is NU, is up to .4 f-score, compared to the .13 values it was working with before. And almost everything is well over .5, on a six class model. Let's try the whole seven class two stage model:

In [102]:
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[25]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
pred_1 = pd.DataFrame(first_stage.predict(X_train), index=X_train.index)

#drop old clusters and merge the new clusters
#because the two models work with different clusterings
X_train = X_train.drop(columns=list(range(0, 25)))
X_train = X_train.merge(pd.get_dummies(cluster_dfs[30]['learnsets']), on='name', how='left')

pred_2 = pd.DataFrame(second_stage.predict(X_train[pred_1[0] == 'Yes']), index=X_train[pred_1[0] == 'Yes'].index)
y_pred = pred_1.merge(pred_2, on='name', how='left')['0_y'].fillna(value='ZU')
y_validation = y_train.to_frame().merge(y_df['formats'], on='name', how='left')['formats']
y_validation

name
Absol              PU
Ninetales-Alola    OU
Palossand          PU
Ponyta-Galar       ZU
Carvanha           ZU
                   ..
Dragonair          ZU
Qwilfish           PU
Cryogonal          PU
Wailord            ZU
Blaziken           OU
Name: formats, Length: 553, dtype: object

In [103]:
f1_score(y_validation, y_pred, labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'], average='weighted')

0.7324626277882527

In [104]:
precision_recall_fscore_support(y_validation, y_pred, labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'])

(array([0.95555556, 0.66423358, 0.42307692, 0.52380952, 0.46153846,
        0.5       , 0.57692308]),
 array([0.85657371, 0.71653543, 0.36666667, 0.64705882, 0.52941176,
        0.3902439 , 0.83333333]),
 array([0.90336134, 0.68939394, 0.39285714, 0.57894737, 0.49315068,
        0.43835616, 0.68181818]),
 array([251, 127,  30,  34,  34,  41,  36], dtype=int64))

This is very good for a 7 class model, and it is doing better by a lot in many classes, especially some of the middle classes that it had a hard time with in the one-stage model (NU and RU especially). It did worse in Ubers but only by a small amount, and the overall performance gains make the two-stage version obviously justifiable in this case.

#### two-stage 4 class with clustering

In [105]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                LogisticRegression(random_state=273, solver='newton-cg'))
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_second), index=X_second.index, columns=X_second.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_second_4,
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_second_4.values)
            log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            log_reg_grid.fit(X_train, y_train)
            if log_reg_grid.best_score_ > best[0]:
                best = [log_reg_grid.best_score_, k, c_type, n]
                
best

[0.6023443718987143, 5, 'learnsets', 20]

In [106]:
pipe = make_pipeline(
    LogisticRegression(random_state=273, solver='newton-cg'))
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_second), index=X_second.index, columns=X_second.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[20]['learnsets']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_second_4,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_4.values)
second_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=5, verbose=0)
second_stage.fit(X_train, y_train)
second_stage.best_params_, second_stage.best_score_

({'logisticregression__C': 1, 'logisticregression__penalty': 'l2'},
 0.6023443718987143)

In [107]:
precision_recall_fscore_support(y_train, second_stage.predict(X_train), labels=['Low c', 'Mid c', 'High c'])

(array([0.69127517, 0.64285714, 0.72463768]),
 array([0.86554622, 0.54      , 0.60240964]),
 array([0.76865672, 0.58695652, 0.65789474]),
 array([119, 100,  83], dtype=int64))

This is arguably not even an improvement over the one-stage model, though we'll have to see how the second stage affects this. It does better in low competitive which is probably the largest by far so weights the f-score higher, but it's struggling with the other classes a bit more. This also might lend credibility to a 3rd-stage, since it's doing well with the PU tier and has high recall, but we may need to remove those big classes ZU AND PU from consideration to get a model that can do well in higher classes.

In [108]:
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[25]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
pred_1 = pd.DataFrame(first_stage.predict(X_train), index=X_train.index)

#drop old clusters and merge the new clusters
#because the two models work with different clusterings
X_train = X_train.drop(columns=list(range(0, 25)))
X_train = X_train.merge(pd.get_dummies(cluster_dfs[20]['learnsets']), on='name', how='left')

pred_2 = pd.DataFrame(second_stage.predict(X_train[pred_1[0] == 'Yes']), index=X_train[pred_1[0] == 'Yes'].index)
y_pred = pred_1.merge(pred_2, on='name', how='left')['0_y'].fillna(value='Not c')
y_validation = y_train.to_frame().merge(y_df['formats4'], on='name', how='left')['formats4']
y_validation

name
Absol               Low c
Ninetales-Alola    High c
Palossand           Low c
Ponyta-Galar        Not c
Carvanha            Not c
                    ...  
Dragonair           Not c
Qwilfish            Low c
Cryogonal           Low c
Wailord             Not c
Blaziken           High c
Name: formats4, Length: 553, dtype: object

In [109]:
f1_score(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.735776201933809

In [110]:
precision_recall_fscore_support(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.95555556, 0.57831325, 0.51401869, 0.69090909]),
 array([0.85657371, 0.75590551, 0.56122449, 0.49350649]),
 array([0.90336134, 0.6552901 , 0.53658537, 0.57575758]),
 array([251, 127,  98,  77], dtype=int64))

4 class models in general seem less useful than 7 class models for two-stage, because the score is barely higher, perhaps because our differentiations are more artificial, and the tiny gain in performance is not worth losing the nuance of 3 more classes. It's literally not even .01 better than the 7 class model.

Mid and high competitive top nose dives in their performance, and literally every class performance worse than the one-stage version of the model. It's interesting to think about why that even happened, whereas for 7 classes the two-stage model is performing better.

#### two-stage 4 class alt with clustering

In [111]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                LogisticRegression(random_state=273, solver='newton-cg'))
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_second), index=X_second.index, columns=X_second.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_second_4alt,
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_second_4alt.values)
            log_reg_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            log_reg_grid.fit(X_train, y_train)
            if log_reg_grid.best_score_ > best[0]:
                best = [log_reg_grid.best_score_, k, c_type, n]
                
best

[0.6711716864231281, 2, 'features', 25]

In [112]:
pipe = make_pipeline(
    LogisticRegression(random_state=273, solver='newton-cg'))
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_second), index=X_second.index, columns=X_second.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[25]['features']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_second_4alt,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_4alt.values)
second_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=2, verbose=0)
second_stage.fit(X_train, y_train)
second_stage.best_params_, second_stage.best_score_

({'logisticregression__C': 100, 'logisticregression__penalty': 'l2'},
 0.6711716864231281)

In [113]:
precision_recall_fscore_support(y_train, second_stage.predict(X_train), labels=['Low c', 'Mid c', 'High c'])

(array([0.80869565, 0.76282051, 0.90322581]),
 array([0.78151261, 0.83216783, 0.7       ]),
 array([0.79487179, 0.79598662, 0.78873239]),
 array([119, 143,  40], dtype=int64))

Okay, these are actually all very high scores, and is definitely the best performing 4 class model we've set yet. This is the first model, other than 2 class models, where every class was above .75 f-score.

In [114]:
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[25]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
pred_1 = pd.DataFrame(first_stage.predict(X_train), index=X_train.index)

#drop old clusters and merge the new clusters
#because the two models work with different clusterings
X_train = X_train.drop(columns=list(range(0, 25)))
X_train = X_train.merge(pd.get_dummies(cluster_dfs[25]['features']), on='name', how='left')

pred_2 = pd.DataFrame(second_stage.predict(X_train[pred_1[0] == 'Yes']), index=X_train[pred_1[0] == 'Yes'].index)
y_pred = pred_1.merge(pred_2, on='name', how='left')['0_y'].fillna(value='Not c')
y_validation = y_train.to_frame().merge(y_df['formats4alt'], on='name', how='left')['formats4alt']
y_validation

name
Absol              Low c
Ninetales-Alola    Mid c
Palossand          Low c
Ponyta-Galar       Not c
Carvanha           Not c
                   ...  
Dragonair          Not c
Qwilfish           Low c
Cryogonal          Low c
Wailord            Not c
Blaziken           Mid c
Name: formats4alt, Length: 553, dtype: object

In [115]:
f1_score(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.7637156014905899

In [116]:
precision_recall_fscore_support(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.95555556, 0.67346939, 0.62569832, 0.54901961]),
 array([0.85657371, 0.51968504, 0.8057554 , 0.77777778]),
 array([0.90336134, 0.58666667, 0.70440252, 0.64367816]),
 array([251, 127, 139,  36], dtype=int64))

This performance is okay, but it's not very good. It does as well as the comparable two-stage clustering model in ZU, but the comparable two-stage clustering model outperforms it in low-competitive by a solid margin of almost .1, though this model outperforms that one in mid and high competitive by a lot (almost .2 in mid competitive). However, the one-stage model outperforms it in those two categories.

So I don't think two-stage modeling helped with 4 classes, though it did help with 7 classes.

Notebook runtime: about 4 minutes on my computer

<a id="performance"></a>
## Score Summary:

### one-stage, no clusters

#### 7 classes, no clusters
0.6462772739719035\
[0.88499025, 0.57553957, 0.13333333, 0.28, 0.28571429, 0.3902439, 0.65853659]
 
#### 4 classes, no clusters
0.706612591329458\
[0.87814313, 0.52320675, 0.55497382, 0.63354037]

#### 4 class alt no clusters
0.7378931899018644\
[0.88316832, 0.51101322, 0.67540984, 0.72463768]

#### 2 class no clusters
0.9088498774843347\
[0.8940678 , 0.92113565]

### one-stage, with clustering

#### 7 class with clustering
0.6981130120478787\
[0.89940828, 0.65384615, 0.13333333, 0.40677966, 0.49180328, 0.43181818, 0.72093023]

#### 4 class with clustering
0.7954788209205903\
[0.91848907, 0.69387755, 0.65306122, 0.74074074]

#### 4 class alt with clustering
0.7674903161456883\
[0.8950495 , 0.56410256, 0.72483221, 0.72463768]

#### 2 class with clustering
0.9162620322390561\
[0.90336134, 0.92698413]

### two-stage, no clustering

#### two-stage 7 class, no clusters
0.6674254958803228\
[0.8940678, 0.63057325, 0.24, 0.30508475, 0.31034483, 0.34210526, 0.62337662]

#### two-stage 4 class, no clusters
0.7229682104036999\
[0.8940678 , 0.61016949, 0.5326087 , 0.59354839]

#### two-stage 4 class alt, no clusters
0.7307719431394758\
[0.8940678 , 0.56589147, 0.63949843, 0.52631579]

### two-stage, with clustering

#### two-stage 7 class with clustering
0.7324626277882527\
[0.90336134, 0.68939394, 0.39285714, 0.57894737, 0.49315068, 0.43835616, 0.68181818]

#### two-stage 4 class with clustering
0.735776201933809\
[0.90336134, 0.6552901 , 0.53658537, 0.57575758]

#### two-stage 4 class alt with clustering
0.7637156014905899\
[0.90336134, 0.58666667, 0.70440252, 0.64367816]

## Performance Summary

### best 7 class model: two-stage 7 class with clustering (7/8)

#### best 7 class overall
- two-stage 7 class with clustering, 0.7324626277882527

#### best 7 class ZU
- two-stage 7 class with clustering, 0.90336134

#### best 7 class PU
- two-stage 7 class with clustering, 0.68939394

#### best 7 class NU
- two-stage 7 class with clustering, 0.39285714

#### best 7 class RU
- two-stage 7 class with clustering, 0.57894737

#### best 7 class UU
- two-stage 7 class with clustering, 0.49315068

#### best 7 class OU
- two-stage 7 class with clustering, 0.43835616

#### best 7 class Uber
- 7 class with clustering, 0.72093023

### best 4 class model: 4 class with clustering (4/5), one-stage

#### best 4 class overall
- 4 class with clustering, 0.7954788209205903

#### best 4 class "not competitive"
- 4 class with clustering, 0.91848907

#### best 4 class "low competitive"
- 4 class with clustering, 0.69387755

#### best 4 class "mid competitive"
- 4 class alt with clustering, 0.72483221

#### best 4 class "high competitive"
- 4 class with clustering, 0.74074074

### best 2 class model: 2 class with clustering

## Explainability Summary

The models with only two classes, as well as the first stages of the two stage models, are the most easily and usefully explainable, since they have one set of coefficients which tells us the relative contribution of each feature to the model.

The 7 and 4 class models, as well as the second stages of the two stage models, are less easily explainable because they have multiple sets of coefficients for each class, but they do give us more nuance about how the features contribute to a pokemon being sorted in each tier, in addition to a more nuanced assessment in the first place.