## Table of Contents

#### Set-Up
- [Splitting the data](#split)
- [Adjusting some features](#adjusting)
- [Clustering](#clustering)
- [Modeling guidelines](#modeling_guidelines)

#### Modeling
- [Decision Tree](#dt)
    - [Decision Tree with clustering](#dt_cluster)
    - [Two-Stage Decision Tree](#two_stage_dt)
    - [Two-Stage Decision Tree with clustering](#two_stage_dt_cluster)
    
- [Performance Results](#performance)

Don't forget to upload this notebook and the EDA notebook again when uploading all this to github

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import re
import json
import pandas_profiling
import requests
from subprocess import call
from IPython.display import Image
from graphviz import render
from bs4 import BeautifulSoup
from scipy import stats
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale, StandardScaler, MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.manifold import TSNE
from sklearn.linear_model import LogisticRegression, Lasso
from sklearn.metrics import precision_recall_fscore_support, log_loss, r2_score, mean_squared_error, f1_score, make_scorer
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.cluster import KMeans, DBSCAN, MeanShift
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster

In [2]:
pokemon_abilities_df = pd.read_csv('./data/pokemon_abilities_df.csv', index_col="name")
pokemon_learnsets_df = pd.read_csv('./data/pokemon_learnsets_df.csv', index_col='name')
pokemon_data = pd.read_csv('./data/pokemon_data.csv', index_col="name")

In [3]:
pokemon_data

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,weight,height,formats,generation,...,Ability Cutoff 2,Ability Cutoff 3,Ability Cutoff 4,Ability Cutoff 5,Ability Cutoff 6,Best Ability,Best Ability <100,Unique Powerful Ability,oldformats,oldformat codes
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,6.9,0.7,ZU,RB,...,1.0,0.0,0.0,0.0,0.0,63.636364,63.636364,0,ZU,0
Ivysaur,60,62,63,80,80,60,13.0,1.0,ZU,RB,...,1.0,0.0,0.0,0.0,0.0,63.636364,63.636364,0,ZU,0
Venusaur,80,82,83,100,100,80,100.0,2.0,OU,RB,...,1.0,0.0,0.0,0.0,0.0,63.636364,63.636364,0,UU,4
Charmander,39,52,43,60,50,65,8.5,0.6,ZU,RB,...,1.0,0.0,0.0,0.0,0.0,50.000000,50.000000,0,ZU,0
Charmeleon,58,64,58,80,65,80,19.0,1.1,ZU,RB,...,1.0,0.0,0.0,0.0,0.0,50.000000,50.000000,0,ZU,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,800.0,2.2,NU,SS,...,1.0,1.0,1.0,0.0,0.0,75.000000,75.000000,0,NU,2
Spectrier,100,65,60,145,80,130,44.5,2.0,Uber,SS,...,0.0,0.0,0.0,0.0,0.0,1.000000,0.000000,1,Uber,6
Calyrex,100,80,80,80,80,80,7.7,1.1,PU,SS,...,0.0,0.0,0.0,0.0,0.0,18.181818,18.181818,0,ZU,0
Calyrex-Ice,100,165,150,85,130,50,809.1,2.4,Uber,SS,...,0.0,0.0,0.0,0.0,0.0,1.000000,0.000000,1,Uber,6


<a id="split"></a>
### Splitting the Data

In [4]:
pokemon_data.columns

Index(['hp', 'atk', 'def', 'spa', 'spd', 'spe', 'weight', 'height', 'formats',
       'generation', 'format codes', 'Weaknesses', 'Strong Weaknesses',
       'Resists', 'Strong Resists', 'Immune', 'STAB', 'Resistance Index',
       'Entry Hazards', 'Hazard Removal', 'Removal Deterrent', 'Cleric',
       'Pivot', 'Item Removal', 'Setup', 'Priority', 'HP Drain', 'HP Recovery',
       'Weather Set', 'Weather Gimmick', 'Physical Cutoff 1',
       'Physical Cutoff 2', 'Physical Cutoff 3', 'Physical Cutoff 4',
       'Physical Cutoff 5', 'Physical Cutoff 6', 'Physical Coverage 1',
       'Physical Coverage 2', 'Physical Coverage 3', 'Physical Coverage 4',
       'Special Cutoff 1', 'Special Cutoff 2', 'Special Cutoff 3',
       'Special Cutoff 4', 'Special Cutoff 5', 'Special Cutoff 6',
       'Special Cutoff 7', 'Special Coverage 1', 'Special Coverage 2',
       'Special Coverage 3', 'Special Coverage 4', 'Special Coverage 5',
       'Special Coverage 6', 'Special Coverage 7', 'Special Cove

In [5]:
X = pokemon_data.drop(columns=['weight', 'height', 'Weaknesses', 'Strong Weaknesses', 'Resists',
                                'Strong Resists', 'Immune', 'STAB', 'Physical Cutoff 1', 'Physical Cutoff 2',
                                'Physical Cutoff 4', 'Physical Cutoff 5', 'Physical Cutoff 6',
                                'Physical Coverage 1', 'Physical Coverage 2', 'Physical Coverage 4',
                                'Special Cutoff 1', 'Special Cutoff 2', 'Special Cutoff 4',
                                'Special Cutoff 5', 'Special Cutoff 6', 'Special Cutoff 7',
                                'Special Coverage 1', 'Special Coverage 2', 'Special Coverage 3',
                                'Special Coverage 4', 'Special Coverage 6', 'Special Coverage 7',
                                'Special Coverage 8', 'Special Coverage 9', 'Special Coverage 10',
                                'Ability Cutoff 1', 'Ability Cutoff 2', 'Ability Cutoff 4', 'Ability Cutoff 5',
                                'Ability Cutoff 6', 'Best Ability <100', 'formats', 'generation',
                                'format codes', 'oldformats', 'oldformat codes'])

y_df = pd.DataFrame(pokemon_data[['formats', 'format codes']], index=pokemon_data.index, columns=['formats', 'format codes', 'oldformats', 'oldformat codes'])
y_df['formats4'] = y_df['formats'].replace({'ZU':'Not c', 'PU': 'Low c', 'NU': 'Mid c', 'RU': 'Mid c', 'UU': 'Mid c', 'OU': 'High c', 'Uber': 'High c'})
y_df['format codes4'] = y_df['format codes'].replace({3:2, 4: 2, 5:3, 6:3})
y_df['formats4alt'] = y_df['formats'].replace({'ZU':'Not c', 'PU': 'Low c', 'NU': 'Mid c', 'RU': 'Mid c', 'UU': 'Mid c', 'OU': 'Mid c', 'Uber': 'High c'})
y_df['format codes4alt'] = y_df['format codes'].replace({3:2, 4: 2, 5:2, 6:3})
y_df['formats2'] = y_df['formats'].replace({'ZU':'No', 'PU': 'Yes', 'NU': 'Yes', 'RU': 'Yes', 'UU': 'Yes', 'OU': 'Yes', 'Uber': 'Yes'})
y_df

Unnamed: 0_level_0,formats,format codes,oldformats,oldformat codes,formats4,format codes4,formats4alt,format codes4alt,formats2
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Bulbasaur,ZU,0,,,Not c,0,Not c,0,No
Ivysaur,ZU,0,,,Not c,0,Not c,0,No
Venusaur,OU,5,,,High c,3,Mid c,2,Yes
Charmander,ZU,0,,,Not c,0,Not c,0,No
Charmeleon,ZU,0,,,Not c,0,Not c,0,No
...,...,...,...,...,...,...,...,...,...
Glastrier,NU,2,,,Mid c,2,Mid c,2,Yes
Spectrier,Uber,6,,,High c,3,High c,3,Yes
Calyrex,PU,1,,,Low c,1,Low c,1,Yes
Calyrex-Ice,Uber,6,,,High c,3,High c,3,Yes


<a id="adjusting"></a>
### Adjusting some features

- remove: ability cutoff, unique powerful ability

In [6]:
X.columns

Index(['hp', 'atk', 'def', 'spa', 'spd', 'spe', 'Resistance Index',
       'Entry Hazards', 'Hazard Removal', 'Removal Deterrent', 'Cleric',
       'Pivot', 'Item Removal', 'Setup', 'Priority', 'HP Drain', 'HP Recovery',
       'Weather Set', 'Weather Gimmick', 'Physical Cutoff 3',
       'Physical Coverage 3', 'Special Cutoff 3', 'Special Coverage 5',
       'Misc Status', 'Unique Powerful Move', 'Ability Cutoff 3',
       'Best Ability', 'Unique Powerful Ability'],
      dtype='object')

In [7]:
X.drop(columns=['Ability Cutoff 3', 'Unique Powerful Ability'], inplace=True)
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Hazard Removal,Removal Deterrent,...,HP Recovery,Weather Set,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,1,0,5,4,3,5,2,4,0,63.636364
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,1,0,5,4,3,5,2,4,0,63.636364
Venusaur,80,82,83,100,100,80,2,0,0,0,...,1,0,5,6,4,6,4,4,0,63.636364
Charmander,39,52,43,60,50,65,3,0,0,0,...,0,0,3,10,9,6,2,3,0,50.000000
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,0,0,3,10,9,6,2,3,0,50.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,0,1,12,7,3,2,2,0,75.000000
Spectrier,100,65,60,145,80,130,8,0,0,1,...,0,0,0,4,4,4,3,3,0,1.000000
Calyrex,100,80,80,80,80,80,-2,0,0,0,...,0,0,2,3,3,9,4,3,0,18.181818
Calyrex-Ice,100,165,150,85,130,50,-4,0,0,0,...,0,0,2,15,9,12,5,3,1,1.000000


In [8]:
X.columns

Index(['hp', 'atk', 'def', 'spa', 'spd', 'spe', 'Resistance Index',
       'Entry Hazards', 'Hazard Removal', 'Removal Deterrent', 'Cleric',
       'Pivot', 'Item Removal', 'Setup', 'Priority', 'HP Drain', 'HP Recovery',
       'Weather Set', 'Weather Gimmick', 'Physical Cutoff 3',
       'Physical Coverage 3', 'Special Cutoff 3', 'Special Coverage 5',
       'Misc Status', 'Unique Powerful Move', 'Best Ability'],
      dtype='object')

- fold weather set into weather gimmick

In [9]:
X['Weather Gimmick'].value_counts()

2    289
1    171
0    161
5     70
3     40
4      7
Name: Weather Gimmick, dtype: int64

In [10]:
X['Weather Set'].value_counts()

0    709
1     29
Name: Weather Set, dtype: int64

In [11]:
X.loc[X['Weather Set'] == 1, 'Weather Gimmick'] = 6
X['Weather Gimmick'].value_counts()

2    265
1    167
0    161
5     70
3     39
6     29
4      7
Name: Weather Gimmick, dtype: int64

In [12]:
X.drop(columns=['Weather Set'], inplace=True)
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Hazard Removal,Removal Deterrent,...,HP Drain,HP Recovery,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,2,1,5,4,3,5,2,4,0,63.636364
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,2,1,5,4,3,5,2,4,0,63.636364
Venusaur,80,82,83,100,100,80,2,0,0,0,...,2,1,5,6,4,6,4,4,0,63.636364
Charmander,39,52,43,60,50,65,3,0,0,0,...,0,0,3,10,9,6,2,3,0,50.000000
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,0,0,3,10,9,6,2,3,0,50.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,0,1,12,7,3,2,2,0,75.000000
Spectrier,100,65,60,145,80,130,8,0,0,1,...,0,0,0,4,4,4,3,3,0,1.000000
Calyrex,100,80,80,80,80,80,-2,0,0,0,...,2,0,2,3,3,9,4,3,0,18.181818
Calyrex-Ice,100,165,150,85,130,50,-4,0,0,0,...,2,0,2,15,9,12,5,3,1,1.000000


- fold hp drain and hp recovery together into a recovery feature

In [13]:
X['HP Recovery'].value_counts()

0    517
1    184
2     37
Name: HP Recovery, dtype: int64

In [14]:
X['HP Drain'].value_counts()

0    482
2    202
1     49
3      4
4      1
Name: HP Drain, dtype: int64

In [15]:
X.loc[X['HP Recovery'] == 1, 'HP Drain'] = 3
X.loc[X['HP Recovery'] == 2, 'HP Drain'] = 4
X['HP Drain'].value_counts()

0    382
3    187
2     91
1     41
4     37
Name: HP Drain, dtype: int64

In [16]:
X.drop(columns=['HP Recovery'], inplace=True)
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Hazard Removal,Removal Deterrent,...,Priority,HP Drain,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,0,3,5,4,3,5,2,4,0,63.636364
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,0,3,5,4,3,5,2,4,0,63.636364
Venusaur,80,82,83,100,100,80,2,0,0,0,...,0,3,5,6,4,6,4,4,0,63.636364
Charmander,39,52,43,60,50,65,3,0,0,0,...,1,0,3,10,9,6,2,3,0,50.000000
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,1,0,3,10,9,6,2,3,0,50.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,0,1,12,7,3,2,2,0,75.000000
Spectrier,100,65,60,145,80,130,8,0,0,1,...,0,0,0,4,4,4,3,3,0,1.000000
Calyrex,100,80,80,80,80,80,-2,0,0,0,...,0,2,2,3,3,9,4,3,0,18.181818
Calyrex-Ice,100,165,150,85,130,50,-4,0,0,0,...,0,2,2,15,9,12,5,3,1,1.000000


In [17]:
X['HP Recovery'] = X['HP Drain']
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Hazard Removal,Removal Deterrent,...,HP Drain,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,3,5,4,3,5,2,4,0,63.636364,3
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,3,5,4,3,5,2,4,0,63.636364,3
Venusaur,80,82,83,100,100,80,2,0,0,0,...,3,5,6,4,6,4,4,0,63.636364,3
Charmander,39,52,43,60,50,65,3,0,0,0,...,0,3,10,9,6,2,3,0,50.000000,0
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,0,3,10,9,6,2,3,0,50.000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,1,12,7,3,2,2,0,75.000000,0
Spectrier,100,65,60,145,80,130,8,0,0,1,...,0,0,4,4,4,3,3,0,1.000000,0
Calyrex,100,80,80,80,80,80,-2,0,0,0,...,2,2,3,3,9,4,3,0,18.181818,2
Calyrex-Ice,100,165,150,85,130,50,-4,0,0,0,...,2,2,15,9,12,5,3,1,1.000000,2


In [18]:
X.drop(columns=['HP Drain'], inplace=True)
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Hazard Removal,Removal Deterrent,...,Priority,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,0,5,4,3,5,2,4,0,63.636364,3
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,0,5,4,3,5,2,4,0,63.636364,3
Venusaur,80,82,83,100,100,80,2,0,0,0,...,0,5,6,4,6,4,4,0,63.636364,3
Charmander,39,52,43,60,50,65,3,0,0,0,...,1,3,10,9,6,2,3,0,50.000000,0
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,1,3,10,9,6,2,3,0,50.000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,1,12,7,3,2,2,0,75.000000,0
Spectrier,100,65,60,145,80,130,8,0,0,1,...,0,0,4,4,4,3,3,0,1.000000,0
Calyrex,100,80,80,80,80,80,-2,0,0,0,...,0,2,3,3,9,4,3,0,18.181818,2
Calyrex-Ice,100,165,150,85,130,50,-4,0,0,0,...,0,2,15,9,12,5,3,1,1.000000,2


In [19]:
X['HP Recovery'].value_counts()

0    382
3    187
2     91
1     41
4     37
Name: HP Recovery, dtype: int64

- considering: removal deterrent (could arguably just remove since its abilities), hazard removal, cleric, entry hazards (all 3 of those might go into misc status)

In [20]:
X.drop(columns=['Removal Deterrent'], inplace=True)
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Hazard Removal,Cleric,...,Priority,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,0,5,4,3,5,2,4,0,63.636364,3
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,0,5,4,3,5,2,4,0,63.636364,3
Venusaur,80,82,83,100,100,80,2,0,0,0,...,0,5,6,4,6,4,4,0,63.636364,3
Charmander,39,52,43,60,50,65,3,0,0,0,...,1,3,10,9,6,2,3,0,50.000000,0
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,1,3,10,9,6,2,3,0,50.000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,1,12,7,3,2,2,0,75.000000,0
Spectrier,100,65,60,145,80,130,8,0,0,0,...,0,0,4,4,4,3,3,0,1.000000,0
Calyrex,100,80,80,80,80,80,-2,0,0,1,...,0,2,3,3,9,4,3,0,18.181818,2
Calyrex-Ice,100,165,150,85,130,50,-4,0,0,1,...,0,2,15,9,12,5,3,1,1.000000,2


In [21]:
X['Misc Status'].value_counts()

3    335
2    234
1     89
0     41
4     35
5      4
Name: Misc Status, dtype: int64

In [22]:
X['Hazard Removal'].value_counts()

0    558
1    174
2      6
Name: Hazard Removal, dtype: int64

In [23]:
X.loc[X['Hazard Removal'] == 1, 'Misc Status'] = 4
X['Misc Status'].value_counts()

3    249
4    205
2    185
1     63
0     32
5      4
Name: Misc Status, dtype: int64

In [24]:
X.drop(columns=['Hazard Removal'], inplace=True)
X

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Cleric,Pivot,...,Priority,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bulbasaur,45,49,49,65,65,45,2,0,0,0,...,0,5,4,3,5,2,4,0,63.636364,3
Ivysaur,60,62,63,80,80,60,2,0,0,0,...,0,5,4,3,5,2,4,0,63.636364,3
Venusaur,80,82,83,100,100,80,2,0,0,0,...,0,5,6,4,6,4,4,0,63.636364,3
Charmander,39,52,43,60,50,65,3,0,0,0,...,1,3,10,9,6,2,3,0,50.000000,0
Charmeleon,58,64,58,80,65,80,3,0,0,0,...,1,3,10,9,6,2,3,0,50.000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,1,12,7,3,2,2,0,75.000000,0
Spectrier,100,65,60,145,80,130,8,0,0,0,...,0,0,4,4,4,3,3,0,1.000000,0
Calyrex,100,80,80,80,80,80,-2,0,1,0,...,0,2,3,3,9,4,3,0,18.181818,2
Calyrex-Ice,100,165,150,85,130,50,-4,0,1,0,...,0,2,15,9,12,5,3,1,1.000000,2


I'll just leave the other one's (Cleric and Entry Hazards) alone for now, updating them would be complicated and it's probably not even a good idea since they performed better than Hazard Removal

<a id="clustering"></a>
### Clustering

In [25]:
cluster_dfs = {}

n_clusters = list(range(5, 35, 5))
n_clusters

[5, 10, 15, 20, 25, 30]

The number of clusters we'll test in each model that uses clusters, which is only half of them, and we'll want to remember to convert those clusters to categories.

We are going to do clustering of 4 different subsets of features, as we did during EDA:
- one for overall features (scaled)
- one for stats (scaled)
- one for abilities (not scaled, because abilities are one-hot encoded)
- one for learnsets (not scaled, because learnsets are one-hot encoded)

Then we'll make 6 dataframes for each of the different amount of clusters, each with all of those 4 types, and they will go in the cluster_dfs dictionary

In [26]:
cluster5 = pd.DataFrame(index=X.index, columns=['features', 'stats', 'abilities', 'learnsets'])

X_scaled = StandardScaler().fit_transform(X)
stats_scaled = StandardScaler().fit_transform(X.loc[:, ['hp', 'atk', 'def', 'spa', 'spd', 'spe']])

kmeans = KMeans(n_clusters=5, random_state=273)
cluster5['features'] = kmeans.fit_predict(X_scaled)

kmeans = KMeans(n_clusters=5, random_state=273)
cluster5['stats'] = kmeans.fit_predict(stats_scaled)

kmeans = KMeans(n_clusters=5, random_state=273)
cluster5['abilities'] = kmeans.fit_predict(pokemon_abilities_df)

kmeans = KMeans(n_clusters=5, random_state=273)
cluster5['learnsets'] = kmeans.fit_predict(pokemon_learnsets_df)

cluster_dfs[5] = cluster5

In [27]:
cluster10 = pd.DataFrame(index=X.index, columns=['features', 'stats', 'abilities', 'learnsets'])

X_scaled = StandardScaler().fit_transform(X)
stats_scaled = StandardScaler().fit_transform(X.loc[:, ['hp', 'atk', 'def', 'spa', 'spd', 'spe']])

kmeans = KMeans(n_clusters=10, random_state=273)
cluster10['features'] = kmeans.fit_predict(X_scaled)

kmeans = KMeans(n_clusters=10, random_state=273)
cluster10['stats'] = kmeans.fit_predict(stats_scaled)

kmeans = KMeans(n_clusters=10, random_state=273)
cluster10['abilities'] = kmeans.fit_predict(pokemon_abilities_df)

kmeans = KMeans(n_clusters=10, random_state=273)
cluster10['learnsets'] = kmeans.fit_predict(pokemon_learnsets_df)

cluster_dfs[10] = cluster10

In [28]:
cluster15 = pd.DataFrame(index=X.index, columns=['features', 'stats', 'abilities', 'learnsets'])

X_scaled = StandardScaler().fit_transform(X)
stats_scaled = StandardScaler().fit_transform(X.loc[:, ['hp', 'atk', 'def', 'spa', 'spd', 'spe']])

kmeans = KMeans(n_clusters=15, random_state=273)
cluster15['features'] = kmeans.fit_predict(X_scaled)

kmeans = KMeans(n_clusters=15, random_state=273)
cluster15['stats'] = kmeans.fit_predict(stats_scaled)

kmeans = KMeans(n_clusters=15, random_state=273)
cluster15['abilities'] = kmeans.fit_predict(pokemon_abilities_df)

kmeans = KMeans(n_clusters=15, random_state=273)
cluster15['learnsets'] = kmeans.fit_predict(pokemon_learnsets_df)

cluster_dfs[15] = cluster15

In [29]:
cluster20 = pd.DataFrame(index=X.index, columns=['features', 'stats', 'abilities', 'learnsets'])

X_scaled = StandardScaler().fit_transform(X)
stats_scaled = StandardScaler().fit_transform(X.loc[:, ['hp', 'atk', 'def', 'spa', 'spd', 'spe']])

kmeans = KMeans(n_clusters=20, random_state=273)
cluster20['features'] = kmeans.fit_predict(X_scaled)

kmeans = KMeans(n_clusters=20, random_state=273)
cluster20['stats'] = kmeans.fit_predict(stats_scaled)

kmeans = KMeans(n_clusters=20, random_state=273)
cluster20['abilities'] = kmeans.fit_predict(pokemon_abilities_df)

kmeans = KMeans(n_clusters=20, random_state=273)
cluster20['learnsets'] = kmeans.fit_predict(pokemon_learnsets_df)

cluster_dfs[20] = cluster20

In [30]:
cluster25 = pd.DataFrame(index=X.index, columns=['features', 'stats', 'abilities', 'learnsets'])

X_scaled = StandardScaler().fit_transform(X)
stats_scaled = StandardScaler().fit_transform(X.loc[:, ['hp', 'atk', 'def', 'spa', 'spd', 'spe']])

kmeans = KMeans(n_clusters=25, random_state=273)
cluster25['features'] = kmeans.fit_predict(X_scaled)

kmeans = KMeans(n_clusters=25, random_state=273)
cluster25['stats'] = kmeans.fit_predict(stats_scaled)

kmeans = KMeans(n_clusters=25, random_state=273)
cluster25['abilities'] = kmeans.fit_predict(pokemon_abilities_df)

kmeans = KMeans(n_clusters=25, random_state=273)
cluster25['learnsets'] = kmeans.fit_predict(pokemon_learnsets_df)

cluster_dfs[25] = cluster25

In [31]:
cluster30 = pd.DataFrame(index=X.index, columns=['features', 'stats', 'abilities', 'learnsets'])

X_scaled = StandardScaler().fit_transform(X)
stats_scaled = StandardScaler().fit_transform(X.loc[:, ['hp', 'atk', 'def', 'spa', 'spd', 'spe']])

kmeans = KMeans(n_clusters=30, random_state=273)
cluster30['features'] = kmeans.fit_predict(X_scaled)

kmeans = KMeans(n_clusters=30, random_state=273)
cluster30['stats'] = kmeans.fit_predict(stats_scaled)

kmeans = KMeans(n_clusters=30, random_state=273)
cluster30['abilities'] = kmeans.fit_predict(pokemon_abilities_df)

kmeans = KMeans(n_clusters=30, random_state=273)
cluster30['learnsets'] = kmeans.fit_predict(pokemon_learnsets_df)

cluster_dfs[30] = cluster30

In [32]:
cluster_dfs

{5:                 features  stats  abilities  learnsets
 name                                                 
 Bulbasaur              0      1          2          1
 Ivysaur                1      1          2          1
 Venusaur               2      3          2          1
 Charmander             0      1          0          2
 Charmeleon             1      4          0          2
 ...                  ...    ...        ...        ...
 Glastrier              3      0          0          3
 Spectrier              1      3          0          3
 Calyrex                2      2          0          1
 Calyrex-Ice            4      0          0          1
 Calyrex-Shadow         4      3          0          1
 
 [738 rows x 4 columns],
 10:                 features  stats  abilities  learnsets
 name                                                 
 Bulbasaur              3      5          1          3
 Ivysaur                9      9          1          3
 Venusaur               9      

<a id="modeling_guidelines"></a>
### Modeling guidelines

How many models am I making:

one-stage: (3 + 1) x 2, 8 one-stage models: 7 class no clusters, 4 class no clusters, 4 class modified no clusters, 2 class no clusters, then each with clusters

two-stage: (2 + 1) x 2, 6 two-stage models: 7 class no clusters, 4 class no clusters, 4 class modified no clusters, then each with clusters

14 total models for each modeling type

Modeling types: Logistic Regression, KNN, Decision Tree, Random Forest, CatBoost

Extra considerations:

- For Logistic Regression and KNN we will need to scale our features.

- We might not even bother with clustering using something like logistic regression, though we can look up whether it might be worthwhile

- Metric will be weighted F1 score, there is no well developed ROC curve for multi-class, log loss is not good for unbalanced classes, F1 score weighted should be especially appropriate for unbalanced classes and where we don't care more about precision or recall (there is no greater cost to a false positive or false negative for our problem)

In [33]:
k_list = [2, 3, 5, 10]

cluster_types = list(cluster_dfs[5].columns)

<a id="dt"></a>
### Decision Tree

In [34]:
pipe = make_pipeline(
    DecisionTreeClassifier(random_state=273))

pipe.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'decisiontreeclassifier', 'decisiontreeclassifier__ccp_alpha', 'decisiontreeclassifier__class_weight', 'decisiontreeclassifier__criterion', 'decisiontreeclassifier__max_depth', 'decisiontreeclassifier__max_features', 'decisiontreeclassifier__max_leaf_nodes', 'decisiontreeclassifier__min_impurity_decrease', 'decisiontreeclassifier__min_impurity_split', 'decisiontreeclassifier__min_samples_leaf', 'decisiontreeclassifier__min_samples_split', 'decisiontreeclassifier__min_weight_fraction_leaf', 'decisiontreeclassifier__random_state', 'decisiontreeclassifier__splitter'])

In [35]:
param_grid = {'decisiontreeclassifier__max_depth': [5, 10, 20, 40],
              'decisiontreeclassifier__min_samples_split': [2, 5, 10, 20, 40],
              'decisiontreeclassifier__min_samples_leaf': [1, 2, 5, 10, 20, 40],
              'decisiontreeclassifier__class_weight': [None, 'balanced']}
param_grid

{'decisiontreeclassifier__max_depth': [5, 10, 20, 40],
 'decisiontreeclassifier__min_samples_split': [2, 5, 10, 20, 40],
 'decisiontreeclassifier__min_samples_leaf': [1, 2, 5, 10, 20, 40],
 'decisiontreeclassifier__class_weight': [None, 'balanced']}

#### 7 classes, no clusters

In [36]:
best = []

for k in k_list:
    pipe = make_pipeline(
        DecisionTreeClassifier(random_state=273))
    X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats'].values)
    dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    dt_grid.fit(X_train, y_train)
    best.append(dt_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.5318290312188122, 10)

In [37]:
pipe = make_pipeline(
    DecisionTreeClassifier(random_state=273))
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats'].values)
dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
dt_grid.fit(X_train, y_train)
dt_grid.best_params_, dt_grid.best_score_

({'decisiontreeclassifier__class_weight': None,
  'decisiontreeclassifier__max_depth': 10,
  'decisiontreeclassifier__min_samples_leaf': 5,
  'decisiontreeclassifier__min_samples_split': 40},
 0.5318290312188122)

In [38]:
f1_score(y_train, dt_grid.predict(X_train), labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'], average='weighted')

0.6167818874054956

In [39]:
precision_recall_fscore_support(y_train, dt_grid.predict(X_train), labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'])

  _warn_prf(average, modifier, msg_start, len(result))


(array([0.85306122, 0.52380952, 0.30769231, 0.        , 0.4       ,
        0.3442623 , 0.63636364]),
 array([0.83266932, 0.64705882, 0.51612903, 0.        , 0.17647059,
        0.48837209, 0.525     ]),
 array([0.84274194, 0.57894737, 0.38554217, 0.        , 0.24489796,
        0.40384615, 0.57534247]),
 array([251, 119,  31,  35,  34,  43,  40], dtype=int64))

The scores on this decision tree are pretty poor compared to logistic regression and KNN. The overall average is a bit lower, and while it does surprisingly well (though not actually well) on NU, some classes like RU literally got nothing correct, and UU, OU and Uber saw significant decreases. Still, let's visualize this tree just to see what it looks like:

In [40]:
export_graphviz(dt_grid.best_estimator_['decisiontreeclassifier'], out_file='trees/7classdt.dot',
               feature_names = X_train.columns,
               class_names = sorted(list(y_train.unique())), #needs to be in alphabetical order
               rounded = True, proportion = False, precision = 2, filled = True)

#There seems to be a problem with "GraphViz's executables not found"
#So I'm inserting some additional lines to fix it
os.environ["PATH"] += os.pathsep + r'C:\Users\Owner\anaconda3\pkgs\graphviz-2.38-hfd603c8_2\Library\bin\graphviz'

render('dot', 'png', 'trees/7classdt.dot', outfile='trees/7classdt.png')

'trees\\7classdt.png'

Yes, this tree is pretty unreasonable, for example there is one split in which if the attack is over 91, it puts the pokemon in ZU, the lowest tier, and gives it a chance to be in a higher tier if it's attack is higher. This obviously doesn't capture the essence of the class.

#### 4 class no clusters

In [41]:
best = []

for k in k_list:
    pipe = make_pipeline(
        DecisionTreeClassifier(random_state=273))
    X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats4'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats4'].values)
    dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    dt_grid.fit(X_train, y_train)
    best.append(dt_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.5959004981971628, 10)

In [42]:
pipe = make_pipeline(
    DecisionTreeClassifier(random_state=273))
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats4'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats4'].values)
dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
dt_grid.fit(X_train, y_train)
dt_grid.best_params_, dt_grid.best_score_

({'decisiontreeclassifier__class_weight': None,
  'decisiontreeclassifier__max_depth': 10,
  'decisiontreeclassifier__min_samples_leaf': 1,
  'decisiontreeclassifier__min_samples_split': 2},
 0.5959004981971628)

Interesting, this one has much lower counts for the minimum split and leaf samples, but the performance is still quite poor.

In [43]:
f1_score(y_train, dt_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.9801185987970462

In [44]:
precision_recall_fscore_support(y_train, dt_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.99193548, 0.96638655, 0.98      , 0.96511628]),
 array([0.98007968, 0.96638655, 0.98      , 1.        ]),
 array([0.98597194, 0.96638655, 0.98      , 0.98224852]),
 array([251, 119, 100,  83], dtype=int64))

Similarly to KNN, this decision tree is probably massively overfitting. The cross validation scores are very low yet the model has nearly 100%? It's hard to believe, though maybe we have so few samples that cross-validation does give a skewed impression since there are so few examples in each fold. As with KNN, we'll have to see how it performs on the test set to see if we can believe it.

One thing we know that is having low values for minimum leaf and split samples can lead to overfitting, so that's is also probably a large part of what's happening here. However, we did cross-validate so it still performed better on that then having higher minimum samples.

Let's see what it's getting wrong:

In [45]:
wrong = pd.DataFrame(dt_grid.predict(X_train), index=X_train.index).merge(y_train, on='name')
wrong.loc[wrong[0] != wrong['formats4']]

Unnamed: 0_level_0,0,formats4
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Gourgeist-Large,Mid c,Not c
Gastrodon,High c,Mid c
Silvally-Bug,Not c,Low c
Rotom-Fan,High c,Low c
Silvally-Dark,Low c,Not c
Thievul,High c,Not c
Silvally-Steel,Low c,Mid c
Silvally-Dragon,Not c,Low c
Lanturn,Mid c,Low c
Silvally-Flying,Low c,Not c


More than Silvally this time, though there is a lot of that. It's interesting that the model always guesses non-Silvally pokemon as higher than they are.

Now let's make our visualization:

In [46]:
export_graphviz(dt_grid.best_estimator_['decisiontreeclassifier'], out_file='trees/4classdt.dot',
               feature_names = X_train.columns,
               class_names = sorted(list(y_train.unique())), #needs to be in alphabetical order
               rounded = True, proportion = False, precision = 2, filled = True)

render('dot', 'png', 'trees/4classdt.dot', outfile='trees/4classdt.png')

'trees\\4classdt.png'

The low minimum samples makes this tree quite complex.  And there are still at least some nonsensical splits, like in the bottom right corner, we have a split that puts a pokemon in low competitive if it's stats are low on a bunch of things that aren't bad, but low competitive isn't the lowest class. However, this split has a sample of 1, which is both an example of absurd overfitting and explaining why such nonsense would be possible. So I'm also inclined to think that this model isn't very good.

#### 4 class alt no clusters

In [47]:
best = []

for k in k_list:
    pipe = make_pipeline(
        DecisionTreeClassifier(random_state=273))
    X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats4alt'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats4alt'].values)
    dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    dt_grid.fit(X_train, y_train)
    best.append(dt_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.6235339024869554, 5)

In [48]:
pipe = make_pipeline(
    DecisionTreeClassifier(random_state=273))
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats4alt'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats4alt'].values)
dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=5, verbose=0)
dt_grid.fit(X_train, y_train)
dt_grid.best_params_, dt_grid.best_score_

({'decisiontreeclassifier__class_weight': 'balanced',
  'decisiontreeclassifier__max_depth': 20,
  'decisiontreeclassifier__min_samples_leaf': 5,
  'decisiontreeclassifier__min_samples_split': 2},
 0.6235339024869554)

This is the first time that the tree is using balanced class weights! It's also a very deep tree. The min samples split is very low, but min leaf of 5 is probably less prone to overfitting than for the last tree at least.

In [49]:
f1_score(y_train, dt_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.8175913105129116

In [50]:
precision_recall_fscore_support(y_train, dt_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.9587156 , 0.72093023, 0.83464567, 0.50632911]),
 array([0.83266932, 0.78151261, 0.74125874, 1.        ]),
 array([0.891258  , 0.75      , 0.78518519, 0.67226891]),
 array([251, 119, 143,  40], dtype=int64))

This performance is actually a lot more reasonable. It's possible there is some overfitting, but it's performing more similarly to some of our better two-stage KNN models.

In [51]:
export_graphviz(dt_grid.best_estimator_['decisiontreeclassifier'], out_file='trees/4classaltdt.dot',
               feature_names = X_train.columns,
               class_names = sorted(list(y_train.unique())), #needs to be in alphabetical order
               rounded = True, proportion = False, precision = 2, filled = True)

render('dot', 'png', 'trees/4classaltdt.dot', outfile='trees/4classaltdt.png')

'trees\\4classaltdt.png'

#### 2 class no clusters

In [52]:
best = []

for k in k_list:
    pipe = make_pipeline(
        DecisionTreeClassifier(random_state=273))
    X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
    dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    dt_grid.fit(X_train, y_train)
    best.append(dt_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.8567492190328831, 10)

In [53]:
pipe = make_pipeline(
    DecisionTreeClassifier(random_state=273))
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
dt_grid.fit(X_train, y_train)
dt_grid.best_params_, dt_grid.best_score_

({'decisiontreeclassifier__class_weight': 'balanced',
  'decisiontreeclassifier__max_depth': 10,
  'decisiontreeclassifier__min_samples_leaf': 10,
  'decisiontreeclassifier__min_samples_split': 2},
 0.8567492190328831)

Again we get a higher value for min samples leaf, so probably this will be less overfitted, and that does seem to show in the scores.

In [54]:
f1_score(y_train, dt_grid.predict(X_train), labels=['No', 'Yes'], average='weighted')

0.8895905415239665

In [55]:
precision_recall_fscore_support(y_train, dt_grid.predict(X_train), labels=['No', 'Yes'])

(array([0.88617886, 0.89250814]),
 array([0.8685259 , 0.90728477]),
 array([0.87726358, 0.8998358 ]),
 array([251, 302], dtype=int64))

This is a very close score to the other algorithms, so it's probably a decent model.

In [56]:
export_graphviz(dt_grid.best_estimator_['decisiontreeclassifier'], out_file='trees/2classdt.dot',
               feature_names = X_train.columns,
               class_names = sorted(list(y_train.unique())), #needs to be in alphabetical order
               rounded = True, proportion = False, precision = 2, filled = True)

render('dot', 'png', 'trees/2classdt.dot', outfile='trees/2classdt.png')

'trees\\2classdt.png'

<a id="dt_cluster"></a>
#### Decision Tree with clustering

#### 7 class with clustering

Since one-hot encoded columns shouldn't really be scaled, we can scale the rest of our data to be compatible with them via minmax scaling between 0 and 1 (which is the default setting for minmaxscaler)

In [57]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                DecisionTreeClassifier(random_state=273))
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats'],
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_df['formats'].values)
            dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            dt_grid.fit(X_train, y_train)
            if dt_grid.best_score_ > best[0]:
                best = [dt_grid.best_score_, k, c_type, n]
                
best

[0.5683512722864769, 10, 'stats', 5]

The performance is very similar to logistic regression, but it's interesting that it chose the stats clusters this time for the 7 class model, as opposed to the general features chosen in logistic regression

In [58]:
pipe = make_pipeline(
    DecisionTreeClassifier(random_state=273))
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[5]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats'].values)
dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
dt_grid.fit(X_train, y_train)
dt_grid.best_params_, dt_grid.best_score_

({'decisiontreeclassifier__class_weight': None,
  'decisiontreeclassifier__max_depth': 5,
  'decisiontreeclassifier__min_samples_leaf': 10,
  'decisiontreeclassifier__min_samples_split': 2},
 0.5683512722864769)

In [59]:
f1_score(y_train, dt_grid.predict(X_train), labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'], average='weighted')

0.6461860544509731

In [60]:
precision_recall_fscore_support(y_train, dt_grid.predict(X_train), labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'])

(array([0.89211618, 0.52906977, 0.26229508, 0.36363636, 0.5       ,
        0.47727273, 1.        ]),
 array([0.85657371, 0.76470588, 0.51612903, 0.11428571, 0.20588235,
        0.48837209, 0.25      ]),
 array([0.87398374, 0.62542955, 0.34782609, 0.17391304, 0.29166667,
        0.48275862, 0.4       ]),
 array([251, 119,  31,  35,  34,  43,  40], dtype=int64))

It's interesting that the decision tree performs so poorly on Ubers, which is really rare. Also some of these decisions trees have struggled a lot on the RU tier for some reason, and it applies here.

In [61]:
export_graphviz(dt_grid.best_estimator_['decisiontreeclassifier'], out_file='trees/7classdt_cluster.dot',
               feature_names = X_train.columns,
               class_names = sorted(list(y_train.unique())), #needs to be in alphabetical order
               rounded = True, proportion = False, precision = 2, filled = True)

render('dot', 'png', 'trees/7classdt_cluster.dot', outfile='trees/7classdt_cluster.png')

'trees\\7classdt_cluster.png'

#### 4 class with clustering

In [62]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                DecisionTreeClassifier(random_state=273))
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats4'],
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_df['formats4'].values)
            dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            dt_grid.fit(X_train, y_train)
            if dt_grid.best_score_ > best[0]:
                best = [dt_grid.best_score_, k, c_type, n]
                
best

[0.6255272116107875, 5, 'features', 5]

It's interesting the decisions trees seem to work better with a small amount of clusters, which also makes sense since fewer clusters are easy to split decisions about in a way that gives you a lot of information

In [63]:
pipe = make_pipeline(
    DecisionTreeClassifier(random_state=273))
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[5]['features']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats4'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats4'].values)
dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=5, verbose=0)
dt_grid.fit(X_train, y_train)
dt_grid.best_params_, dt_grid.best_score_

({'decisiontreeclassifier__class_weight': 'balanced',
  'decisiontreeclassifier__max_depth': 10,
  'decisiontreeclassifier__min_samples_leaf': 1,
  'decisiontreeclassifier__min_samples_split': 20},
 0.6255272116107875)

I am very concerned about the min samples leaf size of 1 since it usually leads to overfitting.

In [64]:
f1_score(y_train, dt_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.7717135294791215

In [65]:
precision_recall_fscore_support(y_train, dt_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.98477157, 0.60377358, 0.63302752, 0.71590909]),
 array([0.77290837, 0.80672269, 0.69      , 0.75903614]),
 array([0.86607143, 0.69064748, 0.66028708, 0.73684211]),
 array([251, 119, 100,  83], dtype=int64))

These are quite decent scores, but not the overfitting I was expecting. Perhaps the high min samples split of 20 also helps to prevent such overfitting even with small leaf size.

In [66]:
export_graphviz(dt_grid.best_estimator_['decisiontreeclassifier'], out_file='trees/4classdt_cluster.dot',
               feature_names = X_train.columns,
               class_names = sorted(list(y_train.unique())), #needs to be in alphabetical order
               rounded = True, proportion = False, precision = 2, filled = True)

render('dot', 'png', 'trees/4classdt_cluster.dot', outfile='trees/4classdt_cluster.png')

'trees\\4classdt_cluster.png'

#### 4 class alt with clustering

In [67]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                DecisionTreeClassifier(random_state=273))
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats4alt'],
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_df['formats4alt'].values)
            dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            dt_grid.fit(X_train, y_train)
            if dt_grid.best_score_ > best[0]:
                best = [dt_grid.best_score_, k, c_type, n]
                
best

[0.6602694867928545, 10, 'features', 5]

Yet again, only 5 clusters a consistent theme with decision tree

In [68]:
pipe = make_pipeline(
    DecisionTreeClassifier(random_state=273))
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[5]['features']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats4alt'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats4alt'].values)
dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
dt_grid.fit(X_train, y_train)
dt_grid.best_params_, dt_grid.best_score_

({'decisiontreeclassifier__class_weight': None,
  'decisiontreeclassifier__max_depth': 10,
  'decisiontreeclassifier__min_samples_leaf': 5,
  'decisiontreeclassifier__min_samples_split': 40},
 0.6602694867928545)

We have pretty high numbers for min samples so that's promising about avoiding overfitting

In [69]:
f1_score(y_train, dt_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.7452341666036475

In [70]:
precision_recall_fscore_support(y_train, dt_grid.predict(X_train), labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.9055794 , 0.55921053, 0.6835443 , 1.        ]),
 array([0.84063745, 0.71428571, 0.75524476, 0.25      ]),
 array([0.87190083, 0.62730627, 0.71760797, 0.4       ]),
 array([251, 119, 143,  40], dtype=int64))

While the overall score isn't terrible, this scores very low on high competitive which I dislike, and I almost surely won't use this one.

In [71]:
export_graphviz(dt_grid.best_estimator_['decisiontreeclassifier'], out_file='trees/4classaltdt_cluster.dot',
               feature_names = X_train.columns,
               class_names = sorted(list(y_train.unique())), #needs to be in alphabetical order
               rounded = True, proportion = False, precision = 2, filled = True)

render('dot', 'png', 'trees/4classaltdt_cluster.dot', outfile='trees/4classaltdt_cluster.png')

'trees\\4classaltdt_cluster.png'

#### 2 class with clustering

In [72]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                DecisionTreeClassifier(random_state=273))
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats2'],
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_df['formats2'].values)
            dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            dt_grid.fit(X_train, y_train)
            if dt_grid.best_score_ > best[0]:
                best = [dt_grid.best_score_, k, c_type, n]
                
best

[0.8907376690554821, 10, 'features', 5]

Exactly the same best parameters as for logistic regression

In [73]:
pipe = make_pipeline(
    DecisionTreeClassifier(random_state=273))
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[5]['features']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
dt_grid.fit(X_train, y_train)
dt_grid.best_params_, dt_grid.best_score_

({'decisiontreeclassifier__class_weight': 'balanced',
  'decisiontreeclassifier__max_depth': 10,
  'decisiontreeclassifier__min_samples_leaf': 5,
  'decisiontreeclassifier__min_samples_split': 40},
 0.8907376690554821)

Again I'm happy with the min samples since I'm going to be using this in my two-tier model.

In [74]:
f1_score(y_train, dt_grid.predict(X_train), labels=['No', 'Yes'], average='weighted')

0.8967874929630921

In [75]:
precision_recall_fscore_support(y_train, dt_grid.predict(X_train), labels=['No', 'Yes'])

(array([0.89754098, 0.89644013]),
 array([0.87250996, 0.91721854]),
 array([0.88484848, 0.90671031]),
 array([251, 302], dtype=int64))

Pretty standard performance for the 2 class models to be around .9 f score

In [76]:
export_graphviz(dt_grid.best_estimator_['decisiontreeclassifier'], out_file='trees/2classdt_cluster.dot',
               feature_names = X_train.columns,
               class_names = sorted(list(y_train.unique())), #needs to be in alphabetical order
               rounded = True, proportion = False, precision = 2, filled = True)

render('dot', 'png', 'trees/2classdt_cluster.dot', outfile='trees/2classdt_cluster.png')

'trees\\2classdt_cluster.png'

<a id="two_stage_dt"></a>
#### two-stage Decision Tree

#### two-stage 7 class, no clusters

The first part of this two stage model is just regular decision tree for two classes (which we already did! so we can just use that model again), to separate out the largest class, ZU i.e. relatively non-competitive pokemon, so that the second model doesn't have to include it and can exercise ALL of its discernment on figuring out which competitive class a competitive pokemon belongs to (which, as we saw from many of the f-scores above, can in some cases be quite difficult, so it's good that the second model can focus on that, and it might lead to higher performance than a single-stage model).

In [77]:
pipe = make_pipeline(
    DecisionTreeClassifier(random_state=273))
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
first_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
first_stage.fit(X_train, y_train)
first_stage.best_params_, first_stage.best_score_

({'decisiontreeclassifier__class_weight': 'balanced',
  'decisiontreeclassifier__max_depth': 10,
  'decisiontreeclassifier__min_samples_leaf': 10,
  'decisiontreeclassifier__min_samples_split': 2},
 0.8567492190328831)

That's the same model that we used before to separate competitive and non-competitive pokemon. Now let's use it to predict which pokemon in all of the training data will belong to ZU, so that we can remove them from consideration in the next model that we build (by filtering X and y_df so that we're only looking at competitive pokemon):

In [78]:
y_df['formats2'].loc[y_df['formats2'] == 'Yes']

name
Venusaur          Yes
Charizard         Yes
Blastoise         Yes
Pikachu           Yes
Raichu            Yes
                 ... 
Glastrier         Yes
Spectrier         Yes
Calyrex           Yes
Calyrex-Ice       Yes
Calyrex-Shadow    Yes
Name: formats2, Length: 403, dtype: object

In [79]:
X_second = X.loc[y_df['formats2'] == 'Yes']
X_second

Unnamed: 0_level_0,hp,atk,def,spa,spd,spe,Resistance Index,Entry Hazards,Cleric,Pivot,...,Priority,Weather Gimmick,Physical Cutoff 3,Physical Coverage 3,Special Cutoff 3,Special Coverage 5,Misc Status,Unique Powerful Move,Best Ability,HP Recovery
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Venusaur,80,82,83,100,100,80,2,0,0,0,...,0,5,6,4,6,4,4,0,63.636364,3
Charizard,78,84,78,109,85,100,8,0,0,0,...,1,3,14,10,11,5,4,0,50.000000,3
Blastoise,79,83,100,85,105,78,2,0,0,3,...,2,2,12,10,11,7,4,0,75.000000,0
Pikachu,35,55,40,50,50,90,2,0,1,3,...,2,2,7,7,5,3,3,0,70.000000,1
Raichu,60,90,55,90,80,110,2,0,1,3,...,2,2,7,7,6,4,3,0,70.000000,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Glastrier,100,145,130,65,110,30,-3,0,0,0,...,0,1,12,7,3,2,2,0,75.000000,0
Spectrier,100,65,60,145,80,130,8,0,0,0,...,0,0,4,4,4,3,3,0,1.000000,0
Calyrex,100,80,80,80,80,80,-2,0,1,0,...,0,2,3,3,9,4,3,0,18.181818,2
Calyrex-Ice,100,165,150,85,130,50,-4,0,1,0,...,0,2,15,9,12,5,3,1,1.000000,2


In [80]:
y_second_7 = y_df['formats'].loc[y_df['formats2'] == 'Yes']
y_second_7

name
Venusaur            OU
Charizard           PU
Blastoise           NU
Pikachu             PU
Raichu              PU
                  ... 
Glastrier           NU
Spectrier         Uber
Calyrex             PU
Calyrex-Ice       Uber
Calyrex-Shadow    Uber
Name: formats, Length: 403, dtype: object

In [81]:
best = []

for k in k_list:
    pipe = make_pipeline(
        DecisionTreeClassifier(random_state=273))
    X_train, X_test, y_train, y_test = train_test_split(X_second, y_second_7,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_7.values)
    dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    dt_grid.fit(X_train, y_train)
    best.append(dt_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.4187828408209852, 10)

As with logistic regression, this is quite a low score.

In [82]:
pipe = make_pipeline(
    DecisionTreeClassifier(random_state=273))
X_train, X_test, y_train, y_test = train_test_split(X_second, y_second_7,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_7.values)
second_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
second_stage.fit(X_train, y_train)
second_stage.best_params_, second_stage.best_score_

({'decisiontreeclassifier__class_weight': None,
  'decisiontreeclassifier__max_depth': 10,
  'decisiontreeclassifier__min_samples_leaf': 1,
  'decisiontreeclassifier__min_samples_split': 5},
 0.4187828408209852)

In [83]:
precision_recall_fscore_support(y_train, second_stage.predict(X_train), labels=['PU', 'NU', 'RU', 'UU', 'OU', 'Uber'])

(array([0.9338843 , 0.74193548, 0.83333333, 0.85714286, 0.79591837,
        0.91666667]),
 array([0.94957983, 0.74193548, 0.71428571, 0.88235294, 0.90697674,
        0.825     ]),
 array([0.94166667, 0.74193548, 0.76923077, 0.86956522, 0.84782609,
        0.86842105]),
 array([119,  31,  35,  34,  43,  40], dtype=int64))

This is a little weird, because these are high scores but not like the near 100% obviously overfitted one's, so this might be a decent model.

In [84]:
export_graphviz(second_stage.best_estimator_['decisiontreeclassifier'], out_file='trees/7classdt_stage2.dot',
               feature_names = X_train.columns,
               class_names = sorted(list(y_train.unique())), #needs to be in alphabetical order
               rounded = True, proportion = False, precision = 2, filled = True)

render('dot', 'png', 'trees/7classdt_stage2.dot', outfile='trees/7classdt_stage2.png')

'trees\\7classdt_stage2.png'

In [85]:
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
pred_1 = pd.DataFrame(first_stage.predict(X_train), index=X_train.index)
pred_2 = pd.DataFrame(second_stage.predict(X_train[pred_1[0] == 'Yes']), index=X_train[pred_1[0] == 'Yes'].index)
y_pred = pred_1.merge(pred_2, on='name', how='left')['0_y'].fillna(value='ZU')
y_validation = y_train.to_frame().merge(y_df['formats'], on='name', how='left')['formats']
y_validation

name
Absol              PU
Ninetales-Alola    OU
Palossand          PU
Ponyta-Galar       ZU
Carvanha           ZU
                   ..
Dragonair          ZU
Qwilfish           PU
Cryogonal          PU
Wailord            ZU
Blaziken           OU
Name: formats, Length: 553, dtype: object

In [86]:
f1_score(y_validation, y_pred, labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'], average='weighted')

0.770759487293633

In [87]:
precision_recall_fscore_support(y_validation, y_pred, labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'])

(array([0.88617886, 0.7107438 , 0.46153846, 0.625     , 0.71428571,
        0.65957447, 0.81818182]),
 array([0.8685259 , 0.67716535, 0.6       , 0.58823529, 0.73529412,
        0.75609756, 0.75      ]),
 array([0.87726358, 0.69354839, 0.52173913, 0.60606061, 0.72463768,
        0.70454545, 0.7826087 ]),
 array([251, 127,  30,  34,  34,  41,  36], dtype=int64))

The scores are generally quite good, with the mid competitive scores being somewhat weak.

#### two-stage 4 class, no clusters

In [88]:
y_second_4 = y_df['formats4'].loc[y_df['formats2'] == 'Yes']
best = []

for k in k_list:
    pipe = make_pipeline(
        DecisionTreeClassifier(random_state=273))
    X_train, X_test, y_train, y_test = train_test_split(X_second, y_second_4,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_4.values)
    dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    dt_grid.fit(X_train, y_train)
    best.append(dt_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.515165051441064, 10)

In [89]:
pipe = make_pipeline(
    DecisionTreeClassifier(random_state=273))
X_train, X_test, y_train, y_test = train_test_split(X_second, y_second_4,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_4.values)
second_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
second_stage.fit(X_train, y_train)
second_stage.best_params_, second_stage.best_score_

({'decisiontreeclassifier__class_weight': None,
  'decisiontreeclassifier__max_depth': 10,
  'decisiontreeclassifier__min_samples_leaf': 2,
  'decisiontreeclassifier__min_samples_split': 2},
 0.515165051441064)

In [90]:
precision_recall_fscore_support(y_train, second_stage.predict(X_train), labels=['Low c', 'Mid c', 'High c'])

(array([0.87401575, 0.91463415, 0.82795699]),
 array([0.93277311, 0.75      , 0.92771084]),
 array([0.90243902, 0.82417582, 0.875     ]),
 array([119, 100,  83], dtype=int64))

These are quite high scores but not perfect, which might be promising.

In [91]:
export_graphviz(second_stage.best_estimator_['decisiontreeclassifier'], out_file='trees/4classdt_stage2.dot',
               feature_names = X_train.columns,
               class_names = sorted(list(y_train.unique())), #needs to be in alphabetical order
               rounded = True, proportion = False, precision = 2, filled = True)

render('dot', 'png', 'trees/4classdt_stage2.dot', outfile='trees/4classdt_stage2.png')

'trees\\4classdt_stage2.png'

In [92]:
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
pred_1 = pd.DataFrame(first_stage.predict(X_train), index=X_train.index)
pred_2 = pd.DataFrame(second_stage.predict(X_train[pred_1[0] == 'Yes']), index=X_train[pred_1[0] == 'Yes'].index)
y_pred = pred_1.merge(pred_2, on='name', how='left')['0_y'].fillna(value='Not c')
y_validation = y_train.to_frame().merge(y_df['formats4'], on='name', how='left')['formats4']
y_validation

name
Absol               Low c
Ninetales-Alola    High c
Palossand           Low c
Ponyta-Galar        Not c
Carvanha            Not c
                    ...  
Dragonair           Not c
Qwilfish            Low c
Cryogonal           Low c
Wailord             Not c
Blaziken           High c
Name: formats4, Length: 553, dtype: object

In [93]:
f1_score(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.7839593210332558

In [94]:
precision_recall_fscore_support(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.88617886, 0.64285714, 0.74712644, 0.75      ]),
 array([0.8685259 , 0.70866142, 0.66326531, 0.77922078]),
 array([0.87726358, 0.6741573 , 0.7027027 , 0.76433121]),
 array([251, 127,  98,  77], dtype=int64))

All scores above .65, which is one of the better models, thought I doubt it will be one of the best.

#### two-stage 4 class alt, no clusters

In [95]:
y_second_4alt = y_df['formats4alt'].loc[y_df['formats2'] == 'Yes']
best = []

for k in k_list:
    pipe = make_pipeline(
        DecisionTreeClassifier(random_state=273))
    X_train, X_test, y_train, y_test = train_test_split(X_second, y_second_4alt,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_4alt.values)
    dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
    dt_grid.fit(X_train, y_train)
    best.append(dt_grid.best_score_)

max(best), k_list[best.index(max(best))]

(0.5941599147209478, 2)

In [96]:
pipe = make_pipeline(
    DecisionTreeClassifier(random_state=273))
X_train, X_test, y_train, y_test = train_test_split(X_second, y_second_4alt,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_4alt.values)
second_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=2, verbose=0)
second_stage.fit(X_train, y_train)
second_stage.best_params_, second_stage.best_score_

({'decisiontreeclassifier__class_weight': None,
  'decisiontreeclassifier__max_depth': 20,
  'decisiontreeclassifier__min_samples_leaf': 1,
  'decisiontreeclassifier__min_samples_split': 2},
 0.5941599147209478)

In [97]:
precision_recall_fscore_support(y_train, second_stage.predict(X_train), labels=['Low c', 'Mid c', 'High c'])

(array([1., 1., 1.]),
 array([1., 1., 1.]),
 array([1., 1., 1.]),
 array([119, 143,  40], dtype=int64))

This one is ridiculously overfitted, getting literally 100%. 2 fold cross-validation, very low min samples leaf and split, and high tree depth are all probably strong contributers to the overfitting. I won't be using this model.

In [98]:
export_graphviz(second_stage.best_estimator_['decisiontreeclassifier'], out_file='trees/4classaltdt_stage2.dot',
               feature_names = X_train.columns,
               class_names = sorted(list(y_train.unique())), #needs to be in alphabetical order
               rounded = True, proportion = False, precision = 2, filled = True)

render('dot', 'png', 'trees/4classaltdt_stage2.dot', outfile='trees/4classaltdt_stage2.png')

'trees\\4classaltdt_stage2.png'

In [99]:
X_train, X_test, y_train, y_test = train_test_split(X, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
pred_1 = pd.DataFrame(first_stage.predict(X_train), index=X_train.index)
pred_2 = pd.DataFrame(second_stage.predict(X_train[pred_1[0] == 'Yes']), index=X_train[pred_1[0] == 'Yes'].index)
y_pred = pred_1.merge(pred_2, on='name', how='left')['0_y'].fillna(value='Not c')
y_validation = y_train.to_frame().merge(y_df['formats4alt'], on='name', how='left')['formats4alt']
y_validation

name
Absol              Low c
Ninetales-Alola    Mid c
Palossand          Low c
Ponyta-Galar       Not c
Carvanha           Not c
                   ...  
Dragonair          Not c
Qwilfish           Low c
Cryogonal          Low c
Wailord            Not c
Blaziken           Mid c
Name: formats4alt, Length: 553, dtype: object

In [100]:
f1_score(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.8298126438271775

In [101]:
precision_recall_fscore_support(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.88617886, 0.72      , 0.84246575, 0.77777778]),
 array([0.8685259 , 0.70866142, 0.88489209, 0.77777778]),
 array([0.87726358, 0.71428571, 0.86315789, 0.77777778]),
 array([251, 127, 139,  36], dtype=int64))

In spite of the semi-reasonable performance here when the model is unfolded to two stage, I find it hard to trust this model when it overfits so much.

<a id="two_stage_dt_cluster"></a>
#### two-stage Decision Tree with clustering

#### two-stage 7 class with clustering

We need to set it up so that our first stage has clustering now:

In [102]:
pipe = make_pipeline(
    DecisionTreeClassifier(random_state=273))
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[5]['features']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
first_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=10, verbose=0)
first_stage.fit(X_train, y_train)
first_stage.best_params_, first_stage.best_score_

({'decisiontreeclassifier__class_weight': 'balanced',
  'decisiontreeclassifier__max_depth': 10,
  'decisiontreeclassifier__min_samples_leaf': 5,
  'decisiontreeclassifier__min_samples_split': 40},
 0.8907376690554821)

And we already have X_second and all iterations of y_second set up to make our training and testing sets, so we can just go ahead and do hyperparameter search:

In [103]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                DecisionTreeClassifier(random_state=273))
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_second), index=X_second.index, columns=X_second.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_second_7,
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_second_7.values)
            dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            dt_grid.fit(X_train, y_train)
            if dt_grid.best_score_ > best[0]:
                best = [dt_grid.best_score_, k, c_type, n]
                
best

[0.4525307416763124, 5, 'stats', 15]

It is interesting that KNN always seems to choose to cluster based on stats, but logistic regression had more variance in the type of clustering it used.

In [104]:
pipe = make_pipeline(
    DecisionTreeClassifier(random_state=273))
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_second), index=X_second.index, columns=X_second.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[15]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_second_7,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_7.values)
second_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=5, verbose=0)
second_stage.fit(X_train, y_train)
second_stage.best_params_, second_stage.best_score_

({'decisiontreeclassifier__class_weight': None,
  'decisiontreeclassifier__max_depth': 10,
  'decisiontreeclassifier__min_samples_leaf': 5,
  'decisiontreeclassifier__min_samples_split': 2},
 0.4525307416763124)

In [105]:
precision_recall_fscore_support(y_train, second_stage.predict(X_train), labels=['PU', 'NU', 'RU', 'UU', 'OU', 'Uber'])

(array([0.75735294, 0.48717949, 0.48484848, 0.57142857, 0.61363636,
        0.82758621]),
 array([0.86554622, 0.61290323, 0.45714286, 0.35294118, 0.62790698,
        0.6       ]),
 array([0.80784314, 0.54285714, 0.47058824, 0.43636364, 0.62068966,
        0.69565217]),
 array([119,  31,  35,  34,  43,  40], dtype=int64))

Interesting, these are not very overfit, more like underfit

In [106]:
export_graphviz(second_stage.best_estimator_['decisiontreeclassifier'], out_file='trees/7classdt_cluster_stage2.dot',
               feature_names = X_train.columns,
               class_names = sorted(list(y_train.unique())), #needs to be in alphabetical order
               rounded = True, proportion = False, precision = 2, filled = True)

render('dot', 'png', 'trees/7classdt_cluster_stage2.dot', outfile='trees/7classdt_cluster_stage2.png')

'trees\\7classdt_cluster_stage2.png'

In [107]:
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[5]['features']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
pred_1 = pd.DataFrame(first_stage.predict(X_train), index=X_train.index)

#drop old clusters and merge the new clusters
#because the two models work with different clusterings
X_train = X_train.drop(columns=list(range(0, 5)))
X_train = X_train.merge(pd.get_dummies(cluster_dfs[15]['stats']), on='name', how='left')

pred_2 = pd.DataFrame(second_stage.predict(X_train[pred_1[0] == 'Yes']), index=X_train[pred_1[0] == 'Yes'].index)
y_pred = pred_1.merge(pred_2, on='name', how='left')['0_y'].fillna(value='ZU')
y_validation = y_train.to_frame().merge(y_df['formats'], on='name', how='left')['formats']
y_validation

name
Absol              PU
Ninetales-Alola    OU
Palossand          PU
Ponyta-Galar       ZU
Carvanha           ZU
                   ..
Dragonair          ZU
Qwilfish           PU
Cryogonal          PU
Wailord            ZU
Blaziken           OU
Name: formats, Length: 553, dtype: object

In [108]:
f1_score(y_validation, y_pred, labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'], average='weighted')

0.6628007522983858

In [109]:
precision_recall_fscore_support(y_validation, y_pred, labels=['ZU', 'PU', 'NU', 'RU', 'UU', 'OU', 'Uber'])

(array([0.89754098, 0.5703125 , 0.33333333, 0.24444444, 0.42857143,
        0.34210526, 0.70588235]),
 array([0.87250996, 0.57480315, 0.4       , 0.32352941, 0.35294118,
        0.31707317, 0.66666667]),
 array([0.88484848, 0.57254902, 0.36363636, 0.27848101, 0.38709677,
        0.32911392, 0.68571429]),
 array([251, 127,  30,  34,  34,  41,  36], dtype=int64))

The main problem is that this model is not working for mid competitive, so it won't be used.

#### two-stage 4 class with clustering

In [110]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                DecisionTreeClassifier(random_state=273))
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_second), index=X_second.index, columns=X_second.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_second_4,
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_second_4.values)
            dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            dt_grid.fit(X_train, y_train)
            if dt_grid.best_score_ > best[0]:
                best = [dt_grid.best_score_, k, c_type, n]
                
best

[0.5378482916147932, 3, 'stats', 5]

That's the first time in a while that learnsets has been the best clustering parameter, but let's see if it's to any significant effect:

In [111]:
pipe = make_pipeline(
    DecisionTreeClassifier(random_state=273))
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_second), index=X_second.index, columns=X_second.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[5]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_second_4,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_4.values)
second_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=3, verbose=0)
second_stage.fit(X_train, y_train)
second_stage.best_params_, second_stage.best_score_

({'decisiontreeclassifier__class_weight': 'balanced',
  'decisiontreeclassifier__max_depth': 10,
  'decisiontreeclassifier__min_samples_leaf': 10,
  'decisiontreeclassifier__min_samples_split': 2},
 0.5378482916147932)

In [112]:
precision_recall_fscore_support(y_train, second_stage.predict(X_train), labels=['Low c', 'Mid c', 'High c'])

(array([0.76785714, 0.60869565, 0.73333333]),
 array([0.72268908, 0.7       , 0.6626506 ]),
 array([0.74458874, 0.65116279, 0.69620253]),
 array([119, 100,  83], dtype=int64))

Not terrible, but not good

In [113]:
export_graphviz(second_stage.best_estimator_['decisiontreeclassifier'], out_file='trees/4classdt_cluster_stage2.dot',
               feature_names = X_train.columns,
               class_names = sorted(list(y_train.unique())), #needs to be in alphabetical order
               rounded = True, proportion = False, precision = 2, filled = True)

render('dot', 'png', 'trees/4classdt_cluster_stage2.dot', outfile='trees/4classdt_cluster_stage2.png')

'trees\\4classdt_cluster_stage2.png'

In [114]:
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[5]['features']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
pred_1 = pd.DataFrame(first_stage.predict(X_train), index=X_train.index)

#drop old clusters and merge the new clusters
#because the two models work with different clusterings
X_train = X_train.drop(columns=list(range(0, 5)))
X_train = X_train.merge(pd.get_dummies(cluster_dfs[5]['stats']), on='name', how='left')

pred_2 = pd.DataFrame(second_stage.predict(X_train[pred_1[0] == 'Yes']), index=X_train[pred_1[0] == 'Yes'].index)
y_pred = pred_1.merge(pred_2, on='name', how='left')['0_y'].fillna(value='Not c')
y_validation = y_train.to_frame().merge(y_df['formats4'], on='name', how='left')['formats4']
y_validation

name
Absol               Low c
Ninetales-Alola    High c
Palossand           Low c
Ponyta-Galar        Not c
Carvanha            Not c
                    ...  
Dragonair           Not c
Qwilfish            Low c
Cryogonal           Low c
Wailord             Not c
Blaziken           High c
Name: formats4, Length: 553, dtype: object

In [115]:
f1_score(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.6985110285875017

In [116]:
precision_recall_fscore_support(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.89754098, 0.57627119, 0.42975207, 0.64285714]),
 array([0.87250996, 0.53543307, 0.53061224, 0.58441558]),
 array([0.88484848, 0.55510204, 0.47488584, 0.6122449 ]),
 array([251, 127,  98,  77], dtype=int64))

Also not a very good model for 4 classes

#### two-stage 4 class alt with clustering

In [117]:
best = [0, 0, 0, 0]

for n in n_clusters:
    for c_type in cluster_types:
        for k in k_list:
            pipe = make_pipeline(
                DecisionTreeClassifier(random_state=273))
            X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_second), index=X_second.index, columns=X_second.columns)
            X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[n][c_type]), on='name')
            X_train, X_test, y_train, y_test = train_test_split(X_final, y_second_4alt,
                                                                test_size=0.25,
                                                                random_state=273,
                                                                stratify=y_second_4alt.values)
            dt_grid = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=k, verbose=0)
            dt_grid.fit(X_train, y_train)
            if dt_grid.best_score_ > best[0]:
                best = [dt_grid.best_score_, k, c_type, n]
                
best

[0.6288629993582472, 2, 'stats', 20]

In [118]:
pipe = make_pipeline(
    DecisionTreeClassifier(random_state=273))
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_second), index=X_second.index, columns=X_second.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[20]['stats']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_second_4alt,
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_second_4alt.values)
second_stage = GridSearchCV(pipe, param_grid, scoring=make_scorer(f1_score, average='weighted'), n_jobs=-1, cv=2, verbose=0)
second_stage.fit(X_train, y_train)
second_stage.best_params_, second_stage.best_score_

({'decisiontreeclassifier__class_weight': None,
  'decisiontreeclassifier__max_depth': 20,
  'decisiontreeclassifier__min_samples_leaf': 2,
  'decisiontreeclassifier__min_samples_split': 10},
 0.6288629993582472)

In [119]:
precision_recall_fscore_support(y_train, second_stage.predict(X_train), labels=['Low c', 'Mid c', 'High c'])

(array([0.84251969, 0.88489209, 0.83333333]),
 array([0.89915966, 0.86013986, 0.75      ]),
 array([0.8699187 , 0.87234043, 0.78947368]),
 array([119, 143,  40], dtype=int64))

This is definitely our best model for decision trees two-stage models, which is interesting because historically the 4 class alt models have performed relatively poorly in two stage models

In [120]:
export_graphviz(second_stage.best_estimator_['decisiontreeclassifier'], out_file='trees/4classaltdt_cluster_stage2.dot',
               feature_names = X_train.columns,
               class_names = sorted(list(y_train.unique())), #needs to be in alphabetical order
               rounded = True, proportion = False, precision = 2, filled = True)

render('dot', 'png', 'trees/4classaltdt_cluster_stage2.dot', outfile='trees/4classaltdt_cluster_stage2.png')

'trees\\4classaltdt_cluster_stage2.png'

In [121]:
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X), index=X.index, columns=X.columns)
X_final = pd.merge(X_scaled, pd.get_dummies(cluster_dfs[5]['features']), on='name')
X_train, X_test, y_train, y_test = train_test_split(X_final, y_df['formats2'],
                                                    test_size=0.25,
                                                    random_state=273,
                                                    stratify=y_df['formats2'].values)
pred_1 = pd.DataFrame(first_stage.predict(X_train), index=X_train.index)

#drop old clusters and merge the new clusters
#because the two models work with different clusterings
X_train = X_train.drop(columns=list(range(0, 5)))
X_train = X_train.merge(pd.get_dummies(cluster_dfs[20]['stats']), on='name', how='left')

pred_2 = pd.DataFrame(second_stage.predict(X_train[pred_1[0] == 'Yes']), index=X_train[pred_1[0] == 'Yes'].index)
y_pred = pred_1.merge(pred_2, on='name', how='left')['0_y'].fillna(value='Not c')
y_validation = y_train.to_frame().merge(y_df['formats4alt'], on='name', how='left')['formats4alt']
y_validation

name
Absol              Low c
Ninetales-Alola    Mid c
Palossand          Low c
Ponyta-Galar       Not c
Carvanha           Not c
                   ...  
Dragonair          Not c
Qwilfish           Low c
Cryogonal          Low c
Wailord            Not c
Blaziken           Mid c
Name: formats4alt, Length: 553, dtype: object

In [122]:
f1_score(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'], average='weighted')

0.7478408263494628

In [123]:
precision_recall_fscore_support(y_validation, y_pred, labels=['Not c', 'Low c', 'Mid c', 'High c'])

(array([0.89754098, 0.60504202, 0.66438356, 0.56818182]),
 array([0.87250996, 0.56692913, 0.69784173, 0.69444444]),
 array([0.88484848, 0.58536585, 0.68070175, 0.625     ]),
 array([251, 127, 139,  36], dtype=int64))

Overall though, it's still not a great model, reinforcing that decision trees probably are not the best choice for this task, but still might contain interesting visualizations.

Notebook runtime: Roughly 20 minutes

<a id="performance"></a>
## Score Summary:

### one-stage, no clusters

#### 7 classes, no clusters
0.6167818874054956\
[0.84274194, 0.57894737, 0.38554217, 0., 0.24489796, 0.40384615, 0.57534247]
 
#### 4 classes, no clusters
0.9801185987970462\
[0.98597194, 0.96638655, 0.98      , 0.98224852]

#### 4 class alt no clusters
0.8175913105129116\
[0.891258  , 0.75      , 0.78518519, 0.67226891]

#### 2 class no clusters
0.8895905415239665\
[0.87726358, 0.8998358]

### one-stage, with clustering

#### 7 class with clustering
0.6461860544509731\
[0.87398374, 0.62542955, 0.34782609, 0.17391304, 0.29166667, 0.48275862, 0.4]

#### 4 class with clustering
0.7717135294791215\
[0.86607143, 0.69064748, 0.66028708, 0.73684211]

#### 4 class alt with clustering
0.7452341666036475\
[0.87190083, 0.62730627, 0.71760797, 0.4]

#### 2 class with clustering
0.8967874929630921\
[0.88484848, 0.90671031]

### two-stage, no clustering

#### two-stage 7 class, no clusters
0.770759487293633\
[0.87726358, 0.69354839, 0.52173913, 0.60606061, 0.72463768, 0.70454545, 0.7826087]

#### two-stage 4 class, no clusters
0.7839593210332558\
[0.87726358, 0.6741573, 0.7027027, 0.76433121]

#### two-stage 4 class alt, no clusters
0.8298126438271775\
[0.87726358, 0.71428571, 0.86315789, 0.77777778]

### two-stage, with clustering

#### two-stage 7 class with clustering
0.6628007522983858\
[0.88484848, 0.57254902, 0.36363636, 0.27848101, 0.38709677, 0.32911392, 0.68571429]

#### two-stage 4 class with clustering
0.6985110285875017\
[0.88484848, 0.55510204, 0.47488584, 0.6122449]

#### two-stage 4 class alt with clustering
0.7478408263494628\
[0.88484848, 0.58536585, 0.68070175, 0.625]

## Performance Summary

### best 7 class model: two-stage 7 class, no clusters (7/8)

#### best 7 class overall
- two-stage 7 class, no clusters, 0.770759487293633

#### best 7 class ZU
- two-stage 7 class with clustering, 0.88484848

#### best 7 class PU
- two-stage 7 class, no clusters, 0.69354839

#### best 7 class NU
- two-stage 7 class, no clusters, 0.52173913

#### best 7 class RU
- two-stage 7 class, no clusters, 0.60606061

#### best 7 class UU
- two-stage 7 class, no clusters, 0.72463768

#### best 7 class OU
- two-stage 7 class, no clusters, 0.70454545

#### best 7 class Uber
- two-stage 7 class, no clusters, 0.7826087

### best 4 class model: two-stage 4 class alt, no clusters (3/5)

#### best 4 class overall
- two-stage 4 class alt, no clusters, 0.8298126438271775

#### best 4 class "not competitive"
- 4 class alt no clusters, 0.891258

#### best 4 class "low competitive"
- 4 class alt no clusters, 0.75

#### best 4 class "mid competitive"
- two-stage 4 class alt, no clusters, 0.86315789

#### best 4 class "high competitive"
- two-stage 4 class alt, no clusters, 0.77777778

### best 2 class model: 2 class with clustering

There was one 4 class model which performed better than the one I assigned to the best, but based on the performance of the other models, it seems extremely obvious that the near 100% f-score 4 class model is heavily overfitting, so I don't think that model has any real likelihood of success. Thus I chose the 2nd best in that case, which has at least an appearance of more reasonable performance.

Overall though, the performance of decision trees seems lower than KNN while higher than Logistic Regression, but we were also worried about overfitting a lot when it came to KNN so it's possible that decision trees could still be better.

## Explainability Summary

Decision trees offer the best explainability of any model, since we can directly see and visualize every decision they make. This means that even with low performance they could be worth looking at in the results. Random Forest, which I'm going to try next, also uses decision trees, but it uses so many trees that its explainability is not as straightforward. Random Forest does offer some interpretability with its feature importances, but less than decision trees, so it will have to justify itself with performance.