# The Average Pokémon Used by Serious Competitors

I've been following the Pokémon series for a really long time (almost 20 years at this point!), and I thought it would be good to put my impractical knowledge to (good?) use in a small data analysis project.

I first discovered the world of competitive online Pokémon battles in 2008, way back when Pokémon was still in its 3rd Gen(eration), where each generation is marked by the release of new Pokémon. While I stopped playing the games some time during the 5th/6th Gen (we are currently in the 8th Gen marked by Sword and Shield), I still consider myself a part of the competitive battling scene. That being said, I am definitely a lot less "in the know" as I was years ago, though I would like to think that I can still hold my own against an amateur battler :)

## Question of Interest

The question that I seek to address via this project is: **Which Pokémon has base stats that most closely resembles the "average" Pokémon used by the most skilled players in the Over-Used (OU) battling scene?**

Before diving into the methodology and actual analyses, here's a couple of paragraphs containing relevant background information for those who might be unfamiliar with this scene:

## A Brief Background on Competitive Online Pokémon Battling

Competitive online Pokémon battlings involves two people each using a team of up to 6 Pokémon. Pokémon are categorised into tiers based on viability, usage frequency, available counters etc. The important thing to note is that Pokémon tiers are not set in stone as team strategies evolve and develop over time, and Pokémon do change tiers over time as well. I won't do a deep dive into the details of it (you can view more information on tiers [here](https://www.smogon.com/bw/articles/bw_tiers)), but I raise this because I have chosen to analyse Pokémon in the Over-Used (OU) tier, which is the most popular tier (and also happens to be the one that I'm most familiar with).

## A Brief Background on Data Sources

In the competitive online Pokémon battling scene, [Smogon](https://www.smogon.com/) is widely recognized as a leading "authority in the competitive Pokémon arena." They operate [Pokémon Showdown!](https://play.pokemonshowdown.com/), one of the most popular online Pokémon battle simulators. Literal millions of battles occur on that website every month, and I understand that there is a mature data collection pipeline for all of those battles along with documentation of the various fields (see this [forum post](https://www.smogon.com/forums/threads/gen-8-smogon-university-usage-statistics-discussion-thread.3657197/) for an overview). This makes it a great source for Pokémon battling data! The Internet truly is a magical place.

## An Overview on Pokémon Stats

There are six different stats (`HP`, `Attack`, `Defense`, `SpAtk`, `SpDef`, and `Speed`) that a Pokémon has.

There are two main types of damaging moves in the game: Physical and Special. The amount of damage dealt by each move is dependent on several things, but to simplify the explanation, the amount of damage done by Physical moves is determined by the attacking Pokémon's `Attack` stat and the defending Pokémon's `Defense` stat. The amount of damage done by Special moves is determined by the attacking Pokémon's Special Attack (`SpAtk`) stat and the defending Pokémon's Special Defense (`SpDef`) stat. When two Pokémon are facing each other, the one with the higher `Speed` stat moves first. `HP` stands for Hit Points, and it basically refers to the amount of health that a Pokémon has. Total is simply the sum of all of the stats, and can be used as a general indicator of a Pokémon's battle potential.

Note that in the data, numbers indicated under each stats' column refers to the base stat of that Pokémon. There are entire pages dedicated towards what this truly means, but suffice to say, a base stat represents a Pokémon's "potential" in that stat. While it is possible for a Pokémon's stat to be inflated by items, natures, and training, the base stat is an indicator of the highest possible stat that the Pokémon can attain for that particular attribute.

## Who are the most skilled players in the OU battling scene?

The Pokémon usage statistics on Smogon are helpfully tiered based on player skill level. Player skill level is determined via one's ranking on the ladder. I won't dive into the specifics of it, but I'm essentially taking usage statistics from  the most skilled players in the OU tier (i.e., players that are ranked 1825 and above). To quote [Question 8 of the Gen 8 Smogon University Usage Statistics Discussion Thread FAQ](https://www.smogon.com/forums/threads/gen-8-smogon-university-usage-statistics-discussion-thread.3657197/):

_"1760 (1825 for OU) stats represent "1337" stats, what the best-of-the-best in the metagame are doing. To some extent, this is what all players should strive to be doing, but there are some Pokemon and strategies that are difficult to pull off and might require a greater amount of skill than the typical competitive player possesses."_

Wow, that was certainly _a lot_. But now that we have all of the background information out of the way, let's look at the methodology I am employing to answer the question.

# Methodology

## Data Sources

I looked at Gen 8 OU battle data for each month in 2021 (you can see a sample file for January 2021 at this link [here](https://www.smogon.com/stats/2021-01/gen8ou-1825.txt)). I also pulled the data for the last two months of 2019 when Gen 8 was first released. I downloaded each of the `.txt` files and read them in.

Unfortunately, the Smogon dataset lacks the necessary base stat information, so I supplemented that by scraping data off [Pokémon Database](https://pokemondb.net/pokedex/all) since all of the base stat information was already organised in a table.

## Determining the Average Pokémon

1. For each month, I decided to filter the data to the top 100 most used Pokémon in the OU tier for a particular month.
2. I then assigned a weight to each of those 100 Pokémon based on how often it appeared on teams based on the `Usage %` field (see this [lengthy discussion on how weighting is used](https://www.smogon.com/forums/threads/weighted-stats-faq.3478570/)), and took the weighted average of each stat across all 100 Pokémon, and took that to be the "average" Pokémon's base stats.
3. I then calculated a difference score for every single Pokémon against the "average" Pokémon's base stats. This difference score was calculated by taking the sum of squared differences for each stat.
   * This difference score calculation methodology was selected as it imposes a bigger penalty on larger deviations from the "average" Pokémon's base stat.
4. The Pokémon with the smallest difference score is the answer to the question!

## Predictions

I hypothesize that my analyses will yield the following two results:
1. The same Pokémon will appear for all months in 2021.
   * It has been over 2 years since the Gen 8 metagame has been released, so it should be at a fairly mature and stable point.
   * Consequently, I doubt that the most frequently used Pokémon would fluctuate by a lot, and so I would expect that the "average" base stats would be fairly constant across those months.
2. The Pokémon for the last two months of 2019 will be different from the one listed in 2021.
   * I expect the metagame to be quite different from when it first started (e.g., Pokémon that are frequently used might be moved to different tiers if they become excessively dominant).

# Analysis

## Loading in relevant libraries

In [1]:
library(tidyverse)
library(vctrs)
library(rvest)
library(RCurl)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.6     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.4     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.1     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: ‘vctrs’


The following object is masked from ‘package:dplyr’:

    data_frame


The following object is masked from ‘package:tibble’:

    data_frame



Attaching package: ‘rvest’


The following object is masked from ‘package:readr’:

    guess_encoding



Attaching package: ‘RCurl’


The following object is masked from ‘package:tid

## Downloading Smogon data files

In [2]:
# Function for saving files
save_smogon_file <- function(year_month){
  
  file <- getURL(str_c("https://www.smogon.com/stats/", year_month, "/gen8ou-1825.txt")
                 , ssl.verifypeer = FALSE)
  
  write_file(file
             , str_c("gen8ou-1825-", year_month,".txt"))
  
}

# Generate the vector of year_month
year_month_vec <- seq(1:12) %>%
                    str_pad(width = 2, pad = "0") %>%
                    str_c("2021-", .) %>% 
                    vec_c(c("2019-11", "2019-12"))

# Download all files for 2021 and for the last two months of 2019
walk(year_month_vec
     , save_smogon_file)

I'll flag that the `ssl.verifypeer = FALSE` in `getURL()` is necessary because without it, the following error is produced:

`Error in function (type, msg, asError = TRUE)  : 
  SSL certificate problem: certificate has expired`
  
According to [this helpful post](https://community.rstudio.com/t/ssl-certificate-problem-certificate-has-expired/68619), this seems to be a potential issue related to a certificate issue expiring on May 30, 2020. I will admit that I do not fully appreciate the technical details of the problem, but my understanding is that it can be bypassed by ignoring peer certification. I understand doing so is not great from a security standpoint, but I am also analyzing data for a children's video game so I think it's fine to let it slide this time.

## Webscraping Base Stats Information for All Pokémon

In [3]:
all_pokemon_url <- "https://pokemondb.net/pokedex/all"
all_pokemon_html <- read_html(all_pokemon_url)
all_pokemon_data <- all_pokemon_html %>%
                      html_nodes(xpath = "//*[@id = 'pokedex']") %>% # Determined after examining source HTML
                      html_table()
all_pokemon_data <- all_pokemon_data[[1]] # Extract out the data frame

rm(all_pokemon_url, all_pokemon_html)

glimpse(all_pokemon_data)

Rows: 1,047
Columns: 10
$ `#`       [3m[90m<int>[39m[23m 1, 2, 3, 3, 4, 5, 6, 6, 6, 7, 8, 9, 9, 10, 11, 12, 13, 14, 1…
$ Name      [3m[90m<chr>[39m[23m "Bulbasaur", "Ivysaur", "Venusaur", "VenusaurMega Venusaur",…
$ Type      [3m[90m<chr>[39m[23m "GrassPoison", "GrassPoison", "GrassPoison", "GrassPoison", …
$ Total     [3m[90m<int>[39m[23m 318, 405, 525, 625, 309, 405, 534, 634, 634, 314, 405, 530, …
$ HP        [3m[90m<int>[39m[23m 45, 60, 80, 80, 39, 58, 78, 78, 78, 44, 59, 79, 79, 45, 50, …
$ Attack    [3m[90m<int>[39m[23m 49, 62, 82, 100, 52, 64, 84, 130, 104, 48, 63, 83, 103, 30, …
$ Defense   [3m[90m<int>[39m[23m 49, 63, 83, 123, 43, 58, 78, 111, 78, 65, 80, 100, 120, 35, …
$ `Sp. Atk` [3m[90m<int>[39m[23m 65, 80, 100, 122, 60, 80, 109, 130, 159, 50, 65, 85, 135, 20…
$ `Sp. Def` [3m[90m<int>[39m[23m 65, 80, 100, 120, 50, 65, 85, 85, 115, 64, 80, 105, 115, 20,…
$ Speed     [3m[90m<int>[39m[23m 45, 60, 80, 80, 65, 80, 100, 100, 100, 43, 58, 7

## Importing and Cleaning Data

In [4]:
# Standardize column names
colnames(all_pokemon_data) <- all_pokemon_data %>%
                                colnames() %>%
                                str_to_lower() %>% 
                                str_replace_all("\\.", "_") %>% 
                                str_remove_all(" ")

# Function to assist with general name replacement for Galarian and Alolan names
rename_region_helper <- function(string, region){
  
  if(!str_detect(string, region)){
    
    return(string)
    
  } else {
    
    if(region == "Galarian"){
      
      replaced_string <- str_replace(string, "Galarian", "-Galar")
      
    }
    
    if(region == "Alolan"){
      
      replaced_string <- str_replace(string, "Alolan", "-Alola")
      
    }
    
    space_position <- str_locate(replaced_string, " ")[1]
    
    final_string <- str_sub(replaced_string
                            , end = space_position - 1)
    
    return(final_string)
    
  }
  
}

# Function to assist general renaming of Pokemon species, specifically targeting Rotom and Lycanroc
rename_species_helper <- function(string, species){
  
  if(str_count(string, species) == 2 | species == "Lycanroc"){
    
    replaced_string <- str_replace(string, species, str_c(species, "-"))
    
    final_string <- str_remove(replaced_string, str_c(" ", species))
    
    # For Lycanroc
    final_string <- str_remove(final_string, " Form")
    
    return(final_string)
    
  } else{
    
    return(string)
    
  }
  
}

# Manual replacement of pokemon variable was determined after an initial join and determining
# entries that had missing values. These are a result of inconsistencies in the naming convention
# for certain Pokémon that have different forms and thus have the potential to be named differently
# across different sources.
to_join <- all_pokemon_data %>% 
             select(name
                    , total
                    , hp
                    , attack
                    , defense
                    , sp_atk
                    , sp_def
                    , speed) %>% 
             rename(pokemon = name) %>% 
             mutate(pokemon = str_trim(pokemon)
                    , pokemon = case_when(
                        pokemon == "UrshifuSingle Strike Style" ~ "Urshifu"
                        , pokemon == "UrshifuRapid Strike Style" ~ "Urshifu-Rapid-Strike"
                        , str_detect(pokemon, "Therian Forme") ~ str_replace(pokemon, "Therian Forme", "-Therian")
                        , pokemon == "AegislashShield Forme" ~ "Aegislash"
                        , pokemon == "ZamazentaCrowned Shield" ~ "Zamazenta-Crowned"
                        , pokemon == "KeldeoOrdinary Form" ~ "Keldeo"
                        , pokemon == "ToxtricityLow Key Form" ~ "Toxtricity"
                        , pokemon == "IndeedeeMale" ~ "Indeedee"
                        , pokemon == "EiscueIce Face" ~ "Eiscue"
                        , pokemon == "MorpekoFull Belly Mode" ~ "Morpeko"
                        , pokemon == "DarmanitanStandard Mode" ~ "Darmanitan"
                        , str_detect(pokemon, "Galarian") ~ map_chr(pokemon, rename_region_helper, region = "Galarian")
                        , str_detect(pokemon, "Alolan") ~ map_chr(pokemon, rename_region_helper, region = "Alolan")
                        , str_detect(pokemon, "Rotom") ~ map_chr(pokemon, rename_species_helper, species = "Rotom")
                        , str_detect(pokemon, "Lycanroc") ~ map_chr(pokemon, rename_species_helper, species = "Lycanroc")
                        , TRUE ~ pokemon
                        )
                    )
  
clean_smogon_file <- function(year_month_input){
  
  # Define column names
  column_names <- c("dummy_one", "rank", "pokemon", "usage_pct", "raw", "raw_pct", "real", "real_pct", "dummy_two")

  raw_data <- read_delim(str_c("gen8ou-1825-", year_month_input,".txt")
                         , delim = "|"
                         , skip = 5
                         , col_names = column_names
                         , show_col_types = FALSE)
  
  # Pick the first 100
  cleaned_data <- raw_data %>% 
                    slice(1:100) %>% 
                    select(pokemon, usage_pct) %>% 
                    mutate(
                        usage_pct = str_replace(usage_pct, "%", "")
                        , usage_pct = as.numeric(usage_pct)
                        , pokemon = str_trim(pokemon)
                        , year_month = year_month_input
                    ) %>% 
                    left_join(to_join, by = "pokemon")
  
  return(cleaned_data)
  
}

stacked_data <- map_dfr(year_month_vec
                        , clean_smogon_file)

glimpse(stacked_data)

“One or more parsing issues, see `problems()` for details”
“One or more parsing issues, see `problems()` for details”
“One or more parsing issues, see `problems()` for details”
“One or more parsing issues, see `problems()` for details”
“One or more parsing issues, see `problems()` for details”
“One or more parsing issues, see `problems()` for details”
“One or more parsing issues, see `problems()` for details”
“One or more parsing issues, see `problems()` for details”
“One or more parsing issues, see `problems()` for details”
“One or more parsing issues, see `problems()` for details”
“One or more parsing issues, see `problems()` for details”
“One or more parsing issues, see `problems()` for details”
“One or more parsing issues, see `problems()` for details”
“One or more parsing issues, see `problems()` for details”


Rows: 1,402
Columns: 10
$ pokemon    [3m[90m<chr>[39m[23m "Landorus-Therian", "Toxapex", "Magearna", "Cinderace", "Cl…
$ usage_pct  [3m[90m<dbl>[39m[23m 35.25502, 31.25370, 26.34967, 22.53238, 21.13526, 20.76790,…
$ year_month [3m[90m<chr>[39m[23m "2021-01", "2021-01", "2021-01", "2021-01", "2021-01", "202…
$ total      [3m[90m<int>[39m[23m 600, 495, 600, 530, 483, 600, 530, 490, 540, 600, 510, 600,…
$ hp         [3m[90m<int>[39m[23m 89, 50, 80, 80, 95, 108, 100, 95, 255, 92, 110, 91, 74, 98,…
$ attack     [3m[90m<int>[39m[23m 145, 63, 95, 116, 70, 130, 125, 75, 10, 105, 65, 90, 94, 87…
$ defense    [3m[90m<int>[39m[23m 90, 152, 115, 75, 73, 95, 90, 110, 10, 90, 105, 106, 131, 1…
$ sp_atk     [3m[90m<int>[39m[23m 105, 53, 130, 65, 95, 80, 60, 100, 75, 125, 55, 130, 54, 53…
$ sp_def     [3m[90m<int>[39m[23m 80, 142, 115, 75, 90, 85, 70, 80, 135, 90, 95, 106, 116, 85…
$ speed      [3m[90m<int>[39m[23m 91, 35, 65, 119, 60, 102, 85, 30, 55, 98, 80, 7

I'll flag that there are three instances of manual replacement where selecting the exact form impacts the base stats of the Pokemon selected: `AegislashShield Forme`, `IndeedeeMale`, and `EiscueIce Face`. In all three instances, I used the [Smogon Pokédex](https://www.smogon.com/dex/ss/pokemon/) to ensure that the correct form was being used.

That being said, I can see arguments made for why we might want to use `Aegislash-Blade` or `Eiscue-Noice` given that those are the forms used as offensive sweepers. As a sensitivity, I reran the analysis using both of those forms, and the results didn't change; this is to be expected given that neither were used very frequently in OU.

## Calculating the "average" Pokémon's base stats

In [5]:
summarized_data <- stacked_data %>%
  group_by(year_month) %>% 
    mutate(weight = usage_pct/sum(usage_pct)) %>% 
      mutate(across(hp:speed
                    , ~ weight * get(cur_column())
                    , .names = 'weighted_{.col}')) %>% 
        summarise(across(weighted_hp:weighted_speed
                  , ~ sum(.x)))

glimpse(summarized_data)

Rows: 14
Columns: 7
$ year_month       [3m[90m<chr>[39m[23m "2019-11", "2019-12", "2021-01", "2021-02", "2021-03"…
$ weighted_hp      [3m[90m<dbl>[39m[23m 81.96590, 84.16067, 93.12312, 91.75425, 91.84276, 91.…
$ weighted_attack  [3m[90m<dbl>[39m[23m 95.83742, 95.71805, 97.87233, 98.36021, 100.17417, 10…
$ weighted_defense [3m[90m<dbl>[39m[23m 86.13144, 88.79475, 94.67103, 94.79502, 93.32024, 95.…
$ weighted_sp_atk  [3m[90m<dbl>[39m[23m 72.28417, 74.78787, 88.43751, 89.33527, 88.14034, 88.…
$ weighted_sp_def  [3m[90m<dbl>[39m[23m 84.04505, 86.24581, 89.83071, 90.43455, 88.57389, 91.…
$ weighted_speed   [3m[90m<dbl>[39m[23m 76.35360, 76.96890, 78.63541, 78.77809, 80.51884, 81.…


## Generating difference scores and determining the Pokémon with the closest base stats

In [6]:
get_average_pokemon <- function(year_month_input){
  
  stats <- summarized_data %>% filter(year_month == year_month_input)
  
  all_pokemon_data %>%
    full_join(stats
              , by = character()) %>%
    mutate(across(hp:speed
                  , ~ (get(cur_column()) - get(str_c("weighted_", cur_column())))^2
                  , .names = "{.col}_diff"
                  )
    ) %>% 
    rowwise() %>%
    mutate(diff_total = sum(c_across(hp_diff:speed_diff))) %>%
    ungroup() %>% 
    arrange(diff_total) %>% 
    slice(1) %>% 
    return()
}

all_average_pokemon <- map_dfr(year_month_vec
                               , get_average_pokemon)

In [7]:
all_average_pokemon %>% select(year_month, name) %>% View()

year_month,name
<chr>,<chr>
2021-01,Feraligatr
2021-02,Feraligatr
2021-03,Feraligatr
2021-04,Feraligatr
2021-05,Feraligatr
2021-06,Feraligatr
2021-07,Feraligatr
2021-08,Feraligatr
2021-09,Feraligatr
2021-10,Feraligatr


The results of the analyses are aligned with my expections, hooray! This wasn't a data exercise that ended in futility! It's also a happy coincidence that I quite like both `Feraligatr` and `Nidoqueen` :)

What I find really interesting, though, is that neither `Feraligatr` nor `Nidoqueen` are in the OU tier. `Feraligatr` doesn't technically exist in Gen 8 because it's one of _many_ Pokemon that are not available in Sword and Shield, while `Nidoqueen` is in [Rarely Used (RU)](https://www.smogon.com/dex/ss/pokemon/nidoqueen/).

## Methodological Limitations

Whether the data itself is sufficient to answer the original question actually depends a lot on how one defines "usage". The way I see it, there are at least two ways to define usage:
1. Usage as defined on a battler's team specific level.
   * Under this definition, a Pokémon on a specific battler's team will only count **once** towards usage statistics. This is regardless of the number of battles that the battler participates in.
2. Usage as defined as the number of battles that a Pokémon appears in.
   * Under this definition, a Pokémon will **always** count towards usage statistics as long as it appears in a battle. This means that if a battler participates in X battles, each Pokémon in the team will count X times towards usage statistics.

If my understanding of the Smogon usage percentage is correct, usage percentage is defined under the second definition. Is this an accurate measurement of usage? It really depends, but is something to consider.

# Final Thoughts

This took way more time and energy than I originally expected it to, but I am quite pleased with the level of rigour that I managed to extract out of a data analysis project involving a children's video game :)

If you have any questions or comments about the data, methodology, or just want to chat about Pokémon and how it is currently just a money grab operation that I can no longer financially support, please feel free to reach out! Especially about the last item. I have a lot of ***strong*** feelings about the franchise...