# 1. Theoretical Framework

## Introduction

**Pokémon** is a long-running multimedia franchise developed by Game Freak and published by Nintendo, centered around capturing, training, and battling fictional creatures called Pokémon. While the franchise began with turn-based RPGs, it has evolved into a global phenomenon encompassing video games, trading cards, television shows, and competitive gaming communities. Within the games themselves, players take on the role of Pokémon Trainers, building teams of creatures with unique abilities, statistics, and move sets in order to battle other trainers.

In the competitive scene, particularly through platforms such as [Pokémon Showdown](https://pokemonshowdown.com/), battles are conducted using meticulously balanced rulesets curated by the community (notably Smogon University). Players construct teams of six Pokémon chosen from a pool of hundreds, each with distinct roles, strategies, and limitations. Due to this variety, the competitive metagame is both dynamic and complex, requiring players to adapt to frequent shifts in viability and usage.

Despite this complexity, there is currently no single metric that objectively quantifies a Pokémon’s overall viability within a tier. Viability is influenced not only by raw strength but also by factors such as usage trends, versatility, and synergy with other team members. This project proposes a **Meta Viability Index (MVI)**, a composite indicator designed to capture these multidimensional aspects and produce a single score representing a Pokémon’s competitive relevance in a specific tier. This Jupyter Notebook file contains a brief rundown of each section, alongside the accompanying code. A word document will be included to fully delve into aspects of the project that may need more explanation. 

---

## Justification for the Index

Competitive viability in Pokémon is a multidimensional concept. It cannot be fully captured by a single attribute like base stats or usage percentage. Instead, a Pokémon’s effectiveness in the metagame is shaped by its consistency, versatility, and how well it synergizes with common team structures. These aspects, though individually measurable, are rarely combined into a single interpretable metric.

Currently, the competitive community, particularly Smogon University, uses tier systems based on both usage statistics and expert discussions to group Pokémon. While effective in practice, these tier lists do not provide a scoring mechanism for comparing Pokémon within a tier or understanding the marginal differences between them. As a result, players often rely on subjective opinions when evaluating Pokémon for their teams.

Composite indicators are not new to games. In football video games like *FIFA*, players are scored using overall ratings that aggregate multiple statistics such as speed, stamina, passing, and shooting. Similarly, MOBA titles like League of Legends use matchmaking rating (MMR) and champion win rates as blended indicators to estimate performance and balance. These systems combine variables into a single numerical value to inform gameplay decisions, matchmaking, and tier lists.

This project follows a similar philosophy by introducing a composite indicator to competitive Pokémon. The proposed **Meta Viability Index (MVI)** integrates usage statistics, performance trends, and role flexibility into a unified score. This approach allows for transparent, data-driven comparison of Pokémon and helps capture the complexity of competitive viability in a structured and objective way.

---

## Data Source

The primary data used in this index comes from [Smogon.com](https://www.smogon.com/stats/), which publishes monthly statistics from Pokémon Showdown, an online battle simulator widely used by the competitive community. These statistics are:
- **Large-scale** (millions of battles per month),
- **Current** (updated monthly),
- **Representative** of actual meta usage,
- **Structured** and consistent for processing.

As expert interviews or official Pokémon Company data are unavailable, these community-driven sources represent the most authoritative and widely accepted datasets available for this kind of analysis.

---

## Defining Meta Viability and Variable Selection Criteria

For the purposes of this project, **meta viability** is defined as a Pokémon’s overall effectiveness, consistency, and strategic value within a competitive tier. It reflects how often a Pokémon is used, how reliably it performs, and how flexibly it fits into a variety of viable team compositions. This concept is inherently multidimensional and must be captured through a combination of different measurable attributes.

To ensure methodological consistency and relevance, the variables selected for the Meta Viability Index (MVI) must meet the following criteria:

- Be **publicly available** and consistently measured across all Pokémon in the dataset.
- Represent **distinct aspects** of competitive performance to avoid redundancy.
- Be **quantifiable** and suitable for standardisation.
- Offer **strategic insight** into how and why a Pokémon is effective in its role(s).

The selected variables will be grouped into three broad categories to support a balanced evaluation:
- **Usage Metrics** (e.g., usage rate, lead frequency)
- **Performance Metrics** (e.g., win rate, success over time)
- **Flexibility Metrics** (e.g., moveset diversity, role variety)

This structure will guide the data selection and analysis process in the sections that follow, supporting the creation of a composite indicator that captures the full competitive profile of each Pokémon.

---

## Relevance to Players and Researchers

This index may prove useful to:
- **Competitive players** aiming to identify meta threats or reliable team staples.
- **Analysts** seeking trends in usage and performance.
- **Developers** and community tiering councils looking for data to support bans or suspect tests.

# 2. Data Selection

## Overview

This section outlines the process of selecting appropriate indicators for the Meta Viability Index (MVI). All variables were chosen based on their relevance to the concept of meta viability, and availability from credible data sources. Each variable captures a distinct aspect of a Pokémon’s performance in competitive play, as defined in the previous section.

The focus of this project is strictly on Pokémon within a single tier: **Gen 9 OverUsed (OU)**, to ensure fair comparisons and avoid confounding the analysis with the effects of tier-based filtering. The MVI is designed to rank Pokémon within this tier, based on how viable they are relative to Pokémon in the same tier.

---

## Selected Variables and Justification

| Variable                    | Category            | Description                                                   | Rationale |
|----------------------------|---------------------|---------------------------------------------------------------|-----------|
| **Usage Percentage**       | Usage Metric        | The proportion of teams in OU that include the Pokémon        | Indicates popularity and meta centrality |
| **Raw Count of Battles**   | Usage Metric        | Total number of battles featuring the Pokémon                 | Helps validate reliability of the usage % |
| **Lead Usage Percentage**  | Usage Metric        | Frequency of use as the first Pokémon in battle               | Suggests role consistency and strategic fit |
| **Moveset Diversity Score**| Flexibility Metric  | Number of distinct viable movesets used                       | Reflects versatility and unpredictability |
| **Teammate Diversity Score** | Flexibility Metric | Number of distinct teammates commonly paired with the Pokémon | Indicates adaptability to team archetypes |


---

## Strengths and Limitations of the Data

**Strengths:**
- Data is sourced from [Smogon.com](https://www.smogon.com/stats/), based on Pokémon Showdown battles, a widely popular, trusted and consistent source.
- Available monthly and by tier, making it easy to isolate Gen 9 OU data as well as data of any other tier.
- Usage and moveset information is based on real player behavior and reflects current metagame trends.

**Limitations:**
- No direct win/loss data or performance outcomes are available.
- Moveset and teammate diversity require computation and may be sensitive to parsing accuracy.
- Synergy and matchup effectiveness are not directly captured and must be inferred.

In cases of data scarcity, **proxy variables** were selected to approximate competitive value. For instance, **moveset diversity** is used as a stand-in for overall flexibility, while **lead usage** reflects reliability in a defined role, as the lead role in a team is quite important, those that are used as leads more often are judged to be overall better.

---

## Summary of Data Characteristics

| Variable                    | Source     | Availability | Type       | Role        |
|----------------------------|------------|--------------|------------|-------------|
| Usage Percentage           | Smogon     | Monthly OU   | Hard       | Output      |
| Raw Count of Battles       | Smogon     | Monthly OU   | Hard       | Input       |
| Lead Usage Percentage      | Smogon     | Monthly OU   | Hard       | Process     |
| Moveset Diversity Score    | Derived    | Computed     | Soft       | Process     |
| Teammate Diversity Score   | Derived    | Computed     | Soft       | Input       |

These variables form the basis of the MVI and will be normalised, weighted, and aggregated in the following sections.

# 3. Imputation of Missing Data

## Overview

Missing data can result from Pokémon being used too infrequently to be logged in all Smogon statistical categories, or from structural inconsistencies across the data files. Left unaddressed, these gaps may introduce bias or instability into the composite indicator.

The objective of this stage is to ensure that the dataset used for the Meta Viability Index (MVI) is both complete and robust. The imputation strategy includes a combination of filtering and logical defaulting (e.g., setting missing values to 0% where appropriate), as well as removing entries with insufficient information. As Smogon is an extremely reliable source, filtering is the most used method here.

---

## Variable-Specific Handling

### Usage Percentage & Raw Count
All Pokémon in the `usage.txt` file have valid usage statistics, but not all are relevant for inclusion in a competitive analysis. To reduce noise, any Pokémon with a usage percentage below **0.5%** was removed from the dataset. This threshold filters out fringe or gimmick entries and ensures that only Pokémon with meaningful presence in the metagame are included.

### Lead Usage Percentage
Pokémon not used as leads are often excluded from the `leads.txt` file. For any valid Pokémon from the filtered usage data that do not appear in the leads file, a **lead usage percentage of 0.0%** is imputed. This reflects the fact that they are never used in the lead position, rather than being truly “missing” values. Likewise, any Pokémon that were removed from Usage, will be removed from Leads as well.

### Moveset and Teammate Diversity
Derived from the `moveset.txt` file, these variables are only available for Pokémon with sufficient representation in battle logs. If a Pokémon from the filtered usage list does not appear in the moveset file, it is excluded.

---

## Summary of Strategy

- Pokémon below 0.5% usage were excluded.
- Pokémon with missing lead usage were retained, with lead usage set to 0%.
- Pokémon missing moveset or teammate data were excluded to preserve indicator consistency.
- No synthetic or model-based imputation was used; all substitutions reflect either logical defaults or established exclusion criteria.

The cleaned dataset resulting from this process serves as a stable foundation for the normalisation and aggregation steps that follow.

# 3.1 Imputation of Missing Data - Code

## Imports

In [181]:
import pandas as pd
from IPython.display import display, HTML

## Usage Parsing

In [182]:
def parse_and_filter_usage(filepath, min_usage_pct=0.5):
    """
    Parses a Smogon usage.txt file and filters out Pokémon below a minimum usage threshold.

    Args:
        filepath (str or Path): Path to the usage text file.
        min_usage_pct (float): Minimum usage percentage to keep a Pokémon.

    Returns:
        pd.DataFrame: Filtered DataFrame with Rank, Pokémon, and Usage %.
    """
    with open(filepath, "r", encoding="utf-8") as file:
        lines = file.readlines()

    data = []
    for line in lines:
        parts = line.strip().split("|")
        if len(parts) >= 4:
            try:
                rank = int(parts[1].strip())
                pokemon = parts[2].strip()
                usage_pct = float(parts[3].strip().replace('%', ''))
                data.append((rank, pokemon, usage_pct))
            except ValueError:
                continue

    df = pd.DataFrame(data, columns=["Rank", "Pokemon", "Usage %"])
    return df[df["Usage %"] >= min_usage_pct].reset_index(drop=True)

## Leads Parsing & Alignment

In [183]:
def parse_and_align_leads(filepath, usage_df):
    """
    Parses a Smogon leads.txt file and aligns it with a filtered usage DataFrame.

    Any Pokémon present in the usage data but missing from the leads data will be
    added with a lead usage percentage of 0.0%.

    Args:
        filepath (str or Path): Path to the leads text file.
        usage_df (pd.DataFrame): Filtered usage DataFrame containing valid Pokémon.

    Returns:
        pd.DataFrame: Cleaned and aligned DataFrame with columns:
                      'Pokemon', 'Lead Usage %'
    """
    with open(filepath, "r", encoding="utf-8") as file:
        lines = file.readlines()

    data = []
    for line in lines:
        parts = line.strip().split("|")
        if len(parts) >= 4:
            try:
                pokemon = parts[2].strip()
                lead_usage_pct = float(parts[3].strip().replace('%', ''))
                data.append((pokemon, lead_usage_pct))
            except ValueError:
                continue

    df_leads = pd.DataFrame(data, columns=["Pokemon", "Lead Usage %"])

    # Pokémon in leads.txt but not in usage_df
    removed_pokemon = df_leads[~df_leads["Pokemon"].isin(usage_df["Pokemon"])]    

    # Filter to only include Pokémon from usage_df
    df_leads_filtered = df_leads[df_leads["Pokemon"].isin(usage_df["Pokemon"])].copy()

    # Add missing Pokémon from usage with 0% lead usage
    missing = usage_df[~usage_df["Pokemon"].isin(df_leads_filtered["Pokemon"])]

    for _, row in missing.iterrows():
        df_leads_filtered = pd.concat([
            df_leads_filtered,
            pd.DataFrame([{"Pokemon": row["Pokemon"], "Lead Usage %": 0.0}])
        ], ignore_index=True)

    # Sort by Lead Usage % in descending order
    df_leads_filtered = df_leads_filtered.sort_values("Lead Usage %", ascending=False).reset_index(drop=True)

    return df_leads_filtered

## Moveset Parsing & Alignment

In [184]:
def parse_and_align_moveset(filepath, usage_df):
    """
    Parses and displays full Pokémon blocks from a Smogon moveset file, including all sections.

    Args:
        filepath (str): Path to the moveset file.
        usage_df (pd.DataFrame, optional): If provided, limits output to Pokémon in the usage DataFrame.
    """
    with open(filepath, "r", encoding="utf-8") as file:
        lines = file.readlines()

    all_blocks = {}
    current_block = []
    current_pokemon = None
    valid_names = set(usage_df["Pokemon"]) if usage_df is not None else None

    previous_line = ""
    for line in lines:
        line_stripped = line.strip()

        # Detect new Pokémon block start: | Name | and previous line is a +------+ border
        if (
            line_stripped.startswith("|") and line_stripped.endswith("|")
            and not any(x in line_stripped for x in ["Raw count", "Abilities", "Items", "Spreads", "Moves",
                                                     "Teammates", "Tera", "Checks", "Avg. weight", "Viability", "%", ":"])
            and len(line_stripped.strip("|").strip()) > 0
            and previous_line.strip().startswith("+")  # Required: follows a separator line
        ):
            # Save previous block
            if current_pokemon and (valid_names is None or current_pokemon in valid_names):
                all_blocks[current_pokemon] = current_block

            # Start new block
            current_pokemon = line_stripped.strip("|").strip()
            current_block = [line.rstrip()]
        else:
            if current_pokemon:
                current_block.append(line.rstrip())

        previous_line = line

    # Catch final Pokémon block
    if current_pokemon and (valid_names is None or current_pokemon in valid_names):
        all_blocks[current_pokemon] = current_block

    return all_blocks

In [185]:
def parse_pokemon_blocks_to_dict(all_blocks):
    """
    Takes a dictionary of Pokémon blocks and parses each block into a structured dictionary.
    
    Args:
        all_blocks (dict): Dictionary where keys are Pokémon names and values are lists of text lines.
        
    Returns:
        pd.DataFrame: DataFrame where each row represents a Pokémon and columns contain structured data.
    """
    structured_data = []

    for pokemon, block in all_blocks.items():
        poke_data = {
            "Pokemon": pokemon,
            "Raw count": None,
            "Avg. weight": None,
            "Viability Ceiling": None,
            "Abilities": {},
            "Items": {},
            "Spreads": {},
            "Moves": {},
            "Tera Types": {},
            "Teammates": {},
        }

        current_section = None

        for line in block:
            stripped = line.strip()

            # Parse raw count, weight, and viability
            if "Raw count" in stripped:
                match = re.search(r"Raw count:\s*([\d,]+)", stripped)
                if match:
                    poke_data["Raw count"] = int(match.group(1).replace(",", ""))
            elif "Avg. weight" in stripped:
                match = re.search(r"Avg\. weight:\s*([\d.]+)", stripped)
                if match:
                    poke_data["Avg. weight"] = float(match.group(1))
            elif "Viability Ceiling" in stripped:
                match = re.search(r"Viability Ceiling:\s*(\d+)", stripped)
                if match:
                    poke_data["Viability Ceiling"] = int(match.group(1))

            # Section headers
            elif stripped.startswith("|") and not "%" in stripped and not ":" in stripped and len(stripped) > 5:
                section_title = stripped.strip("|").strip()
                if section_title in poke_data:
                    current_section = section_title
                else:
                    current_section = None

            # Section data
            elif "%" in stripped and current_section:
                entry_match = re.match(r"\|\s*(.*?)\s+([\d.]+)%", stripped)
                if entry_match:
                    key, val = entry_match.groups()
                    poke_data[current_section][key.strip()] = float(val)

        structured_data.append(poke_data)

    df_structured = pd.DataFrame(structured_data)
    return df_structured

## Parsing each file we need

In [None]:
# Parsing our usage file, serves as the base for leads and moveset.
df_usage_jan = parse_and_filter_usage("Data/smogon_2025-01_gen9ou_usage.txt")


df_leads_jan = parse_and_align_leads("Data/smogon_2025-01_gen9ou_leads.txt", df_usage_jan)

# Parsing the file in the way it appears, then parsing said output into a dict to use in a DataFrame
all_blocks = parse_and_align_moveset("Data/smogon_2025-01_gen9ou_moveset.txt", df_usage_jan)
df_structured = parse_pokemon_blocks_to_dict(all_blocks)

"""
# Uncomment this line if you wish to see the Usage DataFrame
display(df_leads_jan.style.hide(axis="index"))
"""

"""
# Uncomment this line if you wish to see the Leads DataFrame
display(df_usage_jan.style.hide(axis="index"))
"""

"""
# Uncomment this line if you wish to see the Moveset DataFrame
display(df_structured.style.hide(axis="index"))
"""


'\n# Uncomment this line if you wish to see the Moveset DataFrame\ndisplay(df_structured.style.hide(axis="index"))\n'