# Preprocess Data Alternate

Scott Ratchford, (c) 2025

See LICENSE.txt for license information.

This file performs preprocessing for the Pokemon data used for this project.

This file uses a smaller Pokemon image dataset than `preprocess_data.ipynb`. This dataset contains no duplicate data, but it also contains fewer images. This is beneficial because the larger dataset contains many grayscale and "shiny" (extremely rare alternate Pokemon forms) images.

## Data

1. Pokemon Images - [Pokemon with Stats and Images by Christoffer MS, Kaggle](https://www.kaggle.com/datasets/christofferms/pokemon-with-stats-and-image)
2. Pokemon Statistics - ["Pokemon Pokedex" by Kumar Arnav, Kaggle](https://www.kaggle.com/datasets/arnavvvvv/pokemon-pokedex)

## Setup

### Instructions

1. Download and unzip the `.zip` versions of each dataset mentioned above.
2. Update the 

### Imports

In [39]:
from collections import Counter
import os
import pandas as pd
import PIL
from PIL import Image
import random
import shutil
from typing import Tuple
import webcolors
from webcolors._definitions import _get_hex_to_name_map
import numpy as np

### Parameters

In [None]:
# Seed the random number generator
random.seed(151)

# Defines train-test split
train_percent = 0.8   # If train percent is 80%, test percent is 20%

CWD = os.getcwd()

# Input paths for data (modify as required)
PKMN_IMG_ALT_DIR = os.path.join(CWD, "data", "pokemon_images", "images")        # Directory containing all Pokemon images
POKEDEX_ALT_PATH = os.path.join(CWD, "data", "pokemon_images", "pokedex.csv")   # Modified .csv file from "Pokemon with Stats and Images"
PKMN_STATS_PATH = os.path.join(CWD, "data", "pokemon_stats.csv")                # .csv file from "Pokemon Pokedex"

# Output paths for created and modified data (modify as required)
PKMN_IMG_COLORS_ALT_PATH = os.path.join(CWD, "pokemon_colors.csv")              # Pokemon color data output

### Load Data

In [41]:
pokedex_df = pd.read_csv(POKEDEX_ALT_PATH, index_col=0)
num_pkmn = pokedex_df.shape[0]
pokedex_df["Name"] = pokedex_df["Name"].apply(lambda x: x.lower())

print(f"Loaded {num_pkmn} Pokemon.")

Loaded 1215 Pokemon.


### Preprocessing Data

#### Add Pokemon Numbers to Data

In [42]:
# Load Pokedex numbers from stats file
pkmn_numbers = pd.read_csv(PKMN_STATS_PATH)
numbers_keep_cols = ["Number", "Name", ]
numbers_drop_cols = [col for col in pkmn_numbers.columns if col not in numbers_keep_cols]
pkmn_numbers = pkmn_numbers.drop(labels=numbers_drop_cols, axis=1)
pkmn_numbers["Name"] = pkmn_numbers["Name"].apply(lambda x: x.lower())

# Add Pokedex numbers to main dataset
pokedex_df = pd.merge(pokedex_df, pkmn_numbers, on=["Name", ], how="inner")
# missing_df = pd.merge(pokedex_df, pkmn_numbers, on=["Name", ], how="outer", indicator=True)
# missing_df = missing_df.query("_merge != \"both\"")

### Drop Ignored Pokemon

In [44]:
ignored_pkmn = pokedex_df[pokedex_df["Ignore"] == True].shape[0]
pokedex_df = pokedex_df[pokedex_df["Ignore"] == False]
num_pkmn = pokedex_df.shape[0]

print(f"{num_pkmn} Pokemon remain after dropping {ignored_pkmn} Pokemon. The dropped Pokemon were ignored or their numbers were missing from the data set.")

959 Pokemon remain after dropping 2 Pokemon. The dropped Pokemon were ignored or their numbers were missing from the data set.


### Split Pokemon into Test and Training

Split Pokemon into test and training data sets.

In [45]:
# Randomly shuffle the Pokedex
pokedex_df = pokedex_df.sample(frac=1).reset_index(drop=True)

num_train_pkmn = int(num_pkmn * train_percent)
num_test_pkmn = num_pkmn - num_train_pkmn

train_pct = f"{train_percent * 100:.0f}"
test_pct = f"{(1 - train_percent) * 100:.0f}"

# Apply Train column values (randomly distributed)
train_values = np.full(shape=num_train_pkmn, fill_value=True, dtype=bool)
test_values = np.full(shape=num_test_pkmn, fill_value=False, dtype=bool)
pokedex_df["Train"] = pd.Series(np.concat([train_values, test_values]))

pokedex_train_df = pokedex_df[pokedex_df["Train"] == True]
pokedex_test_df = pokedex_df[pokedex_df["Train"] == False]

# Ensure values were correctly applied
if (pokedex_train_df.shape[0] != num_train_pkmn) or (pokedex_test_df.shape[0] != num_test_pkmn):
    raise ValueError("Pokedex train values were not applied correctly.")

print(f"The dataset of {num_pkmn} Pokemon was split into {num_train_pkmn} training Pokemon and {num_test_pkmn} testing Pokemon, a {train_pct}:{test_pct} split.")

The dataset of 959 Pokemon was split into 767 training Pokemon and 192 testing Pokemon, a 80:20 split.


## Analyze Pokemon Image Colors

### Color Analysis Functions

In [46]:
COLOR_DICT = {
    # '#00ffff': 'aqua',
    '#000000': 'black',
    '#0000ff': 'blue',
    '#ff00ff': 'pink',
    '#008000': 'green',
    # '#808080': 'gray',
    # '#00ff00': 'lime',
    # '#800000': 'maroon',
    # '#000080': 'navy',
    # '#808000': 'olive',
    '#800080': 'purple',
    '#ff0000': 'red',
    # '#c0c0c0': 'silver',
    # '#008080': 'teal',
    '#ffffff': 'white',
    '#ffff00': 'yellow',
    '#d29214': 'orange',
}
COLOR_NAMES = list(COLOR_DICT.values())

In [48]:
def closest_color(requested_color: Tuple[int, int, int]) -> str:
    """
    Given an RGB tuple, find the closest color name based on Euclidean distance.
    
    Parameters:
        requested_color (Tuple[int, int, int]): The RGB color tuple.
        
    Returns:
        str: The name of the closest color.
    """
    min_distance = float('inf')
    closest_name = None
    for hex, name in COLOR_DICT.items():
        rgb = webcolors.hex_to_rgb(hex)
        distance = sum((comp1 - comp2) ** 2 for comp1, comp2 in zip(requested_color, rgb))
        if distance < min_distance:
            min_distance = distance
            closest_name = name
    
    return closest_name

def image_color_breakdown(image_path: str) -> pd.DataFrame:
    """
    Given the path to a PNG image, compute the percentage breakdown
    of the colors present in the image.
    Transparent pixels (alpha == 0) are ignored.
    
    Parameters:
        image_path (str): The path to the PNG image.
        
    Returns:
        pd.DataFrame: A DataFrame with columns for colors
    """
    # Open the image in RGBA mode to handle transparency.
    img = Image.open(image_path).convert('RGBA')
    # Filter out pixels where the alpha channel is 0 (fully transparent).
    pixels = [pixel for pixel in img.getdata() if pixel[3] != 0]
    
    if not pixels:
        raise ValueError("No non-transparent pixels found in the image.")
    
    total_pixels = len(pixels)
    
    # Count the occurrences of each allowed color.
    color_counts = Counter()
    for pixel in pixels:
        rgb = pixel[:3]
        color_name = color_name = closest_color(rgb)
        color_counts[color_name] += 1
    
    # Prepare the breakdown dictionary with all allowed colors.
    breakdown = {}
    for color in COLOR_NAMES:
        breakdown[color] = 0.0  # default 0%
    
    # Compute the percentage for each color that occurred.
    for color, count in color_counts.items():
        breakdown[color] = (count / total_pixels) * 100
    
    # Create the DataFrame with the columns in the required order.
    df = pd.DataFrame([breakdown], columns=COLOR_NAMES)
    
    return df

## Analyze Training Pokemon Image Colors

In [49]:
drop_cols = ['Total', 'HP', 'Attack', 'Defense', 'SP. Atk.', 'SP. Def', 'Speed', 'Ignore', ]
create_cols = COLOR_NAMES.copy()

In [50]:
pkmn_color_train_df = pokedex_train_df.drop(labels=drop_cols, axis=1)
for col in create_cols:
    pkmn_color_train_df[col] = np.nan

for i, row in pkmn_color_train_df.iterrows():
    img_path = os.path.join(PKMN_IMG_ALT_DIR, row["Filename"])
    temp_df = image_color_breakdown(img_path)
    for col in temp_df.columns:
        pkmn_color_train_df.at[i, col] = temp_df.at[0, col]
    print(f"Finished analyzing the colors of the Pokemon {row["Name"].capitalize()}.")

print("Analyzed the colors of all training Pokemon images.")

Finished analyzing the colors of the Pokemon Flaaffy.
Finished analyzing the colors of the Pokemon Totodile.
Finished analyzing the colors of the Pokemon Snubbull.
Finished analyzing the colors of the Pokemon Florges.
Finished analyzing the colors of the Pokemon Karrablast.
Finished analyzing the colors of the Pokemon Sawsbuck.
Finished analyzing the colors of the Pokemon Pansear.
Finished analyzing the colors of the Pokemon Lanturn.
Finished analyzing the colors of the Pokemon Roaring moon.
Finished analyzing the colors of the Pokemon Machoke.
Finished analyzing the colors of the Pokemon Durant.
Finished analyzing the colors of the Pokemon Roggenrola.
Finished analyzing the colors of the Pokemon Grovyle.
Finished analyzing the colors of the Pokemon Ferroseed.
Finished analyzing the colors of the Pokemon Scovillain.
Finished analyzing the colors of the Pokemon Gliscor.
Finished analyzing the colors of the Pokemon Salandit.
Finished analyzing the colors of the Pokemon Iron boulder.
Fini

In [51]:
pkmn_color_test_df = pokedex_test_df.drop(labels=drop_cols, axis=1)
for col in create_cols:
    pkmn_color_test_df[col] = np.nan

for i, row in pkmn_color_test_df.iterrows():
    img_path = os.path.join(PKMN_IMG_ALT_DIR, row["Filename"])
    temp_df = image_color_breakdown(img_path)
    for col in temp_df.columns:
        pkmn_color_test_df.at[i, col] = temp_df.at[0, col]
    print(f"Finished analyzing the colors of the Pokemon {row["Name"].capitalize()}.")

print("Analyzed the colors of all testing Pokemon images.")

Finished analyzing the colors of the Pokemon Feebas.
Finished analyzing the colors of the Pokemon Tinkaton.
Finished analyzing the colors of the Pokemon Cinderace.
Finished analyzing the colors of the Pokemon Duskull.
Finished analyzing the colors of the Pokemon Gigalith.
Finished analyzing the colors of the Pokemon Servine.
Finished analyzing the colors of the Pokemon Bayleef.
Finished analyzing the colors of the Pokemon Sharpedo.
Finished analyzing the colors of the Pokemon Phantump.
Finished analyzing the colors of the Pokemon Quaxwell.
Finished analyzing the colors of the Pokemon Aurorus.
Finished analyzing the colors of the Pokemon Vullaby.
Finished analyzing the colors of the Pokemon Treecko.
Finished analyzing the colors of the Pokemon Ninetales.
Finished analyzing the colors of the Pokemon Drampa.
Finished analyzing the colors of the Pokemon Beldum.
Finished analyzing the colors of the Pokemon Popplio.
Finished analyzing the colors of the Pokemon Cherrim.
Finished analyzing the

### Save Color Data to File

In [52]:
# Create a dataframe of all test images
pkmn_color_df = pd.concat([pkmn_color_test_df, pkmn_color_train_df], ignore_index=True)
pkmn_color_df.to_csv(PKMN_IMG_COLORS_ALT_PATH, sep=",")

print("Created a file with all Pokemon image color data.")

Created a file with all Pokemon image color data.
