# 0. Downloading Dataset

Для своих GAN моделей я использую датасет `Pokemon Generation Images` от `@truthisneverlinear` с `kaggle.com`   
В датасете: `39180` подходящих изображений формата `.png` различного размера   
Загружаем `.zip` файл с `kaggle.com` и разархивируем его через **kaggle_api**:

For my GAN models I am using `Pokemon Generation Images` by `@truthisneverlinear` from `kaggle.com`   
There are: `39180` images (.png) in this dataset   
To download and unzip the data we are using **kaggle_api** lib:

In [1]:
from pathlib import Path 
import kaggle
import os

DATA_PATH = Path("../data")
POKEMON_PATH = DATA_PATH / "pokemon"

if POKEMON_PATH.is_dir():
    print("Skipping download")
else:
    print("Downloading Data")
    DATA_PATH.mkdir(parents=True, exist_ok=True)
    os.environ['KAGGLE_USERNAME'] = "YOUR_KAGGLE_USERNAME" # Username from kaggle.com
    os.environ['KAGGLE_KEY'] = "YOUR_KAGGLE_API_KEY" # Get you API key on kagle.com

    kaggle.api.dataset_download_files('truthisneverlinear/pokemon-generations-image-dataset/2', path=DATA_PATH, unzip=True)
        

Skipping download


In [2]:
def dir_walkthrough(dir_path: Path) -> None:
    """Function that outputs number of images in subdirectories of given folders

    Args:
        dir_path (str): A str or pathlib.Path - like object
    """
    for dirpath, dirnames, filenames in os.walk(dir_path):
        print(f"There are {len(dirnames)} directories and {len(filenames)} images in {dirpath}.")

Выводим кол-во фотографиий в суб-каталогах:   
\# of Images in subdirectories:

In [4]:
dir_walkthrough(POKEMON_PATH)

There are 1 directories and 0 images in ..\data\pokemon.
There are 0 directories and 39180 images in ..\data\pokemon\raw.


# 1. Dataset formatting

Датасет распределен по субкаталогам, разделенный на классы и поколения покемонов. Для нашей задачи необходимо отфильтровать фотографии и разместить их в один каталог `/raw`.   
Так как изображдения содержат метки классов в своих названиях, они могут повторяться. Для этого изменяем названия при перемещении:   
`100.png` -> `100-random_generated_int.png`

Our dataset is divided into multiple sub-folders by pokemon types and generations.    
We need to preprocess our data and merge all the files into on folder `/raw`.   
The images names need to be altered when merging, due to them being the same - containing only the pokemon class labels. We use a random generated number and merge it inside original image name:   
`100.png` -> `100-27698457128.png`

In [6]:
import shutil
import os
import random

def recursive_move_files(path: str | Path) -> None:
    """A function takes the Path, recursively walks it and moves ALL found files into root of given subdirectory.

    Args:
        path (str or pathlib.Path): A path of a root directory.
    """
    target_dir = path
    
    for dirpath, dirnames, filenames in os.walk(target_dir):
        print(f"Moving {len(filenames)} images in {dirpath}.")
        for file_name in filenames:
            try:
                shutil.move(os.path.join(dirpath, file_name), target_dir)
            except:
                file_name_arr = file_name.split('.')
                new_file_name = file_name_arr[0]+'-'+str(random.randrange(10000000))+'.'+file_name_arr[1]
                shutil.move(os.path.join(dirpath, file_name), target_dir / new_file_name)
                

Перемещаем файлы в корень `data/pokemon`:   
Moving files to the root of `data/pokemon`:

In [15]:
recursive_move_files(POKEMON_PATH)

Moving 0 images in ..\data\pokemon.
Moving 202 images in ..\data\pokemon\conquest.
Moving 942 images in ..\data\pokemon\icons.
Moving 5 images in ..\data\pokemon\icons\female.
Moving 737 images in ..\data\pokemon\icons\old.
Moving 3 images in ..\data\pokemon\icons\old\female.
Moving 80 images in ..\data\pokemon\icons\right.
Moving 0 images in ..\data\pokemon\main-sprites.
Moving 754 images in ..\data\pokemon\main-sprites\black-white.
Moving 754 images in ..\data\pokemon\main-sprites\black-white\back.
Moving 88 images in ..\data\pokemon\main-sprites\black-white\back\female.
Moving 753 images in ..\data\pokemon\main-sprites\black-white\back\shiny.
Moving 88 images in ..\data\pokemon\main-sprites\black-white\back\shiny\female.
Moving 93 images in ..\data\pokemon\main-sprites\black-white\female.
Moving 753 images in ..\data\pokemon\main-sprites\black-white\shiny.
Moving 93 images in ..\data\pokemon\main-sprites\black-white\shiny\female.
Moving 277 images in ..\data\pokemon\main-sprites\cry

## 1.1 Preprocessing formated data
Наши данные находятся в одном субкаталоге (корне `data/pokemon`), выполняем последний preprocessing:
Выполняем:
- Создаем новый субкаталог `/raw`   
- Удаляем все `.gif` файлы
- перемещаем оставшиеся файлы в `/raw`
- Удаляем изображения покемонов класса 0 (не покемоны)

Now our data is in the root of `data/pokemon`, finishing data formatting with preprocessing:
- Making a new  folder `/raw` to store **ALL** of our processed files
- Deleting all `.gif` files
- MOving all remainig files to `/raw` folder
- Deteling all images of 0-class pokemon (they are not pokemon images)

In [8]:
def finish_preprocessing(source_path: str | Path) -> None:
    """Function does preprocessing steps to our formatted pokemon data

    Args:
        source_path (str or pathlib.Path): A root of data directory
    """
    source_dir = source_path
    target_dir = source_path / "raw"
    if target_dir.is_dir():
        print("RAW already created, adding")
    else:
        target_dir.mkdir(parents=True, exist_ok=True)
    
    print("Removing GIF's")
    for file_name in source_dir.glob("*.png"):
        shutil.move(file_name, target_dir)
        
    print("Moving PNG's to /raw")
    for file_name in source_dir.glob("*.gif"):
        os.remove(file_name)
        
    print("Removing 0 class")
    for file_name in target_dir.glob("0*"):
        os.remove(file_name)
        
    ### TODO: DELETE LAST non-CLASSES, DELETE EMPTY FOLDERS
        
    # print("Deleting empty folders")
    # for dirpath, dirnames, filenames in os.walk(target_dir):
    #     print(f"Moving {len(filenames)} images in {dirpath}.")
    #     for file_name in filenames:
    #         try:
    #             shutil.move(os.path.join(dirpath, file_name), target_dir)
    #         except:
    #             file_name_arr = file_name.split('.')
    #             new_file_name = file_name_arr[0]+'-'+str(random.randrange(10000000))+'.'+file_name_arr[1]
    #             shutil.move(os.path.join(dirpath, file_name), target_dir / new_file_name)
    

In [9]:
finish_preprocessing(POKEMON_PATH)

RAW already created, adding
Removing GIF's
Moving PNG's to /raw
Removing 0 class


Подсчитываем оставшиеся изображения:   
Counting up remaining files:

In [10]:
len(list((POKEMON_PATH / "raw").glob("*.png")))

39180

# 2. Preparing classes - *Optional*

Выгружаем данные о классах покемонов в `pd.DataFrame` формат, для определения названий покемонов из их id.

Loading up class data to `pd.DataFrame` dtype, to link image ids with pokemon names and data

In [12]:
import pandas as pd

pokemon_data = pd.read_csv(DATA_PATH / 'Pokemon.txt')
pokemon_data = pokemon_data.drop_duplicates(subset=["#"],keep="first")
pokemon_data = pokemon_data.set_index(["#"], drop=True)

In [16]:
from typing import List, Tuple

In [19]:
def get_classes(df: pd.DataFrame) -> Tuple[List[str], Tuple[int: str]]:
    """Get classnames as a list and as a tuple of id: class_name

    Returns:
        List[str], Tuple[int: str]: A list of class_names and a tuple of id to class_name
    """
    classes = df["Name"].unique()
    classes_to_idx = {class_name: i for i, class_name in enumerate(classes)}
    return classes, classes_to_idx

In [20]:
classes, classes_to_idx = get_classes(pokemon_data)

# 3. Structured Images to DataLoaders

## 3.1 Data Class

Код для создания кастомного `Dataset` класса, наследующего `torch.utils.data.Dataset`, реализует функциональность, схожую с `ImageFolder` дата класса из `torchvision` пакета, но с расширенным кастомным функционалом, необходимым для нашего датасета.

A custom `Dataset` class, inheriting from `torch.utils.data.Dataset`, functionally simmilar to native `torchvision` data class: `ImageFolder`, but with added functionality for our dataset.

In [21]:
from torch.utils.data import Dataset
from typing import Tuple, Dict, List
import torch
import re
import PIL
from PIL import Image

class PokemonData(Dataset):
    def __init__(self,
                 targ_dir: str,
                 classes_df: pd.DataFrame,
                 transform=None):
        super().__init__()
        # Getting all image paths 
        self.paths = list(Path(targ_dir).glob('*.png'))
        self.transform = transform
        # Utilizing previously made fucntions. refer to 01_data_processing notebook # 2
        self.classes, self.classes_to_idx = self.get_classes(classes_df)
        
    def get_classes(self, df):
        classes = df["Name"].unique()
        classes_to_idx = {class_name: i for i, class_name in enumerate(classes)}
        return classes, classes_to_idx
    
    def load_image(self, index: int) -> PIL.Image:
        """Loading image as a PIL Image class by index.

        Args:
            index (int): A index of loading image

        Returns:
            PIL.Image: A PIL Image
        """
        image_path = self.paths[index]
        image = Image.open(image_path).convert("RGBA")
        # Making png transparent background white
        return Image.composite(image, Image.new('RGBA', image.size, 'white'), image).convert("RGB")
    
    
    def __len__(self) -> int:
        """Returns a length of a loaded dataset. Has to be implemented in a custom Dataset class
        
        Returns:
            int: a length of image paths array
        """
        return len(self.paths)
    
    
    def __getitem__(self, index: int) -> Tuple[torch.Tensor, int]:
        """A function that transforms loaded PIL Image to Tensor and returns it with its class label.

        Args:
            index (int): An index of image to load

        Returns:
            Tuple[torch.Tensor, int]: An Image turned to tensor by transform with its class label
        """
        img = self.load_image(index)
        img_name = self.paths[index].name.split('.')[0]
        # Filtering out all none-digit filenames
        img_name = int(re.findall("\d+", img_name)[0])
        class_name = self.classes[img_name]
        class_idx = self.classes_to_idx[class_name]
        if self.transform:
            return self.transform(img), class_idx
        else:
            return img, class_idx

## 3.2 Transformers

Шаблонный темплейт для создания трансформеров под наш датасет. Зачастую, в моделях несколько изменяется.   
A template image transformer for Pokemon dataset. May differ depending on model architecture.

In [22]:
from torchvision.transforms import v2

image_transformer = v2.Compose([
    v2.ToImage(),
    v2.Resize(size=(128,128)),
    v2.ToDtype(torch.float32, scale=True)
])

In [23]:
raw_data = PokemonData(targ_dir= POKEMON_PATH / "raw",
                            transform=image_transformer,
                            classes_df=pokemon_data)

## 3.3 DataLoaders

Чтобы разбить наш датасет на батчи для загрузки в модель - используем `DataLoader`'ы:   
To batchify our data we need to create `DataLoaders`

In [27]:

import os
from torch.utils.data import DataLoader

BATCH_SIZE = 32
NUM_WORKERS = os.cpu_count() # A number of cpu cores to use when loading data to RAM. More=Better

pokemon_dataloader = DataLoader(dataset=raw_data,
                                     batch_size=BATCH_SIZE,
                                     shuffle=True,
                                     num_workers=NUM_WORKERS,
                                     pin_memory=True)