# Dataset preparation

## Purpose
This notebook was created with the goal to extract the data we require from the
PokeAPI dataset, the completed dataset process is located in the
/data/dataset-assembler.py file. The PokeAPI dataset is an extensive dataset
containing all sorts of information about the Pokemon universe, including all
the Pokemon, their types and abilities, pictures and much more. We will only work
with a small fraction of it.

Additionally, we will also prepare the pictures of all the Digimon we will use
in our dataset.

## Reasoning
The PokeAPI dataset is a gargantuan dataset containing any and all information
one could want about the Pokemon universe. However, for our purposes, most of
that information is irrelevant and superfluous.

Since our aim is to create a dataset that can classify whether a given image
is a Pokemon or not, as well as to classify the Pokemon type based on the image,
we only really need the Pokemon names, types, images, and a set of non-Pokemon
images to train our model. The rest of this information is irrelevant to our
purpose, as knowing for example, a Pokemon's height or weight does not help us
identify it through an image.

Since the Digimon only serve as negative examples of Pokemon, we only need their
images and none of the additional information about them that is included in the
dataset.

In [24]:
import pandas as pd

import os
import shutil
import re

from data.properties import read_properties

## Data Acquisition

In order for this notebook to work, you must have downloaded all the complete
datasets from the PokeAPI and the Digimon pictures.

If you do not have these datasets, you can download them at the following links:
- [PokeAPI Repository](https://github.com/PokeAPI/pokeapi.git) (GitHub)
- [PokeAPI Sprites Repository](https://github.com/PokeAPI/sprites.git) (GitHub)
- [Digimon Webcrawler Data](https://drive.google.com/drive/folders/1tmcdsoX67NvmAgtmGJgo6kb3N6SlJeLu?usp=share_link) (Google Drive)

The paths must be pointing to the folders such that the following structure is
respected:

```
- PokeAPI Repository
| - data
| | - v2
| | | - csv
| | | | - <ALL CSV FILES>

- PokeAPI Sprites Repository
| - sprites
| | - <ALL POKEMON SPRITES>

- Digimon Webcrawler Data
| - images (unzipped contents of images.zip)
| | - <ALL DIGIMON IMAGES>
```

The configuration file must be located next to this notebook and must be named
`paths.env` with the following content:

```ini
pokeAPI-data=path/to/PokeAPI
pokeAPI-sprites=path/to/PokeAPI-sprites
digimon-images=path/to/Digimon

dataset-dir=path/to/output
```

In [25]:
props = read_properties('../paths.env')

pokeAPI_data_repo = props['pokeAPI-data']
pokeAPI_sprites_repo = props['pokeAPI-sprites']
digimon_datasource = props['digimon-images']

output_dir = props['dataset-dir']

In [26]:
# PokeAPI csv data
pokeAPI_data_root = os.path.join(pokeAPI_data_repo, 'data/v2/csv')
# PokeAPI sprites folder
pokeAPI_sprites_folder = os.path.join(pokeAPI_sprites_repo, 'sprites')
# Digimon images folder
digimon_images_folder = os.path.join(digimon_datasource, 'images')

In [27]:
print('Checking for data...')

pokeAPI_data_url = 'https://github.com/PokeAPI/pokeapi.git'
pokeAPI_sprites_url = 'https://github.com/PokeAPI/sprites.git'
digimon_data_url = 'https://drive.google.com/drive/folders/1tmcdsoX67NvmAgtmGJgo6kb3N6SlJeLu?usp=share_link'


print('Retrieving PokeAPI data from: {}'.format(pokeAPI_data_root))
if not os.path.exists(pokeAPI_data_root):
    print('You must download the PokeAPI data first from the following git: {}'.format(pokeAPI_data_url))
    raise FileNotFoundError('PokeAPI data not found')
print('Retrieving PokeAPI sprites from: {}'.format(pokeAPI_sprites_folder))
if not os.path.exists(pokeAPI_sprites_folder):
    print('You must download the PokeAPI sprites first from the following git: {}'.format(pokeAPI_sprites_url))
    raise FileNotFoundError('PokeAPI sprites not found')
print('Retrieving Digimon images from: {}'.format(digimon_images_folder))
if not os.path.exists(digimon_images_folder):
    print('You must download the Digimon data first from the following Google Drive: {}'.format(digimon_data_url))
    raise FileNotFoundError('Digimon images not found')
if os.path.exists(output_dir):
    print('The dataset folder already exists. Please remove it before running the script.')
    raise FileExistsError('Dataset folder already exists')

os.makedirs(output_dir)

Checking for data...
Retrieving PokeAPI data from: /media/ieris19/development-drive/Development/Data/PokeAPI/server/data/v2/csv
Retrieving PokeAPI sprites from: /media/ieris19/development-drive/Development/Data/PokeAPI/sprites/sprites
Retrieving Digimon images from: /media/ieris19/development-drive/Development/Data/Digimon/images


In [28]:
# Reading the PokeAPI data
pokemon = pd.read_csv(os.path.join(pokeAPI_data_root, 'pokemon.csv'))
pokemon

Unnamed: 0,id,identifier,species_id,height,weight,base_experience,order,is_default
0,1,bulbasaur,1,7,69,64,1.0,1
1,2,ivysaur,2,10,130,142,2.0,1
2,3,venusaur,3,20,1000,263,3.0,1
3,4,charmander,4,6,85,62,5.0,1
4,5,charmeleon,5,11,190,142,6.0,1
...,...,...,...,...,...,...,...,...
1297,10273,ogerpon-wellspring-mask,1017,12,398,275,,0
1298,10274,ogerpon-hearthflame-mask,1017,12,398,275,,0
1299,10275,ogerpon-cornerstone-mask,1017,12,398,275,,0
1300,10276,terapagos-terastal,1024,3,160,90,,0


In [29]:
types = pd.read_csv(os.path.join(pokeAPI_data_root, 'types.csv'))
types

Unnamed: 0,id,identifier,generation_id,damage_class_id
0,1,normal,1,2.0
1,2,fighting,1,2.0
2,3,flying,1,2.0
3,4,poison,1,2.0
4,5,ground,1,2.0
5,6,rock,1,2.0
6,7,bug,1,2.0
7,8,ghost,1,2.0
8,9,steel,2,2.0
9,10,fire,1,3.0


In [30]:
pokemon_types = pd.read_csv(os.path.join(pokeAPI_data_root, 'pokemon_types.csv'))
pokemon_types

Unnamed: 0,pokemon_id,type_id,slot
0,1,12,1
1,1,4,2
2,2,12,1
3,2,4,2
4,3,12,1
...,...,...,...
2023,10274,10,2
2024,10275,12,1
2025,10275,6,2
2026,10276,1,1


## Trimming the data

The PokeAPI dataset is quite extensive, it contains all Pokemon (currently 1025)
and also many alternative form information. This information is assigned an ID
that is greater than 10000 so that it does not interfere with the original Pokemon
IDs, as such, we must filter out all IDs greater than 10000.

Additionally, we will remove a lot of the columns that are not relevant to our
current analysis, such as the foreign keys pointing to relationships outside our
scope and some of the data irrelevant to us such as height and weight.

There's also some types that are not relevant to our analysis, because they are
only used for specific mechanics in the games and do not represent a Pokemon's
type, such as the "shadow" type. We will remove these types from the dataset,
thankfully, the same as Pokemon IDs apply, these types have IDs greater than 10000.

We will also rename some of the columns to make them more readable and to avoid
confusion later on.

In [31]:
id_cutoff = 10000

pokemon_types = pokemon_types[pokemon_types['pokemon_id'] < id_cutoff]
pokemon = pokemon[pokemon['id'] < id_cutoff]
types = types[types['id'] < id_cutoff]
types = types.drop(columns=['damage_class_id', 'generation_id'])
pokemon = pokemon.drop(columns=['species_id', 'height', 'weight', 'base_experience', 'order', 'is_default'])
pokemon = pokemon.rename(columns={'identifier': 'name'})
types = types.rename(columns={'identifier': 'type_label'})

display(pokemon)
display(types)
display(pokemon_types)

Unnamed: 0,id,name
0,1,bulbasaur
1,2,ivysaur
2,3,venusaur
3,4,charmander
4,5,charmeleon
...,...,...
1020,1021,raging-bolt
1021,1022,iron-boulder
1022,1023,iron-crown
1023,1024,terapagos


Unnamed: 0,id,type_label
0,1,normal
1,2,fighting
2,3,flying
3,4,poison
4,5,ground
5,6,rock
6,7,bug
7,8,ghost
8,9,steel
9,10,fire


Unnamed: 0,pokemon_id,type_id,slot
0,1,12,1
1,1,4,2
2,2,12,1
3,2,4,2
4,3,12,1
...,...,...,...
1546,1023,9,1
1547,1023,14,2
1548,1024,1,1
1549,1025,4,1


## Merging the data
The data is currently split into three tables, the Pokemon table, the Types table
and the relationship table between these two, the Pokemon_Types table. We will
merge them using the IDs that act as foreign keys between them.

The final dataset will be a single table with each row containing a Pokemon and
one of its types, dual-type Pokemon will have two rows in the table, one for each
type.

The final dataset will be saved in the \<target>/dataset folder as a CSV file.
for future reference.

In [32]:
pokemon_merged = pokemon_types.merge(types, left_on='type_id', right_on='id').drop(columns=['id'])
pokemon_merged = pokemon_merged.rename(columns={'identifier': 'type', 'slot': 'type_slot'})
pokemon_merged = pokemon.merge(pokemon_merged, left_on='id', right_on='pokemon_id').drop(columns=['id'])
pokemon_merged

Unnamed: 0,name,pokemon_id,type_id,type_slot,type_label
0,bulbasaur,1,12,1,grass
1,bulbasaur,1,4,2,poison
2,ivysaur,2,12,1,grass
3,ivysaur,2,4,2,poison
4,venusaur,3,12,1,grass
...,...,...,...,...,...
1546,iron-crown,1023,9,1,steel
1547,iron-crown,1023,14,2,psychic
1548,terapagos,1024,1,1,normal
1549,pecharunt,1025,4,1,poison


In [33]:
dataset_dir = os.path.join(output_dir, 'dataset')
dataset_target = os.path.join(dataset_dir, 'pokemon.csv')
if not os.path.exists(dataset_dir):
    os.makedirs(dataset_dir)
pokemon_merged.to_csv(dataset_target, index=False)
print('Pokemon data saved to {}'.format(dataset_target))

Pokemon data saved to /media/ieris19/development-drive/Development/Data/MAL-project/dataset/pokemon.csv


## Copying the images
The official-artwork folder contains all the Pokemon images in the format \<id>.png

Since this includes all the alternative forms and other images that are not the
included in the dataset, we will use a regex to filter out images with ids
greater or equal to 10000.

For that the following regex pattern will be used:
- ^\[0-9]{1,4}\\.png

The pattern takes a string starting with 1 to 4 digits and then followed by the 
.png extension

This means that any image containing more than 4 digits ID will be ignored.

In [34]:
source_images = os.path.join(pokeAPI_sprites_folder, 'pokemon/other/official-artwork')
pokeAPI_image_files = os.listdir(source_images)
print('Preparing {} Pokemon images...'.format(len(pokeAPI_image_files)))
target_images = os.path.join(output_dir, 'images/pokemon')
os.makedirs(target_images)

Preparing 1294 Pokemon images...


In [35]:
valid_image_pattern = re.compile(r'^[0-9]{1,4}\.png')
for file in pokeAPI_image_files:
    if valid_image_pattern.match(file):
        source = os.path.join(source_images, file)
        target = os.path.join(target_images, file)
        shutil.copy(source, target)
        print('Copied {} to {}'.format(file, target))

Copied 181.png to /media/ieris19/development-drive/Development/Data/MAL-project/images/pokemon/181.png
Copied 56.png to /media/ieris19/development-drive/Development/Data/MAL-project/images/pokemon/56.png
Copied 1.png to /media/ieris19/development-drive/Development/Data/MAL-project/images/pokemon/1.png
Copied 10.png to /media/ieris19/development-drive/Development/Data/MAL-project/images/pokemon/10.png
Copied 100.png to /media/ieris19/development-drive/Development/Data/MAL-project/images/pokemon/100.png
Copied 1000.png to /media/ieris19/development-drive/Development/Data/MAL-project/images/pokemon/1000.png
Copied 1001.png to /media/ieris19/development-drive/Development/Data/MAL-project/images/pokemon/1001.png
Copied 1002.png to /media/ieris19/development-drive/Development/Data/MAL-project/images/pokemon/1002.png
Copied 1003.png to /media/ieris19/development-drive/Development/Data/MAL-project/images/pokemon/1003.png
Copied 1004.png to /media/ieris19/development-drive/Development/Data/MAL-

## Digimon images
The Digimon images are simply copied to the output folder

In [36]:
digimon_image_files = os.listdir(digimon_images_folder)
print('Preparing {} Digimon images...'.format(len(digimon_image_files)))
target_images = os.path.join(output_dir, 'images/digimon')
os.makedirs(target_images)
for file in digimon_image_files:
    source = os.path.join(digimon_images_folder, file)
    target = os.path.join(target_images, file)
    shutil.copy(source, target)
    print('Copied file\n\tfrom: {}\n\tto: {}'.format(file, target))

Preparing 1127 Digimon images...
Copied file
	from: Duftmon.jpg
	to: /media/ieris19/development-drive/Development/Data/MAL-project/images/digimon/Duftmon.jpg
Copied file
	from: Mamemon_X.jpg
	to: /media/ieris19/development-drive/Development/Data/MAL-project/images/digimon/Mamemon_X.jpg
Copied file
	from: Abbadomon.jpg
	to: /media/ieris19/development-drive/Development/Data/MAL-project/images/digimon/Abbadomon.jpg
Copied file
	from: Abbadomon_core.jpg
	to: /media/ieris19/development-drive/Development/Data/MAL-project/images/digimon/Abbadomon_core.jpg
Copied file
	from: Achillesmon.jpg
	to: /media/ieris19/development-drive/Development/Data/MAL-project/images/digimon/Achillesmon.jpg
Copied file
	from: Aegiochusmon.jpg
	to: /media/ieris19/development-drive/Development/Data/MAL-project/images/digimon/Aegiochusmon.jpg
Copied file
	from: Aegiochusmon_blue2.jpg
	to: /media/ieris19/development-drive/Development/Data/MAL-project/images/digimon/Aegiochusmon_blue2.jpg
Copied file
	from: Aegiochusmo