Pokemon dataset was taken from the pokemon image dataset on [Kaggle](https://www.kaggle.com/datasets/lantian773030/pokemonclassification). All pokemon is used compared to the previous models where it was only the four starter pokemons. Animal dataset was taken from the animal image dataset on [Kaggle](https://www.kaggle.com/datasets/iamsouravbanerjee/animal-image-dataset-90-different-animals). Both were combined and formed to create a binary classification model that can differentiate between a real life animal and a pokemon.

In [154]:
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing import image
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import Dense, Input, Conv2D, MaxPooling2D, Dropout, Flatten, GlobalMaxPooling2D
from tensorflow.keras.models import Model
from tensorflow.keras.applications import VGG16
import os
import shutil
import splitfolders
import random

As the images are all in subdirectories of each folder, I will use the created function below to move the images in the subfolders into a new folder depending on if it is an animal or pokemon.

In [62]:
animal_data_dir = 'C:/Users/John/Documents/ml_projects/pokemon-classifier/data/animals'
pokemon_data_dir = 'C:/Users/John/Documents/ml_projects/pokemon-classifier/data/all_pokemon'
new_animal_dir = 'C:/Users/John/Documents/ml_projects/pokemon-classifier/data/all_images/all_animals'
new_pokemon_dir = 'C:/Users/John/Documents/ml_projects/pokemon-classifier/data/all_images/all_pokemon'

def copy_files(initial_directory, final_directory, subfolder = True):
    """
    Takes in two folder locations where it will transfer the items in the subfolders of initial_directory 
    to the final directory folder. If subfolder is False, it will transfer the items in the initial directory 
    to the new folder directory

    
    Parameters:
        initial_directory: folder with subfolders of images
        final_directory: folder where all images will be copied to
    
    Returns:
        A folder which has all images from the initial_directory copied into it

    """
    if subfolder == True:
        for item in os.listdir(initial_directory):
            for image in os.listdir(os.path.join(initial_directory, item)):
                image_path = os.path.join(initial_directory, item, image)
                shutil.copy(image_path, final_directory)
    elif subfolder == False:
        for item in os.listdir(initial_directory):
            image_path = os.path.join(initial_directory, item)
            shutil.copy(image_path, final_directory)

In [None]:
copy_files(animal_data_dir, new_animal_dir)
copy_files(pokemon_data_dir, new_pokemon_dir)

After the cell runs above, I will double check the correct number of images have been copied over to the new folder and delete the original to save space. 

I will check if there is an issue of sampling bias by seeing if there is a large discrepency in the amount of images for both

In [71]:
num_animal_images = len(os.listdir(new_animal_dir))
num_pokemon_images = len(os.listdir(new_pokemon_dir))
print('number of animal images: ', num_animal_images)
print('number of pokemon images: ', num_pokemon_images)
total_images = num_animal_images + num_pokemon_images
print('total images: ', total_images)
print('fraction of animal to pokemon images: ', num_animal_images/total_images)

number of animal images:  5399
number of pokemon images:  6742
total images:  12141
fraction of animal to pokemon images:  0.4446915410592208


As there more pokemon images, I will use undersampling and randomly choose 5399 images of pokemon to balance the dataset. I will repeat the process above and move the selected images to a new folder where the train,test,validation split will occur. 

In [172]:
np.random.seed(seed=10)
selected_pokemon = np.random.permutation(os.listdir(new_pokemon_dir))[:5399].tolist()
selected_pokemon_list = [os.path.join(new_pokemon_dir, image) for image in selected_pokemon]

In [168]:
def same_name_copy(initial_file_name, final_file_name):
    """
    Checks to see if the initial file name has a copy already in the folder it is being moved into.

    Parameters:
        initial_file_name: the name of the initial file
        final_file_name: the name of the folder the file is being moved into
    
    Returns:
        Changes the name of the file if its name is already taken and copies it into the final folder.
    """
    name = os.path.basename(initial_file_name)
    if not os.path.exists(os.path.join(final_file_name, name)):
        shutil.copy(initial_file_name, os.path.join(final_file_name, name))
    else:
        root, extension = os.path.splitext(name)
        i = 1
        while os.path.exists(os.path.join(final_file_name, f'{root}_{i}{extension}')):
            i += 1
        shutil.copy(initial_file_name, os.path.join(final_file_name, f'{root}_{i}{extension}'))


In [None]:
final_pokemon_dir = 'C:/Users/John/Documents/ml_projects/pokemon-classifier/data/train_test_val/pokemon'
final_animal_dir = 'C:/Users/John/Documents/ml_projects/pokemon-classifier/data/train_test_val/animal'

copy_files(new_animal_dir, final_animal_dir, subfolder=False)
for image in selected_pokemon_list:
    same_name_copy(image, final_pokemon_dir)


In [176]:
print('number of animal images: ', len(os.listdir(final_animal_dir)))
print('number of pokemon images: ', len(os.listdir(final_pokemon_dir)))

number of animal images:  5399
number of pokemon images:  5399


Now it is time to split up the dataset into a train, validation and test set

In [177]:
all_images = 'C:/Users/John/Documents/ml_projects/pokemon-classifier/data/train_test_val'
splitfolders.ratio(all_images, output = 'output_folder', seed = 1, ratio=(0.8,0.1,0.1))

Copying files: 10798 files [00:49, 217.12 files/s]


Now that the folder is created, I will move it back into the data folder to tidy up the repository. 

In [None]:
train = ImageDataGenerator(rescale=1/255.0)
validation = ImageDataGenerator(rescale=1/255.0)

