# What is this script used for?

The aim of this notebook is to move the images from the folder where they are stored to new folders that are better suited to loading data for deep learning.

**This notebook prepares an architecture for the task of binary classification of the images in faction (horde or alliance) for images without background.**

More informations about architectures for keras loaders : https://machinelearningmastery.com/how-to-load-large-datasets-from-directories-for-deep-learning-with-keras/


# How is it done?

I use **pandas, os and shutil** librairies.

In [1]:
import os
import pandas as pd
import shutil

from tqdm.notebook import tqdm

# Load informations about characters

In [None]:
# import the dataframe with character's informations for the database without background
df = pd.read_csv("Data/DB_without_BG.csv", header=0, sep=";", 
                 names=["Rank", "Guild", "Realm", "ItemsLvl", "name",
                        "class", "faction", "gender", "race", 
                        "url_with_background", "url_character", "ID"])

# Mix the rows before split (random_state for reproductibility !)
df = df.sample(frac=1, random_state=0)

# We take 70% of the data for the training and the rest for the test (we'll use cross-validation 
# with the training dataset)
N_train = int(df.shape[0] * 0.7)

df_train = df[:N_train]
df_test = df[N_train:]

# Folder creation 

In [None]:
# creation of the folders that will contain the data
os.mkdir("Data/Without_background/train")
os.mkdir("Data/Without_background/test")

# for each group (train/test) and label (horde/alliance) a folder is create
for dtset_type in ["train", "test"]:
    for faction in ["horde", "alliance"]:
        os.mkdir("Data/Without_background/" + dtset_type + "/" + faction)

# Move image in corresponding folder 

You can download the database here : https://drive.google.com/drive/u/0/folders/1jQmVFVDiBfFnWIGnOfJNonF2-bA6xQS5

In [None]:
# The path were data are stored before move
path_raw = "Data/Without_background"

# A simple loop to iterate on the dataframes created above as well as on the type of data set
for df, data_type in zip([df_train, df_test], ["train", "test"]):
    # images are store in this format : "name-server.jpg"
    images_list = df['ID'] + ".jpg"
    # for all images in the dataframe 
    for image in tqdm(images_list):
        # the current path of the image
        curent_path = path_raw + '/'+ image + ".jpg"
        # the corresponding row in the dataframe
        row_id = df[df['ID'] == image]
        # the faction of this character
        faction = row_id["faction"].iloc[0]
        # if the faction is horde then move the image to the corresponding folder
        if faction == "horde":
            new_path = path_raw + '/' + data_type + '/horde/' + image
            shutil.move(curent_path, new_path)
        # same work if the faction is alliance (if the faction isn't horde --> it's alliance)
        else: 
            new_path = path_raw + '/' + data_type + '/alliance/' + image
            shutil.move(curent_path, new_path)

# Validation dataset

**Edit : I add this part because when I created this notebook I thought I could use the option "validation_split" of the method fit of keras but it's not possible with "ImageDataGenerator"**

Link :https://stackoverflow.com/questions/63166479/valueerror-validation-split-is-only-supported-for-tensors-or-numpy-arrays-fo

There is a need for a set of validations in order to correctly optimise a Machine Learning (and therefore Deep Learning) algorithm.
More informations here : https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets#Validation_dataset

We take 20% of the training set in order to create this new set.

In [None]:
# directories of the training dataset 
dir_ally = "Data/With_background/train/alliance"
dir_h2 = "Data/With_background/train/horde"

# number of images for each categories
nb_ally = len(os.listdir(dir_ally))
nb_horde = len(os.listdir(dir_h2))

# total number of images in the training dataset
nb_total = nb_ally + nb_horde

# 20% for the validation dataset
percent_valid = 0.2
nb_valid = int(nb_total*0.2)

# In order to have a representative validation set we take 30% of the total number of images
# for the category alliance (there are twice as many images of characters from the horde in the total dataset)
nb_ally = int(nb_valid/3)
nb_h2 = nb_valid - nb_ally

# list of the characters that will be transferts into the validation set
list_ally = os.listdir(dir_ally)
list_h2 = os.listdir(dir_h2)

# above numbers of characters for each category
ally_valid = list_ally[:nb_ally]
h2_valid = list_h2[:nb_h2]

# directory of the validation
dir_valid =  "Data/With_background/validation/"
# create the folder
os.mkdir(dir_valid)

# Move images from training folder to validation folder
for list_name, direct, faction in zip([ally_valid, h2_valid], [dir_ally, dir_h2], ['alliance', 'horde']):
    for image_name in tqdm(list_name):
        curent_path = direct + '/' + image_name
        new_path = dir_valid + faction + "/" + image_name
        shutil.move(curent_path, new_path)