# What is this notebook used for?

The purpose of this notebook is to extract informations countains in the dictionnary "top_50k_informations.pkl" previously store in a pickle file. 
These informations are then stored in 2 DataFrame in order to serve as labels for the 2 images databases previously created (with or without background).


# How is it done?

I use the properties of the python dictionaries and the pandas library.

In [1]:
import pickle
import pandas as pd
import re

from tqdm.notebook import tqdm

In [2]:
# DataFrame previously created with top 50K characters
df = pd.read_csv("Data/characters_informations.csv", sep=";", header=0, 
                  names=["Rank", "Character", "Guild", "Realm", "ItemsLvl"])

# Dictionnary with informations about these top 50 characters
with open('Data/pickle/top_50k_informations.pkl', 'rb') as save_file:
    characters_informations = pickle.load(save_file)

In [3]:
# Column's names of downloaded informations
cols = list(characters_informations[1].keys())
print("Column's names of the futur DataFrame : \n", cols)

Column's names of the futur DataFrame : 
 ['name', 'class', 'faction', 'gender', 'race', 'url_with_background', 'url_character']


In [4]:
def data_extraction(dico_info, cols=cols, N=50000):
    """
    This function extract the download informations about characters store in "dico_info"
    using some "cols" and a the number of N. Then return a DataFrame with these informations.
    This function is built on the storage model of the dictionary "top_50k_informations.pkl".
        
    Parameters
    ----------
    dico_info : dict
                the dictionnary with previously download informations
    cols : list
           list of columns to recover and store
    N :  int
         number of rows
         
    Returns
    -------
    df_update : pandas.DataFrame 
                size = N * len(cols)
                informations from dico_info in a DataFrame file
    
    Example
    -------
    
    >>> df_update = data_extraction(characters_informations)
    --> See the below output
    """
    
    # This dictionnary will countain the lists with new informations in order to create a dataframe
    data = {}
    # List creation in the dictionnary
    for col in cols:
        data[col] = []
    
    for i in tqdm(range(N)):
        # dictionnary with 1 character's informations
        dico_info_character = characters_informations[i]
        
        # Sometimes the character's wasn't anymore in the website, this case it is stored as ''
        if dico_info_character == '':
            # Append '' to each list because there isn't data for this character
            for k in data.keys():
                data[k].append("")
        
        # if there is a dictionnary, there are data inside
        if type(dico_info_character) == dict:
            # for feature in the list cols
            for feature in cols:
                # if this feature is in the dictionnary then recover this information
                if feature in dico_info_character.keys():
                    data[feature].append(dico_info_character[feature])
                # if the feature isn't in the dictionnary then append '' to the corresponding list
                else:
                    data[feature].append('')
                    
    # transform the dictinonary data into a DataFrame                
    df_update = pd.DataFrame(data)
    # return the dataframe
    return df_update

In [5]:
# Use above function to create a DataFrame
df_update = data_extraction(characters_informations)
df_update.head()

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=50000.0), HTML(value='')))




Unnamed: 0,name,class,faction,gender,race,url_with_background,url_character
0,Nekroth,death-knight,alliance,MALE,human,https://render-eu.worldofwarcraft.com/characte...,
1,Teliah,warlock,alliance,FEMALE,dark-iron-dwarf,https://render-eu.worldofwarcraft.com/characte...,https://render-eu.worldofwarcraft.com/characte...
2,Nárud,death-knight,horde,FEMALE,blood-elf,https://render-eu.worldofwarcraft.com/characte...,https://render-eu.worldofwarcraft.com/characte...
3,Vandaríel,death-knight,horde,MALE,blood-elf,https://render-eu.worldofwarcraft.com/characte...,
4,Ezraeelm,mage,horde,MALE,blood-elf,https://render-eu.worldofwarcraft.com/characte...,


Some characters weren't anymore on the official World of Warcraft website and so haven't been download. For these characters the output was "", the below cell removes these rows from "df_update" DataFrame.

In [6]:
print("Before clean :", df_update.shape)

df_update = df_update[df_update["name"] != ""]

print("After remove missing values :", df_update.shape)

Before clean : (50000, 7)
After remove missing values : (44876, 7)


We then merge the 2 DataFrame into a single DataFrame with all informations.

In [7]:
# Merge using index
df_update = pd.merge(df, df_update, left_index=True, right_index=True)
df_update.head()

Unnamed: 0,Rank,Character,Guild,Realm,ItemsLvl,name,class,faction,gender,race,url_with_background,url_character
0,1,Nekroth,Instinct,Alleria,484.75,Nekroth,death-knight,alliance,MALE,human,https://render-eu.worldofwarcraft.com/characte...,
1,2,Teliah,,Dun Modr,484.38,Teliah,warlock,alliance,FEMALE,dark-iron-dwarf,https://render-eu.worldofwarcraft.com/characte...,https://render-eu.worldofwarcraft.com/characte...
2,3,Nárud,Repax,Eredar,484.25,Nárud,death-knight,horde,FEMALE,blood-elf,https://render-eu.worldofwarcraft.com/characte...,https://render-eu.worldofwarcraft.com/characte...
3,4,Vandaríel,Astounding,Blackhand,484.25,Vandaríel,death-knight,horde,MALE,blood-elf,https://render-eu.worldofwarcraft.com/characte...,
4,5,Ezraeelm,Winterfall,Draenor,484.0,Ezraeelm,mage,horde,MALE,blood-elf,https://render-eu.worldofwarcraft.com/characte...,


Two columns countain the same information : "Character" and "name", you can delete one of the 2.

In [8]:
# Remove "Character" column
df_update = df_update.drop(columns=["Character"])

In [9]:
df_update.head()

Unnamed: 0,Rank,Guild,Realm,ItemsLvl,name,class,faction,gender,race,url_with_background,url_character
0,1,Instinct,Alleria,484.75,Nekroth,death-knight,alliance,MALE,human,https://render-eu.worldofwarcraft.com/characte...,
1,2,,Dun Modr,484.38,Teliah,warlock,alliance,FEMALE,dark-iron-dwarf,https://render-eu.worldofwarcraft.com/characte...,https://render-eu.worldofwarcraft.com/characte...
2,3,Repax,Eredar,484.25,Nárud,death-knight,horde,FEMALE,blood-elf,https://render-eu.worldofwarcraft.com/characte...,https://render-eu.worldofwarcraft.com/characte...
3,4,Astounding,Blackhand,484.25,Vandaríel,death-knight,horde,MALE,blood-elf,https://render-eu.worldofwarcraft.com/characte...,
4,5,Winterfall,Draenor,484.0,Ezraeelm,mage,horde,MALE,blood-elf,https://render-eu.worldofwarcraft.com/characte...,


**Last but not least**, we need to create an ID to match images with their labels !

In World of Warcraft,when a character has a name on a server no other player on that server can be called the same. 
If you have a character's name and server then you have a unique identifier for that character.

The natural **ID thus becomes the name and the server** of a character, writing it in the same way as the name of the uploaded images.

In [10]:
def create_ID(row):
    """
    The purpose of this function is to create an ID for each character using
    its name and realm (in lowercase).
    
    Parameters
    ----------
    row : pandas.core.series.Series
          the row of the dataframe df_update
          
    Returns
    -------
    unique_ID : string
                character's ID
    Example
    -------  
    >>> create_ID(df_update.iloc[0])
    --> 'nekroth-alleria'
    """
    # Character's name
    name = row["name"].lower()
    # Character's realm
    realm = row["Realm"].lower()
    # use "-" rather than space because images are store with "-" separator
    unique_ID = "-".join((name, realm))
    # replace space by "-" : sometimes realm is in two words separated by a " "
    unique_ID = re.sub(' ', '-', unique_ID)
    # return ID
    return unique_ID

In [11]:
df_update["ID"] = df_update.apply(lambda row: create_ID(row), axis=1)
df_update.head(2)

Unnamed: 0,Rank,Guild,Realm,ItemsLvl,name,class,faction,gender,race,url_with_background,url_character,ID
0,1,Instinct,Alleria,484.75,Nekroth,death-knight,alliance,MALE,human,https://render-eu.worldofwarcraft.com/characte...,,nekroth-alleria
1,2,,Dun Modr,484.38,Teliah,warlock,alliance,FEMALE,dark-iron-dwarf,https://render-eu.worldofwarcraft.com/characte...,https://render-eu.worldofwarcraft.com/characte...,teliah-dun-modr


In the previous notebook "images_download.ipynb" 2 databases were created, one with characters and a background, the other one without the background (just the character). We need so 2 label's dataframe.

In order to identify which character is in one of these 2 database we just need to filter "df_update" with columns "url_with_background" and "url_character".

In [12]:
# Informations for images without background
df_without_bg = df_update[df_update["url_character"] != ""]
# sort by ID
df_without_bg = df_without_bg.sort_values("ID")

# Informations for images with a background
df_with_bg = df_update[df_update["url_with_background"] != ""]
# sort by ID
df_with_bg = df_with_bg.sort_values("ID")

print("Number of images without background : {:d}" .format(df_without_bg.shape[0]))
print("Number of images with background : {:d}" .format(df_with_bg.shape[0]))

Number of images without background : 11004
Number of images with background : 44876


# Export database informations into csv files

In [13]:
df_without_bg.to_csv("Data/DB_without_BG.csv", sep=";")
df_with_bg.to_csv("Data/DB_with_BG.csv", sep=";")