# What is this notebook used for?

The purpose of this notebook is to recover player's informations of the top 50 000 EU players (best item level) in World of Warcraft. These informations will help in a next notebook to download the corresponding images of these players to train some Artificial Intelligence models.

# How is it done?

I use **requests & BeautifulSoup** to recovers these informations from https://www.wowprogress.com/gearscore/eu and put the data into a DataFrame using **pandas**.

In [1]:
import requests
import pandas as pd

from tqdm.notebook import tqdm
from bs4 import BeautifulSoup as bs

In [2]:
def recoverCharacterInformation(url_page):
    """
    This function recover the character's informations in the url_page using requests
    and BeautifulSoup and return a list of dictionnaries that countains them.
    
    The function is based on the html architecture of the wowprogress website.

    Parameters
    ----------
    url_page : string
               wowprogress url's (with 23 players by page)

    Returns
    -------
    infosCharacters : list
                      the list with 23 player's informations (in dictionnaries)
                  
    Example 
    -------
    
    >>> recoverCharacterInformation("https://www.wowprogress.com/pve/eu/rating/next/0/rating.#rating370")
    
    [{'rank': '1', 'character': 'Nekroth', 'guild': 'Instinct', 'raid': '', 'realm': 'Alleria', 'itemsLvl': '484.75'},
    ...
    {'rank': '23', 'character': 'Yukile', 'guild': 'Not Great Not Terr..', 'raid': '', 'realm': 'Sylvanas', 
    'itemsLvl': '483.50'}]
     
    """
    # The list who will countains characters's wowprogress page informations
    infosCharacters = []
    
    # Use a get request to the url
    response = requests.get(url_page)
           
    # Prepare the soup for information retrieval in the html
    soup = bs(response.content, "html.parser")
    
    # Retrieves the table containing the informations
    ratingTable = soup.find("table", class_="rating")
    # Retreves the rows of this table (find_all return a list)
    Rows = ratingTable.find_all("tr")
    
    # Column's names
    recoveredItems = ["rank", "character", "guild", 
                      "raid", "realm", "itemsLvl"]
    # We don't need the first item (index 0) in the loop because
    # it's column's names
    for i in range(1, len(Rows)):
        # 1 dictionnary by row /character
        dicoCharacter = {}
        # the informations of the Rows[i] are in "td"
        row = Rows[i].find_all("td")
        
        for i in range(len(recoveredItems)):
            # recovery of the above column's name
            col_name = recoveredItems[i]
            # add informations to the dictionnary
            dicoCharacter[col_name] = row[i].get_text()

        #Add row's dictionnary to the list
        infosCharacters.append(dicoCharacter)
    # return the list
    return infosCharacters

## In order to be able to retrieve information from the different pages we need to identify the structure of the URL

When can't see any url change when we clic on the "next" button to see the next page but with an inspect (right clic + inspect) we see below urls.

url page 2 : https://www.wowprogress.com/gearscore/eu/char_rating/next/0#char_rating

url page 3 : https://www.wowprogress.com/gearscore/eu/char_rating/next/1#char_rating

...

We can see that there is just 1 character that change in the URL, a **simple loop is enough to iterate the URL.**

There are 23 characters per page and the website ranks the top 50,000 players so we need 50000/23 = **2174 pages.**

In [3]:
# URL of the main page
urlMainPage = "https://www.wowprogress.com/gearscore/eu"

In [4]:
# Number of page to scrape (2174 - 1 because the first page isn't in the loop)
N = 2173

# This dictionnary will countain the output of the above function for each page 
dicoList = {}

# The main page hasn't the same URL structure so we need to use the above function outside the loop
dicoList["0-23"] = recoverCharacterInformation(urlMainPage)
 
# iterate on urls and add lists to the dictionnary  
for i in tqdm(range(N)):
    # create url like above
    url = "https://www.wowprogress.com/gearscore/eu/char_rating/next/" + str(i) + "#char_rating"
    # add the url's page informations
    dicoList[str(23*(i+1)) + '-' + str(23*(i+2))] = recoverCharacterInformation(url)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=2173.0), HTML(value='')))




# DataFrame creation & export into a csv file

In [5]:
# These lists will contain the information retrieved above
list_ranks = []
list_characters = []
list_guilds = []
list_realm = []
list_ilvl = []

# We iterate on the lists of different pages
for twenty_3_characters in tqdm(dicoList.values()):
    # For 1 page, we iterate on the characters of this page
    for character in twenty_3_characters:
        # add the informations to the lists
        list_ranks.append(character["rank"])
        list_characters.append(character["character"])
        list_guilds.append(character["guild"])
        list_realm.append(character["realm"])
        list_ilvl.append(character["itemsLvl"])
        
# Dataframe creation        
data = {"Rank": list_ranks,
        "Character": list_characters,
        "Guild": list_guilds,
        "Realm": list_realm,
        "ItemsLvl": list_ilvl}

df = pd.DataFrame(data)
df.tail()

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=2174.0), HTML(value='')))




Unnamed: 0,Rank,Character,Guild,Realm,ItemsLvl
49995,49996,Skyæh,CynicaI,Ragnaros,470.31
49996,49997,Kamazing,Schattenspiel,Garrosh,470.31
49997,49998,Barnzyh,Why no love,Twisting Nether,470.31
49998,49999,Yoinkymage,Couch Drei,Aegwynn,470.31
49999,50000,Ragnarøz,Yes Lads,Ragnaros,470.25


In [6]:
# Export the DataFrame into a csv file
df.to_csv("Data/characters_informations.csv", sep=';')