# Proyecto de Scrapping de Estadísticas de Videojuegos - Hugo Peralta Muñoz

En este proyecto se desarrollará el *scrapping* de una página web de videojuegos, la cual actúa como base de datos para almacenar cada uno de ellos. La página es una muy conocida, que es tanto para películas como música, shows de televisión y juegos. Es **Metacritic** cuyo enlace a su web es la siguiente: <a>https://www.metacritic.com/</a>. Esta es una de las webs más usadas para buscar notas sobre videojuegos.

El scrapeo se realizará con la biblioteca `BeautifulSoup4` en *Python*. Podrás encontrar información útil de su aplicación y la misma documentación en la página de *Python Packages Index*: <a>https://pypi.org/project/beautifulsoup4/</a>.

## Importación de librerías

In [56]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

## Inicialización de datos 

Inicializaremos todos los datos necesarios para nuestro scrapeo, además, crearemos un header el cual imitará a un usuario con las características que le agregamos.

In [57]:
pages = 10

ranks = []
titles = []
descriptions = []
platforms = []
release_dates = []
developers = []
genres = []
metascores = []
user_scores = []


header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

Tendremos almacenadas las URLs que usaremos para acceder a la página y scrapear su contenido.

In [58]:
page_games_url = "https://www.metacritic.com/browse/game/?releaseYearMin=1958&releaseYearMax=2025&page="

## Scrapeo de datos

### Scrapeo de las paginas

Cada página de la base de datos muestra $50$ juegos, scrapearemos primero las entradas de estos videojuegos. 

In [59]:
game_pages = []

for page in range(1, pages + 1):
    game_pages.append(page_games_url + str(page))

game_pages

['https://www.metacritic.com/browse/game/?releaseYearMin=1958&releaseYearMax=2025&page=1',
 'https://www.metacritic.com/browse/game/?releaseYearMin=1958&releaseYearMax=2025&page=2',
 'https://www.metacritic.com/browse/game/?releaseYearMin=1958&releaseYearMax=2025&page=3',
 'https://www.metacritic.com/browse/game/?releaseYearMin=1958&releaseYearMax=2025&page=4',
 'https://www.metacritic.com/browse/game/?releaseYearMin=1958&releaseYearMax=2025&page=5',
 'https://www.metacritic.com/browse/game/?releaseYearMin=1958&releaseYearMax=2025&page=6',
 'https://www.metacritic.com/browse/game/?releaseYearMin=1958&releaseYearMax=2025&page=7',
 'https://www.metacritic.com/browse/game/?releaseYearMin=1958&releaseYearMax=2025&page=8',
 'https://www.metacritic.com/browse/game/?releaseYearMin=1958&releaseYearMax=2025&page=9',
 'https://www.metacritic.com/browse/game/?releaseYearMin=1958&releaseYearMax=2025&page=10']

Una vez tengamos las urls de las páginas, tendremos que sacar las entradas de cada videojuegos.

In [60]:
game_links = []

for page in game_pages:
    response = requests.get(page, headers=header)
    soup = BeautifulSoup(response.content, "html.parser")

    for game in soup.find_all("a", class_="c-finderProductCard_container"):
        game_links.append("https://www.metacritic.com" + game["href"])

game_links

['https://www.metacritic.com/game/the-legend-of-zelda-ocarina-of-time/',
 'https://www.metacritic.com/game/soulcalibur/',
 'https://www.metacritic.com/game/grand-theft-auto-iv/',
 'https://www.metacritic.com/game/super-mario-galaxy/',
 'https://www.metacritic.com/game/super-mario-galaxy-2/',
 'https://www.metacritic.com/game/the-legend-of-zelda-breath-of-the-wild/',
 'https://www.metacritic.com/game/tony-hawks-pro-skater-3/',
 'https://www.metacritic.com/game/perfect-dark-2000/',
 'https://www.metacritic.com/game/red-dead-redemption-2/',
 'https://www.metacritic.com/game/grand-theft-auto-v/',
 'https://www.metacritic.com/game/metroid-prime/',
 'https://www.metacritic.com/game/grand-theft-auto-iii/',
 'https://www.metacritic.com/game/super-mario-odyssey/',
 'https://www.metacritic.com/game/halo-combat-evolved/',
 'https://www.metacritic.com/game/nfl-2k1/',
 'https://www.metacritic.com/game/half-life-2/',
 'https://www.metacritic.com/game/bioshock/',
 'https://www.metacritic.com/game/gol

Ahora, nuestra variable `game_links` almacena el enlace a cada juego de forma individual, lo que necesitamos ahora es sacar los datos relevantes que queremos obtener, las cuales serán las variables inicializadas con anterioridad.

In [61]:
rank_counter = 1

for game_url in game_links:

    response = requests.get(game_url, headers=header)
    soup = BeautifulSoup(response.content, "html.parser")

    # Rank
    ranks.append(rank_counter)
    rank_counter += 1

    # Title
    title = soup.find("h1").get_text(strip=True)
    titles.append(title)

    # Description/Summary
    summary = soup.find("span", class_="c-productionDetailsGame_description g-text-xsmall")
    descriptions.append(summary.get_text(strip=True) if summary else None)

    # Platforms
    platform = soup.find("div", class_="c-gameDetails_Platforms u-flexbox u-flexbox-row") \
                   .find_all("li", class_="c-gameDetails_listItem g-color-gray70 u-inline-block")
    platforms.append(", ".join([p.get_text(strip=True) for p in platform]) if platform else None)

    # Release Date
    release_date = soup.find("div", class_="c-gameDetails_ReleaseDate u-flexbox u-flexbox-row") \
                       .find("span", class_="g-outer-spacing-left-medium-fluid g-color-gray70 u-block")
    release_dates.append(release_date.get_text(strip=True) if release_date else None)

    # Developer
    developer = soup.find("div", class_="c-gameDetails_Developer u-flexbox u-flexbox-row") \
                    .find("li", class_="c-gameDetails_listItem g-color-gray70 u-inline-block")
    developers.append(developer.get_text(strip=True) if developer else None)

    # Genre
    genre = soup.find("ul", class_="c-genreList u-flexbox u-block g-outer-spacing-left-medium-fluid") \
                .find_all("span", class_="c-globalButton_label")
    genres.append(", ".join([g.get_text(strip=True) for g in genre]) if genre else None)

    # Metascore
    metascore = soup.find("div", class_="c-siteReviewScore u-flexbox-column u-flexbox-alignCenter u-flexbox-justifyCenter g-text-bold c-siteReviewScore_green g-color-gray90 c-siteReviewScore_medium") \
                    .find("span")
    metascores.append(metascore.get_text(strip=True) if metascore else None)

    # User Score
    user_score = soup.find("div", class_="c-siteReviewScore u-flexbox-column u-flexbox-alignCenter u-flexbox-justifyCenter g-text-bold c-siteReviewScore_green c-siteReviewScore_user g-color-gray90 c-siteReviewScore_medium")
    if user_score:
        user_score = user_score.find("span")
    user_scores.append(user_score.get_text(strip=True) if user_score else None)


Comprobaremos por encima solo si encontramos algún fallo en nuestras listas tras el scrapeo.

In [62]:
print(ranks)
print(titles)
print(descriptions)
print(platforms)
print(release_dates)
print(developers)
print(genres)
print(metascores)
print(user_scores)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 22

Comprobaremos también que todas las longitudes coincidan, de forma que al crear el dataframe no suponga ningún problema. De media, por cada página de videojuegos podemos encontrar $24$ juegos, por lo que según el número de páginas que pongamos en la parte superior indicará la longitud del dataset.

In [63]:
print(len(ranks))
print(len(titles))
print(len(descriptions))
print(len(platforms))
print(len(release_dates))
print(len(developers))
print(len(genres))
print(len(metascores))
print(len(user_scores))


240
240
240
240
240
240
240
240
240


Ahora, crearemos un método para normalizar las longitudes de las listas, por lo general no presentarám ningún problema, pero en el caso de que difirieran, cogemos las listas de nombres que son únicos para saber la cantidad de entradas que deberíamos de tener.

In [64]:
def normalize_length(list, length_wanted, default_value=None):
    if len(list) < length_wanted:
        return list + [default_value] * (length_wanted - len(list))
    return list[:length_wanted]

In [65]:
max_length = max(len(ranks), len(titles), len(descriptions))

# Normalize all lists to match length
ranks = normalize_length(ranks, max_length)
titles = normalize_length(titles, max_length)
descriptions = normalize_length(descriptions, max_length)
platforms = normalize_length(platforms, max_length)
release_dates = normalize_length(release_dates, max_length)
developers = normalize_length(developers, max_length)
genres = normalize_length(genres, max_length)
metascores = normalize_length(metascores, max_length)
user_scores = normalize_length(user_scores, max_length)

Crearemos ahora el `DataFrame` con las listas que hemos scrapeado antes. Usaremos el rank como index de nuestro dataframe, de esta forma los tendremos ordenados por ranking de puntuación.

In [66]:
games_df = pd.DataFrame({
    "Rank": ranks,
    "Title": titles,
    "Description": descriptions,
    "Developer": developers,
    "Release Date": release_dates,
    "Genre": genres,
    "Metascore": metascores,
    "User Score": user_scores,
    "Platform": platforms
}).set_index("Rank")

Ahora podremos ver los campos de nuestro dataset, veremos las primeras $20$ entradas.

In [67]:
games_df.head(20)

Unnamed: 0_level_0,Title,Description,Developer,Release Date,Genre,Metascore,User Score,Platform
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,The Legend of Zelda: Ocarina of Time,"As a young boy, Link is tricked by Ganondorf, ...",Nintendo,"Nov 23, 1998",Open-World Action,99,9.1,Nintendo 64
2,SoulCalibur,"[Xbox Live Arcade] Soulcalibur, the highest M...",Namco,"Sep 8, 1999",3D Fighting,98,7.8,"Dreamcast, iOS (iPhone/iPad), Xbox 360"
3,Grand Theft Auto IV,[Metacritic's 2008 PS3 Game of the Year; Also ...,Rockstar North,"Apr 29, 2008",Open-World Action,98,8.3,"PlayStation 3, Xbox 360, PC"
4,Super Mario Galaxy,[Metacritic's 2007 Wii Game of the Year] The u...,Nintendo,"Nov 12, 2007",3D Platformer,97,9.1,Wii
5,Super Mario Galaxy 2,"Super Mario Galaxy 2, the sequel to the galaxy...",Nintendo EAD Tokyo,"May 23, 2010",3D Platformer,97,9.0,Wii
6,The Legend of Zelda: Breath of the Wild,Ignore everything you know about The Legend of...,Nintendo,"Mar 3, 2017",Open-World Action,97,8.9,"Wii U, Nintendo Switch"
7,Tony Hawk's Pro Skater 3,Challenge up to four friends in online competi...,Neversoft Entertainment,"Oct 30, 2001",Skating,97,7.7,"PlayStation 2, GameCube, Xbox, PlayStation, PC..."
8,Perfect Dark (2000),[Xbox Live Arcade] Agent Joanna Dark hit the ...,Rare Ltd.,"May 22, 2000",FPS,97,8.5,"Nintendo 64, Xbox 360"
9,Red Dead Redemption 2,Developed by the creators of Grand Theft Auto ...,Rockstar Games,"Oct 26, 2018",Open-World Action,97,8.9,"Xbox One, PlayStation 4, PC"
10,Grand Theft Auto V,"Los Santos is a vast, sun-soaked metropolis fu...",Rockstar North,"Nov 18, 2014",Open-World Action,97,8.5,"PlayStation 3, Xbox 360, PlayStation 4, Xbox O..."


Exportaremos nuestro dataframe a un csv, el cual podremos usar para tener un histórico por ejemplo del ranking actual de la página.

In [68]:
games_df.to_csv("metacritic_games.csv")