# Web scraping

**Web scraping** is the process to obtain content and information from web services through software. The most common way to do this is to get the html of the page and, making use of the DOM structure, get the desired information. In this notebook, we want to retrieve the reviews from certain webs that are on the webpage called *Metacritic*. Thus, we had to analyze the web structure in order to make this retrieval.

## Importations

First of all, we will do the main function importations required to download and process the data.

In [1]:
from bs4 import BeautifulSoup
import requests, pickle
import numpy as np, pandas as pd
#from tqdm import tqdm_notebook # Para hacer una barra de progreso (deprecated; use tqdm.notebook.tqdm instead everytime it's used)
from time import sleep

## Functions

The next step is to define the functions that let us get this information in an authomatized way.

In [44]:
def get_soup(url):
    sleep(1)
    page = requests.get(url, headers = {'user-agent': 'Mozilla/5.0'})
    return BeautifulSoup(page.content, 'html.parser')

In [54]:
def get_reviews_overview(game, platform, category):
    d = {}
    d['category'] = category
    soup = get_soup(f'http://www.metacritic.com/game/{platform}/{game}/{category}-reviews').find('div', class_='module score_details_module')
    if category == 'critic':
        d['medium_score'] = int(soup.find('div', class_='score_summary metascore_summary').find('span').get_text())
    elif category == 'user': # Category is user; different structure in html
        d['medium_score'] = float(soup.find('div', class_='metascore_w').get_text())
    score_count = soup.find('ol', class_='score_counts hover_none')
    count = {}
    for item in score_count.select(f'li[class*="score_count"]'): # Remove commas and adecuate label to column names
        d[item.find('span', class_='label').get_text().replace(':','').lower()] = int(item.find('span', class_='count').get_text().replace(',',''))
    d['scored_reviews'] = d['positive'] + d['mixed'] + d['negative']
    return d

In [46]:
def get_reviews_critics(game, platform):
    soup = get_soup(f'http://www.metacritic.com/game/{platform}/{game}/critic-reviews').find('div', class_='module reviews_module critic_reviews_module')
    # Retrieve scored reviews
    scored_reviews = soup.find('ol', class_='reviews critic_reviews')
    data_scored_reviews = []
    for review in scored_reviews.select(f'li[class*="review critic_review"]'):
        current_review = {}
        # Source of the review and link to original text
        if review.find('div', 'source').find('a') is not None:
            source = review.find('div', 'source').find('a')
            current_review['source'] = source.get_text()
            current_review['link'] = source['href']
        else: # There is no link to original review
            current_review['source'] = review.find('div', 'source').get_text()
            current_review['link'] = ''
        # Date of the review
        current_review['date'] = review.find('div', 'date').get_text()
        # Review grade and score type
        current_review['grade'] = int(review.find('div', 'metascore_w').get_text())
        if current_review['grade'] < 50:
            current_review['scoreType'] = 'Negative'
        elif current_review['grade'] > 74:
            current_review['scoreType'] = 'Positive'
        else:
            current_review['scoreType'] = 'Mixed'
        # Review body
        exp = review.find('span', class_='blurb blurb_expanded')
        if exp is None:
            current_review['text'] = review.find('div', class_='review_body').get_text()
        else: # Large review that has an expand option
            current_review['text'] = exp.get_text()
        data_scored_reviews.append(current_review)
    # Retrieve unscored reviews
    unscored_reviews = soup.find('div', 'unscored_reviews').find('ol', class_='reviews critic_reviews')
    data_unscored_reviews = []
    for review in unscored_reviews.select(f'li[class*="review critic_review"]'):
        current_review = {}
        # Source of the review and link to original text
        if review.find('div', 'source').find('a') is not None:
            source = review.find('div', 'source').find('a')
            current_review['source'] = source.get_text()
            current_review['link'] = source['href']
        else: # There is no link to original review
            current_review['source'] = review.find('div', 'source').get_text()
            current_review['link'] = ''
        # Date of the review
        current_review['date'] = review.find('div', 'date').get_text()
        # Review grade and score type (omitted)
        #current_review['grade'] = ''
        #current_review['scoreType'] = ''
        # Review body
        exp = review.find('span', class_='blurb blurb_expanded')
        if exp is None:
            current_review['text'] = review.find('div', class_='review_body').get_text()
        else: # Large review that has an expand option
            current_review['text'] = exp.get_text()
        data_unscored_reviews.append(current_review)
    return data_scored_reviews, data_unscored_reviews

In [65]:
def get_reviews_users(game, platform):
    i = 0
    data_user_reviews = []
    thereAreReviews = True
    while thereAreReviews:
        # Distinguish first query from the others
        if i == 0:
            soup = get_soup(f'http://www.metacritic.com/game/{platform}/{game}/user-reviews').find('div', class_='module reviews_module user_reviews_module')
        else:
            soup = get_soup(f'http://www.metacritic.com/game/{platform}/{game}/user-reviews?page={i}').find('div', class_='module reviews_module user_reviews_module')
        # Retrieve this page's reviews
        #print(soup)
        if soup.find('ol', class_='reviews user_reviews') is not None:
            reviews = soup.find('ol', class_='reviews user_reviews')
            for review in reviews.select(f'li[class*="review user_review"]'):
                current_review = {}
                # Source of the review and link to original text
                if review.find('div', 'name').find('a') is not None:
                    source = review.find('div', 'name').find('a')
                    current_review['source'] = source.get_text()
                    current_review['link'] = 'http://www.metacritic.com' + source['href']
                else: # There is no link to original review
                    current_review['source'] = review.find('div', 'source').get_text()
                    current_review['link'] = ''
                # Date of the review
                current_review['date'] = review.find('div', 'date').get_text()
                # Review grade and score type
                current_review['grade'] = int(review.find('div', 'metascore_w').get_text())
                if current_review['grade'] < 5:
                    current_review['scoreType'] = 'Negative'
                elif current_review['grade'] > 7:
                    current_review['scoreType'] = 'Positive'
                else:
                    current_review['scoreType'] = 'Mixed'
                # Review body
                exp = review.find('span', class_='blurb blurb_expanded')
                if exp is None:
                    if review.find('div', class_='review_body').find('span') is None: # Review with no text
                        current_review['text'] = ''
                    else:
                        current_review['text'] = review.find('div', class_='review_body').find('span').get_text()
                else: # Large review that has an expand option
                    current_review['text'] = exp.get_text()
                # Thumb count
                current_review['upThumbs'] = int(review.find('span', 'total_ups').get_text())
                current_review['totalThumbs'] = int(review.find('span', 'total_thumbs').get_text())
                if current_review['totalThumbs'] == 0:
                    current_review['helpfulness'] = ''
                else:
                    current_review['helpfulness'] = current_review['upThumbs'] / current_review['totalThumbs']
                data_user_reviews.append(current_review)
        else:
            thereAreReviews = False
        i += 1
    return data_user_reviews

## Retrieve reviews

Once we have defined all the prerequisites, we can begin with the information retrieval.

In [48]:
# Define games and platforms
games = [
    {'title': 'fire-emblem-engage', 'platform': 'switch'},
    {'title': 'hi-fi-rush', 'platform': 'xbox-series-x'},
    {'title': 'forspoken', 'platform': 'playstation-5'},
    {'title': 'hi-fi-rush', 'platform': 'pc'},
    {'title': 'forspoken', 'platform': 'pc'}
]

### Retrieve and store overviews

We will pull the reviews' overview both from professional reviewers and the users.

In [57]:
for game in games:
    # Overview of users' and critics' reviews
    overviews = []
    overviews.append(get_reviews_overview(game['title'], game['platform'], 'user'))
    overviews.append(get_reviews_overview(game['title'], game['platform'], 'critic'))
    # Convert to df and store in csv
    df_overviews = pd.DataFrame(overviews)
    path = f'data/overviews/{game["title"]}_{game["platform"]}.csv'
    df_overviews.to_csv(path, index=False)
    print(f'Retrieved data of {game["title"]} for {game["platform"]}')

Retrieved data of fire-emblem-engage for switch
Retrieved data of hi-fi-rush for xbox-series-x
Retrieved data of forspoken for playstation-5
Retrieved data of hi-fi-rush for pc
Retrieved data of forspoken for pc


We can check that the retrieval has been done without any kind of issue just loading one of the csv files.

In [2]:
# Import the data
path = f'data/overviews/fire-emblem-engage_switch.csv'
importedData = pd.read_csv(path)
importedData

Unnamed: 0,category,medium_score,positive,mixed,negative,scored_reviews
0,user,6.6,394,111,193,698
1,critic,80.0,97,26,0,123


### Retrieve and store critic reviews

In the next step, we will get all the critics' review, taking into account that some of them are not given a numerical grade.

In [58]:
for game in games:
    # Scored and unscored reviews
    data_scored_reviews, data_unscored_reviews = get_reviews_critics(game['title'], game['platform'])
    # Convert to df and store in csv
    df_data_scored_reviews = pd.DataFrame(data_scored_reviews)
    df_data_unscored_reviews = pd.DataFrame(data_unscored_reviews)
    path = f'data/criticReviews/scored/{game["title"]}_{game["platform"]}.csv'
    df_data_scored_reviews.to_csv(path, index=False)
    path = f'data/criticReviews/unscored/{game["title"]}_{game["platform"]}.csv'
    df_data_unscored_reviews.to_csv(path, index=False)
    print(f'Retrieved data of {game["title"]} for {game["platform"]}')

Retrieved data of fire-emblem-engage for switch
Retrieved data of hi-fi-rush for xbox-series-x
Retrieved data of forspoken for playstation-5
Retrieved data of hi-fi-rush for pc
Retrieved data of forspoken for pc


We can check that the retrieval has been done without any kind of issue just loading one of the csv files (both in the scored and unscored case).

In [2]:
# Import the data (scored)
path = f'data/criticReviews/scored/fire-emblem-engage_switch.csv'
importedData = pd.read_csv(path)
importedData

Unnamed: 0,source,link,date,grade,scoreType,text
0,Dexerto,https://www.dexerto.com/gaming/fire-emblem-eng...,"Jan 17, 2023",100,Positive,\n Fire Emblem ...
1,Digitally Downloaded,https://www.digitallydownloaded.net/2023/01/re...,"Jan 17, 2023",100,Positive,\n The only thi...
2,Siliconera,https://www.siliconera.com/review-fire-emblem-...,"Jan 17, 2023",100,Positive,\n Fire Emblem ...
3,Pure Nintendo,https://purenintendo.com/review-fire-emblem-en...,"Feb 14, 2023",95,Positive,\n Fire Emblem:...
4,The Mako Reactor,https://themakoreactor.com/reviews/fire-emblem...,"Feb 8, 2023",95,Positive,\n Fire Emblem ...
...,...,...,...,...,...,...
118,Inverse,https://www.inverse.com/gaming/fire-emblem-eng...,"Jan 17, 2023",60,Mixed,\n Fire Emblem’...
119,App Trigger,https://apptrigger.com/2023/01/26/fire-emblem-...,"Jan 27, 2023",55,Mixed,\n Fire Emblem:...
120,DualShockers,https://www.dualshockers.com/fire-emblem-engag...,"Jan 25, 2023",55,Mixed,\n Fire Emblem ...
121,Slant Magazine,https://www.slantmagazine.com/games/fire-emble...,"Jan 29, 2023",50,Mixed,\n What makes F...


In [4]:
# Import the data (unscored)
path = f'data/criticReviews/unscored/fire-emblem-engage_switch.csv'
importedData = pd.read_csv(path)
importedData

Unnamed: 0,source,link,date,text
0,3DJuegos,https://www.3djuegos.com/juegos/fire-emblem-en...,"Jan 20, 2023",\n Although the...
1,Eurogamer,https://www.eurogamer.net/fire-emblem-engage-r...,"Jan 17, 2023",\n Nintendo's l...
2,Kotaku,https://kotaku.com/fire-emblem-engage-review-n...,"Feb 1, 2023",\n I’ve always ...
3,Polygon,https://www.polygon.com/reviews/23554012/fire-...,"Jan 17, 2023","\n Engage, even..."
4,The Verge,https://www.theverge.com/23557639/fire-emblem-...,"Jan 17, 2023",\n There’s some...
5,Vice,https://www.vice.com/en/article/epz3q4/fire-em...,"Jan 17, 2023",\n Fire Emblem ...


### Retrieve and store user reviews

Eventually, we will retrieve all the users reviews from a game. However, we need to take into account that some of these reviews may not be written in English, so we will need to discard those ones later.

In [66]:
for game in games:
    # User reviews
    data_user_reviews = get_reviews_users(game['title'], game['platform'])
    # Convert to df and store in csv
    df_data_user_reviews = pd.DataFrame(data_user_reviews)
    path = f'data/userReviews/{game["title"]}_{game["platform"]}.csv'
    df_data_user_reviews.to_csv(path, index=False)
    print(f'Retrieved data of {game["title"]} for {game["platform"]}')

Retrieved data of fire-emblem-engage for switch
Retrieved data of hi-fi-rush for xbox-series-x
Retrieved data of forspoken for playstation-5
Retrieved data of hi-fi-rush for pc
Retrieved data of forspoken for pc


We can check that the retrieval has been done without any kind of issue just loading one of the csv files.

In [4]:
# Import the data
path = f'data/userReviews/fire-emblem-engage_switch.csv'
importedData = pd.read_csv(path)
importedData

Unnamed: 0,source,link,date,grade,scoreType,text,upThumbs,totalThumbs,helpfulness
0,NagisaNeko,http://www.metacritic.com/user/NagisaNeko,"Jan 24, 2023",7,Mixed,I admit that this work is better than the prev...,19,19,1.000000
1,Belonski,http://www.metacritic.com/user/Belonski,"Jan 25, 2023",6,Mixed,One of the best Fire Emblems in technical aspe...,17,17,1.000000
2,avantic00,http://www.metacritic.com/user/avantic00,"Feb 6, 2023",0,Negative,"Combat was great and all, but the amount of cr...",16,17,0.941176
3,Faetori,http://www.metacritic.com/user/Faetori,"Jan 25, 2023",5,Mixed,The gameplay itself and the technical portion ...,16,18,0.888889
4,SomethingSom,http://www.metacritic.com/user/SomethingSom,"Jan 26, 2023",5,Mixed,Gameplay is fun. The story and characters are ...,15,17,0.882353
...,...,...,...,...,...,...,...,...,...
379,Aurok11,http://www.metacritic.com/user/Aurok11,"Mar 1, 2023",10,Positive,"Honestly, after reading user reviews of FE Eng...",0,3,0.000000
380,hyper06,http://www.metacritic.com/user/hyper06,"Mar 5, 2023",9,Positive,Fire Emblem Engage is an amazing game and the ...,0,4,0.000000
381,PMG-Writer,http://www.metacritic.com/user/PMG-Writer,"Mar 18, 2023",9,Positive,Nintendo and the developers from Intelligent S...,0,0,
382,shw079,http://www.metacritic.com/user/shw079,"Mar 20, 2023",8,Positive,This is my first fire emblem game and it is ov...,0,0,


We see that the size of the reviews is smaller than the number of reviews got in the summary. This is because in Metacritic you can give a numerical valoration without a textual review.

## Next notebook

In this notebook, we have retrieved data from a webpage without making use of an API. However, not all of this information is useful and we will need to clean the data before using it. Nevertheless, before performing data cleaning it is prefered to have a great understanding of the information that we are dealing with. In the case of this data, we have a great understanding of it since we are the ones that have defined the dataframe structure of the information through the web information, althouh the previous infromation obtained through Restful API is not so well analyzed, so we will deep in that aspect in our next notebook called [data structure](./03DataStructure.ipynb).