# Web Scraping
---

## We need to collect user reviews for our data from [Metactric Games](https://www.metacritic.com/game), wiki link [here](https://en.wikipedia.org/wiki/Metacritic), specifically all the user reviews for Metacritic's Top Games. We want to focus on the most recent consoles to get a better idea of what games are popular with the more recent generation of consoles. We will be focusing on PS4, Xbox One, Switch, PC, Xbox Series X, and PS5 games.  Ex. [Metactitic Best PS4 Games of All Time](https://www.metacritic.com/browse/games/score/metascore/all/ps4/filtered?sort=desc&view=detailed) 

In [234]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep

### To avoid the 301 Moved Permanently status we need to set a different user agent. This will give us a 200 status code which means we are good to continue web scraping.

In [160]:
session = requests.Session()
session.headers['User-Agent'] = \
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)\
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'

url = 'https://www.metacritic.com/browse/games/score/metascore/all/\
ps4/filtered?sort=desc&view=detailed'
response = session.get(url)
response

<Response [200]>

### Now let's begin webscraping the top 100 page for each console. We will be scraping each game title, summary, details, and user reviews. Then we will be adding it into a list of dictionaries to then later create a dataframe to work with.

In [None]:
# grab text from top 100 page
html = response.text

# create BS instance
soup = BeautifulSoup(html, 'lxml')

# create main list to store dictionaries
video_games = []

# iterate over each table (site split up into 4 uneven tables)
for table in soup.find_all('table', {'class': 'clamp-list'}):
    
    # in each table search for the game
    for game in table.find_all('a', {'class':'title'}):
        
        # create a dictionary to store game info
        video_game = {}
        
        # add console to dict
        video_game['console'] = 'ps4'
        
        # grab url for game
        url = 'https://www.metacritic.com/' + game.get('href')

        # add game name to dict
        video_game['video_game_name'] = game.text
        
        # establish connection to the link that has the game info and create BS instance
        session2 = requests.Session()
        sleep(.5)
        session2.headers['User-Agent'] = \
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)\
        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
        response2 = session2.get(url)
        html2 = response2.text
        soup2 = BeautifulSoup(html2, 'lxml')
        
        # add game summary to dict
        try:
            video_game['summary'] = soup2.find('div', 
                                    {'class': 'module product_data product_data_summary'})\
                                    .find('span', {'class':'blurb blurb_expanded'}).text
        except:
            video_game['summary'] = soup2.find('div', 
                                    {'class': 'module product_data product_data_summary'})\
                                    .find('span', {'class':'data'}).text
        else:
            pass
        
        # add developer to dict
        try:
            video_game['developer'] = soup2.find('a', {'class': 'button'}).text
        except:
            pass
        
        # add genre to dict
        try:
            video_game['genre(s)'] = ' '.join(soup2.find('li', 
                                         {'class': 'summary_detail product_genre'}).text.split())
        except:
            pass
        
        # add number of players to dict
        try:
            video_game['num_players'] = ' '.join(soup2.find('li', 
                                         {'class': 'summary_detail product_players'}).text.split())
        except:
            pass
        
        # add esrb game rating to dict
        try:
            video_game['esrb_rating'] = ' '.join(soup2.find('li', 
                                         {'class': 'summary_detail product_rating'}).text.split())
        except:
            pass
        
        # add metacritic critic score
        video_game['esrb_rating'] = soup2.find('span',{'itemprop': 'ratingValue'}).text
        
        # add average user score
        try:
            video_game['avg_user_score'] = soup2.find('div', 
                                    {'class': 'metascore_w user large game positive'}).text
        except AttributeError:
            video_game['avg_user_score'] = soup2.find('div', 
                                    {'class': 'metascore_w user large game mixed'}).text
        except:
            video_game['avg_user_score'] = soup2.find('div', 
                                    {'class': 'metascore_w user large game negative'}).text
        else:
            pass
        
        # establish connection to the user reviews page
        url = url + '/user-reviews'
        session3 = requests.Session()
        sleep(.5)
        session3.headers['User-Agent'] = \
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)\
        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
        response3 = session3.get(url)
        html3 = response3.text
        soup3 = BeautifulSoup(html3, 'lxml')

        # we want to check if there are more than one page, if not scrape the page we're on
        if soup3.find('li', {'class': 'page last_page'}) == None: 
            print('single page only')

            # get the first review and score
            first_review = soup3.find('li', {'class': 'review user_review first_review'})
            
            # check to see if there is only 1 review on 1 page
            one_review = soup3.find('li', {'class': 'review user_review first_review last_review'})
            if one_review:
                if first_review.find('div', {'class': 'review_body'})\
                .find('span', {'class': 'blurb blurb_expanded'}) == None:

                    # if not grab regular review_body and score
                    video_game['user_review'] = first_review.find('div', {'class': 'review_body'}).text

                    video_game['user_score'] = first_review.find('div', {'class': 'review_grade'}).text
                    break
                # if it is "expanded" grab expanded review and score
                else:
                    video_game['user_review'] = first_review.find('div', {'class': 'review_body'})\
                    .find('span', {'class': 'blurb blurb_expanded'}).text

                    video_game['user_score'] = first_review.find('div', {'class': 'review_grade'}).text
                    break
                # add dict to list
                video_games.append(video_game)

            # check to see if the review is an "expanded" review
            if first_review.find('div', {'class': 'review_body'})\
            .find('span', {'class': 'blurb blurb_expanded'}) == None:

                # if not grab regular review_body and score
                video_game['user_review'] = first_review.find('div', {'class': 'review_body'}).text

                video_game['user_score'] = first_review.find('div', {'class': 'review_grade'}).text

            # if it is "expanded" grab expanded review and score
            else:
                video_game['user_review'] = first_review.find('div', {'class': 'review_body'})\
                .find('span', {'class': 'blurb blurb_expanded'}).text

                video_game['user_score'] = first_review.find('div', {'class': 'review_grade'}).text

            # add dict to list
            video_games.append(video_game)

            # iterate over all the user reviews and grab the score and review
            for element in soup3.find_all('li', {'class': 'review user_review'}):
                video_game_2 = video_game.copy()

                # check to see if the review is an "expanded" review
                if element.find('div', {'class': 'review_body'})\
                .find('span', {'class': 'blurb blurb_expanded'}) == None:

                    # if not grab regular review_body and score
                    video_game_2['user_review'] = element.find('div', {'class': 'review_body'}).text

                    video_game_2['user_score'] = element.find('div', {'class': 'review_grade'}).text

                # if it is "expanded" grab expanded review and score
                else:
                    video_game_2['user_review'] = element.find('div', {'class': 'review_body'})\
                    .find('span', {'class': 'blurb blurb_expanded'}).text

                    video_game_2['user_score'] = element.find('div', {'class': 'review_grade'}).text

                # add dict to list
                video_games.append(video_game_2)

            # create new dict to add to list
            video_game_3 = video_game.copy()

            # get the last review and score
            # check to see if the review is an "expanded" review
            last_review = soup3.find('li', {'class': 'review user_review last_review'})

            # check to see if the review is an "expanded" review
            if last_review.find('div', {'class': 'review_body'})\
            .find('span', {'class': 'blurb blurb_expanded'}) == None:

                # if not grab regular review_body and score
                video_game_3['user_review'] = last_review.find('div', {'class': 'review_body'}).text

                video_game_3['user_score'] = last_review.find('div', {'class': 'review_grade'}).text

            # if it is "expanded" grab expanded review and score
            else:
                video_game_3['user_review'] = last_review.find('div', {'class': 'review_body'})\
                .find('span', {'class': 'blurb blurb_expanded'}).text

                video_game_3['user_score'] = last_review.find('div', {'class': 'review_grade'}).text

            # add dict to list
            video_games.append(video_game_3)
        # if there is more than one page scrape all the pages            
        else:
            # here we are going to iterate through all of the user review pages and scrape the reviews
            for review_page in range(0, int(soup3.find('li', {'class': 'page last_page'})
                             .find('a', {'class': 'page_num'}).text)):
                # check to see if we are on the first page, if so scrape user reviews
                if review_page == 0:
                    # get the first review and score
                    first_review = soup3.find('li', {'class': 'review user_review first_review'})

                    # check to see if the review is an "expanded" review
                    if first_review.find('div', {'class': 'review_body'})\
                    .find('span', {'class': 'blurb blurb_expanded'}) == None:

                        # if not grab regular review_body and score
                        video_game['user_review'] = first_review.find('div', {'class': 'review_body'}).text

                        video_game['user_score'] = first_review.find('div', {'class': 'review_grade'}).text

                    # if it is "expanded" grab expanded review and score
                    else:
                        video_game['user_review'] = first_review.find('div', {'class': 'review_body'})\
                        .find('span', {'class': 'blurb blurb_expanded'}).text

                        video_game['user_score'] = first_review.find('div', {'class': 'review_grade'}).text

                    # add dict to list
                    video_games.append(video_game)

                    # iterate over all the user reviews and grab the score and review
                    for element in soup3.find_all('li', {'class': 'review user_review'}):
                        video_game_2 = video_game.copy()

                        # check to see if the review is an "expanded" review
                        if element.find('div', {'class': 'review_body'})\
                        .find('span', {'class': 'blurb blurb_expanded'}) == None:

                            # if not grab regular review_body and score
                            video_game_2['user_review'] = element.find('div', {'class': 'review_body'}).text

                            video_game_2['user_score'] = element.find('div', {'class': 'review_grade'}).text

                        # if it is "expanded" grab expanded review and score
                        else:
                            video_game_2['user_review'] = element.find('div', {'class': 'review_body'})\
                            .find('span', {'class': 'blurb blurb_expanded'}).text

                            video_game_2['user_score'] = element.find('div', {'class': 'review_grade'}).text

                        # add dict to list
                        video_games.append(video_game_2)

                    # create new dict to add to list
                    video_game_3 = video_game.copy()

                    # get the last review and score
                    # check to see if the review is an "expanded" review
                    last_review = soup3.find('li', {'class': 'review user_review last_review'})

                    # check to see if the review is an "expanded" review
                    if last_review.find('div', {'class': 'review_body'})\
                    .find('span', {'class': 'blurb blurb_expanded'}) == None:

                        # if not grab regular review_body and score
                        video_game_3['user_review'] = last_review.find('div', {'class': 'review_body'}).text

                        video_game_3['user_score'] = last_review.find('div', {'class': 'review_grade'}).text

                    # if it is "expanded" grab expanded review and score
                    else:
                        video_game_3['user_review'] = last_review.find('div', {'class': 'review_body'})\
                        .find('span', {'class': 'blurb blurb_expanded'}).text

                        video_game_3['user_score'] = last_review.find('div', {'class': 'review_grade'}).text

                    # add dict to list
                    video_games.append(video_game_3)

                # if we are not on the first page connect to the next page and scrape reviews
                # we only want the first 10 pages, some games have over 100 pages of reviews
                # if we scraped them all we'd have too much
                elif review_page < 10:
                    params = {'page': review_page}
                    session4 = requests.Session()
                    sleep(.5)
                    session4.headers['User-Agent'] = \
                    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)\
                    AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
                    response4 = session4.get(url, params=params)
                    html4 = response4.text
                    soup4 = BeautifulSoup(html4, 'lxml')

                    # get the first review and score
                    first_review = soup4.find('li', {'class': 'review user_review first_review'})

                    # check to see if the review is an "expanded" review
                    if first_review.find('div', {'class': 'review_body'})\
                    .find('span', {'class': 'blurb blurb_expanded'}) == None:

                        # if not grab regular review_body and score
                        video_game['user_review'] = first_review.find('div', {'class': 'review_body'}).text

                        video_game['user_score'] = first_review.find('div', {'class': 'review_grade'}).text

                    # if it is "expanded" grab expanded review and score
                    else:
                        video_game['user_review'] = first_review.find('div', {'class': 'review_body'})\
                        .find('span', {'class': 'blurb blurb_expanded'}).text

                        video_game['user_score'] = first_review.find('div', {'class': 'review_grade'}).text

                    # add dict to list
                    video_games.append(video_game)

                    # iterate over all the user reviews and grab the score and review
                    for element in soup4.find_all('li', {'class': 'review user_review'}):
                        video_game_2 = video_game.copy()

                        # check to see if the review is an "expanded" review
                        if element.find('div', {'class': 'review_body'})\
                        .find('span', {'class': 'blurb blurb_expanded'}) == None:

                            # if not grab regular review_body and score
                            video_game_2['user_review'] = element.find('div', {'class': 'review_body'}).text

                            video_game_2['user_score'] = element.find('div', {'class': 'review_grade'}).text

                        # if it is "expanded" grab expanded review and score
                        else:
                            video_game_2['user_review'] = element.find('div', {'class': 'review_body'})\
                            .find('span', {'class': 'blurb blurb_expanded'}).text

                            video_game_2['user_score'] = element.find('div', {'class': 'review_grade'}).text

                        # add dict to list
                        video_games.append(video_game_2)

                    # create new dict to add to list
                    video_game_3 = video_game.copy()

                    # get the last review and score
                    # check to see if the review is an "expanded" review
                    last_review = soup4.find('li', {'class': 'review user_review last_review'})

                    # check to see if the review is an "expanded" review
                    if last_review.find('div', {'class': 'review_body'})\
                    .find('span', {'class': 'blurb blurb_expanded'}) == None:

                        # if not grab regular review_body and score
                        video_game_3['user_review'] = last_review.find('div', {'class': 'review_body'}).text

                        video_game_3['user_score'] = last_review.find('div', {'class': 'review_grade'}).text

                    # if it is "expanded" grab expanded review and score
                    else:
                        video_game_3['user_review'] = last_review.find('div', {'class': 'review_body'})\
                        .find('span', {'class': 'blurb blurb_expanded'}).text

                        video_game_3['user_score'] = last_review.find('div', {'class': 'review_grade'}).text

                    # add dict to list
                    video_games.append(video_game_3)
                else:
                    break

In [565]:
df = pd.DataFrame(video_games)
print(df.shape)
df.head()

(18102, 10)


Unnamed: 0,console,video_game_name,summary,developer,genre(s),num_players,esrb_rating,avg_user_score,user_review,user_score
0,ps4,Red Dead Redemption 2,Developed by the creators of Grand Theft Auto ...,Rockstar Games,"Genre(s): Action Adventure, Open-World",# of players: Up to 32,97,8.6,"\nThis site is a joke, this the first time whe...",\n9\n
1,ps4,Red Dead Redemption 2,Developed by the creators of Grand Theft Auto ...,Rockstar Games,"Genre(s): Action Adventure, Open-World",# of players: Up to 32,97,8.6,Fair review of RDR2\r I'm almost 15% finished ...,\n7\n
2,ps4,Red Dead Redemption 2,Developed by the creators of Grand Theft Auto ...,Rockstar Games,"Genre(s): Action Adventure, Open-World",# of players: Up to 32,97,8.6,I really wanted to love it. The over-world is ...,\n6\n
3,ps4,Red Dead Redemption 2,Developed by the creators of Grand Theft Auto ...,Rockstar Games,"Genre(s): Action Adventure, Open-World",# of players: Up to 32,97,8.6,"\nBeautiful graphics, excellent voice acting, ...",\n7\n
4,ps4,Red Dead Redemption 2,Developed by the creators of Grand Theft Auto ...,Rockstar Games,"Genre(s): Action Adventure, Open-World",# of players: Up to 32,97,8.6,This game is really overrated.\rThe amazing en...,\n7\n


In [566]:
pd.set_option('display.max_colwidth', 50)
# df[df.isnull().any(axis=1)]
df

Unnamed: 0,console,video_game_name,summary,developer,genre(s),num_players,esrb_rating,avg_user_score,user_review,user_score
0,ps4,Red Dead Redemption 2,Developed by the creators of Grand Theft Auto ...,Rockstar Games,"Genre(s): Action Adventure, Open-World",# of players: Up to 32,97,8.6,"\nThis site is a joke, this the first time whe...",\n9\n
1,ps4,Red Dead Redemption 2,Developed by the creators of Grand Theft Auto ...,Rockstar Games,"Genre(s): Action Adventure, Open-World",# of players: Up to 32,97,8.6,Fair review of RDR2\r I'm almost 15% finished ...,\n7\n
2,ps4,Red Dead Redemption 2,Developed by the creators of Grand Theft Auto ...,Rockstar Games,"Genre(s): Action Adventure, Open-World",# of players: Up to 32,97,8.6,I really wanted to love it. The over-world is ...,\n6\n
3,ps4,Red Dead Redemption 2,Developed by the creators of Grand Theft Auto ...,Rockstar Games,"Genre(s): Action Adventure, Open-World",# of players: Up to 32,97,8.6,"\nBeautiful graphics, excellent voice acting, ...",\n7\n
4,ps4,Red Dead Redemption 2,Developed by the creators of Grand Theft Auto ...,Rockstar Games,"Genre(s): Action Adventure, Open-World",# of players: Up to 32,97,8.6,This game is really overrated.\rThe amazing en...,\n7\n
...,...,...,...,...,...,...,...,...,...,...
18097,ps4,Apex Legends,"Conquer with character in Apex Legends, a free...",Respawn Entertainment,"Genre(s): Action, Shooter, First-Person, Tactical",# of players: Up to 60,89,6.9,Hands down my favorite Battle Royale game for ...,\n8\n
18098,ps4,Apex Legends,"Conquer with character in Apex Legends, a free...",Respawn Entertainment,"Genre(s): Action, Shooter, First-Person, Tactical",# of players: Up to 60,89,6.9,\nStill a few bugs but improving every season ...,\n10\n
18099,ps4,Apex Legends,"Conquer with character in Apex Legends, a free...",Respawn Entertainment,"Genre(s): Action, Shooter, First-Person, Tactical",# of players: Up to 60,89,6.9,\ni hated battle royal games .. but when i fir...,\n9\n
18100,ps4,Apex Legends,"Conquer with character in Apex Legends, a free...",Respawn Entertainment,"Genre(s): Action, Shooter, First-Person, Tactical",# of players: Up to 60,89,6.9,\nThe best shooter on the market currently. Th...,\n6\n
