# Web Scraping Metacritic for Game Reviews

I'm practicing collecting data through webscraping and some cleaning. Metacritic doesn't have an API but is a great source of compiled critical ratings and user ratings.

### Scraping game ratings from Metacritic.com

I will be scraping job listings from Metacritic.com using BeautifulSoup.

In [133]:
import requests
import bs4
from bs4 import BeautifulSoup
import urllib
from urllib.request import urlopen # changed python 2 code: from urllib import urlopen
import pandas as pd
import matplotlib as plt
import re
%matplotlib inline

I was getting the following error : 
    "HTTPError: HTTP Error 403: Forbidden"
when I was using this code:
```python
platform = "all"; # "all", "ps4, ""xboxone", "pc", "ps3", "wii-u", etc.
year = 2016;
maxPage = 4; # The first page is page "0" in the URL but page 1 on the website UI; enter this number using the UI convention
whitespaceBlock = "                                            "

queries = [];
currentPage = 0
for i in range (currentPage, maxPage):
    URL = "http://www.metacritic.com/browse/games/score/metascore/year/all/filtered?year_selected=" + str(year) + "&sort=desc&page=" + str(i) 
    print(URL)
    html = urllib.request.urlopen(URL).read() # urllib.urlopen updated
    soup = BeautifulSoup(html, "html.parser")
    print(soup)
```
            
It looks like metacritic blocks non-browser queries.

Thus I added the code below to make the site think I'm using a browser:
```python
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
headers={'User-Agent':user_agent,} 
request=urllib.request.Request(URL,None,headers) 
response = urllib.request.urlopen(request)
data = response.read() 
```

This forum thread helped me get there: https://stackoverflow.com/questions/13303449/urllib2-httperror-http-error-403-forbidden

I've constructed the for loop below to go through the first 4 pages of results for games on all platforms from 2016.

Let's take a look at this URL: http://www.metacritic.com/browse/games/score/metascore/year/all/filtered?year_selected=2016&sort=desc&page=1

We can see that year_selected can be changed for the year we care about and that page can be changed for the page we care about. This page attribute only appears when browsing once you leave the first page. The second page is page=1. Fortunately page=0 gives us the first page, so the coding doesn't get too messy.

In [132]:
platform = "all"; # "all", "ps4, ""xboxone", "pc", "ps3", "wii-u", etc.
year = 2016;
maxPage = 4; # The first page is page "0" in the URL but page 1 on the website UI; enter this number using the UI convention

currentPage = 0
for i in range (currentPage, maxPage):
    URL = "http://www.metacritic.com/browse/games/score/metascore/year/all/filtered?year_selected=" + str(year) + "&sort=desc&page=" + str(i) 

    print(URL)

    user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
    headers={'User-Agent':user_agent,} 
    request=urllib.request.Request(URL,None,headers)
    response = urllib.request.urlopen(request)
    data = response.read()

    # print(data)
    soup = BeautifulSoup(data, "html.parser")
    # print(soup)

http://www.metacritic.com/browse/games/score/metascore/year/all/filtered?year_selected=2016&sort=desc&page=0
http://www.metacritic.com/browse/games/score/metascore/year/all/filtered?year_selected=2016&sort=desc&page=1
http://www.metacritic.com/browse/games/score/metascore/year/all/filtered?year_selected=2016&sort=desc&page=2
http://www.metacritic.com/browse/games/score/metascore/year/all/filtered?year_selected=2016&sort=desc&page=3


Let's look at part of the html to get a sense of what tags to pull out:

```JSON
<div class="clr"></div>
</div>
<div class="body_wrap">
<div class="product_condensed">
<div class="product_rows">
<div class="product_row game first">
<div class="product_item row_num">
                101.
            </div>
<div class="product_item product_score">
<div class="metascore_w small game positive">84</div>
</div>
<div class="product_item product_title">
<a href="/game/pc/darkest-dungeon">
                    Darkest Dungeon
                                            (PC)
                                    </a>
</div>
<div class="product_item product_userscore_txt">
<span class="label">User:</span>
<span class="data textscore textscore_favorable">7.8</span>
</div>
<div class="product_item product_date">
                                                                                                                        Jan 19, 2016
            </div>
</div>
<div class="product_row game">
<div class="product_item row_num">
                102.
            </div>
<div class="product_item product_score">
<div class="metascore_w small game positive">84</div>
</div>
<div class="product_item product_title">
<a href="/game/playstation-4/day-of-the-tentacle-remastered">
                    Day of the Tentacle Remastered
                                            (PS4)
                                    </a>
</div>
<div class="product_item product_userscore_txt">
<span class="label">User:</span>
<span class="data textscore textscore_mixed">7.3</span>
</div>
<div class="product_item product_date">
                                                                                                                        Mar 22, 2016
            </div>
</div>
```

In [124]:
# print(soup.find_all('div', attrs = {'class':'product_row game'}))

For each game we now have html that follows this pattern:
```json
</div>, <div class="product_row game">
<div class="product_item row_num">
                399.
            </div>
<div class="product_item product_score">
<div class="metascore_w small game positive">75</div>
</div>
<div class="product_item product_title">
<a href="/game/xbox-one/batman-the-telltale-series---episode-4-guardian-of-gotham">
                    Batman: The Telltale Series - Episode 4: Guardian of Gotham
                                            (XONE)
                                    </a>
</div>
<div class="product_item product_userscore_txt">
<span class="label">User:</span>
<span class="data textscore textscore_mixed">5.1</span>
</div>
<div class="product_item product_date">
                                                                                                                        Nov 22, 2016
            </div>
</div>]
```

That means we can now find each game by using: 
```python
soup.find_all('div', attrs = {'class':'product_row game'})
```

Debugging the for loop:

In [131]:
for games in soup.find_all('div', class_='product_row game'):
    title_and_platform = games.find('div', class_='product_item product_title').find('a').text    
    title_and_platform = str(title_and_platform).replace("  ", "") # removes any spaces more than one in a row
    title = re.findall('(.*?)\n', title_and_platform)[1] # takes the second line, which is the title
    platform = re.findall('\n\((.*?)\)', title_and_platform)
    platform = str(platform)
    critic_score = games.find('div', class_='product_item product_score').text
    critic_score = str(critic_score).replace("  ", "") # removes any spaces more than one in a row
    critic_score = str(critic_score).replace('/n', "") # removes any new lines
    user_score = games.find('span', attrs = {"class": re.compile("(data textscore .*?)")}).text     
    # user_score = games.find('div', attrs = {"class": "product_item product_userscore_txt"}).find('span', attrs = {"class": re.compile("(data textscore .*?)")}).contents # this works equally well, and is a more careful coding in case "data textscore" was reused somewhere else
    if user_score != 'tbd':
        user_score = int(float(str(user_score))*10) # The critic scores are out of 100, while the user scores are out of 10 - normalize them here
    release_date = games.find('div', class_='product_item product_date').text
    release_date = str(release_date).replace("  ", "") # removes any spaces more than one in a row
    #df = df.append({"title":title, "platform":platform, "critic_score": critic_score, "user_score": user_score, "release_date": release_date}, ignore_index=True) 
    # print(release_date) # working
    # print(user_score) # working
    # print(title_and_platform) # working
    # print(title) # working
    # print(platform) # working
    # print(critic_score) # working

Alright, let's put everything together into a for loop that will go through the first 4 pages and give us the title, platform, critic, user, and release date for each game:

In [130]:
df = pd.DataFrame(columns=["title", "platform", "critic_score", "user_score", "release_date"])

for i in range (currentPage, maxPage):
    URL = "http://www.metacritic.com/browse/games/score/metascore/year/all/filtered?year_selected=" + str(year) + "&sort=desc&page=" + str(i) 
    user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
    headers={'User-Agent':user_agent,} 
    request=urllib.request.Request(URL,None,headers)
    response = urllib.request.urlopen(request)
    data = response.read()
    soup = BeautifulSoup(data, "html.parser")

    for games in soup.find_all('div', class_='product_row game'):
        title_and_platform = games.find('div', class_='product_item product_title').find('a').text    
        title_and_platform = str(title_and_platform).replace("  ", "") # removes any spaces more than one in a row
        title = re.findall('(.*?)\n', title_and_platform)[1] # takes the second line, which is the title
        platform = re.findall('\n\((.*?)\)', title_and_platform)
        critic_score = games.find('div', class_='product_item product_score').text
        critic_score = str(critic_score).replace("  ", "") # removes any spaces more than one in a row
        critic_score = str(critic_score).replace('\n', "") # removes any new lines
        user_score = games.find('span', attrs = {"class": re.compile("(data textscore .*?)")}).text     
        # user_score = games.find('div', attrs = {"class": "product_item product_userscore_txt"}).find('span', attrs = {"class": re.compile("(data textscore .*?)")}).contents # this works equally well, and is a more careful coding in case "data textscore" was reused somewhere else
        if user_score != 'tbd':
            user_score = int(float(str(user_score))*10) # The critic scores are out of 100, while the user scores are out of 10 - normalize them here
        release_date = games.find('div', class_='product_item product_date').text
        release_date = str(release_date).replace("  ", "") # removes any spaces more than one in a row
        release_date = str(release_date).replace('\n', "") # removes any new lines
        df = df.append({"title":title, "platform":platform, "critic_score": critic_score, "user_score": user_score, "release_date": release_date}, ignore_index=True) 
   
df.head()        

Unnamed: 0,title,platform,critic_score,user_score,release_date
0,INSIDE,[XONE],93,82,"Jun 29, 2016"
1,Out of the Park Baseball 17,[PC],92,34,"Mar 22, 2016"
2,The Witcher 3: Wild Hunt - Blood and Wine,[PC],92,92,"May 30, 2016"
3,Overwatch,[PC],91,68,"May 23, 2016"
4,Overwatch,[XONE],91,57,"May 23, 2016"


In [135]:
# save as csv
# df.to_csv("~/Desktop/Metacritic-10-29-2017.csv" , sep=',', encoding='utf-8')

Future directions:
---
I'll look for 

1) patterns across critical score and platform 

2) how well correlated critical score and user score are

3) visualize these patterns

Crediting sources/inspiration:
---
I did model parts of my code around both of these sources, but most of the code here is is my own.

As a starting point for this, I used the notebook from here: http://nbviewer.jupyter.org/github/rowandl/portfolio/blob/master/Webscraping%20Indeed.com/Webscraping%20Indeed.ipynb

and I had a look at the code from this github:
https://github.com/ogienma/game-review-analysis/blob/master/review-scraper.js