# PS4 GAMES DATA EXPLORATION

We will be analyzing PS4 game data collected from three different sites, namely ps-timetracker.com, gamepressure.com, and VGchartz.com.

**Part 1**: `Data Collection`

**Part 2**: `Data Preprocessing`

**Part 3**: `Data Analysis`

This particular notebook contains the **Part 1** which is data collection part, where we perform web scraping on three different websites to obtain the necessary data, and then store the data on corresponding json files.


# PART 1: Data Collection
<font color='red'>Valid as of March 2021. Files used for this analysis can be found in the Github repository. </font>

This part contians the whole data collection. Due to lack of available datasets on PS4 games, we opted to perform web scraping on three different websites to obtain the necessary data needed for analysis. The following websites that were scraped were PS-TimeTracker, GamePressure, and VGChartz.

### Import all necessary packages

`request` - used to request a web page

`AsyncHTMLSession` - used to request dynamic pages that use java script when loadin

`pandas` -  an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

`BeautifulSoup` - a Python library for pulling data out of HTML and XML files.

`json` - a lightweight data-interchange format.

In [1]:
import requests
from requests_html import AsyncHTMLSession
import pandas as pd
from bs4 import BeautifulSoup
import json

## 1.1 Web Scraping ps-timetracker.com

PS-Timetracker is website that details the top 100 games in a monthly categorical manner based primarly on the total number of hours played, additionally, the dataset also includes the total playerbase, as well as the span of an average session. The website includes both singleplayer-genre games and multiplayer-oriented games, and is regularly updated per month.

The first step is to request the page and load it onto BeautifulSoup

In [None]:
URL="https://ps-timetracker.com/statistic/"

We begin by searching for the required data. We will use January 2021 as the month of interest.

In [None]:
page=requests.get(URL + "2021-01")
soup = BeautifulSoup(page.content, 'html.parser')

The ps-timetracker data is located on a table with the class "table table-striped"

In [None]:
ps4_timetables=soup.find("table", {"class": "table table-striped"})

We will find all the rows and store them into an array

In [None]:
ps4_timetables=ps4_timetables.find_all('tr')

We will then select a random row and view the contents so that we can extract them

In [None]:
info_row=ps4_timetables[3].contents

In [None]:
print(info_row)

Each even index of the table contains the \n symbol. The data we are looking for are located on odd indexes.

Index 

1 = Rank, 3 = Chart Movement, 5 =  Game Title, 7 = Hours Played, 9 = Number of Players, 11 = Total Sesssions for the month, and 13 = Average time spent per Session

In [None]:
# Extract items of interest
Rank=info_row[1].text.strip()
Chart_movement=info_row[3].text.strip()
Title=info_row[5].text.strip()
Hours_Played=info_row[7].text.strip()
Players=info_row[9].text.strip()
Sessions=info_row[11].text.strip()
Avg_Session=info_row[13].text.strip()

print(Rank + " " + Title + " " + " " + Hours_Played + " " + Players + " " + Sessions + " " + Avg_Session)

We will then create a loop to view each row. The headers are stored on index 0 of the table so we will start at index 1.

In [None]:
info_start = 1
for i in range(info_start, len(ps4_timetables)-1):
    # Extract items of interest
    info_row = ps4_timetables[i].contents
    Rank=info_row[1].text.strip()
    Chart_movement=info_row[3].text.strip()
    Title=info_row[5].text.strip()
    Hours_Played=info_row[7].text.strip()
    Players=info_row[9].text.strip()
    Sessions=info_row[11].text.strip()
    Avg_Session=info_row[13].text.strip()
    print(Rank + " " + Title + " " + " " + Hours_Played + " " + Players + " " + Sessions + " " + Avg_Session)

### Final Code 1.1

We can see the the Top 100 were able to be  viewed. We will now create the final code which is a double loop. The first loop is for going through each month's page, the format is **"https:// ps-timetracker.com /statistic/YEAR-MONTH"**. The second loop is the same as the on above which collects each row of data.

In [None]:
ps4_times_json = []
info_start = 1
Year = "2021"
Months = ["01","02","03"]

for i in range(0, len(Months)):
    page=requests.get(URL + Year + "-" + Months[i])
    soup = BeautifulSoup(page.content, 'html.parser')
    ps4_timetables=soup.find("table", {"class": "table table-striped"})
    ps4_timetables=ps4_timetables.find_all('tr')
    
    for j in range(info_start, len(ps4_timetables)-1):
        # Extract items of interest
        info_row = ps4_timetables[j].contents
        Rank=info_row[1].text.strip()
        Chart_Movement=info_row[3].text.strip()
        Title=info_row[5].text.strip()
        Hours_Played=info_row[7].text.strip()
        Players=info_row[9].text.strip()
        Sessions=info_row[11].text.strip()
        Avg_Session=info_row[13].text.strip()

        ps4_times_json.append({
            "Year": Year,
            "Month": Months[i],
            "Rank": Rank,
            "Chart_Movement": Chart_Movement,
            "Title": Title,
            "Hours_Played": Hours_Played,
            "Players": Players,
            "Sessions": Sessions,
            "Avg_Session": Avg_Session
        })

We will save the collected data to a JSON File

In [None]:
with open('ps4-times.json', 'w') as json_file:
    json.dump(ps4_times_json, json_file)

Viewing the data using a pandas dataframe for verification

In [None]:
ps4_times_df = pd.json_normalize(ps4_times_json)
ps4_times_df

## 1.2 Web Scraping gamepressure.com

Gamepressure is a gaming news and media platform offering guides and walkthroughs on various consoles and gaming platforms. The games on the PS4 platform were scraped from this website. Aside from the title of the game, tags that define the game, its developer, publisher, mode, release date, game description, and ratings were collected.

### 1.2.1 LINK COLLECTION

The first step is to request the page and load it onto BeautifulSoup

In [None]:
URL="https://www.gamepressure.com/games/"

The game data are stored on diffent pages, we will first go to the ps4 game list page to collect all the links

In [None]:
page=requests.get(URL + "ps4" + "/")
soup = BeautifulSoup(page.content, 'html.parser')

The links are stored in a div with class "lista lista-gry"

In [None]:
ps4_game_links=soup.find("div", {"class": "lista lista-gry"}) 

We will extract each link, which is stored in an anchor with class pic-c", and place them in an array

In [None]:
ps4_game_links=ps4_game_links.find_all("a", {"class": "pic-c"},href=True) 

A for loop to verify if all the links were collected

In [None]:
for i in range (0, len(ps4_game_links)):
    print(ps4_game_links[i]['href'])

### Final Code 1.2.1 Link Collection

The final code to collect all links is a double for loop. The outer loop is used to iterate through each page and the inner loop is used to store all the links 

In [None]:
start_page = 1;
end_page = 21
ps4_links= []

for i in range (start_page, end_page):
    page=requests.get(URL + "ps4" + "/-" + str(i))
    soup = BeautifulSoup(page.content, 'html.parser')
    ps4_game_links=soup.find("div", {"class": "lista lista-gry"}) 
    ps4_game_links=ps4_game_links.find_all("a", {"class": "pic-c"},href=True) 
    
    for j in range (0, len(ps4_game_links)):
        ps4_links.append(ps4_game_links[j]['href'])

In [None]:
ps4_links[3]

### 1.2.2 GAME DATA

We will now select the first link to look for the game data. We will use an AsyncHTMLSession because the ratings portion requires javascript

In [None]:
URL="https://www.gamepressure.com"

The amount of time it too for pages to render wasn't tracked, so the timeout was set to 0. However, this code may be modified to select a specific timeout to avoid a page connection failure which would freeze the code.

In [None]:
from requests_html import AsyncHTMLSession
session = AsyncHTMLSession()
r = await session.get(URL + ps4_links[0])
await r.html.arender(timeout=0) 
print(r.html.raw_html) 

In [None]:
soup = BeautifulSoup(r.html.raw_html, 'html.parser')

The game data is stored in a section with class "article left"

In [None]:
game_info=soup.find("section", {"class": "article-left"}) 

The game title is store  on a span with id "game-title-cnt"

In [None]:
game_title = game_info.find("span", {"id": "game-title-cnt"})
print(game_title.text.strip())

The tags are stored in anchors within a paragraph with class "tagi"

In [None]:
tags_list = []
tags = game_info.find("p", {"class": "tagi"})
tags = tags.find_all("a")
for i in range(0, len(tags)):
    tags_list.append(tags[i].text.strip())

tags_list

The game developer name is stored in a div with id "game-developer-cnt"

In [None]:
game_developer = game_info.find("div", {"id": "game-developer-cnt"})
print("developer: " + game_developer.text.strip().replace("developer: ", ""))

The game publisher name is stored in a div with id "game-publisher-cnt"

In [None]:
game_publisher = game_info.find("div", {"id": "game-publisher-cnt"})
print("publisher: " + game_publisher.text.strip().replace("publisher: ", ""))

The game mode is stored in a div with class "S016-game-info"

In [None]:
game_mode = game_info.find("div", {"class": "S016-game-info"})
game_mode = game_mode.find_all("p")
print("Game mode: " + game_mode[3].text.strip().replace("Game mode: ", ""))

To get the release date we first enter a div with id "game-dates-cnt". We then find the paragraph with class "p2". Withing this pargraph there are three spans that contain the day,month,and year these spans have classes "s1,s2,3"

In [None]:
release_date = game_info.find("div", {"id": "game-dates-cnt"})
release_date = release_date.find("p", {"class": "p2"})
release_date_day = release_date.find("span", {"class": "s1"})
release_date_month = release_date.find("span", {"class": "s2"}) 
release_date_year = release_date.find("span", {"class": "s3"})

print("Release date: " + release_date_day.text.strip() + " " + release_date_month.text.strip() + " " + release_date_year.text.strip())

The game description content varies per page. We will just collect all the content and clean it later

In [None]:
game_description = game_info.find("div", {"id": "game-description-cnt"})
game_description = game_description.text.strip()
game_description

The game rating is stored in spans within a div with id "game-mis-cnt". The firt span (index 0) is the current score, the second span (index 1) is the expected score, both number are stored as the first object (index 0) within the spans.

In [None]:
game_rating = game_info.find("div", {"id": "game-misc-cnt"})
game_rating = game_rating.find_all("span")

current = game_rating[0].find_all("b")[0].text.strip()
expected = game_rating[1].find_all("b")[0].text.strip()

print("Current Score: " + current)
print("Expected Score: " + expected)

We will close the asynch session that we created

In [None]:
session.close()

### Final Code 1.2.2 Game Data

The main loop is a single loop which implements all the subcodes above. The loop iterates over all the links and collects the data. It also saves the data into a json file after collection. The collection took 3-4 hours, there is no error checking or time out for the page requests so that may have played a part.

In [None]:
ps4_games_json = []
URL="https://www.gamepressure.com"

for i in range (0, len(ps4_links)):
    session = AsyncHTMLSession()
    r = await session.get(URL + ps4_links[i])
    await r.html.arender(timeout=0) 
    soup = BeautifulSoup(r.html.raw_html, 'html.parser')
    
    game_info = soup.find("section", {"class": "article-left"}) 
    
    if game_info is not None:
        game_title = game_info.find("span", {"id": "game-title-cnt"})
        if game_title is not None:
            game_title = game_title.text.strip()

        tags_list = []
        tags = game_info.find("p", {"class": "tagi"})
        tags = tags.find_all("a")
        if tags is not None:
            for j in range(0, len(tags)):
                tags_list.append(tags[j].text.strip())

        game_developer = game_info.find("div", {"id": "game-developer-cnt"})
        if game_developer is not None:
            game_developer = game_developer.text.strip().replace("developer: ", "")
 
        game_publisher = game_info.find("div", {"id": "game-publisher-cnt"})
        if game_publisher is not None:
            game_publisher = game_publisher.text.strip().replace("publisher: ", "")

        game_mode = game_info.find("div", {"class": "S016-game-info"})
        game_mode = game_mode.find_all("p")
        if(len(game_mode) >= 4):
            game_mode = game_mode[3].text.strip().replace("Game mode: ", "")
        else:
            game_mode = None

        release_date = game_info.find("div", {"id": "game-dates-cnt"})
        release_date = release_date.find("p", {"class": "p2"})
        release_date_day = None
        release_date_month = None
        release_date_year = None
        if release_date.find("span", {"class": "s1"}) is not None:
            release_date_day = release_date.find("span", {"class": "s1"}).text.strip()
        if release_date.find("span", {"class": "s2"}) is not None:
            release_date_month = release_date.find("span", {"class": "s2"}).text.strip()
        if release_date.find("span", {"class": "s3"}) is not None:
            release_date_year = release_date.find("span", {"class": "s3"}).text.strip()

        game_description = game_info.find("div", {"id": "game-description-cnt"})
        if game_description is not None:
            game_description = game_description.text.strip()
        else:
            game_description = None


        game_rating = game_info.find("div", {"id": "game-misc-cnt"})
        if game_rating is not  None:
            game_rating = game_rating.find_all("span")

            if game_rating is None:
                current = None
                expected = None
            elif(len(game_rating) == 2):
                current = game_rating[0].find_all("b")[0].text.strip()
                expected = game_rating[1].find_all("b")[0].text.strip()
            elif (len(game_rating) == 1):
                current = None
                expected = game_rating[0].find_all("b")[0].text.strip()


        ps4_games_json.append({
                "Title":  game_title,
                "Tags": tags_list,
                "Developer": game_developer,
                "Publisher": game_publisher,
                "Mode": game_mode,
                "Release_Date_Day": release_date_day,
                "Release_Date_Month": release_date_month,
                "Release_Date_Year": release_date_year,
                "Game_Description": game_description,
                "Expected_Rating": expected,
                "Current_Rating": current
        })

    await session.close()
    
with open('ps4-games.json', 'w') as json_file:
    json.dump(ps4_games_json, json_file)

We will view the data in a pandas dataframe for verification

In [None]:
ps4_times_df = pd.json_normalize(ps4_games_json)
ps4_times_df

## 1.3 Web scraping VGchartz

VGCharts is game coverage website that categorizes details of certain games into a database. Details include the platform, title, total shipped, sales from NA region, PAL region, Japan region, and other regions, as well as its total sales.

We will first create an array with keywords for the consoles

In [None]:
console = ["XOne", "PS4", "PC", "NS"]

We will search for the data using a randomly selected console. 

In [None]:
URL = "https://www.vgchartz.com/games/games.php?name=&keyword=&console=" + console[0] + "&region=All&developer=&publisher=&goty_year=&genre=&boxart=Both&banner=Both&ownership=Both&showmultiplat=No&results=15000&order=Sales&showtotalsales=0&showtotalsales=1&showpublisher=0&showvgchartzscore=0&shownasales=0&shownasales=1&showdeveloper=0&showcriticscore=0&showpalsales=0&showpalsales=1&showreleasedate=0&showuserscore=0&showjapansales=0&showjapansales=1&showlastupdate=0&showothersales=0&showothersales=1&showshipped=0&showshipped=1"

The VGChartz database is slow and is prone to server errors. We will set a timeout of 300 and check for status code 200/Success

In [None]:
status = 0
while(status!=200):
    try: 
        page = requests.get(URL, timeout=300)
        status = page.status_code
    except requests.exceptions.Timeout as err:
        status = 200
soup = BeautifulSoup(page.content, 'html.parser')
soup

The data is stored in a table, there is no ID so we will just get all table rows

In [None]:
sales_table = soup.find_all("tr")

The first 27 table rows contain random elements from the website, the sales data begins at row 28 and ends at rows-1

In [None]:
size = len(sales_table) - 1

In [None]:
sales_table = sales_table[27:size]

In [None]:
len(sales_table)

After extracting the rows, we will now select a random item to extract the sales data

In [None]:
item = sales_table[0]

We will find all the items in each row and store them in an array

In [None]:
item = item.find_all("td")

The data is stored from index 0-9, excluding index 1 and 3.

Index 0 = game publisher, Index 2 = game title, Index 4 = total number of copies shipped, Index 5 is the total amount of sales in millions, Index 6-9 are specific sales in millions for North America, PAL, Japan, and Others. 

In [None]:
sales_publisher = console[0]
sales_title = item[2].text.strip()
sales_total_shipped = item[4].text.strip()
sales_total_sales = item[5].text.strip()
sales_NA_sales = item[6].text.strip()
sales_PAL_sales = item[7].text.strip()
sales_Japan_sales = item[8].text.strip()
sales_Other_sales = item[9].text.strip()


print(sales_publisher + " " + sales_title + " " + sales_total_shipped + " " + 
      sales_total_sales + " " + sales_NA_sales + " " + sales_PAL_sales + " " +
     sales_Japan_sales + " " + sales_Other_sales)

### Final Code 1.3

The main code is a double loop, the outer loop queries for 15,000 objects from each console and the inner loop collects the data. After collecting the data, it is stored in a json file

In [None]:
console = ["XOne", "PS4", "PC", "NS"]
ps4_sales_json = []

for i in range (0, len(console)):
    URL = "https://www.vgchartz.com/games/games.php?name=&keyword=&console=" + console[i] + "&region=All&developer=&publisher=&goty_year=&genre=&boxart=Both&banner=Both&ownership=Both&showmultiplat=No&results=15000&order=Sales&showtotalsales=0&showtotalsales=1&showpublisher=0&showvgchartzscore=0&shownasales=0&shownasales=1&showdeveloper=0&showcriticscore=0&showpalsales=0&showpalsales=1&showreleasedate=0&showuserscore=0&showjapansales=0&showjapansales=1&showlastupdate=0&showothersales=0&showothersales=1&showshipped=0&showshipped=1"
    status = 0
    while(status!=200):
        try: 
            page = requests.get(URL, timeout=300)
            status = page.status_code
        except requests.exceptions.Timeout as err:
            status = 200
    soup = BeautifulSoup(page.content, 'html.parser')
    sales_table = soup.find_all("tr")
    size = len(sales_table) - 1
    sales_table = sales_table[27:size]
    
    for j in range (0, len(sales_table)):
        item = sales_table[j]
        item = item.find_all("td")
        sales_publisher = console[i]
        sales_title = item[2].text.strip()
        sales_total_shipped = item[4].text.strip()
        sales_total_sales = item[5].text.strip()
        sales_NA_sales = item[6].text.strip()
        sales_PAL_sales = item[7].text.strip()
        sales_Japan_sales = item[8].text.strip()
        sales_Other_sales = item[9].text.strip()

        ps4_sales_json.append({
            
            "Publisher": sales_publisher,
            "Title": sales_title,
            "Total_Shipped": sales_total_shipped,
            "Total_Sales": sales_total_sales,
            "NA_Sales": sales_NA_sales,
            "PAL_Sales": sales_PAL_sales,
            "Japan_Sales": sales_Japan_sales,
            "Other_Sales": sales_Other_sales 
             
        })
        
with open('ps4-sales.json', 'w') as json_file:
    json.dump(ps4_sales_json, json_file)

### Adding Missing Files
As of April 2021 the Game Pressure site has been updated. Below is the updated code together with a list of links that were not included in the previous scrape.

In [2]:
missing = ['https://www.gamepressure.com/games/brawlhalla/zf45a5#ps4',
'https://www.gamepressure.com/games/call-of-duty-black-ops-iiii/zf51ed#ps4',
'https://www.gamepressure.com/games/conan-exiles/z04624#ps4',
'https://www.gamepressure.com/games/demons-souls-remake/ze5be2', 
'https://www.gamepressure.com/games/destruction-allstars/z15be5',
'https://www.gamepressure.com/games/diablo-iii-reaper-of-souls-ultimate-evil-edition/z1391d#ps4',
'https://www.gamepressure.com/games/dying-light-the-following/z543f8#ps4',
'https://www.gamepressure.com/games/enter-the-gungeon/z13f5c#ps4',
'https://www.gamepressure.com/games/hollow-knight/z14b7d#ps4',
'https://www.gamepressure.com/games/hunt-showdown/z03d4e#ps4',
'https://www.gamepressure.com/games/madden-nfl-21/z55b86#ps4',
'https://www.gamepressure.com/games/maneater/zd53cb#ps4',
'https://www.gamepressure.com/games/monster-hunter-world-iceborne/za579e#ps4',
'https://www.gamepressure.com/games/nba-2k20/z85847#ps4',
'https://www.gamepressure.com/games/persona-5-scramble-the-phantom-strikers/z75783#ps4',
'https://www.gamepressure.com/games/smite/z02de5#ps4',
'https://www.gamepressure.com/games/tom-clancys-rainbow-six-siege/z23d74#ps4',
'https://www.gamepressure.com/games/warface/z4259d#ps4',
'https://www.gamepressure.com/games/we-were-here/ze4cbe#ps4',
'https://www.gamepressure.com/games/world-of-tanks-mercenaries/z229e2#ps4']

In [193]:
ps4_missing_games_json = []

for i in range (0, len(missing)):
    session = AsyncHTMLSession()
    r = await session.get(missing[i])
    await r.html.arender(timeout=0) 
    soup = BeautifulSoup(r.html.raw_html, 'html.parser')
    
    game_info = soup.find("section", {"class": "article-left"}) 
    
    if game_info is not None:
        game_title = game_info.find("span", {"id": "game-title-cnt"})
        if game_title is not None:
            game_title = game_title.text.strip()

        tags_list = []
        tags = game_info.find("p", {"class": "tagi"})
        tags = tags.find_all("a")
        if tags is not None:
            for j in range(0, len(tags)):
                tags_list.append(tags[j].text.strip())

        game_developer = game_info.find("span", {"id": "game-developer-cnt"})
        if game_developer is not None:
            game_developer = game_developer.text.strip().replace("developer: ", "")
 
        game_publisher = game_info.find("span", {"id": "game-publisher-cnt"})
        if game_publisher is not None:
            game_publisher = game_publisher.text.strip().replace("publisher: ", "")

        game_mode = game_info.find("div", {"id": "game-misc-cnt"})
        game_mode = game_mode.find_all("p")
        game_mode = game_mode[0].find("b")
        game_mode =game_mode.text.strip()

        release_date = game_info.find("div", {"id": "game-dates-cnt"})
        release_date = release_date.find("p", {"class": "p2"})
        release_date_day = None
        release_date_month = None
        release_date_year = None
        if release_date.find("span", {"class": "s1"}) is not None:
            release_date_day = release_date.find("span", {"class": "s1"}).text.strip()
        if release_date.find("span", {"class": "s2"}) is not None:
            release_date_month = release_date.find("span", {"class": "s2"}).text.strip()
        if release_date.find("span", {"class": "s3"}) is not None:
            release_date_year = release_date.find("span", {"class": "s3"}).text.strip()

        game_description = game_info.find("div", {"id": "game-description-cnt"})
        if game_description is not None:
            game_description = game_description.text.strip()
        else:
            game_description = None


        game_rating = game_info.find("div", {"id": "game-misc-cnt"})
       
        if game_rating is not  None:
            game_rating = game_rating.find_all("p")

            if game_rating is None:
                current = None
                expected = None
            elif(len(game_rating) == 3):
                current = game_rating[1].find_all("b")[0].text.strip()
                expected = game_rating[2].find_all("b")[0].text.strip()
            elif (len(game_rating) == 2):
                current = None
                expected = game_rating[1].find_all("b")[0].text.strip()


        ps4_missing_games_json.append({
                "Title":  game_title,
                "Tags": tags_list,
                "Developer": game_developer,
                "Publisher": game_publisher,
                "Mode": game_mode,
                "Release_Date_Day": release_date_day,
                "Release_Date_Month": release_date_month,
                "Release_Date_Year": release_date_year,
                "Game_Description": game_description,
                "Expected_Rating": expected,
                "Current_Rating": current
        })

    await session.close()
    
with open('ps4-games-missing.json', 'w') as json_file:
    json.dump(ps4_missing_games_json, json_file)

#### That concludes the data collection part.

The data is now stored into corresponding json files.