# Data Acquisition - Game Reviews <a id='top'></a>

In this last notebook of data extraction, we create a dataset with a focus on the game reviews. To do so, we started by extracting data using the Steam API. In this notebook, data related with the each games is going to be extracted. The final columns of our dataset are:
* **recommendationid** (string): Unique identifier for the recommendation.
* **author** (object): Contains information about the author of the recommendation.
  * **steamid** (string): Author's Steam ID.
  * **num_games_owned** (integer): Number of games the author owns on Steam.
  * **num_reviews** (integer): Number of reviews written by the author.
  * **playtime_forever** (integer): Total playtime of the author on Steam in minutes.
  * **playtime_last_two_weeks** (integer): Playtime by the author in the last two weeks in minutes.
  * **playtime_at_review** (integer): The author's playtime on the game at the time of writing the review (in minutes).
  * **last_played** (integer): Unix timestamp of the last time the author played the game.
* **language** (string): Language of the review (e.g., "english").
* **review** (string): The actual review content.
* **timestamp_created** (integer): Unix timestamp of when the review was created.
* **timestamp_updated** (integer): Unix timestamp of when the review was last updated.
* **voted_up** (boolean): Whether the author upvoted their own review.
* **votes_up** (integer): Number of upvotes the review has received.
* **votes_funny** (integer): Number of times the review was marked as funny.
* **weighted_vote_score** (string): A score representing the overall rating of the review.
* **comment_count** (integer): Number of comments on the review.
* **steam_purchase** (boolean): Whether the author purchased the game on Steam.
* **received_for_free** (boolean): Whether the author received the game for free.
* **written_during_early_access** (boolean): Whether the review was written during the game's Early Access period.
* **hidden_in_steam_china** (boolean): Whether the review is hidden on the Steam China store.
* **steam_china_location** (string): User's location for the Steam China store.


The structure of this notebook is as follows:

[0. Import Libraries](#libraries) <br>
[1. Define Functions for Data Extraction](#functions) <br>
[2. Import Games Basic Data](#import_games) <br>
&emsp; [2.1. Import Topsellers](#topseller) <br>
&emsp; [2.2. Import Steam Most Played Games](#most_played) <br>
&emsp; [2.3. Import Steam Monthly Top Games](#monthly) <br>
[3. Extract Data](#extract) <br>
&emsp; [3.1. Check Current Data](#current_data) <br>
&emsp; [3.2. Extract Data](#extract_data) <br>


# 0. Import Libraries<a id='libraries'></a>
[to the top](#top)  

The first step is to import the necessary libraries.

In [None]:
import polars as pl
import os
import time
from urllib.parse import quote
import requests

# 1. Define Functions for Data Extraction<a id='functions'></a>
[to the top](#top)  

Following, we define the necessary functions to extract the reviews of the selected games.

The following function fetches recent reviews for a Steam application based on its app ID. 

In [None]:
def get_steam_reviews(app_id, cursor='*'):
    url = f"https://store.steampowered.com/appreviews/{app_id}?json=1&filter=recent&language=english&cursor={cursor}&num_per_page=100"
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        if data['success'] == 1:
            return data
        else:
            print(f"Failed to get valid reviews for app ID {app_id}.")
            print(data)
            return data
    else:
        print(f"Failed to get reviews for app ID {app_id}. Status code: {response.status_code}")
        return None

Following, the reviews of each game are saved in individual parquet files with the following format: {appID}\_reviews\_{numberOfReviews}. Additionally, this function as a safenet that is composed by a cursor that keeps track of where the function left off when fetching reviews from the Steam API. It ensures that the function can retrieve reviews in batches, starting from where it last stopped, until all reviews have been fetched. This helps manage large amounts of data efficiently.

In [None]:
def fetch_all_reviews(app_id):

    cursor = '*'
    all_reviews = []
    
    while True:
        data = get_steam_reviews(app_id, cursor)
        if data is None or data['query_summary']['num_reviews'] == 0:
            break
        
        # Collect reviews
        reviews = data['reviews']
        all_reviews.extend(reviews)
        
        # Update cursor
        cursor = quote(data['cursor'])
    
    # Create a DataFrame from reviews
    reviews_df = pl.DataFrame(all_reviews)
    num_entries = len(reviews_df)

    # Specify the directory and file name for the parquet file
    directory = "data\\parquets"
    file_name = f"{app_id}_reviews_{num_entries}.parquet"
    file_path = os.path.join(directory, file_name)
    
    # Save to parquet
    reviews_df.write_parquet(file_path)

    return num_entries

Bellow, the reviews are extracted used the defined functions. Each time we run the cell, we have to decided which game's reviews we are going to fetch.

In [None]:
# start_time = time.time()
# num_entries = fetch_all_reviews(553850)
# elapsed_time = time.time() - start_time

# print(f"Execution time was {elapsed_time:.2f} seconds.")
# print(f"Number of entries: {num_entries}")

# 2. Import Games Basic Data<a id='import_games'></a>
[to the top](#top)  

Our next step is to confirm which reviews from which games in our JSON files were downloaded.

## 2.1. Import Topsellers<a id='topseller'></a>
[to the top](#top)  

The first JSON file we imported was related to the Steam Top Seller games and has information about each games name and ID.

In [None]:
topseller = [x for x in list(pl.read_json('data/SteamTopSellers.json'))[0]]
topseller[:5] 

## 2.2. Import Steam Most Played Games<a id='most_played'></a>
[to the top](#top)  

The second JSON file we imported was related to the Most Played games at the moment of the data extraction and has information about each games rank, unique identifier, title, and peak current player count within the last 24 hours relative to the data extraction time.

In [None]:
mostplayed = [x for x in list(pl.read_json('data/SteamMostPlayed.json'))[0]]
mostplayed[:5] 

## 2.3. Import Steam Monthly Top Games<a id='monthly'></a>
[to the top](#top)  

The last JSON file we imported was related to the Most Played games of the first three months of 2024 and has information about each games rank, unique identifier, title, and peak current player count within the last 24 hours relative to the data extraction time.

In [None]:
monthlytopgames = [x for x in list(pl.read_json('data/SteamMonthlyTopGames.json'))[0]]
monthlytopgames[:5] 

# 3. Extract Data<a id='extract'></a>
[to the top](#top)  

After selecting which games we want data on, it is time to start extracting the data.

## 3.1. Check Current Data<a id='current_data'></a>
[to the top](#top)  

As extracting data is a long process, we decided to created a way to check which games we already have data on and extract data on the ones that we do not have yet. To do so, we create a list of the APP IDs of the games that we already extracted data on.

In [None]:
folder_path = 'data\\parquets'
downloaded = []
for filename in os.listdir(folder_path):
    if filename.endswith('.parquet'):
        game = filename.split('_')[0]
        downloaded.append(game)
len(downloaded)

## 3.2. Extract Data<a id='extract_data'></a>
[to the top](#top)  

Below, we check if we already have data on each game, first for top sellers, then for the most played games and lastly for the monthly most played games. If we do not, we proceed to extract it. It also checks how long does it take to extract the last batch of data.

In [None]:
# for game in topseller:
#     if game in downloaded:
#         continue
#     start_time = time.time()
#     num_entries = fetch_all_reviews(game)
#     elapsed_time = time.time() - start_time

#     # Conversion of time from seconds to hours, minutes, and seconds
#     hours, remainder = divmod(elapsed_time, 3600)
#     minutes, seconds = divmod(remainder, 60)

#     print(f"""AppID: {game}
#     Duration: {int(hours)} hours, {int(minutes)} minutes, {seconds:.2f} seconds
#     Count: {num_entries}
#     ####################""")

In [None]:
for game in mostplayed:
    if game in downloaded:
        continue
    start_time = time.time()
    num_entries = fetch_all_reviews(game)
    elapsed_time = time.time() - start_time

    # Conversion of time from seconds to hours, minutes, and seconds
    hours, remainder = divmod(elapsed_time, 3600)
    minutes, seconds = divmod(remainder, 60)

    print(f"""AppID: {game}
    Duration: {int(hours)} hours, {int(minutes)} minutes, {seconds:.2f} seconds
    Count: {num_entries}
    ####################""")

In [None]:
for game in monthlytopgames:
    if game in downloaded:
        continue
    start_time = time.time()
    num_entries = fetch_all_reviews(game)
    elapsed_time = time.time() - start_time

    # Conversion of time from seconds to hours, minutes, and seconds
    hours, remainder = divmod(elapsed_time, 3600)
    minutes, seconds = divmod(remainder, 60)

    print(f"""AppID: {game}
    Duration: {int(hours)} hours, {int(minutes)} minutes, {seconds:.2f} seconds
    Count: {num_entries}
    ####################""")