# Recommendation Systems using Steam Data - Data Extraction 2 <a id='top'></a>

In this notebook, we perform our second data extraction. This time, we create a dataset with a focus on the game reviews. To do so, we started by extracting data using the Steam API. In this notebook, data related with the each games is going to be extracted. The final columns of our dataset are:
* aa
* aa

The structure of this notebook is as follows:

[0. Import Libraries](#libraries) <br>
[1. Import Topsellers File](#topseller) <br>
[2. Extract Data](#extract) <br>
&emsp; [2.1. Check Current Data](#current_data) <br>
&emsp; [2.2. Extract Data](#extract_data) <br>


# 0. Import Libraries<a id='libraries'></a>
[to the top](#top)  

The first step is to import the necessary libraries.

In [6]:
import polars as pl
import os
import time
from urllib.parse import quote

In [None]:
def get_steam_reviews(app_id, cursor='*'):
    url = f"https://store.steampowered.com/appreviews/{app_id}?json=1&filter=recent&language=english&cursor={cursor}&num_per_page=100"
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        if data['success'] == 1:
            return data
        else:
            print(f"Failed to get valid reviews for app ID {app_id}.")
            print(data)
            return data
    else:
        print(f"Failed to get reviews for app ID {app_id}. Status code: {response.status_code}")
        return None

In [None]:
def fetch_all_reviews(app_id):

    cursor = '*'
    all_reviews = []
    
    while True:
        data = get_steam_reviews(app_id, cursor)
        if data is None or data['query_summary']['num_reviews'] == 0:
            break
        
        # Collect reviews
        reviews = data['reviews']
        all_reviews.extend(reviews)
        
        # Update cursor
        cursor = quote(data['cursor'])
    
    # Create a DataFrame from reviews
    reviews_df = pl.DataFrame(all_reviews)
    num_entries = len(reviews_df)

    # Specify the directory and file name for the parquet file
    directory = "data\\parquets"
    file_name = f"{app_id}_reviews_{num_entries}.parquet"
    file_path = os.path.join(directory, file_name)
    
    # Save to parquet
    reviews_df.write_parquet(file_path)

    return num_entries

# Single fetch for specific game

In [7]:
# start_time = time.time()
# num_entries = fetch_all_reviews(553850)
# elapsed_time = time.time() - start_time

# print(f"Die Ausführungszeit betrug {elapsed_time:.2f} Sekunden.")
# print(f"Anzahl Einträge:{num_entries}")

# 1. Import Topsellers File<a id='topseller'></a>
[to the top](#top)  

As we wanted to get data from the topseller games, we import a json file that has the APP IDs of the top seller games.

In [8]:
topseller = [x for x in list(pl.read_json('data/SteamTopSellers.json'))[0]]
topseller[:5] 

['730',
 '1145350',
 '1604030',
 '281990',
 '1840080',
 '553850',
 '2195250',
 '2215430',
 '1158310',
 '1363080',
 '2479810',
 '236390',
 '1172470',
 '1145360',
 '1599340',
 '236850',
 '1151340',
 '983870',
 '1086940',
 '394360',
 '374320',
 '813230',
 '1142710',
 '570',
 '1222670',
 '1812450',
 '381210',
 '2881650',
 '1677280',
 '306130',
 '1085660',
 '1245620',
 '529340',
 '39210',
 '1177980',
 '377160',
 '805550',
 '230410',
 '1774580',
 '1449850',
 '1669000',
 '1954200',
 '2670630',
 '1962663',
 '582660',
 '255710',
 '227300',
 '294100',
 '1172620',
 '2519060',
 '1248130',
 '570940',
 '1794680',
 '2426960',
 '359550',
 '578080',
 '271590',
 '1203620',
 '1517290',
 '1426210',
 '1250410',
 '435150',
 '552990',
 '1476970',
 '1284190',
 '1222700',
 '1091500',
 '870780',
 '252490',
 '2381740',
 '1326470',
 '413150',
 '2418520',
 '2399830',
 '1018830',
 '238960',
 '427410',
 '1172380',
 '761890',
 '756800',
 '1284210',
 '1407200',
 '1328670',
 '1693980',
 '1237950',
 '1887840',
 '2362300

# 2. Extract Data<a id='extract'></a>
[to the top](#top)  

After selecting which games we want data on, it is time to start extracting the data.

## 2.1. Check Current Data<a id='current_data'></a>
[to the top](#top)  

As extracting data is a long process, we decided to created a way to check which games we already have data on and extract data on the ones that we do not have yet. To do so, we create a list of the APP IDs of the games that we already extracted data on.

In [9]:
folder_path = 'data\\parquets'
downloaded = []
for filename in os.listdir(folder_path):
    if filename.endswith('.parquet'):
        game = filename.split('_')[0]
        downloaded.append(game)
len(downloaded)

['1000760',
 '1011190',
 '1012110',
 '1027820',
 '104600',
 '1061180',
 '1062960',
 '1064460',
 '1065310',
 '1069740',
 '1070710',
 '1075740',
 '1077290',
 '1082430',
 '1086640',
 '1088790',
 '1091920',
 '1093910',
 '1096720',
 '1098340',
 '1099640',
 '10',
 '1100990',
 '1101190',
 '1101450',
 '1119700',
 '1133060',
 '11390',
 '1139890',
 '1144030',
 '1145960',
 '1149660',
 '1153430',
 '1162680',
 '1162960',
 '1164850',
 '1184160',
 '1194590',
 '1197370',
 '1200',
 '1201830',
 '1202900',
 '1203420',
 '1205040',
 '1205960',
 '1212620',
 '1218900',
 '1225590',
 '1229460',
 '1235140',
 '1236990',
 '1245620',
 '1252330',
 '1255650',
 '1257410',
 '1259980',
 '1260810',
 '1266820',
 '1275890',
 '1285670',
 '1296840',
 '1299120',
 '1306620',
 '1318090',
 '1334000',
 '1342890',
 '1345820',
 '1347030',
 '1349960',
 '13530',
 '13580',
 '1361700',
 '1377360',
 '1377380',
 '1386780',
 '1403020',
 '1409530',
 '1421250',
 '1430680',
 '1446330',
 '1472660',
 '1473050',
 '1500740',
 '1504020',
 '15071

## 2.2. Extract Data<a id='extract_data'></a>
[to the top](#top)  

Below, we check if we already have data on each game. If we do not, we proceed to extract it. It also checks how long does it take to extract the last batch of data.

In [11]:
for game in topseller:
    if game in downloaded:
        continue
    start_time = time.time()
    num_entries = fetch_all_reviews(553850)  # ID is constant, as it does not vary
    elapsed_time = time.time() - start_time

    # Conversion of time from seconds to hours, minutes, and seconds
    hours, remainder = divmod(elapsed_time, 3600)
    minutes, seconds = divmod(remainder, 60)

    print(f"""AppID: {game}
    Duration: {int(hours)} hours, {int(minutes)} minutes, {seconds:.2f} seconds
    Count: {num_entries}
    ####################""")


KeyboardInterrupt: 