# Data Selection <a id='top'></a>

After acquiring and exploring our data, it is time to start slecting what is going to be used in our model. In this notebook, we partition the games into two categories based on a predefined threshold of 50,000 reviews: those with high review counts and those with low review counts. By saving the lists of corresponding app IDs to text files, we are now ready to start training our model.

The structure of this notebook is as follows:

[0. Import Libraries](#libraries) <br>
[1. Data Selection](#select) <br>

# 0. Import Libraries<a id='libraries'></a>
[to the top](#top)  

The first step is to import the necessary libraries.

In [1]:
import polars as pl
from helper_functions import get_parquets_data_info, save_appids_to_txt

# 1. Data Selection<a id='select'></a>
[to the top](#top)  

In this section of the notebook, we start by getting the reviews and then separate it into two groups: one that has games with a high amount of reviews and another one that contains the remaining games.

In [2]:
parquet_folder_path = 'data/parquets'
parquet_preprocessed_folder_path = 'data/parquets_preprocessed'

The get_parquets_data_info function processes all Parquet files within a specified folder, extracting relevant information from their filenames and combining it with data from a JSON file (SteamGames.json). It begins by defining a function to extract app ID and review count from each Parquet file's filename using a regular expression. Then, it searches for all Parquet files in the folder, extracts and stores the relevant information. Next, it loads the SteamGames.json file to obtain the mapping between app IDs and game names. It creates a lookup dictionary for fast app ID to name conversion. Finally, it adds the game name to the collected data and constructs a Polars DataFrame containing file names, app IDs, game names, and review counts, which is returned for further analysis or processing. This function facilitates data exploration and understanding by providing an organized summary of Parquet file contents alongside corresponding game names.

In [3]:
df_parquets = get_parquets_data_info(parquet_folder_path)
df_parquets_preprocessed = get_parquets_data_info(parquet_preprocessed_folder_path)

In [4]:
df_parquets.head()

In [5]:
df_parquets_preprocessed.head()

The provided code segment sets a threshold of 50,000 reviews and divides a DataFrame df_parquets into two subsets: one containing app IDs with review counts exceeding the threshold (high_review_df) and the other containing app IDs with review counts below the threshold (low_review_df). It then extracts app IDs from each subset, converts them to lists, and prints the number of app IDs in each subset. Additionally, it saves the lists of app IDs to text files named "high_review_games.txt" and "low_review_games.txt" using a function save_appids_to_txt. This approach allows for the segmentation of games based on review counts, facilitating further analysis or targeted actions based on review volume.

In [6]:
threshold = 50000

high_review_df = df_parquets.filter(pl.col("review_count") > threshold)
high_review_appid_list = high_review_df.select("appid").to_series().to_list()
print(f"high:{len(high_review_appid_list)}")

# Save the list of appids to a text file
save_appids_to_txt(high_review_appid_list, 'data/high_review_games.txt')

low_review_df = df_parquets.filter(pl.col("review_count") < threshold)
low_review_appid_list = low_review_df.select("appid").to_series().to_list()
print(f"low:{len(low_review_appid_list)}")

# Save the list of appids to a text file
save_appids_to_txt(low_review_appid_list, 'data/low_review_games.txt')

high:23
low:487
