We begin by importing the necessary libraries, including Pandas for data manipulation and the AnglE library for pretrained angle embeddings. We load the pretrained angle embedding model, UAE-Large-V1, using a classification pooling strategy. This model will be utilized for advanced data analysis.

In [2]:
import pandas as pd
from angle_emb import AnglE
angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()

The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  from .autonotebook import tqdm as notebook_tqdm





We will upload data from the "steam.csv" dataset, which contains information about various Steam games. We will explore this dataset by examining a random sample of entries to understand its structure and content. Subsequently, we will apply the pretrained angle embeddings to extract meaningful insights from the game data.

In [3]:
df = pd.read_csv("data/data-with-media/steam.csv")
df.sample(3)

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
15136,654940,RXE,2017-07-21,1,Buce Studios LLC,Buce Studios LLC,windows;mac;linux,0,Single-player;Multi-player;Steam Achievements;...,Indie;Simulation;Early Access,Early Access;Indie;Simulation,2,1,3,0,0,0-20000,4.79
12395,566980,Crashimals,2017-09-07,1,Rogue Earth LLC,GAMEPUMP,windows,0,Single-player,Action;Casual;Indie;Strategy,Action;Casual;Indie,0,10,0,0,0,0-20000,0.79
18844,757690,Limit of defense,2017-12-16,1,Jl Apps,Jl Apps,windows,0,Single-player,Indie,Indie,0,0,1,0,0,0-20000,2.09


Next, we upload data from the "steam_media_data.csv" dataset, which contains media-related information about various Steam games. We will examine a random sample of three entries from this dataset to understand its structure and content, which will help us incorporate media insights into our analysis.

In [4]:
media_df = pd.read_csv("data/data-with-media/steam_media_data.csv")
media_df.sample(3)

Unnamed: 0,steam_appid,header_image,screenshots,background,movies
147,6000,https://steamcdn-a.akamaihd.net/steam/apps/600...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/600...,"[{'id': 256668530, 'name': 'Republic Commando ..."
7830,434160,https://steamcdn-a.akamaihd.net/steam/apps/434...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/434...,"[{'id': 256685194, 'name': 'A Hole New World -..."
21042,816750,https://steamcdn-a.akamaihd.net/steam/apps/816...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/816...,"[{'id': 256710161, 'name': 'Travildorn', 'thum..."


We begin by uploading data from three datasets: the "steam.csv" file, which contains information about various Steam games; the "steam_media_data.csv" file, which includes media-related information for these games; and the "steam_description_data.csv" file, which provides detailed descriptions of the games. By loading these datasets, we can comprehensively analyze game attributes, media elements, and detailed descriptions, enhancing our overall insights into the Steam gaming ecosystem.

In [5]:
games_df = pd.read_csv("data/data-with-media/steam.csv")
media_df = pd.read_csv("data/data-with-media/steam_media_data.csv")
description_df = pd.read_csv("data/data-with-media/steam_description_data.csv")


We merge the three datasets—games_df, media_df, and description_df—using the "appid" from the games dataset and "steam_appid" from the media and description datasets. This combined DataFrame allows us to consolidate all relevant information about each game.

Next, we iterate through the merged DataFrame to construct a formatted summary for each game, including the game name, developer, genres, and a brief summary. This summary is stored in a new column called "text." Additionally, we extract the minimum number of owners by processing the "owners" column to convert the range into an integer, representing the lower bound. Finally, we display a transposed sample of three entries from the DataFrame to review the newly created features and structure.

In [6]:
df = games_df.merge(media_df, left_on="appid" , right_on="steam_appid").merge(description_df, on="steam_appid")
game_texts = []
for _, row in df.iterrows():
    game_text = f'''Game name : {row["name"]}
Developer : {row["developer"]}
Genres : {row["genres"]}
Summary : {row["short_description"]}'''
    game_texts.append(game_text)
df["text"] = game_texts
df["minimum_owners"] = df ["owners"].apply(lambda v : int(v.split("-")[0]))
df.sample(3).T

Unnamed: 0,4542,19813,143
appid,341780,785870,4880
name,Chronicles of a Dark Lord: Episode II War of T...,Admine,Cossacks: European Wars
release_date,2015-01-16,2018-02-16,2011-08-26
english,1,1,1
developer,Kisareth Studios,Irbynx,GSC Game World
publisher,Kisareth Studios,Irbynx,GSC World Publishing
platforms,windows,windows;linux,windows
required_age,0,0,0
categories,Single-player;Steam Achievements;Full controll...,Single-player;Steam Achievements;Steam Cloud;S...,Single-player
genres,Indie;RPG,Indie,Strategy


In this step, we extract the minimum number of owners from the "owners" column of the DataFrame by applying a lambda function that splits the range (e.g., "5000-10000") and converts the lower bound into an integer. This extracted value is stored in a new column called "minimum_owners."

Afterward, we sort the DataFrame in descending order based on the "minimum_owners" column. This sorting allows us to easily identify the games with the highest minimum ownership, facilitating further analysis of popular titles within the dataset.

In [7]:
df["minimum_owners"] = df ["owners"].apply(lambda v : int(v.split("-")[0]))
df = df.sort_values("minimum_owners",ascending = False)

In this section, we define a function called display_game, which takes a row from the DataFrame as input and constructs an HTML representation of the game. The function formats the game’s name as a heading, displays the genres in bold, provides a short description, and includes the game's header image.

We then use this function to display a randomly selected game from the DataFrame. By calling df.sample(1, random_state=0), we ensure that a single game is sampled for consistent output each time this cell is executed. The resulting HTML content is rendered in the notebook for a visually appealing presentation of the game details.

In [8]:
from IPython.display import display , HTML
def display_game(row):
    html =''
    html +=  f'<h3>{row["name"]}</h3>'
    html +=  f'<strong>{row["genres"]}</strong>'
    html +=  f'<p>{row["short_description"]}</p>'
    html +=  f'<img src ="{row["header_image"]}">'
    display(HTML(html))
    
for _, row in df.sample(1, random_state=0).iterrows():
    display_game(row)

In this segment, we measure the execution time of our code using the %%time magic command. We first create a subset of the DataFrame, top_df, by sorting the original DataFrame by the "minimum_owners" column in descending order and selecting the top 4000 games.

Next, we prepare the text data for embedding by splitting the "text" column into batches of 50 entries each. This batch processing helps manage memory usage and ensures efficient encoding. We then iterate over these batches, printing the progress every five iterations for monitoring purposes.

Within the loop, we use the angle.encode method to generate embeddings for each chunk of text, appending the results to the embeddings list. Finally, we store these embeddings in a new column called "embedding" in the top_df DataFrame, enabling us to perform further analyses and tasks on the embedded data.

In [12]:
%%time
import numpy as np
top_df = df.sort_values("minimum_owners",ascending = False).head(4000)
embeddings = []
batches = np.array_split(top_df["text"], len(top_df) // 50)
for idx, chunk_text in enumerate(batches):
    if idx % 5 == 0:
        print(f"{idx} / {len(batches)}")
    embeddings +=  list(angle.encode(list(chunk_text), to_numpy=True))
top_df["embedding"] = embeddings

0 / 80
5 / 80
10 / 80
15 / 80
20 / 80
25 / 80
30 / 80
35 / 80
40 / 80
45 / 80
50 / 80
55 / 80
60 / 80
65 / 80
70 / 80
75 / 80
CPU times: total: 2h 18min 28s
Wall time: 27min 4s


In this section, we use the cosine_distances function from the sklearn.metrics.pairwise module to find games similar to a specified title, in this case, "Factorio." We first extract the row corresponding to "Factorio" from the top_df DataFrame.

We then compute the cosine distances between the embedding of "Factorio" and the embeddings of all other games in the dataset. This metric helps us measure the similarity between the games based on their embeddings.

After calculating the distances, we sort the indices of the games based on their similarity to "Factorio." Finally, we display the top 8 most similar games using the display_game function, presenting their details in a visually appealing format.

In [4]:
from sklearn.metrics.pairwise import cosine_distances
game_name = 'Factorio'
game_row = top_df[top_df["name"] == game_name].iloc[0]
distances = cosine_distances(np.array([game_row.embedding]), np.array(top_df.embedding.tolist()))[0]
sorted_indices = distances.argsort()
for idx in sorted_indices[:8]:
    similar_game = top_df.iloc[idx]
    display_game(similar_game)

In this step, we iterate through each row of the DataFrame df to create a formatted summary for each game. For every game, we construct a string that includes the game name, developer, genres, and a brief summary of the game, which is derived from the "short_description" column.

These formatted summaries are stored in a list called game_texts. After processing all rows, we add this list as a new column named "text" in the DataFrame df. This column provides a concise overview of each game, facilitating easier access to essential information for subsequent analyses.

In [18]:
game_texts = []
for _, row in df.iterrows():
    game_text = f'''Game name : {row["name"]}
Developer : {row["developer"]}
Genres : {row["genres"]}
Summary : {row["short_description"]}'''
    game_texts.append(game_text)
df["text"] = game_texts

In [None]:
In this final step, we save the top_df DataFrame to a Parquet file named "game_database.parquet." The Parquet format is chosen for its efficient storage and fast retrieval capabilities, making it suitable for handling large datasets. By saving our processed data, which includes game details and their corresponding embeddings, we ensure that it can be easily accessed and reused in future analyses without needing to repeat the earlier processing steps.

In [19]:
top_df.to_parquet("data/game_database.parquet")

we import the textdistance library, which provides various algorithms for measuring the distance or similarity between strings. This library will enable us to use it in our web application so it can make the search function works fine.

In [None]:
import textdistance

Example of how textdistance works

In [1]:
textdistance.levenshtein("path of ele","path of exile")

NameError: name 'textdistance' is not defined

Anything bellow this Markdown is just some code i executed it for testing some functions so i can use them above. 

In [None]:
top_df["name"].apply(lambda v : textdistance.levenshtein(v, "path of exale")).sort_values()

In [3]:
top_df.loc[1772]

appid                                                              238960
name                                                        Path of Exile
release_date                                                   2013-10-23
english                                                                 1
developer                                             Grinding Gear Games
publisher                                             Grinding Gear Games
platforms                                                         windows
required_age                                                            0
categories              Single-player;Multi-player;Online Multi-Player...
genres                  Action;Adventure;Free to Play;Indie;Massively ...
steamspy_tags                      Free to Play;Action RPG;Hack and Slash
achievements                                                          120
positive_ratings                                                    71593
negative_ratings                      

In [None]:
vec = angle.encode('hello world', to_numpy = True)
print(vec)
vecs=angle.encode(['hello world1','hello world2'], to_numpy=True)
print(vecs)

In [None]:
from sklearn.metrics.pairwise import cosine_distances
cosine_distances(angle.encode(['beef stakes is delicious ',
                              'the capital of Morocco is Rabat',
                              'I love meat']))[0]

This code actually saves your session so you don't have to execut the whole notebook in order to get your variables saved