### Steam game tag vectorization

Tags on Steam can only be chosen from a selection of choices\* and are used to broadly categorize games. They can be initially provided by developers, but the community is able to suggest (vote for) other tags that will then be displayed on the game's store page.

Tags include:
* General genres, such as "RPG", "Comedy", Survival Horror" or "puzzle"
* Multiplayer information ("co-op", "singleplayer", "pvp", "pve")
* Information about game mood such as "relaxing", "cute", "violent" or "cinematic"
* Opinions about some aspects of the game ("lore-rich", "replay value", "great soundtrack")

The tags provide a rich, community-driven description of a game that is at the same time also easily convertable to a vectorized representation due to belonging to a limited set of possible values.

\* Initially, users were able to put anything as a tag, but that option was quickly removed due to abuse/trolling. The tags provided by users included "Polish" for The Witcher, "broken" for games that users considered to be buggy or unplayable, and "hat simulator" for Team Fortress. The phenomenon of users joking with tags hasn't entirely disappeared - at the time of writing, OBS Studio (video recording software) is tagged as "Emotional", "Dark" and "Romance". Accurate tags are more popular, however.

In [1]:
from pathlib import Path
data_dir = Path('../data/raw')
csv_path = data_dir / "games.csv"
json_path = data_dir / "games.json"

import pandas as pd

df = pd.read_csv(csv_path)

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85103 entries, 0 to 85102
Data columns (total 39 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   AppID                       85103 non-null  int64  
 1   Name                        85097 non-null  object 
 2   Release date                85103 non-null  object 
 3   Estimated owners            85103 non-null  object 
 4   Peak CCU                    85103 non-null  int64  
 5   Required age                85103 non-null  int64  
 6   Price                       85103 non-null  float64
 7   DLC count                   85103 non-null  int64  
 8   About the game              81536 non-null  object 
 9   Supported languages         85103 non-null  object 
 10  Full audio languages        85103 non-null  object 
 11  Reviews                     9743 non-null   object 
 12  Header image                85103 non-null  object 
 13  Website                     394

In [4]:
tagged_games = df[df["Tags"].notnull()]

In [6]:
tagged_games["Tags"].head()

0                          Indie,Casual,Sports,Bowling
1    Indie,Action,Pixel Graphics,2D,Retro,Arcade,Sc...
3    2D Platformer,Atmospheric,Surreal,Mystery,Puzz...
4    Indie,Adventure,Nudity,Violent,Sexual Content,...
5    Turn-Based Combat,Massively Multiplayer,Multip...
Name: Tags, dtype: object

In [7]:
from typing import Iterable

def str_tags_to_set(tags : str)->Iterable[str]:
    return set(tags.split(","))

In [8]:
# set.update
str_tags_to_set("Indie,Casual,Sports,Bowling")

{'Bowling', 'Casual', 'Indie', 'Sports'}

In [10]:
pd.options.mode.chained_assignment = None
tagged_games["Tags_set"] = tagged_games["Tags"].apply(lambda x: str_tags_to_set(x))

In [11]:
tagged_games["Tags_set"]

0                         {Bowling, Indie, Casual, Sports}
1        {Arcade, Fast-Paced, Western, Funny, Blood, Re...
3        {Stylized, Surreal, 2D, Singleplayer, Physics,...
4        {Nudity, Sexual Content, Indie, Adventure, Vio...
5        {Mythology, Strategy, MMORPG, RPG, Multiplayer...
                               ...                        
85077    {Relaxing, 2D, Clicker, Sandbox, Indie, Nature...
85079    {3D, Indie, Adventure, Stylized, Point & Click...
85083    {Sexual Content, Time Management, Funny, Matur...
85085    {3D, Indie, Adventure, Psychological Horror, S...
85094    {Time Management, Demons, 3D, Base-Building, I...
Name: Tags_set, Length: 64003, dtype: object

In [12]:
tags_set_list = tagged_games["Tags_set"].tolist()

tags_collection = set()

for i in tags_set_list:
    tags_collection.update(i)

In [13]:
len(tags_collection)

448

In [14]:
tag_dict = dict()
for i, el in enumerate(tags_collection):
    tag_dict[el] = i

In [15]:
print(tag_dict)

{'Fast-Paced': 0, 'Farming': 1, 'Golf': 2, 'Masterpiece': 3, 'Lemmings': 4, 'Sniper': 5, 'Feature Film': 6, 'Procedural Generation': 7, 'Romance': 8, 'Pixel Graphics': 9, 'Robots': 10, 'Multiplayer': 11, 'Grid-Based Movement': 12, 'Metroidvania': 13, 'Fighting': 14, 'Parody': 15, 'Lore-Rich': 16, 'Narration': 17, 'Lara Croft': 18, 'Online Co-Op': 19, 'Female Protagonist': 20, 'Dark Fantasy': 21, 'Puzzle': 22, 'Utilities': 23, 'FMV': 24, 'Real-Time': 25, 'Remake': 26, 'Snowboarding': 27, 'Software Training': 28, 'Choose Your Own Adventure': 29, 'Nonlinear': 30, 'Free to Play': 31, 'Music': 32, 'Action Roguelike': 33, 'Time Manipulation': 34, 'Hand-drawn': 35, 'Level Editor': 36, 'Gaming': 37, 'Minimalist': 38, 'Roguelike Deckbuilder': 39, 'e-sports': 40, 'Nature': 41, 'Violent': 42, 'Interactive Fiction': 43, 'Dark': 44, 'Skateboarding': 45, 'Sports': 46, 'Score Attack': 47, 'Comedy': 48, 'Real Time Tactics': 49, 'Philosophical': 50, 'Tanks': 51, 'Instrumental Music': 52, 'Pirates': 53,

In [16]:
import json
with open("tag_dictionary.json", "w", encoding="utf-8") as f:
    json.dump(tag_dict, f)

In [17]:
import numpy as np

with open("tag_dictionary.json", "r", encoding="utf-8") as f:
    tag_dict = json.load(f)

def vectorize_str_tags(tags: str):
    vec = np.zeros(448)
    tags_set = str_tags_to_set(tags)
    for tag in tags_set:
        vec[tag_dict[tag]]=1
    return vec

In [29]:
from numpy.linalg import norm
def cosine_similarity(A : np.ndarray, B : np.ndarray)->float:
    all_zeros = not (np.any(A) and np.any(B))
    if all_zeros:
        return 0.0
    return (np.dot(A, B) / (norm(A) * norm(B)))

def tags_similarity(tags1 : str, tags2 : str)->float:
    vec1 = vectorize_str_tags(tags1)
    vec2 = vectorize_str_tags(tags2)
    return cosine_similarity(vec1,vec2)

def compare_games(id1,id2):
    name_1 = df[df['AppID']==id1]['Name'].values[0]
    name_2 = df[df['AppID']==id2]['Name'].values[0]
    print(f"Similarity between \n>>>{name_1} \nand \n>>>{name_2}")
    tags1 = df[df['AppID']==id1]['Tags'].values[0]
    tags2 = df[df['AppID']==id2]['Tags'].values[0]
    print(tags_similarity(tags1,tags2))
    print("\n\n")

In [30]:
compare_games(557630, 885810)
compare_games(885810, 1901370)

compare_games(885810, 570940)
compare_games(236430, 570940)
compare_games(374320, 570940)

compare_games(24780, 570940)
compare_games(24780, 255710)

compare_games(557630, 255710)

compare_games(255710, 292030)
compare_games(374320, 292030)

Similarity between 
>>>Hello Charlotte EP2: Requiem Aeternam Deo 
and 
>>>The Witch's House MV
0.6499999999999999



Similarity between 
>>>The Witch's House MV 
and 
>>>Ib
0.7499999999999999



Similarity between 
>>>The Witch's House MV 
and 
>>>DARK SOULS™: REMASTERED
0.29999999999999993



Similarity between 
>>>DARK SOULS™ II 
and 
>>>DARK SOULS™: REMASTERED
0.7499999999999999



Similarity between 
>>>DARK SOULS™ III 
and 
>>>DARK SOULS™: REMASTERED
0.8499999999999999



Similarity between 
>>>SimCity™ 4 Deluxe Edition 
and 
>>>DARK SOULS™: REMASTERED
0.14999999999999997



Similarity between 
>>>SimCity™ 4 Deluxe Edition 
and 
>>>Cities: Skylines
0.6499999999999999



Similarity between 
>>>Hello Charlotte EP2: Requiem Aeternam Deo 
and 
>>>Cities: Skylines
0.04999999999999999



Similarity between 
>>>Cities: Skylines 
and 
>>>The Witcher® 3: Wild Hunt
0.19999999999999996



Similarity between 
>>>DARK SOULS™ III 
and 
>>>The Witcher® 3: Wild Hunt
0.5499999999999999





In [31]:
compare_games(1307710, 292030)
compare_games(1307710, 1551360)

Similarity between 
>>>GRID Legends 
and 
>>>The Witcher® 3: Wild Hunt
0.19999999999999996



Similarity between 
>>>GRID Legends 
and 
>>>Forza Horizon 5
0.5999999999999999





In [32]:
compare_games(1307710, 2108330)

Similarity between 
>>>GRID Legends 
and 
>>>F1® 23
0.4999999999999999





In [33]:
compare_games(1307710, 739630)

Similarity between 
>>>GRID Legends 
and 
>>>Phasmophobia
0.14999999999999997





In [34]:
compare_games(739630, 238320)

Similarity between 
>>>Phasmophobia 
and 
>>>Outlast
0.3499999999999999





In [35]:
compare_games(1150440, 238320)

Similarity between 
>>>Aliens: Dark Descent 
and 
>>>Outlast
0.4103913408340616





In [36]:
compare_games(1150440, 413150)

Similarity between 
>>>Aliens: Dark Descent 
and 
>>>Stardew Valley
0.1025978352085154





In [37]:
compare_games(1150440, 2108330)

Similarity between 
>>>Aliens: Dark Descent 
and 
>>>F1® 23
0.2051956704170308





### Conclusion:

Vectorizing games based on their tags results in useful vectors that can be readily compared in order to measure the similarity of items on Steam.

Compared to descriptions, multiple games are more likely to have identical tags describing them, so it's likely better to use this as one component of a larger search / recommendation system.