### Steam game description vectorization

Follows similar steps to fasttext_vectorization, but uses gensim's implementation of doc2vec instead.

In [11]:
from pathlib import Path
data_dir = Path('../data/raw')
csv_path = data_dir / "games.csv"
json_path = data_dir / "games.json"

import pandas as pd

df = pd.read_csv(csv_path)

### Filtering to English-language games with a name and description:

In [12]:
df.dropna(subset=['Name', 'About the game'], how='any', inplace=True)
english_descriptions = df[df['Supported languages'].str.contains("English")]

In [13]:
pd.options.mode.chained_assignment = None
english_descriptions['About the game'] = english_descriptions['About the game'].astype(str)

In [14]:
english_descriptions[english_descriptions['Name'].isnull()]

Unnamed: 0,AppID,Name,Release date,Estimated owners,Peak CCU,Required age,Price,DLC count,About the game,Supported languages,...,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies


In [15]:
english_descriptions[english_descriptions['About the game'].isnull()]

Unnamed: 0,AppID,Name,Release date,Estimated owners,Peak CCU,Required age,Price,DLC count,About the game,Supported languages,...,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies


In [16]:
english_descriptions.info()

<class 'pandas.core.frame.DataFrame'>
Index: 77964 entries, 0 to 85102
Data columns (total 39 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   AppID                       77964 non-null  int64  
 1   Name                        77964 non-null  object 
 2   Release date                77964 non-null  object 
 3   Estimated owners            77964 non-null  object 
 4   Peak CCU                    77964 non-null  int64  
 5   Required age                77964 non-null  int64  
 6   Price                       77964 non-null  float64
 7   DLC count                   77964 non-null  int64  
 8   About the game              77964 non-null  object 
 9   Supported languages         77964 non-null  object 
 10  Full audio languages        77964 non-null  object 
 11  Reviews                     9651 non-null   object 
 12  Header image                77964 non-null  object 
 13  Website                     38350 no

### Normalizing description

* Removing punctuation
* Changing all text to lowercase

In [17]:
import string
def process_line(line : str) ->str:
    processed = line.translate(str.maketrans('','',string.punctuation))
    return processed.lower()

In [18]:
process_line("Test text!.? OK")

'test text ok'

In [19]:
english_descriptions['About the game'] = english_descriptions['About the game'].apply(lambda x: process_line(x))

In [20]:
english_descriptions['About the game'].head()

0    galactic bowling is an exaggerated and stylize...
1    the law looks to be a showdown atop a train th...
2    jolt project the army now has a new robotics p...
3    henosis™ is a mysterious 2d platform puzzler w...
4    about the game play as a hacker who has arrang...
Name: About the game, dtype: object

In [21]:
from gensim.utils import simple_preprocess
from gensim.models import doc2vec

descriptions = english_descriptions['About the game'].tolist()

In [22]:
print(descriptions[:3])

['galactic bowling is an exaggerated and stylized bowling game with an intergalactic twist players will engage in fastpaced single and multiplayer competition while being submerged in a unique new universe filled with overthetop humor wild characters unique levels and addictive game play the title is aimed at players of all ages and skill sets through accessible and intuitive controls and gameplay galactic bowling allows you to jump right into the action a singleplayer campaign and online play allow you to work your way up the ranks of the galactic bowling league whether you have hours to play or only a few minutes galactic bowling is a fast paced and entertaining experience that will leave you wanting more full singleplayer story campaign including 11 characters and environments 2 singleplayer play modes including regular and battle modes head to head online multiplayer play modes super powers special balls and whammies unlockable characters environments and minigames unlock all 30 st

In [23]:
from gensim.models.doc2vec import TaggedDocument

train_corpus = []
for i, text in enumerate(descriptions):
    tokens = simple_preprocess(text)
    train_corpus.append(TaggedDocument(tokens, [i]))

In [24]:
model = doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
model.build_vocab(train_corpus)

In [25]:
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)  # Longer training than default because the dataset isn't large

In [26]:
model.save("doc2vec_trained")

### Testing the vectorization

In [27]:
import numpy as np
from numpy.linalg import norm
def cosine_similarity(A, B):
    all_zeros = not (np.any(A) and np.any(B))
    if all_zeros:
        return 0.0
    return (np.dot(A, B) / (norm(A) * norm(B)))

def compare_text(A,B):
    A_normalized = simple_preprocess(process_line(A))
    B_normalized = simple_preprocess(process_line(B))
    A_vector = model.infer_vector(A_normalized)
    B_vector = model.infer_vector(B_normalized)

    print(cosine_similarity(A_vector,B_vector))

In [28]:
compare_text("realistic", "simulation")

0.8014595


In [29]:
compare_text("realistic", "fantasy")

0.47903386


In [30]:
compare_text("RPG", "roleplay")

0.6531818


In [31]:
compare_text("RPG", "RTS")

0.6985966


In [32]:
compare_text("RPG", "racing game")

0.39194736


In [33]:
compare_text("horror", "racing game")

0.44411784


In [34]:
compare_text("pixel art", "RTS")

0.5793386


In [35]:
compare_text("pixel art", "RPG")

0.6517919


In [36]:
def compare_games(id1, id2):
    name_1 = english_descriptions[english_descriptions['AppID']==id1]['Name'].values[0]
    name_2 = english_descriptions[english_descriptions['AppID']==id2]['Name'].values[0]
    print(f"Similarity between \n{name_1} and \n{name_2}")
    desc1 = english_descriptions[english_descriptions['AppID']==id1]['About the game'].values[0]
    desc2 = english_descriptions[english_descriptions['AppID']==id2]['About the game'].values[0]
    compare_text(desc1, desc2)
    print("\n\n")

In [37]:
compare_games(557630, 885810)
compare_games(885810, 1901370)

compare_games(885810, 570940)
compare_games(236430, 570940)
compare_games(374320, 570940)

compare_games(24780, 570940)
compare_games(24780, 255710)

compare_games(557630, 255710)

compare_games(255710, 292030)
compare_games(374320, 292030)

Similarity between 
Hello Charlotte EP2: Requiem Aeternam Deo and 
The Witch's House MV
0.22953954



Similarity between 
The Witch's House MV and 
Ib
0.51349425



Similarity between 
The Witch's House MV and 
DARK SOULS™: REMASTERED
0.2900842



Similarity between 
DARK SOULS™ II and 
DARK SOULS™: REMASTERED
0.7924236



Similarity between 
DARK SOULS™ III and 
DARK SOULS™: REMASTERED
0.8505721



Similarity between 
SimCity™ 4 Deluxe Edition and 
DARK SOULS™: REMASTERED
0.08018278



Similarity between 
SimCity™ 4 Deluxe Edition and 
Cities: Skylines
0.66219866



Similarity between 
Hello Charlotte EP2: Requiem Aeternam Deo and 
Cities: Skylines
-0.056560945



Similarity between 
Cities: Skylines and 
The Witcher® 3: Wild Hunt
0.16696687



Similarity between 
DARK SOULS™ III and 
The Witcher® 3: Wild Hunt
0.16673931





In [38]:
compare_games(1307710, 292030)
compare_games(1307710, 1551360)

Similarity between 
GRID Legends and 
The Witcher® 3: Wild Hunt
0.15940717



Similarity between 
GRID Legends and 
Forza Horizon 5
0.67024606





In [40]:
compare_games(1307710, 2108330)

Similarity between 
GRID Legends and 
F1® 23
0.5873475





In [41]:
compare_games(1307710, 739630)

Similarity between 
GRID Legends and 
Phasmophobia
0.16849494





In [42]:
compare_games(739630, 238320)

Similarity between 
Phasmophobia and 
Outlast
0.45465654





In [43]:
compare_games(1150440, 238320)

Similarity between 
Aliens: Dark Descent and 
Outlast
0.45350683





In [44]:
compare_games(1150440, 413150)

Similarity between 
Aliens: Dark Descent and 
Stardew Valley
0.039598938





In [45]:
compare_games(1150440, 2108330)

Similarity between 
Aliens: Dark Descent and 
F1® 23
0.08933396





#### Conclusion:

The vectors created by doc2vec appear to represent the game characteristics much better than fasttext.
It's likely that the game descriptions are long enough that this technique is better suited for this task.

Vectorization itself is slower than for fasttext, but using pre-calculated vectors and a vector database should make finding similar games fast enough for real-time usage.