### Steam game description vectorization

"About this game" (game description) is a long text field provided by the developer of a game that appears on its store page. It is often used to explain what makes the game stand out from other games in its genre, and to provide a short introduction to the game's story/lore.

Unlike tags, the description is:
* Provided by the game's developer (not users)
* Freely editable text with no predefined format.

This makes it a good candidate for using an embedding algorithm to convert to a feature for content-based recommendations.

Game description can include images, which unfortunately will not be vectorized by the methods applied here.

In [19]:
from pathlib import Path
data_dir = Path('../data/raw')
csv_path = data_dir / "games.csv"
json_path = data_dir / "games.json"

import pandas as pd

df = pd.read_csv(csv_path)

### Filtering to English-language games with a name and description:

In [20]:
df.dropna(subset=['Name', 'About the game'], how='any', inplace=True)
english_descriptions = df[df['Supported languages'].str.contains("English")]

In [21]:
pd.options.mode.chained_assignment = None
english_descriptions['About the game'] = english_descriptions['About the game'].astype(str)

In [22]:
english_descriptions[english_descriptions['Name'].isnull()]

Unnamed: 0,AppID,Name,Release date,Estimated owners,Peak CCU,Required age,Price,DLC count,About the game,Supported languages,...,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies


In [23]:
english_descriptions[english_descriptions['About the game'].isnull()]

Unnamed: 0,AppID,Name,Release date,Estimated owners,Peak CCU,Required age,Price,DLC count,About the game,Supported languages,...,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies


In [24]:
english_descriptions.info()

<class 'pandas.core.frame.DataFrame'>
Index: 77964 entries, 0 to 85102
Data columns (total 39 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   AppID                       77964 non-null  int64  
 1   Name                        77964 non-null  object 
 2   Release date                77964 non-null  object 
 3   Estimated owners            77964 non-null  object 
 4   Peak CCU                    77964 non-null  int64  
 5   Required age                77964 non-null  int64  
 6   Price                       77964 non-null  float64
 7   DLC count                   77964 non-null  int64  
 8   About the game              77964 non-null  object 
 9   Supported languages         77964 non-null  object 
 10  Full audio languages        77964 non-null  object 
 11  Reviews                     9651 non-null   object 
 12  Header image                77964 non-null  object 
 13  Website                     38350 no

### Normalizing description

* Removing punctuation
* Changing all text to lowercase

In [25]:
import string
def process_line(line : str) ->str:
    processed = line.translate(str.maketrans('','',string.punctuation))
    return processed.lower()

In [26]:
process_line("Test text!.? OK")

'test text ok'

In [27]:
english_descriptions['About the game'] = english_descriptions['About the game'].apply(lambda x: process_line(x))

In [28]:
english_descriptions['About the game'].head()

0    galactic bowling is an exaggerated and stylize...
1    the law looks to be a showdown atop a train th...
2    jolt project the army now has a new robotics p...
3    henosis™ is a mysterious 2d platform puzzler w...
4    about the game play as a hacker who has arrang...
Name: About the game, dtype: object

In [29]:
from gensim.models import FastText

data_input = english_descriptions['About the game'].tolist()

In [30]:
print(data_input[:3])

['galactic bowling is an exaggerated and stylized bowling game with an intergalactic twist players will engage in fastpaced single and multiplayer competition while being submerged in a unique new universe filled with overthetop humor wild characters unique levels and addictive game play the title is aimed at players of all ages and skill sets through accessible and intuitive controls and gameplay galactic bowling allows you to jump right into the action a singleplayer campaign and online play allow you to work your way up the ranks of the galactic bowling league whether you have hours to play or only a few minutes galactic bowling is a fast paced and entertaining experience that will leave you wanting more full singleplayer story campaign including 11 characters and environments 2 singleplayer play modes including regular and battle modes head to head online multiplayer play modes super powers special balls and whammies unlockable characters environments and minigames unlock all 30 st

In [78]:
model = FastText(vector_size=200, window=5, min_count=100)
model.build_vocab(corpus_iterable=data_input)

In [79]:
model.train(data_input, total_examples=model.corpus_count, epochs=25)  # Longer training than default because the dataset isn't large

(418588353, 2390533450)

In [80]:
model.save("fasttext_trained_v3")

### Testing the vectorization

In [81]:
similarities = model.wv.most_similar(positive=['realistic', 'graphics'], negative=['pixel'], topn=10, restrict_vocab=200)
most_similar = similarities[:5]
print(most_similar)

[('s', 0.18077237904071808), ('c', 0.14272266626358032), ('―', 0.13422559201717377), ('u', 0.13260796666145325), ('m', 0.11276456713676453)]


In [82]:
import numpy as np
from numpy.linalg import norm
def cosine_similarity(A, B):
    all_zeros = not (np.any(A) and np.any(B))
    if all_zeros:
        return 0.0
    return (np.dot(A, B) / (norm(A) * norm(B)))

def compare_text(A,B):
    A_normalized = process_line(A)
    B_normalized = process_line(B)
    A_vector = model.wv.get_sentence_vector(A_normalized)
    B_vector = model.wv.get_sentence_vector(B_normalized)

    print(cosine_similarity(A_vector,B_vector))

In [83]:
compare_text("realistic", "simulation")

0.9201918


In [84]:
compare_text("realistic", "fantasy")

0.7028196


In [85]:
compare_text("RPG", "roleplay")

0.5577753


In [86]:
compare_text("RPG", "RTS")

0.4688035


In [87]:
compare_text("RPG", "racing game")

0.69503295


In [88]:
compare_text("horror", "racing game")

0.48132095


In [89]:
compare_text("pixel art", "RTS")

0.68783134


In [90]:
compare_text("pixel art", "RPG")

0.720731


In [91]:
def compare_games(id1, id2):
    name_1 = english_descriptions[english_descriptions['AppID']==id1]['Name'].values[0]
    name_2 = english_descriptions[english_descriptions['AppID']==id2]['Name'].values[0]
    print(f"Similarity between \n{name_1} and \n{name_2}")
    desc1 = english_descriptions[english_descriptions['AppID']==id1]['About the game'].values[0]
    desc2 = english_descriptions[english_descriptions['AppID']==id2]['About the game'].values[0]
    compare_text(desc1, desc2)
    print("\n\n")

In [95]:
compare_games(557630, 885810)
compare_games(885810, 1901370)

compare_games(885810, 570940)
compare_games(236430, 570940)
compare_games(374320, 570940)

compare_games(24780, 570940)
compare_games(24780, 255710)

compare_games(557630, 255710)

compare_games(255710, 292030)
compare_games(374320, 292030)

Similarity between 
Hello Charlotte EP2: Requiem Aeternam Deo and 
The Witch's House MV
0.99708587



Similarity between 
The Witch's House MV and 
Ib
0.9983693



Similarity between 
The Witch's House MV and 
DARK SOULS™: REMASTERED
0.996813



Similarity between 
DARK SOULS™ II and 
DARK SOULS™: REMASTERED
0.9984196



Similarity between 
DARK SOULS™ III and 
DARK SOULS™: REMASTERED
0.9990789



Similarity between 
SimCity™ 4 Deluxe Edition and 
DARK SOULS™: REMASTERED
0.9978094



Similarity between 
SimCity™ 4 Deluxe Edition and 
Cities: Skylines
0.9989362



Similarity between 
Hello Charlotte EP2: Requiem Aeternam Deo and 
Cities: Skylines
0.9953719



Similarity between 
Cities: Skylines and 
The Witcher® 3: Wild Hunt
0.9977336



Similarity between 
DARK SOULS™ III and 
The Witcher® 3: Wild Hunt
0.9973756





### Conclusion:

Unfortunately it appears that the game descriptions are too long for FastText to extract useful info from using word vector averaging.

Attempted to use different numbers of epochs, but the results were unsatisfactory.

Many games (and genres) have similar description vectors, and dissimilar games sometimes have more similar vectors than similar ones.

Doc2vec might provide better results