# **Machine Learning Group Project:** Recommender System Preparation

The purpose of this notebook is to extract content-based similarities across different products on the basis of the plethora of of textual data contained in descriptive columns such as _tags_, _genres_, _description_, or _steamspy_tags_. These findings can later be utilized when building other models.

In [7]:
import ast
import pandas as pd

## Text Data DataFrame

First we create a DataFrame that contains all instances of descriptive textual data for the products. There is a single entry for each game. We run an inner merge because this is the same kind of merge done between these datasets for the final_df in [a_csv_creation.ipynb](a_csv_creation.ipynb). Consequently, this ensures the data we are working with is compatible with other work later on.

In [8]:
# Import Metadata
df_games_meta_data = pd.read_json('data/games_metadata.json', lines=True)

# Import Games Data
df_games_additional = pd.read_csv("data/steam.csv", 
                                  usecols= ["appid", "name", "genres", "steamspy_tags",
                                           "categories"])

# Merge to a Single DataFrame
textual_df = df_games_additional.join(df_games_meta_data.set_index("app_id"),
                                      on = "appid", how = "inner")

# Show Head
textual_df.head()

Unnamed: 0,appid,name,categories,genres,steamspy_tags,description,tags
0,10,Counter-Strike,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,Play the world's number 1 online action game. ...,"[Action, FPS, Multiplayer, Shooter, Classic, T..."
1,20,Team Fortress Classic,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,One of the most popular online action games of...,"[Action, FPS, Multiplayer, Classic, Hero Shoot..."
2,30,Day of Defeat,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,Enlist in an intense brand of Axis vs. Allied ...,"[FPS, World War II, Multiplayer, Shooter, Acti..."
3,40,Deathmatch Classic,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,Enjoy fast-paced multiplayer gaming with Death...,"[Action, FPS, Classic, Multiplayer, Shooter, F..."
4,50,Half-Life: Opposing Force,Single-player;Multi-player;Valve Anti-Cheat en...,Action,FPS;Action;Sci-fi,Return to the Black Mesa Research Facility as ...,"[FPS, Action, Classic, Sci-fi, Singleplayer, S..."


## Data Preparation

#### Text Homogenization

The first step undertaken is homgenizing the text across the different columns. We do the following:
- Transform the _tags_ column from a list to a string
- Remove separators in the strings of columns such as _categories_, _steamspy_tags_, etc.
- Make all words fully lowercase

In [9]:
# Turn to String
textual_df["tags"] = textual_df["tags"].apply(lambda x: str(x))

# Drop the Punctuation
punc = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
 
# Remove punctuation and make everything lowercase with for loop
for i in textual_df["tags"].reset_index()["index"].unique():
    for j in textual_df["tags"][i]:
        if j in punc:
            textual_df.loc[i,"tags"] = textual_df["tags"][i].replace(j, "").lower()

In [10]:
# Use for loop to remove all separators from list columns and turn all to lowercase
for i in ["categories", "genres", "steamspy_tags"]:
    for j in textual_df.reset_index()["index"].unique():
        textual_df.loc[j, i] = textual_df.loc[j, i].replace(';',' ').lower()

In [11]:
# Remove punctuation, numerics, and turn description to lowercase 
textual_df

Unnamed: 0,appid,name,categories,genres,steamspy_tags,description,tags
0,10,Counter-Strike,multi-player online multi-player local multi-p...,action,action fps multiplayer,Play the world's number 1 online action game. ...,action fps multiplayer shooter classic teambas...
1,20,Team Fortress Classic,multi-player online multi-player local multi-p...,action,action fps multiplayer,One of the most popular online action games of...,action fps multiplayer classic hero shooter sh...
2,30,Day of Defeat,multi-player valve anti-cheat enabled,action,fps world war ii multiplayer,Enlist in an intense brand of Axis vs. Allied ...,fps world war ii multiplayer shooter action wa...
3,40,Deathmatch Classic,multi-player online multi-player local multi-p...,action,action fps multiplayer,Enjoy fast-paced multiplayer gaming with Death...,action fps classic multiplayer shooter firstpe...
4,50,Half-Life: Opposing Force,single-player multi-player valve anti-cheat en...,action,fps action sci-fi,Return to the Black Mesa Research Facility as ...,fps action classic scifi singleplayer shooter ...
...,...,...,...,...,...,...,...
27066,1064060,The Mystery of Bikini Island,single-player,adventure casual indie rpg early access,early access adventure sexual content,Solve puzzles and meet beautiful women in this...,sexual content nudity adventure casual indie rpg
27067,1064580,CaptainMarlene,single-player,adventure indie early access,early access indie adventure,"In the game CaptainMarlene, you control the sp...",indie adventure casual arcade platformer actio...
27070,1065230,Room of Pandora,single-player steam achievements,adventure casual indie,adventure indie casual,The Room of Pandora is a third-person interact...,puzzle investigation point click hidden objec...
27071,1065570,Cyber Gun,single-player,action adventure indie,action indie adventure,Cyber Gun is a hardcore first-person shooter w...,indie action adventure fps firstperson minimal...


#### Stopword Removal

Although columns like _categories_, _genres_, _steamspy_tags_, and _tags_ do not suffer from this issue, the _description_ column contains stopwords which could be detrimental to our analysis. 

## Text Data Exploration

In [12]:
# Read the data
final_df = pd.read_csv("data/final_df.csv")

# Split and count steamspy_tags and genres
steamspy_tags = final_df['steamspy_tags'].str.split(';', expand=True).stack().value_counts()
genres = final_df['genres'].str.split(';', expand=True).stack().value_counts()

# Print the most common steamspy_tags and genres
print("Most common steamspy_tags:\n", steamspy_tags.head(10))
print("\nMost common genres:\n", genres.head(10))

# Combine the split steamspy_tags and genres back into the DataFrame
final_df['steamspy_tags'] = final_df['steamspy_tags'].str.split(';')
final_df['genres'] = final_df['genres'].str.split(';')

# Calculate game popularity metrics, sales, and price_final
final_df['average_playtime'] = final_df['average_playtime'].astype(float)
final_df['reviews'] = final_df['reviews'].astype(int)
final_df['price_final'] = final_df['price_final'].astype(float)

# Group by steamspy_tags and genres, and calculate the average popularity, sales metrics, and price_final
tags_popularity = final_df.explode('steamspy_tags').groupby('steamspy_tags').agg({
    'average_playtime': 'mean',
    'reviews': 'sum',
    'price_final': 'mean'
}).sort_values('reviews', ascending=False)

genres_popularity = final_df.explode('genres').groupby('genres').agg({
    'average_playtime': 'mean',
    'reviews': 'sum',
    'price_final': 'mean'
}).sort_values('reviews', ascending=False)

# Print the popularity, sales metrics, and price_final for the most common steamspy_tags and genres
print("\nPopularity, sales metrics, and price_final by steamspy_tags:\n", tags_popularity.head(10))
print("\nPopularity, sales metrics, and price_final by genres:\n", genres_popularity.head(10))

Most common steamspy_tags:
 Multiplayer     1306846
Open World       975555
Survival         709232
Free to Play     528619
Simulation       485486
Early Access     434366
Action           410372
RPG              406981
FPS              373718
Strategy         356269
dtype: int64

Most common genres:
 Action                   2317164
Adventure                1352712
Indie                    1274129
Simulation               1245893
RPG                      1180447
Massively Multiplayer     871115
Strategy                  813447
Free to Play              701041
Early Access              434366
Sports                    182598
dtype: int64

Popularity, sales metrics, and price_final by steamspy_tags:
                average_playtime  reviews  price_final
steamspy_tags                                        
Multiplayer        10551.283588  2211400    18.785185
Open World          5465.449322  1658514    24.866857
Survival            7628.942128  1200746    28.686975
Free to Play        5

In [13]:
final_df["description"]

0          Carve a path of destruction through an intense...
1          Carve a path of destruction through an intense...
2          Project Zomboid is the ultimate in zombie surv...
3          Carve a path of destruction through an intense...
4          Carve a path of destruction through an intense...
                                 ...                        
3927517    Cuphead is a classic run and gun action game h...
3927518    Cuphead is a classic run and gun action game h...
3927519    Cuphead is a classic run and gun action game h...
3927520    Cuphead is a classic run and gun action game h...
3927521    Cuphead is a classic run and gun action game h...
Name: description, Length: 3927522, dtype: object

In [14]:
final_df["steamspy_tags"].str.replace(';',' ')

0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
           ..
3927517   NaN
3927518   NaN
3927519   NaN
3927520   NaN
3927521   NaN
Name: steamspy_tags, Length: 3927522, dtype: float64