# **Machine Learning Group Project:** Recommender System Preparation

The purpose of this notebook is to extract content-based similarities across different products on the basis of the plethora of of textual data contained in descriptive columns such as _tags_, _genres_, _description_, or _steamspy_tags_. These findings can later be utilized when building other models.

In [20]:
import requests
import pandas as pd

## Text Data DataFrame

First we create a DataFrame that contains all instances of descriptive textual data for the products. There is a single entry for each game. We run an inner merge because this is the same kind of merge done between these datasets for the final_df in [a_csv_creation.ipynb](a_csv_creation.ipynb). Consequently, this ensures the data we are working with is compatible with other work later on.

In [3]:
# Import Metadata
df_games_meta_data = pd.read_json('data/games_metadata.json', lines=True)

# Import Games Data
df_games_additional = pd.read_csv("data/steam.csv", 
                                  usecols= ["appid", "name", "genres", "steamspy_tags",
                                           "categories"])

# Merge to a Single DataFrame
textual_df = df_games_additional.join(df_games_meta_data.set_index("app_id"),
                                      on = "appid", how = "inner")

# Show Head
textual_df.head()

Unnamed: 0,appid,name,categories,genres,steamspy_tags,description,tags
0,10,Counter-Strike,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,Play the world's number 1 online action game. ...,"[Action, FPS, Multiplayer, Shooter, Classic, T..."
1,20,Team Fortress Classic,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,One of the most popular online action games of...,"[Action, FPS, Multiplayer, Classic, Hero Shoot..."
2,30,Day of Defeat,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,Enlist in an intense brand of Axis vs. Allied ...,"[FPS, World War II, Multiplayer, Shooter, Acti..."
3,40,Deathmatch Classic,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,Enjoy fast-paced multiplayer gaming with Death...,"[Action, FPS, Classic, Multiplayer, Shooter, F..."
4,50,Half-Life: Opposing Force,Single-player;Multi-player;Valve Anti-Cheat en...,Action,FPS;Action;Sci-fi,Return to the Black Mesa Research Facility as ...,"[FPS, Action, Classic, Sci-fi, Singleplayer, S..."


## Data Preparation

#### Text Homogenization

The first step undertaken is homgenizing the text across the different columns. We do the following:
- Make all words fully lowercase
- Transform the _tags_ column from a list to a string
- Remove separators in the strings of columns such as _categories_, _steamspy_tags_, etc.
- Remove any other punctuation
- Homogenize words with spelling discrepancies (e.g. multiplayer & multi-player)

In [4]:
# Turn to String
textual_df["tags"] = textual_df["tags"].apply(lambda x: str(x))

# Drop the Punctuation
punc = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
 
# Remove punctuation and make everything lowercase with for loop
for i in textual_df["tags"].reset_index()["index"].unique():
    for j in textual_df["tags"][i]:
        if j in punc:
            textual_df.loc[i,"tags"] = textual_df["tags"][i].replace(j, "").lower()

In [51]:
# Use for loop to remove all separators from list columns and turn all to lowercase
for i in ["categories", "genres", "steamspy_tags", "tags"]:
    for j in textual_df.reset_index()["index"].unique():
        textual_df.loc[j, i] = textual_df.loc[j, i].replace(';',' ').lower()
        
        textual_df.loc[j, i] = textual_df.loc[j, i].replace('multi-player', 'multiplayer')\
        .replace("free to play", "free-to-play").replace("singleplayer", "single-player")\
        .replace("post-apocalyptic", "postapocalyptic").replace("scifi", "sci-fi")

In [29]:
# Define numbers
num = '0123456789'

# Remove punctuation, numbers, and turn description to lowercase 
for i in textual_df["description"].reset_index()["index"].unique():
    for j in textual_df["description"][i]:
        
        if j in punc:
            textual_df.loc[i,"description"] = textual_df["description"][i].replace(j, "")
            
        if j in num:
            textual_df.loc[i,"description"] = textual_df["description"][i].replace(j, "")
        
        textual_df.loc[i,"description"] = textual_df["description"][i].lower().replace("multi-player", "multiplayer")

#### Stopword Removal

Although columns like _categories_, _genres_, _steamspy_tags_, and _tags_ do not suffer from this issue, the _description_ column contains stopwords which could be detrimental to our analysis. The stopword list is obtained from a separate [GitHub project](https://gist.githubusercontent.com/rg089/35e00abf8941d72d419224cfd5b5925d/raw/12d899b70156fd0041fa9778d657330b024b959c/stopwords.txt).

In [25]:
# Obtain stopwords
stopwords_list = requests.get("https://gist.githubusercontent.com/rg089/35e00abf8941d72d419224cfd5b5925d/raw/12d899b70156fd0041fa9778d657330b024b959c/stopwords.txt").content
stopwords = set(stopwords_list.decode().splitlines()) 

# Remove Stopwords
textual_df['description'] = textual_df['description']\
.apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))

#### Full Text Column Creation

Finally, we create a new column which contains all the textual information we have on a game.   

In [52]:
# Use for loop to summarize all movie tags into one
for i in textual_df["description"].reset_index()["index"].unique():
    full_txt = ""
    for j in ["categories", "genres", "steamspy_tags", "description", "tags"]:
        full_txt += " " + textual_df.loc[i, j]
        
    textual_df.loc[i, "full_text"] = full_txt[1:]

In [53]:
textual_df.head()

Unnamed: 0,appid,name,categories,genres,steamspy_tags,description,tags,full_text
0,10,Counter-Strike,multiplayer online multiplayer local multiplay...,action,action fps multiplayer,play worlds number online action game engage i...,action fps multiplayer shooter classic teambas...,multiplayer online multiplayer local multiplay...
1,20,Team Fortress Classic,multiplayer online multiplayer local multiplay...,action,action fps multiplayer,popular online action games time team fortress...,action fps multiplayer classic hero shooter sh...,multiplayer online multiplayer local multiplay...
2,30,Day of Defeat,multiplayer valve anti-cheat enabled,action,fps world war ii multiplayer,enlist intense brand axis allied teamplay set ...,fps world war ii multiplayer shooter action wa...,multiplayer valve anti-cheat enabled action fp...
3,40,Deathmatch Classic,multiplayer online multiplayer local multiplay...,action,action fps multiplayer,enjoy fastpaced multiplayer gaming deathmatch ...,action fps classic multiplayer shooter firstpe...,multiplayer online multiplayer local multiplay...
4,50,Half-Life: Opposing Force,single-player multiplayer valve anti-cheat ena...,action,fps action sci-fi,return black mesa facility military specialist...,fps action classic sci-fi single-player shoote...,single-player multiplayer valve anti-cheat ena...


## Text Data Exploration

In [1]:
# Read the data
final_df = pd.read_csv("data/final_df.csv")

# Split and count steamspy_tags and genres
steamspy_tags = final_df['steamspy_tags'].str.split(';', expand=True).stack().value_counts()
genres = final_df['genres'].str.split(';', expand=True).stack().value_counts()

# Print the most common steamspy_tags and genres
print("Most common steamspy_tags:\n", steamspy_tags.head(10))
print("\nMost common genres:\n", genres.head(10))

# Combine the split steamspy_tags and genres back into the DataFrame
final_df['steamspy_tags'] = final_df['steamspy_tags'].str.split(';')
final_df['genres'] = final_df['genres'].str.split(';')

# Calculate game popularity metrics, sales, and price_final
final_df['average_playtime'] = final_df['average_playtime'].astype(float)
final_df['reviews'] = final_df['reviews'].astype(int)
final_df['price_final'] = final_df['price_final'].astype(float)

# Group by steamspy_tags and genres, and calculate the average popularity, sales metrics, and price_final
tags_popularity = final_df.explode('steamspy_tags').groupby('steamspy_tags').agg({
    'average_playtime': 'mean',
    'reviews': 'sum',
    'price_final': 'mean'
}).sort_values('reviews', ascending=False)

genres_popularity = final_df.explode('genres').groupby('genres').agg({
    'average_playtime': 'mean',
    'reviews': 'sum',
    'price_final': 'mean'
}).sort_values('reviews', ascending=False)

# Print the popularity, sales metrics, and price_final for the most common steamspy_tags and genres
print("\nPopularity, sales metrics, and price_final by steamspy_tags:\n", tags_popularity.head(10))
print("\nPopularity, sales metrics, and price_final by genres:\n", genres_popularity.head(10))

NameError: name 'pd' is not defined

## Content-Based Similarity