# Machine Learning Group Project 

User game rating prediction & systematic discount offering on Steam. Project developed by Team XX composed by:
| Student Name | Student Number | Class Group |
| --- | --- | --- |
| **Alessandro Maugeri** | 53067 | TA |
| **Frank Andreas Bauer** | XXXX | XX |
|  **Johannes Rahn** | XXXX | XX |
| **Nicole Zoppi** | XXXX | XX |
| **Yannick von der Heyden** | XXXX | XX |

## Importing Packages

In [37]:
import ast
import pandas as pd
import numpy as np
from datetime import datetime

## Importing Data

The data for this project was retrieved from [Kaggle](https://www.kaggle.com/datasets/antonkozyriev/game-recommendations-on-steam?select=games.csv) and stored in the "data" folder found in the notebook's directory. The folder includes **five data files**:

The CSV file **[games.csv](data/games.csv)** presents data concerning individual games in the Steam library:

| Column | Description | Example|
| --- | --- | --- |
| **app_id** | Product ID on Steam | 620 |
| **title** | Product Commercial Title | Portal 2|
|  **date_release** | Release Date of Title (y-m-d) | 2011-04-18 |
| **win** | Boolean Denoting Compatibility to Windows Computers | True |
| **mac** | Boolean Denoting Compatibility to Mac Computers  | True | 
| **linux** | Boolean Denoting Compatibility to Linux Computers  | True |
| **rating** | Categorical Rating of Product (e.g. "Positive")| Overwhelmingly Positive |
| **positive_ratio** | Ratio of Postive Feedback for Game  | 98 |
| **user_reviews** | Number of Reviews  | 267142 |
| **price_final** | Final Price in USD | 9.99 |
| **price_original** | Price Before Discounts in USD | 9.99 |
| **discount** | Applied Discount | 0 |
| **steam_deck** | Discount Percentage | True |



In [38]:
df_games_data = pd.read_csv("data/games.csv")
df_games_data.head(2)

Unnamed: 0,app_id,title,date_release,win,mac,linux,rating,positive_ratio,user_reviews,price_final,price_original,discount,steam_deck
0,10090,Call of Duty: World at War,2008-11-18,True,False,False,Very Positive,92,37039,19.99,19.99,0.0,True
1,13500,Prince of Persia: Warrior Within™,2008-11-21,True,False,False,Very Positive,84,2199,9.99,9.99,0.0,True


----
The **CSV file [users.csv](data/users.csv)** presents data concerning individual users found in the datasets:

| Column | Description | Example|
| --- | --- | --- |
| **user_id** | User ID on Steam | 5693478 |
| **products** | Number of Products from Steam Library Owned | 156 |
|  **reviews** | Number of Reviews Published | 1 |

In [40]:
df_users = pd.read_csv("data/users.csv")
df_users.head(2)

Unnamed: 0,user_id,products,reviews
0,5693478,156,1
1,3595958,329,3


----
The **CSV file [recommendations.csv](data/recommendations.csv)** has a many-to-many relationship to both users.csv and games.csv and contains data concerning user reviews of specific games:

| Column | Description | Example|
| --- | --- | --- |
| **app_id** | Product ID on Steam | 620 |
| **helpful** | Number of Users Who Found Review Helpful | 0 |
|  **funny** | Number of Users Who Found Review Funny | 0 |
| **date** | Date in Which Review was Published (y-m-d) | 2022-12-12 |
| **is_recommended** | Does the User Recommend the Title | True | 
| **hours** | Hours Spent by User Playing Game  | 36.3 |
| **user_id** | User ID of Review Author | 19954 |
| **review_id** | ID of Individual Review  | 0 |

In [46]:
df_recommendations = pd.read_csv("data/recommendations.csv")
df_recommendations.head(2)

Unnamed: 0,app_id,helpful,funny,date,is_recommended,hours,user_id,review_id
0,975370,0,0,2022-12-12,True,36.3,19954,0
1,304390,4,0,2017-02-17,False,11.5,1098,1


----
Finally, the folder includes a **JSON file [games_metadata.json](data/games_metadata.json)** containing metadata on individual games:

| Column | Description | Example|
| --- | --- | --- |
| **app_id** | Product ID on Steam | 304430 |
| **description** | Game Description on Steam | "Hunted and alone, a boy finds himself drawn into the center of a dark project. INSIDE is a dark, narrative-driven platformer combining intense action with challenging puzzles. It has been critically acclaimed for its moody art style, ambient soundtrack and unsettling atmosphere." |
|  **tags** | Additional Tags on Steam Platform | ["2.5D", "Story Rich", "Puzzle Platformer" , "Atmospheric" , "Adventure" , "Indie" , "Dark" , "Horror" , "Singleplayer" , "Action-Adventure" , "Puzzle" , "Multiple Endings" , "Exploration" , "2D Platformer" , "Platformer" , "Controller" , "Soundtrack" , "Ambient" , "Action" , "Narrative"] |

In [47]:
df_games_meta_data = pd.read_json('data/games_metadata.json', lines=True)
df_games_meta_data.head(2)

Unnamed: 0,app_id,description,tags
0,10090,"Call of Duty is back, redefining war like you'...","[Zombies, World War II, FPS, Multiplayer, Acti..."
1,13500,Enter the dark underworld of Prince of Persia ...,"[Action, Adventure, Parkour, Third Person, Gre..."


----
Two additional data files are utilized from a separate [Kaggle page](https://www.kaggle.com/datasets/nikdavis/steam-store-games?select=steam.csv) to enrich the analysis. These are imported below.


The **CSV file [steam.csv](data/steam.csv)** provides additional data on the games:

| Column | Description | Example|
| --- | --- | --- |
| **appid** | Product ID on Steam| 10 |
| **name** | Name of Game | Counter-Strike |
|  **release_date** | Release Date of Title (y-m-d) | 2000-11-01 |
|  **english** | Is the Game Available in English? | 1 |
|  **developer** | Developer Company of Game | Valve |
|  **publisher** | Publishing Company of Game | Valve |
|  **platforms** | Semicolon-Delimited List of Systems that Can Run the Game | windows;mac;linux |
|  **required_age** | Minimum Age Required to Play by PEGI UK (0 Means Unsupplied) | 0 |
|  **categories** | Game Categorisation (Semicolon Delimited) | Multi-player;Online Multi-Player... |
|  **genres** | Game's Genre (Semicolon Delimteted) | Action |
|  **steamspy_tags** | Tags from Steamspy API (Semicolon Delimited) | Action;FPS;Multiplayer |
|  **achievements** | Number of In-Game Achievements (If Any) | 0 |
|  **positive_ratings** | Number of Positive Ratings | 124534 |
|  **negative_ratings** | Number of Negative Ratings | 3339 |
|  **average_playtime** | Average Playtime by User in Minutes| 17612 |
|  **median_playtime** | Median Playtime by User in Minutes | 317 |
|  **owners** | Number of Users that Own the Game (Bracket) | 10000000-20000000 |
|  **price** | Number of In-Game Achievements (If Any) | 0 |



In [55]:
df_games_additional = pd.read_csv("data/steam.csv")
df_games_additional.head(3)

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,124534,3339,17612,317,10000000-20000000,7.19
1,20,Team Fortress Classic,1999-04-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,3318,633,277,62,5000000-10000000,3.99
2,30,Day of Defeat,2003-05-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,0,3416,398,187,34,5000000-10000000,3.99


### JOEY'S TASK
The **CSV file [blabla.csv](data/steam.csv)** contains information about system requirements in order to run the games:

| Column | Description | Example|
| --- | --- | --- |
| **JOEY** | Product ID on Steam| 10 |
| **JOEY** | Name of Game | Counter-Strike |
|  **JOEY** | Release Date of Title (y-m-d) | 2000-11-01 |
|  **JOEY** | Is the Game Available in English? | 1 |

In [56]:
# File Import 

## Preliminary Data Exploration

In [48]:
df_games_data["rating"].unique()

array(['Very Positive', 'Positive', 'Mixed', 'Mostly Positive',
       'Overwhelmingly Positive', 'Mostly Negative',
       'Overwhelmingly Negative', 'Negative', 'Very Negative'],
      dtype=object)

## Data Preparation

#### Games Data
- Turning relevant columns to datetime objects
- **AA IF ANYTHING ELSE PLEASE ADD**

In [34]:
# Turn date_release column to Pandas DateTime
df_games_data["date_release"] = pd.to_datetime(df_games_data["date_release"])

#### Additional Game Data

In [None]:
# Drop Redundant Columns


#### User Data

#### Recommendations Data

In [8]:
df_redommendations["date"] = pd.to_datetime(df_redommendations["date"])

#### Games Metadata

In [9]:
# Turn the Description Column to a String
df_games_meta_data['description'] = df_games_meta_data['description'].astype(str)

# Turn the Tags Column Into a List
df_games_meta_data["tags"] = df_games_meta_data["tags"].astype(str).apply(ast.literal_eval)

df_games_meta_data

Unnamed: 0,app_id,description,tags
0,10090,"Call of Duty is back, redefining war like you'...","[Zombies, World War II, FPS, Multiplayer, Acti..."
1,13500,Enter the dark underworld of Prince of Persia ...,"[Action, Adventure, Parkour, Third Person, Gre..."
2,22364,,[Action]
3,113020,Monaco: What's Yours Is Mine is a single playe...,"[Co-op, Stealth, Indie, Heist, Local Co-Op, St..."
4,226560,Escape Dead Island is a Survival-Mystery adven...,"[Zombies, Adventure, Survival, Action, Third P..."
...,...,...,...
46063,758560,"Welcome to Versus World! Shoot, stab, snipe, a...","[Action, Indie, Early Access, Gore, Violent, F..."
46064,886910,,"[Simulation, Free to Play, Multiplayer, Single..."
46065,1477870,Fire and Water what is that an online game mad...,"[Casual, Action, Adventure, Action-Adventure, ..."
46066,1638430,A modern turn-based deckbuilding JRPG involvin...,"[RPG, Pixel Graphics, Party-Based RPG, JRPG, A..."


## Merging Datasets

We merge the data into 

In [10]:
# Merge all information on games to one DataFrame
games_df = df_games_data.merge(df_games_meta_data)

# Merge game information into the recommendations DataFrame
recs_df = df_redommendations.merge(games_df, how = "left", on = "app_id")

# Merge all information on users into a final DataFrame
final_df = recs_df.merge(df_users, how="left", on = "user_id")

In [11]:
final_df.head(2)

Unnamed: 0,app_id,helpful,funny,date,is_recommended,hours,user_id,review_id,title,date_release,...,positive_ratio,user_reviews,price_final,price_original,discount,steam_deck,description,tags,products,reviews
0,975370,0,0,2022-12-12,True,36.3,19954,0,Dwarf Fortress,2022-12-06,...,95,17773,29.99,29.99,0.0,True,"The deepest, most intricate simulation of a wo...","[Colony Sim, Indie, Pixel Graphics, Simulation...",28,3
1,304390,4,0,2017-02-17,False,11.5,1098,1,FOR HONOR™,2017-02-13,...,68,76071,14.99,14.99,0.0,True,Carve a path of destruction through an intense...,"[Medieval, Swordplay, Action, Multiplayer, PvP...",269,1


## Feature Engineering

**Elapsed Time:** A new feature which tracks the amount of time that has elapsed between the game's release and the review being logged. This could be interested because people who purchase a game right after its release are likely to be larger fans of the genre or franchise.

In [12]:
final_df["elapsed_time"] = final_df["date"] - final_df["date_release"]

**Relative Recommendation:** Likelihood 

In [13]:
usr_avg_rating = final_df[["user_id","is_recommended"]].groupby("user_id").mean()
usr_avg_rating.rename(columns = {"is_recommended":"avg_rating"}, inplace = True)

In [14]:
final_df = final_df.merge(usr_avg_rating, how = "left", on = "user_id")
final_df["rel_rec"] = (final_df["is_recommended"] - final_df["avg_rating"])

## Final Dataset Exploration

In [54]:
# We see how many positive and negative recommendations there are
print(df_recommendations["is_recommended"].value_counts())

# What is their ratio?
print("\nHow many positive reviews for each negative one?")
df_recommendations["is_recommended"].value_counts()[0]/df_recommendations["is_recommended"].value_counts()[1]

True     8599822
False    1472448
Name: is_recommended, dtype: int64

How many positive reviews for each negative one?


5.840492839135915

## Model Preparation

In [27]:
# train_test_split with validation set

#### Logistic Regression Model

In [None]:
# Consider balancing out data because of overwhelming positivity