# Steam Games Rating Prediction
In this exploration notebook, we will demonstrate the entire pipeline of our project.

First, we import external libraries along with the dataset that will be used throughout the project. 

We then proceed with the preprocessing phase, normalizing the given average ratings in terms of the number of total ratings (amongst other factors).

Then we clean the data to adhere to the preferred format of our machine learning models (namely numerical values for classification).

Finally, we attempt to implement the decision tree and support vector machine models, comparing their performance on our given testing dataset.

## Importing libraries

In [12]:
# Default libraries
from collections import Counter
# External Libraries
import pandas
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

## Importing the dataset

In [13]:
data  = pandas.read_csv("datasets/steam_games_2024.csv")
data.head()

Unnamed: 0,appid,name,price,release_date,required_age,publishers,developers,categories,genres,ratings,totalRatings,average_playtime,median_playtime,num_owners
0,10,Counter-Strike,7.19,2000-11-01,0,Valve,Valve,Multi-player;Online Multi-Player;Local Multi-P...,Action,97.39,127873,17612,317,10000000-20000000
1,20,Team Fortress Classic,3.99,1999-04-01,0,Valve,Valve,Multi-player;Online Multi-Player;Local Multi-P...,Action,83.98,3951,277,62,5000000-10000000
2,30,Day of Defeat,3.99,2003-05-01,0,Valve,Valve,Multi-player;Valve Anti-Cheat enabled,Action,89.56,3814,187,34,5000000-10000000
3,40,Deathmatch Classic,3.99,2001-06-01,0,Valve,Valve,Multi-player;Online Multi-Player;Local Multi-P...,Action,82.66,1540,258,184,5000000-10000000
4,50,Half-Life: Opposing Force,3.99,1999-11-01,0,Valve,Gearbox Software,Single-player;Multi-player;Valve Anti-Cheat en...,Action,94.8,5538,624,415,5000000-10000000


## Data preprocessing
- 
- 

### Added columns
- mean rating
- weighted rating

In [14]:
# Step 1: Determine the minimum number of ratings required to be listed, using the 50th percen|tile as a threshold
m = data['totalRatings'].quantile(0.50)
print(m)

# Step 2: Calculate C, the mean rating across all games
C = data['ratings'].mean()

# Apply the weighted rating formula
data['weighted_rating'] = (data['totalRatings'] / (data['totalRatings'] + m) * data['ratings']) + (m / (data['totalRatings'] + m) * C)

# Display the first few rows to verify the calculation
data[['name', 'ratings', 'totalRatings', 'weighted_rating']]
sorted  = data.sort_values(by=['totalRatings'], ascending=False)
sorted[sorted['totalRatings']> 36]

36.0


Unnamed: 0,appid,name,price,release_date,required_age,publishers,developers,categories,genres,ratings,totalRatings,average_playtime,median_playtime,num_owners,weighted_rating
25,730,Counter-Strike: Global Offensive,0.00,2012-08-21,0,Valve,Valve;Hidden Path Entertainment,Multi-player;Steam Achievements;Full controlle...,Action;Free to Play,86.80,3046717,22494,6502,50000000-100000000,86.799819
22,570,Dota 2,0.00,2013-07-09,0,Valve,Valve,Multi-player;Co-op;Steam Trading Cards;Steam W...,Action;Free to Play;Strategy,85.87,1005586,23944,801,100000000-200000000,85.869484
12836,578080,PLAYERUNKNOWN'S BATTLEGROUNDS,26.99,2017-12-21,0,PUBG Corporation,PUBG Corporation,Multi-player;Online Multi-Player;Stats,Action;Adventure;Massively Multiplayer,50.46,983260,22938,12434,50000000-100000000,50.460768
19,440,Team Fortress 2,0.00,2007-10-10,0,Valve,Valve,Multi-player;Cross-Platform Multiplayer;Steam ...,Action;Free to Play,93.81,549915,8495,623,20000000-50000000,93.808536
2478,271590,Grand Theft Auto V,24.99,2015-04-13,18,Rockstar Games,Rockstar North,Single-player;Multi-player;Steam Achievements;...,Action;Adventure,70.26,468369,9837,4834,10000000-20000000,70.260091
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18763,755670,Sleep Tight,11.39,2018-07-26,0,We Are Fuzzy,We Are Fuzzy,Single-player;Steam Achievements;Full controll...,Action;Indie;Strategy,86.49,37,0,0,0-20000,79.071959
9942,500320,A Tale of Caos: Overture,5.99,2016-12-21,0,Eli Daddio,ExperaGameStudio,Single-player;Steam Achievements;Steam Trading...,Adventure;Indie,70.27,37,0,0,0-20000,70.850863
7473,423440,Choice of Kung Fu,3.99,2015-12-11,0,Choice of Games,Choice of Games,Single-player;Steam Achievements;Captions avai...,Indie;RPG,94.59,37,0,0,0-20000,83.177438
20540,805450,Derrek Quest V Regression,0.79,2018-02-28,0,Alexeibelih,Manuf,Single-player,Action;Adventure;Casual;Indie,48.65,37,0,0,0-20000,59.892780


In [17]:
# Isolating the 'weighted_rating' column for K-means
X = sorted[['weighted_rating']]

# Using the elbow method to find the optimal number of clusters
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

In [8]:
publishers  =  sorted(["publishers"])
publisher_dict = []
for line in publishers:

    line  = str(line)
    names  = line.find(";")
    splits = line.split(";")

    if names == -1:
        publisher_dict.append(line)
    else:
        for name in splits:
            publisher_dict.append(name)

dictionary = {}
ls = []
counter = 0;

for publisher in publisher_dict:
    ls.append((publisher,1))

counts = Counter(key for key, value in ls)
len(counts)
print(counts["Rockstar Games"])

0
