Name: Noah Wagner, 
Dataset: https://www.kaggle.com/datasets/whigmalwhim/steam-releases/

In [1]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score

In [2]:
data = pd.read_csv("game_data_trimmed.csv")
data = data[["release", "peak_players", "total_reviews", "rating", "players_right_now"]]
data.fillna(value=0, inplace = True)

#edit players_right_now column to be numerical (has strings such as "1,234")
data["players_right_now"] = data["players_right_now"].apply(lambda x: int(x.replace(",", "")) if isinstance(x, str) else x)

#modify release dates to be numerical
data["release"] = data["release"].apply(lambda x: int(x.replace("-", "")))

xs = data.drop(columns = ["players_right_now"])
ys = data["players_right_now"]

train_x, test_x, train_y, test_y = train_test_split( xs, ys, train_size = 0.7)
print(train_x, test_x, train_y, test_y)

       release  peak_players  total_reviews  rating
357   20230317            12             36   81.30
1759  20230127             3              3   67.06
7378  20220325             3              7   73.26
2054  20230203             1              2   64.08
5737  20221108            17             15   78.30
...        ...           ...            ...     ...
3956  20220318            85            304   86.74
5196  20221007            11             20   80.00
7531  20220301             4             34   73.19
2911  20230220             3              3   55.69
2353  20230119             2             11   61.97

[7000 rows x 4 columns]        release  peak_players  total_reviews  rating
433   20230309             7             30   80.07
2308  20230105             4              5   62.51
9513  20220914             4              8   68.15
3313  20230126             1              1   40.58
8202  20220211       1325305         191273   71.29
...        ...           ...           

In [3]:
steps = [
    ("minmax", MinMaxScaler()),
    ("predict", LinearRegression(n_jobs=-1))
]

pipline = Pipeline(steps)

pipline.fit(train_x, train_y)

In [4]:
predict_y = pipline.predict(test_x)
r2_score(test_y, predict_y)

0.5733298808808963

### Why I chose the feature columns
The target column describes how many players are currently playing a given game. Below is the rationale for why I included each feature column.
- __release__: Usually older games have fewer players, so I was hoping the model to be able to learn that the lower the date is, the lower the current player could is likely to be.
- __peak_players__: If games have had many players at one point, there is a likely chance that the current amount of players could be some fraction of that. I was expecting the model to learn a positive relationship between this feature and the target.
- __total_reviews__: Peak_players is not enough however. Some games get really popular, but the replayability is low, so player count dies out. This would lead to a low total_review count. I was hoping this feature would help with edge cases where games get really popular, but die out quickly. The more the reviews for a game there are, the more likely it is to still have players.
- __rating__: If a game isn't fun, people are less likely to keep playing. I expected to model to learn a positive coorelation between rating and current player count.