# Analyzing and Predicting Player Count Trends in Online Games

## Part 1: An Introduction

As we enter the second decade of the 21st century, the interconnectedness of our world increases every day. Once, activities done alone are now conducted online with fellows, and video games are no different. An estimated 2.7 billion gamers exist in the world, and this number can only increase due to the impact COVID-19 has had, and will continue to have, on our culture. With that in mind, businesses that serve these customers must be able to handle the changes that surround us. They must effectively predict how their customer base will be changing in the long run to address capital restrictions, how their player base changes over the medium run to address capturing the largest player base for events/announcements, and how their player base changes over the short run to to address how employing short-term retention strategies. 

## Our Project
We will be using data on concurrent players for multiple games in order to predict number of active players. We can accomplish this on a long term, medium term, and short term basis to help us answer business questions that can ecompass all time periods. This will allow us to make both long and short term recommendations to these businesses who would benefit from our findings. 

### Project Map
First, we will need to collect data to make our predictions, and this will come from Steam, the market dominator for PC gaming. While this will not capture the full sample space of gamers worldwide, Steam is an excellent choice for our data source, as most online gamers opt for playing on PC rather than console interfaces, and Steam dominates the PC gaming market by a great margin. 

Our data will be structured quite simply. Raw data contains the date, number of players during that day, any flags that are associated with the date (events usually), and number of twitch viewers at that time. We will be dropping number of twitch viewers for our modeling phase, as this is not helpful in predicting number of players. However, taking twitch viewers into account during our EDA phase may be useful in giving richer context. 

We will have to build models for multiple games in order to avoid overfitting our model to a single game. While a single business may wish to have a model that is precisely fitted to their product, we are attempting to construct a model that can be applied to any online game. 

During our modelling phase, we will need to adjust parameters to account for the different time periods we wish to predict. In order to do this, we will be using SARIMA or SARIMA EXR, two types of supervised learning models that account for seasonal trends in data. When we model, our "seasons" will adjust to the time period we are analyzing: LR is yearly, MR is monthly, and SR is weekly. 

## Data Collection
We established criteria for selecting games from steamdb.info. Our selected games to investigate should:
1. Be online. 
2. Be popular.
3. Not be a new release.
4. Have in-game events, seasonal or non-seasonal
5. Have in-game rewards
6. Be available worldwide. 

While it is not imperative for our games to meet all of these criteria, the more that are met, the better we can apply the given game to our investigation. After some initial research on SteamDB, the following games meet most of our criteria, and will be used for our investigation:

- Rocket League, 2015
- Counter Strike: Global Offensive, 2012
- DOTA 2, 2013
- Team Fortress 2, 2007 *

*Team Fortress 2 has limited events and rewards, but has an extremely large sample space, so it acts as a great control compared to the other games selected.

## Raw Data

This section serves to introduce our data, make some basic alterations for ease of use, and complete introductory feature engineering. 

In [25]:
# we begin with importing introductory libraries

from fsds.imports import * 

In [26]:
# we import our raw data

dota = pd.read_csv('data/Raw/dota2.csv', parse_dates = ['DateTime'], index_col = 'DateTime')
csgo = pd.read_csv('data/Raw/csgo.csv', parse_dates = ['DateTime'], index_col = 'DateTime')
rl = pd.read_csv('data/Raw/rl.csv', parse_dates = ['DateTime'], index_col = 'DateTime')
tf = pd.read_csv('data/Raw/tf2.csv', parse_dates = ['DateTime'], index_col = 'DateTime')

# we construct a list of these dataframes to use throughout

ls = [csgo, dota, rl, tf]
labels = ["CS:GO", "DOTA 2", "Rocket League", 'Team Fortress 2']

dota

Unnamed: 0_level_0,Players,Flags,Twitch Viewers
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2011-09-22,194.0,,
2011-09-23,240.0,,
2011-09-24,,,
2011-09-25,233.0,,
2011-09-26,222.0,,
...,...,...,...
2020-12-30,625344.0,,71052.0
2020-12-31,561914.0,,60719.0
2021-01-01,642032.0,,57219.0
2021-01-02,694687.0,,63547.0


In [27]:
# we inspect columns
dota.columns

Index(['Players', 'Flags', 'Twitch Viewers'], dtype='object')

### Flags
Upon inspecting this column, we were disappointed to see that event notes did not translate when importing our data, resulting in this column being empty. We will instead have to complete this manually from SteamDB in our feature engineering section. 

In [28]:
# we drop 'Flags'

ls = list(map(lambda df:df.drop(columns = 'Flags'), ls))
ls

[              Players  Twitch Viewers
 DateTime                             
 2011-11-30      680.0             NaN
 2011-12-01        NaN             NaN
 2011-12-02        NaN             NaN
 2011-12-03        NaN             NaN
 2011-12-04        NaN             NaN
 ...               ...             ...
 2020-12-30  1056057.0         77815.0
 2020-12-31   982583.0         60120.0
 2021-01-01  1020715.0         91532.0
 2021-01-02  1079804.0         93678.0
 2021-01-03  1067795.0         94130.0
 
 [3323 rows x 2 columns],
              Players  Twitch Viewers
 DateTime                            
 2011-09-22     194.0             NaN
 2011-09-23     240.0             NaN
 2011-09-24       NaN             NaN
 2011-09-25     233.0             NaN
 2011-09-26     222.0             NaN
 ...              ...             ...
 2020-12-30  625344.0         71052.0
 2020-12-31  561914.0         60719.0
 2021-01-01  642032.0         57219.0
 2021-01-02  694687.0         63547.0
 2021-01-

A breakdown of the columns in our data:
- DateTime: Shows the date in format YYYY-MM-DD HH:MM:SS. However, data is collected on a daily basis at 00:00:00, so we will need to format this column so that only the date shows, and it is pandas encoded as well.
- Players: Shows the number of players during the day. This will be our target during our modelling phase. 
- Flags: A note column for the day that is recorded. Usually indicates an event.
- Twitch Viewers: Shows the number of twitch viewers during that day. 

In [29]:
# we rename columns

rename_dict = {
    "DateTime" : "time",
    "Players" : "players",
    "Twitch Viewers" : 'viewers'
}

ls = list(map(lambda df: df.rename(columns = rename_dict), ls))
ls

[              players  viewers
 DateTime                      
 2011-11-30      680.0      NaN
 2011-12-01        NaN      NaN
 2011-12-02        NaN      NaN
 2011-12-03        NaN      NaN
 2011-12-04        NaN      NaN
 ...               ...      ...
 2020-12-30  1056057.0  77815.0
 2020-12-31   982583.0  60120.0
 2021-01-01  1020715.0  91532.0
 2021-01-02  1079804.0  93678.0
 2021-01-03  1067795.0  94130.0
 
 [3323 rows x 2 columns],
              players  viewers
 DateTime                     
 2011-09-22     194.0      NaN
 2011-09-23     240.0      NaN
 2011-09-24       NaN      NaN
 2011-09-25     233.0      NaN
 2011-09-26     222.0      NaN
 ...              ...      ...
 2020-12-30  625344.0  71052.0
 2020-12-31  561914.0  60719.0
 2021-01-01  642032.0  57219.0
 2021-01-02  694687.0  63547.0
 2021-01-03  692293.0  55268.0
 
 [3392 rows x 2 columns],
              players   viewers
 DateTime                      
 2014-01-07      12.0       NaN
 2014-01-08       NaN       N

## Null Values

In [30]:
# we fill NaN with 0

ls = list(map(lambda df: df.fillna(0), ls))
ls[0]
    

Unnamed: 0_level_0,players,viewers
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1
2011-11-30,680.0,0.0
2011-12-01,0.0,0.0
2011-12-02,0.0,0.0
2011-12-03,0.0,0.0
2011-12-04,0.0,0.0
...,...,...
2020-12-30,1056057.0,77815.0
2020-12-31,982583.0,60120.0
2021-01-01,1020715.0,91532.0
2021-01-02,1079804.0,93678.0


Our data contains player counts and viewers for times before a given game's release, so we will need to alter the bounds of our times in order to gain data that will be modellable. 

In [31]:
# We convert players and viewers from float to int

for df in ls:
    df['players'] = df['players'].astype(int)
    df['viewers'] = df['viewers'].astype(int)

In [32]:
# We update time boundaries to exclude days immediately after game release

csgo = ls[0]["2015-01-01":]
dota = ls[1]["2015-01-01":]
rl = ls[2]["2015-07-27":]
tf = ls[3]["2012-01-01":]

ls = [csgo, dota, rl, tf]
ls = list(map(lambda df: df.reset_index(), ls))

### Introductory Feature Engineering
There are some features that we need to address as soon as possible, though others will likely reveal themselves as we continue on. As stated earlier, we must manually address the issue of events not being encoded in our data. We will be constructing a column labelled 'event', encoded as '1' for an event ongoing, or '0' for no event. 

Furthermore, we should construct a column that represents the percent change in players and viewers, giving us a column that measures margins. 

#### Events
We made a csv file named 'events' that contains the date, and columns for each of our games events; 1 for an event ongoing, 0 for no event. You will notice that Team Fortress 2 is not included in this section, and that is because there were no events for the lifetime of Team Fortress 2.

In [33]:
# we construct our 'event' column. 

In [34]:
# we import the csv file we constructed for this process

df_events = pd.read_csv('data/Raw/events.csv').fillna(0)

df_events['csgo'].value_counts()

0.0    3211
1.0     112
Name: csgo, dtype: int64

In [35]:
# we format the 'time' column to datetime

for i in range(len(df_events['time'])):
    
    df_events['time'][i] = df_events['time'][i].replace('-', '') 
    
df_events['time'] = pd.to_datetime(df_events['time'], format = '%Y%m%d')

In [36]:
# we establish event df for each game

df_csgo = df_events['csgo']
df_dota = df_events['dota']
df_rl = df_events['rl']


In [37]:
# and join the player count dataframes with these event dataframes

csgo_join = ls[0].join(df_csgo, how = 'inner')
dota_join = ls[1].join(df_dota, how = 'inner')
rl_join = ls[2].join(df_rl, how = 'inner')


In [38]:
# finally, we rename the column to reflect 'event', we also need to add an event column for tf
# even though this column will be fully 0

csgo = csgo_join.rename(columns = {'csgo':'event'})
dota = dota_join.rename(columns = {'dota':'event'})
rl = rl_join.rename(columns = {'rl':'event'})

tf = ls[3]
tf['event'] = 0 


In [39]:
# we re-establish our list of dataframes

ls = [csgo, dota, rl, tf]

In [40]:
for df in ls:
    for i in range(len(df['event'])):
        if df['event'][i] == 1:
            df['event'][i] = 'Yes'
        else:
            df['event'][i] = 'No'
        

### Marginal change in Players/Viewers

This will be a helpful column to construct now, as inspecting the change in players/viewers is arguably more important than just inspecting the raw number of players/viewers. It will also normalize our data so that we can easily compare different games without worrying that the player bases are of such different scale. 

In [41]:
# we get percent change of our players and viewers columns

for df in ls:
    df['%chg_players'] = df['players'].pct_change()
    df['%chg_viewers'] = df['viewers'].pct_change()


### Game Column

We should add a column to all of our dataframes that will specify which game it is for.

In [42]:
for i, df in enumerate(ls):
    df['title'] = labels[i]

In [43]:
ls[3]

Unnamed: 0,DateTime,players,viewers,event,%chg_players,%chg_viewers,title
0,2012-01-01,46374,0,No,,,Team Fortress 2
1,2012-01-02,45005,0,No,-0.029521,,Team Fortress 2
2,2012-01-03,40923,0,No,-0.090701,,Team Fortress 2
3,2012-01-04,43962,0,No,0.074261,,Team Fortress 2
4,2012-01-05,46017,0,No,0.046745,,Team Fortress 2
...,...,...,...,...,...,...,...
3286,2020-12-30,100552,988,No,-0.142026,-0.943012,Team Fortress 2
3287,2020-12-31,112504,932,No,0.118864,-0.056680,Team Fortress 2
3288,2021-01-01,109008,718,No,-0.031074,-0.229614,Team Fortress 2
3289,2021-01-02,111686,839,No,0.024567,0.168524,Team Fortress 2


In this introductory section, we have:

* Cleaned data, including datatype handling, null value handling.
* We manually addressed the 'flag' column, which did not translate correctly from our data source. We constructed a csv file manually and imported this, joining with our dataframes. 
* We completed some basic feature engineering, giving us the percent change in both players and viewers, which will be extremely helpful in our EDA.

Next, we will be completing Exploratory Data Analysis, in which we will be exploring trends in our data, relationships between predictors, and much more. 

In [44]:
# we export our dataframes to use in the next notebook

csgo.to_csv('data/Clean/csgo.csv')
dota.to_csv('data/Clean/dota.csv')
rl.to_csv('data/Clean/rl.csv')
tf.to_csv('data/Clean/tf.csv')
