# Analyzing and Predicting Player Count Trends in Online Games

## Intro

As we enter the second decade of the 21st century, the interconnectedness of our world increases every day. Once, activities done alone are now conducted online with fellows, and video games are no different. An estimated 2.7 billion gamers exist in the world, and this number can only increase due to the impact COVID-19 has had, and will continue to have, on our culture. With that in mind, businesses that serve these customers must be able to handle the changes that surround us. They must effectively predict how their customer base will be changing in the long run to address capital restrictions, how their player base changes over the medium run to address capturing the largest player base for events/announcements, and hwo their player base changes over the short run to to address how employing short-term retention strategies. 

## Our Project
We will be using data on concurrent players for multiple games in order to predict number of active players. We can accomplish this on a long term, medium term, and short term basis to help us answer business questions that can ecompass all time periods. Breaking these investigations down, we will:
- Make a long-run (LR) prediction for player base, which will help us decide how a business should increase or decrease their capital, namely servers, the primary capital for most online game businesses. 
- Make a medium-run (MR) prediction for player base, which will help us show when most players are online over the course of a year. Knowing yearly trends will assist in planning events for players, which will lead to increased player retention. 
- Make a short-run (SR) prediction model for player base, which will help show short-term trends. This will show businesses where short term retention strategies would be most applicable. 

### Project Map
First, we will need to collect data to make our predictions, and this will come from Steam, the market dominator for PC gaming. While this will not capture the full sample space of gamers worldwide, Steam is an excellent choice for our data source, as most online gamers opt for playing on PC rather than console interfaces, and Steam dominates the PC gaming market by a great margin. 

Our data will be structured quite simply. Raw data contains the date, number of players during that day, any flags that are associated with the date (events usually), and number of twitch viewers at that time. We will be dropping number of twitch viewers for our modeling phase, as this is not helpful in predicting number of players. However, taking twitch viewers into account during our EDA phase may be useful in giving richer context. 

We will have to build models for multiple games in order to avoid overfitting our model to a single game. While a single business may wish to have a model that is precisely fitted to their product, we are attempting to construct a model that can be applied to any online game. 

During our modelling phase, we will need to adjust parameters to account for the different time periods we wish to predict. In order to do this, we will be using SARIMA or SARIMA EXR, two types of supervised learning models that account for seasonal trends in data. When we model, our "seasons" will adjust to the time period we are analyzing: LR is yearly, MR is monthly, and SR is weekly. 

## Data Collection
We established criteria for selecting games from steamdb.info. Our selected games to investigate should:
1. Be online. 
2. Be popular.
3. Not be a new release.
4. Have in-game events, seasonal or non-seasonal
5. Have in-game rewards

While it is not imperative for our games to meet all of these criteria, the more that are met, the better we can apply the given game to our investigation. After some initial research on SteamDB, the following games meet most of our criteria, and will be used for our investigation:

- Rocket League, 2015
- Counter Strike: Global Offensive, 2012
- DOTA 2, 2013
- Team Fortress 2, 2007 *

*Team Fortress 2 has limited events and rewards, but has an extremely large sample space, so it acts as a great control compared to the other games selected.

## Raw Data

This section serves to introduce our data, make some basic alterations for ease of use, and calculate our summary statistics. 

In [53]:
# we begin with importing introductory libraries

!pip install -U fsds

from fsds.imports import * 

Requirement already up-to-date: fsds in c:\users\rmcar\anaconda\envs\learn-env\lib\site-packages (0.2.27)


In [66]:
# we import our raw data

dota = pd.read_csv('data/dota2.csv')
csgo = pd.read_csv('data/csgo.csv')
rl = pd.read_csv('data/rl.csv')
tf = pd.read_csv('data/tf2.csv')

# we construct a dictionary to use throughout

dictionary = {"dota":dota,
              "csgo":csgo,
              "rl":rl,
              "tf":tf}

dota.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3392 entries, 0 to 3391
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   DateTime        3392 non-null   object 
 1   Players         3364 non-null   float64
 2   Flags           0 non-null      float64
 3   Twitch Viewers  1998 non-null   float64
dtypes: float64(3), object(1)
memory usage: 106.1+ KB


In [55]:
# we inspect columns
dota.columns

Index(['DateTime', 'Players', 'Flags', 'Twitch Viewers'], dtype='object')

### Flags
Upon inspecting this column, we were disappointed to see that event notes did not translate when importing our data, resulting in this column being empty. We will instead have to complete this manually from SteamDB in our feature engineering section. 

In [62]:
# we drop 'Flags'

ls = list(map(lambda df:df.drop(columns = 'Flags'), dictionary.values()))

In [63]:
ls

[                 DateTime   Players  Twitch Viewers
 0     2011-09-22 00:00:00     194.0             NaN
 1     2011-09-23 00:00:00     240.0             NaN
 2     2011-09-24 00:00:00       NaN             NaN
 3     2011-09-25 00:00:00     233.0             NaN
 4     2011-09-26 00:00:00     222.0             NaN
 ...                   ...       ...             ...
 3387  2020-12-30 00:00:00  625344.0         71052.0
 3388  2020-12-31 00:00:00  561914.0         60719.0
 3389  2021-01-01 00:00:00  642032.0         57219.0
 3390  2021-01-02 00:00:00  694687.0         63547.0
 3391  2021-01-03 00:00:00  692293.0         55268.0
 
 [3392 rows x 3 columns],
                  DateTime    Players  Twitch Viewers
 0     2011-11-30 00:00:00      680.0             NaN
 1     2011-12-01 00:00:00        NaN             NaN
 2     2011-12-02 00:00:00        NaN             NaN
 3     2011-12-03 00:00:00        NaN             NaN
 4     2011-12-04 00:00:00        NaN             NaN
 ...        

A breakdown of the columns in our data:
- DateTime: Shows the date in format YYYY-MM-DD HH:MM:SS. However, data is collected on a daily basis at 00:00:00, so we will need to format this column so that only the date shows, and it is pandas encoded as well.
- Players: Shows the number of players during the day. This will be our target during our modelling phase. 
- Flags: A note column for the day that is recorded. Usually indicates an event.
- Twitch Viewers: Shows the number of twitch viewers during that day. 

In [65]:
# we rename columns

rename_dict = {
    "DateTime" : "time",
    "Players" : "players",
    "Twitch Viewers" : 'viewers'
}

ls = list(map(lambda df: df.rename(columns = rename_dict), ls))

ls

[                     time   players  viewers
 0     2011-09-22 00:00:00     194.0      NaN
 1     2011-09-23 00:00:00     240.0      NaN
 2     2011-09-24 00:00:00       NaN      NaN
 3     2011-09-25 00:00:00     233.0      NaN
 4     2011-09-26 00:00:00     222.0      NaN
 ...                   ...       ...      ...
 3387  2020-12-30 00:00:00  625344.0  71052.0
 3388  2020-12-31 00:00:00  561914.0  60719.0
 3389  2021-01-01 00:00:00  642032.0  57219.0
 3390  2021-01-02 00:00:00  694687.0  63547.0
 3391  2021-01-03 00:00:00  692293.0  55268.0
 
 [3392 rows x 3 columns],
                      time    players  viewers
 0     2011-11-30 00:00:00      680.0      NaN
 1     2011-12-01 00:00:00        NaN      NaN
 2     2011-12-02 00:00:00        NaN      NaN
 3     2011-12-03 00:00:00        NaN      NaN
 4     2011-12-04 00:00:00        NaN      NaN
 ...                   ...        ...      ...
 3318  2020-12-30 00:00:00  1056057.0  77815.0
 3319  2020-12-31 00:00:00   982583.0  60120

In [None]:
# we construct some summary stats to get a feel for our data
for df in ls:
    