# Sim-Racing Steam DB analysis

Hypothesis: "Sim racing is getting more & more popular"

*Analysis of the Steam database with 10 games tagged "Automobile Sim".*

The data are .csv files coming from the SteamDB website, which fetch data directly from a SteamAPI.

## Import libraries


In [1]:
import os
import pandas as pd
import numpy as np
import glob
import seaborn as sns
import matplotlib.pyplot as plt

## Load & concatenate all data sets
all .csv files have the same structure, same variables.

In [2]:
folder_path = '/Users/macbook/Dropbox/Mac/Documents/Pro/Data_Analyst/sim-racing-steam-db/data/raw'
dfs = {}

for file in os.listdir(folder_path):
    if file.endswith('.csv'):
        game_name = os.path.splitext(file)[0]  # Extract game name from filename
        df = pd.read_csv(os.path.join(folder_path, file))
        df['game'] = game_name
        dfs[game_name] = df


In [3]:
merged_df = pd.concat(dfs.values(), ignore_index=True)

### Checking the load & merge

In [4]:
merged_df.head(20)

Unnamed: 0,DateTime,Players,Average Players,Twitch Viewers,game,DateTime;Players;Average Players;Twitch Viewers;game
0,2013-10-01 00:00:00,8.0,,,Assetto_corsa,
1,2013-10-02 00:00:00,,,,Assetto_corsa,
2,2013-10-03 00:00:00,,,,Assetto_corsa,
3,2013-10-04 00:00:00,,,,Assetto_corsa,
4,2013-10-05 00:00:00,,,,Assetto_corsa,
5,2013-10-06 00:00:00,,,,Assetto_corsa,
6,2013-10-07 00:00:00,,,,Assetto_corsa,
7,2013-10-08 00:00:00,,,,Assetto_corsa,
8,2013-10-09 00:00:00,,,,Assetto_corsa,
9,2013-10-10 00:00:00,,,,Assetto_corsa,


In [5]:
merged_df.tail(20)

Unnamed: 0,DateTime,Players,Average Players,Twitch Viewers,game,DateTime;Players;Average Players;Twitch Viewers;game
38645,2024-01-09 10:40:00,699.0,,22.0,DiRT_rally_2.0_,
38646,2024-01-09 10:50:00,692.0,,22.0,DiRT_rally_2.0_,
38647,2024-01-09 11:00:00,712.0,966.0,11.0,DiRT_rally_2.0_,
38648,2024-01-09 11:10:00,729.0,,11.0,DiRT_rally_2.0_,
38649,2024-01-09 11:20:00,753.0,,11.0,DiRT_rally_2.0_,
38650,2024-01-09 11:30:00,776.0,,11.0,DiRT_rally_2.0_,
38651,2024-01-09 11:40:00,800.0,,11.0,DiRT_rally_2.0_,
38652,2024-01-09 11:50:00,828.0,,11.0,DiRT_rally_2.0_,
38653,2024-01-09 12:00:00,843.0,966.0,14.0,DiRT_rally_2.0_,
38654,2024-01-09 12:10:00,852.0,,14.0,DiRT_rally_2.0_,


we see that in 2013, we had one records of player per day. IN our most recent records, it was done every 10 mins.

In [6]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38665 entries, 0 to 38664
Data columns (total 6 columns):
 #   Column                                                Non-Null Count  Dtype  
---  ------                                                --------------  -----  
 0   DateTime                                              36175 non-null  object 
 1   Players                                               33790 non-null  float64
 2   Average Players                                       10587 non-null  float64
 3   Twitch Viewers                                        32387 non-null  float64
 4   game                                                  38665 non-null  object 
 5   DateTime;Players;Average Players;Twitch Viewers;game  2490 non-null   object 
dtypes: float64(3), object(3)
memory usage: 1.8+ MB


we need to change the data type of datetime. Also, our function add a column we need to drop.

In [7]:
round(merged_df.describe())

Unnamed: 0,Players,Average Players,Twitch Viewers
count,33790.0,10587.0,32387.0
mean,9065.0,8455.0,1165.0
std,11098.0,8626.0,3614.0
min,1.0,265.0,0.0
25%,1609.0,2040.0,44.0
50%,4158.0,6321.0,236.0
75%,12690.0,11906.0,894.0
max,81096.0,44392.0,127965.0


## Data preparation & cleaning

Let's drop the extra column and rename the column headers

In [8]:
print(merged_df.columns)

Index(['DateTime', 'Players', 'Average Players', 'Twitch Viewers', 'game',
       'DateTime;Players;Average Players;Twitch Viewers;game'],
      dtype='object')


In [9]:
# drop extra column
merged_df.drop(columns=['DateTime;Players;Average Players;Twitch Viewers;game'], inplace=True)

In [10]:
print(merged_df.columns)

Index(['DateTime', 'Players', 'Average Players', 'Twitch Viewers', 'game'], dtype='object')


In [11]:
# rename column headers
merged_df.columns = ['datetime', 'players', 'average_players', 'twitch_viewers', 'game']

In [12]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38665 entries, 0 to 38664
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   datetime         36175 non-null  object 
 1   players          33790 non-null  float64
 2   average_players  10587 non-null  float64
 3   twitch_viewers   32387 non-null  float64
 4   game             38665 non-null  object 
dtypes: float64(3), object(2)
memory usage: 1.5+ MB


In [13]:
merged_df['datetime'] = merged_df['datetime'].astype('datetime64[s]')
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38665 entries, 0 to 38664
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype        
---  ------           --------------  -----        
 0   datetime         36175 non-null  datetime64[s]
 1   players          33790 non-null  float64      
 2   average_players  10587 non-null  float64      
 3   twitch_viewers   32387 non-null  float64      
 4   game             38665 non-null  object       
dtypes: datetime64[s](1), float64(3), object(1)
memory usage: 1.5+ MB


In [14]:
# create an alternative dataframe to test functions
df_test = merged_df

In [15]:
#check for mixed types
for col in df_test.columns.tolist():
  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

datetime


In [16]:
merged_df.tail()

Unnamed: 0,datetime,players,average_players,twitch_viewers,game
38660,2024-01-09 13:10:00,913.0,,18.0,DiRT_rally_2.0_
38661,2024-01-09 13:20:00,923.0,,18.0,DiRT_rally_2.0_
38662,2024-01-09 13:30:00,,,18.0,DiRT_rally_2.0_
38663,2024-01-09 13:40:00,,,18.0,DiRT_rally_2.0_
38664,2024-01-09 13:50:00,,,18.0,DiRT_rally_2.0_


In [17]:
# Checking the Null values
print(merged_df.isnull().sum())


datetime            2490
players             4875
average_players    28078
twitch_viewers      6278
game                   0
dtype: int64


In [18]:
null_dates = merged_df[merged_df['datetime'].isnull() == True]
null_dates

Unnamed: 0,datetime,players,average_players,twitch_viewers,game
14875,NaT,,,,American_truck
14876,NaT,,,,American_truck
14877,NaT,,,,American_truck
14878,NaT,,,,American_truck
14879,NaT,,,,American_truck
...,...,...,...,...,...
17360,NaT,,,,American_truck
17361,NaT,,,,American_truck
17362,NaT,,,,American_truck
17363,NaT,,,,American_truck


We need to re-import the data set of the game American truck

In [19]:
game = merged_df['game'].unique()
game

array(['Assetto_corsa', 'Forza Horizon 5 Steam Charts',
       'Forza Horizon 4 Steam Charts', 'Assetto_corsa_competizione',
       'American_truck', 'BeamNG', 'Euro_truck_2',
       'Automobilista 2 Steam Charts', 'CarX_drift_racing_online',
       'DiRT_rally_2.0_'], dtype=object)

In [20]:
merged_df[merged_df['game']== 'American_truck'].isna().count()

datetime           2490
players            2490
average_players    2490
twitch_viewers     2490
game               2490
dtype: int64

In [21]:
import plotly.express as px

fig = px.histogram(merged_df, x='datetime')
fig.show()

## Exploratary analysis & visualisation

In [22]:
# Generate a range of dates between 2020-01-01 and 2023-12-31 (inclusive)
#date_range = pd.DateRange('2020-01-01', periods=365, freq='D')

# Select only those rows that have a date within the range
df_filtered = merged_df[(merged_df['datetime'] >= 2020-1-1) & (merged_df['datetime'] <= 2020-12-31)]

TypeError: Invalid comparison between dtype=datetime64[s] and int

In [None]:
import plotly.express as px

fig = px.histogram(df_american, x='Players')
fig.show()

In [None]:
import plotly.express as px

fig = px.histogram(df_american, x='Twitch Viewers')
fig.show()

In [None]:
boxplot = px.box(df_american,x='Players')
boxplot.show()

## Ask & Answer questions

## Summary & conclusion