# ETL process (output_steam_games file)

The following notebook contains information about an ETL process using the data from the videogames platform STEAM. The intention was to process the data and prepare it for later analysis. We start by opening the files using the adequate Python libraries and modules, followed by some simple transformations to the file and finally we save the resulting file for it's later use. 

In [2]:
#Importing the libraries
import pandas as pd
import numpy as np
import json
import ast

# Opening the file and exploring the data

In [4]:
df_games = pd.read_json("output_steam_games.json", lines=True)

In [6]:
df_games.head(3)

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,,,,,,,,,,,,,
1,,,,,,,,,,,,,
2,,,,,,,,,,,,,


In [7]:
df_games.columns

Index(['publisher', 'genres', 'app_name', 'title', 'url', 'release_date',
       'tags', 'reviews_url', 'specs', 'price', 'early_access', 'id',
       'developer'],
      dtype='object')

In [8]:
df_games.shape

(120445, 13)

In [9]:
#Getting general information about the dataframe
df_games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120445 entries, 0 to 120444
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   publisher     24083 non-null  object 
 1   genres        28852 non-null  object 
 2   app_name      32133 non-null  object 
 3   title         30085 non-null  object 
 4   url           32135 non-null  object 
 5   release_date  30068 non-null  object 
 6   tags          31972 non-null  object 
 7   reviews_url   32133 non-null  object 
 8   specs         31465 non-null  object 
 9   price         30758 non-null  object 
 10  early_access  32135 non-null  float64
 11  id            32133 non-null  float64
 12  developer     28836 non-null  object 
dtypes: float64(2), object(11)
memory usage: 11.9+ MB


In [10]:
#Getting general statistics
df_games.describe()

Unnamed: 0,early_access,id
count,32135.0,32133.0
mean,0.060588,451757.4
std,0.238577,182714.0
min,0.0,10.0
25%,0.0,329280.0
50%,0.0,452060.0
75%,0.0,593400.0
max,1.0,2028850.0


In [27]:
#Null values count
print("\nNot null values per column")
{"publisher":df_games["publisher"].isna().sum(),"genres":df_games["genres"].isna().sum(),
 "app_name":df_games["app_name"].isna().sum(),"title":df_games["title"].isna().sum(),
 "games":df_games["url"].isna().sum(),"release_date":df_games["release_date"].isna().sum(),
 "tags":df_games["tags"].isna().sum(),"reviews_url":df_games["reviews_url"].isna().sum(),
 "specs":df_games["specs"].isna().sum(),"price":df_games["price"].isna().sum(),"id":df_games["id"].isna().sum()}


Not null values per column


{'publisher': 96362,
 'genres': 91593,
 'app_name': 88312,
 'title': 90360,
 'games': 88310,
 'release_date': 90377,
 'tags': 88473,
 'reviews_url': 88312,
 'specs': 88980,
 'price': 89687,
 'id': 88312}

In [28]:
#Droping useless columns
df_games.drop(columns={"url","reviews_url"},inplace=True)

In [30]:
#Checking the resulting dataframe
df_games.head(3)

Unnamed: 0,publisher,genres,app_name,title,release_date,tags,specs,price,early_access,id,developer
0,,,,,,,,,,,
1,,,,,,,,,,,
2,,,,,,,,,,,


In [33]:
df_comparison=pd.concat([df_games["publisher"],df_games["developer"]],axis=1)
df_comparison

Unnamed: 0,publisher,developer
0,,
1,,
2,,
3,,
4,,
...,...,...
120440,Ghost_RUS Games,"Nikita ""Ghost_RUS"""
120441,Sacada,Sacada
120442,Laush Studio,Laush Dmitriy Sergeevich
120443,SIXNAILS,"xropi,stev3ns"


In [37]:
df_comparison=df_comparison.dropna()
len(df_comparison)

24018

In [36]:
df_comparison.query("publisher==developer")

Unnamed: 0,publisher,developer
88310,Kotoshiro,Kotoshiro
88312,Poolians.com,Poolians.com
88313,彼岸领域,彼岸领域
88315,Trickjump Games Ltd,Trickjump Games Ltd
88317,Poppermost Productions,Poppermost Productions
...,...,...
120435,Retro Army Limited,Retro Army Limited
120437,INGAME,INGAME
120438,Riviysky,Riviysky
120439,Bidoniera Games,Bidoniera Games


In [None]:
df_games.drop(columns={"publisher"},inplace=True)

In [50]:
df_games.head(3)

Unnamed: 0,genres,app_name,title,release_date,tags,specs,price,early_access,id,developer
0,,,,,,,,,,
1,,,,,,,,,,
2,,,,,,,,,,


In [44]:
df_comparison2=pd.concat([df_games["app_name"],df_games["title"]],axis=1)
df_comparison2

Unnamed: 0,app_name,title
0,,
1,,
2,,
3,,
4,,
...,...,...
120440,Colony On Mars,Colony On Mars
120441,LOGistICAL: South Africa,LOGistICAL: South Africa
120442,Russian Roads,Russian Roads
120443,EXIT 2 - Directions,EXIT 2 - Directions


In [46]:
df_comparison2=df_comparison2.dropna()
len(df_comparison2)

30085

In [48]:
df_comparison2.query("app_name==title")

Unnamed: 0,app_name,title
88310,Lost Summoner Kitty,Lost Summoner Kitty
88311,Ironbound,Ironbound
88312,Real Pool 3D - Poolians,Real Pool 3D - Poolians
88313,弹炸人2222,弹炸人2222
88315,Battle Royale Trainer,Battle Royale Trainer
...,...,...
120439,Kebab it Up!,Kebab it Up!
120440,Colony On Mars,Colony On Mars
120441,LOGistICAL: South Africa,LOGistICAL: South Africa
120442,Russian Roads,Russian Roads


In [51]:
df_games.drop(columns={"title"},inplace=True)
df_games.head(3)

Unnamed: 0,genres,app_name,release_date,tags,specs,price,early_access,id,developer
0,,,,,,,,,
1,,,,,,,,,
2,,,,,,,,,


In [53]:
df_comparison3=pd.concat([df_games["genres"],df_games["tags"]],axis=1)
df_comparison3

Unnamed: 0,genres,tags
0,,
1,,
2,,
3,,
4,,
...,...,...
120440,"[Casual, Indie, Simulation, Strategy]","[Strategy, Indie, Casual, Simulation]"
120441,"[Casual, Indie, Strategy]","[Strategy, Indie, Casual]"
120442,"[Indie, Racing, Simulation]","[Indie, Simulation, Racing]"
120443,"[Casual, Indie]","[Indie, Casual, Puzzle, Singleplayer, Atmosphe..."


In [54]:
df_comparison3=df_comparison3.dropna()
len(df_comparison3)

28828

In [55]:
df_comparison3.query("genres==tags")

Unnamed: 0,genres,tags
88313,"[Action, Adventure, Casual]","[Action, Adventure, Casual]"
88316,"[Free to Play, Indie, Simulation, Sports]","[Free to Play, Indie, Simulation, Sports]"
88317,"[Free to Play, Indie, Simulation, Sports]","[Free to Play, Indie, Simulation, Sports]"
88318,"[Free to Play, Indie, Simulation, Sports]","[Free to Play, Indie, Simulation, Sports]"
88326,"[Free to Play, Indie, Simulation, Sports]","[Free to Play, Indie, Simulation, Sports]"
...,...,...
120406,"[Action, Adventure, Casual, Indie, RPG]","[Action, Adventure, Casual, Indie, RPG]"
120408,"[Utilities, Video Production]","[Utilities, Video Production]"
120412,[Indie],[Indie]
120432,[Indie],[Indie]


In [61]:
df_games.dropna(how="all",inplace=True)
df_games.head(3)

Unnamed: 0,genres,app_name,release_date,tags,specs,price,early_access,id,developer
88310,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",[Single-player],4.99,0.0,761140.0,Kotoshiro
88311,"[Free to Play, Indie, RPG, Strategy]",Ironbound,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...","[Single-player, Multi-player, Online Multi-Pla...",Free To Play,0.0,643980.0,Secret Level SRL
88312,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...","[Single-player, Multi-player, Online Multi-Pla...",Free to Play,0.0,670290.0,Poolians.com


In [63]:
df_games["price"].replace(["Free to Play","Free To Play","Free Mod","Free Demo","Play for Free!","Play Now",
                           "Third-party","Free Movie","Play the Demo","Free to Try","Free HITMAN™ Holiday Pack","Install Now","Play WARMACHINE: Tactics Demo",
                           "Install Theme","Free to Use","Free"],np.nan,inplace=True)

In [64]:
df_games["price"].replace(["Starting at $449.00","Starting at $499.00"],["449.00","499.00"],inplace=True)

In [65]:
df_games["price"]=df_games["price"].astype("float64")
df_games["price"]=df_games["price"].round(3)

In [67]:
df_games.head(3)

Unnamed: 0,genres,app_name,release_date,tags,specs,price,early_access,id,developer
88310,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",[Single-player],4.99,0.0,761140.0,Kotoshiro
88311,"[Free to Play, Indie, RPG, Strategy]",Ironbound,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...","[Single-player, Multi-player, Online Multi-Pla...",,0.0,643980.0,Secret Level SRL
88312,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...","[Single-player, Multi-player, Online Multi-Pla...",,0.0,670290.0,Poolians.com
