# PREPROCESSING

Study of a dataset cotaining various information about 32135 games published on Steam gaming plataform.

| Columna | Descripción | Ejemplo |
| -- | - | - |
| publisher | Empresa publicadora del contenido | [Ubisoft,Dovetail Games - Trains,Degica] |
| genres | Genero del contenido | [Action, Adventure, Racing, Simulation, Strategy] |
| app_name | Nombre del contenido | [Warzone, Soundtrack, Puzzle Blocks] |
| title | Titulo del contenido | [The Dream Machine: Chapter 4 , Fate/EXTELLA - Sweet Room Dream, Fate/EXTELLA - Charming Bunny] |
| url | URL de publicación del contenido | http://store.steampowered.com/app/761140/Lost_Summoner_Kitty/ |
| release_date | Fecha de lanzamiento | [2018-01-04] |
| tags | etiquetas de contenido | [Simulation, Indie, Action, Adventure, Funny, Open World, First-Person, Sandbox, Free to Play] |
| discount_price | precio de descuento | [22.66, 0.49, 0.69] |
| reviews_url | Reviews de contenido | http://steamcommunity.com/app/681550/reviews/?browsefilter=mostrecent&p=1 |
| specs | Especificaciones | [Multi-player, Co-op, Cross-Platform Multiplayer, Downloadable Content] |
| price | Precio del contenido | [4.99, 9.99, Free to Use, Free to Play] |
| early_access | acceso temprano | [False, True] |
| id | identificador unico de contenido | [761140, 643980, 670290] |
| developer | Desarrollador | [Kotoshiro, Secret Level SRL, Poolians.com] |
| sentiment | Análisis de sentimiento | [Mixed, Very Positive, Positive, 3 user reviews] |
| metascore | Score por metacritic | [80, 74, 77, 75] |



Import relevant libraries.

In [1]:
import pandas as pd
import numpy as np
import ast
import json

## Read Data

I tried to open the json file with pd.read_json but i could not do it because of it's format.

In [2]:
# pd.read_json()

That's why I decided to use "open" and then transform each row into a dictionary. With the list of all the rows I can create a DataFrame.

In [3]:
# Open file.
file = open('datasets/steam_games.json')

allNewLines = []

# Go through each line 
for line in file:
    # Create a dictionary of each line.
    myNewdict = ast.literal_eval(line)
    # Append to a list of all the lines in the file.
    allNewLines.append(myNewdict)

rawGamesDf = pd.DataFrame(allNewLines)

In [4]:
rawGamesDf.head(2)

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,discount_price,reviews_url,specs,price,early_access,id,developer,sentiment,metascore
0,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",4.49,http://steamcommunity.com/app/761140/reviews/?...,[Single-player],4.99,False,761140,Kotoshiro,,
1,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,http://store.steampowered.com/app/643980/Ironb...,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",,http://steamcommunity.com/app/643980/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,False,643980,Secret Level SRL,Mostly Positive,


## Columns: url & reviews_url
We do not need any url for the analysis so we will drop them.

In [5]:
rawGamesDf.drop(columns=['url', 'reviews_url'], inplace=True)

In [6]:
rawGamesDf.head(2)

Unnamed: 0,publisher,genres,app_name,title,release_date,tags,discount_price,specs,price,early_access,id,developer,sentiment,metascore
0,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",4.49,[Single-player],4.99,False,761140,Kotoshiro,,
1,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,False,643980,Secret Level SRL,Mostly Positive,


---

## Duplicates

Analysis if there are duplicates

In [7]:
# Games with duplicate ids, dropna to avoid games with ids = NaN.
rawGamesDf[rawGamesDf.duplicated(keep=False, subset='id')].dropna(subset='id')

Unnamed: 0,publisher,genres,app_name,title,release_date,tags,discount_price,specs,price,early_access,id,developer,sentiment,metascore
13894,Bethesda Softworks,[Action],Wolfenstein II: The New Colossus,Wolfenstein II: The New Colossus,2017-10-26,"[Action, FPS, Gore, Violent, Alternate History...",,"[Single-player, Steam Achievements, Full contr...",59.99,False,612880,Machine Games,Mostly Positive,86
14573,Bethesda Softworks,[Action],Wolfenstein II: The New Colossus,Wolfenstein II: The New Colossus,2017-10-26,"[Action, FPS, Gore, Violent, Alternate History...",,"[Single-player, Steam Achievements, Full contr...",59.99,False,612880,Machine Games,Mostly Positive,86


In [8]:
index_last_dup_to_remove = rawGamesDf[rawGamesDf.duplicated(keep='last', subset='id')].dropna(subset='id').index
# index_last_dup_to_remove = rawGamesDf[rawGamesDf.duplicated(keep='last', subset='id')].dropna(subset='id')
index_last_dup_to_remove

Index([13894], dtype='int64')

In [9]:
rawGamesDf.iloc[index_last_dup_to_remove]

Unnamed: 0,publisher,genres,app_name,title,release_date,tags,discount_price,specs,price,early_access,id,developer,sentiment,metascore
13894,Bethesda Softworks,[Action],Wolfenstein II: The New Colossus,Wolfenstein II: The New Colossus,2017-10-26,"[Action, FPS, Gore, Violent, Alternate History...",,"[Single-player, Steam Achievements, Full contr...",59.99,False,612880,Machine Games,Mostly Positive,86


In [10]:
rawGamesDf.drop(index=index_last_dup_to_remove, inplace=True)

In [11]:
# Can't use this one yet because of the list in some columns.
# rawGamesDf.drop_duplicates(inplace=True)

---

## Endpoints, general

Since the data is static and won't change, the most optimal solution could be to prepare a data with every year already processed so the API endpoint can consume it with the least processing in real time.

And since every endpoint to be consumed by the API it needs to know the year, let's process that first.

### Year

We only need year for this study so we will create a new column called year with that information. Extracting year with a pattern of four numbers toghether.

In [12]:
gamesWithYearDf = rawGamesDf

In [13]:
gamesWithYearDf['release_date'].unique()

array(['2018-01-04', '2017-07-24', '2017-12-07', ..., '2016-11-19',
       'January 2018', '2018-10-01'], dtype=object)

In [14]:
# Pattern of regex to extract, 1XXX or 2XXX (valid years).
pattern = '([1-2][0-9][0-9][0-9])'
gamesWithYearDf['year'] = gamesWithYearDf['release_date'].str.extract(pattern, expand = True)

In [15]:
gamesWithYearDf['year'].head()

0    2018
1    2018
2    2017
3    2017
4     NaN
Name: year, dtype: object

In [16]:
gamesWithYearDf['year'].describe()

count     29966
unique       44
top        2017
freq       9594
Name: year, dtype: object

In [17]:
gamesWithYearDf['year'].isna().sum()

2168

Drop "release_date" and drop NaNs from year since they will not give us relevant information on the APIs queries.

In [18]:
gamesWithYearDf = gamesWithYearDf.drop(columns='release_date')

In [19]:
gamesWithYearDf.shape

(32134, 14)

In [20]:
gamesWithYearDf = gamesWithYearDf.dropna(subset=['year'])

In [21]:
gamesWithYearDf.shape

(29966, 14)

Lastly we convert years into int and we can check the years so they are valid years and drop the invalids.

In [22]:
gamesWithYearDf['year'] = pd.to_numeric(gamesWithYearDf['year'])
gamesWithYearDf['year'].info()

<class 'pandas.core.series.Series'>
Index: 29966 entries, 0 to 32133
Series name: year
Non-Null Count  Dtype
--------------  -----
29966 non-null  int64
dtypes: int64(1)
memory usage: 468.2 KB


In [23]:
gamesWithYearDf[gamesWithYearDf['year'] > 2023]

Unnamed: 0,publisher,genres,app_name,title,tags,discount_price,specs,price,early_access,id,developer,sentiment,metascore,year
13428,一次元创作组,"[Casual, Indie, Early Access]",Puzzle Sisters Foer,Puzzle Sisters Foer,"[Early Access, Casual, Indie]",,"[Single-player, Steam Achievements, Steam Trad...",,True,710190,一次元创作组,,,2756


In [24]:
gamesWithYearDf[gamesWithYearDf['year'] < 1970]

Unnamed: 0,publisher,genres,app_name,title,tags,discount_price,specs,price,early_access,id,developer,sentiment,metascore,year


In [25]:
gamesWithYearDf = gamesWithYearDf[gamesWithYearDf['year'] <= 2023]

## Endpoint 1: genero

def genero( Año: str ): Se ingresa un año y devuelve una lista con los 5 géneros más vendidos en el orden correspondiente.

Example:
| Year | Genres |
| -- | - |
| 2018 | {'Indie': 125, 'Action': 75, 'Adventure': 74, 'Casual': 45, 'Simulation': 43} |
| 2016 | {'Indie': 4106, 'Action': 2544, 'Casual': 2141, 'Adventure': 2042, 'Simulation': 1666} |

For this case the only thing that matters to us is "genres" and "year" columns, so we can drop the other ones.


In [26]:
generoEndpointDf = gamesWithYearDf[['genres', 'year']]
generoEndpointDf.head()

Unnamed: 0,genres,year
0,"[Action, Casual, Indie, Simulation, Strategy]",2018
1,"[Free to Play, Indie, RPG, Strategy]",2018
2,"[Casual, Free to Play, Indie, Simulation, Sports]",2017
3,"[Action, Adventure, Casual]",2017
5,"[Action, Adventure, Simulation]",2018


Now we can drop all the columns that have a NaN in any of the two columns because they will bring no information to the scope of the genero function.

In [27]:
generoEndpointDf.describe()

Unnamed: 0,year
count,29965.0
mean,2014.769965
std,3.50409
min,1970.0
25%,2014.0
50%,2016.0
75%,2017.0
max,2021.0


In [28]:
generoEndpointDf.isna().sum()

genres    1234
year         0
dtype: int64

In [29]:
generoEndpointDf = generoEndpointDf.dropna()

In [30]:
generoEndpointDf.isna().sum()

genres    0
year      0
dtype: int64

### Genre

Each game can have various genres, we can see this in the "genres" column because it is filled with lists.

For this reason we need to explode it so we can count all the genres separately.

In [31]:
genresCompleteList = generoEndpointDf.explode(column=['genres'])['genres'].unique()
print(genresCompleteList)
print(genresCompleteList.shape)

['Action' 'Casual' 'Indie' 'Simulation' 'Strategy' 'Free to Play' 'RPG'
 'Sports' 'Adventure' 'Racing' 'Massively Multiplayer' 'Early Access'
 'Animation &amp; Modeling' 'Video Production' 'Web Publishing'
 'Education' 'Software Training' 'Utilities' 'Design &amp; Illustration'
 'Audio Production' 'Photo Editing' 'Accounting']
(22,)


In [32]:
generoEndpointDf.shape

(28731, 2)

In [33]:
generoEndpointDf = generoEndpointDf.explode('genres')
generoEndpointDf.head()

Unnamed: 0,genres,year
0,Action,2018
0,Casual,2018
0,Indie,2018
0,Simulation,2018
0,Strategy,2018


In [34]:
generoEndpointDf.shape

(71212, 2)

Now we can make a Groupby to get all the genres of each year. Then unstack it and have a new Data Frame with the data we desire.

In [35]:
generoEndpointDf2 = generoEndpointDf.groupby(by=['year', 'genres'])['genres'].count()
generoEndpointDf2.tail()

year  genres   
2019  RPG          2
      Strategy     1
2021  Adventure    1
      Indie        1
      RPG          1
Name: genres, dtype: int64

In [36]:
generoEndpointDf3 = generoEndpointDf2.unstack(level=1, fill_value=0)
generoEndpointDf3.head()

genres,Accounting,Action,Adventure,Animation &amp; Modeling,Audio Production,Casual,Design &amp; Illustration,Early Access,Education,Free to Play,...,Photo Editing,RPG,Racing,Simulation,Software Training,Sports,Strategy,Utilities,Video Production,Web Publishing
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1983,0,1,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1984,0,1,1,0,0,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1985,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1986,0,0,1,0,0,1,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0
1987,0,0,3,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0


In [37]:
# Now we have exactly what we need.
generoEndpointDict = generoEndpointDf3.to_dict(orient='index')
generoEndpointDict

{1983: {'Accounting': 0,
  'Action': 1,
  'Adventure': 1,
  'Animation &amp; Modeling': 0,
  'Audio Production': 0,
  'Casual': 1,
  'Design &amp; Illustration': 0,
  'Early Access': 0,
  'Education': 0,
  'Free to Play': 0,
  'Indie': 0,
  'Massively Multiplayer': 0,
  'Photo Editing': 0,
  'RPG': 0,
  'Racing': 0,
  'Simulation': 0,
  'Software Training': 0,
  'Sports': 0,
  'Strategy': 0,
  'Utilities': 0,
  'Video Production': 0,
  'Web Publishing': 0},
 1984: {'Accounting': 0,
  'Action': 1,
  'Adventure': 1,
  'Animation &amp; Modeling': 0,
  'Audio Production': 0,
  'Casual': 2,
  'Design &amp; Illustration': 0,
  'Early Access': 0,
  'Education': 0,
  'Free to Play': 0,
  'Indie': 1,
  'Massively Multiplayer': 0,
  'Photo Editing': 0,
  'RPG': 0,
  'Racing': 0,
  'Simulation': 0,
  'Software Training': 0,
  'Sports': 0,
  'Strategy': 0,
  'Utilities': 0,
  'Video Production': 0,
  'Web Publishing': 0},
 1985: {'Accounting': 0,
  'Action': 0,
  'Adventure': 0,
  'Animation &amp;

Sort by values.

In [38]:
generoEndpointDict_sorted = {}
for y in generoEndpointDict.keys():
    genres = generoEndpointDict[y]
    genres_sorted = {k: v for k, v in sorted(genres.items(), key=lambda item: item[1], reverse=True)}
    # print(genres_sorted)
    generoEndpointDict_sorted[y] = genres_sorted

generoEndpointDict_sorted

{1983: {'Action': 1,
  'Adventure': 1,
  'Casual': 1,
  'Accounting': 0,
  'Animation &amp; Modeling': 0,
  'Audio Production': 0,
  'Design &amp; Illustration': 0,
  'Early Access': 0,
  'Education': 0,
  'Free to Play': 0,
  'Indie': 0,
  'Massively Multiplayer': 0,
  'Photo Editing': 0,
  'RPG': 0,
  'Racing': 0,
  'Simulation': 0,
  'Software Training': 0,
  'Sports': 0,
  'Strategy': 0,
  'Utilities': 0,
  'Video Production': 0,
  'Web Publishing': 0},
 1984: {'Casual': 2,
  'Action': 1,
  'Adventure': 1,
  'Indie': 1,
  'Accounting': 0,
  'Animation &amp; Modeling': 0,
  'Audio Production': 0,
  'Design &amp; Illustration': 0,
  'Early Access': 0,
  'Education': 0,
  'Free to Play': 0,
  'Massively Multiplayer': 0,
  'Photo Editing': 0,
  'RPG': 0,
  'Racing': 0,
  'Simulation': 0,
  'Software Training': 0,
  'Sports': 0,
  'Strategy': 0,
  'Utilities': 0,
  'Video Production': 0,
  'Web Publishing': 0},
 1985: {'Simulation': 1,
  'Accounting': 0,
  'Action': 0,
  'Adventure': 0,

In [39]:
with open("datasets/steam_games_endpoint_1_genero.json", "w") as outfile:
    json.dump(generoEndpointDict_sorted, outfile)

**TEST**

In [40]:
def genero(Year: str): 
    """
    Se ingresa un año y devuelve un diccionario con los 5 géneros más vendidos en el orden correspondiente.
    """

    # Read data
    with open("datasets/steam_games_endpoint_1_genero.json") as json_file:
        genres_dict = json.load(json_file)
    # Since data is already sorted, get the first 5 as top 5.
    top5 = dict(list(genres_dict[Year].items())[0:5])
    return top5

genero('2016')

{'Indie': 4106,
 'Action': 2544,
 'Casual': 2141,
 'Adventure': 2042,
 'Simulation': 1666}

---

## Endpoint 2: juegos

def juegos( Año: str ): Se ingresa un año y devuelve una lista con los juegos lanzados en el año.

Example:
| Year | Games |
| -- | - |
| 2018 | ['juego1', 'juego2', ... ] |
| 2016 | ['juegoX', 'juegoY', ... ] |

For this case the only thing that matters to us is "app_name", "title" and "year" columns, so we can drop the other ones.

In [41]:
juegosEndpointDf = gamesWithYearDf[['app_name', 'title', 'year']]
juegosEndpointDf.head()

Unnamed: 0,app_name,title,year
0,Lost Summoner Kitty,Lost Summoner Kitty,2018
1,Ironbound,Ironbound,2018
2,Real Pool 3D - Poolians,Real Pool 3D - Poolians,2017
3,弹炸人2222,弹炸人2222,2017
5,Battle Royale Trainer,Battle Royale Trainer,2018


First let's study and unify app_name and title into one. It only has one NaN each and it is in the same row, so we drop it.

In [42]:
print("Unique:",juegosEndpointDf['app_name'].unique())
print("NaNs:", juegosEndpointDf['app_name'].isna().sum())

Unique: ['Lost Summoner Kitty' 'Ironbound' 'Real Pool 3D - Poolians' ...
 'LOGistICAL: South Africa' 'Russian Roads' 'EXIT 2 - Directions']
NaNs: 1


In [43]:
print("Unique:",juegosEndpointDf['title'].unique())
print("NaNs:", juegosEndpointDf['title'].isna().sum())

Unique: ['Lost Summoner Kitty' 'Ironbound' 'Real Pool 3D - Poolians' ...
 'LOGistICAL: South Africa' 'Russian Roads' 'EXIT 2 - Directions']
NaNs: 1


In [44]:
juegosEndpointDf[juegosEndpointDf['title'].isna()]

Unnamed: 0,app_name,title,year
2580,,,2014


In [45]:
juegosEndpointDf.shape

(29965, 3)

In [46]:
juegosEndpointDf = juegosEndpointDf.dropna(subset=['app_name', 'title'])

In [47]:
juegosEndpointDf.shape

(29964, 3)

Comparing the difference of game names it looks better the characters in "app_name", so we drop "title".

In [48]:
juegosEndpointDf[juegosEndpointDf['app_name'] != juegosEndpointDf['title']].tail()

Unnamed: 0,app_name,title,year
31871,Sam & Max 105: Reality 2.0,Sam &amp; Max 105: Reality 2.0,2007
31872,Sam & Max 104: Abe Lincoln Must Die!,Sam &amp; Max 104: Abe Lincoln Must Die!,2007
31873,Sam & Max 106: Bright Side of the Moon,Sam &amp; Max 106: Bright Side of the Moon,2007
31898,Making History: The Calm & the Storm,Making History: The Calm &amp; the Storm,2007
32052,Dark Messiah of Might & Magic,Dark Messiah of Might &amp; Magic,2006


In [49]:
juegosEndpointDf = juegosEndpointDf.drop(columns='title')

### Games

Each game has the name on column "app_name". Let's prepare the json.

In [50]:
juegosEndpointDf.shape

(29964, 2)

Now we can make a Groupby to get all the genres of each year. THen unstack it and have a new Data Frame with the data we desire.

In [62]:
# juegosEndpointDf2 = juegosEndpointDf.groupby(by=['year']).apply(lambda df: dict(df['app_name']))
juegosEndpointDf2 = juegosEndpointDf.groupby(by=['year']).apply(lambda df: list(df['app_name']))

juegosEndpointDf2.tail(20)

year
2001    [Return to Castle Wolfenstein, Max Payne, X-CO...
2002    [Arx Fatalis, The Sum of All Fears, Freedom Fo...
2003    [Ghost Master®, Railroad Tycoon 3, BloodRayne,...
2004    [Rome: Total War™ - Collection, Manhunt, Far C...
2005    [Advent Rising, FlatOut, Act of War: Direct Ac...
2006    [Disciples II: Gallean's Return, Disciples II:...
2007    [Lost Planet™: Extreme Condition, RIP - Trilog...
2008    [Command & Conquer: Red Alert 3, Conflict: Den...
2009    [PRR Wagon Pack 01, Class 421 London South Eas...
2010    [Avencast: Rise of the Mage, Pirates, Vikings,...
2011    [Mafia II, Lara Croft GoL: Raziel and Kain Cha...
2012    [Hard Reset Extended Edition, Blackwell Unboun...
2013    [RIFT, Trine 2: Complete Story, Angelica Weave...
2014    [Imagine Earth, BeatBlasters III, Blood of the...
2015    [Assault Android Cactus, America's Army: Provi...
2016    [Fallen Mage, Of Guards And Thieves, The Inter...
2017    [Real Pool 3D - Poolians, 弹炸人2222, RC Plane 3 ...
2018    [

In [66]:
juegosEndpointDict = dict(juegosEndpointDf2)

In [67]:
with open("datasets/steam_games_endpoint_2_juegos.json", "w") as outfile:
    json.dump(juegosEndpointDict, outfile)

**TEST**

In [70]:
def juegos( Year: str ):
    """
    Se ingresa un año y devuelve un diccionario con los juegos lanzados en el año.
    """
    # Read data
    with open("datasets/steam_games_endpoint_2_juegos.json") as json_file:
        juegos_dict = json.load(json_file)
    return juegos_dict[Year]

juegos('2020')

KeyError: '2020'

---

## Endpoint 3: specs

def specs( Año: str ): Se ingresa un año y devuelve una lista con los 5 specs que más se repiten en el mismo en el orden correspondiente.

Example:
| Year | Specs |
| -- | - |
| 2018 | {'Indie': 125, 'Action': 75, 'Adventure': 74, 'Casual': 45, 'Simulation': 43} |
| 2016 | {'Indie': 4106, 'Action': 2544, 'Casual': 2141, 'Adventure': 2042, 'Simulation': 1666} |

For this case the only thing that matters to us is "specs" and "year" columns, so we can drop the other ones.

In [71]:
specsEndpointDf = gamesWithYearDf[['specs', 'year']]
specsEndpointDf.head()

Unnamed: 0,specs,year
0,[Single-player],2018
1,"[Single-player, Multi-player, Online Multi-Pla...",2018
2,"[Single-player, Multi-player, Online Multi-Pla...",2017
3,[Single-player],2017
5,"[Single-player, Steam Achievements]",2018


Now we can drop all the columns that have a NaN in any of the two columns because they will bring no information to the scope of the genero function.

In [72]:
specsEndpointDf.describe()

Unnamed: 0,year
count,29965.0
mean,2014.769965
std,3.50409
min,1970.0
25%,2014.0
50%,2016.0
75%,2017.0
max,2021.0


In [73]:
specsEndpointDf.isna().sum()

specs    669
year       0
dtype: int64

In [74]:
specsEndpointDf = specsEndpointDf.dropna()

In [76]:
specsEndpointDf.isna().sum()

specs    0
year     0
dtype: int64

### Specs

Similarly to genres, each game can have various specs, we can see this in the "specs" column because it is filled with lists.

For this reason we need to explode it so we can count all the specs separately.

In [77]:
specsCompleteList = specsEndpointDf.explode(column=['specs'])['specs'].unique()
print(specsCompleteList)
print(specsCompleteList.shape)

['Single-player' 'Multi-player' 'Online Multi-Player'
 'Cross-Platform Multiplayer' 'Steam Achievements' 'Steam Trading Cards'
 'In-App Purchases' 'Stats' 'Downloadable Content'
 'Full controller support' 'Steam Cloud' 'Steam Leaderboards'
 'Partial Controller Support' 'Local Co-op' 'Shared/Split Screen'
 'Valve Anti-Cheat enabled' 'Local Multi-Player'
 'Steam Turn Notifications' 'MMO' 'Co-op' 'Captions available'
 'Steam Workshop' 'Includes level editor' 'Mods' 'Mods (require HL2)'
 'Online Co-op' 'Game demo' 'Includes Source SDK' 'Commentary available'
 'SteamVR Collectibles' 'Mods (require HL1)']
(31,)


In [None]:
specsEndpointDf.shape

(28731, 2)

In [78]:
specsEndpointDf = specsEndpointDf.explode('specs')
specsEndpointDf.head()

Unnamed: 0,specs,year
0,Single-player,2018
1,Single-player,2018
1,Multi-player,2018
1,Online Multi-Player,2018
1,Cross-Platform Multiplayer,2018


In [79]:
specsEndpointDf.shape

(128228, 2)

Now we can make a Groupby to get all the specs of each year. Then unstack it and have a new Data Frame with the data we desire.

In [85]:
specsEndpointDf2 = specsEndpointDf.groupby(by=['year', 'specs'])['specs'].count()
specsEndpointDf2.tail()

year  specs               
2019  Single-player           4
      Steam Achievements      2
      Steam Cloud             1
2021  Downloadable Content    1
      Single-player           1
Name: specs, dtype: int64

In [86]:
specsEndpointDf3 = specsEndpointDf2.unstack(level=1, fill_value=0)
specsEndpointDf3.head()

specs,Captions available,Co-op,Commentary available,Cross-Platform Multiplayer,Downloadable Content,Full controller support,Game demo,In-App Purchases,Includes Source SDK,Includes level editor,...,Single-player,Stats,Steam Achievements,Steam Cloud,Steam Leaderboards,Steam Trading Cards,Steam Turn Notifications,Steam Workshop,SteamVR Collectibles,Valve Anti-Cheat enabled
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1970,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1975,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1980,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1981,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1982,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [87]:
# Now we have exactly what we need.
specsEndpointDict = specsEndpointDf3.to_dict(orient='index')
specsEndpointDict

{1970: {'Captions available': 2,
  'Co-op': 0,
  'Commentary available': 0,
  'Cross-Platform Multiplayer': 0,
  'Downloadable Content': 0,
  'Full controller support': 0,
  'Game demo': 0,
  'In-App Purchases': 0,
  'Includes Source SDK': 0,
  'Includes level editor': 0,
  'Local Co-op': 0,
  'Local Multi-Player': 0,
  'MMO': 0,
  'Mods': 0,
  'Mods (require HL1)': 0,
  'Mods (require HL2)': 0,
  'Multi-player': 0,
  'Online Co-op': 0,
  'Online Multi-Player': 0,
  'Partial Controller Support': 0,
  'Shared/Split Screen': 0,
  'Single-player': 0,
  'Stats': 0,
  'Steam Achievements': 0,
  'Steam Cloud': 0,
  'Steam Leaderboards': 0,
  'Steam Trading Cards': 0,
  'Steam Turn Notifications': 0,
  'Steam Workshop': 0,
  'SteamVR Collectibles': 0,
  'Valve Anti-Cheat enabled': 0},
 1975: {'Captions available': 1,
  'Co-op': 0,
  'Commentary available': 0,
  'Cross-Platform Multiplayer': 0,
  'Downloadable Content': 0,
  'Full controller support': 0,
  'Game demo': 0,
  'In-App Purchases':

Sort by values.

In [89]:
specsEndpointDict_sorted = {}
for y in specsEndpointDict.keys():
    specs = specsEndpointDict[y]
    specs_sorted = {k: v for k, v in sorted(specs.items(), key=lambda item: item[1], reverse=True)}
    specsEndpointDict_sorted[y] = specs_sorted

specsEndpointDict_sorted

{1970: {'Captions available': 2,
  'Co-op': 0,
  'Commentary available': 0,
  'Cross-Platform Multiplayer': 0,
  'Downloadable Content': 0,
  'Full controller support': 0,
  'Game demo': 0,
  'In-App Purchases': 0,
  'Includes Source SDK': 0,
  'Includes level editor': 0,
  'Local Co-op': 0,
  'Local Multi-Player': 0,
  'MMO': 0,
  'Mods': 0,
  'Mods (require HL1)': 0,
  'Mods (require HL2)': 0,
  'Multi-player': 0,
  'Online Co-op': 0,
  'Online Multi-Player': 0,
  'Partial Controller Support': 0,
  'Shared/Split Screen': 0,
  'Single-player': 0,
  'Stats': 0,
  'Steam Achievements': 0,
  'Steam Cloud': 0,
  'Steam Leaderboards': 0,
  'Steam Trading Cards': 0,
  'Steam Turn Notifications': 0,
  'Steam Workshop': 0,
  'SteamVR Collectibles': 0,
  'Valve Anti-Cheat enabled': 0},
 1975: {'Captions available': 1,
  'Co-op': 0,
  'Commentary available': 0,
  'Cross-Platform Multiplayer': 0,
  'Downloadable Content': 0,
  'Full controller support': 0,
  'Game demo': 0,
  'In-App Purchases':

In [90]:
with open("datasets/steam_games_endpoint_3_specs.json", "w") as outfile:
    json.dump(specsEndpointDict_sorted, outfile)

**TEST**

In [94]:
def specs( Year: str ):
    """
    Se ingresa un año y devuelve un diccionario con los 5 specs que más se repiten en el mismo en el orden correspondiente.
    """
    # Read data
    with open("datasets/steam_games_endpoint_3_specs.json") as json_file:
        specs_dict = json.load(json_file)

    # TODO Add validation

    # Since data is already sorted, get the first 5 as top 5.
    top5 = dict(list(specs_dict[Year].items())[0:5])
    return top5

specs('2018')

{'Single-player': 147,
 'Steam Achievements': 68,
 'Full controller support': 44,
 'Steam Cloud': 30,
 'Partial Controller Support': 29}

---

---

## CREATE API FUNCTIONS (Testing)