# EDA

Study of a dataset cotaining various information about 32135 games published on Steam gaming plataform.

| Columna | Descripción | Ejemplo |
| -- | - | - |
| publisher | Empresa publicadora del contenido | [Ubisoft,Dovetail Games - Trains,Degica] |
| genres | Genero del contenido | [Action, Adventure, Racing, Simulation, Strategy] |
| app_name | Nombre del contenido | [Warzone, Soundtrack, Puzzle Blocks] |
| title | Titulo del contenido | [The Dream Machine: Chapter 4 , Fate/EXTELLA - Sweet Room Dream, Fate/EXTELLA - Charming Bunny] |
| url | URL de publicación del contenido | http://store.steampowered.com/app/761140/Lost_Summoner_Kitty/ |
| release_date | Fecha de lanzamiento | [2018-01-04] |
| tags | etiquetas de contenido | [Simulation, Indie, Action, Adventure, Funny, Open World, First-Person, Sandbox, Free to Play] |
| discount_price | precio de descuento | [22.66, 0.49, 0.69] |
| reviews_url | Reviews de contenido | http://steamcommunity.com/app/681550/reviews/?browsefilter=mostrecent&p=1 |
| specs | Especificaciones | [Multi-player, Co-op, Cross-Platform Multiplayer, Downloadable Content] |
| price | Precio del contenido | [4.99, 9.99, Free to Use, Free to Play] |
| early_access | acceso temprano | [False, True] |
| id | identificador unico de contenido | [761140, 643980, 670290] |
| developer | Desarrollador | [Kotoshiro, Secret Level SRL, Poolians.com] |
| sentiment | Análisis de sentimiento | [Mixed, Very Positive, Positive, 3 user reviews] |
| metascore | Score por metacritic | [80, 74, 77, 75] |



Import relevant libraries.

In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ast

## Read Data

I tried to open the json file with pd.read_json but i could not do it because of its format.

In [19]:
# pd.read_json()

That's why I decided to use "open" and then transform each row into a dictionary.

In [20]:
# Open file.
file = open('datasets/steam_games.json')

allNewLines = []

# Go through each line 
for line in file:
    # Create a dictionary of each line.
    myNewdict = ast.literal_eval(line)
    # Append to a list of all the lines in the file.
    allNewLines.append(myNewdict)

With the list of all the rows I can create a DataFrame.

In [21]:
rawGamesDf = pd.DataFrame(allNewLines)
rawGamesDf.head(2)

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,discount_price,reviews_url,specs,price,early_access,id,developer,sentiment,metascore
0,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",4.49,http://steamcommunity.com/app/761140/reviews/?...,[Single-player],4.99,False,761140,Kotoshiro,,
1,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,http://store.steampowered.com/app/643980/Ironb...,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",,http://steamcommunity.com/app/643980/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,False,643980,Secret Level SRL,Mostly Positive,


## Basic Info

In [22]:
# Column names
rawGamesDf.columns

Index(['publisher', 'genres', 'app_name', 'title', 'url', 'release_date',
       'tags', 'discount_price', 'reviews_url', 'specs', 'price',
       'early_access', 'id', 'developer', 'sentiment', 'metascore'],
      dtype='object')

In [23]:
rawGamesDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32135 entries, 0 to 32134
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   publisher       24083 non-null  object 
 1   genres          28852 non-null  object 
 2   app_name        32133 non-null  object 
 3   title           30085 non-null  object 
 4   url             32135 non-null  object 
 5   release_date    30068 non-null  object 
 6   tags            31972 non-null  object 
 7   discount_price  225 non-null    float64
 8   reviews_url     32133 non-null  object 
 9   specs           31465 non-null  object 
 10  price           30758 non-null  object 
 11  early_access    32135 non-null  bool   
 12  id              32133 non-null  object 
 13  developer       28836 non-null  object 
 14  sentiment       24953 non-null  object 
 15  metascore       2677 non-null   object 
dtypes: bool(1), float64(1), object(14)
memory usage: 3.7+ MB


## Columns: url & reviews_url
We do not need any url for the analysis so we will drop them.

In [24]:
rawGamesDf.drop(columns=['url', 'reviews_url'], inplace=True)

## Column: genres, tags and specs

We need to start working with genres, because it's a column containing a list of different items. We should try to make a column for each genre.

In [25]:
print('Describe:')
print(rawGamesDf['genres'].describe())
print('')
print('NaN amount:', rawGamesDf['genres'].isna().sum())

Describe:
count        28852
unique         883
top       [Action]
freq          1880
Name: genres, dtype: object

NaN amount: 3283


In [26]:
genresCompleteList = rawGamesDf.explode(column=['genres'])['genres'].unique()
print(genresCompleteList)
print(genresCompleteList.shape)

['Action' 'Casual' 'Indie' 'Simulation' 'Strategy' 'Free to Play' 'RPG'
 'Sports' 'Adventure' nan 'Racing' 'Early Access' 'Massively Multiplayer'
 'Animation &amp; Modeling' 'Video Production' 'Utilities'
 'Web Publishing' 'Education' 'Software Training'
 'Design &amp; Illustration' 'Audio Production' 'Photo Editing'
 'Accounting']
(23,)


We have 23 genres, **WE NEED TO DECIDE WHAT TO DO WITH THE NaN category**

We could use a groupby and get dummies to get all the genre columns.

In [27]:
# Using pd.get_dummies() method
genresDummies = pd.get_dummies(rawGamesDf['genres'].apply(pd.Series).stack(), prefix='genre').groupby(level=0).sum()

# display result
genresDummies.head()

Unnamed: 0,genre_Accounting,genre_Action,genre_Adventure,genre_Animation &amp; Modeling,genre_Audio Production,genre_Casual,genre_Design &amp; Illustration,genre_Early Access,genre_Education,genre_Free to Play,...,genre_Photo Editing,genre_RPG,genre_Racing,genre_Simulation,genre_Software Training,genre_Sports,genre_Strategy,genre_Utilities,genre_Video Production,genre_Web Publishing
0,0,1,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,1,...,0,1,0,0,0,0,1,0,0,0
2,0,0,0,0,0,1,0,0,0,1,...,0,0,0,1,0,1,0,0,0,0
3,0,1,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,1,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [28]:
# Let's merge the new genres data with the original DataFrame.
cleanedGamesDf = rawGamesDf.merge(genresDummies, left_index=True, right_index=True)

# Now we will also drop the genres column since we do not need it anymore.
cleanedGamesDf.drop(columns='genres', inplace=True)

cleanedGamesDf.head()

Unnamed: 0,publisher,app_name,title,release_date,tags,discount_price,specs,price,early_access,id,...,genre_Photo Editing,genre_RPG,genre_Racing,genre_Simulation,genre_Software Training,genre_Sports,genre_Strategy,genre_Utilities,genre_Video Production,genre_Web Publishing
0,Kotoshiro,Lost Summoner Kitty,Lost Summoner Kitty,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",4.49,[Single-player],4.99,False,761140,...,0,0,0,1,0,0,1,0,0,0
1,"Making Fun, Inc.",Ironbound,Ironbound,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,False,643980,...,0,1,0,0,0,0,1,0,0,0
2,Poolians.com,Real Pool 3D - Poolians,Real Pool 3D - Poolians,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,False,670290,...,0,0,0,1,0,1,0,0,0,0
3,彼岸领域,弹炸人2222,弹炸人2222,2017-12-07,"[Action, Adventure, Casual]",0.83,[Single-player],0.99,False,767400,...,0,0,0,0,0,0,0,0,0,0
5,Trickjump Games Ltd,Battle Royale Trainer,Battle Royale Trainer,2018-01-04,"[Action, Adventure, Simulation, FPS, Shooter, ...",,"[Single-player, Steam Achievements]",3.99,False,772540,...,0,0,0,1,0,0,0,0,0,0


**tags**

In [29]:
tagsCompleteList = rawGamesDf.explode(column=['tags'])['tags'].unique()
print(tagsCompleteList)
print(tagsCompleteList.shape)

['Strategy' 'Action' 'Indie' 'Casual' 'Simulation' 'Free to Play' 'RPG'
 'Card Game' 'Trading Card Game' 'Turn-Based' 'Fantasy' 'Tactical'
 'Dark Fantasy' 'Board Game' 'PvP' '2D' 'Competitive' 'Replay Value'
 'Character Customization' 'Female Protagonist' 'Difficult'
 'Design & Illustration' 'Sports' 'Multiplayer' 'Adventure' 'FPS'
 'Shooter' 'Third-Person Shooter' 'Sniper' 'Third Person' 'Racing'
 'Early Access' 'Survival' 'Pixel Graphics' 'Cute' 'Physics' 'Science'
 'VR' 'Tutorial' 'Classic' 'Gore' "1990's" 'Singleplayer' 'Sci-fi'
 'Aliens' 'First-Person' 'Story Rich' 'Atmospheric' 'Silent Protagonist'
 'Great Soundtrack' 'Moddable' 'Linear' 'Retro' 'Funny'
 'Turn-Based Strategy' 'Platformer' 'Side Scroller'
 'Massively Multiplayer' 'Clicker' 'Gothic' 'Isometric' 'Stealth'
 'Mystery' 'Assassin' 'Comedy' 'Stylized' 'Co-op' 'War' 'Rome'
 'Historical' 'Open World' 'Realistic' 'Crafting' 'Trading' 'MMORPG'
 'Swordplay' 'Hunting' 'Violent' 'Experience' 'City Builder' 'Building'
 'Economy'

Too much info in tags to make individual columns  **DROP FOR NOW?????????**

In [30]:
cleanedGamesDf.drop(columns='tags', inplace=True)
cleanedGamesDf.head()

Unnamed: 0,publisher,app_name,title,release_date,discount_price,specs,price,early_access,id,developer,...,genre_Photo Editing,genre_RPG,genre_Racing,genre_Simulation,genre_Software Training,genre_Sports,genre_Strategy,genre_Utilities,genre_Video Production,genre_Web Publishing
0,Kotoshiro,Lost Summoner Kitty,Lost Summoner Kitty,2018-01-04,4.49,[Single-player],4.99,False,761140,Kotoshiro,...,0,0,0,1,0,0,1,0,0,0
1,"Making Fun, Inc.",Ironbound,Ironbound,2018-01-04,,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,False,643980,Secret Level SRL,...,0,1,0,0,0,0,1,0,0,0
2,Poolians.com,Real Pool 3D - Poolians,Real Pool 3D - Poolians,2017-07-24,,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,False,670290,Poolians.com,...,0,0,0,1,0,1,0,0,0,0
3,彼岸领域,弹炸人2222,弹炸人2222,2017-12-07,0.83,[Single-player],0.99,False,767400,彼岸领域,...,0,0,0,0,0,0,0,0,0,0
5,Trickjump Games Ltd,Battle Royale Trainer,Battle Royale Trainer,2018-01-04,,"[Single-player, Steam Achievements]",3.99,False,772540,Trickjump Games Ltd,...,0,0,0,1,0,0,0,0,0,0


**specs**

In [31]:
specsCompleteList = rawGamesDf.explode(column=['specs'])['specs'].unique()
print(specsCompleteList)
print(specsCompleteList.shape)

['Single-player' 'Multi-player' 'Online Multi-Player'
 'Cross-Platform Multiplayer' 'Steam Achievements' 'Steam Trading Cards'
 'In-App Purchases' 'Stats' 'Full controller support' 'HTC Vive'
 'Oculus Rift' 'Tracked Motion Controllers' 'Room-Scale'
 'Downloadable Content' 'Steam Cloud' 'Steam Leaderboards'
 'Partial Controller Support' 'Seated' 'Standing' 'Local Co-op'
 'Shared/Split Screen' nan 'Valve Anti-Cheat enabled' 'Local Multi-Player'
 'Steam Turn Notifications' 'MMO' 'Co-op' 'Online Co-op'
 'Captions available' 'Commentary available' 'Steam Workshop'
 'Includes level editor' 'Mods' 'Mods (require HL2)' 'Game demo'
 'Includes Source SDK' 'SteamVR Collectibles' 'Keyboard / Mouse' 'Gamepad'
 'Windows Mixed Reality' 'Mods (require HL1)']
(41,)


Too much info in specs to make individual columns  **DROP FOR NOW?????????**

In [32]:
cleanedGamesDf.drop(columns='specs', inplace=True)
cleanedGamesDf.head()

Unnamed: 0,publisher,app_name,title,release_date,discount_price,price,early_access,id,developer,sentiment,...,genre_Photo Editing,genre_RPG,genre_Racing,genre_Simulation,genre_Software Training,genre_Sports,genre_Strategy,genre_Utilities,genre_Video Production,genre_Web Publishing
0,Kotoshiro,Lost Summoner Kitty,Lost Summoner Kitty,2018-01-04,4.49,4.99,False,761140,Kotoshiro,,...,0,0,0,1,0,0,1,0,0,0
1,"Making Fun, Inc.",Ironbound,Ironbound,2018-01-04,,Free To Play,False,643980,Secret Level SRL,Mostly Positive,...,0,1,0,0,0,0,1,0,0,0
2,Poolians.com,Real Pool 3D - Poolians,Real Pool 3D - Poolians,2017-07-24,,Free to Play,False,670290,Poolians.com,Mostly Positive,...,0,0,0,1,0,1,0,0,0,0
3,彼岸领域,弹炸人2222,弹炸人2222,2017-12-07,0.83,0.99,False,767400,彼岸领域,,...,0,0,0,0,0,0,0,0,0,0
5,Trickjump Games Ltd,Battle Royale Trainer,Battle Royale Trainer,2018-01-04,,3.99,False,772540,Trickjump Games Ltd,Mixed,...,0,0,0,1,0,0,0,0,0,0


## Duplicates

Analysis if there are duplicates

In [33]:
# Games with duplicate ids
cleanedGamesDf[cleanedGamesDf.duplicated(keep=False, subset='id')].sort_values(by='id')

Unnamed: 0,publisher,app_name,title,release_date,discount_price,price,early_access,id,developer,sentiment,...,genre_Photo Editing,genre_RPG,genre_Racing,genre_Simulation,genre_Software Training,genre_Sports,genre_Strategy,genre_Utilities,genre_Video Production,genre_Web Publishing
13894,Bethesda Softworks,Wolfenstein II: The New Colossus,Wolfenstein II: The New Colossus,2017-10-26,,59.99,False,612880,Machine Games,Mostly Positive,...,0,0,0,0,0,0,0,0,0,0
14573,Bethesda Softworks,Wolfenstein II: The New Colossus,Wolfenstein II: The New Colossus,2017-10-26,,59.99,False,612880,Machine Games,Mostly Positive,...,0,0,0,0,0,0,0,0,0,0


In [34]:
cleanedGamesDf.drop_duplicates(inplace=True)

## Function to describe columns

In [35]:
def column_describe(name):
    """
    Basic description of a categorical column.
    """
    print('Describe:')
    print(cleanedGamesDf[name].describe())
    print('')
    print('Unique Values:')
    print(cleanedGamesDf[name].unique())
    print('')
    print('NaN amount:', cleanedGamesDf[name].isna().sum())

## publisher

In [36]:
column_describe('publisher')

Describe:
count       23954
unique       8226
top       Ubisoft
freq          383
Name: publisher, dtype: object

Unique Values:
['Kotoshiro' 'Making Fun, Inc.' 'Poolians.com' ... 'OrtiGames/OrtiSoft'
 'INGAME' 'Bidoniera Games']

NaN amount: 4897


**What to do with this Amount of NaNs????**

## app_name and title

This columns are redundant. So let's drop *app_name*.

In [37]:
column_describe('app_name')

Describe:
count          28850
unique         28827
top       Soundtrack
freq               3
Name: app_name, dtype: object

Unique Values:
['Lost Summoner Kitty' 'Ironbound' 'Real Pool 3D - Poolians' ...
 'LOGistICAL: South Africa' 'Russian Roads' 'EXIT 2 - Directions']

NaN amount: 1


In [38]:
column_describe('title')

Describe:
count          28850
unique         28827
top       Soundtrack
freq               3
Name: title, dtype: object

Unique Values:
['Lost Summoner Kitty' 'Ironbound' 'Real Pool 3D - Poolians' ...
 'LOGistICAL: South Africa' 'Russian Roads' 'EXIT 2 - Directions']

NaN amount: 1


**CUAL elegir, mezclar ambas en el null??????**

In [39]:
cleanedGamesDf.drop(columns='app_name', inplace=True)
cleanedGamesDf.head(2)

Unnamed: 0,publisher,title,release_date,discount_price,price,early_access,id,developer,sentiment,metascore,...,genre_Photo Editing,genre_RPG,genre_Racing,genre_Simulation,genre_Software Training,genre_Sports,genre_Strategy,genre_Utilities,genre_Video Production,genre_Web Publishing
0,Kotoshiro,Lost Summoner Kitty,2018-01-04,4.49,4.99,False,761140,Kotoshiro,,,...,0,0,0,1,0,0,1,0,0,0
1,"Making Fun, Inc.",Ironbound,2018-01-04,,Free To Play,False,643980,Secret Level SRL,Mostly Positive,,...,0,1,0,0,0,0,1,0,0,0


## discount_price

In [40]:
column_describe('discount_price')

Describe:
count    204.000000
mean      12.605392
std       18.207547
min        0.490000
25%        0.890000
50%        4.090000
75%       22.660000
max      139.990000
Name: discount_price, dtype: float64

Unique Values:
[  4.49    nan   0.83   8.79   1.59   0.59   0.84   7.19   6.79   9.89
   5.99   0.49   0.74  22.66  49.96   0.89  14.99   6.29   3.14  59.11
   7.49  13.39  10.49  35.97   0.99   7.99  11.39  24.9    3.59   2.69
   3.49   4.68   4.19  17.08   8.99   0.62   2.99  19.78   2.05   2.39
  19.99  79.99  49.99 139.99   2.44   0.69   3.74   0.79   1.79   1.49
   3.99   6.99   2.09  10.04  24.82  29.75   6.74   4.79   2.19   3.34
   0.5    5.24   2.51   1.19   1.99   0.66  22.46  44.1    9.99   3.24
  17.49   1.39  31.49   1.69   4.24  25.49   3.19  11.69  11.99]

NaN amount: 28647


In [41]:
amount_games_0_dscount_price = cleanedGamesDf[cleanedGamesDf['discount_price'] == 0]['discount_price'].shape[0]
amount_games_NaN_dscount_price = cleanedGamesDf[cleanedGamesDf['discount_price'].isna()]['discount_price'].shape[0]

print(f"Amount of games with  0  discount price: {amount_games_0_dscount_price}")
print(f"Amount of games with NaN discount price: {amount_games_NaN_dscount_price}")

Amount of games with  0  discount price: 0
Amount of games with NaN discount price: 28647


Since there are no games with 0 discount price we can assume that all the NaNs are games with no discount price. That's why we can impute 0 into the NaNs in that column.

In [42]:
cleanedGamesDf['discount_price'].fillna(0, inplace=True)

## price

In [43]:
column_describe('price')

Describe:
count     27621.00
unique      150.00
top           4.99
freq       3860.00
Name: price, dtype: float64

Unique Values:
[4.99 'Free To Play' 'Free to Play' 0.99 3.99 9.99 18.99 29.99 nan 10.99
 2.99 1.59 14.99 1.99 59.99 8.99 6.99 7.99 39.99 'Free' 19.99 7.49 12.99
 5.99 2.49 15.99 1.25 24.99 17.99 61.99 3.49 11.99 13.99 'Free Demo'
 'Play for Free!' 34.99 74.76 1.49 32.99 99.99 14.95 69.99 16.99 79.99
 49.99 5.0 44.99 13.98 29.96 109.99 149.99 771.71 'Install Now' 21.99
 89.99 'Play WARMACHINE: Tactics Demo' 0.98 139.92 4.29 64.99 'Free Mod'
 54.99 74.99 'Install Theme' 0.89 'Third-party' 0.5 'Play Now' 299.99 1.29
 119.99 3.0 15.0 5.49 23.99 49.0 20.99 10.93 1.39
 'Free HITMAN™ Holiday Pack' 36.99 4.49 2.0 4.0 1.95 1.5 199.0 189.0 6.66
 27.99 129.99 179.0 26.99 399.99 31.99 399.0 20.0 40.0 3.33 22.99 320.0
 38.85 71.7 995.0 27.49 3.39 6.0 19.95 499.99 199.99 16.06 4.68 131.4
 44.98 202.76 2.3 0.95 172.24 249.99 2.97 10.96 10.0 30.0 2.66 6.48 1.0
 11.15 'Play the Demo' 99.0 

First let's drop the NaNs because we do not have enough information to assume that they are free or which price are they selling.

In [44]:
# In general there are no tags associated with free to assume that they could be $0.
rawGamesDf[rawGamesDf['price'].isna()].head(5)

Unnamed: 0,publisher,genres,app_name,title,release_date,tags,discount_price,specs,price,early_access,id,developer,sentiment,metascore
9,RewindApp,"[Casual, Indie, Racing, Simulation]",Race,Race,2018-01-04,"[Indie, Casual, Simulation, Racing]",,"[Single-player, Multi-player, Partial Controll...",,False,768800,RewindApp,,
10,Qucheza,"[Action, Indie, Simulation, Early Access]",Uncanny Islands,Uncanny Islands,Soon..,"[Early Access, Action, Indie, Simulation, Surv...",,[Single-player],,True,768570,Qucheza,,
31,BlueLine Games,"[Casual, Indie, Strategy]",Lost Cities,Lost Cities,2018-01-01,"[Casual, Indie, Strategy, Card Game, Board Gam...",,"[Single-player, Multi-player, Online Multi-Pla...",,False,520680,BlueLine Games,,
32,Games by Brundle,[Action],Twisted Enhanced Edition,Twisted Enhanced Edition,2018-01-01,"[Action, Platformer, Side Scroller]",,"[Single-player, Full controller support]",,False,690410,Games by Brundle,,
34,ProjectorGames,"[Action, Casual, Indie, Massively Multiplayer,...",Tactics Forever,Tactics Forever,2018-01-01,"[Casual, Action, Massively Multiplayer, Indie,...",,"[Online Multi-Player, MMO, Cross-Platform Mult...",,False,413120,ProjectorGames,,


In [45]:
cleanedGamesDf[cleanedGamesDf['price'].isna()].head(5)

Unnamed: 0,publisher,title,release_date,discount_price,price,early_access,id,developer,sentiment,metascore,...,genre_Photo Editing,genre_RPG,genre_Racing,genre_Simulation,genre_Software Training,genre_Sports,genre_Strategy,genre_Utilities,genre_Video Production,genre_Web Publishing
9,RewindApp,Race,2018-01-04,0.0,,False,768800,RewindApp,,,...,0,0,1,1,0,0,0,0,0,0
10,Qucheza,Uncanny Islands,Soon..,0.0,,True,768570,Qucheza,,,...,0,0,0,1,0,0,0,0,0,0
31,BlueLine Games,Lost Cities,2018-01-01,0.0,,False,520680,BlueLine Games,,,...,0,0,0,0,0,0,1,0,0,0
32,Games by Brundle,Twisted Enhanced Edition,2018-01-01,0.0,,False,690410,Games by Brundle,,,...,0,0,0,0,0,0,0,0,0,0
34,ProjectorGames,Tactics Forever,2018-01-01,0.0,,False,413120,ProjectorGames,,,...,0,0,0,1,0,1,1,0,0,0


In [46]:
cleanedGamesDf.dropna(subset=['price'], inplace=True)

In [47]:
cleanedGamesDf['price'].isna().sum()

0

There are several games that have some text instead of price.

Most notably the word "Free" referring to free to play games. We could find them all and put them a price of 0.

In [48]:
cleanedGamesDf[cleanedGamesDf['price'].map(type) == str]['price'].unique()

array(['Free To Play', 'Free to Play', 'Free', 'Free Demo',
       'Play for Free!', 'Install Now', 'Play WARMACHINE: Tactics Demo',
       'Free Mod', 'Install Theme', 'Third-party', 'Play Now',
       'Free HITMAN™ Holiday Pack', 'Play the Demo', 'Free to Try',
       'Free to Use'], dtype=object)

In [49]:
print("Games containinig Free on the price:", cleanedGamesDf['price'].str.count("Free").sum())

Games containinig Free on the price: 1533.0


In [50]:
# Change all games containing Free to 0
cleanedGamesDf[cleanedGamesDf['price'].str.contains('Free', na=False)] = 0

We could eliminate all rows with prices that are still a string since there are few.

In [51]:
cleanedGamesDf[cleanedGamesDf['price'].map(type) == str]['price'].unique()

array(['Install Now', 'Play WARMACHINE: Tactics Demo', 'Install Theme',
       'Third-party', 'Play Now', 'Play the Demo'], dtype=object)

In [52]:
cleanedGamesDf[cleanedGamesDf['price'].map(type) == str]

Unnamed: 0,publisher,title,release_date,discount_price,price,early_access,id,developer,sentiment,metascore,...,genre_Photo Editing,genre_RPG,genre_Racing,genre_Simulation,genre_Software Training,genre_Sports,genre_Strategy,genre_Utilities,genre_Video Production,genre_Web Publishing
2405,EVGA,EVGA Precision XOC,2014-09-19,0.0,Install Now,False,268850,EVGA,Mostly Positive,,...,0,0,0,0,0,0,0,1,0,0
2871,Privateer Press Interactive,WARMACHINE: Tactics,2014-11-20,0.0,Play WARMACHINE: Tactics Demo,False,253510,WhiteMoon Dreams,Mixed,,...,0,0,0,0,0,0,1,0,0,0
3832,,FREE China Theme Pack,2015-06-10,0.0,Install Theme,False,370880,Stolen Couch Games,4 user reviews,,...,0,1,0,1,0,0,0,0,0,0
3918,,Parcel - Soundtrack,2015-07-02,0.0,Third-party,False,362970,Polar Bunny Ltd,,,...,0,0,0,0,0,0,0,0,0,0
4026,DigitalEZ,Oblivious Garden ~White Day,2015-07-20,0.0,Play Now,False,345040,"CorypheeSoft,DigitalEZ",Mostly Positive,,...,0,0,0,0,0,0,0,0,0,0
22734,Boomzap Entertainment,Legends of Callasia,2016-06-10,0.0,Play the Demo,False,438920,Boomzap Entertainment,Very Positive,,...,0,0,0,0,0,0,1,0,0,0
26217,,Area-X - Extra Gallery,2015-06-24,0.0,Play Now,False,383860,Zeiva Inc,,,...,0,0,0,1,0,0,0,0,0,0
31838,"PopCap Games, Inc.",Peggle Extreme,2007-09-11,0.0,Third-party,False,3483,"PopCap Games, Inc.",Very Positive,,...,0,0,0,0,0,0,0,0,0,0


In [53]:
cleanedGamesDf.drop(index=cleanedGamesDf[cleanedGamesDf['price'].map(type) == str].index, inplace=True)

In [54]:
cleanedGamesDf[cleanedGamesDf['price'].map(type) == str]['price'].unique()

array([], dtype=object)

Now change DataType of price to numeric.

In [55]:
cleanedGamesDf['price'] = pd.to_numeric(cleanedGamesDf['price'])
cleanedGamesDf['price'].info()

<class 'pandas.core.series.Series'>
Index: 27613 entries, 0 to 32133
Series name: price
Non-Null Count  Dtype  
--------------  -----  
27613 non-null  float64
dtypes: float64(1)
memory usage: 431.5 KB


## metascore

We should drop them since most are NaNs 

In [56]:
column_describe('metascore')

Describe:
count     4043
unique      71
top          0
freq      1533
Name: metascore, dtype: int64

Unique Values:
[nan 0 96 84 80 76 70 'NA' 69 81 75 72 66 67 77 91 89 83 88 65 94 57 86 87
 92 79 82 58 74 85 90 68 71 60 73 59 64 61 54 53 78 51 44 63 38 56 49 52
 62 93 48 34 95 43 55 24 46 41 20 39 45 35 47 40 36 50 32 37 33 42 29 30]

NaN amount: 23570


In [57]:
cleanedGamesDf.drop(columns='metascore', inplace=True)
cleanedGamesDf.head(2)

Unnamed: 0,publisher,title,release_date,discount_price,price,early_access,id,developer,sentiment,genre_Accounting,...,genre_Photo Editing,genre_RPG,genre_Racing,genre_Simulation,genre_Software Training,genre_Sports,genre_Strategy,genre_Utilities,genre_Video Production,genre_Web Publishing
0,Kotoshiro,Lost Summoner Kitty,2018-01-04,4.49,4.99,False,761140,Kotoshiro,,0,...,0,0,0,1,0,0,1,0,0,0
1,0,0,0,0.0,0.0,0,0,0,0.0,0,...,0,0,0,0,0,0,0,0,0,0


## release_date QUE HACER CON ESTO???????????

We could try to convert to Date Time.

In [58]:
column_describe('release_date')

Describe:
count     27609
unique     3193
top           0
freq       1533
Name: release_date, dtype: int64

Unique Values:
['2018-01-04' 0 '2017-12-07' ... '2003-11-01' '2004-03-16' '2004-03-01']

NaN amount: 4


In [59]:
cleanedGamesDf[cleanedGamesDf['release_date'].map(type) == str]['release_date'].unique()

array(['2018-01-04', '2017-12-07', '2018-01-03', ..., '2003-11-01',
       '2004-03-16', '2004-03-01'], dtype=object)

In [60]:
cleanedGamesDf[cleanedGamesDf['release_date'].str.startsWith('Soon..')]

AttributeError: 'StringMethods' object has no attribute 'startsWith'

In [None]:
cleanedGamesDf[cleanedGamesDf['release_date'] == 'Soon..']

Unnamed: 0,publisher,title,release_date,discount_price,price,early_access,id,developer,sentiment,genre_Accounting,...,genre_Photo Editing,genre_RPG,genre_Racing,genre_Simulation,genre_Software Training,genre_Sports,genre_Strategy,genre_Utilities,genre_Video Production,genre_Web Publishing


In [None]:
pd.to_datetime(cleanedGamesDf['release_date'])

ValueError: time data "0" doesn't match format "%Y-%m-%d", at position 1. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

In [None]:
column_describe('release_date')

Describe:
count          30068
unique          3582
top       2012-10-16
freq             100
Name: release_date, dtype: object

Unique Values:
['2018-01-04' '2017-07-24' '2017-12-07' ... '2016-11-19' 'January 2018'
 '2018-10-01']

NaN amount: 2067


'publisher', 'release_date', 
       'early_access', 'id', 'developer', 'sentiment'

In [None]:
cleanedGamesDf.columns

Index(['publisher', 'title', 'release_date', 'discount_price', 'price',
       'early_access', 'id', 'developer', 'sentiment', 'genre_Accounting',
       'genre_Action', 'genre_Adventure', 'genre_Animation &amp; Modeling',
       'genre_Audio Production', 'genre_Casual',
       'genre_Design &amp; Illustration', 'genre_Early Access',
       'genre_Education', 'genre_Free to Play', 'genre_Indie',
       'genre_Massively Multiplayer', 'genre_Photo Editing', 'genre_RPG',
       'genre_Racing', 'genre_Simulation', 'genre_Software Training',
       'genre_Sports', 'genre_Strategy', 'genre_Utilities',
       'genre_Video Production', 'genre_Web Publishing'],
      dtype='object')

---
Otras cosas para borrar luego

In [None]:
print("Games containinig Free on the price:", cleanedGamesDf['price'].str.count("Free").sum())

with pd.option_context('display.max_rows', None,
                       'display.max_columns', None,
                       'display.precision', 3,
                       ):

    print(cleanedGamesDf['price'].value_counts())

Games containinig Free on the price: 0.0
price
4.99                             3860
9.99                             3584
2.99                             3178
0.99                             2478
1.99                             2251
19.99                            1569
0                                1533
3.99                             1416
14.99                            1391
6.99                             1112
7.99                             1008
5.99                              868
29.99                             478
12.99                             382
24.99                             366
8.99                              314
11.99                             303
39.99                             278
49.99                             155
34.99                             104
59.99                              92
17.99                              84
16.99                              83
13.99                              78
10.99                              78
15.