# Preproceso y Transformación de los datos
## Minería de datos - Grupo 2ASJ


## Librerías y herramientas



In [None]:
# Data load and manipulation
from google.colab import files
import io
 
# DataFrame librery
import pandas as pd
 
# Visualization 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Basic Operations
import numpy as np
import itertools
import operator
 
# Prepocessing
from sklearn import preprocessing 
from sklearn.decomposition import PCA
from sklearn.impute import KNNImputer

##Dataset Principal

### Carga de los datos

In [None]:
main_dataset = pd.read_csv('./steam.csv', sep=',')
main_dataset.head()

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,124534,3339,17612,317,10000000-20000000,7.19
1,20,Team Fortress Classic,1999-04-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,3318,633,277,62,5000000-10000000,3.99
2,30,Day of Defeat,2003-05-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,0,3416,398,187,34,5000000-10000000,3.99
3,40,Deathmatch Classic,2001-06-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,1273,267,258,184,5000000-10000000,3.99
4,50,Half-Life: Opposing Force,1999-11-01,1,Gearbox Software,Valve,windows;mac;linux,0,Single-player;Multi-player;Valve Anti-Cheat en...,Action,FPS;Action;Sci-fi,0,5250,288,624,415,5000000-10000000,3.99


Debido a que vamos a modificar el DataFrame 'main_dataset', vamos a crear una copia para poder tener a mano los datos originales aún modificando este DataFrame.

In [None]:
original_data = main_dataset.copy()

In [None]:
print("Número de registros en el dataset original: " + str(len(main_dataset)))

Número de registros en el dataset original: 27075


### Tratamiento de los valores vacíos

In [None]:
print("Número de valores vacíos por variable en el dataset principal: ")
main_dataset.isnull().sum()

Número de valores vacíos por variable en el dataset principal: 


appid               0
name                0
release_date        0
english             0
developer           0
publisher           0
platforms           0
required_age        0
categories          0
genres              0
steamspy_tags       0
achievements        0
positive_ratings    0
negative_ratings    0
average_playtime    0
median_playtime     0
owners              0
price               0
dtype: int64

Como vemos en la celda previa, no contamos con valores vacíos en el dataset principal, por lo que no será necesario realizar ningún tipo de tratamiento para poder utilizarlos en etapas posteriores.

### Variable objetivo

Para el cálculo de la variable objetivo, se ha tomado la división entre el número de valoraciones positivas y el número total de valoraciones. Se ha escogido este valor con el fin de normalizar y tratar de forma equitativa aquellos juegos que tienen un número alto de valoraciones y aquellos que tienen un número reducido. 

In [None]:
main_dataset['target'] = main_dataset['positive_ratings']/(main_dataset['positive_ratings'] + main_dataset['negative_ratings'])

Para evitar valores de la variable objetivo anómalos, filtramos el dataset por aquellos juegos que cuentan con al menos 20 valoraciones.

In [None]:
main_dataset['total_reviews'] = main_dataset['positive_ratings'] + main_dataset['negative_ratings']
main_dataset = main_dataset[main_dataset['total_reviews'] >= 20]

In [None]:
print("Número de registros del dataset principal tras el filtrado por la variable objetivo: " + str(len(main_dataset)))

Número de registros del dataset principal tras el filtrado por la variable objetivo: 16807


### Preprocesado del resto de las variables del dataset principal


Como vimos en el cuaderno de análisis, la columna 'english' toma el valor 1 en el 98% de los registros, y además no parece tener ningún tipo de correlación con la variable objetivo, por lo que hemos decidido desecharla.

In [None]:
main_dataset.drop(['english'], axis=1, inplace = True)

A continuación vamos a ver el tratamiento aplicado a las columnas 'developer' y 'publisher'.

In [None]:
print("Número de valores de la columna 'developers' diferentes: " + str(len(main_dataset['developer'].unique())))
print("Número de valores de la columna 'publisher' diferentes: " + str(len(main_dataset['publisher'].unique())))
print("Número de registros en los que el valor de ambas columnas coincide: " + str(len(main_dataset[main_dataset['developer'] == main_dataset['publisher']])) + " de un total de " + str(len(main_dataset)))

Número de valores de la columna 'developers' diferentes: 10680
Número de valores de la columna 'publisher' diferentes: 8391
Número de registros en los que el valor de ambas columnas coincide: 10054 de un total de 16807


Como vemos, tenemos un número de valores de estas columnas demasiado alto como para valorar el hecho de transformarla en una columna por valor. La mejor opción en este caso consideramos que es añadir al dataset el número de juegos desarrollados por el developer correspondiente. 

Además de lo comentado se podrían añadir columnas por aquellos developers más importantes y una columna más que englobe aquellos estudios de desarrollo 'indies' con un número reducido de juegos, pero al no tener un subconjunto reducido de estudios que tenga claramente un mayor número de juegos desarrollado que el resto, hemos considerado dejar solo la columna de número de juegos desarrollados.

In [None]:
# Sacamos un diccionario con cada developer y el correspondiente número de juegos desarrollados por este en el dataset principal original
developer_dict = {}
for g, rows in original_data.groupby(['developer']):
  developer_dict[g] = len(rows)

# Creamos una nueva columna con el número de juegos desarrollados y desechamos la columna original
main_dataset['number_developer_games'] = main_dataset['developer'].apply(lambda x: developer_dict[x])
main_dataset.drop(['developer'], axis=1, inplace = True)

Pese a que el valor de la columna 'developer' y el de la columna 'published' es el mismo en la mayoría de casos, el número de juegos desarrollado y el número de juegos publicados para una misma compañía no tiene por qué ser el mismo. 

Aun así, como vimos en el notebook de análisis, este número sigue coincidiendo en la mayoría de los casos, haciendo que sea información muy redundante y que no aporta realmente nada.

In [None]:
# Desechamos la columna 'publisher' debido a lo redundante que es con respecto a la información sobre el 'developer'
main_dataset.drop(['publisher'], axis=1, inplace = True)

La columna 'platforms' contiene información acerca de aquellas plataformas para las que está disponible el juego. Los valores de esta columna son las diferentes plataformas para las que está disponible separadas por el carácter ';', por lo que la transformación obvia para esta columna es separar los valores en columnas booleanas diferentes.

In [None]:
#  Definimos un método que dado el valor de las plataformas separadas por ';', devuelva la correspondiente lista de 1/0 para añadirla al dataset
def get_platform_avalaible_list(x):
  platforms_avalaible = [0, 0, 0]
  platforms = x.split(';')

  for platform in platforms:
    if platform == 'windows':
      i = 0
    elif platform == 'mac':
      i = 1
    elif platform == 'linux':
      i = 2
    else:
      print("Plataforma no idenfiticada: " + platform)
      break
    platforms_avalaible[i] = 1

  return platforms_avalaible

# Calculamos los valores de cada columna correspondiente para el dataset principal
platform_available_list = [get_platform_avalaible_list(x) for x in main_dataset['platforms'].tolist()]

# Añadimos las columnas al dataset principal
main_dataset['avalaible_on_windows'] = [x[0] for x in platform_available_list]
main_dataset['avalaible_on_mac'] =  [x[1] for x in platform_available_list]
main_dataset['avalaible_on_linux'] =  [x[2] for x in platform_available_list]

# Desechamos la columa de 'platforms' original
main_dataset.drop(['platforms'], axis=1, inplace = True)

A la columna 'required_age' no hay que aplicarle ninguna transformación, ya que es correcto que esta columna sea interpretada como numérica.

La columna 'categories', al igual que 'platforms', contiene valores con diferentes subvalores separados por el carácter ';'. Esta columna al igual que 'platforms' se transformará en una columna por cada una de estos subvalores.

In [None]:
categories = []

# Sacamos cuales son todos aquellos subvalores que pueden aparecer en la columna 'categories'
for element in list(main_dataset['categories'].unique()):
  for separated_element in element.split(';'):
    categories.append(separated_element)
categories = set(categories)

# Creamos una columna por cada uno de estos subvalores en el dataset principal
for c in categories:
  main_dataset[c] = 0

# Recorremos registro por registro del dataset principal, viendo que subvalores aparecen en la columna 'categories' y poniendo el valor de su columna correspondiente a 1
for i, row in main_dataset.iterrows():
  for subvalue in row['categories'].split(';'):
    main_dataset.loc[i, [subvalue]] = 1

# Desechamos la columna 'categories' original
main_dataset.drop(['categories'], axis=1, inplace = True)

Para reducir el número de columnas que hemos añadido con estos subvalores de la columna categoría, hemos mirado el porcentaje de ocurrencias del valor '1' por columna, desechando del dataset aquellas columnas en las que este porcentaje fuera menor que el 2 % o fuera mayor que el 60%.

De esta forma hemos reducido las columnas generadas a partir de la columna 'categories' de 29 a 20.

In [None]:
print("Porcentaje de ocurrencias del valor '1' en: ")

# Recorremos todas las columnas generadas a partir de la columna 'categories'
for column in main_dataset.columns[19:]:
  # Calculamos el porcentaje de ocurrencias del valor '1' en esta columna
  one_value_percentage = len(main_dataset[main_dataset[column] == 1]) * 100 / len(main_dataset)
  print("\tColumna '" + column + "': " + "{:.3f}".format(one_value_percentage) + " % ")

  # Si este porcentaje es menor que el 2% o es superior al 60 %, desechamos la columna
  if (one_value_percentage < 2) or (one_value_percentage > 60):
    main_dataset.drop([column], axis=1, inplace = True)

Porcentaje de ocurrencias del valor '1' en: 
	Columna 'Local Multi-Player': 5.081 % 
	Columna 'Mods': 0.012 % 
	Columna 'Stats': 7.913 % 
	Columna 'Includes Source SDK': 0.167 % 
	Columna 'SteamVR Collectibles': 0.238 % 
	Columna 'Cross-Platform Multiplayer': 5.289 % 
	Columna 'Full controller support': 23.829 % 
	Columna 'Captions available': 3.261 % 
	Columna 'Steam Cloud': 33.034 % 
	Columna 'Steam Leaderboards': 14.518 % 
	Columna 'VR Support': 1.178 % 
	Columna 'Includes level editor': 4.718 % 
	Columna 'Steam Achievements': 59.440 % 
	Columna 'Single-player': 94.282 % 
	Columna 'Steam Workshop': 4.516 % 
	Columna 'Commentary available': 0.690 % 
	Columna 'Online Multi-Player': 10.859 % 
	Columna 'Online Co-op': 5.069 % 
	Columna 'MMO': 2.237 % 
	Columna 'Shared/Split Screen': 8.032 % 
	Columna 'Multi-player': 18.052 % 
	Columna 'Steam Trading Cards': 41.917 % 
	Columna 'Mods (require HL2)': 0.006 % 
	Columna 'Valve Anti-Cheat enabled': 0.559 % 
	Columna 'Local Co-op': 3.445 % 
	C

Una vez tratada la columna 'categories', vamos a hacer algo similar con la columna 'genres'. Primero vamos a obtener los distintos subvalores que pueden aparecer separados por un ';' en esta columna, crearemos una columna por cada uno de estos subvalores, y por último recorreremos registro por registro viendo que subvalores aparecen en la columna 'genres'.

In [None]:
genres = []

# Sacamos cuales son todos aquellos subvalores que pueden aparecer en la columna 'genres'
for element in list(main_dataset['genres'].unique()):
  for separated_element in element.split(';'):
    genres.append(separated_element)
genres = set(genres)

# Creamos una columna por cada uno de estos subvalores en el dataset principal
for g in genres:
  main_dataset[g] = 0

# Recorremos registro por registro del dataset principal, viendo que subvalores aparecen en la columna 'genres' y poniendo el valor de su columna correspondiente a 1
for i, row in main_dataset.iterrows():
  for subvalue in row['genres'].split(';'):
    main_dataset.loc[i, [subvalue]] = 1

# Desechamos la columna 'genres' original
main_dataset.drop(['genres'], axis=1, inplace = True)

Al igual que hicimos anteriormente con las categorias, reducimos el número de columnas generadas a partir de la columna 'genres'. Para esto comprobamos el número de ocurrencias del valor '1' por columna creada, desechando del dataset aquellas columnas en las que este porcentaje sea menos que el 1%.

De esta forma hemos reducido las columnas generadas a partir de la columna 'genres' de 27 a 16.

In [None]:
print("Porcentaje de ocurrencias del valor '1' en: ")

# Recorremos todas las columnas generadas a partir de la columna 'genres'
for column in main_dataset.columns[-27:]:
  # Calculamos el porcentaje de ocurrencias del valor '1' en esta columna
  one_value_percentage = len(main_dataset[main_dataset[column] == 1]) * 100 / len(main_dataset)
  print("\tColumna '" + column + "': " + "{:.3f}".format(one_value_percentage) + " % ")

  # Si este porcentaje es menor que el 1%
  if one_value_percentage < 1:
   main_dataset.drop([column], axis=1, inplace = True)

Porcentaje de ocurrencias del valor '1' en: 
	Columna 'Education': 0.161 % 
	Columna 'Casual': 32.320 % 
	Columna 'Massively Multiplayer': 3.445 % 
	Columna 'Animation & Modeling': 0.280 % 
	Columna 'Video Production': 0.125 % 
	Columna 'Sexual Content': 1.059 % 
	Columna 'Gore': 1.940 % 
	Columna 'Racing': 3.796 % 
	Columna 'RPG': 18.034 % 
	Columna 'Adventure': 37.948 % 
	Columna 'Violent': 2.838 % 
	Columna 'Free to Play': 8.865 % 
	Columna 'Audio Production': 0.107 % 
	Columna 'Accounting': 0.006 % 
	Columna 'Early Access': 9.193 % 
	Columna 'Action': 43.791 % 
	Columna 'Indie': 69.578 % 
	Columna 'Photo Editing': 0.059 % 
	Columna 'Game Development': 0.059 % 
	Columna 'Software Training': 0.137 % 
	Columna 'Sports': 4.451 % 
	Columna 'Strategy': 20.848 % 
	Columna 'Design & Illustration': 0.327 % 
	Columna 'Utilities': 0.535 % 
	Columna 'Simulation': 20.093 % 
	Columna 'Web Publishing': 0.131 % 
	Columna 'Nudity': 1.148 % 


La columna 'steamspy_tags' vamos a desecharla, ya que contamos con un dataset de apoyo que nos aporta esta misma información pero mejor.

In [None]:
main_dataset.drop(['steamspy_tags'], axis=1, inplace = True)

La columna 'achievements' contiene el número de logros que tiene el juego en Steam, algo que puede ser de importancia para nuestro objetivo debido a que existe un pérfil de jugador que se centra en conseguir dichos logros. Esta columna no hace falta que la preprocesemos puesto que viene en formato numérico entero.

La columna 'owners' contiene rangos de valores enteros. En este caso convertir cada uno de los valores de esta columna en una columna aparte consideramos que no es correcto puesto que perderíamos la ordinalidad de la variable. Para preservar esta ordinalidad lo que haremos será sustituir estos valores por valores numéricos, en concreto la media entre los dos valores del rango numérico.

In [None]:
# Definimos un método para convertir el rango en un valor numérico
def convertir_rango_a_numerico(x):
  vals = x.split('-')
  val_1 = int(vals[0])
  val_2 = int(vals[1])
  return int((val_1 + val_2)/2)

# Convertimos con el método definido previamente la columna 'owners'
main_dataset['owners'] = main_dataset['owners'].apply(lambda x: convertir_rango_a_numerico(x))

La columna 'price' es muy importante, ya que el precio es una característica de mucho peso a la hora de valorar un juego. Esta columna es de tipo float, por lo que no será necesario hacerle ningún tipo de preproceso o transformación.

Con esto hemos preprocesado y transformado todas las features que tenemos en el dataset principal, por lo que por último desechamos las columnas que no aporten información o que hayamos utilizado para generar otras, dejando unicamente las columnas esenciales y las columna 'appid' y 'name' que nos permitirán hacer el merge de este dataset con los datasets de apoyo. 

In [None]:
cols_to_drop = ['release_date', 'positive_ratings', 'negative_ratings', 'total_reviews']

main_dataset.drop(cols_to_drop, axis=1, inplace = True)

Por último, reorganizamos las columnas para que tengan un orden similar al dataset original, y la columna target aparezca la última.

In [None]:
reorder_cols = ['appid', 'name', 'number_developer_games', 'avalaible_on_linux', 
                'avalaible_on_mac', 'avalaible_on_windows', 'required_age']
reorder_cols.extend(main_dataset.columns[13:])
reorder_cols.extend(['achievements', 'average_playtime', 'median_playtime', 'owners', 'price', 'target'])

main_dataset = main_dataset[reorder_cols]
main_dataset.head()

Unnamed: 0,appid,name,number_developer_games,avalaible_on_linux,avalaible_on_mac,avalaible_on_windows,required_age,Local Multi-Player,Stats,Cross-Platform Multiplayer,Full controller support,Captions available,Steam Cloud,Steam Leaderboards,Includes level editor,Steam Achievements,Steam Workshop,Online Multi-Player,Online Co-op,MMO,Shared/Split Screen,Multi-player,Steam Trading Cards,Local Co-op,Co-op,Partial Controller Support,In-App Purchases,Casual,Massively Multiplayer,Sexual Content,Gore,Racing,RPG,Adventure,Violent,Free to Play,Early Access,Action,Indie,Sports,Strategy,Simulation,Nudity,achievements,average_playtime,median_playtime,owners,price,target
0,10,Counter-Strike,26,1,1,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,17612,317,15000000,7.19,0.973888
1,20,Team Fortress Classic,26,1,1,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,277,62,7500000,3.99,0.839787
2,30,Day of Defeat,26,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,187,34,7500000,3.99,0.895648
3,40,Deathmatch Classic,26,1,1,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,258,184,7500000,3.99,0.826623
4,50,Half-Life: Opposing Force,7,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,624,415,7500000,3.99,0.947996


## Merge de los datasets de apoyo con el principal

### Dataset de apoyo 1 - Requisitos del ordenador

Cargamos el dataset auxiliar 1. Este dataset se corresponde con el resultado del preproceso sobre el dataset de apoyo que contenía información acerca de los requisitos de hardware de los juegos.

In [None]:
game_requirements= pd.read_csv('auxiliar1.csv', sep=',')
game_requirements.head()

Unnamed: 0.1,Unnamed: 0,steam_appid,pc_requirements,mac_requirements,linux_requirements,minimum,RAM,RAM_MBytes,Processor,Processor MHZ
0,0,10,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","500 mhz processor, 96mb ram, 16mb video card, ...",96mb ram,96.0,500 mhz,500.0
1,1,20,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","500 mhz processor, 96mb ram, 16mb video card, ...",96mb ram,96.0,500 mhz,500.0
2,2,30,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","500 mhz processor, 96mb ram, 16mb video card, ...",96mb ram,96.0,500 mhz,500.0
3,3,40,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","500 mhz processor, 96mb ram, 16mb video card, ...",96mb ram,96.0,500 mhz,500.0
4,4,50,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","500 mhz processor, 96mb ram, 16mb video card, ...",96mb ram,96.0,500 mhz,500.0


De este dataset solo utilizaremos la columna RAM_MBytes que contiene los requisitos de RAM del juego. La columna Processor_MHZ también podría ser utilizada, pero consideramos que es mejor no hacerlo puesto que no hemos podido extraer esta información de muchos de los juegos, obteniendo muchos valores nulos.

In [None]:
# Añadimos la columna RAM_MBytes al dataset principal
main_dataset['RAM'] = game_requirements['RAM_MBytes']
main_dataset.head()

Unnamed: 0,appid,name,number_developer_games,avalaible_on_linux,avalaible_on_mac,avalaible_on_windows,required_age,Local Multi-Player,Stats,Cross-Platform Multiplayer,Full controller support,Captions available,Steam Cloud,Steam Leaderboards,Includes level editor,Steam Achievements,Steam Workshop,Online Multi-Player,Online Co-op,MMO,Shared/Split Screen,Multi-player,Steam Trading Cards,Local Co-op,Co-op,Partial Controller Support,In-App Purchases,Casual,Massively Multiplayer,Sexual Content,Gore,Racing,RPG,Adventure,Violent,Free to Play,Early Access,Action,Indie,Sports,Strategy,Simulation,Nudity,achievements,average_playtime,median_playtime,owners,price,target,RAM
0,10,Counter-Strike,26,1,1,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,17612,317,15000000,7.19,0.973888,96.0
1,20,Team Fortress Classic,26,1,1,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,277,62,7500000,3.99,0.839787,96.0
2,30,Day of Defeat,26,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,187,34,7500000,3.99,0.895648,96.0
3,40,Deathmatch Classic,26,1,1,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,258,184,7500000,3.99,0.826623,96.0
4,50,Half-Life: Opposing Force,7,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,624,415,7500000,3.99,0.947996,96.0


### Dataset de apoyo 2 - Steamspy Tags

Cargamos el dataset auxiliar 2, que se corresponde directamente con el dataset de apoyo denominado 'steamspy_tag_data'. Este dataset contiene información sobre las etiquetas que asigna la comunidad de steamspy a los diferentes juegos de nuestro dataset.

In [None]:
steamspy_tags = pd.read_csv('steamspy_tag_data.csv', sep=',')
steamspy_tags.head()

Unnamed: 0,appid,1980s,1990s,2.5d,2d,2d_fighter,360_video,3d,3d_platformer,3d_vision,4_player_local,4x,6dof,atv,abstract,action,action_rpg,action_adventure,addictive,adventure,agriculture,aliens,alternate_history,america,animation_&_modeling,anime,arcade,arena_shooter,artificial_intelligence,assassin,asynchronous_multiplayer,atmospheric,audio_production,bmx,base_building,baseball,based_on_a_novel,basketball,batman,battle_royale,...,touch_friendly,tower_defense,trackir,trading,trading_card_game,trains,transhumanism,turn_based,turn_based_combat,turn_based_strategy,turn_based_tactics,tutorial,twin_stick_shooter,typing,underground,underwater,unforgiving,utilities,vr,vr_only,vampire,video_production,villain_protagonist,violent,visual_novel,voice_control,voxel,walking_simulator,war,wargame,warhammer_40k,web_publishing,werewolves,western,word_game,world_war_i,world_war_ii,wrestling,zombies,e_sports
0,10,144,564,0,0,0,0,0,0,0,0,0,0,0,0,2681,0,0,0,0,0,0,0,0,0,0,0,0,0,151,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,40,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,550
1,20,0,71,0,0,0,0,0,0,0,0,0,0,0,0,208,0,0,0,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,26,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,30,0,0,0,0,0,0,0,0,0,0,0,0,0,0,99,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,80,0,0,0,0,0,0,5,122,0,0,0
3,40,0,0,0,0,0,0,0,0,0,0,0,0,0,0,85,0,0,0,0,0,0,0,0,0,0,0,22,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,50,0,77,0,0,0,0,0,0,0,0,0,0,0,0,211,0,0,0,87,0,122,0,0,0,0,0,0,0,0,0,73,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Este dataset contiene un gran número de columnas. De momento añadiremos todas estas columnas a la tarjeta de datos, aunque posteriormente habrá que hacer una selección de características exhaustiva. El único procesado que vamos a hacer sobre este dataset antes de hacer el merge con el dataset principal va a ser normalizar los valores que aparecen en cada registro, dividiendo cada valor de cada columna por el número total de votos de ese juego, haciendo que todos los valores de las columnas para un registro sumen un total de 1.

In [None]:
# Cambiamos todas las columnas a tipo float
for c in steamspy_tags.columns[1:]:
  steamspy_tags[c] = steamspy_tags[c].astype(float) 

# Recorremos todos los registros del dataset
for i, row in steamspy_tags.iterrows():
  total_votes = 0
  # Obtenemos la suma total de este registro
  for c in steamspy_tags.columns[1:]:
    total_votes += row[c]

  # Vamos columna por columna modificando el valor en caso de que sea diferente a 0
  for c in steamspy_tags.columns[1:]:
    steamspy_tags.at[i, c] = float(int(row[c])/total_votes)

steamspy_tags.head()


invalid value encountered in double_scalars



Unnamed: 0,appid,1980s,1990s,2.5d,2d,2d_fighter,360_video,3d,3d_platformer,3d_vision,4_player_local,4x,6dof,atv,abstract,action,action_rpg,action_adventure,addictive,adventure,agriculture,aliens,alternate_history,america,animation_&_modeling,anime,arcade,arena_shooter,artificial_intelligence,assassin,asynchronous_multiplayer,atmospheric,audio_production,bmx,base_building,baseball,based_on_a_novel,basketball,batman,battle_royale,...,touch_friendly,tower_defense,trackir,trading,trading_card_game,trains,transhumanism,turn_based,turn_based_combat,turn_based_strategy,turn_based_tactics,tutorial,twin_stick_shooter,typing,underground,underwater,unforgiving,utilities,vr,vr_only,vampire,video_production,villain_protagonist,violent,visual_novel,voice_control,voxel,walking_simulator,war,wargame,warhammer_40k,web_publishing,werewolves,western,word_game,world_war_i,world_war_ii,wrestling,zombies,e_sports
0,10,0.009231,0.036156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.17187,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00968,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002564,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035259
1,20,0.0,0.043425,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.127217,0.0,0.0,0.0,0.009174,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015902,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,30,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.100202,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.080972,0.0,0.0,0.0,0.0,0.0,0.0,0.005061,0.123482,0.0,0.0,0.0
3,40,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.221354,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057292,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,50,0.0,0.043307,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.118673,0.0,0.0,0.0,0.048931,0.0,0.068616,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041057,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Tras esto, hacemos merge con el dataset principal.

In [None]:
main_dataset = main_dataset.merge(steamspy_tags, left_on='appid', right_on = 'appid')

### Dataset de apoyo 3 - Media, mediana y desviación estándar del tiempo de juego

Cargamos el dataset auxiliar 3. Este dataset se corresponde con el resultado del preproceso sobre el dataset de apoyo que contenía información acerca del tiempo de juego por diferentes usuarios de cada videojuego

In [None]:
hours_played = pd.read_csv('auxiliar3.csv', sep=',')
hours_played.head()

Unnamed: 0.1,Unnamed: 0,appid,average_playtime,median_playtime,standard_deviation
0,0,10,17612.0,317.0,617.95
1,1,20,277.0,62.0,1.13
2,2,30,187.0,34.0,45.92
3,3,40,258.0,184.0,1.63
4,4,50,624.0,415.0,0.0


Con este dataset sustituiremos los valores que tenemos en el dataset principal en las columnas de 'average_playtime' y 'median_playtime'.

In [None]:
# Nos quedamos unicamente con aquellos juegos para los que tenemos el tiempo medio de juego y la mediana
hours_played = hours_played[(hours_played['average_playtime'] != 0) & (hours_played['median_playtime'] != 0)]

# Sobre estos juegos, vamos rellenando con sus valores de este dataset en el dataset principal
for i, row in hours_played.iterrows():
  mean = row['average_playtime']
  median = row['median_playtime']
  main_dataset.at[row['Unnamed: 0'], 'average_playtime'] = mean
  main_dataset.at[row['Unnamed: 0'], 'median_playtime'] = median

### Dataset de apoyo 4 - Descripciones de los juegos

Cargamos el dataset auxiliar 4. Este dataset se corresponde con el resultado del preproceso sobre el dataset de apoyo que contenía información acerca de las descripciones de los videojuegos.

In [None]:
descripciones = pd.read_csv('auxiliar4.csv', sep=',',index_col=0)
descripciones.head()

Unnamed: 0,appid,2d,360,3d,50,60,70,80,90,abandoned,ability,able,access,accessible,acclaimed,according,account,accurate,achieve,achievement,acquire,across,act,acting,action,actionpacked,active,activity,actual,actually,adapt,add,added,addictive,adding,addition,additional,adjust,advance,advanced,...,zone,CC,CD,DT,EX,FW,IN,JJ,JJR,JJS,MD,NN,NNP,NNPS,NNS,PDT,POS,PRP,RB,RBR,RBS,RP,SYM,TO,UH,VB,VBD,VBG,VBN,VBP,VBZ,WDT,WP,WRB,num_palabras,num_oraciones,negative,positive,neutral,compound
0,10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.113545,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.049722,0.0,0.0,0.0,0.0,0.340236,0.0,0.0,0.0,0.927666,0.0,0.0,0.0,0.0,0.0,0.0,0.136959,0.0,0.0,0.0,0.0,0.0,0.0,0.049461,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,58,7,0.197,0.361,0.442,0.7867
1,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.140025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.127178,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.084791,0.0,0.0,0.0,0.0,0.217577,0.0,0.0,0.0,0.970744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.039645,0.0,0.0,0.039432,0.0,0.0,0.0,0.0,63,2,0.1,0.171,0.729,0.4767
2,30,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.429065,0.0,0.0,0.080876,0.889482,0.055853,0.0,0.0,0.0,0.0,0.0,0.041871,0.0,0.0,0.0,0.0,0.0,0.0,0.045363,0.042644,0.0,0.097134,0.0,0.0,0.0,0.0,0.0,73,4,0.194,0.085,0.722,-0.743
3,40,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.179714,0.0,0.178032,0.0,0.0,0.0,0.9414,0.169949,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.129757,0.063664,0.0,0.0,0.0,0.0,0.0,0.0,45,3,0.0,0.173,0.827,0.5859
4,50,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.131625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.058139,0.0,0.0,0.0,0.0,0.596748,0.0,0.0,0.0,0.788875,0.0,0.0,0.0,0.0,0.0,0.0,0.106762,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.054367,0.0,0.061918,0.0,0.0,0.0,0.0,0.0,55,4,0.053,0.076,0.871,0.2023


Este dataset contiene muchas columnas puesto que preprocesa texto. De momento vamos a añadir todas las columnas, dejando para el futura la selección de características que se tendrá que llevar a cabo para reducir la dimensionalidad de la tarjeta de datos.

In [None]:
# Eliminamos la columna 'name' del dataset principal para evitar conflictos con la columna correspondiente a la cadena de texto 'name' proveniente de las descripciones de los juegos
main_dataset.drop(['name'], axis=1, inplace = True)

# Hacemos el merge del dataset de descripciones y nuestro dataset principal
main_dataset = main_dataset.merge(descripciones, left_on='appid', right_on='appid')

## Obtención de la tarjeta de datos definitiva

Tras hacer todo el preprocesado correspondiente sobre el dataset principal, y hacer los merges necesarios con los datasets auxiliares, realizamos algunos retoques a la tarjeta de datos final, eliminando la variable app_id que no nos aporta nada.

In [None]:
main_dataset.drop(['appid'], axis=1, inplace = True)

El tamaño final de nuestra tarjeta de datos es de 16807 registros y 2220 columnas. Teniendo la forma que podemos ver en la siguiente celda:

In [None]:
main_dataset.head()

Unnamed: 0,number_developer_games,avalaible_on_linux,avalaible_on_mac,avalaible_on_windows,required_age,Local Multi-Player,Stats,Cross-Platform Multiplayer,Full controller support,Captions available,Steam Cloud,Steam Leaderboards,Includes level editor,Steam Achievements,Steam Workshop,Online Multi-Player,Online Co-op,MMO,Shared/Split Screen,Multi-player,Steam Trading Cards,Local Co-op,Co-op,Partial Controller Support,In-App Purchases,Casual,Massively Multiplayer,Sexual Content,Gore,Racing,RPG,Adventure,Violent,Free to Play,Early Access,Action,Indie,Sports,Strategy,Simulation,...,zone,CC,CD,DT,EX,FW,IN,JJ,JJR,JJS,MD,NN,NNP,NNPS,NNS,PDT,POS,PRP,RB,RBR,RBS,RP,SYM,TO,UH,VB,VBD,VBG,VBN,VBP,VBZ,WDT,WP,WRB,num_palabras,num_oraciones,negative,positive,neutral,compound
0,26.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.049722,0.0,0.0,0.0,0.0,0.340236,0.0,0.0,0.0,0.927666,0.0,0.0,0.0,0.0,0.0,0.0,0.136959,0.0,0.0,0.0,0.0,0.0,0.0,0.049461,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,58,7,0.197,0.361,0.442,0.7867
1,26.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.084791,0.0,0.0,0.0,0.0,0.217577,0.0,0.0,0.0,0.970744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.039645,0.0,0.0,0.039432,0.0,0.0,0.0,0.0,63,2,0.1,0.171,0.729,0.4767
2,26.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.429065,0.0,0.0,0.080876,0.889482,0.055853,0.0,0.0,0.0,0.0,0.0,0.041871,0.0,0.0,0.0,0.0,0.0,0.0,0.045363,0.042644,0.0,0.097134,0.0,0.0,0.0,0.0,0.0,73,4,0.194,0.085,0.722,-0.743
3,26.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.179714,0.0,0.178032,0.0,0.0,0.0,0.9414,0.169949,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.129757,0.063664,0.0,0.0,0.0,0.0,0.0,0.0,45,3,0.0,0.173,0.827,0.5859
4,7.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.058139,0.0,0.0,0.0,0.0,0.596748,0.0,0.0,0.0,0.788875,0.0,0.0,0.0,0.0,0.0,0.0,0.106762,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.054367,0.0,0.061918,0.0,0.0,0.0,0.0,0.0,55,4,0.053,0.076,0.871,0.2023


In [None]:
main_dataset.shape

(16807, 2220)

Por último descargamos la tarjeta de datos en nuestro equipo, exportándola como un csv.

In [None]:
with open('tarjeta_datos.csv', 'w') as f:
 main_dataset.to_csv('tarjeta_datos.csv', index = False)
  
files.download('tarjeta_datos.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>