Importación de librerias

In [1]:
import pandas as pd
import gzip
import json
import ast

## Transformación **steam_games**

### Carga de datos de juegos de Steam desde un archivo JSON comprimido en gzip

In [2]:
# Creación de DataFrame a partir de un Archivo JSON Comprimido
filas = []

with gzip.open("datasets/steam_games.json.gz", 'rt', encoding='utf-8') as archivo:
    for linea in archivo:
        filas.append(json.loads(linea))
        
# Crear el DataFrame
df_games = pd.DataFrame(filas)

In [3]:
df_games.head()

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,,,,,,,,,,,,,
1,,,,,,,,,,,,,
2,,,,,,,,,,,,,
3,,,,,,,,,,,,,
4,,,,,,,,,,,,,


### Verificación y Limpieza del DataFrame

In [4]:
# Verificamos las dimensiones de nuestro DF
df_games.shape

(120445, 13)

In [5]:
# Eliminamos las filas que contienen SOLO VALORES NULOS
df_games = df_games.dropna(how="all").reset_index(drop=True) 
df_games.head()

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",http://steamcommunity.com/app/761140/reviews/?...,[Single-player],4.99,False,761140,Kotoshiro
1,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,http://store.steampowered.com/app/643980/Ironb...,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",http://steamcommunity.com/app/643980/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,False,643980,Secret Level SRL
2,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_...,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",http://steamcommunity.com/app/670290/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,False,670290,Poolians.com
3,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,http://store.steampowered.com/app/767400/2222/,2017-12-07,"[Action, Adventure, Casual]",http://steamcommunity.com/app/767400/reviews/?...,[Single-player],0.99,False,767400,彼岸领域
4,,,Log Challenge,,http://store.steampowered.com/app/773570/Log_C...,,"[Action, Indie, Casual, Sports]",http://steamcommunity.com/app/773570/reviews/?...,"[Single-player, Full controller support, HTC V...",2.99,False,773570,


In [6]:
# Verificamos nuevamente las dimensiones de nuestro DF
df_games.shape

(32135, 13)

Después de eliminar las filas que contenían únicamente valores nulos, las dimensiones del DataFrame se redujeron de (120445 filas, 13 columnas) a (32135 filas, 13 columnas). Este paso de limpieza ha reducido el tamaño del conjunto de datos y ha eliminado las filas que no contenían información útil en ninguna de las columnas.

In [7]:
# Análisis de Calidad de Datos en un DataFrame"
def verificar_datos(df):
    verif = {"nombre_campo": [], "tipo_dato": [], "%_No_Nulos": [], "%_Nulos": [], "Nulos": []}

    for columna in df.columns:
        porcentaje_no_nulos = (df[columna].count() / len(df)) * 100
        verif["nombre_campo"].append(columna)
        verif["tipo_dato"].append(df[columna].dtypes)
        verif["%_No_Nulos"].append(round(porcentaje_no_nulos, 2))
        verif["%_Nulos"].append(round(100 - porcentaje_no_nulos, 2))
        verif["Nulos"].append(df[columna].isnull().sum())

    df_info = pd.DataFrame(verif)
        
    return df_info

In [8]:
verificar_datos(df_games)

Unnamed: 0,nombre_campo,tipo_dato,%_No_Nulos,%_Nulos,Nulos
0,publisher,object,74.94,25.06,8052
1,genres,object,89.78,10.22,3283
2,app_name,object,99.99,0.01,2
3,title,object,93.62,6.38,2050
4,url,object,100.0,0.0,0
5,release_date,object,93.57,6.43,2067
6,tags,object,99.49,0.51,163
7,reviews_url,object,99.99,0.01,2
8,specs,object,97.92,2.08,670
9,price,object,95.71,4.29,1377


### Eliminación de columnas
Se eliminaran las columnas **'title'**, **'url'**, **'early_access'**, **'reviews_url'** y **'specs'** del DataFrame df_games. La decisión de eliminar 'title' se basa en que hay otra columna similar con menos datos nulos, y las otras columnas no se requerirán para los endpoints que se ha pedido. Este paso de eliminación se realizó para simplificar el conjunto de datos y trabajar con las columnas relevantes.

In [9]:
df_games.drop(['title', 'url', 'early_access', 'reviews_url','specs'], axis=1, inplace=True)

In [10]:
df_games.head()

Unnamed: 0,publisher,genres,app_name,release_date,tags,price,id,developer
0,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",4.99,761140,Kotoshiro
1,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",Free To Play,643980,Secret Level SRL
2,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",Free to Play,670290,Poolians.com
3,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,2017-12-07,"[Action, Adventure, Casual]",0.99,767400,彼岸领域
4,,,Log Challenge,,"[Action, Indie, Casual, Sports]",2.99,773570,


### Ajustes para poder guardar el DF en formato Parquet y CSV

La columna **'price'** del DataFrame **df_games** se convirtió al tipo de dato STR para facilitar su manejo. Luego, el DataFrame se guardó en los formatos Parquet y CSV.

In [11]:
# Convertir la columna 'price' a tipo string
df_games['price'] = df_games['price'].astype(str)
df_games.to_parquet('datasets/data_transf/t_df_games.parquet', index=False)
df_games.to_csv('datasets/data_transf/t_df_games.csv', index=False)

## Transformación **user_reviews**

#### Carga de datos de 'reviews' de Steam desde un archivo JSON comprimido en gzip

In [12]:
# Creación de DataFrame a partir de un Archivo JSON Comprimido
row = []

with gzip.open("datasets/user_reviews.json.gz", 'rt', encoding='utf-8') as archivo:
    for linea in archivo:
        # ast.literal_eval() convierte cada línea, que es una cadena de texto, en un objeto de Python
        row.append(ast.literal_eval(linea))

# Crear el DataFrame
df_reviews = pd.DataFrame(row)

In [13]:
df_reviews.head()

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."


In [14]:
# Normalización de datos JSON en un DataFrame
df_reviews = pd.json_normalize(row, record_path='reviews', meta=['user_id','user_url'])
df_reviews.head()

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review,user_id,user_url
0,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,76561197970982479,http://steamcommunity.com/profiles/76561197970...
1,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.,76561197970982479,http://steamcommunity.com/profiles/76561197970...
2,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,76561197970982479,http://steamcommunity.com/profiles/76561197970...
3,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,js41637,http://steamcommunity.com/id/js41637
4,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,js41637,http://steamcommunity.com/id/js41637


#### Verificación y Limpieza del DataFrame

In [15]:
# Verificamos las dimensiones de nuestro DF
df_reviews.shape

(59305, 9)

In [16]:
# Eliminación de columnas que son innecesarias para los siguientes procesos
df_reviews.drop(['funny','last_edited','helpful','user_url'], axis=1, inplace=True)

In [17]:
# Eliminación de Filas Nulas y Duplicadas
df_reviews = df_reviews.dropna(how='all')
df_reviews = df_reviews.drop_duplicates()
df_reviews.head()

Unnamed: 0,posted,item_id,recommend,review,user_id
0,"Posted November 5, 2011.",1250,True,Simple yet with great replayability. In my opi...,76561197970982479
1,"Posted July 15, 2011.",22200,True,It's unique and worth a playthrough.,76561197970982479
2,"Posted April 21, 2011.",43110,True,Great atmosphere. The gunplay can be a bit chu...,76561197970982479
3,"Posted June 24, 2014.",251610,True,I know what you think when you see this title ...,js41637
4,"Posted September 8, 2013.",227300,True,For a simple (it's actually not all that simpl...,js41637


In [18]:
# Verificamos nuevamente las dimensiones de nuestro DF
df_reviews.shape

(58431, 5)

Después de eliminar las columnas innecesarias y filas que contenían únicamente valores nulos y duplicados, las dimensiones del DataFrame se redujeron de (59305 filas, 9 columnas) a (58431 filas, 5 columnas). Este paso de limpieza ha reducido el tamaño del conjunto de datos.

In [19]:
verificar_datos(df_reviews)

Unnamed: 0,nombre_campo,tipo_dato,%_No_Nulos,%_Nulos,Nulos
0,posted,object,100.0,0.0,0
1,item_id,object,100.0,0.0,0
2,recommend,bool,100.0,0.0,0
3,review,object,100.0,0.0,0
4,user_id,object,100.0,0.0,0


#### Guardamos los dataframes transformados en archivos Parquet y CSV

In [20]:
df_reviews.to_parquet('datasets/data_transf/t_user_reviews.parquet', index=False)
df_reviews.to_csv('datasets/data_transf/t_user_reviews.csv', index=False)

## Transformación **users_items**

In [21]:
# Creación de DataFrame a partir de un Archivo JSON Comprimido
row = []

with gzip.open("datasets/users_items.json.gz", 'rt', encoding='utf-8') as archivo:
    for linea in archivo:
        # ast.literal_eval() convierte cada línea, que es una cadena de texto, en un objeto de Python
        row.append(ast.literal_eval(linea))

# Crear el DataFrame
df_items = pd.DataFrame(row)

In [22]:
df_items.head()

Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
1,js41637,888,76561198035864385,http://steamcommunity.com/id/js41637,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
2,evcentric,137,76561198007712555,http://steamcommunity.com/id/evcentric,"[{'item_id': '1200', 'item_name': 'Red Orchest..."
3,Riot-Punch,328,76561197963445855,http://steamcommunity.com/id/Riot-Punch,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
4,doctr,541,76561198002099482,http://steamcommunity.com/id/doctr,"[{'item_id': '300', 'item_name': 'Day of Defea..."


In [23]:
# Normalización de datos JSON en un DataFrame
df_items = pd.json_normalize(row, record_path='items', meta=['user_id','items_count','steam_id','user_url'])
df_items.head()

Unnamed: 0,item_id,item_name,playtime_forever,playtime_2weeks,user_id,items_count,steam_id,user_url
0,10,Counter-Strike,6,0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...
1,20,Team Fortress Classic,0,0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...
2,30,Day of Defeat,7,0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...
3,40,Deathmatch Classic,0,0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...
4,50,Half-Life: Opposing Force,0,0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...


#### Verificación y Limpieza del DataFrame

In [24]:
# Verificamos las dimensiones de nuestro DF
df_items.shape

(5153209, 8)

In [25]:
# Eliminación de columnas que son innecesarias para los siguientes procesos
df_items.drop(['user_url', 'item_name', 'steam_id', 'playtime_2weeks'], axis=1, inplace=True)

In [26]:
# Eliminación de Filas Nulas y Duplicadas
df_items = df_items.dropna(how='all')
df_items = df_items.drop_duplicates()
df_items.head()

Unnamed: 0,item_id,playtime_forever,user_id,items_count
0,10,6,76561197970982479,277
1,20,0,76561197970982479,277
2,30,7,76561197970982479,277
3,40,0,76561197970982479,277
4,50,0,76561197970982479,277


In [27]:
# Verificamos nuevamente las dimensiones de nuestro DF
df_items.shape

(5094092, 4)

Después de eliminar las columnas innecesarias y filas que contenían únicamente valores nulos y duplicados, las dimensiones del DataFrame se redujeron de (5153209 filas, 8 columnas) a (5094092 filas, 4 columnas). Este paso de limpieza ha reducido el tamaño del conjunto de datos.

In [28]:
verificar_datos(df_items)

Unnamed: 0,nombre_campo,tipo_dato,%_No_Nulos,%_Nulos,Nulos
0,item_id,object,100.0,0.0,0
1,playtime_forever,int64,100.0,0.0,0
2,user_id,object,100.0,0.0,0
3,items_count,object,100.0,0.0,0


#### Guardamos los dataframes transformados en formato Parquet y CSV

In [29]:
df_items.to_parquet('datasets/data_transf/t_users_items.parquet', index=False)
df_items.to_csv('datasets/data_transf/t_users_items.csv', index=False)