### Etl Steam Games ###


Carga y preparacion de Datos 

 A continuacion se importan las bibliotecas pandas, json y ast para la carga y preparación de los datos. El archivo JSON es cargado utilizando la biblioteca json, mientras que pandas y ast son empleados para transformar los datos en un formato adecuado para el análisis posterior.

In [65]:
import pandas as pd
import json as js
import ast as ast


Se define la variable de entorno "ruta_archivo" con el fin de especificar la ubicación del archivo de configuración JSON de los juegos de Steam

In [66]:
ruta_archivo="output_steam_games.json"

Se lleva a cabo la ingesta de datos desde un archivo JSON en un DataFrame de Pandas. A través de un proceso iterativo, cada objeto JSON es deserializado a un diccionario de Python y agregado a una lista. Finalmente, esta lista es estructurada como un DataFrame para facilitar el análisis.

In [67]:
df_games = []
with open(ruta_archivo,'rt', encoding='utf-8') as file: 

    for line in file: # Con un ciclo for iteramos todas las filas del archivo 
        df_games.append(js.loads(line)) # Agregamos cada una de las filas en una lista 

df_games  = pd.DataFrame(df_games) 

Se empleó la función head() para obtener una vista previa de los primeros registros del DataFrame.

In [68]:
df_games.head()

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,,,,,,,,,,,,,
1,,,,,,,,,,,,,
2,,,,,,,,,,,,,
3,,,,,,,,,,,,,
4,,,,,,,,,,,,,


Como parte del proceso de preparación de datos, se evaluó la calidad de las variables ‘title’ y ‘app_name’. Se determinó el número de observaciones completas y se localizaron los registros con valores faltantes en ‘app_name’. Para evitar modificaciones no intencionadas en el dataset original, se creó una copia de trabajo sobre la cual se ejecutaron las subsiguientes operaciones de limpieza.

In [69]:
df_games_limpio = df_games.copy() # copia del DataFrame para modificarlo

Dado que las primeras 88.310 filas del DataFrame se componían íntegramente de valores nulos, se optó por su exclusión del análisis.

In [70]:
df_games_limpio = df_games_limpio.drop(df_games_limpio.index[:88310])
df_games_limpio

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
88310,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",http://steamcommunity.com/app/761140/reviews/?...,[Single-player],4.99,False,761140,Kotoshiro
88311,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,http://store.steampowered.com/app/643980/Ironb...,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",http://steamcommunity.com/app/643980/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,False,643980,Secret Level SRL
88312,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_...,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",http://steamcommunity.com/app/670290/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,False,670290,Poolians.com
88313,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,http://store.steampowered.com/app/767400/2222/,2017-12-07,"[Action, Adventure, Casual]",http://steamcommunity.com/app/767400/reviews/?...,[Single-player],0.99,False,767400,彼岸领域
88314,,,Log Challenge,,http://store.steampowered.com/app/773570/Log_C...,,"[Action, Indie, Casual, Sports]",http://steamcommunity.com/app/773570/reviews/?...,"[Single-player, Full controller support, HTC V...",2.99,False,773570,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
120440,Ghost_RUS Games,"[Casual, Indie, Simulation, Strategy]",Colony On Mars,Colony On Mars,http://store.steampowered.com/app/773640/Colon...,2018-01-04,"[Strategy, Indie, Casual, Simulation]",http://steamcommunity.com/app/773640/reviews/?...,"[Single-player, Steam Achievements]",1.99,False,773640,"Nikita ""Ghost_RUS"""
120441,Sacada,"[Casual, Indie, Strategy]",LOGistICAL: South Africa,LOGistICAL: South Africa,http://store.steampowered.com/app/733530/LOGis...,2018-01-04,"[Strategy, Indie, Casual]",http://steamcommunity.com/app/733530/reviews/?...,"[Single-player, Steam Achievements, Steam Clou...",4.99,False,733530,Sacada
120442,Laush Studio,"[Indie, Racing, Simulation]",Russian Roads,Russian Roads,http://store.steampowered.com/app/610660/Russi...,2018-01-04,"[Indie, Simulation, Racing]",http://steamcommunity.com/app/610660/reviews/?...,"[Single-player, Steam Achievements, Steam Trad...",1.99,False,610660,Laush Dmitriy Sergeevich
120443,SIXNAILS,"[Casual, Indie]",EXIT 2 - Directions,EXIT 2 - Directions,http://store.steampowered.com/app/658870/EXIT_...,2017-09-02,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...",http://steamcommunity.com/app/658870/reviews/?...,"[Single-player, Steam Achievements, Steam Cloud]",4.99,False,658870,"xropi,stev3ns"


In [71]:
array = df_games_limpio["app_name"] == df_games_limpio["title"]
print(array.info)

# Contamos la cant de verdaderos y falsos del nuevo arreglo 
print(array.value_counts().get(True, 0))
valores_dif= array.value_counts().get(False, 0)
print(valores_dif)

# Contamos la cant de nulos en cada columna 
nulos_title = df_games_limpio["title"].isnull()
cantidad_nulos_title = df_games_limpio["title"].isnull().sum()
print(f" Hay {cantidad_nulos_title} nulos en la columna title")

nulos_app_name = df_games_limpio["app_name"].isnull()
cantidad_nulos_app_name = df_games_limpio["app_name"].isnull().sum()
print(f" En la columna app_name hay {cantidad_nulos_app_name} nulos ")

# Sumamos la cantidad de valores nulos de cada columna y se lo resto a la cantidad de valores distintos
nulos = cantidad_nulos_app_name + cantidad_nulos_title
filas_sin_nulos = valores_dif - nulos
print(f"las filas sin nulos y diferentes valores son {filas_sin_nulos}")

<bound method Series.info of 88310      True
88311      True
88312      True
88313      True
88314     False
          ...  
120440     True
120441     True
120442     True
120443     True
120444    False
Length: 32135, dtype: bool>
29530
2605
 Hay 2050 nulos en la columna title
 En la columna app_name hay 2 nulos 
las filas sin nulos y diferentes valores son 553


Las columnas 'genres', 'tags', 'price' y 'developer' fueron sometidas a un proceso de transformación que incluyó la limpieza de datos anómalos y la completación de valores faltantes.

In [72]:
filas_nulas = df_games_limpio[df_games_limpio['app_name'].isnull()]

# Iteramos para saber en que filas estan los datos nulos
for i, fila in filas_nulas.iterrows():
  print(f"Índice: {i}")
  print(f"fila: {fila}")

Índice: 88384
fila: publisher                                  NaN
genres                                     NaN
app_name                                   NaN
title                                      NaN
url             http://store.steampowered.com/
release_date                               NaN
tags                                       NaN
reviews_url                                NaN
specs                                      NaN
price                                    19.99
early_access                             False
id                                         NaN
developer                                  NaN
Name: 88384, dtype: object
Índice: 90890
fila: publisher                                                     NaN
genres                                            [Action, Indie]
app_name                                                      NaN
title                                                         NaN
url                   http://store.steampowered.com/app/31

Existe una considerable superposición de información entre las columnas 'tags' y 'genre'. Procedemos a trabajar en ello.

In [73]:
print(df_games_limpio[['genres', 'tags']].dtypes)

genres    object
tags      object
dtype: object


In [74]:
df_games_limpio['genres'] = df_games_limpio['genres'].astype(object)
df_games_limpio['tags'] = df_games_limpio['tags'].astype(object)

In [75]:
df_games_limpio['genres'].fillna(df_games_limpio['tags'], inplace=True)
df_games_limpio['tags'].fillna(df_games_limpio['genres'], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_games_limpio['genres'].fillna(df_games_limpio['tags'], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_games_limpio['tags'].fillna(df_games_limpio['genres'], inplace=True)


En el analisis de la estructura del Data Frame determinamos que La columna 'price' almacenará tanto el precio de los juegos como una indicación de si son gratuitos.
La columna 'genres' contendrá la información del género principal al que pertenece cada juego y la columna 'tags', al no resultar relevante para el análisis, será eliminada del DataFrame.

In [77]:
#def si_free_to_play(tags):
 #   if tags is None:
  #      return None
   # try:
    #    tags_list = ast.literal_eval(tags)
     #   if 'Free to Play' in tags_list:
      #      return 'Free to Play'
       # else:
        #    return None
    #except (SyntaxError, ValueError):
     #   return None

In [78]:
#def tags_limpieza(tags):
 #   if pd.isna(tags):  # Handle missing values (NaN)
  #      return None
   # try:
    #    return ast.literal_eval(tags)
    #except (SyntaxError, ValueError):
     #   return None

#df_games_limpio['tags'] = df_games_limpio['tags'].apply(tags_limpieza)

In [76]:
# pasamos los datos de tags a cadena
df_games_limpio['tags'] = df_games_limpio['tags'].astype(str)

# Filtramos y dejamos solo los elementos "Free to Play" o None si no se encuentra
def si_free_to_play(tags):
    try:
        tags_list = ast.literal_eval(tags)
        if 'Free to Play' in tags_list:
            return 'Free to Play'
        else:
            return None
    except (SyntaxError, ValueError):
        return None

# Aplicamos a tags
df_games_limpio['tags'] = df_games_limpio['tags'].apply(si_free_to_play)

In [79]:
df_games_limpio['price'].fillna(df_games_limpio['tags'], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_games_limpio['price'].fillna(df_games_limpio['tags'], inplace=True)


In [80]:
# cambio "Free to play" en "Free"
df_games_limpio['price'] = df_games_limpio['price'].replace('(?i)Free to Play', 'Free', regex=True)

Eliminaremos la categoría 'Free to Play' de la columna 'genres' y concatenamos los valores restantes en una cadena string

In [81]:
df_games_limpio['genres'] = df_games_limpio['genres'].astype(str)

def eliminar(generos):
    try:
        lista_generos = ast.literal_eval(generos)  # Convertir cadena a lista
        if lista_generos is None:
            return None
        if "Free to Play" in lista_generos: # si esta "free to play", lo eliminamos
            lista_generos.remove("Free to Play")
        return ', '.join(lista_generos) # convertimos a cadena y retornamos
    except (SyntaxError, ValueError):
        return None

#Aplicamos a genres
df_games_limpio['genres'] = df_games_limpio['genres'].apply(eliminar)

Se retiene únicamente el primer elemento de la cadena en la columna "genres"

In [82]:
# se divide la cadena en comas y se usa el primer elemento
df_games_limpio['genres'] = df_games_limpio['genres'].str.split(',').str[0].str.strip()

print(df_games_limpio['genres'])

88310           Action
88311            Indie
88312           Casual
88313           Action
88314           Action
              ...     
120440          Casual
120441          Casual
120442           Indie
120443          Casual
120444    Early Access
Name: genres, Length: 32135, dtype: object


 Limpieza y transformación de la variable 'release date'

In [83]:
# Normalizo 'release_date'
df_games_limpio['release_date'] = df_games_limpio['release_date'].str.extract(r'(\d{4})')

# Normalizo valores con palabra seguida de número
df_games_limpio['release_date'] = df_games_limpio['release_date'].str.extract(r'(\d{4})')

print(df_games_limpio)

               publisher        genres                  app_name  \
88310          Kotoshiro        Action       Lost Summoner Kitty   
88311   Making Fun, Inc.         Indie                 Ironbound   
88312       Poolians.com        Casual   Real Pool 3D - Poolians   
88313               彼岸领域        Action                   弹炸人2222   
88314                NaN        Action             Log Challenge   
...                  ...           ...                       ...   
120440   Ghost_RUS Games        Casual            Colony On Mars   
120441            Sacada        Casual  LOGistICAL: South Africa   
120442      Laush Studio         Indie             Russian Roads   
120443          SIXNAILS        Casual       EXIT 2 - Directions   
120444               NaN  Early Access               Maze Run VR   

                           title  \
88310        Lost Summoner Kitty   
88311                  Ironbound   
88312    Real Pool 3D - Poolians   
88313                    弹炸人2222   
883

Se lleva a cabo una transformación de los datos, reemplazando todos los valores no numéricos por el valor numérico cero.

In [84]:
# Lambda para convertir a 0 si es string,sino mantiene el valor original
replace_string_with_zero = lambda x: 0 if isinstance(x, str) else x

# Aplico lambda a 'price'
df_games_limpio['price'] = df_games_limpio['price'].apply(replace_string_with_zero)

print(df_games_limpio['price'])

88310     4.99
88311     0.00
88312     0.00
88313     0.99
88314     2.99
          ... 
120440    1.99
120441    4.99
120442    1.99
120443    4.99
120444    4.99
Name: price, Length: 32135, dtype: float64


Se aplica una transformación al DataFrame, reordenando sus columnas.

In [85]:
columns_to_drop = ['publisher','url', 'reviews_url', 'specs', 'early_access', 'tags','title']

# Elimino columnas especificas
df_games_limpio = df_games_limpio.drop(columns=columns_to_drop, errors='ignore')

In [86]:
columnas_ordenadas = [
    'id',
    'app_name', 
    'developer',
    'genres',
    'price',
    'release_date',
]

# Columnas del DataFrame con el nuevo orden
df_games_limpio = df_games_limpio[columnas_ordenadas]

df_games_limpio

Unnamed: 0,id,app_name,developer,genres,price,release_date
88310,761140,Lost Summoner Kitty,Kotoshiro,Action,4.99,2018
88311,643980,Ironbound,Secret Level SRL,Indie,0.00,2018
88312,670290,Real Pool 3D - Poolians,Poolians.com,Casual,0.00,2017
88313,767400,弹炸人2222,彼岸领域,Action,0.99,2017
88314,773570,Log Challenge,,Action,2.99,
...,...,...,...,...,...,...
120440,773640,Colony On Mars,"Nikita ""Ghost_RUS""",Casual,1.99,2018
120441,733530,LOGistICAL: South Africa,Sacada,Casual,4.99,2018
120442,610660,Russian Roads,Laush Dmitriy Sergeevich,Indie,1.99,2018
120443,658870,EXIT 2 - Directions,"xropi,stev3ns",Casual,4.99,2017


### Etl User Reviews ###


In [87]:
import pandas as pd
from pandas import json_normalize # para trabajar con datos anidados
import json as js 
import ast as ast


Se define la variable de entorno "ruta_reviews" con el fin de especificar la ubicación del archivo de configuración JSON 

In [88]:
ruta_reviews= "australian_user_reviews.json"

Los datos del archivo JSON 'australian_user_reviews.json' fueron representados en una estructura de lista.

In [89]:
lista_reviews= [] 

# Abrimos el archivo JSON 
with open(ruta_reviews, encoding='utf-8') as file: 
    for line in file.readlines(): # Iteramos línea por línea
        lista_reviews.append(ast.literal_eval(line)) # agregamos cada línea a la lista. Evaluamos con ast.literal_eval y convertimos la cadena JSON en un objeto.

Se generó un DataFrame, denominado 'df_reviews', a partir de la lista 'lista_reviews'. A fin de preservar los datos originales, se creó una copia de este DataFrame, denominada 'df_reviews_limpio', sobre la cual se llevarán a cabo las modificaciones necesarias.


In [90]:
df_reviews  = pd.DataFrame(lista_reviews)
df_reviews_limpio = df_reviews.copy()

Se excluyó la columna "user_url" del conjunto de datos "df_reviews_limpio". 

In [91]:
df_reviews_limpio = df_reviews_limpio.drop('user_url', axis=1)
df_reviews_limpio

Unnamed: 0,user_id,reviews
0,76561197970982479,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,"[{'funny': '3 people found this review funny',..."
...,...,...
25794,76561198306599751,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."


En esta etapa del análisis, se procedió a desanidar la columna 'reviews' del DataFrame 'df_reviews_clean'. Esta operación permitió expandir los datos anidados en esta columna, generando un nuevo DataFrame con una estructura más plana y fácil de analizar.

In [92]:
#Desanidar la columna 'reviews'
df_desanido = df_reviews_limpio.explode('reviews')

df_desanido.reset_index(drop=True, inplace=True)


In [93]:
df_desanido.head(25)

Unnamed: 0,user_id,reviews
0,76561197970982479,"{'funny': '', 'posted': 'Posted November 5, 20..."
1,76561197970982479,"{'funny': '', 'posted': 'Posted July 15, 2011...."
2,76561197970982479,"{'funny': '', 'posted': 'Posted April 21, 2011..."
3,js41637,"{'funny': '', 'posted': 'Posted June 24, 2014...."
4,js41637,"{'funny': '', 'posted': 'Posted September 8, 2..."
5,js41637,"{'funny': '', 'posted': 'Posted November 29, 2..."
6,evcentric,"{'funny': '', 'posted': 'Posted February 3.', ..."
7,evcentric,"{'funny': '', 'posted': 'Posted December 4, 20..."
8,evcentric,"{'funny': '', 'posted': 'Posted November 3, 20..."
9,evcentric,"{'funny': '', 'posted': 'Posted October 15, 20..."


Se aplanó la estructura de la columna 'reviews' en el DataFrame 'df_expanded', convirtiéndola en un formato tabular.

In [94]:
df_reviews_normalizado = json_normalize(df_desanido['reviews'])

In [95]:
df_reviews_normalizado.head(25)

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review
0,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
5,,"Posted November 29, 2013.",,239030,1 of 4 people (25%) found this review helpful,True,Very fun little game to play when your bored o...
6,,Posted February 3.,,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
7,,"Posted December 4, 2015.","Last edited December 5, 2015.",370360,No ratings yet,True,"""Run for fun? What the hell kind of fun is that?"""
8,,"Posted November 3, 2014.",,237930,No ratings yet,True,"Elegant integration of gameplay, story, world ..."
9,,"Posted October 15, 2014.",,263360,No ratings yet,True,"Random drops and random quests, with stat poin..."


Se fusionaron los DataFrames 'df_desanido' y 'df_reviews_normalizado' para crear el DataFrame final.

In [96]:
df_final = pd.concat([df_desanido.drop(columns=['reviews']), df_reviews_normalizado], axis=1)

In [97]:
df_final.head(25)

Unnamed: 0,user_id,funny,posted,last_edited,item_id,helpful,recommend,review
0,76561197970982479,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,js41637,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
5,js41637,,"Posted November 29, 2013.",,239030,1 of 4 people (25%) found this review helpful,True,Very fun little game to play when your bored o...
6,evcentric,,Posted February 3.,,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
7,evcentric,,"Posted December 4, 2015.","Last edited December 5, 2015.",370360,No ratings yet,True,"""Run for fun? What the hell kind of fun is that?"""
8,evcentric,,"Posted November 3, 2014.",,237930,No ratings yet,True,"Elegant integration of gameplay, story, world ..."
9,evcentric,,"Posted October 15, 2014.",,263360,No ratings yet,True,"Random drops and random quests, with stat poin..."


Se llevó a cabo una transformación del DataFrame 'df_final' mediante la exclusión de las columnas especificadas en 'colum_a_eliminar'. El DataFrame resultante, 'df_reviews_limpio', sirvió como base para los análisis subsiguientes.

In [98]:
colum_a_eliminar = ['funny', 'last_edited', 'helpful']

# columnas especificadas
df_reviews_limpio = df_final.drop(colum_a_eliminar, errors='ignore')

df_reviews_limpio

Unnamed: 0,user_id,funny,posted,last_edited,item_id,helpful,recommend,review
0,76561197970982479,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,js41637,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...,...,...,...
59328,76561198312638244,,Posted July 10.,,70,No ratings yet,True,a must have classic from steam definitely wort...
59329,76561198312638244,,Posted July 8.,,362890,No ratings yet,True,this game is a perfect remake of the original ...
59330,LydiaMorley,1 person found this review funny,Posted July 3.,,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...
59331,LydiaMorley,,Posted July 20.,,730,No ratings yet,True,:D


### Etl User Items ###

In [99]:
import pandas as pd
from pandas import json_normalize #para trabajar con datos JSON anidados. 
import json as js 
import ast as ast

Se define la variable de entorno "ruta_items" con el fin de especificar la ubicación del archivo de configuración JSON 

In [100]:
ruta_items= "australian_users_items.json"

Se cargó el archivo JSON 'australian_users_items.json' en la variable correspondiente.

In [101]:
lista_items= []
with open(ruta_items, encoding='utf-8') as file:
    for line in file.readlines():
        lista_items.append(ast.literal_eval(line))

df_items  = pd.DataFrame(lista_items)# creamos df

df_items_limpio = df_items.copy()

In [102]:
df_items_limpio

Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
1,js41637,888,76561198035864385,http://steamcommunity.com/id/js41637,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
2,evcentric,137,76561198007712555,http://steamcommunity.com/id/evcentric,"[{'item_id': '1200', 'item_name': 'Red Orchest..."
3,Riot-Punch,328,76561197963445855,http://steamcommunity.com/id/Riot-Punch,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
4,doctr,541,76561198002099482,http://steamcommunity.com/id/doctr,"[{'item_id': '300', 'item_name': 'Day of Defea..."
...,...,...,...,...,...
88305,76561198323066619,22,76561198323066619,http://steamcommunity.com/profiles/76561198323...,"[{'item_id': '413850', 'item_name': 'CS:GO Pla..."
88306,76561198326700687,177,76561198326700687,http://steamcommunity.com/profiles/76561198326...,"[{'item_id': '11020', 'item_name': 'TrackMania..."
88307,XxLaughingJackClown77xX,0,76561198328759259,http://steamcommunity.com/id/XxLaughingJackClo...,[]
88308,76561198329548331,7,76561198329548331,http://steamcommunity.com/profiles/76561198329...,"[{'item_id': '304930', 'item_name': 'Unturned'..."


La función explode, transforma el DataFrame df_items_limpio en df_items_desanido, desagregando las listas anidadas en la columna 'items' en filas individuales. Cada elemento de una lista se convierte en una nueva fila, ampliando así el DataFrame.

In [103]:
df_items_desanido = df_items_limpio.explode('items')

En el ámbito de la manipulación de datos con Pandas, la siguiente línea de código ejemplifica la reindexación de un DataFrame

In [104]:
df_items_desanido.reset_index(drop=True, inplace=True)

Mediante la función json_normalize de pandas se estructura el DataFrame JSON.

In [105]:
df_items_normalizado = json_normalize(df_items_desanido['items'])

Se emplea la función "concat" de Pandas para unir dos o más DataFrames.

In [106]:
df_items_final = pd.concat([df_items_desanido.drop(columns=['items']), df_items_normalizado], axis=1)

Con el fin de depurar el DataFrame "df_items_final", se establece una lista, "columns_to_drop", que engloba las columnas 'steam_id', 'user_url', 'item_name', y 'playtime_2weeks', las cuales serán posteriormente eliminadas.

In [107]:
col_eliminar= ['steam_id','user_url','item_name','playtime_2weeks']

# columnas especificadas
df_items_limpio = df_items_final.drop(columns=col_eliminar, errors='ignore')

pasamos a csv

In [117]:
df_games_limpio.to_csv("output_steam_games.json", index=False)

In [109]:
df_reviews_limpio.to_csv("australian_user_reviews.csv", index= False)

In [110]:
df_items_limpio.to_csv("australian_users_items.csv", index= False)

pasamos a parquet

In [118]:
df_games_limpio.to_parquet('output_steam_games_limpio.parquet', index=False)

In [119]:
df_reviews_limpio.to_parquet("australian_user_reviews_limpio.parquet", index= False)

In [120]:
df_items_limpio.to_parquet("australian_users_items_limpio.parquet", index= False)