## ETL del dataset `steam_reviews`

En esta jupyter notebook se desarrolla la extracción, transformación y carga (ETL) del conjunto de datos `steam_reviews.json.gz`.

### Descripción de Columnas en user_reviews.gz.json

| Columna     | Descripción                                   | Ejemplo                                                                      |
|-------------|-----------------------------------------------|------------------------------------------------------------------------------|
| user_id     | Identificador único de usuario                | [76561197970982479, evcentric, maplemage]                                   |
| user_url    | URL del perfil del usuario                    | [http://steamcommunity.com/id/evcentric]                                    |
| reviews     | Revisión del usuario en formato JSON          | {'funny': '', 'posted': 'Posted September 8, 2013.', 'last_edited': '', 'item_id': '227300', 'helpful': '0 of 1 people (0%) found this review helpful', 'recommend': True, 'review': "For a simple (it's actually not all that simple but it can be!) truck driving Simulator, it is quite a fun and relaxing game. Playing on simple (or easy?) its just the basic WASD keys for driving but (if you want) the game can be much harder and realistic with having to manually change gears, much harder turning, etc. And reversing in this game is a ♥♥♥♥♥, as I imagine it would be with an actual truck. Luckily, you don't have to reverse park it but you get extra points if you do cause it is bloody hard. But this is surprisingly a nice truck driving game and I had a bit of fun with it."} |


In [1]:
import gzip  # Para trabajar con archivos comprimidos en formato Gzip
import json  # Para trabajar con datos en formato JSON
import matplotlib.pyplot as plt  # Para crear gráficos y visualizaciones
import numpy as np  # Para operaciones matemáticas eficientes y manipulación de arrays
import pandas as pd  # Para el análisis y manipulación de datos en forma de DataFrames
import pyarrow as pa  # Herramientas para trabajar con datos en formato de flecha (Arrow)
import pyarrow.parquet as pq  # Para trabajar con archivos en formato Parquet
import seaborn as sns  # Biblioteca de visualización de datos basada en matplotlib
import ast # Proporciona funciones para analizar y manipular el árbol de sintaxis abstracta

In [2]:
filas_review = []  # creamos una lista vacía para ir agregando las filas del archivo json
with gzip.open("data/user_reviews.json.gz", 'rt', encoding='utf-8') as file:
    for line in file.readlines():
        filas_review.append(ast.literal_eval(line))

# Se convierte en dataframe
reviews = pd.DataFrame(filas_review)
reviews

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."
...,...,...,...
25794,76561198306599751,http://steamcommunity.com/profiles/76561198306...,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,http://steamcommunity.com/id/Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,http://steamcommunity.com/profiles/76561198310...,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."


In [3]:
reviews.shape # Obsevamos el tamaño del dataset

(25799, 3)

Ahora es necesario desanidar los registros que contienen listas

In [4]:
duplicados_columnas = reviews[reviews.duplicated(subset=['user_id'], keep=False)] # Verificamos si hay filas duplicadas en la columna id
duplicados_columnas

Unnamed: 0,user_id,user_url,reviews
9,76561198156664158,http://steamcommunity.com/profiles/76561198156...,"[{'funny': '', 'posted': 'Posted June 16.', 'l..."
50,Rivtex,http://steamcommunity.com/id/Rivtex,"[{'funny': '', 'posted': 'Posted December 23, ..."
83,76561198094224872,http://steamcommunity.com/profiles/76561198094...,[]
119,DieMadchenschanderin,http://steamcommunity.com/id/DieMadchenschanderin,"[{'funny': '', 'posted': 'Posted August 29, 20..."
147,relesprit,http://steamcommunity.com/id/relesprit,"[{'funny': '', 'posted': 'Posted December 27, ..."
...,...,...,...
17819,76561198076474887,http://steamcommunity.com/profiles/76561198076...,"[{'funny': '', 'posted': 'Posted April 12.', '..."
17916,yolofaceguy,http://steamcommunity.com/id/yolofaceguy,"[{'funny': '', 'posted': 'Posted October 31, 2..."
18028,76561198075591109,http://steamcommunity.com/profiles/76561198075...,"[{'funny': '', 'posted': 'Posted December 26, ..."
18234,76561198092022514,http://steamcommunity.com/profiles/76561198092...,"[{'funny': '', 'posted': 'Posted July 3.', 'la..."


In [6]:
# Se revisan tres usuarios  de forma aleatoria
user_id = '05041129'
user_reviews = duplicados_columnas[duplicados_columnas['user_id'] == user_id]['reviews']

for review_list in user_reviews:
    for review in review_list:
        print(review['review'])
    print('-' * 40)
    
user_id = 'SuchGayMuchWow'
user_reviews = duplicados_columnas[duplicados_columnas['user_id'] == user_id]['reviews']

for review_list in user_reviews:
    for review in review_list:
        print(review['review'])
    print('-' * 40)
    
user_id = '76561198076474887'
user_reviews = duplicados_columnas[duplicados_columnas['user_id'] == user_id]['reviews']

for review_list in user_reviews:
    for review in review_list:
        print(review['review'])
    print('-' * 40)

This game to me it is so good that it is better than any of the games out their and $15 worth it
this is the best third person game ever that i have played
this will be the  number one game if it have more competitive things
----------------------------------------
This game to me it is so good that it is better than any of the games out their and $15 worth it
this is the best third person game ever that i have played
this will be the  number one game if it have more competitive things
----------------------------------------
10/10 too many children with mikes.
----------------------------------------
10/10 too many children with mikes.
----------------------------------------
Excelente e cada vez mais bonito.Comprei o jogo pouco tempo depois de seu lançamento, e cheguei as minhas primeiras 24 horas jogadas em um piscar de olhos. Com jogabilidade simples, bela musica e arte o jogo me conquistou quase que de imediato.Devo admitir que após algumas horas de jogo, me senti frustrado com al

In [7]:
reviews = reviews.drop_duplicates(subset='user_id', keep='first')
duplicados_columnas = reviews[reviews.duplicated(subset=['user_id'], keep=False)]
duplicados_columnas

Unnamed: 0,user_id,user_url,reviews


In [8]:
reviews.shape

(25485, 3)

Se revisa la columna review

In [9]:
reviews['reviews'][0]

[{'funny': '',
  'posted': 'Posted November 5, 2011.',
  'last_edited': '',
  'item_id': '1250',
  'helpful': 'No ratings yet',
  'recommend': True,
  'review': 'Simple yet with great replayability. In my opinion does "zombie" hordes and team work better than left 4 dead plus has a global leveling system. Alot of down to earth "zombie" splattering fun for the whole family. Amazed this sort of FPS is so rare.'},
 {'funny': '',
  'posted': 'Posted July 15, 2011.',
  'last_edited': '',
  'item_id': '22200',
  'helpful': 'No ratings yet',
  'recommend': True,
  'review': "It's unique and worth a playthrough."},
 {'funny': '',
  'posted': 'Posted April 21, 2011.',
  'last_edited': '',
  'item_id': '43110',
  'helpful': 'No ratings yet',
  'recommend': True,
  'review': 'Great atmosphere. The gunplay can be a bit chunky at times but at the end of the day this game is definitely worth it and I hope they do a sequel...so buy the game so I get a sequel!'}]

| Campo         | Descripción                                                                      |
|---------------|----------------------------------------------------------------------------------|
| user_id       | Identificador único para el usuario.                                             |
| user_url      | URL del perfil del usuario en streamcommunity.                                   |
| reviews       | Lista de diccionarios que contienen información sobre los reviews del usuario.  |
| - funny       | Indica si alguien puso emoticón de gracioso al review.                           |
| - posted      | Fecha de posteo del review en formato "Posted April 21, 2011".                    |
| - last_edited | Fecha de la última edición del review.                                           |
| - item_id     | Identificador único del item (juego).                                           |
| - helpful     | Estadística donde otros usuarios indican si fue útil la información.            |
| - recommend   | Booleano que indica si el usuario recomienda o no el juego.                      |
| - review      | Sentencia string con los comentarios sobre el juego.                             |


In [10]:
review_norm = pd.json_normalize(reviews['reviews'].dropna()) # Utilizamos el metodo json_normalize para normalizar los datos
review_norm.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,"{'funny': '', 'posted': 'Posted November 5, 20...","{'funny': '', 'posted': 'Posted July 15, 2011....","{'funny': '', 'posted': 'Posted April 21, 2011...",,,,,,,
1,"{'funny': '', 'posted': 'Posted June 24, 2014....","{'funny': '', 'posted': 'Posted September 8, 2...","{'funny': '', 'posted': 'Posted November 29, 2...",,,,,,,
2,"{'funny': '', 'posted': 'Posted February 3.', ...","{'funny': '', 'posted': 'Posted December 4, 20...","{'funny': '', 'posted': 'Posted November 3, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...",,,,
3,"{'funny': '', 'posted': 'Posted October 14, 20...","{'funny': '', 'posted': 'Posted July 28, 2012....","{'funny': '', 'posted': 'Posted June 2, 2012.'...","{'funny': '', 'posted': 'Posted June 29, 2014....","{'funny': '', 'posted': 'Posted November 22, 2...","{'funny': '', 'posted': 'Posted February 23, 2...",,,,
4,"{'funny': '3 people found this review funny', ...","{'funny': '1 person found this review funny', ...","{'funny': '2 people found this review funny', ...","{'funny': '', 'posted': 'Posted July 11, 2013....",,,,,,


Se pierde user id y user url

In [11]:
# Se agrega el 'user_id' y 'user_url' a las columnas separadas 
review_norm = pd.concat([reviews[['user_id', 'user_url']], review_norm], axis=1)
review_norm.head()

Unnamed: 0,user_id,user_url,0,1,2,3,4,5,6,7,8,9
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted November 5, 20...","{'funny': '', 'posted': 'Posted July 15, 2011....","{'funny': '', 'posted': 'Posted April 21, 2011...",,,,,,,
1,js41637,http://steamcommunity.com/id/js41637,"{'funny': '', 'posted': 'Posted June 24, 2014....","{'funny': '', 'posted': 'Posted September 8, 2...","{'funny': '', 'posted': 'Posted November 29, 2...",,,,,,,
2,evcentric,http://steamcommunity.com/id/evcentric,"{'funny': '', 'posted': 'Posted February 3.', ...","{'funny': '', 'posted': 'Posted December 4, 20...","{'funny': '', 'posted': 'Posted November 3, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...",,,,
3,doctr,http://steamcommunity.com/id/doctr,"{'funny': '', 'posted': 'Posted October 14, 20...","{'funny': '', 'posted': 'Posted July 28, 2012....","{'funny': '', 'posted': 'Posted June 2, 2012.'...","{'funny': '', 'posted': 'Posted June 29, 2014....","{'funny': '', 'posted': 'Posted November 22, 2...","{'funny': '', 'posted': 'Posted February 23, 2...",,,,
4,maplemage,http://steamcommunity.com/id/maplemage,"{'funny': '3 people found this review funny', ...","{'funny': '1 person found this review funny', ...","{'funny': '2 people found this review funny', ...","{'funny': '', 'posted': 'Posted July 11, 2013....",,,,,,


In [12]:
# Se utiliza pd.melt para transformar las columnas en filas conservando el 'user_id' y 'user_url'
review_norm = pd.melt(review_norm, id_vars=['user_id', 'user_url'], 
                       value_vars=list(range(9)),
                       value_name='reviews')
review_norm.head()

Unnamed: 0,user_id,user_url,variable,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,0,"{'funny': '', 'posted': 'Posted November 5, 20..."
1,js41637,http://steamcommunity.com/id/js41637,0,"{'funny': '', 'posted': 'Posted June 24, 2014...."
2,evcentric,http://steamcommunity.com/id/evcentric,0,"{'funny': '', 'posted': 'Posted February 3.', ..."
3,doctr,http://steamcommunity.com/id/doctr,0,"{'funny': '', 'posted': 'Posted October 14, 20..."
4,maplemage,http://steamcommunity.com/id/maplemage,0,"{'funny': '3 people found this review funny', ..."


In [14]:
review_norm[review_norm['user_id']=='evcentric']

Unnamed: 0,user_id,user_url,variable,reviews
2,evcentric,http://steamcommunity.com/id/evcentric,0,"{'funny': '', 'posted': 'Posted February 3.', ..."
25801,evcentric,http://steamcommunity.com/id/evcentric,1,"{'funny': '', 'posted': 'Posted December 4, 20..."
51600,evcentric,http://steamcommunity.com/id/evcentric,2,"{'funny': '', 'posted': 'Posted November 3, 20..."
77399,evcentric,http://steamcommunity.com/id/evcentric,3,"{'funny': '', 'posted': 'Posted October 15, 20..."
103198,evcentric,http://steamcommunity.com/id/evcentric,4,"{'funny': '', 'posted': 'Posted October 15, 20..."
128997,evcentric,http://steamcommunity.com/id/evcentric,5,"{'funny': '', 'posted': 'Posted October 15, 20..."
154796,evcentric,http://steamcommunity.com/id/evcentric,6,
180595,evcentric,http://steamcommunity.com/id/evcentric,7,
206394,evcentric,http://steamcommunity.com/id/evcentric,8,


In [16]:
# Se eliminan las filas con valor None
review_norm = review_norm.dropna()
# Se verifica que solo queden el 'user_id' con la cantidad de diccionarios que le corresponde
review_norm[review_norm['user_id']=='evcentric']

Unnamed: 0,user_id,user_url,variable,reviews
2,evcentric,http://steamcommunity.com/id/evcentric,0,"{'funny': '', 'posted': 'Posted February 3.', ..."
25801,evcentric,http://steamcommunity.com/id/evcentric,1,"{'funny': '', 'posted': 'Posted December 4, 20..."
51600,evcentric,http://steamcommunity.com/id/evcentric,2,"{'funny': '', 'posted': 'Posted November 3, 20..."
77399,evcentric,http://steamcommunity.com/id/evcentric,3,"{'funny': '', 'posted': 'Posted October 15, 20..."
103198,evcentric,http://steamcommunity.com/id/evcentric,4,"{'funny': '', 'posted': 'Posted October 15, 20..."
128997,evcentric,http://steamcommunity.com/id/evcentric,5,"{'funny': '', 'posted': 'Posted October 15, 20..."
