---
## Iniciamos el ETL del segundo archivo 'user_reviews'

importamos las librerías a utilizar

In [1]:
import json
import re
import ast
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

## Comenzamos con la extracción y lectura del archivo

Por cuestiones de prueba y error, se decidió cargar el archivo directo de .gz De estta forma persiste la data luego del desanidado

In [3]:
#extraccion del json
row = [] #lista vacia para ir guardando las filas

with open ("../DataJSon/australian_user_reviews.json", 'r', encoding='utf-8') as file: #utilizo with para que el archivo se abra y cierre
    for line in file.readlines(): #bucle para ir leyendo filas y luego agregarlas a row
        row.append(ast.literal_eval(line)) # interpreta las lineas del json y transforma en objeto de python

# genero el dataframe 
reviews = pd.DataFrame(row)
reviews

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."
...,...,...,...
25794,76561198306599751,http://steamcommunity.com/profiles/76561198306...,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,http://steamcommunity.com/id/Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,http://steamcommunity.com/profiles/76561198310...,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."


Podemos comprobar la cantidad de datos anidados en la columnas reviews. Por otro lado, procedemos a elimar la columna user_url pues no representa interes alguno en este proyecto

In [4]:
reviews.drop(columns='user_url', inplace=True)


In [5]:
reviews.info

<bound method DataFrame.info of                  user_id                                            reviews
0      76561197970982479  [{'funny': '', 'posted': 'Posted November 5, 2...
1                js41637  [{'funny': '', 'posted': 'Posted June 24, 2014...
2              evcentric  [{'funny': '', 'posted': 'Posted February 3.',...
3                  doctr  [{'funny': '', 'posted': 'Posted October 14, 2...
4              maplemage  [{'funny': '3 people found this review funny',...
...                  ...                                                ...
25794  76561198306599751  [{'funny': '', 'posted': 'Posted May 31.', 'la...
25795           Ghoustik  [{'funny': '', 'posted': 'Posted June 17.', 'l...
25796  76561198310819422  [{'funny': '1 person found this review funny',...
25797  76561198312638244  [{'funny': '', 'posted': 'Posted July 21.', 'l...
25798        LydiaMorley  [{'funny': '1 person found this review funny',...

[25799 rows x 2 columns]>

## Transformación del dataset 

Procedemos a desanidar la lista con diccionarios en la columna "reviews"

In [6]:
#Utilizo la función explode para explotar la columna y desanidar los datos
exploded = reviews.explode('reviews')
exploded 
#Conservo el resultado en una variable

Unnamed: 0,user_id,reviews
0,76561197970982479,"{'funny': '', 'posted': 'Posted November 5, 20..."
0,76561197970982479,"{'funny': '', 'posted': 'Posted July 15, 2011...."
0,76561197970982479,"{'funny': '', 'posted': 'Posted April 21, 2011..."
1,js41637,"{'funny': '', 'posted': 'Posted June 24, 2014...."
1,js41637,"{'funny': '', 'posted': 'Posted September 8, 2..."
...,...,...
25797,76561198312638244,"{'funny': '', 'posted': 'Posted July 10.', 'la..."
25797,76561198312638244,"{'funny': '', 'posted': 'Posted July 8.', 'las..."
25798,LydiaMorley,"{'funny': '1 person found this review funny', ..."
25798,LydiaMorley,"{'funny': '', 'posted': 'Posted July 20.', 'la..."


In [7]:
#normalizo o aplano los datos 
normalizado = pd.json_normalize(exploded['reviews'].dropna())
normalizado

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review
0,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...,...,...
59300,,Posted July 10.,,70,No ratings yet,True,a must have classic from steam definitely wort...
59301,,Posted July 8.,,362890,No ratings yet,True,this game is a perfect remake of the original ...
59302,1 person found this review funny,Posted July 3.,,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...
59303,,Posted July 20.,,730,No ratings yet,True,:D


In [8]:
#reseteo los indices para que no se desordenen las filas
normalizado.reset_index(inplace=True)
normalizado

Unnamed: 0,index,funny,posted,last_edited,item_id,helpful,recommend,review
0,0,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,1,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,2,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,3,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,4,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...,...,...,...
59300,59300,,Posted July 10.,,70,No ratings yet,True,a must have classic from steam definitely wort...
59301,59301,,Posted July 8.,,362890,No ratings yet,True,this game is a perfect remake of the original ...
59302,59302,1 person found this review funny,Posted July 3.,,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...
59303,59303,,Posted July 20.,,730,No ratings yet,True,:D


In [9]:
exploded.reset_index(inplace=True)
exploded

Unnamed: 0,index,user_id,reviews
0,0,76561197970982479,"{'funny': '', 'posted': 'Posted November 5, 20..."
1,0,76561197970982479,"{'funny': '', 'posted': 'Posted July 15, 2011...."
2,0,76561197970982479,"{'funny': '', 'posted': 'Posted April 21, 2011..."
3,1,js41637,"{'funny': '', 'posted': 'Posted June 24, 2014...."
4,1,js41637,"{'funny': '', 'posted': 'Posted September 8, 2..."
...,...,...,...
59328,25797,76561198312638244,"{'funny': '', 'posted': 'Posted July 10.', 'la..."
59329,25797,76561198312638244,"{'funny': '', 'posted': 'Posted July 8.', 'las..."
59330,25798,LydiaMorley,"{'funny': '1 person found this review funny', ..."
59331,25798,LydiaMorley,"{'funny': '', 'posted': 'Posted July 20.', 'la..."


In [10]:
#Concateno con el data orignal y elimino la columna original "reviews" anidada
reviews= pd.concat([exploded, normalizado], axis=1)
reviews= reviews.drop(columns = ['reviews'])
reviews

  output = repr(obj)
  return method()


Unnamed: 0,index,user_id,index.1,funny,posted,last_edited,item_id,helpful,recommend,review
0,0,76561197970982479,0.0,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,0,76561197970982479,1.0,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,0,76561197970982479,2.0,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,1,js41637,3.0,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,1,js41637,4.0,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...,...,...,...,...,...
59328,25797,76561198312638244,,,,,,,,
59329,25797,76561198312638244,,,,,,,,
59330,25798,LydiaMorley,,,,,,,,
59331,25798,LydiaMorley,,,,,,,,


In [11]:
#elimino el doble index para que solo quede una sola manera de ordenar por indice
reviews= reviews.drop(columns="index")
reviews

  output = repr(obj)
  return method()


Unnamed: 0,user_id,funny,posted,last_edited,item_id,helpful,recommend,review
0,76561197970982479,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,js41637,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...,...,...,...
59328,76561198312638244,,,,,,,
59329,76561198312638244,,,,,,,
59330,LydiaMorley,,,,,,,
59331,LydiaMorley,,,,,,,


In [12]:
# Genero el diccionario directamente con comprensiones de listas
tipo_data = {
    "columna": reviews.columns,
    "tipos_de_datos": [reviews[col].apply(type).unique() for col in reviews.columns]
}

# Creo el DataFrame a partir del diccionario
analisis = pd.DataFrame(tipo_data)
analisis

Unnamed: 0,columna,tipos_de_datos
0,user_id,[<class 'str'>]
1,funny,"[<class 'str'>, <class 'float'>]"
2,posted,"[<class 'str'>, <class 'float'>]"
3,last_edited,"[<class 'str'>, <class 'float'>]"
4,item_id,"[<class 'str'>, <class 'float'>]"
5,helpful,"[<class 'str'>, <class 'float'>]"
6,recommend,"[<class 'bool'>, <class 'float'>]"
7,review,"[<class 'str'>, <class 'float'>]"


### Busqueda de duplicados y nulos

#### En este dataframe la busqueda y eliminación de nulos ser realiza después de la normalización, debido a la gran cantidad de información contenida en la columna anidada "reviews", a modo de no perder datos sin haberlos analizado

In [13]:
#se utiliza la variable duplicados para guardar la busqueda y poder comparar
duplicados= reviews.loc[reviews.duplicated()]
duplicados

Unnamed: 0,user_id,funny,posted,last_edited,item_id,helpful,recommend,review
1112,bobseagull,,"Posted September 24, 2015.",,346110,1 of 1 people (100%) found this review helpful,True,yep
2894,ImSeriouss,,"Posted January 13, 2014.",,211820,No ratings yet,True,If you want to play this game.. expect glithes...
2895,ImSeriouss,,"Posted January 10, 2014.",,440,No ratings yet,True,Really good game! fun! Good for people who wan...
2896,ImSeriouss,,"Posted March 19, 2012.",,42680,No ratings yet,True,Good but a bit overdone. Still love it though.
3582,76561198062039159,,"Posted December 11, 2015.",,730,0 of 1 people (0%) found this review helpful,True,I rate it R8/Revolver
...,...,...,...,...,...,...,...,...
59327,76561198312638244,,,,,,,
59328,76561198312638244,,,,,,,
59329,76561198312638244,,,,,,,
59331,LydiaMorley,,,,,,,


### Se analiza si es correcto eliminar los duplicados

In [14]:
reviews = reviews.drop_duplicates(keep='first')
reviews

Unnamed: 0,user_id,funny,posted,last_edited,item_id,helpful,recommend,review
0,76561197970982479,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,js41637,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...,...,...,...
59323,76561198306599751,,,,,,,
59324,Ghoustik,,,,,,,
59325,76561198310819422,,,,,,,
59326,76561198312638244,,,,,,,


In [15]:
nulos= reviews.isnull().sum()
nulos

user_id         0
funny          18
posted         18
last_edited    18
item_id        18
helpful        18
recommend      18
review         18
dtype: int64

In [16]:
reviews = reviews.dropna().reset_index(drop=True)
reviews

Unnamed: 0,user_id,funny,posted,last_edited,item_id,helpful,recommend,review
0,76561197970982479,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,js41637,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...,...,...,...
59156,Fuckfhaisjnsnsjakaka,,Posted July 10.,,70,No ratings yet,True,a must have classic from steam definitely wort...
59157,3214213216,,Posted July 8.,,362890,No ratings yet,True,this game is a perfect remake of the original ...
59158,ChrisCoroner,1 person found this review funny,Posted July 3.,,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...
59159,CaptainAmericaCw,,Posted July 20.,,730,No ratings yet,True,:D


Notamos que en la columna posted nos estorba "Posted" por lo que procedemos a eliminar

In [17]:
#Reemplazo la palabra Posted por espacio vacio
reviews['posted'] = reviews['posted'].replace({'Posted': ''}, regex=True)
reviews.head(3)

Unnamed: 0,user_id,funny,posted,last_edited,item_id,helpful,recommend,review
0,76561197970982479,,"November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,,"July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,,"April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...


In [18]:
#Transformo ahora la columna a tipo de dato datetime
reviews['posted'] = pd.to_datetime(reviews['posted'], errors='coerce')
reviews.head(3)

Unnamed: 0,user_id,funny,posted,last_edited,item_id,helpful,recommend,review
0,76561197970982479,,2011-11-05,,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,,2011-07-15,,22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,,2011-04-21,,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...


## Guardamos el archivo en formato parquet, para una optimización en el consumo de la memoria

In [19]:
df = reviews.copy()

In [20]:
save = '../DataSets/user_review_limpio.csv'
reviews.to_csv(save, index=False, encoding='utf-8')

In [28]:
#Transformo el archivo csv a parquet
#Leo el archivo csv
reviews= pd.read_csv("../DataSets/user_review_limpio.csv") 

#Indico donde quiero guardar el parquet y con que nombre
output_file= "../DataSets/user_review.parquet"

#Transformo a traves de una tabla el archivo csv en parquet
table = pa.Table.from_pandas(reviews)
pq.write_table(table,output_file)