# User Reviews
###### Proceso de ETL del archivo `user_reviews.json.gz`

### Librerías a usar y definición de funciones

In [1]:
import pandas as pd
import numpy as np
import ast
from datetime import datetime
import gzip
import os

In [2]:
pd.options.display.max_columns = None
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 999) 

In [3]:
project_path = os.path.dirname(os.path.realpath('../__file__'))

## Extración e inspección

### Lectura de los datos

Para ver como estaba constituido el archivo, se descomprimió y se abrió en un archivo texto, donde se pudo verificar, que cada línea del archivo constituía un dato en forma `dict`

In [4]:
# Ruta al archivo comprimido
file_path = os.path.join(project_path, r'Data\source_DATA\user_reviews.json.gz')

# Lista para almacenar los datos
data_list = []

# Descomprime el archivo y lee línea por línea
with gzip.open(file_path, 'rt', encoding='utf-8') as file:
    
    for line in file.readlines():
        data = ast.literal_eval(line)
        data_list.append(data)

# Crea un DataFrame a partir de la lista de datos
df_reviews = pd.DataFrame(data_list)

### Conociendo al Dataset

En este momento, usaremos los distintos atributos de los DataGrames de Pandas, para tener una idea de como están conformados los datos y de cuantos hay

In [5]:
df_reviews.head(20)

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."
5,Wackky,http://steamcommunity.com/id/Wackky,"[{'funny': '', 'posted': 'Posted May 5, 2014.'..."
6,76561198079601835,http://steamcommunity.com/profiles/76561198079...,"[{'funny': '1 person found this review funny',..."
7,MeaTCompany,http://steamcommunity.com/id/MeaTCompany,"[{'funny': '', 'posted': 'Posted July 24.', 'l..."
8,76561198089393905,http://steamcommunity.com/profiles/76561198089...,"[{'funny': '5 people found this review funny',..."
9,76561198156664158,http://steamcommunity.com/profiles/76561198156...,"[{'funny': '', 'posted': 'Posted June 16.', 'l..."


In [6]:
df_reviews.tail(20)

Unnamed: 0,user_id,user_url,reviews
25779,CaptainAmericaCw,http://steamcommunity.com/id/CaptainAmericaCw,"[{'funny': '1 person found this review funny',..."
25780,76561198267374962,http://steamcommunity.com/profiles/76561198267...,"[{'funny': '1 person found this review funny',..."
25781,KinkyyyCSGO,http://steamcommunity.com/id/KinkyyyCSGO,"[{'funny': '', 'posted': 'Posted March 6.', 'l..."
25782,76561198270958927,http://steamcommunity.com/profiles/76561198270...,"[{'funny': '', 'posted': 'Posted July 3.', 'la..."
25783,elasticgoose,http://steamcommunity.com/id/elasticgoose,"[{'funny': '', 'posted': 'Posted January 24.',..."
25784,76561198272389051,http://steamcommunity.com/profiles/76561198272...,"[{'funny': '1 person found this review funny',..."
25785,MeloncraftLP,http://steamcommunity.com/id/MeloncraftLP,"[{'funny': '1 person found this review funny',..."
25786,76561198277602337,http://steamcommunity.com/profiles/76561198277...,"[{'funny': '3 people found this review funny',..."
25787,943525,http://steamcommunity.com/id/943525,"[{'funny': '', 'posted': 'Posted March 5.', 'l..."
25788,vinquility,http://steamcommunity.com/id/vinquility,"[{'funny': '', 'posted': 'Posted March 5.', 'l..."


In [7]:
df_reviews.shape

(25799, 3)

Se observa que la columna `reviews` está formada por listas donde sus elementos son diccionarios, que contiene la información de las reseñas *(reviews)* realizadas por los usuarios.

In [8]:
df_reviews.isnull().sum()

user_id     0
user_url    0
reviews     0
dtype: int64

No existen valores nulos

como ejemplo tomaremos un valor de la columna de `reviews`. Acá nos damos cuenta que no solo se trata de un diccionarioc sinó de una lista de diccionarios.

In [9]:
df_reviews['reviews'][13]

[{'funny': '',
  'posted': 'Posted September 5, 2015.',
  'last_edited': '',
  'item_id': '232090',
  'helpful': '0 of 1 people (0%) found this review helpful',
  'recommend': True,
  'review': 'Amazing, Non-stop action of blowing stuff to bits, Decapitation and shooting everything you see. With a combination of action, thriller and emmersive gameplay, as well as enviromental challanges (Jump physics). This game will really put your eyes to the test, can you see the enemys before they see you? Cause their are so many!This is the second level of the killing floor, I quote bill LF4D "Son we just crossed the street" But in reality they only moved up an elevator level on genes. What has yet to come as the game is slowly realsed with thrilling and horryfing creations, But who really cares, Let\'s just blow it up, I invite you to get on the GODAMN KILLING FLOOR, LET\'S SHOOT ♥♥♥♥♥ES AND GET PAID! RAHHHHHHHHHHH!'},
 {'funny': '',
  'posted': 'Posted March 30, 2015.',
  'last_edited': '',
  'i

## Transformación

### Datos duplicados

Inicialmente usaremos la columna `user_id` para ver si hay registros repetidos

In [10]:
duplicated_reviews_user_id = df_reviews.loc[df_reviews['user_id'].duplicated(keep=False)]
duplicated_reviews_user_id

Unnamed: 0,user_id,user_url,reviews
9,76561198156664158,http://steamcommunity.com/profiles/76561198156...,"[{'funny': '', 'posted': 'Posted June 16.', 'l..."
50,Rivtex,http://steamcommunity.com/id/Rivtex,"[{'funny': '', 'posted': 'Posted December 23, ..."
83,76561198094224872,http://steamcommunity.com/profiles/76561198094...,[]
119,DieMadchenschanderin,http://steamcommunity.com/id/DieMadchenschanderin,"[{'funny': '', 'posted': 'Posted August 29, 20..."
147,relesprit,http://steamcommunity.com/id/relesprit,"[{'funny': '', 'posted': 'Posted December 27, ..."
...,...,...,...
17819,76561198076474887,http://steamcommunity.com/profiles/76561198076...,"[{'funny': '', 'posted': 'Posted April 12.', '..."
17916,yolofaceguy,http://steamcommunity.com/id/yolofaceguy,"[{'funny': '', 'posted': 'Posted October 31, 2..."
18028,76561198075591109,http://steamcommunity.com/profiles/76561198075...,"[{'funny': '', 'posted': 'Posted December 26, ..."
18234,76561198092022514,http://steamcommunity.com/profiles/76561198092...,"[{'funny': '', 'posted': 'Posted July 3.', 'la..."


Eliminaremos los registros repetidos de la columna `user_id`

In [11]:
df_reviews.drop_duplicates(subset='user_id', keep='first', inplace=True)
df_reviews

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."
...,...,...,...
25794,76561198306599751,http://steamcommunity.com/profiles/76561198306...,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,http://steamcommunity.com/id/Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,http://steamcommunity.com/profiles/76561198310...,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."


Ahora usaremos la columna `reviews` para ver si hay registros repetidos

In [12]:
duplicated_reviews_reviews = df_reviews.loc[df_reviews['reviews'].duplicated(keep=False)]
duplicated_reviews_reviews

Unnamed: 0,user_id,user_url,reviews
62,gdxsd,http://steamcommunity.com/id/gdxsd,[]
83,76561198094224872,http://steamcommunity.com/profiles/76561198094...,[]
1047,76561198021575394,http://steamcommunity.com/profiles/76561198021...,[]
3954,cmuir37,http://steamcommunity.com/id/cmuir37,[]
5394,Jaysteeny,http://steamcommunity.com/id/Jaysteeny,[]
6135,ML8989,http://steamcommunity.com/id/ML8989,[]
7583,76561198079215291,http://steamcommunity.com/profiles/76561198079...,[]
7952,76561198079342142,http://steamcommunity.com/profiles/76561198079...,[]
9894,76561198061996985,http://steamcommunity.com/profiles/76561198061...,[]
10381,76561198108286351,http://steamcommunity.com/profiles/76561198108...,[]


Podemos ver que existen valores repetidos en la columna `reviews`, sin embargo estos correspnden a valores faltantes.

In [13]:
df_reviews.isnull().sum()

user_id     0
user_url    0
reviews     0
dtype: int64

### Tranformación de la columna 'reviews'

La columna `reviews` consta de una lista cuyos elementos son diccionarios, se expandirá la lista y despúes se desanidará el diccionario.

In [14]:
df_reviews = df_reviews.explode('reviews')

df_reviews = df_reviews.dropna(subset=['reviews'])

In [15]:
df_reviews = pd.json_normalize(df_reviews['reviews'])


In [16]:
df_reviews

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review
0,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...,...,...
58425,,Posted July 10.,,70,No ratings yet,True,a must have classic from steam definitely wort...
58426,,Posted July 8.,,362890,No ratings yet,True,this game is a perfect remake of the original ...
58427,1 person found this review funny,Posted July 3.,,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...
58428,,Posted July 20.,,730,No ratings yet,True,:D


In [17]:
# Renombrar las columnas con 'review' como prefijo
df_reviews.columns = ['review_' + col for col in df_reviews.columns]

In [18]:
df_reviews

Unnamed: 0,review_funny,review_posted,review_last_edited,review_item_id,review_helpful,review_recommend,review_review
0,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...,...,...
58425,,Posted July 10.,,70,No ratings yet,True,a must have classic from steam definitely wort...
58426,,Posted July 8.,,362890,No ratings yet,True,this game is a perfect remake of the original ...
58427,1 person found this review funny,Posted July 3.,,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...
58428,,Posted July 20.,,730,No ratings yet,True,:D
