# Proceso ETL para `user_reviews`

Este notebook describe el proceso de Extracción, Transformación y Carga (ETL) aplicado al dataset `user_reviews`, crucial para preparar los datos de reseñas de usuarios en la plataforma Steam para análisis. Estas reseñas son esenciales para comprender la percepción y satisfacción de los usuarios respecto a los juegos, ofreciendo insights valiosos sobre la popularidad y recepción de los productos.

Librerías requeridas 

In [8]:
import pandas as pd
from textblob import TextBlob
from utils.utils import cargar_json_gz

## Extracción y Exploración de Datos 

Carga de los datos 

In [9]:

# Se crea el dataframe en base a la lista
reviews = cargar_json_gz("../data/user_reviews.json.gz", way=1)
reviews

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."
...,...,...,...
25794,76561198306599751,http://steamcommunity.com/profiles/76561198306...,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,http://steamcommunity.com/id/Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,http://steamcommunity.com/profiles/76561198310...,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."


Exploración inicial

In [10]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25799 entries, 0 to 25798
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   user_id   25799 non-null  object
 1   user_url  25799 non-null  object
 2   reviews   25799 non-null  object
dtypes: object(3)
memory usage: 604.8+ KB


## Transformación de Datos 
### Desanidamiento de Reseñas

Transformamos la estructura anidada de las reseñas para facilitar su análisis

In [11]:
# Desanidamos la columna 'reviews' y convertimos los diccionarios en columnas separadas
reviews = reviews.explode('reviews')
reviews = pd.concat([reviews.drop(['reviews'], axis=1),
                    reviews['reviews'].apply(pd.Series)], axis=1)
reviews

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review,0
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.,
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,
1,js41637,http://steamcommunity.com/id/js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,
1,js41637,http://steamcommunity.com/id/js41637,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,
...,...,...,...,...,...,...,...,...,...,...
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,,Posted July 10.,,70,No ratings yet,True,a must have classic from steam definitely wort...,
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,,Posted July 8.,,362890,No ratings yet,True,this game is a perfect remake of the original ...,
25798,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,1 person found this review funny,Posted July 3.,,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...,
25798,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,,Posted July 20.,,730,No ratings yet,True,:D,


### Limpieza y Preparación 

Realizamos una serie de pasos para limpiar y preparar los datos

In [24]:
# Extracción del año de la columna 'posted' y creación de la columna 'year'
reviews['year'] = reviews['posted'].str.extract(r'(\d{4})')

# Eliminamos columnas irrelevantes
columns_to_drop = ['user_url', 'last_edited', 'funny', 'helpful', 0]
reviews.drop(columns_to_drop, axis=1, inplace=True)

# Eliminación de filas duplicadas
reviews.drop_duplicates(inplace=True)

# Tratamiento de valores nulos
reviews.dropna(subset=['review', 'recommend', 'year'], inplace=True)

# Reiniciamos el índice
reviews.reset_index(drop=True, inplace=True)

In [25]:
reviews

Unnamed: 0,user_id,posted,item_id,recommend,review,year
0,76561197970982479,"Posted November 5, 2011.",1250,True,Simple yet with great replayability. In my opi...,2011
1,76561197970982479,"Posted July 15, 2011.",22200,True,It's unique and worth a playthrough.,2011
2,76561197970982479,"Posted April 21, 2011.",43110,True,Great atmosphere. The gunplay can be a bit chu...,2011
3,js41637,"Posted June 24, 2014.",251610,True,I know what you think when you see this title ...,2014
4,js41637,"Posted September 8, 2013.",227300,True,For a simple (it's actually not all that simpl...,2013
...,...,...,...,...,...,...
48493,wayfeng,"Posted October 14, 2015.",730,True,its FUNNNNNNNN,2015
48494,76561198251004808,"Posted October 10, 2015.",253980,True,Awesome fantasy game if you don't mind the gra...,2015
48495,72947282842,"Posted October 31, 2015.",730,True,Prettyy Mad Game,2015
48496,ApxLGhost,"Posted December 14, 2015.",730,True,AMAZING GAME 10/10,2015


## Almacenamiento de Datos 

Para finalizar, almacenamos el DataFrame limpio y transformado usando el formato Pickle.


In [31]:
ruta_guardar_pickle = "../data/user_reviews.pkl"

reviews.to_pickle(ruta_guardar_pickle)

In [32]:
# Verificamos que el archivo se haya guardado correctamente

reviews_recuperado = pd.read_pickle(ruta_guardar_pickle)
reviews_recuperado

Unnamed: 0,user_id,posted,item_id,recommend,review,year
0,76561197970982479,"Posted November 5, 2011.",1250,True,Simple yet with great replayability. In my opi...,2011
1,76561197970982479,"Posted July 15, 2011.",22200,True,It's unique and worth a playthrough.,2011
2,76561197970982479,"Posted April 21, 2011.",43110,True,Great atmosphere. The gunplay can be a bit chu...,2011
3,js41637,"Posted June 24, 2014.",251610,True,I know what you think when you see this title ...,2014
4,js41637,"Posted September 8, 2013.",227300,True,For a simple (it's actually not all that simpl...,2013
...,...,...,...,...,...,...
48493,wayfeng,"Posted October 14, 2015.",730,True,its FUNNNNNNNN,2015
48494,76561198251004808,"Posted October 10, 2015.",253980,True,Awesome fantasy game if you don't mind the gra...,2015
48495,72947282842,"Posted October 31, 2015.",730,True,Prettyy Mad Game,2015
48496,ApxLGhost,"Posted December 14, 2015.",730,True,AMAZING GAME 10/10,2015


# Reporte Final del Procesamiento para `user_reviews`

1. **`reviews`**:
   - **Desanidamiento y Normalización**: La columna `reviews`, que contenía datos anidados, fue desanidada para convertir sus elementos en columnas individuales, facilitando el acceso y análisis de cada atributo de las reseñas.

2. **`posted`**:
   - **Extracción de Año**: De los datos de publicación, se extrajo el año y se almacenó en una nueva columna `year`, permitiendo un análisis temporal de las reseñas. Esto ayuda a entender tendencias a lo largo del tiempo y la recepción de juegos en años específicos.

3. **Columnas Irrelevantes**:
   - **Eliminación**: Se eliminaron varias columnas como `user_url`, `last_edited`, `funny`, `helpful`, y otras no mencionadas explícitamente en la transformación inicial, que no aportaban valor para el análisis de sentimiento o tendencias de recomendación, optimizando así el dataset para análisis específicos.

4. **Duplicados y Valores Nulos**:
   - **Limpieza y Filtrado**: Se identificaron y eliminaron reseñas duplicadas para mantener la unicidad de los datos. Además, se eliminaron filas con valores nulos en columnas clave como `review`, `recommend`, y `year`, asegurando la integridad y calidad del dataset.

5. **`recommend`**:
   - **Normalización**: La columna `recommend` se mantuvo para reflejar si el usuario recomienda o no el juego, siendo un indicador directo de la percepción positiva o negativa hacia el juego.

6. **Almacenamiento**:
   - **Uso de Pickle**: Se optó por almacenar el DataFrame final en formato Pickle (`user_reviews.pkl`), eligiendo este método por su eficiencia en la serialización de estructuras de datos de Python y por preservar la integridad del DataFrame para análisis futuros.
