## **ETL - user_reviews**
**En este notebook realizaremos la extracción, transformación y carga (ETL) de `user_reviews`**

In [1]:
# Importamos librerías
import json             # Módulo de codificador y decodificador JSON
import ast              # Módulo de Árboles de Sintaxis Abstracta
import pandas as pd     # Librería para manipular datasets
import pyarrow as pa    # Útil para operaciones de lectura y escritura de datos
import pyarrow.parquet as pq   # Útil para leer y escribir datos en formato Parquet de manera eficiente
import gzip             # Librería para comprimir y descomprimir datos
import os               # creación de directorios y comprobación de existencia
from textblob import TextBlob

#### **Extracción y exploración**

In [2]:
#Se crea una lista sin elementos con el propósito de almacenar el registro de iteraciones que ocurrirán en el bucle for.
contenido = []

#Creamos bucle que va a recorrer el dataset
for i in gzip.open("dataset/user_reviews.json.gz"):
    contenido.append(ast.literal_eval(i.decode('utf-8')))

#Se crea el dataframe en base a la lista
reviews = pd.DataFrame(contenido)
reviews

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."
...,...,...,...
25794,76561198306599751,http://steamcommunity.com/profiles/76561198306...,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,http://steamcommunity.com/id/Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,http://steamcommunity.com/profiles/76561198310...,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."


In [3]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25799 entries, 0 to 25798
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   user_id   25799 non-null  object
 1   user_url  25799 non-null  object
 2   reviews   25799 non-null  object
dtypes: object(3)
memory usage: 604.8+ KB


#### **Transformación**


Desanidamos los datos de la columna `reviews`

In [4]:
#Se Utiliza la función explode para explorar la columna y desanidar los datos
exploded = reviews.explode('reviews')
norm_reviews = exploded['reviews'].apply(pd.Series)

# Se resetean las filas para que no se desordenen.
norm_reviews.reset_index(inplace=True)
exploded.reset_index(inplace=True)

#Se Concadena con el orignal
reviews_ok= pd.concat([exploded, norm_reviews], axis=1)
reviews_ok.head()

Unnamed: 0,index,user_id,user_url,reviews,index.1,funny,posted,last_edited,item_id,helpful,recommend,review,0
0,0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted November 5, 20...",0,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,
1,0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted July 15, 2011....",0,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.,
2,0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted April 21, 2011...",0,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,
3,1,js41637,http://steamcommunity.com/id/js41637,"{'funny': '', 'posted': 'Posted June 24, 2014....",1,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,
4,1,js41637,http://steamcommunity.com/id/js41637,"{'funny': '', 'posted': 'Posted September 8, 2...",1,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,


In [5]:
# se crea una columna nueva con el año extraido de los valores de la columna 'posted'
reviews_ok['posted'] = reviews_ok['posted'].str.extract(r'(\d{4})')
reviews_ok.head(3)

Unnamed: 0,index,user_id,user_url,reviews,index.1,funny,posted,last_edited,item_id,helpful,recommend,review,0
0,0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted November 5, 20...",0,,2011,,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,
1,0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted July 15, 2011....",0,,2011,,22200,No ratings yet,True,It's unique and worth a playthrough.,
2,0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted April 21, 2011...",0,,2011,,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,


In [6]:
# se eliminan las columnas con poca relevancia para el posterior analisis
reviews_ok.drop(['reviews','index','user_url','last_edited','funny','helpful'], axis=1, inplace = True)
# borrar la ultima columna [0]
del reviews_ok[reviews_ok.columns[-1]]

In [7]:
#Se creo un diccionario con claves columna y tipo
# columna - almacena los nombres de las columnas
# tipo - almacena tipos de datos únicos para cada columna
data_type_process = {
    "columna": reviews_ok.columns.tolist(), 
    "tipo": [reviews_ok[columna].apply(type).unique() for columna in reviews_ok.columns]
}

#Se visualizara el nombre de la columna y su tipo de dato
data_type = pd.DataFrame(data_type_process)
data_type

Unnamed: 0,columna,tipo
0,user_id,[<class 'str'>]
1,posted,"[<class 'str'>, <class 'float'>]"
2,item_id,"[<class 'str'>, <class 'float'>]"
3,recommend,"[<class 'bool'>, <class 'float'>]"
4,review,"[<class 'str'>, <class 'float'>]"


Busqueda y tratamiento de duplicados

In [8]:
#se utiliza la variable duplicados para guardar la busqueda y poder comparar
duplicados= reviews_ok.loc[reviews_ok.duplicated()]
duplicados

Unnamed: 0,user_id,posted,item_id,recommend,review
1114,bokkkbokkk,2015,346110,True,yep
2894,ImSeriouss,2014,218620,True,"Good graphics, fun heists! A bit laggy"
2895,ImSeriouss,2014,105600,True,So fun! DEFINITELY NOT RIP OFF OF MINECRAFT! e...
2896,ImSeriouss,2014,570,True,bobo pinoy
2897,ImSeriouss,2014,211820,True,If you want to play this game.. expect glithes...
...,...,...,...,...,...
44456,76561198092022514,,422400,True,Muy entretenido y una coleccion de armas prome...
44457,76561198092022514,,218620,True,"Tiene una jugabilidad y tematica muy buena :D,..."
44458,76561198092022514,2014,261820,True,"Buen juego, no importa el desarrrollo que tien..."
44459,76561198092022514,2014,224260,True,exelente aporte :D¡¡¡ es una buen mod basado e...


In [9]:
reviews_ok = reviews_ok.drop_duplicates(keep='first')
reviews_ok

Unnamed: 0,user_id,posted,item_id,recommend,review
0,76561197970982479,2011,1250,True,Simple yet with great replayability. In my opi...
1,76561197970982479,2011,22200,True,It's unique and worth a playthrough.
2,76561197970982479,2011,43110,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,2014,251610,True,I know what you think when you see this title ...
4,js41637,2013,227300,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...
59328,76561198312638244,,70,True,a must have classic from steam definitely wort...
59329,76561198312638244,,362890,True,this game is a perfect remake of the original ...
59330,LydiaMorley,,273110,True,had so much fun plaing this and collecting res...
59331,LydiaMorley,,730,True,:D


Busqueda y tratamiento de nulos

In [10]:
nulos= reviews_ok.isnull().sum()
nulos

user_id         0
posted       9961
item_id        28
recommend      28
review         28
dtype: int64

In [11]:
# se eliminan valores nulos por columnas
reviews_ok = reviews_ok.dropna(subset=['posted'])
reviews_ok = reviews_ok.dropna(subset=['item_id'])
reviews_ok = reviews_ok.dropna(subset=['review'])
reviews_ok = reviews_ok.dropna(subset=['recommend'])

In [12]:
# seleccion de columnas relevantes para el analisis
reviews_ok = reviews_ok[['user_id','item_id','review','recommend','posted']].reset_index(drop=True)
reviews_ok

Unnamed: 0,user_id,item_id,review,recommend,posted
0,76561197970982479,1250,Simple yet with great replayability. In my opi...,True,2011
1,76561197970982479,22200,It's unique and worth a playthrough.,True,2011
2,76561197970982479,43110,Great atmosphere. The gunplay can be a bit chu...,True,2011
3,js41637,251610,I know what you think when you see this title ...,True,2014
4,js41637,227300,For a simple (it's actually not all that simpl...,True,2013
...,...,...,...,...,...
48493,wayfeng,730,its FUNNNNNNNN,True,2015
48494,76561198251004808,253980,Awesome fantasy game if you don't mind the gra...,True,2015
48495,72947282842,730,Prettyy Mad Game,True,2015
48496,ApxLGhost,730,AMAZING GAME 10/10,True,2015


**Almaceno el DataFrame en un formato de archivo Parquet con el fin de mejorar su eficiencia y facilitar su uso en consultas posteriores**

Se crea una copia independiente del DataFrame `reviews_ok` y asigna esta copia a la variable `reviews_copy`. Esto se hace para evitar que cualquier modificación realizada en la copia afecte al DataFrame original, permitiendo trabajar con los datos de manera segura sin alterar el DataFrame original.

In [32]:
# Generamos copia
reviews_copy = reviews_ok.copy()

#### **Guardo incialmente el DF en formato CSV, llamado `user_reviews.csv`**

In [34]:
ruta_carpeta = "data"

if not os.path.exists(ruta_carpeta):
    os.makedirs(ruta_carpeta)

ruta_guardar_csv = 'data/user_reviews.csv'
reviews_copy.to_csv(ruta_guardar_csv, index=False, encoding='utf-8')

#### **Convierto el CSV y guardo como `user_items.parquet`**

In [35]:
#Se lee el archivo csv
reviews_copy= pd.read_csv("data/user_reviews.csv") 

#Asigno la ruta donde quiero guardar el parquet con el nombre que va tener
ruta_guardar_parquet= "data/user_reviews.parquet"

#Transformo a una tabla el archivo csv en parquet
table = pa.Table.from_pandas(reviews_copy)
pq.write_table(table,ruta_guardar_parquet)