# **ETL del Proyecto 1: MLOps**
### Tabla user_reviews  
#### Importando librerias:

In [2]:
import pandas as pd
import ast

## **Empezando la Extracción**

In [None]:
ruta_user_reviews = r"C:\\Users\\Usuario\Desktop\\Labs\\Proyecto_1\\datasets_originales\\australian_user_reviews.json"

user_reviews = [] # Lista que guarda los json leidos

with open(ruta_user_reviews, encoding="utf-8") as reviews:
    for linea in reviews.readlines():
        user_reviews.append(ast.literal_eval(linea))

for i in range(5):   # Verificamos que el contenido se guardo en la lista
    print(user_reviews[i])

#### Convertimos la lista en DataFrame y visualizamos:

In [4]:
df_reviews = pd.DataFrame(user_reviews)
df_reviews.head(5)

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."


## **Empezamos la transformación**

In [5]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25799 entries, 0 to 25798
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   user_id   25799 non-null  object
 1   user_url  25799 non-null  object
 2   reviews   25799 non-null  object
dtypes: object(3)
memory usage: 604.8+ KB


Vemos si hay id's repetidos:

In [6]:
frecuencia = df_reviews["user_id"].value_counts()
repetidos = frecuencia[frecuencia > 1]
print(repetidos)

user_id
76561198027488037    3
76561198045953692    3
76561198051777058    3
76561198100326818    3
blablabla174         3
                    ..
oLdxZ5vt             2
76561198064484479    2
76561198036124769    2
relesprit            2
76561198088807138    2
Name: count, Length: 309, dtype: int64


Vemos si la información en los registros es la misma:

In [7]:
df_reviews[df_reviews["user_id"] == '76561198027488037']

Unnamed: 0,user_id,user_url,reviews
6935,76561198027488037,http://steamcommunity.com/profiles/76561198027...,"[{'funny': '', 'posted': 'Posted May 12.', 'la..."
6936,76561198027488037,http://steamcommunity.com/profiles/76561198027...,"[{'funny': '', 'posted': 'Posted May 12.', 'la..."
15693,76561198027488037,http://steamcommunity.com/profiles/76561198027...,"[{'funny': '', 'posted': 'Posted May 12.', 'la..."


Eliminamos los id's repetidos:

In [8]:
df_reviews = df_reviews.drop_duplicates(subset=["user_id"], keep="first")

# Vemos si ahora hay mas duplicados
frecuencia = df_reviews["user_id"].value_counts()
repetidos = frecuencia[frecuencia > 1]
print(repetidos)

Series([], Name: count, dtype: int64)


Reseteamos el index

In [9]:
df_reviews = df_reviews.reset_index(drop=True)
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25485 entries, 0 to 25484
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   user_id   25485 non-null  object
 1   user_url  25485 non-null  object
 2   reviews   25485 non-null  object
dtypes: object(3)
memory usage: 597.4+ KB


Buscamos desanidar las reviews, expandimos las listas para cada usuario

In [10]:
df_reviews_expandido = df_reviews.explode("reviews")
df_reviews_expandido.head(5)

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted November 5, 20..."
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted July 15, 2011...."
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted April 21, 2011..."
1,js41637,http://steamcommunity.com/id/js41637,"{'funny': '', 'posted': 'Posted June 24, 2014...."
1,js41637,http://steamcommunity.com/id/js41637,"{'funny': '', 'posted': 'Posted September 8, 2..."


Normalizamos las listas expandidas, convirtiendo la columna reviews en un dataframe

In [11]:
df_reviews_desanidado = pd.json_normalize(df_reviews_expandido["reviews"])
df_reviews_desanidado.head(5)

Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review
0,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...


Reseteamos index de ambos dataframes para evitar errores

In [12]:
df_reviews_expandido = df_reviews_expandido.reset_index(drop=True)
df_reviews_desanidado = df_reviews_desanidado.reset_index(drop=True)

Concatenamos ambos dataframes para crear un dataframe completo

In [13]:
df_reviews_final = pd.concat([df_reviews_expandido, df_reviews_desanidado], axis=1)
df_reviews_final.head(2)

Unnamed: 0,user_id,user_url,reviews,funny,posted,last_edited,item_id,helpful,recommend,review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted November 5, 20...",,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted July 15, 2011....",,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.


Eliminamos la columna reviews ya que ahora sobra

In [14]:
df_reviews_final = df_reviews_final.drop("reviews", axis=1)
df_reviews_final.head(3)

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...


Las columnas 'user_url', 'funny', 'posted', 'last_edited' y 'helpful' no me sirven para mi modelo ni para los endpoints,  
asi que voy a eliminarlas:

In [15]:
a_eliminar = ['user_url', 'funny', 'posted', 'last_edited', 'helpful']
df_reviews_final = df_reviews_final.drop(a_eliminar, axis=1)

df_reviews_final.head(3)

Unnamed: 0,user_id,item_id,recommend,review
0,76561197970982479,1250,True,Simple yet with great replayability. In my opi...
1,76561197970982479,22200,True,It's unique and worth a playthrough.
2,76561197970982479,43110,True,Great atmosphere. The gunplay can be a bit chu...


## **Empezamos la Carga**
**Guardaremos los dataframes en archivos CSV**

In [16]:
df_reviews_final.to_parquet("./datasets/user_reviews.parquet")