# ETL Users Reviews

In [2]:
import json
import os
import pandas as pd
import gzip
import ast
import numpy as np
import re
from datetime import datetime

### Cuando intenté cargar el archivo en un DataFrame, me encontré con el error:

**JSONDecodeError**: Se esperaba un nombre de propiedad encerrado entre comillas dobles: línea 1 columna 3 (carácter 2)

Después de una extensa búsqueda e investigación en Stack Overflow, me encontré con un artículo que fue un avance:

[Convert JSON to pd.DataFrame](https://stackoverflow.com/questions/55338899/convert-json-to-pd-dataframe/65427497#65427497)

In [3]:
rows = []

# Open the gzip-compressed JSON file
with gzip.open(r'C:\Users\flore\OneDrive\Escritorio\Etapa Labs\MLOPs\01. PI MLOps - STEAM\user_reviews.json.gz', 'rb') as f:
    # Iterate over each line in the file
    for line in f.readlines():
        # Decode the line and evaluate it as a Python literal
        rows.append(ast.literal_eval(line.decode('utf-8')))
        
# Convert the list of dictionaries into a DataFrame
df_reviews = pd.DataFrame(rows)

# Display the first few rows of the DataFrame
df_reviews.head()

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."


Ahora podemos mostrar los datos dentro del archivo y vemos que tenemos 3 columnas y la última columna contiene listas anidadas de diccionarios.

In [4]:
df_reviews.columns

Index(['user_id', 'user_url', 'reviews'], dtype='object')

In [5]:
df_reviews.shape

(25799, 3)

In [6]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25799 entries, 0 to 25798
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   user_id   25799 non-null  object
 1   user_url  25799 non-null  object
 2   reviews   25799 non-null  object
dtypes: object(3)
memory usage: 604.8+ KB


In [7]:
# Chequaemos por duplicados en base a 'user_id' y 'user_url' 

duplicated_rows = df_reviews[df_reviews.duplicated(subset=['user_id', 'user_url'], keep=False)]
duplicated_rows

Unnamed: 0,user_id,user_url,reviews
9,76561198156664158,http://steamcommunity.com/profiles/76561198156...,"[{'funny': '', 'posted': 'Posted June 16.', 'l..."
50,Rivtex,http://steamcommunity.com/id/Rivtex,"[{'funny': '', 'posted': 'Posted December 23, ..."
83,76561198094224872,http://steamcommunity.com/profiles/76561198094...,[]
119,DieMadchenschanderin,http://steamcommunity.com/id/DieMadchenschanderin,"[{'funny': '', 'posted': 'Posted August 29, 20..."
147,relesprit,http://steamcommunity.com/id/relesprit,"[{'funny': '', 'posted': 'Posted December 27, ..."
...,...,...,...
17819,76561198076474887,http://steamcommunity.com/profiles/76561198076...,"[{'funny': '', 'posted': 'Posted April 12.', '..."
17916,yolofaceguy,http://steamcommunity.com/id/yolofaceguy,"[{'funny': '', 'posted': 'Posted October 31, 2..."
18028,76561198075591109,http://steamcommunity.com/profiles/76561198075...,"[{'funny': '', 'posted': 'Posted December 26, ..."
18234,76561198092022514,http://steamcommunity.com/profiles/76561198092...,"[{'funny': '', 'posted': 'Posted July 3.', 'la..."


In [8]:
# Ordenamos el DataFrame por 'user_id'
df_sorted = df_reviews.sort_values(by='user_id')

# Buscamos duplicados en las filas basados en las  columnas'user_id' y 'user_url'
duplicated_rows = df_sorted[df_sorted.duplicated(subset=['user_id', 'user_url'], keep=False)]
duplicated_rows

Unnamed: 0,user_id,user_url,reviews
12888,05041129,http://steamcommunity.com/id/05041129,"[{'funny': '', 'posted': 'Posted May 18, 2015...."
5250,05041129,http://steamcommunity.com/id/05041129,"[{'funny': '', 'posted': 'Posted May 18, 2015...."
3134,111222333444555666888,http://steamcommunity.com/id/11122233344455566...,"[{'funny': '', 'posted': 'Posted December 22, ..."
3133,111222333444555666888,http://steamcommunity.com/id/11122233344455566...,"[{'funny': '', 'posted': 'Posted December 22, ..."
4138,29123,http://steamcommunity.com/id/29123,"[{'funny': '', 'posted': 'Posted March 26.', '..."
...,...,...,...
2721,xXAussieRockXx,http://steamcommunity.com/id/xXAussieRockXx,"[{'funny': '', 'posted': 'Posted July 17, 2015..."
2680,yolofaceguy,http://steamcommunity.com/id/yolofaceguy,"[{'funny': '', 'posted': 'Posted October 31, 2..."
17916,yolofaceguy,http://steamcommunity.com/id/yolofaceguy,"[{'funny': '', 'posted': 'Posted October 31, 2..."
5855,zeroblade,http://steamcommunity.com/id/zeroblade,"[{'funny': '', 'posted': 'Posted November 30, ..."


### Al comparar uno contra el otro, podemos validar que los registros están duplicados. Por lo tanto, eliminamos el valor duplicado.

In [9]:
# Dropeamos los  duplicados basados en las  columnas 'user_id' y  'user_url',nos quedamos con la primer ocurrencia del registro
df_reviews = df_reviews.drop_duplicates(subset=['user_id', 'user_url'], keep='first')
df_reviews

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."
...,...,...,...
25794,76561198306599751,http://steamcommunity.com/profiles/76561198306...,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,http://steamcommunity.com/id/Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,http://steamcommunity.com/profiles/76561198310...,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."


## Ahora podemos proceder a investigar nuestra columna de revisiones.
Esta columna contiene la información de todas las revisiones publicadas por el usuario identificado con el ID de usuario único.

### Mostramos un ejemplo de fila para la tercera columna para comprender qué información es valiosa y cómo proceder para extraerla. Esta información es la entrada de nuestro análisis de sentimientos, por lo que es crucial.

In [77]:
df_reviews.iloc[1,2]

[{'funny': '',
  'posted': 'Posted June 24, 2014.',
  'last_edited': '',
  'item_id': '251610',
  'helpful': '15 of 20 people (75%) found this review helpful',
  'recommend': True,
  'review': 'I know what you think when you see this title "Barbie Dreamhouse Party" but do not be intimidated by it\'s title, this is easily one of my GOTYs. You don\'t get any of that cliche game mechanics that all the latest games have, this is simply good core gameplay. Yes, you can\'t 360 noscope your friends, but what you can do is show them up with your bad ♥♥♥ dance moves and put them to shame as you show them what true fashion and color combinations are.I know this game says for kids but, this is easily for any age range and any age will have a blast playing this.8/8'},
 {'funny': '',
  'posted': 'Posted September 8, 2013.',
  'last_edited': '',
  'item_id': '227300',
  'helpful': '0 of 1 people (0%) found this review helpful',
  'recommend': True,
  'review': "For a simple (it's actually not all th

In [10]:
df_reviews.iloc[100,2]

[{'funny': '',
  'posted': 'Posted October 13, 2014.',
  'last_edited': '',
  'item_id': '209870',
  'helpful': '3 of 8 people (38%) found this review helpful',
  'recommend': True,
  'review': 'Its a very fun game i recomend as its nearly like TITANFALL but its FREE!Play this game now'}]

### Necesitamos extraer en columnas la información presente dentro de cada campo de la tercera columna. <br>
Los campos en columnas serán: divertido, publicado, última edición, ID de artículo, útil, recomendar, reseña.

In [11]:
# 'reviews' es la columna que contiene una lista de diccionarios
# Necesitamos trabajar con los valores que contiene la columna para crear el analisis de sentimiento
# Initiazamos una lista vacia para guardar las reviews desanidadas
unnested_reviews = []

# Iteramos en cada fila del  DataFrame 'df_reviews'
for index, row in df_reviews.iterrows():
    user_id = row['user_id']
    user_url = row['user_url']
    reviews = row['reviews']
    
    # Dentro de la celda, iteramos en cada review  en 'reviews' list
    for review in reviews:
        new_review = {
            'user_id': user_id,
            'user_url': user_url,
            'funny': review.get('funny', ''),
            'posted': review.get('posted', ''),
            'last_edited': review.get('last_edited', ''),
            'item_id': review.get('item_id', ''),
            'helpful': review.get('helpful', ''),
            'recommend': review.get('recommend', ''),
            'review_text': review.get('review', '')  # Renombramos a 'review' como 'review_text'
        }
        
        # agregamos la review a la lista 
        unnested_reviews.append(new_review)

# Creamos un nuevo DataFrame 'df_reviews_unnested' de la lista de reviews desanidadas
df_reviews_unnested = pd.DataFrame(unnested_reviews)

In [12]:
df_reviews_unnested

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review_text
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,http://steamcommunity.com/id/js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,js41637,http://steamcommunity.com/id/js41637,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...,...,...,...,...
58425,76561198312638244,http://steamcommunity.com/profiles/76561198312...,,Posted July 10.,,70,No ratings yet,True,a must have classic from steam definitely wort...
58426,76561198312638244,http://steamcommunity.com/profiles/76561198312...,,Posted July 8.,,362890,No ratings yet,True,this game is a perfect remake of the original ...
58427,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,1 person found this review funny,Posted July 3.,,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...
58428,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,,Posted July 20.,,730,No ratings yet,True,:D


In [13]:
df_reviews_unnested.columns

Index(['user_id', 'user_url', 'funny', 'posted', 'last_edited', 'item_id',
       'helpful', 'recommend', 'review_text'],
      dtype='object')

In [14]:
df_reviews = df_reviews_unnested

In [15]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58430 entries, 0 to 58429
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      58430 non-null  object
 1   user_url     58430 non-null  object
 2   funny        58430 non-null  object
 3   posted       58430 non-null  object
 4   last_edited  58430 non-null  object
 5   item_id      58430 non-null  object
 6   helpful      58430 non-null  object
 7   recommend    58430 non-null  bool  
 8   review_text  58430 non-null  object
dtypes: bool(1), object(8)
memory usage: 3.6+ MB


## Verificamos valores faltantes y duplicados
Muchas columnas tienen ' ' como valores, que deberían convertirse en None a partir de la transformación desanidada.

In [16]:
# copiamos el reviews DataFrame
df_reviews_copy = df_reviews.copy()

# reemplazamos los strings vacios por None
df_reviews_copy.replace('', np.nan, inplace=True)

# ASignamos el df copiado al original 
df_reviews = df_reviews_copy

In [17]:
# Contamos la cantidad de valores None en cada columna
none_count = df_reviews.isnull().sum()
# Calculamos el porcentaje de None para cada columna
none_percentage = (none_count / len(df_reviews)) * 100
# Combinamos los None count y porcentajes al DataFrame
none_info = pd.DataFrame({'None Count': none_count, 'None Percentage': none_percentage})
# Agregamos la columna del total de Registros
none_info['Total Registers'] = len(df_reviews)
# Reordenamos las columnas
none_info = none_info[['Total Registers', 'None Count', 'None Percentage']]
none_info

Unnamed: 0,Total Registers,None Count,None Percentage
user_id,58430,0,0.0
user_url,58430,0,0.0
funny,58430,50420,86.291289
posted,58430,0,0.0
last_edited,58430,52393,89.667979
item_id,58430,0,0.0
helpful,58430,0,0.0
recommend,58430,0,0.0
review_text,58430,30,0.051343


### Dado este conocimiento sobre los datos de las revisiones, podemos proceder a eliminar esas dos columnas que tienen principalmente valores None: last_edited y funny

In [18]:
df_reviews['last_edited'].value_counts()

last_edited
Last edited November 25, 2013.    99
Last edited October 17, 2015.     18
Last edited July 25, 2015.        17
Last edited June 22, 2015.        16
Last edited December 29, 2015.    16
                                  ..
Last edited August 13, 2014.       1
Last edited February 26, 2014.     1
Last edited November 30, 2014.     1
Last edited February 28, 2014.     1
Last edited August 15, 2014.       1
Name: count, Length: 1014, dtype: int64

In [19]:
df_reviews['funny'].value_counts()

funny
1 person found this review funny        5083
2 people found this review funny        1213
3 people found this review funny         488
4 people found this review funny         263
5 people found this review funny         162
                                        ... 
58 people found this review funny          1
405 people found this review funny         1
105 people found this review funny         1
1,130 people found this review funny       1
825 people found this review funny         1
Name: count, Length: 185, dtype: int64

In [20]:
df_reviews.drop(['last_edited', 'funny','user_url'], axis=1, inplace= True)

In [21]:
df_reviews

Unnamed: 0,user_id,posted,item_id,helpful,recommend,review_text
0,76561197970982479,"Posted November 5, 2011.",1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,"Posted July 15, 2011.",22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,"Posted April 21, 2011.",43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,"Posted June 24, 2014.",251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,js41637,"Posted September 8, 2013.",227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...,...
58425,76561198312638244,Posted July 10.,70,No ratings yet,True,a must have classic from steam definitely wort...
58426,76561198312638244,Posted July 8.,362890,No ratings yet,True,this game is a perfect remake of the original ...
58427,LydiaMorley,Posted July 3.,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...
58428,LydiaMorley,Posted July 20.,730,No ratings yet,True,:D


### También podemos eliminar esos registros donde review_text es None.

In [22]:
# Dropeamos las filas con valores  None en 'review_text'
df_reviews = df_reviews.dropna(subset=['review_text'])
df_reviews

Unnamed: 0,user_id,posted,item_id,helpful,recommend,review_text
0,76561197970982479,"Posted November 5, 2011.",1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,"Posted July 15, 2011.",22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,"Posted April 21, 2011.",43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,"Posted June 24, 2014.",251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,js41637,"Posted September 8, 2013.",227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...,...
58425,76561198312638244,Posted July 10.,70,No ratings yet,True,a must have classic from steam definitely wort...
58426,76561198312638244,Posted July 8.,362890,No ratings yet,True,this game is a perfect remake of the original ...
58427,LydiaMorley,Posted July 3.,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...
58428,LydiaMorley,Posted July 20.,730,No ratings yet,True,:D


In [23]:
# Cantidad de valores  None values en cada columna
none_count = df_reviews.isnull().sum()
# Calculamos el porcentaje de None para cada columna
none_percentage = (none_count / len(df_reviews)) * 100
# Combinamos los None count y porcentajes al DataFrame
none_info = pd.DataFrame({'None Count': none_count, 'None Percentage': none_percentage})
# Agregamos la columna del total de Registros
none_info['Total Registers'] = len(df_reviews)
# Reordenamos las columnas
none_info = none_info[['Total Registers', 'None Count', 'None Percentage']]
none_info

Unnamed: 0,Total Registers,None Count,None Percentage
user_id,58400,0,0.0
posted,58400,0,0.0
item_id,58400,0,0.0
helpful,58400,0,0.0
recommend,58400,0,0.0
review_text,58400,0,0.0


### Publicado
Esta columna tiene la fecha en que se publicó la reseña del juego por parte del usuario.<br>
Pero el formato es 'Publicado <Mes> <día>, <año>' <br>
Y necesitamos transformar esto en el formato yyyy-mm-dd.<br>
Vamos a extraer el mes y almacenarlo en una columna 'mes', y 'día' y 'año'.<br>
Dado que en la columna no hay un único formato sino varios:<br>
Ejemplo: <br>

- 'Publicado 5 de noviembre de 2011.'<br>
- 'Publicado 2 de julio.'<br>
Estos últimos tienen el año faltante, por lo que se definirán como formato no válido None<br

In [24]:
df_reviews

Unnamed: 0,user_id,posted,item_id,helpful,recommend,review_text
0,76561197970982479,"Posted November 5, 2011.",1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,"Posted July 15, 2011.",22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,"Posted April 21, 2011.",43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,"Posted June 24, 2014.",251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,js41637,"Posted September 8, 2013.",227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...,...
58425,76561198312638244,Posted July 10.,70,No ratings yet,True,a must have classic from steam definitely wort...
58426,76561198312638244,Posted July 8.,362890,No ratings yet,True,this game is a perfect remake of the original ...
58427,LydiaMorley,Posted July 3.,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...
58428,LydiaMorley,Posted July 20.,730,No ratings yet,True,:D


In [25]:
# Definimos una funcion de  the conversion
def convert_posted(posted_str):
    match = re.search(r"Posted (\w+) (\d{1,2})(?:,)?(?: (\d{4}))?", posted_str)
    if match:
        month_str, day_str, year_str = match.groups()
        month_mapping = {
            "January": "01",
            "February": "02",
            "March": "03",
            "April": "04",
            "May": "05",
            "June": "06",
            "July": "07",
            "August": "08",
            "September": "09",
            "October": "10",
            "November": "11",
            "December": "12"
        }
        current_year = str(pd.Timestamp.now().year)
        formatted_date = f"{year_str or current_year}-{month_mapping[month_str]}-{day_str.zfill(2)}"
        return formatted_date
    else:
        return None

In [26]:
df_reviews_copy = df_reviews.copy()
# Aplicamos la funcion de conversion para parsear la columna 'posted' al Dataset
df_reviews_copy['date'] = df_reviews_copy['posted'].apply(convert_posted)

# Convertimos la columna 'date' to datetime format
df_reviews_copy['date'] = pd.to_datetime(df_reviews_copy['date'], errors='coerce')

# Extraemos el año de la columna  year 
df_reviews_copy['year'] = df_reviews_copy['date'].dt.year

df_reviews_copy

Unnamed: 0,user_id,posted,item_id,helpful,recommend,review_text,date,year
0,76561197970982479,"Posted November 5, 2011.",1250,No ratings yet,True,Simple yet with great replayability. In my opi...,2011-11-05,2011
1,76561197970982479,"Posted July 15, 2011.",22200,No ratings yet,True,It's unique and worth a playthrough.,2011-07-15,2011
2,76561197970982479,"Posted April 21, 2011.",43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,2011-04-21,2011
3,js41637,"Posted June 24, 2014.",251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,2014-06-24,2014
4,js41637,"Posted September 8, 2013.",227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,2013-09-08,2013
...,...,...,...,...,...,...,...,...
58425,76561198312638244,Posted July 10.,70,No ratings yet,True,a must have classic from steam definitely wort...,2024-07-10,2024
58426,76561198312638244,Posted July 8.,362890,No ratings yet,True,this game is a perfect remake of the original ...,2024-07-08,2024
58427,LydiaMorley,Posted July 3.,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...,2024-07-03,2024
58428,LydiaMorley,Posted July 20.,730,No ratings yet,True,:D,2024-07-20,2024


In [27]:
df_reviews = df_reviews_copy

### Ahora tenemos una columna de fecha, pero esta columna tiene el año actual para aquellos registros que tenían un formato incorrecto en el campo "Publicado" y el año estaba ausente. Por lo tanto, estos registros, que son 9929 en total, pueden filtrarse por la columna de año. Estos registros no deben considerarse para la función best_developer_year.

In [28]:
df_reviews['year'].value_counts()

year
2014    21821
2015    18146
2024     9929
2013     6707
2012     1201
2011      530
2010       66
Name: count, dtype: int64

In [29]:
# Contamos los  NaN en  'year'
nan_count = df_reviews['year'].isna().sum()
nan_count

0

In [30]:
df_reviews.drop(['posted'], axis=1, inplace= True)

In [31]:
#  Contamos los  NaN en 'user_id' 
nan_count = df_reviews['user_id'].isna().sum()
nan_count

0

In [32]:
df_reviews["item_id"] = pd.to_numeric(df_reviews["item_id"], errors="coerce")

In [33]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 58400 entries, 0 to 58429
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   user_id      58400 non-null  object        
 1   item_id      58400 non-null  int64         
 2   helpful      58400 non-null  object        
 3   recommend    58400 non-null  bool          
 4   review_text  58400 non-null  object        
 5   date         58400 non-null  datetime64[ns]
 6   year         58400 non-null  int32         
dtypes: bool(1), datetime64[ns](1), int32(1), int64(1), object(3)
memory usage: 3.0+ MB


### Almacenar el DataFrame de reviews
Ahora que hemos realizado la carga y transformación de los datos en información valiosa, los almacenamos para proceder con el EDA.
Elegimos almacenar los datos como .parquet debido a las limitaciones de tamaño.

In [34]:
# Definimos el path para guardar nuestro trabajo en un .parquet file
reviews = 'data/reviews.parquet'

# Guardamos el DataFrame 
df_reviews.to_parquet(reviews, index=False)

# Mensaje de confirmacion 
print(f'reviews DataFrame was stored into {reviews}')

reviews DataFrame was stored into data/reviews.parquet
