# üöÄ ETL del dataset `australian_user_reviews`
#### En este notebook se desarrolla la üì¶ extracci√≥n, üí± transformaci√≥n y üì• carga del conjunto de datos `australian_user_reviews`

#### üì•Importaciones 

In [2]:
import pandas as pd
import json
import ast
import re

#### üì¶ Extracci√≥n de los datos y primera exploraci√≥n

Se extraen los datos desde el archivo json, se convierte en Dataframe y se realiza una observaci√≥n de su contenido.

In [2]:
# Ruta al dataset 
ruta_review = './australian_user_reviews.json'

# Se lee de cada l√≠nea del dataset
filas_review = []
with open(ruta_review, encoding='utf-8') as archivo:
    for line in archivo.readlines():
        filas_review.append(ast.literal_eval(line))

# Se convierte en dataframe
df_reviews = pd.DataFrame(filas_review)
df_reviews

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."
...,...,...,...
25794,76561198306599751,http://steamcommunity.com/profiles/76561198306...,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,http://steamcommunity.com/id/Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,http://steamcommunity.com/profiles/76561198310...,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."


Se observan 3 columnas y 25799 filas. 

#### ‚úÖ Se verifican los tipos de datos de cada columna y si hay nulos.

In [3]:
def verificar_tipo_datos(df):
    
    mi_dict = {"nombre_campo": [], "tipo_datos": [], "no_nulos_%": [], "nulos_%": [], "nulos": []}

    for columna in df.columns:
        porcentaje_no_nulos = (df[columna].count() / len(df)) * 100
        mi_dict["nombre_campo"].append(columna)
        mi_dict["tipo_datos"].append(df[columna].apply(type).unique())
        mi_dict["no_nulos_%"].append(round(porcentaje_no_nulos, 2))
        mi_dict["nulos_%"].append(round(100-porcentaje_no_nulos, 2))
        mi_dict["nulos"].append(df[columna].isnull().sum())

    df_info = pd.DataFrame(mi_dict)
            
    return df_info

In [4]:
verificar_tipo_datos(df_reviews)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,user_url,[<class 'str'>],100.0,0.0,0
2,reviews,[<class 'list'>],100.0,0.0,0


Se observa si hay o no duplicados por cada 'user_id'.

In [5]:
def verifica_duplicados_por_columna(df, columna):

    # Se filtran las filas duplicadas
    duplicated_rows = df[df.duplicated(subset=columna, keep=False)]
    if duplicated_rows.empty:
        return "No hay duplicados"
    
    # se ordenan las filas duplicadas para comparar entre s√≠
    duplicated_rows_sorted = duplicated_rows.sort_values(by=columna)
    return duplicated_rows_sorted

In [6]:
filas_duplicadas = verifica_duplicados_por_columna(df_reviews, 'user_id')
filas_duplicadas

Unnamed: 0,user_id,user_url,reviews
12888,05041129,http://steamcommunity.com/id/05041129,"[{'funny': '', 'posted': 'Posted May 18, 2015...."
5250,05041129,http://steamcommunity.com/id/05041129,"[{'funny': '', 'posted': 'Posted May 18, 2015...."
3133,111222333444555666888,http://steamcommunity.com/id/11122233344455566...,"[{'funny': '', 'posted': 'Posted December 22, ..."
3134,111222333444555666888,http://steamcommunity.com/id/11122233344455566...,"[{'funny': '', 'posted': 'Posted December 22, ..."
4139,29123,http://steamcommunity.com/id/29123,"[{'funny': '', 'posted': 'Posted March 26.', '..."
...,...,...,...
2721,xXAussieRockXx,http://steamcommunity.com/id/xXAussieRockXx,"[{'funny': '', 'posted': 'Posted July 17, 2015..."
2680,yolofaceguy,http://steamcommunity.com/id/yolofaceguy,"[{'funny': '', 'posted': 'Posted October 31, 2..."
17916,yolofaceguy,http://steamcommunity.com/id/yolofaceguy,"[{'funny': '', 'posted': 'Posted October 31, 2..."
5855,zeroblade,http://steamcommunity.com/id/zeroblade,"[{'funny': '', 'posted': 'Posted November 30, ..."


Se observan 623 filas duplicadas en la columna 'user_id'

Se revisan si los review dentro de los datos anidados de 'review' se encuentran duplicados o si solo se duplica el 'user_id' porque hay mas de un comentario realizado por ese usuario.

In [7]:
user_id = '05041129'
user_reviews = filas_duplicadas[filas_duplicadas['user_id'] == user_id]['reviews']

for review_list in user_reviews:
    for review in review_list:
        print(review['review'])
    print('-' * 100)

This game to me it is so good that it is better than any of the games out their and $15 worth it
this is the best third person game ever that i have played
this will be the  number one game if it have more competitive things
----------------------------------------------------------------------------------------------------
This game to me it is so good that it is better than any of the games out their and $15 worth it
this is the best third person game ever that i have played
this will be the  number one game if it have more competitive things
----------------------------------------------------------------------------------------------------


#### üí± Transformacion de los datos

Se puede ver que los review son los mismos para cada registro, por lo que se borran los duplicados, dejando la primer ocurrencia de los registros.

In [8]:
df_reviews = df_reviews.drop_duplicates(subset='user_id', keep='first')
verifica_duplicados_por_columna(df_reviews, 'user_id')

'No hay duplicados'

Se revisa la columna 'review' para entender el tipo de dato.

In [9]:
df_reviews['reviews'][0]

[{'funny': '',
  'posted': 'Posted November 5, 2011.',
  'last_edited': '',
  'item_id': '1250',
  'helpful': 'No ratings yet',
  'recommend': True,
  'review': 'Simple yet with great replayability. In my opinion does "zombie" hordes and team work better than left 4 dead plus has a global leveling system. Alot of down to earth "zombie" splattering fun for the whole family. Amazed this sort of FPS is so rare.'},
 {'funny': '',
  'posted': 'Posted July 15, 2011.',
  'last_edited': '',
  'item_id': '22200',
  'helpful': 'No ratings yet',
  'recommend': True,
  'review': "It's unique and worth a playthrough."},
 {'funny': '',
  'posted': 'Posted April 21, 2011.',
  'last_edited': '',
  'item_id': '43110',
  'helpful': 'No ratings yet',
  'recommend': True,
  'review': 'Great atmosphere. The gunplay can be a bit chunky at times but at the end of the day this game is definitely worth it and I hope they do a sequel...so buy the game so I get a sequel!'}]

Las columnas de este conjunto son:

* **user_id**: es un identificador √∫nico para el usuario.
* **user_url**: es la url del perfil del usuario en streamcommunity.
* **reviews**: contiene una lista de diccionarios. Para cada usuario se tiene uno o mas diccionario con el review. Cada diccionario contiene:
    * **funny**: indica si alguien puso un emotic√≥n de gracioso al review.
    * **posted**: es la fecha de posteo del review en formato Posted Month 00, 0000.
    * **last_edited**: es la fecha de la √∫ltima edici√≥n.
    * **item_id**: es el identificador √∫nico del item, es decir, del juego.
    * **helpful**: es la estad√≠stica donde otros usuarios indican si fue √∫til la informaci√≥n.
    * **recommend**: es un booleano que indica si el usuario recomienda o no el juego.
    * **review**: es una sentencia string con los comentarios sobre el juego.

#### üí± Transformaci√≥n de la columna 'reviews'

La columna 'reviews' se presenta anidada, siendo una lista con uno o mas diccionarios como elementos. 

Se genera una columna por cada diccionario para posteriormente hacer un registro.

In [10]:
# Se transforma a columnas cada elemento de las listas
df_reviews2 = pd.json_normalize(df_reviews['reviews'])
df_reviews2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,"{'funny': '', 'posted': 'Posted November 5, 20...","{'funny': '', 'posted': 'Posted July 15, 2011....","{'funny': '', 'posted': 'Posted April 21, 2011...",,,,,,,
1,"{'funny': '', 'posted': 'Posted June 24, 2014....","{'funny': '', 'posted': 'Posted September 8, 2...","{'funny': '', 'posted': 'Posted November 29, 2...",,,,,,,
2,"{'funny': '', 'posted': 'Posted February 3.', ...","{'funny': '', 'posted': 'Posted December 4, 20...","{'funny': '', 'posted': 'Posted November 3, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...",,,,
3,"{'funny': '', 'posted': 'Posted October 14, 20...","{'funny': '', 'posted': 'Posted July 28, 2012....","{'funny': '', 'posted': 'Posted June 2, 2012.'...","{'funny': '', 'posted': 'Posted June 29, 2014....","{'funny': '', 'posted': 'Posted November 22, 2...","{'funny': '', 'posted': 'Posted February 23, 2...",,,,
4,"{'funny': '3 people found this review funny', ...","{'funny': '1 person found this review funny', ...","{'funny': '2 people found this review funny', ...","{'funny': '', 'posted': 'Posted July 11, 2013....",,,,,,


En esta transformaci√≥n, se pierde el 'user_id' y 'user_url' al que pertenece cada diccionario, pero a√∫n mantiene la misma posici√≥n. Por lo que se concatena con el dataframe anterior, para no perder estos datos.

In [11]:
# Se agrega el 'user_id' y 'user_url' a las columnas separadas 
df_reviews2 = pd.concat([df_reviews[['user_id', 'user_url']], df_reviews2], axis=1)
df_reviews2.head()

Unnamed: 0,user_id,user_url,0,1,2,3,4,5,6,7,8,9
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted November 5, 20...","{'funny': '', 'posted': 'Posted July 15, 2011....","{'funny': '', 'posted': 'Posted April 21, 2011...",,,,,,,
1,js41637,http://steamcommunity.com/id/js41637,"{'funny': '', 'posted': 'Posted June 24, 2014....","{'funny': '', 'posted': 'Posted September 8, 2...","{'funny': '', 'posted': 'Posted November 29, 2...",,,,,,,
2,evcentric,http://steamcommunity.com/id/evcentric,"{'funny': '', 'posted': 'Posted February 3.', ...","{'funny': '', 'posted': 'Posted December 4, 20...","{'funny': '', 'posted': 'Posted November 3, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...",,,,
3,doctr,http://steamcommunity.com/id/doctr,"{'funny': '', 'posted': 'Posted October 14, 20...","{'funny': '', 'posted': 'Posted July 28, 2012....","{'funny': '', 'posted': 'Posted June 2, 2012.'...","{'funny': '', 'posted': 'Posted June 29, 2014....","{'funny': '', 'posted': 'Posted November 22, 2...","{'funny': '', 'posted': 'Posted February 23, 2...",,,,
4,maplemage,http://steamcommunity.com/id/maplemage,"{'funny': '3 people found this review funny', ...","{'funny': '1 person found this review funny', ...","{'funny': '2 people found this review funny', ...","{'funny': '', 'posted': 'Posted July 11, 2013....",,,,,,


Ahora se tienen los diccionarios por columnas, con el usuario que genera dicha informaci√≥n.

Se genera un registro por cada diccionario, manteniendo en cada caso el usuario que lo genera.

In [12]:
# Se utiliza pd.melt para transformar las columnas en filas conservando el 'user_id' y 'user_url'
df_reviews2 = pd.melt(df_reviews2, id_vars=['user_id', 'user_url'], 
                       value_vars=list(range(9)),
                       value_name='reviews')
df_reviews2.head()

Unnamed: 0,user_id,user_url,variable,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,0,"{'funny': '', 'posted': 'Posted November 5, 20..."
1,js41637,http://steamcommunity.com/id/js41637,0,"{'funny': '', 'posted': 'Posted June 24, 2014...."
2,evcentric,http://steamcommunity.com/id/evcentric,0,"{'funny': '', 'posted': 'Posted February 3.', ..."
3,doctr,http://steamcommunity.com/id/doctr,0,"{'funny': '', 'posted': 'Posted October 14, 20..."
4,maplemage,http://steamcommunity.com/id/maplemage,0,"{'funny': '3 people found this review funny', ..."


Se observa que quedan registros None. Esto ocurre porque hay usuarios que hicieron mas reviews que otros. En este ejemplo se puede ver este caso:

In [13]:
df_reviews2[df_reviews2['user_id']=='76561197970982479']

Unnamed: 0,user_id,user_url,variable,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,0,"{'funny': '', 'posted': 'Posted November 5, 20..."
25799,76561197970982479,http://steamcommunity.com/profiles/76561197970...,1,"{'funny': '', 'posted': 'Posted July 15, 2011...."
51598,76561197970982479,http://steamcommunity.com/profiles/76561197970...,2,"{'funny': '', 'posted': 'Posted April 21, 2011..."
77397,76561197970982479,http://steamcommunity.com/profiles/76561197970...,3,
103196,76561197970982479,http://steamcommunity.com/profiles/76561197970...,4,
128995,76561197970982479,http://steamcommunity.com/profiles/76561197970...,5,
154794,76561197970982479,http://steamcommunity.com/profiles/76561197970...,6,
180593,76561197970982479,http://steamcommunity.com/profiles/76561197970...,7,
206392,76561197970982479,http://steamcommunity.com/profiles/76561197970...,8,


Se eliminan los registros que tienen None en 'reviews'.

In [14]:
df_reviews2 = df_reviews2.dropna()

Se verifica que solo queden el 'user_id' con la cantidad de diccionarios que le corresponden

In [15]:
df_reviews2[df_reviews2['user_id']=='76561197970982479']

Unnamed: 0,user_id,user_url,variable,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,0,"{'funny': '', 'posted': 'Posted November 5, 20..."
25799,76561197970982479,http://steamcommunity.com/profiles/76561197970...,1,"{'funny': '', 'posted': 'Posted July 15, 2011...."
51598,76561197970982479,http://steamcommunity.com/profiles/76561197970...,2,"{'funny': '', 'posted': 'Posted April 21, 2011..."


Se convierte cada diccionario en columna.

In [16]:
# Se separan por columnas cada una de las claves de 'reviews'
df_reviews = df_reviews2['reviews'].apply(pd.Series, dtype='object')
df_reviews = df_reviews.add_prefix('reviews_')
df_reviews.head()

Unnamed: 0,reviews_funny,reviews_posted,reviews_last_edited,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review
0,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,,Posted February 3.,,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
3,,"Posted October 14, 2013.",,250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...
4,3 people found this review funny,"Posted April 15, 2014.",,211420,35 of 43 people (81%) found this review helpful,True,Git gud


Se puede observar que la columna de 'user_id' y 'user_url' se perdieron nuevamente, por lo que se vuelve a concatenar.

In [17]:
# Se une con el 'user_id' y 'user_url'
df_reviews = pd.concat([df_reviews2[['user_id', 'user_url']], df_reviews], axis=1)
df_reviews.head()

Unnamed: 0,user_id,user_url,reviews_funny,reviews_posted,reviews_last_edited,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,js41637,http://steamcommunity.com/id/js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,evcentric,http://steamcommunity.com/id/evcentric,,Posted February 3.,,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
3,doctr,http://steamcommunity.com/id/doctr,,"Posted October 14, 2013.",,250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...
4,maplemage,http://steamcommunity.com/id/maplemage,3 people found this review funny,"Posted April 15, 2014.",,211420,35 of 43 people (81%) found this review helpful,True,Git gud


Se observa que hay valores faltantes en algunas columnas, pero no como nulos, sino que tienen un espacio:

In [18]:
df_reviews['reviews_last_edited'][0]

''

Se reemplazan esos espacios como valores nulos.

In [19]:
df_reviews.replace('', None, inplace=True)
df_reviews.head()

Unnamed: 0,user_id,user_url,reviews_funny,reviews_posted,reviews_last_edited,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,js41637,http://steamcommunity.com/id/js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,evcentric,http://steamcommunity.com/id/evcentric,,Posted February 3.,,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
3,doctr,http://steamcommunity.com/id/doctr,,"Posted October 14, 2013.",,250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...
4,maplemage,http://steamcommunity.com/id/maplemage,3 people found this review funny,"Posted April 15, 2014.",,211420,35 of 43 people (81%) found this review helpful,True,Git gud


Se analizan los tipos de datos y los nulos que quedaron despu√©s de desanidar la columna 'reviews'.

In [20]:
verificar_tipo_datos(df_reviews)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,user_url,[<class 'str'>],100.0,0.0,0
2,reviews_funny,"[<class 'NoneType'>, <class 'str'>]",13.76,86.24,49498
3,reviews_posted,[<class 'str'>],100.0,0.0,0
4,reviews_last_edited,"[<class 'NoneType'>, <class 'str'>]",10.28,89.72,51499
5,reviews_item_id,[<class 'str'>],100.0,0.0,0
6,reviews_helpful,[<class 'str'>],100.0,0.0,0
7,reviews_recommend,[<class 'bool'>],100.0,0.0,0
8,reviews_review,"[<class 'str'>, <class 'NoneType'>]",99.95,0.05,30


Se observa entre un 86% a 89% de nulos en las columnas 'reviews_funny' y 'reviews_last_edited' por lo que se eliminan estas columnas. 

Por lado se observa un 0,05% de nulos en la columna 'reviews_review', pero no se eliminan esos registros ya que se consideran como un comentario neutral.

In [21]:
# Se eliminan las columnas 'reviews_funny' y 'reviews_last_edited'
df_reviews = df_reviews.drop(columns=['reviews_funny', 'reviews_last_edited'])
df_reviews.columns

Index(['user_id', 'user_url', 'reviews_posted', 'reviews_item_id',
       'reviews_helpful', 'reviews_recommend', 'reviews_review'],
      dtype='object')

#### üí± Transformaci√≥n de la columna 'reviews_posted'

La fecha donde se hizo el posteo de la review se encuentra en formato: `Posted November 5, 2011.` , se transforma la fecha a este formato: `YYYY-MM-DD`. 

In [22]:
def convertir_fecha(cadena_fecha):
    match = re.search(r'(\w+\s\d{1,2},\s\d{4})', cadena_fecha)
    if match:
        fecha_str = match.group(1)
        try:
            fecha_dt = pd.to_datetime(fecha_str)
            return fecha_dt.strftime('%Y-%m-%d')
        except:
            return 'Fecha inv√°lida'
    else:
        return 'Formato inv√°lido'

In [23]:
df_reviews['reviews_date'] = df_reviews['reviews_posted'].apply(convertir_fecha)
df_reviews['reviews_date']

0               2011-11-05
1               2014-06-24
2         Formato inv√°lido
3               2013-10-14
4               2014-04-15
                ...       
231291          2014-08-15
231293          2014-08-02
231419          2015-07-31
231499          2015-12-20
231501    Formato inv√°lido
Name: reviews_date, Length: 57397, dtype: object

Se observa que hay registros que contienen un formato inv√°lido distinto a los demas. Estos registros no se podr√°n consultar desde la API, pero las dem√°s columnas ser√°n √∫tiles para aportar informaci√≥n.

In [24]:
df_reviews[df_reviews['reviews_date'] == 'Formato inv√°lido']

Unnamed: 0,user_id,user_url,reviews_posted,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review,reviews_date
2,evcentric,http://steamcommunity.com/id/evcentric,Posted February 3.,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...,Formato inv√°lido
6,76561198079601835,http://steamcommunity.com/profiles/76561198079...,Posted May 20.,730,0 of 1 people (0%) found this review helpful,True,ZIKA DO BAILE,Formato inv√°lido
7,MeaTCompany,http://steamcommunity.com/id/MeaTCompany,Posted July 24.,730,No ratings yet,True,BEST GAME IN THE BLOODY WORLD,Formato inv√°lido
9,76561198156664158,http://steamcommunity.com/profiles/76561198156...,Posted June 16.,252950,0 of 1 people (0%) found this review helpful,True,love it,Formato inv√°lido
10,76561198077246154,http://steamcommunity.com/profiles/76561198077...,Posted June 11.,440,No ratings yet,True,mt bom,Formato inv√°lido
...,...,...,...,...,...,...,...,...
223569,76561198040184950,http://steamcommunity.com/profiles/76561198040...,Posted April 12.,394690,No ratings yet,True,I cannot say much right now due to the game no...,Formato inv√°lido
226105,76561198046474248,http://steamcommunity.com/profiles/76561198046...,Posted March 28.,234140,No ratings yet,True,"Oh what a day .., What a lovely day to play th...",Formato inv√°lido
228109,dmitry_who,http://steamcommunity.com/id/dmitry_who,Posted May 17.,376210,10 of 28 people (36%) found this review helpful,True,‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñà‚ñÄ‚ñÄ‚ñë‚ñë‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñÑ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñë‚ñë‚ñë‚ñë‚ñë‚ñà‚ñÑ‚ñÑ‚ñë‚ñë‚ñë‚ñë‚ñë...,Formato inv√°lido
229231,76561198079507136,http://steamcommunity.com/profiles/76561198079...,Posted January 3.,730,No ratings yet,False,got VACed,Formato inv√°lido


Se elimina la columna 'reviews_posted' porque solo indica el d√≠a y el mes del posteo.

In [25]:
df_reviews = df_reviews.drop('reviews_posted', axis=1)
df_reviews.columns

Index(['user_id', 'user_url', 'reviews_item_id', 'reviews_helpful',
       'reviews_recommend', 'reviews_review', 'reviews_date'],
      dtype='object')

#### üí± Transformaci√≥n de la columna 'reviews_review'

Esta columna tiene un 5% de valores nulos, por lo que se eliminan.

In [26]:
df_reviews = df_reviews.dropna(subset=['reviews_review'])
verificar_tipo_datos(df_reviews)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,user_url,[<class 'str'>],100.0,0.0,0
2,reviews_item_id,[<class 'str'>],100.0,0.0,0
3,reviews_helpful,[<class 'str'>],100.0,0.0,0
4,reviews_recommend,[<class 'bool'>],100.0,0.0,0
5,reviews_review,[<class 'str'>],100.0,0.0,0
6,reviews_date,[<class 'str'>],100.0,0.0,0


## üì• Carga del dataset `australian_user_reviews`

Se guarda el conjunto de datos transformado como `user_review_limpio`. 

In [27]:
archivo_limpio = 'Data/user_review_limpio.csv'
df_reviews.to_csv(archivo_limpio, index=False, encoding='utf-8')
print(f'Se guard√≥ el archivo {archivo_limpio}')

Se guard√≥ el archivo Data/user_review_limpio.csv
