# Extracción, Transformación y Carga de datos (ETL)

### Importamos librerías

Estas librerías nos permiten manipular los datos y prepararlos para ser consumibles.

In [56]:
import pandas as pd
import numpy as np
import json
import ast
import warnings
warnings.filterwarnings("ignore")
import sys
sys.path.insert(0, '../')
import Herramientas as Herr

### Carga de datos

Realizamos la carga del archivo a través de una función de nuestro modulo Herramientas para que el formato JSON pase a objeto de Python. Luego se convierte a dataframe para su manipulación.

In [57]:
filas = Herr.read_json('../datasets/australian_user_reviews.json')
        
data_reviews = pd.DataFrame(filas)
data_reviews

El archivo se leyó con éxito


Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."
...,...,...,...
25794,76561198306599751,http://steamcommunity.com/profiles/76561198306...,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,http://steamcommunity.com/id/Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,http://steamcommunity.com/profiles/76561198310...,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."


Se revisaron las columnas para ver nulos y los tipos de datos únicos, para dicha tarea usamos una función del módulo Herramientas.

In [58]:
Herr.analizar_datos(data_reviews)

Unnamed: 0,Nombre,Tipos de Datos Únicos,% de Valores No Nulos,% de Valores Nulos,Cantidad de Valores Nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,user_url,[<class 'str'>],100.0,0.0,0
2,reviews,[<class 'list'>],100.0,0.0,0


### Transformación de los datos

Se verificó si hay duplicados de filas y número de fila. Luego se imprimió por pantalla solo las filas que se repitieron.

In [59]:
duplicados = data_reviews['user_id'].duplicated()
print(duplicados.value_counts())
print(110*'-')
filas_duplicadas = data_reviews.loc[duplicados,:]
filas_duplicadas

user_id
False    25485
True       314
Name: count, dtype: int64
--------------------------------------------------------------------------------------------------------------


Unnamed: 0,user_id,user_url,reviews
456,bokkkbokkk,http://steamcommunity.com/id/bokkkbokkk,"[{'funny': '', 'posted': 'Posted September 24,..."
1182,ImSeriouss,http://steamcommunity.com/id/ImSeriouss,"[{'funny': '', 'posted': 'Posted January 10, 2..."
1456,76561198062039159,http://steamcommunity.com/profiles/76561198062...,"[{'funny': '', 'posted': 'Posted August 24, 20..."
1477,76561198045009232,http://steamcommunity.com/profiles/76561198045...,"[{'funny': '', 'posted': 'Posted October 31, 2..."
1746,nitr0ticwolf,http://steamcommunity.com/id/nitr0ticwolf,"[{'funny': '', 'posted': 'Posted December 12, ..."
...,...,...,...
17819,76561198076474887,http://steamcommunity.com/profiles/76561198076...,"[{'funny': '', 'posted': 'Posted April 12.', '..."
17916,yolofaceguy,http://steamcommunity.com/id/yolofaceguy,"[{'funny': '', 'posted': 'Posted October 31, 2..."
18028,76561198075591109,http://steamcommunity.com/profiles/76561198075...,"[{'funny': '', 'posted': 'Posted December 26, ..."
18234,76561198092022514,http://steamcommunity.com/profiles/76561198092...,"[{'funny': '', 'posted': 'Posted July 3.', 'la..."


Se intentó con varios ID diferentes de los duplicados y encontramos que ninguno de ellos repite los comentarios, significa que el mismo usuario ha hecho más de un comentario.

In [60]:
id = '76561198062039159'
resenia = filas_duplicadas[filas_duplicadas['user_id'] == id]['reviews']

for x in resenia:   
    for r in x:
        print(r['review'])
        print(100*'--')

Doto of the ancients > LoLegend league of Legends10/IV would play again, best fps in today's scientific community
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This is possibly the best tactical FPS i've ever played.Insurgency provides hours of fun and it also looks fabutabulous. It has amazing gunplay and guns, fantastic supression mechanics and somewhat good people.This game is a 10/10, no questions asked.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Buy it and feel bad when a stupid ♥♥♥♥ing robot kills your dudes.It's really good.
-----------------------------------------------------------------------------------------------------------------------------------------------------

Se aplica la función json_normalize para desanidar la columna "reviews" y se la guarda en un nuevo dataframe.

In [61]:
data_reviews2= pd.json_normalize(data_reviews['reviews'])
data_reviews2

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,"{'funny': '', 'posted': 'Posted November 5, 20...","{'funny': '', 'posted': 'Posted July 15, 2011....","{'funny': '', 'posted': 'Posted April 21, 2011...",,,,,,,
1,"{'funny': '', 'posted': 'Posted June 24, 2014....","{'funny': '', 'posted': 'Posted September 8, 2...","{'funny': '', 'posted': 'Posted November 29, 2...",,,,,,,
2,"{'funny': '', 'posted': 'Posted February 3.', ...","{'funny': '', 'posted': 'Posted December 4, 20...","{'funny': '', 'posted': 'Posted November 3, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...",,,,
3,"{'funny': '', 'posted': 'Posted October 14, 20...","{'funny': '', 'posted': 'Posted July 28, 2012....","{'funny': '', 'posted': 'Posted June 2, 2012.'...","{'funny': '', 'posted': 'Posted June 29, 2014....","{'funny': '', 'posted': 'Posted November 22, 2...","{'funny': '', 'posted': 'Posted February 23, 2...",,,,
4,"{'funny': '3 people found this review funny', ...","{'funny': '1 person found this review funny', ...","{'funny': '2 people found this review funny', ...","{'funny': '', 'posted': 'Posted July 11, 2013....",,,,,,
...,...,...,...,...,...,...,...,...,...,...
25794,"{'funny': '', 'posted': 'Posted May 31.', 'las...",,,,,,,,,
25795,"{'funny': '', 'posted': 'Posted June 17.', 'la...",,,,,,,,,
25796,"{'funny': '1 person found this review funny', ...",,,,,,,,,
25797,"{'funny': '', 'posted': 'Posted July 21.', 'la...","{'funny': '', 'posted': 'Posted July 10.', 'la...","{'funny': '', 'posted': 'Posted July 10.', 'la...","{'funny': '', 'posted': 'Posted July 8.', 'las...",,,,,,


Se realizó la concatenación del primer data frame (datareviews) con el segundo (datareviews2).

In [62]:
data_reviews3 = pd.concat([data_reviews,data_reviews2],axis=1)
data_reviews3

Unnamed: 0,user_id,user_url,reviews,0,1,2,3,4,5,6,7,8,9
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2...","{'funny': '', 'posted': 'Posted November 5, 20...","{'funny': '', 'posted': 'Posted July 15, 2011....","{'funny': '', 'posted': 'Posted April 21, 2011...",,,,,,,
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014...","{'funny': '', 'posted': 'Posted June 24, 2014....","{'funny': '', 'posted': 'Posted September 8, 2...","{'funny': '', 'posted': 'Posted November 29, 2...",,,,,,,
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',...","{'funny': '', 'posted': 'Posted February 3.', ...","{'funny': '', 'posted': 'Posted December 4, 20...","{'funny': '', 'posted': 'Posted November 3, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...",,,,
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2...","{'funny': '', 'posted': 'Posted October 14, 20...","{'funny': '', 'posted': 'Posted July 28, 2012....","{'funny': '', 'posted': 'Posted June 2, 2012.'...","{'funny': '', 'posted': 'Posted June 29, 2014....","{'funny': '', 'posted': 'Posted November 22, 2...","{'funny': '', 'posted': 'Posted February 23, 2...",,,,
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',...","{'funny': '3 people found this review funny', ...","{'funny': '1 person found this review funny', ...","{'funny': '2 people found this review funny', ...","{'funny': '', 'posted': 'Posted July 11, 2013....",,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
25794,76561198306599751,http://steamcommunity.com/profiles/76561198306...,"[{'funny': '', 'posted': 'Posted May 31.', 'la...","{'funny': '', 'posted': 'Posted May 31.', 'las...",,,,,,,,,
25795,Ghoustik,http://steamcommunity.com/id/Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l...","{'funny': '', 'posted': 'Posted June 17.', 'la...",,,,,,,,,
25796,76561198310819422,http://steamcommunity.com/profiles/76561198310...,"[{'funny': '1 person found this review funny',...","{'funny': '1 person found this review funny', ...",,,,,,,,,
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"[{'funny': '', 'posted': 'Posted July 21.', 'l...","{'funny': '', 'posted': 'Posted July 21.', 'la...","{'funny': '', 'posted': 'Posted July 10.', 'la...","{'funny': '', 'posted': 'Posted July 10.', 'la...","{'funny': '', 'posted': 'Posted July 8.', 'las...",,,,,,


Se eliminó la columna "reviews" dado que se agregó al nuevo dataframe 'normalizada'.

In [63]:
data_reviews3 = data_reviews3.drop('reviews',axis=1)

Se aplicó la función 'melt' de la biblioteca pandas para las columnas que resultaron de desanidar la columna "reviews" y se la nombró de nuevo como "reviews", guardando las dos primeras columnas como identificadores ('user_id', 'user_url').

In [64]:
data_reviews3 = pd.melt(data_reviews3,id_vars=['user_id','user_url'],value_vars=list(range(10)),value_name='reviews')
data_reviews3

Unnamed: 0,user_id,user_url,variable,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,0,"{'funny': '', 'posted': 'Posted November 5, 20..."
1,js41637,http://steamcommunity.com/id/js41637,0,"{'funny': '', 'posted': 'Posted June 24, 2014...."
2,evcentric,http://steamcommunity.com/id/evcentric,0,"{'funny': '', 'posted': 'Posted February 3.', ..."
3,doctr,http://steamcommunity.com/id/doctr,0,"{'funny': '', 'posted': 'Posted October 14, 20..."
4,maplemage,http://steamcommunity.com/id/maplemage,0,"{'funny': '3 people found this review funny', ..."
...,...,...,...,...
257985,76561198306599751,http://steamcommunity.com/profiles/76561198306...,9,
257986,Ghoustik,http://steamcommunity.com/id/Ghoustik,9,
257987,76561198310819422,http://steamcommunity.com/profiles/76561198310...,9,
257988,76561198312638244,http://steamcommunity.com/profiles/76561198312...,9,


Luego de desanidar, verificamos cómo están compuestos los tipos datos de la columna "reviews" y si poseen nulos.

In [65]:
Herr.analizar_datos(data_reviews3)

Unnamed: 0,Nombre,Tipos de Datos Únicos,% de Valores No Nulos,% de Valores Nulos,Cantidad de Valores Nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,user_url,[<class 'str'>],100.0,0.0,0
2,variable,[<class 'int'>],100.0,0.0,0
3,reviews,"[<class 'dict'>, <class 'NoneType'>]",22.99,77.01,198685


Se seleccionó un usuario aleatorio para ver las reviews que realizó, y se ve a simple vista que tenemos muchos nulos dado que hizo menos comentarios que otros.  
Esto se puede corroborar para cualquier usuario.

In [66]:
user = 'LydiaMorley'
filtrado = data_reviews3[data_reviews3['user_id'] == user]
filtrado

Unnamed: 0,user_id,user_url,variable,reviews
25798,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,0,"{'funny': '1 person found this review funny', ..."
51597,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,1,"{'funny': '', 'posted': 'Posted July 20.', 'la..."
77396,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,2,"{'funny': '', 'posted': 'Posted July 2.', 'las..."
103195,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,3,
128994,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,4,
154793,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,5,
180592,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,6,
206391,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,7,
232190,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,8,
257989,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,9,


Sobre el DataFrame "data_reviews3", se aplicó la función 'dropna' para eliminar las filas que no contenían reviews.

In [67]:
data_reviews3 = data_reviews3.dropna()
Herr.analizar_datos(data_reviews3)

Unnamed: 0,Nombre,Tipos de Datos Únicos,% de Valores No Nulos,% de Valores Nulos,Cantidad de Valores Nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,user_url,[<class 'str'>],100.0,0.0,0
2,variable,[<class 'int'>],100.0,0.0,0
3,reviews,[<class 'dict'>],100.0,0.0,0


Se aplicó la función 'pd.Series' sobre la columna "reviews" para expandir en columnas la información que tenía dentro.

In [68]:
data_des = data_reviews3['reviews'].apply(pd.Series)
data_des


Unnamed: 0,funny,posted,last_edited,item_id,helpful,recommend,review
0,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,,Posted February 3.,,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
3,,"Posted October 14, 2013.",,250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...
4,3 people found this review funny,"Posted April 15, 2014.",,211420,35 of 43 people (81%) found this review helpful,True,Git gud
...,...,...,...,...,...,...,...
256785,,"Posted June 7, 2015.",,400,No ratings yet,True,This is the best stratigy/puzzle game out ther...
256904,,"Posted March 16, 2015.",,313120,No ratings yet,True,Not a bad game for alpha. I think with future ...
257402,,"Posted March 29, 2014.",Last edited April 12.,17410,No ratings yet,True,Pretty great graphics and gameplay is no diffe...
257718,,"Posted August 9, 2014.","Last edited October 3, 2014.",304930,No ratings yet,True,This game is so much fun but so challenging it...


Se concatenó el nuevo DataFrame con la columna "reviews" expandida (data_des), y el tercer DataFrame que se creó (data_reviews3).

In [69]:
data_normalizado = pd.concat([data_reviews3[['user_id','user_url']],data_des],axis=1)
data_normalizado

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,js41637,http://steamcommunity.com/id/js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,evcentric,http://steamcommunity.com/id/evcentric,,Posted February 3.,,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
3,doctr,http://steamcommunity.com/id/doctr,,"Posted October 14, 2013.",,250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...
4,maplemage,http://steamcommunity.com/id/maplemage,3 people found this review funny,"Posted April 15, 2014.",,211420,35 of 43 people (81%) found this review helpful,True,Git gud
...,...,...,...,...,...,...,...,...,...
256785,BonnieMTD,http://steamcommunity.com/id/BonnieMTD,,"Posted June 7, 2015.",,400,No ratings yet,True,This is the best stratigy/puzzle game out ther...
256904,amillionlemons,http://steamcommunity.com/id/amillionlemons,,"Posted March 16, 2015.",,313120,No ratings yet,True,Not a bad game for alpha. I think with future ...
257402,keepit1hunid,http://steamcommunity.com/id/keepit1hunid,,"Posted March 29, 2014.",Last edited April 12.,17410,No ratings yet,True,Pretty great graphics and gameplay is no diffe...
257718,SKELETRONPRIMEISOP,http://steamcommunity.com/id/SKELETRONPRIMEISOP,,"Posted August 9, 2014.","Last edited October 3, 2014.",304930,No ratings yet,True,This game is so much fun but so challenging it...


Se revisó cuántos nulos o vacíos tienen las columnas. Hay columnas que, en lugar de tener el valor None, poseen espacios vacíos, lo que no aporta ningún valor significativo.

In [70]:
Herr.analizar_datos(data_normalizado)

Unnamed: 0,Nombre,Tipos de Datos Únicos,% de Valores No Nulos,% de Valores Nulos,Cantidad de Valores Nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,user_url,[<class 'str'>],100.0,0.0,0
2,funny,[<class 'str'>],100.0,0.0,0
3,posted,[<class 'str'>],100.0,0.0,0
4,last_edited,[<class 'str'>],100.0,0.0,0
5,item_id,[<class 'str'>],100.0,0.0,0
6,helpful,[<class 'str'>],100.0,0.0,0
7,recommend,[<class 'bool'>],100.0,0.0,0
8,review,[<class 'str'>],100.0,0.0,0


Se reemplazaron los espacios vacíos por el valor None, para que nuestra función pueda sumar la cantidad de nulos reales que tiene este DataFrame.

In [71]:
data_normalizado = data_normalizado.replace('', None)
Herr.analizar_datos(data_normalizado)

Unnamed: 0,Nombre,Tipos de Datos Únicos,% de Valores No Nulos,% de Valores Nulos,Cantidad de Valores Nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,user_url,[<class 'str'>],100.0,0.0,0
2,funny,"[<class 'NoneType'>, <class 'str'>]",13.74,86.26,51154
3,posted,[<class 'str'>],100.0,0.0,0
4,last_edited,"[<class 'NoneType'>, <class 'str'>]",10.35,89.65,53165
5,item_id,[<class 'str'>],100.0,0.0,0
6,helpful,[<class 'str'>],100.0,0.0,0
7,recommend,[<class 'bool'>],100.0,0.0,0
8,review,"[<class 'str'>, <class 'NoneType'>]",99.95,0.05,30


Las columnas "funny" (86.23%) y "last edited" (89.7%) tienen un gran porcentaje de nulos. Serán eliminadas, ya que son porcentajes muy altos, y la imputación de valores podría resultar poco fiable.

In [72]:
data_normalizado = data_normalizado.drop(columns=['funny','last_edited'])
data_normalizado

Unnamed: 0,user_id,user_url,posted,item_id,helpful,recommend,review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"Posted November 5, 2011.",1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,js41637,http://steamcommunity.com/id/js41637,"Posted June 24, 2014.",251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,evcentric,http://steamcommunity.com/id/evcentric,Posted February 3.,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
3,doctr,http://steamcommunity.com/id/doctr,"Posted October 14, 2013.",250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...
4,maplemage,http://steamcommunity.com/id/maplemage,"Posted April 15, 2014.",211420,35 of 43 people (81%) found this review helpful,True,Git gud
...,...,...,...,...,...,...,...
256785,BonnieMTD,http://steamcommunity.com/id/BonnieMTD,"Posted June 7, 2015.",400,No ratings yet,True,This is the best stratigy/puzzle game out ther...
256904,amillionlemons,http://steamcommunity.com/id/amillionlemons,"Posted March 16, 2015.",313120,No ratings yet,True,Not a bad game for alpha. I think with future ...
257402,keepit1hunid,http://steamcommunity.com/id/keepit1hunid,"Posted March 29, 2014.",17410,No ratings yet,True,Pretty great graphics and gameplay is no diffe...
257718,SKELETRONPRIMEISOP,http://steamcommunity.com/id/SKELETRONPRIMEISOP,"Posted August 9, 2014.",304930,No ratings yet,True,This game is so much fun but so challenging it...


Además, eliminaremos los registros nulos de la columna "reviews" (0.05%), ya que al realizar un análisis de sentimiento, podría ser cualquier opción (Negative, Neutral o Positive).  
Al no representar un número significativo de reviews, procederemos a eliminar esos registros.

In [73]:
data_normalizado = data_normalizado.dropna(subset=['review'])
data_normalizado

Unnamed: 0,user_id,user_url,posted,item_id,helpful,recommend,review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"Posted November 5, 2011.",1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,js41637,http://steamcommunity.com/id/js41637,"Posted June 24, 2014.",251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,evcentric,http://steamcommunity.com/id/evcentric,Posted February 3.,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
3,doctr,http://steamcommunity.com/id/doctr,"Posted October 14, 2013.",250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...
4,maplemage,http://steamcommunity.com/id/maplemage,"Posted April 15, 2014.",211420,35 of 43 people (81%) found this review helpful,True,Git gud
...,...,...,...,...,...,...,...
256785,BonnieMTD,http://steamcommunity.com/id/BonnieMTD,"Posted June 7, 2015.",400,No ratings yet,True,This is the best stratigy/puzzle game out ther...
256904,amillionlemons,http://steamcommunity.com/id/amillionlemons,"Posted March 16, 2015.",313120,No ratings yet,True,Not a bad game for alpha. I think with future ...
257402,keepit1hunid,http://steamcommunity.com/id/keepit1hunid,"Posted March 29, 2014.",17410,No ratings yet,True,Pretty great graphics and gameplay is no diffe...
257718,SKELETRONPRIMEISOP,http://steamcommunity.com/id/SKELETRONPRIMEISOP,"Posted August 9, 2014.",304930,No ratings yet,True,This game is so much fun but so challenging it...


Se verificó que no tengamos nulos ni múltiples tipos de datos en las columnas.

In [74]:
Herr.analizar_datos(data_normalizado)

Unnamed: 0,Nombre,Tipos de Datos Únicos,% de Valores No Nulos,% de Valores Nulos,Cantidad de Valores Nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,user_url,[<class 'str'>],100.0,0.0,0
2,posted,[<class 'str'>],100.0,0.0,0
3,item_id,[<class 'str'>],100.0,0.0,0
4,helpful,[<class 'str'>],100.0,0.0,0
5,recommend,[<class 'bool'>],100.0,0.0,0
6,review,[<class 'str'>],100.0,0.0,0


Ahora necesitamos que la columna "posted" guarde solo el año, ya que es necesario este valor para luego ser consumido desde la API. Aquellos registros que no tenían año o solo tenían mes se les asignó el valor None. Para lograrlo, se utilizó una función de nuestro módulo Herramientas.

In [75]:
data_normalizado['posted'] = data_normalizado['posted'].apply(Herr.extraccion_anio)
data_normalizado

Unnamed: 0,user_id,user_url,posted,item_id,helpful,recommend,review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,2011,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,js41637,http://steamcommunity.com/id/js41637,2014,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,evcentric,http://steamcommunity.com/id/evcentric,,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
3,doctr,http://steamcommunity.com/id/doctr,2013,250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...
4,maplemage,http://steamcommunity.com/id/maplemage,2014,211420,35 of 43 people (81%) found this review helpful,True,Git gud
...,...,...,...,...,...,...,...
256785,BonnieMTD,http://steamcommunity.com/id/BonnieMTD,2015,400,No ratings yet,True,This is the best stratigy/puzzle game out ther...
256904,amillionlemons,http://steamcommunity.com/id/amillionlemons,2015,313120,No ratings yet,True,Not a bad game for alpha. I think with future ...
257402,keepit1hunid,http://steamcommunity.com/id/keepit1hunid,2014,17410,No ratings yet,True,Pretty great graphics and gameplay is no diffe...
257718,SKELETRONPRIMEISOP,http://steamcommunity.com/id/SKELETRONPRIMEISOP,2014,304930,No ratings yet,True,This game is so much fun but so challenging it...


En el análisis anterior de los tipos de datos en las columnas, se observó que la columna "posted" es de tipo cadena (string), por eso se lo convirtió a numérico para luego poder convertirlo a entero (int).

In [76]:
data_normalizado['posted'] = pd.to_numeric(data_normalizado['posted'],errors='coerce')

Se creó una variable llamada "nones" y se filtró la columna "posted" para visualizar aquellos registros que tenian valor None.  
Luego, se sumaron para evaluar qué decisión tomar con esos registros.

In [77]:
nones = data_normalizado['posted'].isna()
contador = nones.sum()
contador

10116

Se realizó la extracción de la mediana de los años y se imputaron los valores faltantes en esa columna con dicha mediana. 

In [78]:
mediana = np.nanmedian(data_normalizado['posted'])
data_normalizado['posted'] = data_normalizado['posted'].fillna(mediana)

Se convirtió el tipo de dato de la columna posted a entero(int).

In [79]:
data_normalizado['posted'] = data_normalizado['posted'].astype(int)

Se borró la columna "helpful" dado que no será utilizada en ningún momento y no ofrece información relevante.

In [80]:
data_normalizado = data_normalizado.drop('helpful',axis=1)

In [81]:
Herr.analizar_datos(data_normalizado)

Unnamed: 0,Nombre,Tipos de Datos Únicos,% de Valores No Nulos,% de Valores Nulos,Cantidad de Valores Nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,user_url,[<class 'str'>],100.0,0.0,0
2,posted,[<class 'int'>],100.0,0.0,0
3,item_id,[<class 'str'>],100.0,0.0,0
4,recommend,[<class 'bool'>],100.0,0.0,0
5,review,[<class 'str'>],100.0,0.0,0


### Exportamos los datos

A través de una función de nuestro módulo Herramientas, exportamos el archivo transformado.

In [82]:
Herr.export_data_csv('../datasets/australian_reviews.csv',data_normalizado)