In [62]:
import pandas as pd
import ast
import json


%load_ext autoreload
%autoreload 2
import utils

import warnings
warnings.filterwarnings("ignore")

In [58]:
#pip install textblob

Collecting textblobNote: you may need to restart the kernel to use updated packages.

  Using cached textblob-0.17.1-py2.py3-none-any.whl (636 kB)
Installing collected packages: textblob
Successfully installed textblob-0.17.1




## Dataset  `australian_user_reviews`

Se lee el dataset que contiene las reviews que hacen los usuarios.

In [2]:
# Ruta al dataset australian_user_reviews
ruta_review = 'data/australian_user_reviews.json'

# Se lee de cada línea del dataset
filas_review = []
with open(ruta_review, encoding='MacRoman') as f:
    for line in f.readlines():
        filas_review.append(ast.literal_eval(line))

# Se convierte en dataframe
df_reviews = pd.DataFrame(filas_review)
df_reviews

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."
...,...,...,...
25794,76561198306599751,http://steamcommunity.com/profiles/76561198306...,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,http://steamcommunity.com/id/Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,http://steamcommunity.com/profiles/76561198310...,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."


In [28]:
# Se revisan tipos de datos y existencias de nulos
utils.verificar_tipo_datos(df_reviews)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,user_url,[<class 'str'>],100.0,0.0,0
2,reviews,[<class 'list'>],100.0,0.0,0


In [4]:
# Se observa el tipo de dato que contiene 'review'
df_reviews['reviews'][0]

[{'funny': '',
  'posted': 'Posted November 5, 2011.',
  'last_edited': '',
  'item_id': '1250',
  'helpful': 'No ratings yet',
  'recommend': True,
  'review': 'Simple yet with great replayability. In my opinion does "zombie" hordes and team work better than left 4 dead plus has a global leveling system. Alot of down to earth "zombie" splattering fun for the whole family. Amazed this sort of FPS is so rare.'},
 {'funny': '',
  'posted': 'Posted July 15, 2011.',
  'last_edited': '',
  'item_id': '22200',
  'helpful': 'No ratings yet',
  'recommend': True,
  'review': "It's unique and worth a playthrough."},
 {'funny': '',
  'posted': 'Posted April 21, 2011.',
  'last_edited': '',
  'item_id': '43110',
  'helpful': 'No ratings yet',
  'recommend': True,
  'review': 'Great atmosphere. The gunplay can be a bit chunky at times but at the end of the day this game is definitely worth it and I hope they do a sequel...so buy the game so I get a sequel!'}]

Este conjunto contiene 3 columnas y 25799 filas, sin valores nulos. Las columnas son:

* user_id: es donde esta el nombre del usuario, algunos son números, otros son string o combinación de ambos.
* user_url: es la url del perfil del usuario en streamcommunity.
* reviews: contiene una lista de diccionarios. Para cada usuario se tiene uno o mas diccionario con el review. Cada diccionario contiene:
    * funny: indica si alguien puso emoticón de gracioso.
    * posted: es la fecha en formato Posted April 21, 2011.
    * last_edited: es un string vacío.
    * item_id: es un número entero que le da el id a la review.
    * helpful: es una sentencia string donde otros usuarios indican si fué útil la información.
    * recommend: es un True/False.
    * review: es una sentencia string con los comnetarios del juego.

En primer lugar, la columna de **user_url** se decide eliminar, dado que no se utilizará en el análisis.

In [31]:
df_reviews = df_reviews.drop(columns=['user_url'])

A continuación, se genera una columna por cada diccionario de la lista.

In [34]:
df_reviews.head()

Unnamed: 0,user_id,reviews
0,76561197970982479,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,"[{'funny': '3 people found this review funny',..."
...,...,...
25794,76561198306599751,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."


In [35]:
df_reviews2 = pd.json_normalize(df_reviews['reviews'])
df_reviews2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,"{'funny': '', 'posted': 'Posted November 5, 20...","{'funny': '', 'posted': 'Posted July 15, 2011....","{'funny': '', 'posted': 'Posted April 21, 2011...",,,,,,,
1,"{'funny': '', 'posted': 'Posted June 24, 2014....","{'funny': '', 'posted': 'Posted September 8, 2...","{'funny': '', 'posted': 'Posted November 29, 2...",,,,,,,
2,"{'funny': '', 'posted': 'Posted February 3.', ...","{'funny': '', 'posted': 'Posted December 4, 20...","{'funny': '', 'posted': 'Posted November 3, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...",,,,
3,"{'funny': '', 'posted': 'Posted October 14, 20...","{'funny': '', 'posted': 'Posted July 28, 2012....","{'funny': '', 'posted': 'Posted June 2, 2012.'...","{'funny': '', 'posted': 'Posted June 29, 2014....","{'funny': '', 'posted': 'Posted November 22, 2...","{'funny': '', 'posted': 'Posted February 23, 2...",,,,
4,"{'funny': '3 people found this review funny', ...","{'funny': '1 person found this review funny', ...","{'funny': '2 people found this review funny', ...","{'funny': '', 'posted': 'Posted July 11, 2013....",,,,,,
...,...,...,...,...,...,...,...,...,...,...
25794,"{'funny': '', 'posted': 'Posted May 31.', 'las...",,,,,,,,,
25795,"{'funny': '', 'posted': 'Posted June 17.', 'la...",,,,,,,,,
25796,"{'funny': '1 person found this review funny', ...",,,,,,,,,
25797,"{'funny': '', 'posted': 'Posted July 21.', 'la...","{'funny': '', 'posted': 'Posted July 10.', 'la...","{'funny': '', 'posted': 'Posted July 10.', 'la...","{'funny': '', 'posted': 'Posted July 8.', 'las...",,,,,,


In [36]:
# Se agrega el user_id a las columnas separadas 
df_reviews2 = pd.concat([df_reviews['user_id'], df_reviews2], axis=1)
df_reviews2.head()

Unnamed: 0,user_id,0,1,2,3,4,5,6,7,8,9
0,76561197970982479,"{'funny': '', 'posted': 'Posted November 5, 20...","{'funny': '', 'posted': 'Posted July 15, 2011....","{'funny': '', 'posted': 'Posted April 21, 2011...",,,,,,,
1,js41637,"{'funny': '', 'posted': 'Posted June 24, 2014....","{'funny': '', 'posted': 'Posted September 8, 2...","{'funny': '', 'posted': 'Posted November 29, 2...",,,,,,,
2,evcentric,"{'funny': '', 'posted': 'Posted February 3.', ...","{'funny': '', 'posted': 'Posted December 4, 20...","{'funny': '', 'posted': 'Posted November 3, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...","{'funny': '', 'posted': 'Posted October 15, 20...",,,,
3,doctr,"{'funny': '', 'posted': 'Posted October 14, 20...","{'funny': '', 'posted': 'Posted July 28, 2012....","{'funny': '', 'posted': 'Posted June 2, 2012.'...","{'funny': '', 'posted': 'Posted June 29, 2014....","{'funny': '', 'posted': 'Posted November 22, 2...","{'funny': '', 'posted': 'Posted February 23, 2...",,,,
4,maplemage,"{'funny': '3 people found this review funny', ...","{'funny': '1 person found this review funny', ...","{'funny': '2 people found this review funny', ...","{'funny': '', 'posted': 'Posted July 11, 2013....",,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
25794,76561198306599751,"{'funny': '', 'posted': 'Posted May 31.', 'las...",,,,,,,,,
25795,Ghoustik,"{'funny': '', 'posted': 'Posted June 17.', 'la...",,,,,,,,,
25796,76561198310819422,"{'funny': '1 person found this review funny', ...",,,,,,,,,
25797,76561198312638244,"{'funny': '', 'posted': 'Posted July 21.', 'la...","{'funny': '', 'posted': 'Posted July 10.', 'la...","{'funny': '', 'posted': 'Posted July 10.', 'la...","{'funny': '', 'posted': 'Posted July 8.', 'las...",,,,,,


In [37]:
# Se utiliza pd.melt para transformar las columnas en filas conservando el user_id
df_reviews2 = pd.melt(df_reviews2, id_vars=['user_id'], 
                       value_vars=list(range(9)),
                       value_name='reviews')
df_reviews2.head()

Unnamed: 0,user_id,variable,reviews
0,76561197970982479,0,"{'funny': '', 'posted': 'Posted November 5, 20..."
1,js41637,0,"{'funny': '', 'posted': 'Posted June 24, 2014...."
2,evcentric,0,"{'funny': '', 'posted': 'Posted February 3.', ..."
3,doctr,0,"{'funny': '', 'posted': 'Posted October 14, 20..."
4,maplemage,0,"{'funny': '3 people found this review funny', ..."
...,...,...,...
232186,76561198306599751,8,
232187,Ghoustik,8,
232188,76561198310819422,8,
232189,76561198312638244,8,


In [40]:
# Se puede ver que cada 'user_id' se repite por cada diccionario que que tenía en reviews
# Los None se refieren a columnas de diccionarios de otros registros. Esto hay que eliminar.
df_reviews2[df_reviews2['user_id']=='76561197970982479']

Unnamed: 0,user_id,variable,reviews
0,76561197970982479,0,"{'funny': '', 'posted': 'Posted November 5, 20..."
25799,76561197970982479,1,"{'funny': '', 'posted': 'Posted July 15, 2011...."
51598,76561197970982479,2,"{'funny': '', 'posted': 'Posted April 21, 2011..."
77397,76561197970982479,3,
103196,76561197970982479,4,
128995,76561197970982479,5,
154794,76561197970982479,6,
180593,76561197970982479,7,
206392,76561197970982479,8,


In [42]:
# Se eliminan las filas con valor None
df_reviews2 = df_reviews2.dropna()
# Verificamos que solo queden el 'user_id' con la cantidad de diccionarios que le corresponde
df_reviews2[df_reviews2['user_id']=='76561197970982479']

Unnamed: 0,user_id,variable,reviews
0,76561197970982479,0,"{'funny': '', 'posted': 'Posted November 5, 20..."
25799,76561197970982479,1,"{'funny': '', 'posted': 'Posted July 15, 2011...."
51598,76561197970982479,2,"{'funny': '', 'posted': 'Posted April 21, 2011..."


Ahora, se puede convertir cada diccionario en columna.

In [44]:
# Se separan por columnas cada una de las claves de df_genrer_id
df_reviews = df_reviews2['reviews'].apply(pd.Series, dtype='object')
df_reviews = df_reviews.add_prefix('reviews_')
df_reviews.head()

Unnamed: 0,reviews_funny,reviews_posted,reviews_last_edited,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review
0,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,,Posted February 3.,,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
3,,"Posted October 14, 2013.",,250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...
4,3 people found this review funny,"Posted April 15, 2014.",,211420,35 of 43 people (81%) found this review helpful,True,Git gud
...,...,...,...,...,...,...,...
231919,,"Posted August 15, 2014.","Last edited November 3, 2014.",440,No ratings yet,True,TF2 is alot of fun and its really good but the...
231921,,"Posted August 2, 2014.",,304930,No ratings yet,True,Fun game with friends
232047,,"Posted July 31, 2015.",,265630,No ratings yet,True,So Fun!! :D
232127,,"Posted December 20, 2015.",,304050,No ratings yet,True,"This game is great. The only thing is,Why cant..."


En el procesamiento anterior, se puede ver que la columna de 'user_id' se perdió, por lo que se vuelve a concatenar.

In [46]:
# Se une con el 'user_id'
df_reviews = pd.concat([df_reviews2['user_id'], df_reviews], axis=1)
df_reviews

Unnamed: 0,user_id,reviews_funny,reviews_posted,reviews_last_edited,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review
0,76561197970982479,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,evcentric,,Posted February 3.,,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
3,doctr,,"Posted October 14, 2013.",,250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...
4,maplemage,3 people found this review funny,"Posted April 15, 2014.",,211420,35 of 43 people (81%) found this review helpful,True,Git gud
...,...,...,...,...,...,...,...,...
231919,SKELETRONPRIMEISOP,,"Posted August 15, 2014.","Last edited November 3, 2014.",440,No ratings yet,True,TF2 is alot of fun and its really good but the...
231921,76561198141079508,,"Posted August 2, 2014.",,304930,No ratings yet,True,Fun game with friends
232047,ShadowYT100,,"Posted July 31, 2015.",,265630,No ratings yet,True,So Fun!! :D
232127,bestcustomurlevermade,,"Posted December 20, 2015.",,304050,No ratings yet,True,"This game is great. The only thing is,Why cant..."


Se observa que hay valores faltantes en algunas columnas, pero no estan como nulos, probablemente deben tener un espacio. Se compueba esto.

In [48]:
df_reviews['reviews_last_edited'][0]

''

Se reemplazar esos espacios como valores nulos.

In [50]:
df_reviews.replace('', None, inplace=True)
df_reviews

Unnamed: 0,user_id,reviews_funny,reviews_posted,reviews_last_edited,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review
0,76561197970982479,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
2,evcentric,,Posted February 3.,,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...
3,doctr,,"Posted October 14, 2013.",,250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...
4,maplemage,3 people found this review funny,"Posted April 15, 2014.",,211420,35 of 43 people (81%) found this review helpful,True,Git gud
...,...,...,...,...,...,...,...,...
231919,SKELETRONPRIMEISOP,,"Posted August 15, 2014.","Last edited November 3, 2014.",440,No ratings yet,True,TF2 is alot of fun and its really good but the...
231921,76561198141079508,,"Posted August 2, 2014.",,304930,No ratings yet,True,Fun game with friends
232047,ShadowYT100,,"Posted July 31, 2015.",,265630,No ratings yet,True,So Fun!! :D
232127,bestcustomurlevermade,,"Posted December 20, 2015.",,304050,No ratings yet,True,"This game is great. The only thing is,Why cant..."


Se analizan los tipos de datos y los nulos que quedaron luego de desanidar la columna 'reviews'.

In [51]:
utils.verificar_tipo_datos(df_reviews)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,reviews_funny,"[<class 'NoneType'>, <class 'str'>]",13.77,86.23,50904
2,reviews_posted,[<class 'str'>],100.0,0.0,0
3,reviews_last_edited,"[<class 'NoneType'>, <class 'str'>]",10.3,89.7,52953
4,reviews_item_id,[<class 'str'>],100.0,0.0,0
5,reviews_helpful,[<class 'str'>],100.0,0.0,0
6,reviews_recommend,[<class 'bool'>],100.0,0.0,0
7,reviews_review,"[<class 'str'>, <class 'NoneType'>]",99.95,0.05,30


Se observa entre un 86 a 89% de faltantes de datos en las columnas 'reviews_funny' y 'reviews_last_edited' por lo que se decide eliminar estas columnas.Por otra parte hay un 5% de faltantes de datos en la columna propiamente de reviews, pero no se eliminarán esos registros porque se considerarán como un comentario neutral.

In [57]:
# Se eliminan las columnas 'reviews_funny' y 'reviews_last_edited'
df_reviews = df_reviews.drop(columns=['reviews_funny', 'reviews_last_edited'])
df_reviews.columns

Index(['user_id', 'reviews_posted', 'reviews_item_id', 'reviews_helpful',
       'reviews_recommend', 'reviews_review'],
      dtype='object')

### Análisis de sentimientos

Se pide crear una nueva columna llamada 'sentiment_analysis' que reemplace a 'reviews_review' donde se realize un análisis de sentimiento con NLP con la siguiente escala: 

* 0 si es malo,
* 1 si es neutral o esta sin review
* 2 si es positivo.

In [64]:
df_reviews['sentiment_analysis'] = df_reviews['reviews_review'].apply(utils.analyze_sentiment)
df_reviews

Unnamed: 0,user_id,reviews_posted,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review,sentiment_analysis
0,76561197970982479,"Posted November 5, 2011.",1250,No ratings yet,True,Simple yet with great replayability. In my opi...,1
1,js41637,"Posted June 24, 2014.",251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,1
2,evcentric,Posted February 3.,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...,2
3,doctr,"Posted October 14, 2013.",250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...,2
4,maplemage,"Posted April 15, 2014.",211420,35 of 43 people (81%) found this review helpful,True,Git gud,1
...,...,...,...,...,...,...,...
231919,SKELETRONPRIMEISOP,"Posted August 15, 2014.",440,No ratings yet,True,TF2 is alot of fun and its really good but the...,1
231921,76561198141079508,"Posted August 2, 2014.",304930,No ratings yet,True,Fun game with friends,1
232047,ShadowYT100,"Posted July 31, 2015.",265630,No ratings yet,True,So Fun!! :D,2
232127,bestcustomurlevermade,"Posted December 20, 2015.",304050,No ratings yet,True,"This game is great. The only thing is,Why cant...",1


In [67]:
df_reviews[df_reviews['sentiment_analysis']==2]

Unnamed: 0,user_id,reviews_posted,reviews_item_id,reviews_helpful,reviews_recommend,reviews_review,sentiment_analysis
2,evcentric,Posted February 3.,248820,No ratings yet,True,A suitably punishing roguelike platformer. Wi...,2
3,doctr,"Posted October 14, 2013.",250320,2 of 2 people (100%) found this review helpful,True,This game... is so fun. The fight sequences ha...,2
9,76561198156664158,Posted June 16.,252950,0 of 1 people (0%) found this review helpful,True,love it,2
15,Rainbow-Dashie,"Posted June 19, 2015.",730,1 of 3 people (33%) found this review helpful,True,Good Game It Was Very Fun!!!!,2
19,devvonst,"Posted February 12, 2014.",440,6 of 9 people (67%) found this review helpful,True,Best first person shooter,2
...,...,...,...,...,...,...,...
230279,ninjagato,"Posted April 18, 2014.",51100,No ratings yet,True,esse jogo √© muitom bom! paresse counter strik...,2
230975,JustHacksNoSkill,"Posted May 16, 2014.",233270,No ratings yet,True,Its sexual Enough,2
231213,MsVen,"Posted July 24, 2013.",226980,No ratings yet,True,Comes with one free table so you can compete a...,2
231855,76561198133319761,"Posted June 27, 2014.",238460,No ratings yet,True,Its a great game if your up for a challenge or...,2


## Dataset `output_steam_games`

In [5]:
# Ruta al dataset australian_user_reviews
ruta_games = 'data/output_steam_games.json'

# Se lee de cada línea del dataset
filas_games = []
with open(ruta_games, encoding='MacRoman') as f:
    for line in f.readlines():
        data = json.loads(line)
        filas_games.append(data)

# Se convierte en dataframe
df_games = pd.DataFrame(filas_games)
df_games

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,,,,,,,,,,,,,
1,,,,,,,,,,,,,
2,,,,,,,,,,,,,
3,,,,,,,,,,,,,
4,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
120440,Ghost_RUS Games,"[Casual, Indie, Simulation, Strategy]",Colony On Mars,Colony On Mars,http://store.steampowered.com/app/773640/Colon...,2018-01-04,"[Strategy, Indie, Casual, Simulation]",http://steamcommunity.com/app/773640/reviews/?...,"[Single-player, Steam Achievements]",1.99,False,773640,"Nikita ""Ghost_RUS"""
120441,Sacada,"[Casual, Indie, Strategy]",LOGistICAL: South Africa,LOGistICAL: South Africa,http://store.steampowered.com/app/733530/LOGis...,2018-01-04,"[Strategy, Indie, Casual]",http://steamcommunity.com/app/733530/reviews/?...,"[Single-player, Steam Achievements, Steam Clou...",4.99,False,733530,Sacada
120442,Laush Studio,"[Indie, Racing, Simulation]",Russian Roads,Russian Roads,http://store.steampowered.com/app/610660/Russi...,2018-01-04,"[Indie, Simulation, Racing]",http://steamcommunity.com/app/610660/reviews/?...,"[Single-player, Steam Achievements, Steam Trad...",1.99,False,610660,Laush Dmitriy Sergeevich
120443,SIXNAILS,"[Casual, Indie]",EXIT 2 - Directions,EXIT 2 - Directions,http://store.steampowered.com/app/658870/EXIT_...,2017-09-02,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...",http://steamcommunity.com/app/658870/reviews/?...,"[Single-player, Steam Achievements, Steam Cloud]",4.99,False,658870,"xropi,stev3ns"


In [29]:
# Se revisan tipos de datos y existencias de nulos
utils.verificar_tipo_datos(df_games)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,publisher,"[<class 'float'>, <class 'str'>]",20.0,80.0,96362
1,genres,"[<class 'float'>, <class 'list'>]",23.95,76.05,91593
2,app_name,"[<class 'float'>, <class 'str'>]",26.68,73.32,88312
3,title,"[<class 'float'>, <class 'str'>]",24.98,75.02,90360
4,url,"[<class 'float'>, <class 'str'>]",26.68,73.32,88310
5,release_date,"[<class 'float'>, <class 'str'>]",24.96,75.04,90377
6,tags,"[<class 'float'>, <class 'list'>]",26.54,73.46,88473
7,reviews_url,"[<class 'float'>, <class 'str'>]",26.68,73.32,88312
8,specs,"[<class 'float'>, <class 'list'>]",26.12,73.88,88980
9,price,"[<class 'float'>, <class 'str'>]",25.54,74.46,89687


In [15]:
# se observa el tipo de dato en 'genres'
df_games['genres'][101505]

['Action', 'Adventure', 'Free to Play', 'RPG', 'Early Access']

In [16]:
# se observa el tipo de dato en 'tags'
df_games['tags'][101505]

['Early Access', 'Free to Play', 'Action', 'Adventure', 'RPG']

In [18]:
# se observa el tipo de dato en 'specs'
df_games['specs'][101500]

['Single-player', 'Multi-player']

In [20]:
# se observa el tipo de dato en 'price'
df_games['price'].unique()

array([nan, 4.99, 'Free To Play', 'Free to Play', 0.99, 2.99, 3.99, 9.99,
       18.99, 29.99, 'Free', 10.99, 1.59, 14.99, 1.99, 59.99, 8.99, 6.99,
       7.99, 39.99, 19.99, 7.49, 12.99, 5.99, 2.49, 15.99, 1.25, 24.99,
       17.99, 61.99, 3.49, 11.99, 13.99, 'Free Demo', 'Play for Free!',
       34.99, 74.76, 1.49, 32.99, 99.99, 14.95, 69.99, 16.99, 79.99,
       49.99, 5.0, 44.99, 13.98, 29.96, 119.99, 109.99, 149.99, 771.71,
       'Install Now', 21.99, 89.99, 'Play WARMACHINE: Tactics Demo', 0.98,
       139.92, 4.29, 64.99, 'Free Mod', 54.99, 74.99, 'Install Theme',
       0.89, 'Third-party', 0.5, 'Play Now', 299.99, 1.29, 3.0, 15.0,
       5.49, 23.99, 49.0, 20.99, 10.93, 1.39, 'Free HITMAN™ Holiday Pack',
       36.99, 4.49, 2.0, 4.0, 9.0, 234.99, 1.95, 1.5, 199.0, 189.0, 6.66,
       27.99, 10.49, 129.99, 179.0, 26.99, 399.99, 31.99, 399.0, 20.0,
       40.0, 3.33, 199.99, 22.99, 320.0, 38.85, 71.7, 59.95, 995.0, 27.49,
       3.39, 6.0, 19.95, 499.99, 16.06, 4.68, 131.4, 44.

In [21]:
# se observa el tipo de dato en 'early_access'
df_games['early_access'].unique()

array([nan, False, True], dtype=object)

Este conjunto contiene 13 columnas y 120445 filas, conteniendo nulos en todas las columnas. Las columnas que contiene son:

* publisher: es la empresa publicadora del contenido. Hay nombres en mayúsculas y minúsculas.
* genres: es el contenido del género. Esta formado por una lista de uno o mas géneros por registro.
* app_name: es el nombre del contenido.
* title: es el título del contenido.
* url: es la url de la publicación del contenido.
* release_date: es la fecha de lanzamiento en formato 2018-01-04
* tags: es la etiqueta del contenido. Esta formado por una lista de uno o mas etiquetas por registro.
* reviews_url: es la url donde se encuentra el review.
* specs: son especificaciones de cada registro. Es una lista con uno o mas string con las especificaciones.
* price: es el precio del contenido. Algunos datos no son numéricos, sino que hay textos referidos al precio
* early_access: indica el acceso temprano con un True/False.
* id: es el identificador único del contenido.
* developer: es el desarrollador del contenido

## Dataset `australian_users_items`

In [8]:
# Ruta al dataset australian_user_reviews
ruta_items = 'data/australian_users_items.json'

# Se lee de cada línea del dataset
filas_items = []
with open(ruta_items, encoding='MacRoman') as f:
    for line in f.readlines():
        filas_items.append(ast.literal_eval(line))

# Se convierte en dataframe
df_items = pd.DataFrame(filas_items)
df_items

Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
1,js41637,888,76561198035864385,http://steamcommunity.com/id/js41637,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
2,evcentric,137,76561198007712555,http://steamcommunity.com/id/evcentric,"[{'item_id': '1200', 'item_name': 'Red Orchest..."
3,Riot-Punch,328,76561197963445855,http://steamcommunity.com/id/Riot-Punch,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
4,doctr,541,76561198002099482,http://steamcommunity.com/id/doctr,"[{'item_id': '300', 'item_name': 'Day of Defea..."
...,...,...,...,...,...
88305,76561198323066619,22,76561198323066619,http://steamcommunity.com/profiles/76561198323...,"[{'item_id': '413850', 'item_name': 'CS:GO Pla..."
88306,76561198326700687,177,76561198326700687,http://steamcommunity.com/profiles/76561198326...,"[{'item_id': '11020', 'item_name': 'TrackMania..."
88307,XxLaughingJackClown77xX,0,76561198328759259,http://steamcommunity.com/id/XxLaughingJackClo...,[]
88308,76561198329548331,7,76561198329548331,http://steamcommunity.com/profiles/76561198329...,"[{'item_id': '304930', 'item_name': 'Unturned'..."


In [25]:
# se observa el tipo de dato en 'early_access'
df_items['items'][0]

[{'item_id': '10',
  'item_name': 'Counter-Strike',
  'playtime_forever': 6,
  'playtime_2weeks': 0},
 {'item_id': '20',
  'item_name': 'Team Fortress Classic',
  'playtime_forever': 0,
  'playtime_2weeks': 0},
 {'item_id': '30',
  'item_name': 'Day of Defeat',
  'playtime_forever': 7,
  'playtime_2weeks': 0},
 {'item_id': '40',
  'item_name': 'Deathmatch Classic',
  'playtime_forever': 0,
  'playtime_2weeks': 0},
 {'item_id': '50',
  'item_name': 'Half-Life: Opposing Force',
  'playtime_forever': 0,
  'playtime_2weeks': 0},
 {'item_id': '60',
  'item_name': 'Ricochet',
  'playtime_forever': 0,
  'playtime_2weeks': 0},
 {'item_id': '70',
  'item_name': 'Half-Life',
  'playtime_forever': 0,
  'playtime_2weeks': 0},
 {'item_id': '130',
  'item_name': 'Half-Life: Blue Shift',
  'playtime_forever': 0,
  'playtime_2weeks': 0},
 {'item_id': '300',
  'item_name': 'Day of Defeat: Source',
  'playtime_forever': 4733,
  'playtime_2weeks': 0},
 {'item_id': '240',
  'item_name': 'Counter-Strike: S

In [30]:
# Se revisan tipos de datos y existencias de nulos
utils.verificar_tipo_datos(df_items)

Unnamed: 0,nombre_campo,tipo_datos,no_nulos_%,nulos_%,nulos
0,user_id,[<class 'str'>],100.0,0.0,0
1,items_count,[<class 'int'>],100.0,0.0,0
2,steam_id,[<class 'str'>],100.0,0.0,0
3,user_url,[<class 'str'>],100.0,0.0,0
4,items,[<class 'list'>],100.0,0.0,0


Este conjunto contiene 5 columnas y 88309 filas, no tiene nulos. Las columnas que contiene son:

* user_id: contiene el id del usuario.
* items_count: contiene un número entero del cual no se dispone de información sobre su significado.
* steam_id: es un número de id de stram del cuál no se dispone información.
* user_url: es la url del perfil del usuario
* items: contiene una lista de uno o mas diccionarios de los items que consume cada usuario. Cada diccionario tiene las siguientes claves:
  * item_id: es un número con el id del item.
  * item_name: es el nombre del contenido que consume.
  * playtime_forever: es un número entero del cuál no se dispone información.
  * playtime_2weeks: es un número entero del cuál no se dispone información.