## Loading json data

For this purpose, gzip and json modules provide tools to read gzipped files, so it's easy to do a quick inspection and loading of the data provided.

After a very quick look to the json files i noticed they were different so i decided to make loading, cleaning and transforming separatedly.

### 1) 'steam_games.json.gz' data

In [111]:
import gzip
import json
import numpy as np 
import pandas as pd

from functions import load_jsongz, gzip_json_file

In [47]:
with gzip.open('./raw_data/steam_games.json.gz', 'rt') as file:
    data = file.readlines()

data = list(map(lambda x: json.loads(x), data))
print(len(data))

120445


120445 records for this data. Let's create a pandas dataFrame

In [63]:
df_games = pd.DataFrame(data)
print(df_games.info())
df_games.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120445 entries, 0 to 120444
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   publisher     24083 non-null  object
 1   genres        28852 non-null  object
 2   app_name      32133 non-null  object
 3   title         30085 non-null  object
 4   url           32135 non-null  object
 5   release_date  30068 non-null  object
 6   tags          31972 non-null  object
 7   reviews_url   32133 non-null  object
 8   specs         31465 non-null  object
 9   price         30758 non-null  object
 10  early_access  32135 non-null  object
 11  id            32133 non-null  object
 12  developer     28836 non-null  object
dtypes: object(13)
memory usage: 11.9+ MB
None


Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,,,,,,,,,,,,,
1,,,,,,,,,,,,,
2,,,,,,,,,,,,,
3,,,,,,,,,,,,,
4,,,,,,,,,,,,,


It seems to be null rows, so only rows with all null will be dropped

In [64]:
df_games.dropna(axis=0, how='all', inplace=True, ignore_index=True)
print(df_games.info())
df_games.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32135 entries, 0 to 32134
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   publisher     24083 non-null  object
 1   genres        28852 non-null  object
 2   app_name      32133 non-null  object
 3   title         30085 non-null  object
 4   url           32135 non-null  object
 5   release_date  30068 non-null  object
 6   tags          31972 non-null  object
 7   reviews_url   32133 non-null  object
 8   specs         31465 non-null  object
 9   price         30758 non-null  object
 10  early_access  32135 non-null  object
 11  id            32133 non-null  object
 12  developer     28836 non-null  object
dtypes: object(13)
memory usage: 3.2+ MB
None


Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",http://steamcommunity.com/app/761140/reviews/?...,[Single-player],4.99,False,761140,Kotoshiro
1,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,http://store.steampowered.com/app/643980/Ironb...,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",http://steamcommunity.com/app/643980/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,False,643980,Secret Level SRL
2,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_...,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",http://steamcommunity.com/app/670290/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,False,670290,Poolians.com
3,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,http://store.steampowered.com/app/767400/2222/,2017-12-07,"[Action, Adventure, Casual]",http://steamcommunity.com/app/767400/reviews/?...,[Single-player],0.99,False,767400,彼岸领域
4,,,Log Challenge,,http://store.steampowered.com/app/773570/Log_C...,,"[Action, Indie, Casual, Sports]",http://steamcommunity.com/app/773570/reviews/?...,"[Single-player, Full controller support, HTC V...",2.99,False,773570,


32135 entries but 32133 non-mull id. Can't have entries without 'id' so let's check.

In [65]:
df_games[df_games['id'].isnull()]

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
74,,,,,http://store.steampowered.com/,,,,,19.99,False,,
30961,"Warner Bros. Interactive Entertainment, Feral ...","[Action, Adventure]",Batman: Arkham City - Game of the Year Edition,Batman: Arkham City - Game of the Year Edition,http://store.steampowered.com/app/200260,2012-09-07,"[Action, Open World, Batman, Adventure, Stealt...",,"[Single-player, Steam Achievements, Steam Trad...",19.99,False,,"Rocksteady Studios,Feral Interactive (Mac)"


It turns out that i already have an entrie for "Batman: Arkham City - Game of the Year Edition" and the other one is empty.

 **I can drop both**

In [66]:
df_games[df_games['id'] == '200260']

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
1068,"Warner Bros. Interactive Entertainment, Feral ...","[Action, Adventure]",Batman: Arkham City - Game of the Year Edition,Batman: Arkham City - Game of the Year Edition,http://store.steampowered.com/app/200260/Batma...,2012-09-07,"[Action, Open World, Batman, Adventure, Stealt...",http://steamcommunity.com/app/200260/reviews/?...,"[Single-player, Steam Achievements, Steam Trad...",19.99,False,200260,"Rocksteady Studios,Feral Interactive (Mac)"


In [67]:
df_games.drop([74, 30961], axis=0, inplace=True)
df_games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32133 entries, 0 to 32134
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   publisher     24082 non-null  object
 1   genres        28851 non-null  object
 2   app_name      32132 non-null  object
 3   title         30084 non-null  object
 4   url           32133 non-null  object
 5   release_date  30067 non-null  object
 6   tags          31971 non-null  object
 7   reviews_url   32133 non-null  object
 8   specs         31464 non-null  object
 9   price         30756 non-null  object
 10  early_access  32133 non-null  object
 11  id            32133 non-null  object
 12  developer     28835 non-null  object
dtypes: object(13)
memory usage: 3.4+ MB


**32313 records left**

So the idea is to keep it simple, i will discard columns i won't need for the analysis and keep the most informative ones. That is, those that allow me to create relationships (games-reviews, user-items, for example).

In [112]:
# Keeping ids, genres, specs, prices, developer, app_names, tags
# Creating a gzip crompessed json file
columns = ['id', 'developer', 'app_name', 'genres','tags', 'specs','release_date', 'price']
gzip_json_file('./data/steam_games_c.json.gz',
               df=df_games,
               subset=columns)

*** 

### User's reviews data int file: `user_reviews.json.gz`

Note: There is nested information within the following files

In [113]:
revs = load_jsongz('./raw_data/user_reviews.json.gz', mode='rt', encoding='UTF-8')

Number of records: 25799
Item type: <class 'dict'>


In [114]:
df_revs = pd.DataFrame(revs)
print(df_revs.info())
df_revs.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25799 entries, 0 to 25798
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   user_id   25799 non-null  object
 1   user_url  25799 non-null  object
 2   reviews   25799 non-null  object
dtypes: object(3)
memory usage: 604.8+ KB
None


Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."


25799 entries -> 25799 distinct users?

In [115]:
# Counting distinct values
len(pd.unique(df_revs['user_id']))

25485

In [116]:
# Identifying duplicated user id in the dataFrame
duplicated = df_revs['user_id'].value_counts()
duplicated = duplicated[duplicated > 1]
len(duplicated)

309

Check if duplicated entries have same set of reviews.

In [117]:
idxs = df_revs[df_revs['user_id'] == '76561198027488037'].loc[:,'reviews'].index

for idx in idxs:
    print(df_revs.loc[idx, 'reviews'][0])

{'funny': '', 'posted': 'Posted May 12.', 'last_edited': '', 'item_id': '463490', 'helpful': '1 of 1 people (100%) found this review helpful', 'recommend': True, 'review': 'I gotta say, Melons is my favourite song from evry one of my soundtracks.'}
{'funny': '', 'posted': 'Posted May 12.', 'last_edited': '', 'item_id': '463490', 'helpful': '1 of 1 people (100%) found this review helpful', 'recommend': True, 'review': 'I gotta say, Melons is my favourite song from evry one of my soundtracks.'}
{'funny': '', 'posted': 'Posted May 12.', 'last_edited': '', 'item_id': '463490', 'helpful': '1 of 1 people (100%) found this review helpful', 'recommend': True, 'review': 'I gotta say, Melons is my favourite song from evry one of my soundtracks.'}


Dropping duplicates

In [118]:
df_revs.drop_duplicates('user_id', inplace=True, ignore_index=True)
print(len(df_revs))

25485


25485 distinct users.

- What is the average of reviews per user?
- What is the maximum number of reviews that a single user has?
- What es the minimum number?

In [119]:
print('Average number of reviews for a single user:', df_revs['reviews'].map(len).mean())
print('Maximum number of reviews for a single user:', df_revs['reviews'].map(len).max())
print('Minimum number of reviews for a single user:', df_revs['reviews'].map(len).min())

Average number of reviews for a single user: 2.2927212085540516
Maximum number of reviews for a single user: 10
Minimum number of reviews for a single user: 0


Unnestig reviews

In [120]:
revs = []
for i in df_revs.index:
    # For every user in dataFrame
    user_id = df_revs.loc[i,'user_id']
    for rev in df_revs.loc[i, 'reviews']:
        # For every review in each user's reviews list
        revs.append(
            {
                'user_id': user_id,
                'posted': rev['posted'],
                'item_id': rev['item_id'],
                'recommend': rev['recommend'],
                'review': rev['review'],
            }
        )
# Creating a dataFrame from that list
df = pd.DataFrame(revs)
df.head(10)

Unnamed: 0,user_id,posted,item_id,recommend,review
0,76561197970982479,"Posted November 5, 2011.",1250,True,Simple yet with great replayability. In my opi...
1,76561197970982479,"Posted July 15, 2011.",22200,True,It's unique and worth a playthrough.
2,76561197970982479,"Posted April 21, 2011.",43110,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,"Posted June 24, 2014.",251610,True,I know what you think when you see this title ...
4,js41637,"Posted September 8, 2013.",227300,True,For a simple (it's actually not all that simpl...
5,js41637,"Posted November 29, 2013.",239030,True,Very fun little game to play when your bored o...
6,evcentric,Posted February 3.,248820,True,A suitably punishing roguelike platformer. Wi...
7,evcentric,"Posted December 4, 2015.",370360,True,"""Run for fun? What the hell kind of fun is that?"""
8,evcentric,"Posted November 3, 2014.",237930,True,"Elegant integration of gameplay, story, world ..."
9,evcentric,"Posted October 15, 2014.",263360,True,"Random drops and random quests, with stat poin..."


In [121]:
# Saving it into a file - 58430 reviews total
gzip_json_file('./data/user_reviews_c.json.gz',
               df=df)

***

## 3) users_items.json.gz data:

In [3]:
df_items = load_data('./raw_data/users_items.json.gz',
                     mode='rt', encoding='UTF-8')

Number of records: 88310
Item type: <class 'dict'>


In [4]:
df_items = pd.DataFrame(df_items)
print(df_items.info())
df_items.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88310 entries, 0 to 88309
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      88310 non-null  object
 1   items_count  88310 non-null  int64 
 2   steam_id     88310 non-null  object
 3   user_url     88310 non-null  object
 4   items        88310 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.4+ MB
None


Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
1,js41637,888,76561198035864385,http://steamcommunity.com/id/js41637,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
2,evcentric,137,76561198007712555,http://steamcommunity.com/id/evcentric,"[{'item_id': '1200', 'item_name': 'Red Orchest..."
3,Riot-Punch,328,76561197963445855,http://steamcommunity.com/id/Riot-Punch,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
4,doctr,541,76561198002099482,http://steamcommunity.com/id/doctr,"[{'item_id': '300', 'item_name': 'Day of Defea..."


In [41]:
print('Unique users', len(pd.unique(df_items['user_id'])))
# Getting how many duplicated records
dupl = df_items['user_id'].value_counts()
dupl = dupl[dupl > 1]
print('Duplicated:', len(dupl))

Unique users 87626
Duplicated: 673


From 87626 distinct, 673 duplicated records

In [44]:
# Dropping duplicates
df_items.drop_duplicates('user_id', inplace=True, ignore_index=True)
print(df_items.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87626 entries, 0 to 87625
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      87626 non-null  object
 1   items_count  87626 non-null  int64 
 2   steam_id     87626 non-null  object
 3   user_url     87626 non-null  object
 4   items        87626 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.3+ MB
None


Is there users with 0 items?

In [72]:
print('Number of users with no items:', 
      len(df_items[df_items['items_count'] == 0]))
df_items[df_items['items_count'] == 0].head()

Number of users with no items: 16714


Unnamed: 0,user_id,items_count,steam_id,user_url,items
9,Wackky,0,76561198039117046,http://steamcommunity.com/id/Wackky,[]
11,76561198079601835,0,76561198079601835,http://steamcommunity.com/profiles/76561198079...,[]
31,hellom8o,0,76561198117222320,http://steamcommunity.com/id/hellom8o,[]
38,starkillershadow553,0,76561198059648579,http://steamcommunity.com/id/starkillershadow553,[]
54,darkenkane,0,76561198058876001,http://steamcommunity.com/id/darkenkane,[]


It is possible that users where 'items_count' = 0 just own their accounts to make reviews so i'll keep the records.

In [122]:
# Saving file
gzip_json_file('./data/users_items_c.json.gz',
               df=df_items)