## Loading json data

For this purpose, gzip and json modules provide tools to read gzipped files, so it's easy to do a quick inspection and loading of the data provided.

After a very quick look to the json files i noticed they were different so i decided to make loading, cleaning and transforming separatedly.

### 1) 'steam_games.json.gz' data

In [1]:
import gzip
import json
import numpy as np 
import pandas as pd
import importlib

import functions as fts 
importlib.reload(fts)

<module 'functions' from '/home/juancml/Personal-Profesional/henry/pi_mlops/functions.py'>

In [2]:
data = fts.load_json_gz('./raw_data/steam_games.json.gz', mode='rt', encoding='UTF-8')
print(len(data))

Number of records: 120445
Item type: <class 'dict'>
120445


120445 records for this data. Let's create a pandas dataFrame

In [57]:
df_games = pd.DataFrame(data)
print(df_games.info())
df_games.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120445 entries, 0 to 120444
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   publisher     24083 non-null  object
 1   genres        28852 non-null  object
 2   app_name      32133 non-null  object
 3   title         30085 non-null  object
 4   url           32135 non-null  object
 5   release_date  30068 non-null  object
 6   tags          31972 non-null  object
 7   reviews_url   32133 non-null  object
 8   specs         31465 non-null  object
 9   price         30758 non-null  object
 10  early_access  32135 non-null  object
 11  id            32133 non-null  object
 12  developer     28836 non-null  object
dtypes: object(13)
memory usage: 11.9+ MB
None


Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,,,,,,,,,,,,,
1,,,,,,,,,,,,,
2,,,,,,,,,,,,,
3,,,,,,,,,,,,,
4,,,,,,,,,,,,,


It seems to be null rows, so only rows with all null will be dropped

In [58]:
df_games.dropna(axis=0, how='all', inplace=True, ignore_index=True)
print(df_games.info())
df_games.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32135 entries, 0 to 32134
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   publisher     24083 non-null  object
 1   genres        28852 non-null  object
 2   app_name      32133 non-null  object
 3   title         30085 non-null  object
 4   url           32135 non-null  object
 5   release_date  30068 non-null  object
 6   tags          31972 non-null  object
 7   reviews_url   32133 non-null  object
 8   specs         31465 non-null  object
 9   price         30758 non-null  object
 10  early_access  32135 non-null  object
 11  id            32133 non-null  object
 12  developer     28836 non-null  object
dtypes: object(13)
memory usage: 3.2+ MB
None


Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",http://steamcommunity.com/app/761140/reviews/?...,[Single-player],4.99,False,761140,Kotoshiro
1,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,http://store.steampowered.com/app/643980/Ironb...,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",http://steamcommunity.com/app/643980/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,False,643980,Secret Level SRL
2,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_...,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",http://steamcommunity.com/app/670290/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,False,670290,Poolians.com
3,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,http://store.steampowered.com/app/767400/2222/,2017-12-07,"[Action, Adventure, Casual]",http://steamcommunity.com/app/767400/reviews/?...,[Single-player],0.99,False,767400,彼岸领域
4,,,Log Challenge,,http://store.steampowered.com/app/773570/Log_C...,,"[Action, Indie, Casual, Sports]",http://steamcommunity.com/app/773570/reviews/?...,"[Single-player, Full controller support, HTC V...",2.99,False,773570,


32135 entries but 32133 non-mull id. Can't have entries without 'id' so let's check if there is duplicated entries or Null id value.

In [59]:
df_games[df_games['id'].isnull()]

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
74,,,,,http://store.steampowered.com/,,,,,19.99,False,,
30961,"Warner Bros. Interactive Entertainment, Feral ...","[Action, Adventure]",Batman: Arkham City - Game of the Year Edition,Batman: Arkham City - Game of the Year Edition,http://store.steampowered.com/app/200260,2012-09-07,"[Action, Open World, Batman, Adventure, Stealt...",,"[Single-player, Steam Achievements, Steam Trad...",19.99,False,,"Rocksteady Studios,Feral Interactive (Mac)"


It turns out that i already have an entrie for "Batman: Arkham City - Game of the Year Edition" and the other one is empty.

 **I can drop both**

In [60]:
df_games[df_games['app_name'] == 'Batman: Arkham City - Game of the Year Edition']

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
1068,"Warner Bros. Interactive Entertainment, Feral ...","[Action, Adventure]",Batman: Arkham City - Game of the Year Edition,Batman: Arkham City - Game of the Year Edition,http://store.steampowered.com/app/200260/Batma...,2012-09-07,"[Action, Open World, Batman, Adventure, Stealt...",http://steamcommunity.com/app/200260/reviews/?...,"[Single-player, Steam Achievements, Steam Trad...",19.99,False,200260.0,"Rocksteady Studios,Feral Interactive (Mac)"
30961,"Warner Bros. Interactive Entertainment, Feral ...","[Action, Adventure]",Batman: Arkham City - Game of the Year Edition,Batman: Arkham City - Game of the Year Edition,http://store.steampowered.com/app/200260,2012-09-07,"[Action, Open World, Batman, Adventure, Stealt...",,"[Single-player, Steam Achievements, Steam Trad...",19.99,False,,"Rocksteady Studios,Feral Interactive (Mac)"


In [61]:
# Dropping using indexes
df_games.drop([74, 30961], axis=0, inplace=True)
df_games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32133 entries, 0 to 32134
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   publisher     24082 non-null  object
 1   genres        28851 non-null  object
 2   app_name      32132 non-null  object
 3   title         30084 non-null  object
 4   url           32133 non-null  object
 5   release_date  30067 non-null  object
 6   tags          31971 non-null  object
 7   reviews_url   32133 non-null  object
 8   specs         31464 non-null  object
 9   price         30756 non-null  object
 10  early_access  32133 non-null  object
 11  id            32133 non-null  object
 12  developer     28835 non-null  object
dtypes: object(13)
memory usage: 3.4+ MB


In [62]:
# Null handling
df_games.fillna(
    {
        'developer': 'Unknown',
        'genres': 'Empty',
        'app_name': 'Unknown',
        'release_date': 'Unknown',
        'tags': 'Empty',
        'specs': 'Empty',
        'price': 0
    },
    inplace=True
)

**32313 records left**

So the idea is to keep it simple, i will discard columns i won't need for the analysis and keep the most informative ones. That is, those that allow me to create relationships (games-reviews, user-items, for example).

In [63]:
# Funciton to extract only the year of dates
def get_year(y:str):
    # There are different formats along the column
    # And also strings from filling Na values before
    # Using exception handling to avoid errors and return years
    try:
        date = pd.to_datetime(y, format='mixed')
        return date.year
    except:
        return 'Unknown'
    

# Getting only the year
df_games['release_date'] = df_games['release_date'].apply(get_year)

In [64]:
# Renaming id and release date columns
df_games.rename(columns={'id':'item_id', 'release_date':'release_year'}, inplace=True)
# Keeping ids, genres, specs, prices, developer, app_names, tags
# Creating a gzip crompessed json file
columns = ['item_id', 'developer', 'app_name', 'genres','tags', 'specs','release_year', 'price']
fts.gzip_json_file('./data/games.json.gz',
               df=df_games,
               subset=columns)

File saved at "data/games.json.gz"


*** 

### 2) User's reviews data int file: `user_reviews.json.gz`

Note: There is nested information within the following files

In [9]:
revs = fts.load_json_gz('./raw_data/user_reviews.json.gz', mode='rt', encoding='UTF-8')

Number of records: 25799
Item type: <class 'dict'>


In [10]:
# To Dataframe
df_revs = pd.DataFrame(revs)
print(df_revs.info())
df_revs.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25799 entries, 0 to 25798
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   user_id   25799 non-null  object
 1   user_url  25799 non-null  object
 2   reviews   25799 non-null  object
dtypes: object(3)
memory usage: 604.8+ KB
None


Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."


25799 entries -> 25799 distinct users?

In [11]:
# Counting distinct values
len(pd.unique(df_revs['user_id']))

25485

In [12]:
# Identifying duplicated user id in the dataFrame
duplicated = df_revs['user_id'].value_counts()
duplicated = duplicated[duplicated > 1]
len(duplicated)

309

Check if duplicated entries have same set of reviews.

In [13]:
idxs = df_revs[df_revs['user_id'] == '76561198027488037'].loc[:,'reviews'].index

for idx in idxs:
    print(df_revs.loc[idx, 'reviews'][0])

{'funny': '', 'posted': 'Posted May 12.', 'last_edited': '', 'item_id': '463490', 'helpful': '1 of 1 people (100%) found this review helpful', 'recommend': True, 'review': 'I gotta say, Melons is my favourite song from evry one of my soundtracks.'}
{'funny': '', 'posted': 'Posted May 12.', 'last_edited': '', 'item_id': '463490', 'helpful': '1 of 1 people (100%) found this review helpful', 'recommend': True, 'review': 'I gotta say, Melons is my favourite song from evry one of my soundtracks.'}
{'funny': '', 'posted': 'Posted May 12.', 'last_edited': '', 'item_id': '463490', 'helpful': '1 of 1 people (100%) found this review helpful', 'recommend': True, 'review': 'I gotta say, Melons is my favourite song from evry one of my soundtracks.'}


Dropping duplicates

In [14]:
df_revs.drop_duplicates('user_id', inplace=True, ignore_index=True)
print(len(df_revs))

25485


25485 distinct users.

- What is the average of reviews per user?
- What is the maximum number of reviews that a single user has?
- What es the minimum number?

In [15]:
print('Average number of reviews for a single user:', df_revs['reviews'].map(len).mean())
print('Maximum number of reviews for a single user:', df_revs['reviews'].map(len).max())
print('Minimum number of reviews for a single user:', df_revs['reviews'].map(len).min())

Average number of reviews for a single user: 2.2927212085540516
Maximum number of reviews for a single user: 10
Minimum number of reviews for a single user: 0


How many users with 0 reviews?

In [16]:
print('Number of users with no reviews registered:', 
      (df_revs['reviews'].map(len) == 0).sum())

# Dropping them
indexes = df_revs[df_revs['reviews'].map(len) == 0].index
df_revs.drop(index=indexes, inplace=True)

# Checking
print('After dropping:', 
      (df_revs['reviews'].map(len) == 0).sum())

Number of users with no reviews registered: 28
After dropping: 0


Unnestig reviews

In [17]:
revs = fts.json_unpacking(df_revs, where='reviews',
                             values=['posted', 'item_id', 'recommend', 'review'],
                             old_colums=['user_id'])

# Creating a dataFrame from that list
# df_revs = pd.DataFrame(revs)
# df_revs.head(10)

Shape of the resulting array: (58430, 5)


In [19]:
revs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58430 entries, 0 to 58429
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   user_id    58430 non-null  object
 1   posted     58430 non-null  object
 2   item_id    58430 non-null  object
 3   recommend  58430 non-null  bool  
 4   review     58430 non-null  object
dtypes: bool(1), object(4)
memory usage: 1.8+ MB


In [28]:
# Saving it into a file - 58430 reviews total
columns = ['user_id', 'item_id', 'recommend', 'review']
fts.gzip_json_file('./data/user_reviews_c.json.gz',
               df=df_revs,
               subset=columns)

File saved at "data/user_reviews_c.json.gz"


***

## 3) users_items.json.gz data:

This file is way **heavier** than the others. 

**Just open the file and reading the data properly can take longer than reading the files before** (up to 2 min or more).

In [20]:
df_items = fts.load_json_gz('./raw_data/users_items.json.gz',
                     mode='rt', encoding='UTF-8')

Number of records: 88310
Item type: <class 'dict'>


In [21]:
# Creating a dataframe with the dictionaries
df_items = pd.DataFrame(df_items)
print(df_items.info())
df_items.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88310 entries, 0 to 88309
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      88310 non-null  object
 1   items_count  88310 non-null  int64 
 2   steam_id     88310 non-null  object
 3   user_url     88310 non-null  object
 4   items        88310 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.4+ MB
None


Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
1,js41637,888,76561198035864385,http://steamcommunity.com/id/js41637,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
2,evcentric,137,76561198007712555,http://steamcommunity.com/id/evcentric,"[{'item_id': '1200', 'item_name': 'Red Orchest..."
3,Riot-Punch,328,76561197963445855,http://steamcommunity.com/id/Riot-Punch,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
4,doctr,541,76561198002099482,http://steamcommunity.com/id/doctr,"[{'item_id': '300', 'item_name': 'Day of Defea..."


Looking for duplicated entries

In [22]:
print('Unique users', len(pd.unique(df_items['user_id'])))
# Getting how many duplicated records
dupl = df_items['user_id'].value_counts()
dupl = dupl[dupl > 1]
print('Duplicated entries:', len(dupl))

Unique users 87626
Duplicated entries: 673


Only 87626 distinct user id; 673 duplicated records.

In [23]:
# Dropping duplicates
df_items.drop_duplicates('user_id', inplace=True, ignore_index=True)
print(df_items.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87626 entries, 0 to 87625
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      87626 non-null  object
 1   items_count  87626 non-null  int64 
 2   steam_id     87626 non-null  object
 3   user_url     87626 non-null  object
 4   items        87626 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.3+ MB
None


- Is there users with 0 items?

- Have they written some review?

In [24]:
print('Number of users with no items:', 
      len(df_items[df_items['items_count'] == 0]))

# List of unique users with reviews registered
rev_users = list(revs['user_id'].unique())

# Users without any item
no_items_users = df_items[df_items['items_count'] == 0].user_id

# Checkin if there's reviews from somebody without items.
print('How many of them have reviewed something?\n ->', 
      no_items_users.isin(rev_users).sum())

Number of users with no items: 16714
How many of them have reviewed something?
 -> 2841


2841 Users having 0 items and some review. They represent about **3%** of the total.

Dropping all users with 0 items for simplicity.

In [25]:
# Getting indexes to drop
idxs =df_items[df_items['items_count'] == 0].index

df_items.drop(index=idxs, inplace=True)
# Final version of the data
df_items.info()

<class 'pandas.core.frame.DataFrame'>
Index: 70912 entries, 0 to 87624
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      70912 non-null  object
 1   items_count  70912 non-null  int64 
 2   steam_id     70912 non-null  object
 3   user_url     70912 non-null  object
 4   items        70912 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.2+ MB


In [26]:
df_items['items'][0][-1]

{'item_id': '273350',
 'item_name': 'Evolve Stage 2',
 'playtime_forever': 58,
 'playtime_2weeks': 0}

Just above is a single item (json format) from column 'items'.

Each json containing an item follows the key-value structure with information we need.

- An ``item id``
- An ``item_name``
- ``playtime_forever`` (I guess is the total hours played for that item).
- ``playtime_2weeks`` (I guess is the total hours played for that item in the last 2 weeks).

Items have to be unnested.

The result is a huge dataset, but it will be easier to query and analyze data.

The cell below can take several minutes to execute, as well as creating a pandas DataFrame with it.

In [27]:
# Unpacking
items = fts.json_unpacking(
    df = df_items,
    where='items',
    values=['item_id', 'playtime_forever'],
    old_colums=['user_id']
)

Shape of the resulting array: (5094082, 3)


The resulting dataset is pretty big; **5M+ items**.

In [28]:
items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5094082 entries, 0 to 5094081
Data columns (total 3 columns):
 #   Column            Dtype 
---  ------            ----- 
 0   user_id           object
 1   item_id           object
 2   playtime_forever  int64 
dtypes: int64(1), object(2)
memory usage: 116.6+ MB


- Is there users having items but never played them? (``playtime_forever = 0`` for some item)

In [29]:
items[items['playtime_forever'] == 0]

Unnamed: 0,user_id,item_id,playtime_forever
1,76561197970982479,20,0
3,76561197970982479,40,0
4,76561197970982479,50,0
5,76561197970982479,60,0
6,76561197970982479,70,0
...,...,...,...
5094072,76561198326700687,519170,0
5094073,76561198326700687,358390,0
5094074,76561198326700687,521570,0
5094077,76561198329548331,346330,0


Almost 2M records with zero playtime. 

There is no reason to keep those records as aggregation functions will be calculated.

In [30]:
# Dropping
items.drop(
    index = items[items['playtime_forever'] == 0].index,
    inplace=True
    )

# Checking new info and size
items.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3246352 entries, 0 to 5094081
Data columns (total 3 columns):
 #   Column            Dtype 
---  ------            ----- 
 0   user_id           object
 1   item_id           object
 2   playtime_forever  int64 
dtypes: int64(1), object(2)
memory usage: 99.1+ MB


Saving this cleaner, less heavy dataframe

In [14]:
# What if i save this dataframe?
fts.gzip_json_file(
    './data/items.json.gz',
    df=items,
)

File saved at "data/items.json.gz"


In [92]:
mask = df_games['genres'].map(lambda x: 'Action' in x, na_action='ignore')

In [93]:
mask

0         True
1        False
2        False
3         True
4          NaN
         ...  
32130    False
32131    False
32132    False
32133    False
32134      NaN
Name: genres, Length: 32133, dtype: object