# Obtain data

In this Notebook, we retrieve the data and save it as a JSON file. 

Whilst the data was said to be in JSON format, when running it through a JSON linter, we noticed that it was not proper JSON due to the use of single quotes instead of double quotes.

To remedy this, we explored options such as replacing the single quotes with double quotes. However this led to issues where the name of a game included an apostrophe, e.g. `Assassin's Creed`. 

A workable solution was to read the data carefully using the `ast.literal_eval()` method and saving the outputs as `.json` files.

## User-items data

The dataset was downloaded from https://cseweb.ucsd.edu/~jmcauley/datasets.html#steam_data file *Version 1: User and Item Data*.

In [1]:
# Import libraries

import pandas as pd
import numpy as np
import ast
import json

In [2]:
# Read data using readlines()

with open('australian_users_items.json') as f:
    lines = f.readlines()[:30000]
len(lines)

30000

There are 88310 lines of data in this file, each representing a user.

In [3]:
# Evaluate the first line
# j = ast.literal_eval(lines[0])
# j

In [4]:
# Create a string representing a list with each line seperated by comma
# newstring = "[" + ",".join(lines) + "]"

# Evaluate this new string
j = ast.literal_eval("[" + ",".join(lines) + "]")

In [5]:
# Save file as JSON

with open('data.json', 'w') as json_file:
    json.dump(j, json_file)

We now have a `.json` file that we can easily view as a Pandas DataFrame.

In [6]:
# Load dataframe to check
df = pd.DataFrame(j)
df.head()

Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
1,js41637,888,76561198035864385,http://steamcommunity.com/id/js41637,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
2,evcentric,137,76561198007712555,http://steamcommunity.com/id/evcentric,"[{'item_id': '1200', 'item_name': 'Red Orchest..."
3,Riot-Punch,328,76561197963445855,http://steamcommunity.com/id/Riot-Punch,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
4,doctr,541,76561198002099482,http://steamcommunity.com/id/doctr,"[{'item_id': '300', 'item_name': 'Day of Defea..."


## Items detail data

This dataset was downloaded from https://cseweb.ucsd.edu/~jmcauley/datasets.html#steam_data file *Version 2: item metadata*

Similar to above, we will use `ast.literal_eval()` to read the data and then save as a `.json` file.

In [7]:
# Read data

with open('steam_games.json') as f:
    lines = f.readlines()

In [8]:
# View first line

lines[0]

"{u'publisher': u'Kotoshiro', u'genres': [u'Action', u'Casual', u'Indie', u'Simulation', u'Strategy'], u'app_name': u'Lost Summoner Kitty', u'title': u'Lost Summoner Kitty', u'url': u'http://store.steampowered.com/app/761140/Lost_Summoner_Kitty/', u'release_date': u'2018-01-04', u'tags': [u'Strategy', u'Action', u'Indie', u'Casual', u'Simulation'], u'discount_price': 4.49, u'reviews_url': u'http://steamcommunity.com/app/761140/reviews/?browsefilter=mostrecent&p=1', u'specs': [u'Single-player'], u'price': 4.99, u'early_access': False, u'id': u'761140', u'developer': u'Kotoshiro'}\n"

In [9]:
# Get number of lines
len(lines)

32135

There are 32135 lines, each representing a different game.

In [10]:
# evaluate the first string
j = ast.literal_eval(lines[0])
j

{'publisher': 'Kotoshiro',
 'genres': ['Action', 'Casual', 'Indie', 'Simulation', 'Strategy'],
 'app_name': 'Lost Summoner Kitty',
 'title': 'Lost Summoner Kitty',
 'url': 'http://store.steampowered.com/app/761140/Lost_Summoner_Kitty/',
 'release_date': '2018-01-04',
 'tags': ['Strategy', 'Action', 'Indie', 'Casual', 'Simulation'],
 'discount_price': 4.49,
 'reviews_url': 'http://steamcommunity.com/app/761140/reviews/?browsefilter=mostrecent&p=1',
 'specs': ['Single-player'],
 'price': 4.99,
 'early_access': False,
 'id': '761140',
 'developer': 'Kotoshiro'}

In [11]:
# Create a string representing a list with each line seperated by comma
newstring = '[' + ','.join(lines) + ']'

In [12]:
# Evaluate this new string
j = ast.literal_eval(newstring)

In [13]:
# Save file as JSON

with open('gamesdata.json', 'w') as json_file:
    json.dump(j, json_file)

We now have a `.json` file that we can easily view as a Pandas DataFrame.

In [14]:
# Load dataframe to check
df = pd.DataFrame(j)
df.head()

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,discount_price,reviews_url,specs,price,early_access,id,developer,sentiment,metascore
0,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",4.49,http://steamcommunity.com/app/761140/reviews/?...,[Single-player],4.99,False,761140,Kotoshiro,,
1,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,http://store.steampowered.com/app/643980/Ironb...,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",,http://steamcommunity.com/app/643980/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,False,643980,Secret Level SRL,Mostly Positive,
2,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_...,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",,http://steamcommunity.com/app/670290/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,False,670290,Poolians.com,Mostly Positive,
3,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,http://store.steampowered.com/app/767400/2222/,2017-12-07,"[Action, Adventure, Casual]",0.83,http://steamcommunity.com/app/767400/reviews/?...,[Single-player],0.99,False,767400,彼岸领域,,
4,,,Log Challenge,,http://store.steampowered.com/app/773570/Log_C...,,"[Action, Indie, Casual, Sports]",1.79,http://steamcommunity.com/app/773570/reviews/?...,"[Single-player, Full controller support, HTC V...",2.99,False,773570,,,
