# Lab Report 3: How to Load, Convert, and Write JSON Files in Python
## Name: Afnan Alabdulwahab
----------------------------------------------------------------------------------------------------------------------------------------------------


## Problem 0
Import the following libraries:

In [72]:
import numpy as np
import pandas as pd
import requests
import json
import sys
sys.tracebacklimit = 0 # turn off the error tracebacks

## Problem 1 

### Why does a CSV file usually take less memory than a JSON formatted file for the same data?
The way the data is structured in a JSON formatted file casues it to be larger than a CSV file for the same data. In a CSV file, the data is represented in a tabular format where each line is a row and the columns are separated by a delimiter (a comma in most cases). On the other hand, a JSON file for the same dataset uses additional characters such as `{}`, `[]` and `""` to structure the data. In addition, each datapoint includes a key name or metadata which increase the file size compartively to CSV that uses one line for column names, espacially for larger datasets or deeply nested structures. So, the lack of nested structure and key-value pairs reduces the size for a CSV file.

### Under what conditions could a JSON file be smaller in memory than a CSV file for the same data?
* **If the data is complex and nested**, using the JSON format can be more efficient because it takes advantage of objects, arrays, and mixed data types. This can be more compact compared to flattening nested data into a 2D tabular form, which requires repeating fields, therefore increasing the file size.
* **In handling of missing data**. The missing data is simply not included in a JSON format. In a CSV file, the missing data still allocate space when using 'NaN' or even if the cell is left empty.
* **JSON uses objects and arrays** which can save memory spcae if repetition is needed in the data. In a CSV file, the repeated data is plain text and will consuem more memory space.

## Problem 2

First, I will get the user agent, using 'https://httpbin.org', to include in the `headers` parameter for requesting webpages.

In [75]:
url = 'https://httpbin.org/user-agent'
r = requests.get(url)
r

<Response [200]>

In [76]:
myjson = json.loads(r.text)
useragent = myjson['user-agent']
headers = {'User-Agent': useragent,
          'From': 'aa7dd@virginia.edu'}
headers

{'User-Agent': 'python-requests/2.32.3', 'From': 'aa7dd@virginia.edu'}

Examining the data in my web-browser I can see the data *contains nesting* (geolocation), but *no metadata*. In that case the best strategy is to:
1. use `requests.get()` to download the raw JSON data from the JSON url provided and I will check if the request was successful.

In [77]:
url = 'https://data.nasa.gov/resource/y77d-th95.json'
meteorites = requests.get(url, headers=headers)
meteorites

<Response [200]>

2. use `json.loads()` on the `.text` attribute of the output `meteorites` to register the data is a list in Python.

In [78]:
meteorites_json = json.loads(meteorites.text)
type(meteorites_json)

list

3. use `pd.json_normalize()` on the list `meteorites_json` to store every feature in the data in a separate column, regradless of the depth of nesting of features. `pd.json_normalize()` produces a `pandas` dataframe.

In [79]:
meteorites_df = pd.json_normalize(meteorites_json)
meteorites_df

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation.type,geolocation.coordinates,:@computed_region_cbhk_fwbd,:@computed_region_nnqa_25f4
0,Aachen,1,Valid,L5,21,Fell,1880-01-01T00:00:00.000,50.775000,6.083330,Point,"[6.08333, 50.775]",,
1,Aarhus,2,Valid,H6,720,Fell,1951-01-01T00:00:00.000,56.183330,10.233330,Point,"[10.23333, 56.18333]",,
2,Abee,6,Valid,EH4,107000,Fell,1952-01-01T00:00:00.000,54.216670,-113.000000,Point,"[-113, 54.21667]",,
3,Acapulco,10,Valid,Acapulcoite,1914,Fell,1976-01-01T00:00:00.000,16.883330,-99.900000,Point,"[-99.9, 16.88333]",,
4,Achiras,370,Valid,L6,780,Fell,1902-01-01T00:00:00.000,-33.166670,-64.950000,Point,"[-64.95, -33.16667]",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Tirupati,24009,Valid,H6,230,Fell,1934-01-01T00:00:00.000,13.633330,79.416670,Point,"[79.41667, 13.63333]",,
996,Tissint,54823,Valid,Martian (shergottite),7000,Fell,2011-01-01T00:00:00.000,29.481950,-7.611230,Point,"[-7.61123, 29.48195]",,
997,Tjabe,24011,Valid,H6,20000,Fell,1869-01-01T00:00:00.000,-7.083330,111.533330,Point,"[111.53333, -7.08333]",,
998,Tjerebon,24012,Valid,L5,16500,Fell,1922-01-01T00:00:00.000,-6.666670,106.583330,Point,"[106.58333, -6.66667]",,


## Problem 3

First, I will use `requests.get()` to download the raw JSON data from the JSON url provided and I will check if the request was successful.

In [80]:
url = 'https://www.reddit.com/r/popular/top.json'
reddit = requests.get(url, headers=headers)
reddit

<Response [200]>

Next, I will use `json.loads()` on the `.text` attribute of the output `reddit` to register the data as a list/dict in Python.

In [81]:
reddit_json = json.loads(reddit.text)
type(reddit_json)

dict

Examining the data in my web-browser, I can see the data contains *metadata* and *nesting*. The top-level has two keys: “kind” and “data”, and the data live in “data”. Within this branch, there are four more metadata branches, “after”, “dist”, “modhash”, and “geo_filter”, and the data I need exist within “children”. So I can use the keys ['data']['children'] in the loaded dictionary `reddit_json` to access the data I need for looping:

In [82]:
reddit_data = reddit_json['data']['children']
type(reddit_data)

list

Now I can loop through `reddit_data` and extract data for `subreddit`, `title`, `ups`, and `created_utc` for each record and save it in a dataframe using `pd.DataFrame()` and using a `for` loop within it.

(Note: there is additional metadata within each record structure 'kind' and the data I need is within 'data' so I'll use `x['data']` to access the data I need in my loop)

In [89]:
reddit_df = pd.DataFrame([x['data']['subreddit'], x['data']['title'],\
                          x['data']['ups'], x['data']['created_utc']]
                  for x in reddit_data)
reddit_df.columns = ['subreddit', 'title', 'ups', 'created_utc']
reddit_df

Unnamed: 0,subreddit,title,ups,created_utc
0,clevercomebacks,My thumb is the size of a nuke explosion,64050,1719624000.0
1,interestingasfuck,How a breeding bull is greeted by pasture full...,53027,1719630000.0
2,TikTokCringe,Oh how times have changed,54014,1719630000.0
3,Damnthatsinteresting,Grab your iced tea and Raise a toast!,51499,1719615000.0
4,pics,Matthew McConaughey &amp; Woody Harrelson padd...,45773,1719609000.0
5,MadeMeSmile,A love-hate-love relationship,48017,1719670000.0
6,MadeMeSmile,She’s a real friend,41208,1719628000.0
7,BlackPeopleTwitter,Post-debate Waffle House,39654,1719623000.0
8,OldSchoolCool,I used to live in a NYC loft in the 1970s: A 5...,38616,1719615000.0
9,meme,what is that,36692,1719626000.0


## Problem 4

### Part a

Using `requests.get()` to download the raw JSON data from the JSON url provided and I will check if the request was successful.

In [84]:
# part a
url = 'https://stats.nba.com/js/data/sportvu/2015/shootingTeamData.json'
nba = requests.get(url, headers=headers)
nba

<Response [200]>

Next, I will use `json.loads()` on the `.text` attribute of the output `nba` to register the data in Python's memory.

In [85]:
nba_json = json.loads(nba.text)
type(nba_json)

dict

### Part b

### The path that leads to the team-by-team data
Through the top-level key: '**resultSets**', which is a list containing a singe dictionary, the dictionary has a '**rowSet**' list containg the team-by-team data.

### Part c

In [66]:
nba_df = pd.json_normalize(nba_json, record_path = ["resultSets", "rowSet"])
nba_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,23,24,25,26,27,28,29,30,31,32
0,1610612744,Golden State,Warriors,GSW,,82,48.7,114.9,14.9,0.498,...,0.478,21.2,42.5,0.497,2.3,6.3,0.363,10.8,25.3,0.429
1,1610612759,San Antonio,Spurs,SAS,,82,48.3,103.5,14.8,0.481,...,0.506,18.3,39.8,0.46,0.9,2.6,0.341,6.1,15.9,0.381
2,1610612739,Cleveland,Cavaliers,CLE,,82,48.7,104.3,16.9,0.481,...,0.473,18.2,40.7,0.447,1.7,5.7,0.299,9.0,23.9,0.378
3,1610612746,Los Angeles,Clippers,LAC,,82,48.6,104.5,15.0,0.497,...,0.48,18.9,42.0,0.45,2.0,6.0,0.334,7.7,20.8,0.373
4,1610612760,Oklahoma City,Thunder,OKC,,82,48.6,110.2,16.1,0.48,...,0.497,17.5,38.7,0.451,1.6,5.1,0.321,6.6,18.6,0.356
5,1610612737,Atlanta,Hawks,ATL,,82,48.6,102.8,19.0,0.463,...,0.483,19.4,44.6,0.435,1.0,3.1,0.311,9.0,25.3,0.355
6,1610612745,Houston,Rockets,HOU,,82,48.6,106.5,17.2,0.433,...,0.472,15.5,36.4,0.426,2.3,7.4,0.318,8.4,23.5,0.355
7,1610612757,Portland,Trail Blazers,POR,,82,48.5,105.1,17.5,0.441,...,0.447,18.0,39.8,0.453,1.7,5.9,0.295,8.8,22.6,0.389
8,1610612758,Sacramento,Kings,SAC,,81,48.4,106.7,18.7,0.452,...,0.473,18.1,39.7,0.454,0.9,3.1,0.276,7.2,19.4,0.372
9,1610612764,Washington,Wizards,WAS,,82,48.5,104.1,15.4,0.48,...,0.483,19.5,44.3,0.439,0.7,2.7,0.254,8.0,21.5,0.371


### Part d

The path that leads to the headers is through the top-level key: '**resultSets**', which is a list containing a single dictionary, the dictionary contains the '**headers**' list.

In [86]:
column_names = nba_json["resultSets"][0]["headers"]
column_names

['TEAM_ID',
 'TEAM_CITY',
 'TEAM_NAME',
 'TEAM_ABBREVIATION',
 'TEAM_CODE',
 'GP',
 'MIN',
 'PTS',
 'PTS_DRIVE',
 'FGP_DRIVE',
 'PTS_CLOSE',
 'FGP_CLOSE',
 'PTS_CATCH_SHOOT',
 'FGP_CATCH_SHOOT',
 'PTS_PULL_UP',
 'FGP_PULL_UP',
 'FGA_DRIVE',
 'FGA_CLOSE',
 'FGA_CATCH_SHOOT',
 'FGA_PULL_UP',
 'EFG_PCT',
 'CFGM',
 'CFGA',
 'CFGP',
 'UFGM',
 'UFGA',
 'UFGP',
 'CFG3M',
 'CFG3A',
 'CFG3P',
 'UFG3M',
 'UFG3A',
 'UFG3P']

In [87]:
nba_df.columns = column_names
nba_df

Unnamed: 0,TEAM_ID,TEAM_CITY,TEAM_NAME,TEAM_ABBREVIATION,TEAM_CODE,GP,MIN,PTS,PTS_DRIVE,FGP_DRIVE,...,CFGP,UFGM,UFGA,UFGP,CFG3M,CFG3A,CFG3P,UFG3M,UFG3A,UFG3P
0,1610612744,Golden State,Warriors,GSW,,82,48.7,114.9,14.9,0.498,...,0.478,21.2,42.5,0.497,2.3,6.3,0.363,10.8,25.3,0.429
1,1610612759,San Antonio,Spurs,SAS,,82,48.3,103.5,14.8,0.481,...,0.506,18.3,39.8,0.46,0.9,2.6,0.341,6.1,15.9,0.381
2,1610612739,Cleveland,Cavaliers,CLE,,82,48.7,104.3,16.9,0.481,...,0.473,18.2,40.7,0.447,1.7,5.7,0.299,9.0,23.9,0.378
3,1610612746,Los Angeles,Clippers,LAC,,82,48.6,104.5,15.0,0.497,...,0.48,18.9,42.0,0.45,2.0,6.0,0.334,7.7,20.8,0.373
4,1610612760,Oklahoma City,Thunder,OKC,,82,48.6,110.2,16.1,0.48,...,0.497,17.5,38.7,0.451,1.6,5.1,0.321,6.6,18.6,0.356
5,1610612737,Atlanta,Hawks,ATL,,82,48.6,102.8,19.0,0.463,...,0.483,19.4,44.6,0.435,1.0,3.1,0.311,9.0,25.3,0.355
6,1610612745,Houston,Rockets,HOU,,82,48.6,106.5,17.2,0.433,...,0.472,15.5,36.4,0.426,2.3,7.4,0.318,8.4,23.5,0.355
7,1610612757,Portland,Trail Blazers,POR,,82,48.5,105.1,17.5,0.441,...,0.447,18.0,39.8,0.453,1.7,5.9,0.295,8.8,22.6,0.389
8,1610612758,Sacramento,Kings,SAC,,81,48.4,106.7,18.7,0.452,...,0.473,18.1,39.7,0.454,0.9,3.1,0.276,7.2,19.4,0.372
9,1610612764,Washington,Wizards,WAS,,82,48.5,104.1,15.4,0.48,...,0.483,19.5,44.3,0.439,0.7,2.7,0.254,8.0,21.5,0.371


## Problem 5

To save the NBA dataframe into a JSON formatted file I can use `.to_json()` and to organize the data as dictionary with three lists: columns lists the column names, index lists the row names, and data is a list-of-lists of data point I acan used the `orient="split"` parameter.

In [88]:
nba_df.to_json("nbajson.json", orient="split")