Logs
- [2024/10/22]      
  [Class session]: Review three data reading functions `.text`, `.csv`, and `.json`   
  Use JSON formatter from VSCode to have a pretify form of one-line `.json` file.

# Week 09

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

import json
import pprint

In [2]:
spark = (SparkSession
  .builder
  .master("local[*]") 
  .appName("Processing JSON")
  .getOrCreate())


In [3]:
spark

We read the `.json` file with JSON built-in library in Python

In [4]:
with open("./data/shows/shows-silicon-valley.json", 'r') as f_json:
  data = f_json.read()
shows_json = json.loads(data)
pprint.pp(shows_json)

{'id': 143,
 'url': 'https://www.tvmaze.com/shows/143/silicon-valley',
 'name': 'Silicon Valley',
 'type': 'Scripted',
 'language': 'English',
 'genres': ['Comedy'],
 'status': 'Ended',
 'runtime': 30,
 'averageRuntime': 30,
 'premiered': '2014-04-06',
 'ended': '2019-12-08',
 'officialSite': 'http://www.hbo.com/silicon-valley/',
 'schedule': {'time': '22:00', 'days': ['Sunday']},
 'rating': {'average': 8.4},
 'weight': 89,
 'network': {'id': 8,
             'name': 'HBO',
             'country': {'name': 'United States',
                         'code': 'US',
                         'timezone': 'America/New_York'},
             'officialSite': 'https://www.hbo.com/'},
 'webChannel': None,
 'dvdCountry': None,
 'externals': {'tvrage': 33759, 'thetvdb': 277165, 'imdb': 'tt2575988'},
 'image': {'medium': 'https://static.tvmaze.com/uploads/images/medium_portrait/215/538434.jpg',
           'original': 'https://static.tvmaze.com/uploads/images/original_untouched/215/538434.jpg'},
 'summar

Now we read .json file with PySpark

In [5]:
shows = spark.read.json("./data/shows/shows-silicon-valley.json")
shows.count()

1

We read all .json files in `shows` directory

In [6]:
multiple_shows = spark.read.json(
  "./data/shows/shows-*.json", multiLine=True)
multiple_shows.count()

8

In [7]:
shows.printSchema()

root
 |-- _embedded: struct (nullable = true)
 |    |-- episodes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _links: struct (nullable = true)
 |    |    |    |    |-- self: struct (nullable = true)
 |    |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |    |-- show: struct (nullable = true)
 |    |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- airdate: string (nullable = true)
 |    |    |    |-- airstamp: string (nullable = true)
 |    |    |    |-- airtime: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- image: struct (nullable = true)
 |    |    |    |    |-- medium: string (nullable = true)
 |    |    |    |    |-- original: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- number: long (nullable = true)
 |    |    |    |-- rating:

Show all columns in first hierarchy

In [8]:
shows.columns

['_embedded',
 '_links',
 'averageRuntime',
 'dvdCountry',
 'ended',
 'externals',
 'genres',
 'id',
 'image',
 'language',
 'name',
 'network',
 'officialSite',
 'premiered',
 'rating',
 'runtime',
 'schedule',
 'status',
 'summary',
 'type',
 'updated',
 'url',
 'webChannel',
 'weight']

Choose `name` and `genres` columns

In [9]:
array_subset = shows.select("name", "genres")

array_subset.show(1, False)

+--------------+--------+
|name          |genres  |
+--------------+--------+
|Silicon Valley|[Comedy]|
+--------------+--------+



Choose `name` and `episodes` columns   
Hint: `episodes` is in `_embedded`

In [15]:
# Get all episodes
shows.select(F.col("_embedded").episodes)

# You can also use OOP dot notation to access columns
# shows.select(shows._embedded.episodes).first()[0]

# or using F.col
# shows.select(F.col("_embedded.episodes")).first()[0]

DataFrame[_embedded.episodes: array<struct<_links:struct<self:struct<href:string>,show:struct<href:string,name:string>>,airdate:string,airstamp:string,airtime:string,id:bigint,image:struct<medium:string,original:string>,name:string,number:bigint,rating:struct<average:double>,runtime:bigint,season:bigint,summary:string,type:string,url:string>>]

In [47]:
array_of_episode_names = list(shows.select(
  F.col("_embedded.episodes")[i].alias("episode")
).select("episode.name").first()[0] for i in range(
  shows.select(F.size(F.col("_embedded.episodes"))).first()[0]))
array_of_episode_names

['Minimum Viable Product',
 'The Cap Table',
 'Articles of Incorporation',
 'Fiduciary Duties',
 'Signaling Risk',
 'Third Party Insourcing',
 'Proof of Concept',
 'Optimal Tip-to-Tip Efficiency',
 'Sand Hill Shuffle',
 'Runaway Devaluation',
 'Bad Money',
 'The Lady',
 'Server Space',
 'Homicide',
 'Adult Content',
 'White Hat/Black Hat',
 'Binding Arbitration',
 'Two Days of the Condor',
 'Founder Friendly',
 'Two in the Box',
 "Meinertzhagen's Haversack",
 'Maleant Data Systems Solutions',
 'The Empty Chair',
 'Bachmanity Insanity',
 'To Build a Better Beta',
 "Bachman's Earning's Over-ride",
 'Daily Active Users',
 'The Uptick',
 'Success Failure',
 'Terms of Service',
 'Intellectual Property',
 'Teambuilding Exercise',
 'The Blood Boy',
 'Customer Service',
 'The Patent Troll',
 'The Keenan Vortex',
 'Hooli-Con',
 'Server Error',
 'Grow Fast or Die Slow',
 'Reorientation',
 'Chief Operating Officer',
 'Tech Evangelist',
 'Facial Recognition',
 'Artificial Emotional Intelligence',


In [50]:
shows.select(
  F.col("name"), F.lit(array_of_episode_names).alias("name")
).show(1, False)

+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Struct type within column