# Javascript Object Notation (JSON)

- Es una notación común para data web
- No tabular:
     - Registros no tienen todos el mismo conjunto de atributos
- Datos se encuentran organizados en colecciones de objetos
- Objectos son colecciones de atributos clave : valor
- Json nested: objectos se encuentran anidados

#### Tipos de Archivo Json

- **Record Orientation**: JSON más común
- **Column Orientation**: Uso de espación más eficiente que en record Orientation
- **Specifying Orientation**: Divide orientation data

<img src='https://docs.oracle.com/en/database/oracle/oracle-database/21/lnoci/img/json_doc.png'>

In [0]:
%run ./Includes/Classroom-Setup

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Lectura de archivos JSON

Leer de JSON con el método `json` de DataFrameReader y la opción de esquema de inferencia

In [0]:
eventsJsonPath = "/FileStore/events-500k.json"

eventsDF = (spark.read
  .option("inferSchema", True)
  .json(eventsJsonPath))

eventsDF.printSchema()

root
 |-- device: string (nullable = true)
 |-- ecommerce: struct (nullable = true)
 |    |-- purchase_revenue_in_usd: double (nullable = true)
 |    |-- total_item_quantity: long (nullable = true)
 |    |-- unique_items: long (nullable = true)
 |-- event_name: string (nullable = true)
 |-- event_previous_timestamp: long (nullable = true)
 |-- event_timestamp: long (nullable = true)
 |-- geo: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: string (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- coupon: string (nullable = true)
 |    |    |-- item_id: string (nullable = true)
 |    |    |-- item_name: string (nullable = true)
 |    |    |-- item_revenue_in_usd: double (nullable = true)
 |    |    |-- price_in_usd: double (nullable = true)
 |    |    |-- quantity: long (nullable = true)
 |-- traffic_source: string (nullable = true)
 |-- user_first_touch_timestamp: long (nullable = true)

Lea datos más rápido creando un `StructType` con los nombres de esquema y tipos de datos

In [0]:
from pyspark.sql.types import ArrayType, DoubleType, IntegerType, LongType, StringType, StructType, StructField

userDefinedSchema = StructType([
  StructField("device", StringType(), True),
  StructField("ecommerce", StructType([
    StructField("purchaseRevenue", DoubleType(), True),
    StructField("total_item_quantity", LongType(), True),
    StructField("unique_items", LongType(), True)
  ]), True),
  StructField("event_name", StringType(), True),
  StructField("event_previous_timestamp", LongType(), True),
  StructField("event_timestamp", LongType(), True),
  StructField("geo", StructType([
    StructField("city", StringType(), True),
    StructField("state", StringType(), True)
  ]), True),
  StructField("items", ArrayType(
    StructType([
      StructField("coupon", StringType(), True),
      StructField("item_id", StringType(), True),
      StructField("item_name", StringType(), True),
      StructField("item_revenue_in_usd", DoubleType(), True),
      StructField("price_in_usd", DoubleType(), True),
      StructField("quantity", LongType(), True)
    ])
  ), True),
  StructField("traffic_source", StringType(), True),
  StructField("user_first_touch_timestamp", LongType(), True),
  StructField("user_id", StringType(), True)
])

eventsDF = (spark.read
  .schema(userDefinedSchema)
  .json(eventsJsonPath))

## Descargar Archivos de Databricks

In [0]:
dbutils.fs.ls('dbfs:/FileStore')

dbutils.fs.cp('src', 'dts', True)


/dbfs/mnt/training/ecommerce/events/events-500k.json


https://community.cloud.databricks.com/files/<path-file>?o=#####

https://community.cloud.databricks.com/files/events-500k.json/part-00000-tid-309888144738233288-fab86c62-ff9f-4176-98c6-587f95ee9066-2365-1-c000.json?o=8507771068099703

## Laboratorio
------------------

- Leer archivo flight-data.json según:
  - Infiere esquema
  - Generar esquema de lectura

In [2]:
# Only when is Local
import findspark

findspark.init()
findspark.find()

'E:\\LibreriasPython\\spark-3.1.2-bin-hadoop2.7\\python\\pyspark'

In [3]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('ManagementJSON').getOrCreate()

In [9]:
flightJsonPath = "../data/2015-summary.json"

flightDF = (spark.read
  .option("inferSchema", True)
  .json(flightJsonPath))

flightDF.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



In [10]:
flightDF.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+
only showing top 5 rows



In [12]:
from pyspark.sql.types import ArrayType, DoubleType, IntegerType, LongType, StringType, StructType, StructField

userDefinedSchema = StructType([
  StructField("DEST_COUNTRY_NAME", StringType(), True),
  StructField("ORIGIN_COUNTRY_NAME", StringType(),True),
  StructField("count", IntegerType(), True),

])

flightDF = (spark.read
  .schema(userDefinedSchema)
  .json(flightJsonPath))

flightDF.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)



In [13]:
flightDF.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+
only showing top 5 rows

