# Spark Tables

This notebook shows how to use Spark Catalog Interface API to query databases, tables, and columns.

A full list of documented methods is available [here](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Catalog)

In [1]:
us_flights_file = "./data/flights/departuredelays.csv"

In [18]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
spark = (
    SparkSession
    .builder
    .appName("04_chap")
    .config("spark.sql.catalogImplementation", "hive")
    .getOrCreate()
    )
sc = spark.sparkContext

25/05/09 00:09:23 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


### Create Managed Tables

In [19]:
# Create database and managed tables

spark.sql("DROP DATABASE IF EXISTS learn_spark_db CASCADE") 
spark.sql("CREATE DATABASE learn_spark_db")
spark.sql("USE learn_spark_db")
spark.sql("CREATE TABLE us_delay_flights_tbl(date STRING, delay INT, distance INT, origin STRING, destination STRING)")

25/05/09 00:09:25 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
25/05/09 00:09:25 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
25/05/09 00:09:28 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
25/05/09 00:09:28 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore khodosevichleo@198.18.1.200
25/05/09 00:09:28 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
25/05/09 00:09:29 WARN ObjectStore: Failed to get database learn_spark_db, returning NoSuchObjectException
25/05/09 00:09:29 WARN ObjectStore: Failed to get database learn_spark_db, returning NoSuchObjectException
25/05/09 00:09:29 WARN ObjectStore: Failed to get database learn_spark_db, returning NoSuchObjectException
25/05/09 00:09:29 WARN ObjectStore: Failed to get databas

DataFrame[]

### Display the databases

In [20]:
display(spark.catalog.listDatabases())

                                                                                

[Database(name='default', catalog='spark_catalog', description='Default Hive database', locationUri='file:/Users/khodosevichleo/Desktop/fun/Spark/Learning-Spark-book/04_chap/spark-warehouse'),
 Database(name='learn_spark_db', catalog='spark_catalog', description='', locationUri='file:/Users/khodosevichleo/Desktop/fun/Spark/Learning-Spark-book/04_chap/spark-warehouse/learn_spark_db.db')]

## Read our US Flights table

In [22]:
df = (spark.read.format("csv")
      .schema("date STRING, delay INT, distance INT, origin STRING, destination STRING")
      .option("header", "true")
      .option("path", "./data/flights/departuredelays.csv")
      .load())

## Save into our table

In [23]:
df.write.mode("overwrite").saveAsTable("us_delay_flights_tbl")

25/05/09 00:10:10 WARN MemoryManager: Total allocation exceeds 95,00% (983 197 274 bytes) of heap memory
Scaling row group sizes to 91,57% for 8 writers
                                                                                

## Cache the Table

In [24]:
# %sql
spark.sql("CACHE TABLE us_delay_flights_tbl")

                                                                                

DataFrame[]

Check if the table is cached

In [25]:
spark.catalog.isCached("us_delay_flights_tbl")

True

### Display tables within a Database

Note that the table is MANGED by Spark

In [26]:
spark.catalog.listTables(dbName="learn_spark_db")

[Table(name='us_delay_flights_tbl', catalog='spark_catalog', namespace=['learn_spark_db'], description=None, tableType='MANAGED', isTemporary=False)]

### Display Columns for a table

In [27]:
spark.catalog.listColumns("us_delay_flights_tbl")

[Column(name='date', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),
 Column(name='delay', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='distance', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='origin', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),
 Column(name='destination', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False)]

### Create Unmanaged Tables

In [33]:
# Drop the database and create unmanaged tables
spark.sql("DROP DATABASE IF EXISTS learn_spark_db CASCADE")
spark.sql("CREATE DATABASE learn_spark_db")
spark.sql("USE learn_spark_db")
spark.sql("CREATE TABLE us_delay_flights_tbl (date STRING, delay INT, distance INT, origin STRING, destination STRING) USING csv OPTIONS (path '/databricks-datasets/learning-spark-v2/flights/departuredelays.csv')")

25/05/09 00:13:24 WARN ObjectStore: Failed to get database learn_spark_db, returning NoSuchObjectException
25/05/09 00:13:24 WARN ObjectStore: Failed to get database learn_spark_db, returning NoSuchObjectException
25/05/09 00:13:24 WARN ObjectStore: Failed to get database learn_spark_db, returning NoSuchObjectException
25/05/09 00:13:24 WARN ObjectStore: Failed to get database learn_spark_db, returning NoSuchObjectException
25/05/09 00:13:24 WARN HadoopFSUtils: The directory file:/databricks-datasets/learning-spark-v2/flights/departuredelays.csv was not found. Was it deleted very recently?
25/05/09 00:13:24 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider csv. Persisting data source table `spark_catalog`.`learn_spark_db`.`us_delay_flights_tbl` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.


DataFrame[]

### Display Tables

**Note**: The table type here that tableType='EXTERNAL', which indicates it's unmanaged by Spark, whereas above the tableType='MANAGED'

In [34]:
spark.catalog.listTables(dbName="learn_spark_db")

[Table(name='us_delay_flights_tbl', catalog='spark_catalog', namespace=['learn_spark_db'], description=None, tableType='EXTERNAL', isTemporary=False)]

### Display Columns for a table

In [35]:
spark.catalog.listColumns("us_delay_flights_tbl")

[Column(name='date', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),
 Column(name='delay', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='distance', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='origin', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),
 Column(name='destination', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False)]

In [36]:
spark.sql("DROP TABLE us_delay_flights_tbl")

DataFrame[]

In [38]:
spark.sql("DROP DATABASE IF EXISTS learn_spark_db CASCADE")

25/05/09 00:14:04 WARN TxnHandler: Cannot perform cleanup since metastore table does not exist


DataFrame[]

25/05/09 01:10:26 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 184285 ms exceeds timeout 120000 ms
25/05/09 01:10:26 WARN SparkContext: Killing executors is not supported by current scheduler.
25/05/09 01:10:28 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:101)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:85)
	at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:80)
	at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:642)
	at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1223)
	at o