# Spark Tables

This notebook shows how to use Spark Catalog Interface API to query databases, tables, and columns.

A full list of documented methods is available [here](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Catalog)

In [2]:
from pyspark.sql import SparkSession
#create a SparkSession
spark = (SparkSession
    .builder
    .appName("Example-3_6")
    .getOrCreate())

In [3]:
us_flights_file = "../../databricks-datasets/learning-spark-v2/flights/departuredelays.csv"

## Create Managed Tables

https://stackoverflow.com/questions/50914102/why-do-i-get-a-hive-support-is-required-to-create-hive-table-as-select-error/54552891

In [4]:
# Create database and managed tables
spark.sql("DROP DATABASE IF EXISTS learn_spark_db CASCADE") 
spark.sql("CREATE DATABASE learn_spark_db")
spark.sql("USE learn_spark_db")
spark.sql("CREATE TABLE us_delay_flights_tbl(date STRING, delay INT, distance INT, origin STRING, destination STRING)")

AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);
'CreateTable `learn_spark_db`.`us_delay_flights_tbl`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, ErrorIfExists


## Display the databases

In [5]:
display(spark.catalog.listDatabases())

[Database(name='default', description='default database', locationUri='file:/media/jose/Repositorio/Git_Hub/Big-Data-Procesing/Learning_Spark/Segunda_Edicion/notebooks/chapter4/spark-warehouse'),
 Database(name='learn_spark_db', description='', locationUri='file:/media/jose/Repositorio/Git_Hub/Big-Data-Procesing/Learning_Spark/Segunda_Edicion/notebooks/chapter4/spark-warehouse/learn_spark_db.db')]

## Read our US Flights table

In [6]:
df = (spark.read.format("csv")
      .schema("date STRING, delay INT, distance INT, origin STRING, destination STRING")
      .option("header", "true")
      .option("path", "departuredelays.csv")
      .load())

## Save into our table

In [7]:
df.write.mode("overwrite").saveAsTable("us_delay_flights_tbl")

## Cache the Table

In [8]:
%sql
CACHE TABLE us_delay_flights_tbl

SyntaxError: invalid syntax (<ipython-input-8-b0de4460edf3>, line 2)

Check if the table is cached

In [9]:
spark.catalog.isCached("us_delay_flights_tbl")

False

## Display tables within a Database

Note that the table is MANGED by Spark

In [10]:
spark.catalog.listColumns("us_delay_flights_tbl")

[Column(name='date', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),
 Column(name='delay', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='distance', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='origin', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),
 Column(name='destination', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False)]

## Display Columns for a table

In [None]:
spark.catalog.listColumns("us_delay_flights_tbl")

## Create Unmanaged Tables

In [11]:
# Drop the database and create unmanaged tables
spark.sql("DROP DATABASE IF EXISTS learn_spark_db CASCADE")
spark.sql("CREATE DATABASE learn_spark_db")
spark.sql("USE learn_spark_db")
spark.sql("CREATE TABLE us_delay_flights_tbl (date STRING, delay INT, distance INT, origin STRING, destination STRING) USING csv OPTIONS (path '/databricks-datasets/learning-spark-v2/flights/departuredelays.csv')")

DataFrame[]

## Display Tables
Note: The table type here that tableType='EXTERNAL', which indicates it's unmanaged by Spark, whereas above the tableType='MANAGED'

In [12]:
spark.catalog.listTables(dbName="learn_spark_db")

[Table(name='us_delay_flights_tbl', database='learn_spark_db', description=None, tableType='EXTERNAL', isTemporary=False)]

## Display Columns for a table

In [None]:
spark.catalog.listColumns("us_delay_flights_tbl")