# Manipulating databases directly

**Managed tables**: Spark manages both metadata and data.

**Unmanaged tables**: Spark manages only meta data while we manage the data ourselves in an extarrnal data source.

In [75]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import *

In [76]:
spark = (SparkSession
        .builder
        .appName("Spark SQL Example 2")
        .getOrCreate())

23/11/16 12:53:33 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [77]:
csv_file = "flights/departuredelays.csv"

In [78]:
spark.sql("CREATE DATABASE IF NOT EXISTS test")
spark.sql("USE test")

DataFrame[]

The data warehouse will be created in the current folder. To set a different location, use 

```spark.conf.set("spark.sql.warehouse.dir", "/path/to/warehouse")```

In [79]:
spark.conf.set("spark.sql.legacy.createHiveTableByDefault", "false")

In [80]:
spark.sql("CREATE TABLE managed_us_delay_flights_tbl (date STRING, delay INT,distance INT,\
origin STRING, destination STRING)")

DataFrame[]

In [95]:
schema = "date STRING, delay INT,distance INT,origin STRING, destination STRING"

23/11/16 13:53:19 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 760181 ms exceeds timeout 120000 ms
23/11/16 13:53:19 WARN SparkContext: Killing executors is not supported by current scheduler.
23/11/16 13:53:25 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:322)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:117)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$driverEndpoint(BlockManagerMasterEndpoint.scala:116)
	at org.apache.spark.storage.B

In [82]:
flights_df = spark.read.csv(csv_file, schema = schema)

In [89]:
flights_df.write.saveAsTable("df_us_delay_flights_tbl")

                                                                                

We can also created unmanaged tables from a data source such as CSV, Parquet, or JSON.

In [90]:
spark.sql("""CREATE TABLE local_us_delay_flights_tbl (date STRING, delay INT,distance INT,
origin STRING, destination STRING)
USING csv OPTIONS (PATH 'flights/departuredelays.csv')""")

DataFrame[]

In [85]:
spark.catalog.listDatabases()

[Database(name='default', catalog='spark_catalog', description='default database', locationUri='file:/home/amit/Documents/CS535-resources/examples/spark/spark-sql-python/spark-warehouse'),
 Database(name='test', catalog='spark_catalog', description='', locationUri='file:/home/amit/Documents/CS535-resources/examples/spark/spark-sql-python/spark-warehouse/test.db')]

In [94]:
spark.catalog.listTables()

AnalysisException: [SCHEMA_NOT_FOUND] The schema `test` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a catalog, verify the current_schema() output, or qualify the name with the correct catalog.
To tolerate the error on drop use DROP SCHEMA IF EXISTS.

In [92]:
spark.sql("DROP TABLE IF EXISTS local_us_delay_flights_tbl")
spark.sql("DROP TABLE IF EXISTS managed_us_delay_flights_tbl")
spark.sql("DROP TABLE IF EXISTS df_us_delay_flights_tbl")

23/11/16 12:57:49 WARN HadoopFSUtils: The directory file:/home/amit/Documents/CS535-resources/examples/spark/spark-sql-python/spark-warehouse/test.db/flights/departuredelays.csv was not found. Was it deleted very recently?


DataFrame[]

In [93]:
spark.sql("DROP DATABASE test")

DataFrame[]

In [71]:
spark.stop()