# Manipulating databases directly

**Managed tables**: Spark manages both metadata and data.

**Unmanaged tables**: Spark manages only meta data while we manage the data ourselves in an extarrnal data source.

In [56]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import *

In [57]:
spark = (SparkSession
        .builder
        .appName("Spark SQL Example 1")
        .getOrCreate())

23/11/16 10:50:00 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [58]:
csv_file = "flights/departuredelays.csv"

In [59]:
spark.sql("CREATE DATABASE IF NOT EXISTS test")
spark.sql("USE test")

DataFrame[]

The data warehouse will be created in the current folder. To set a different location, use 

```spark.conf.set("spark.sql.warehouse.dir", "/path/to/warehouse")```

In [60]:
spark.conf.set("spark.sql.legacy.createHiveTableByDefault", "false")

In [61]:
spark.sql("CREATE TABLE managed_us_delay_flights_tbl (date STRING, delay INT,distance INT,\
origin STRING, destination STRING)")

DataFrame[]

In [62]:
schema = "date STRING, elay INT,distance INT,origin STRING, destination STRING"

In [63]:
flights_df = spark.read.csv(csv_file, schema = schema)

In [64]:
flights_df.write.saveAsTable("df_us_delay_flights_tbl")

                                                                                

We can also created unmanaged tables from a data socurce such as CSV, Parquet, or JSON.

In [65]:
spark.sql("""CREATE TABLE local_us_delay_flights_tbl (date STRING, delay INT,distance INT,
origin STRING, destination STRING)
USING csv OPTIONS (PATH 'flights/departuredelays.csv')""")

DataFrame[]

In [66]:
spark.catalog.listDatabases()

[Database(name='default', catalog='spark_catalog', description='default database', locationUri='file:/home/amit/Documents/CS535-resources/examples/spark/spark-sql-python/spark-warehouse'),
 Database(name='test', catalog='spark_catalog', description='', locationUri='file:/home/amit/Documents/CS535-resources/examples/spark/spark-sql-python/spark-warehouse/test.db')]

In [67]:
spark.catalog.listTables()

[Table(name='df_us_delay_flights_tbl', catalog='spark_catalog', namespace=['test'], description=None, tableType='MANAGED', isTemporary=False),
 Table(name='local_us_delay_flights_tbl', catalog='spark_catalog', namespace=['test'], description=None, tableType='EXTERNAL', isTemporary=False),
 Table(name='managed_us_delay_flights_tbl', catalog='spark_catalog', namespace=['test'], description=None, tableType='MANAGED', isTemporary=False)]

In [69]:
spark.sql("DROP TABLE IF EXISTS local_us_delay_flights_tbl")
spark.sql("DROP TABLE IF EXISTS managed_us_delay_flights_tbl")
spark.sql("DROP TABLE IF EXISTS df_us_delay_flights_tbl")

DataFrame[]

In [70]:
spark.sql("DROP DATABASE test")

DataFrame[]

In [71]:
spark.stop()