# Working in the Default Schema

## Creating managed tables
First, we create a managed table named `managed_default` and populate it with data:

In [0]:
USE CATALOG hive_metastore;

CREATE TABLE managed_default 
(
  country STRING, code STRING, dial_code STRING
);

INSERT INTO managed_default
VALUES (
  'France', 'Fr', '+33'
)

### Describing the table
Since we are not specifying the `LOCATION` keyword, this table is considered _managed_ in the default database.
Executing the `DESCRIBE EXTENDED` command on the table provides advanced metadata information. Among this metadata information, we focus on three key elements:
- The **type** of table, which is indeed `MANAGED`
- The **location**, which shows that the table resides in the default Hive metastore under `dbfs:/user/hive/warehouse`
- The **provider**, which confirms that this is a Delta Lake table

In [0]:
DESCRIBE EXTENDED managed_default

## Creating external tables
We create an external table within the default database. To achieve this, we simply 
- Add the `LOCATION` keyword followed by 
- The desired storage path.

In [0]:
CREATE TABLE external_default
(
  country STRING, code STRING, dial_code STRING
)
LOCATION 'dbfs:/mnt/demo/external_default';

INSERT INTO external_default
VALUES (
  'France', 'Fr', '+33')

### Describing the table
Running `DESCRIBE EXTENDED` on the external table confirms its external nature and its storage location.

In [0]:
DESCRIBE EXTENDED external_default 

## Dropping tables
If you want to remove tables from the database, you simply drop them using the `DROP TABLE` command. However, it is important to note that the behaviour differs for managed and external tables.

### Dropping managed tables

In [0]:
DROP TABLE managed_default

When you drop a managed table, it deletes its metadata from the metastore. This means that the table's definition, including its schema, column names, data types, and other relevant information is no longer stored in the metastore. We can confirm this by trying to query the table: `SELECT * FROM managed_default` which results in a "TABLE_OR_VIEW_NOT_FOUND" error.

In [0]:
SELECT * FROM managed_default

Dropping a managed table not only removes its metadata from the metastore, but also deletes all associated data files from the storage. This is confirmed by a "FileNotFoundExeption" upon checking the table directory.

In [0]:
%fs ls 'dbfs:/user/hive/warehouse/managed_default'

### Dropping external tables
Dropping the external table also removes its entry from the metastore. We can confirm this by trying to query the table, which should result in a "TABLE_OR_VIEW_NOT_FOUND" error. However, since the underlying data is stored outside the database directory, the data files should remain intact. We can confirm that the data files of the table still persist by checking the table directory:

In [0]:
%fs ls 'dbfs:/mnt/demo/external_default'

In Databricks, you can directly access a Delta table by querying its directory using the following `SELECT` statement

In [0]:
SELECT * FROM DELTA.`dbfs:/mnt/demo/external_default`

You can manually remove the table directory and its content by running the `dbutils.fs.rm` function in Python:

In [0]:
%python
dbutils.fs.rm('dbfs:/mnt/demo/external_default', True)

# Working in a New Schema

## Creating a new database
You can create a new database using either the `CREATE DATABASE` or `CREATE SCHEMA` syntax. 

In [0]:
USE CATALOG hive_metastore;
CREATE SCHEMA new_default

Once created, you can inspect the database's metadata using the `DESCRIBE DATABASE EXTENDED` command. This command provides information about the database, such as its location in the underlying storage.

In [0]:
DESCRIBE DATABASE EXTENDED new_default

The new database is stored under the default Hive directory with a `.db` extension to distinguish it from other table folders in the directory.

### Creating tables in the new database
To create tables within a database, you need first to set it as the current schema by specifying its name through the `USE DATABASE` keyword.

In [0]:
USE DATABASE new_default;
/* Create a managed table*/

CREATE TABLE managed_new_default 
(
  country STRING, code STRING, dial_code STRING
);

INSERT INTO managed_new_default 
VALUES (
  'France', 'Fr', '+33'
);

/* Create an external table */

CREATE TABLE external_new_default
(
  country STRING, code STRING, dial_code STRING
)
LOCATION 'dbfs:/mnt/demo/external_new_default';

INSERT INTO external_new_default
VALUES (
  'France', 'Fr', '+33');

By running `DESCRIBE EXTENDED` on each of these tables, we can see that the first table is indeed a managed table created in its database folder under the default Hive directory:

In [0]:
DESCRIBE EXTENDED managed_new_default

The second table, where we used the `LOCATION` keyword, has been defined as an external table under the `/mnt/demo/` location

In [0]:
DESCRIBE EXTENDED external_new_default

## Dropping tables

In [0]:
DROP TABLE managed_new_default;
DROP TABLE external_new_default;

Dropping the tables removes their entries from the Hive metastore. Moreover, this action on the managed table results in the removal of its directory and associated data files, from the storage:

In [0]:
%fs ls 'dbfs:/user/hive/warehouse/new_default.db/managed_new_default'

However, as expected, in the case of the external table, although the table itself is dropped from the database, the directory and its data files persist in the specified external location

In [0]:
%fs ls 'dbfs:/mnt/demo/external_new_default'

# Working in a Custom-Location Schema
In the last scenario, we will create a database in a custom location outside of the default Hive directory.

## Creating the database
To achieve this, we begin by using the `CREATE SCHEMA` statement and we add the `LOCATION` keyword followed by the desired storage path:

In [0]:
CREATE SCHEMA custom
LOCATION 'dbfs:/Shared/schemas/custom.db'

Upon closer examination, using the `DESCRIBE DATABASE EXTENDED` command, we confirm that the database is located in the custom location we specified:

In [0]:
DESCRIBE DATABASE EXTENDED custom

## Creating tables
We proceed to use this database to create both managed and external tables.

In [0]:
USE DATABASE custom;
/* Create a managed table */

CREATE TABLE managed_custom (
  country STRING, code STRING, dial_code STRING
);

INSERT INTO managed_custom
VALUES (
  'France', 'Fr', '+33'
);

/* Create an external table */

CREATE TABLE external_custom
(
  country STRING, code STRING, dial_code STRING
)
LOCATION 'dbfs:/mnt/demo/external_custom';

INSERT INTO external_custom
VALUES (
  'France', 'Fr', '+33');

We can confirm that the `managed_custom` table is indeed a managed table, since it was created in its database folder located in the custom location.

In [0]:
DESCRIBE EXTENDED managed_custom;

The `external_custom` table is an external table because its location was specified during table creation.

In [0]:
DESCRIBE EXTENDED external_custom;

## Dropping tables

In [0]:
DROP TABLE managed_custom;
DROP TABLE external_custom;

In [0]:
%python
dbutils.fs.rm('dbfs:/mnt/demo/external_custom', True)

# Setting Up Delta Tables

## Create Table As SELECT
CTAS statements allow the creation and population of tables at the same time based on the results of a `SELECT` query. Therefore, with CTAS statements you can create a new table from existing data sources. For example in the
`CREATE TABLE table_2 AS SELECT * FROM table_1` 
statement, we are creating `table_2 by` selecting all data from `table_1`.

CTAS statements automatically **infer schema information** from the query results, eliminating the need for manual schema declaration.

### CTAS statements in Databricks
CTAS statements in Databricks offer a consistent means to perform transformation of data during the creation of Delta tables. These transformations can include tasks such as 
- Renaming columns; or
- Selecting specific columns for inclusion in the target table.

`CREATE TABLE table_2 AS SELECT col_1, col_3 AS new_col_3 FROM table_1`
The CTAS statement generates a new table named `table_2`, by selecting columns `col_1` and `col_3`. Additionally, `col_3` is renamed to `new_col_3` in the resulting table.

Moreover, a range of options can be added to the `CREATE TABLE` clause to customise table creation, allowing for precise control over table properties and storage configurations.

`CREATE TABLE new_users
  COMMENT "Contains PII"
  PARTITIONED BY (city, birthday_date)
  LOCATION 'some/path'
  AS SELECT id, name, email, birth_date, city FROM users`

**Comment**
The COMMENT clause enables you to provide a descriptive comment for the table, helping in the discovery and understanding of its contents.

**Partitioning**
The underlying data of the table can be partitioned into subfolders. The PARTITIONED BY clause allows for data partitioning based on one or more columns. Partitioning can significantly enhance the performance of Delta tables by facilitating efficient data retrieval. However, for small to medium-sized tables, the benefits of partitioning may be negligible or outweighed by drawbacks. One significant drawback is the potential emergence of what is known as the "small files problem." This problem arises when data partitioning results in the creation of numerous small files, each containing a relatively small amount of data. 
Partitioning aims to improve query performance by reducing the amount of data scanned, but the presence of many small files can prevent file compaction and efficiency in data skipping. 
In general, partitioning should be selectively applied based on the size and nature of the data.

**External location**
The `LOCATION` option enables the creation of external tables.

### Comparing CREATE TABLE and CTAS

|                        | **CREATE TABLE statement**                                           | **CTAS statement**                                                                 |
|------------------------|----------------------------------------------------------------------|-------------------------------------------------------------------------------------|
| **Example**            | `CREATE TABLE table_2 (col1 INT, col2 STRING, col3 DOUBLE)`          | `CREATE TABLE table_2 AS SELECT col1, col2, col3 FROM table_1`                     |
| **Schema declaration** | Requires manual schema declaration.                                 | Does not allow manual schema declaration. It automatically infers the table schema. |
| **Populating data**    | Creates an empty table; a data loading statement, such as `INSERT INTO`, is required to populate it. | The table is created with data as specified.                      |

## Table Constraints
After creating a Delta Lake table, either through the CREATE TABLE statement or a CTAS statement, you have the option to enhance its integrity by adding **constraints**. Databricks currently supports two types of constraints:
- **NOT NULL** constrains
- **CHECK** contraints

`ALTER TABLE table_name ADD CONSTRAINTS <constraint_name> <constraint_detail>`

When applying constraints to a Delta table, ensure that the existing data in the table adheres to these constraints before defining them; otherwise, the statement will fail. 
Once a constraint is enforced, any new data that violates the constraint will result in a **write failure**.

Suppose you want to ensure that dates in the `date` column fall within a specific range. **CHECK constraints** define conditions that incoming data must satisfy in order to be accepted in the table.

`ALTER TABLE my_table ADD CONSTRAINT valid_date CHECK (date >= '2024-01-01' AND date <= '2024-12-31');`

`valid_date` is the name of our constraint, and the condition ensures that the date column values fall within the specified range for the year 2024. Any attempt to insert or update data with dates outside this range will be rejected. This helps maintain data consistency and integrity within the Delta Lake table.

## Cloning Delta Lake Tables
In Databricks, if you need to backup or duplicate your Delta Lake table, you have two efficient options:
- Deep clone
- Shallow clone

### Deep Clone
It involves copying both data and metadata from a source table to a target.

`CREATE TABLE table_clone DEEP CLONE source_table`

It is important to note that because in deep cloning, all the data must be copied over, this process may take quite a while, especially for large source tables.

### Shallow Clone
It creates a new Delta table that **shares** the data files (Parquet) from the source table. Only the metadata is copied. It is fast and uses minimal storage. If you drop the source table or files, the clone may break or become incomplete. It is best for rapid development or testing; or creating lightweight staging environments. 

`CREATE TABLE table_clone SHALLOW CLONE source_table`

# Exploring Views
SQL views are virtual tables created by saving an SQL query that dynamically presents data from one or more underlying tables without storing the data itself. They behave like real tables in terms of rows and columns but do not physically hold data; instead, the data is retrieved fresh each time the view is queried.  

## Key characteristics of SQL views
- **Virtual table** : A view is based on the result of a SELECT statement.
- **Dynamic data**: The data shown by a view is always up-to-date because the database runs the underlying query whenever the view is accessed. 
- **Simplify complex queries**: Views can encapsulate joins, filters, and calculations, making it easier for users to query complex data structures without writing complicated SQL repeatedly.
- **Security**: Views can restrict user access to specific rows or columns, enhancing data security by exposing only necessary information.

To demonstrate how views function within Databricks, we will start by creating a table of data called `cars`. This table contains columns for the ID, model, brand, and release year of the cars.

In [0]:
USE CATALOG hive_metastore;

CREATE TABLE IF NOT EXISTS cars
(id INT, model STRING, brand STRING, year INT);

INSERT INTO cars
VALUES (1, 'Cybertruck', 'Tesla', 2024),
     (2, 'Model S', 'Tesla', 2023),
     (3, 'Model Y', 'Tesla', 2022),
     (4, 'Model X 75D', 'Tesla', 2017),
     (5, 'G-Class G63', 'Mercedes-Benz', 2024),
     (6, 'E-Class E200', 'Mercedes-Benz', 2023),
     (7, 'C-Class C300', 'Mercedes-Benz', 2016),
     (8, 'Everest', 'Ford', 2023),
     (9, 'Puma', 'Ford', 2021),
     (10, 'Focus', 'Ford', 2019)

We can use the SHOW TABLES command to list all tables and views in the default database:

In [0]:
SHOW TABLES  

## View types
There are three types of views available in Databricks:
- Stored views
- Temporary views
- Global temporary views

### Stored views
Stored views are views similar to traditional database views. They are objects whose metadata is persisted in the database. 

To create a stored view, you use the `CREATE VIEW` statement followed by the `AS` keyword and the logical SQL query defining the view:

`CREATE VIEW view_name AS <query>`

Let's create a stored view that displays only Tesla cars from our `cars` table. We use the CREATE VIEW statement, naming our view `view_tesla_cars`:

In [0]:
CREATE VIEW view_tesla_cars 
AS SELECT * FROM cars WHERE brand = 'Tesla'

Running the `SHOW TABLES` command confirms that the view has been persisted in the default database and it is not a temporary object, as shown in the `isTemporary` column.

In [0]:
SHOW TABLES

Once created, you can query the stored view using a standard SELECT statement, treating it as if it were a table object.

In [0]:
SELECT * FROM view_tesla_cars

It is worth noting that this result is retrieved directly from the `cars` table. Each time the view is queried, its underlying logical query is actually executed against the source table, this case, the `cars` table.

### Temporary views
They are bound to the Spark session and automatically dropped when the session ends. 
They are handy for temporary data manipulations or analyses.
To create a temporary view, you simply add the `TEMPORARY` or `TEMP` keyword to the CREATE VIEW command:

`CREATE TEMP VIEW view_name AS <query>`

Let's create a temporary view called `temp_view_cars_brands`. This temporary view simply retrieves the unique list of brands from our `cars` table.

In [0]:
CREATE TEMP VIEW temp_view_cars_brands AS SELECT DISTINCT brand from cars; 
SELECT * FROM temp_view_cars_brands

Running the `SHOW TABLES` command confirms the addition of the temporary view to the list. The `isTemporary` column indicates its temporary nature. 
In addition, since it is a temporary object, it is not persisted to any database, as indicated by having no database specified in the database column.

In [0]:
SHOW TABLES

### Global temporary views
Global temporary views behave similarly to other temporary views but are tied to the cluster instead of a specific session. This means that as long as the cluster is running, any notebook attached to it can access its global temporary views. 

To define a global temporary view, you add the `GLOBAL TEMP` keyword to the `CREATE VIEW` command:

`CREATE GLOBAL TEMP VIEW view_name AS <query>`

For example, the following view retrieves all cars from our `cars` table released in 2022 or later, ordered in descending order:

In [0]:
CREATE GLOBAL TEMP VIEW global_temp_view_recent_cars AS SELECT * FROM cars WHERE year >= 2022 ORDER BY year DESC

Global temporary views are stored in a cluster's temporary database, named `global_temp`. When querying a global table view in the `SELECT` statement, you need to specify the `global_temp` database qualifier:

In [0]:
SELECT * FROM global_temp.global_temp_view_recent_cars 

Since the global temporary views are tied to the `global_temp` database, we need to use the command `SHOW TABLES IN`, explicitly specifying the database name `global_temp`:

In [0]:
SHOW TABLES IN global_temp

We can see the `global_temp_view_recent_cars`, which is indeed a temporary object tied to the `global_temp` database.

## Dropping Views
Let's drop our stored view by running the `DROP VIEW` command, like in standard SQL:


In [0]:
DROP VIEW view_tesla_cars

If you want to delete temporary views without waiting for the session to end or for the cluster to terminate, you can manually achieve this by using the `DROP VIEW` command as well:

`DROP VIEW temp_view_cars_brands;
DROP VIEW global_temp.global_temp_view_recent_cars;`