# Relational Entities in Databricks

Relational entities form the backbone of data organization and analysis within Databricks. These entities, such as *tables* and *views*, provide a structured way to store and retrieve data. The relational model allows for the representation of complex relationships between datasets, facilitating efficient querying, reporting, and analysis. By understanding and effectively utilizing relational entities, you will be able to use the full power of Databricks to derive valuable insights from your data.

## Basics of Relational Entities

> Relational entities are structures that store and organize data in a way that reflects the relationships between different datasets. These entities enable users to manage and analyze data efficiently.

<p align="center">
    <img src="images/Primary Objects.png" width="500"/>
</p>

### 1. Metastore

The *Metastore* in Databricks serves as a central repository for storing metadata related to tables, views, and other data structures. It stores information such as table schema, location, and properties, enabling efficient query planning and execution. The Metastore is essential for managing the metadata associated with relational entities.

### 2. Catalogs

A *catalog* is named collection of databases. It is a top-level organizational unit that helps in categorizing and managing various databases. Each catalog can contain multiple databases, providing a way to structure and segregate data based on different projects, teams, or use cases.

### 3. Schemas (Databases)

Within a catalog, *schemas* (or *databases*) are logical containers that hold tables, views, and other relational entities. Schemas provide a way to organize data within a catalog, by defining the structure of tables and views.

### 4. Tables

**Tables** are structured collections of data stored within a specific schema. They represent the fundamental entities for data storage and retrieval, organized into rows and columns. Tables are used to store and manage large datasets efficiently, supporting various data manipulation and analysis operations.

### 5. Views

**Views** are virtual tables derived from one or more existing tables or views. They provide a dynamic perspective on the underlying datasets without storing the data themselves. Views are beneficial for simplifying complex queries, aggregations, and transformations, offering a logical abstraction layers for users interacting with the data.

## Using Databases in Databricks

To effectively organize and manage data, Databricks provides the ability to create and utilize databases, acting as logical containers for tables.

Databases can be created using the `CREATE DATABASE` or `CREATE SCHEMA` statement, with terms used interchangeably in Databricks:

```sql
CREATE DATABASE database_name
```
> Remember, if you want to run these commands you have to run them in a Databricks Notebook. You will need to make sure you have a working cluster in your Community Edition account that's attached to your current Notebook. You will also need to change the default language to SQL. 

To ensure a database is only created if it doesn't already exist, you can use:

```sql
CREATE DATABASE IF NOT EXISTS database_name
```
As an example, we will create a new database called `my_first_db` using the command above. The `SHOW DATABASES` command allows you to view a list of existing databases:

<p align="center">
    <img src="images/ShowDatabases.png" width="750" height="250"/>
</p>

As you can see, besides our previously created database, we also have a `default` database. The default database is automatically present in every Databricks workspace and serves as the default context when a particular database is not specified in SQL queries. For example, if any table is created without explicitly specifying the databases, it will be created in the default database.

For detailed information about a specific database use the `DESCRIBE DATABASE database_name` command.

<p align="center">
    <img src="images/DescribeDatabase.png" width="800" height="300"/>
</p>

This command will display important information about the databases, including its catalog, name, location and owner. 

> By default, when you create a database, it is stored in the *Hive Metastore* (a central repository that stores metadata information about databases, tables, and partitions) with a default physical storage location in the **Databricks File System (DBFS)**.

Let's breakdown the location:

- `dbfs:/` refers to the Databricks File System
- `user/hive/metastore` is the default base directory for Hive databases in DBFS
- `my_first_db.db` is the specific directory corresponding to the `my_first_db` database

Once you have created a database, you can also visualize it in the **Data** explorer tab.

<p align="center">
    <img src="images/DataExplorer.png" width="700" height="400"/>
</p>

> You can customize the location at which a database is created by explicitly specifying the `LOCATION` clause when creating the database:

```sql
CREATE DATABASE IF NOT EXISTS my_second_db
LOCATION 'dbfs:/your/custom/location/';
```

For example, that might look like this:

```sql
CREATE DATABASE IF NOT EXISTS my_second_db
LOCATION 'dbfs:/custom/databases/my_second_db/';
```
If you now use the `DESCRIBE DATABASE` command you should see the custom location under the location field.

Two commands to be aware of when working with databases are:

- `USE database_name`: For setting a specific database as the current context, simplifying subsequent table creating and querying
- `DROP DATABASE IF EXISTS database_name CASCADE`: For deleting a database and removing all its associated tables. Remember to exercise caution when dropping databases to avoid data loss.

## Creating Tables in Databricks

Tables in Databricks serve as structured containers for organizing and storing data. When working with tables, there are two primary types: *managed tables* and *external tables*.

### Managed Tables

> **Managed tables** are fully handled and maintained by tDatabricks. Databricks takes care of storing both the data and its associated metadata. Managed tables are stored in the default Hive Metastore location within DBFS.

To create a managed table, you can use the `CREATE TABLE` statement and specify the table name along with its schema. The schema defines the structure of the table, including the names and data types of its columns. For example:

```sql
CREATE TABLE my_managed_table (
    id INT,
    name STRING,
    age INT
);
```
In this example, we've created a table named `my_managed_table` with three columns: `id` of type `INT`, `name` of type `STRING`, and `age` of type `INT`. This defines the schema of the table, indicating the structure and data types of the columns.

### External Tables

> **External tables** reference data stored in a location specified by the `LOCATION` clause during table creation. This location is outside the default directory structure.

To create an external table you can use the following syntax:

```sql
CREATE TABLE table_name
LOCATION 'path'
```

So, for example:

```sql
CREATE TABLE my_external_table (
    id INT,
    name STRING,
    age INT
)
LOCATION 'your/external/location/';
```

An external table could have its data files stored within Databricks (in a different directory than the database directory in DBFS) or in an entirely external storage system, such as Azure Blob Storage or AWS S3. 

The key differences between managed and external tables are:

- Managed tables store both data and metadata in the default Hive Metastore location within DBFS
- External tables reference data stored externally, and the location is specified using the `LOCATION` clause
- Databricks fully manages the lifecycle of a managed table, meaning that dropping a managed table will delete the underlying data files
- Dropping an external table will not delete the underlying data files, as Databricks only manages its metadata

### Inserting Data into Tables

Once you have created a table, you can use the `INSERT INTO` statement to add data to it. This is a common and straightforward method to populate tables with meaningful information.

The `INSERT INTO` statement allows you to insert specific values or data from another source into a table. Here's a simple example:

```sql
-- Inserting data into a managed table
INSERT INTO my_managed_table (id, name, age)
VALUES
  (1, 'John Doe', 25),
  (2, 'Jane Smith', 30),
  (3, 'Bob Johnson', 28);
```

In this example, data is being inserted into the `my_managed_table`. The specified columns (`id`, `name`, and `age`) are matched with corresponding values. Ensure the order of columns in the `VALUES` clause aligns with the order of columns in the table. 

As with any relational entity, not only should you now be able to see this new table using the **Data** explorer page, but by double-clicking on the table name, you should also be able to see its schema and sample data:

<p align="center">
    <img src="images/DataExplorerTables.png" width="700" height="475"/>
</p>

Note, that `INSERT INTO` is just one of the many ways you can populate tables with data. As you become more familiar with Databricks, you can explore additional methods for data insertion and manipulation.

## Working with Views

*> *Views** provide a powerful way to organize and present data without physically duplicating it. They are virtual tables based on the result of a `SELECT` query. They do not store the data themselves but provide a way to represent the data from one or more tables.

### Creating Views in Databricks

To create a view in Databricks, you can use the `CREATE VIEW` statement. There are different types of views:

1. Stored Views

   - *Stored views* are persistent and stored in the mMtastore. They can be referenced across different sessions and notebooks.
   - An example of creating a stored view:

```sql
CREATE VIEW my_stored_view AS
SELECT id, name
FROM my_table
WHERE age > 25;
```

2. Temporary Views

   - *Temporary views* are session-scoped and will be available only during the duration of the current session
   - An example of creating a temporary view:

```sql
CREATE OR REPLACE TEMPORARY VIEW my_temporary_view AS
SELECT id, name
FROM my_table
WHERE age > 25;
```

3. Global Temporary Views

   - *Global temporary views* are similar to temporary views but can be shared across different sessions within the same cluster
   - An example of creating a global temporary view:

```sql
CREATE OR REPLACE GLOBAL TEMPORARY VIEW my_global_temporary_view AS
SELECT id, name
FROM my_table
WHERE age > 25;
```

### Differences Between Views and Tables

While both views and tables represent structured data, there are key differences:

- Views do not store data themselves; they are based on the result of a `SELECT` query. Tables, on the other hand, physically store data.

- Views do not have a schema of their own; they inherit the schema from the underlying tables. Tables have a defined schema that specifies the structure of the data.

- Data cannot be directly modified through a view. If you want to modify data, you need to modify the underlying tables

- When you modify or delete the underlying table, the view is affected. Since views are essentially queries over tables, changes to the underlying table directly impact the data presented by the view. 

## Key Takeaways

- The Metastore serves as a central repository storing metadata related to tables, views, and data structures in Databricks, crucial for efficient query planning and execution
- Catalogs are named collections of databases that provide top-level organizational units, aiding in the categorization and management of various databases within Databricks
- Databases, or Schemas, are logical containers within catalogs that hold tables, views, and other relational entities, enabling organized data storage and retrieval
- You can create databases using the `CREATE DATABASE` or `CREATE SCHEMA` statements. You can also create databases with custom locations using the `LOCATION` clause.
- Managed tables are fully handled and maintained by Databricks. These tables store both data and metadata in the default Hive Metastore location within DBFS.
- External tables reference data stored externally, allowing for flexibility in data storage locations, either within Databricks or in external storage systems like Azure Blob Storage or AWS S3
- You can create tables using the `CREATE TABLE` statement
- Tables can be populated using the `INSERT INTO` statement, a common and straightforward method for adding meaningful information to tables
- Views are virtual tables derived from `SELECT` queries, providing a dynamic perspective on underlying datasets without storing the data themselves
- Stored views are persistent views stored in the Metastore, referenced across different sessions and notebooks
- Temporary views are session-scoped views available only during the current session
- Global temporary views are similar to temporary views but can be shared across different sessions within the same cluster