# Lecture 19. Querying Files

## Querying Files Directly

To query a file content, we can simply use a `SELECT` statement.

`SELECT * FROM` a file format, and we specify the file path. And make special note of the use of backticks and not single quotes around the path.

<div style="text-align: center;">
<img src="../../assets/images/Presentation-Images/Querying Files Directly.jpg" style="width:640px" >
</div> 

This works well with self-describing formats that have well-defined schema like JSON and parquet.
However, it is not very useful with non describing formats like CSV.

- A path to file could be 
  - a single file.
  - Or we can use a wildcard character to read multiple files simultaneously.
  - Or simply reading the whole directory.
    Of course, assuming that all of the files in the directory have the same format and schema.

- File Format
  - Extract a JSON file.

    ```sql
    SELECT * FROM json.`/path/file_name.json`
    ```

    As you see, it simply `SELECT * FROM json`, and we specify the path to file around back ticks.

  - Extract files as raw strings
    When working with text-based files which include JSON, CSV, TSV and TXT format, you can use the `text` format to extract data as raw strings.
    
    ```sql
    SELECT * FROM text.`/path/to/file`
    ```

    This can be useful when input data could be corrupted.
    In this case, we extract the data as raw string and we apply custom text parsing functions to extract values from text files.

  - Extract files as raw bytes

    And in some cases, we need the binary representation of files content, for example, when dealing with images and unstructured data.
    Here we can use simply `binaryFile` as a format.

    ```sql
    SELECT * FROM binaryFile.`/path/to/file`
    ```

## CTAS: Registering Tables from Files

And usually after extracting data from external data sources, 
we need to load them into the lakehouse 
which ensures that all of the benefits of Databricks platform can be fully leveraged. 

To load data from files into Delta tables, 
we use CTAS statements, which is "Create Table As Select" query.

```sql
CREATE TABLE table_name
AS SELECT * FROM file_format.`/path/to/file`
```

Here we are querying data from files directly.

CTAS statements automatically inferior schema information from query results and do not support manual schema declaration.
This means the CTAS statements are useful for external data injection from sources with well-defined schema such as parquet files and tables.

### Limitation

CTAS statements also do not support specifying additional file options.
And this is why this statement presents significant limitation when trying to ingest data from CSV files.
For such a format that requires additional options, we need another solution that supports options.

## Configuring the Options for External Sources

### `CREATE TABLE USING` statement

  This solution is the regular `CREATE TABLE` statement, but with the `USING` keyword.
  By adding the `USING` keyword, we specify the external data source type, for example CSV format and with any additional options.

  And of course, you need to specify a location to where these files are stored.

  ```sql
  CREATE TABLE table_name
              (col_name1 col_type1, ...)
  USING data_source_type
  OPTIONS (key1 = val1, key2 = val2, ...)
  LOCATION = path
  ```

  That means with this command, we are always creating an external table.
  The table here is just a reference to the files.

  Unlike with CTAS statements, here there is no data moving during table creation.
  We are just pointing to files stored in an external location.

  Moreover, these files are kept in its original format, which means we are creating here a non-Delta table.

#### Examples

- Here is an example of creating a table using CSV external source.

  ```sql
  CREATE TABLE table_name
  (col_name1 col_type1, ...)
  USING CSV
  OPTIONS (header = "true",
          delimiter = ";")
  LOCATION = path
  ```

  So again, it's not a Delta table.

  We are pointing to CSV files exist in an external location.

  And we are specifying the options for reading the files.
  Like the fact that there is a header presents in the files and the delimiter is a semicolon.

  And finally, we are providing the location to these CSV files.

- Another example is to create a table using JDBC connection to refer to data in an external SQL database. And we provide the necessary options like the connection string, the username and the password for this database and of course, the database table containing the data.

  ```sql
  CREATE TABLE table_name
  (col_name1 col_type1, ...)
  USING JDBC
  OPTIONS (url = "jdbc:sqlite://hostname:port",
          dbtable = "database.table",
          user = "username",
          password = ”pwd” )
  ```

#### Limitation

And again, a table with external data source has a limitation.
It is not a Delta table.
  
* It means the performance and the features of Delta Lake are no more guaranteed, like time travel feature, and the guarantee that we are always reading the most recent version of the data.

* In addition, if you are referring to a huge database table, this also can cause performance issues.

### Solution for `CREATE TABLE USING` Statement Limitation

The solution is simply to 
create a temporary view, referring to the external data source, 
and then query this temporary view to create a table using CTAS statements.

```sql
CREATE TEMP VIEW temp_view_name (col_name1 col_type1, ...)
USING data_source
OPTIONS (key1 = “val1”, key2 = “val2”, ..., path = “/path/to/files”)

CREATE TABLE table_name
AS SELECT * FROM temp_view_name
```

In this way we are extracting the data from the external data source and load it in a Delta table.

And as you can see, with CTAS statement, you can not only query files, but you can query any object like a temporary view in this case.