# Objectives
- Querying data files
- Writing to tables
- Performing advanced ETL operations
- Discover the potential of higher-order functions and user-defined functions (UDFs) in Spark SQL

# Querying Data Files
To initiate a file query, we use the SELECT * FROM syntax, followed by the file format and the path to the file. 
```sql
SELECT * FROM file_format.`/path/to/file`
```
The filepath is specified between **backticks**, to prevent potential syntax errors and ensure the correct interpretation of the path. 

A filepath in this context can refer to 
- A single file
- A wildcard character to simultaneously read multiple files; or
- An entire directory, assuming that all files within that directory adhere to the same format and schema

We can now demonstrate extracting data directly from files using a real-world dataset representing an online school environment. This dataset consists of three tables:
- Students
- Enrollments
- Courses

We begin by running a helper notebook, "School-Setup", which can be found within the `Include` subfolder. This helper notebook facilitates downloading of the dataset to the Databricks file system and prepares the working environment accordingly:

In [0]:
%run ./Includes/School-Setup

## Querying JSON Format
The student data in this dataset is formatted in JSON. The placeholder `dataset_school` referenced in the following query, is a variable defined within our "School-Setup" notebook. It points to the location where the dataset files are stored on the filesystem. 

In [0]:
%python
files = dbutils.fs.ls(f"{dataset_school}/students-json")
display(files)

The output above shows that there are 6 JSON files in the `students-json` folder.

### Reading a single data file
To read a single JSON file, the `SELECT` statement is used with the syntax `SELECT * FROM json.`, and then the full path for the JSON file is specified between backticks. We use the `dataset.school` placeholder with the `$` character to reference the location where the dataset files are stored. This placeholder is configured in the "School-Setup" notebook: 

In [0]:
SELECT * FROM json.`${dataset.school}/students-json/export_001.json`

The result displays the extracted student data, including:
- Student ID
- Email
- GPA score
- Profile information (in JSON format); and
- The last updated timestamp

### Querying multiple files
To query multiple files simultaneously, you can use the wildcard character (*) in the path. For example, you can easily query all JSON files starting with the name `export_`:

In [0]:
SELECT * FROM json.`${dataset.school}/students-json/export_*.json`

### Querying an entire directory
You can query and entire directory of files, assuming a consistent format and schema across all files in the directory. In the following query, the directory path is specified instead of an individual file:

In [0]:
SELECT * FROM json.`${dataset.school}/students-json`

#### Recording the source file
When dealing with multiple files, adding the `input_file_name` function becomes useful. This built-in Spark SQL function records the source data file for each record. This helps in troubleshooting data-related issues by precisely pinpointing their exact source.

In [0]:
SELECT *, input_file_name() source_file FROM json.`${dataset.school}/students-json`

The output above shows in addition to the original columns, a new column `source_file`. This column provides supplementary information about the origin of each record in the dataset.

## Querying Using the text Format
When dealing with a variety of text-based files, including formats such as JSON, CSV, TSV, and TXT, Databricks provides the flexibility to handle them using the text format:
```sql
SELECT * FROM text.`/path/to/file`
```
This format allows you to extract the data as raw strings, which provide significant advantages in scenarios where input data might be corrupted or contain anomalies. 
By extracting data as raw strings, you can leverage custom parsing logic to navigate and extract relevant values from the text-based files.

In [0]:
SELECT * FROM text.`${dataset.school}/students-json`

The output above displays the student data as raw strings. Each line of the file is loaded as a record within a one-string column, `named` value.

With this result, you can easily apply custom parsing or transformationt techniques to extract specific fields, correct anomalies, or reformat the data as needed, for subsequent analysis.

## Querying Using binaryFile Format
There are scenarios where the binary representation of file content is essential, such as when working with images or unstructured data. In such cases, the `binaryFile` format is suited for this task:
```sql
SELECT * FROM binaryFile.`/path/sample_image.png`
```
In the sample query provided, the `binaryFile` format is employed to query an image (`sample_image.png`), allowing you to work directly with the binary representation of the file's content.

We can use the `binaryFile` format to extract the raw bytes and some metadata information of the student files:

In [0]:
SELECT * FROM binaryFile.`${dataset.school}/students-json`

The output of the query provides the following details about each source file:
- `path` provides the location of the source file on the storage
- `modificationTime` gives the last modification time of the file
- `length` indicates the size of the file
- `content` represents the binary representation of the file

So, by using the binaryFile format, you can access both the content and metadata for files, offering a detailed view of your dataset.

## Querying Non-Self-Describing Formats
The previous querying approach is particularly effective with self-describing formats that possess a well-defined schema, such as JSON and Parquet. By nature, these formats offer a built-in structure that makes it easy to retrieve and interpret data using `SELECT` queries.

However, when dealing with non-self-describing formats such as CSV, the `SELECT` statement may not be as informative. Unlike JSON and Parquet, CSV files lack a predefined schema, making the format less suitable for direct querying. In such cases, additional steps, such as defining a schema, may be necessary for effective data extraction and analysis.

In [0]:
SELECT * FROM csv.`${dataset.school}/courses-csv`

As shown from the output above, the query is not well-parsed:
- The header row is extracted as a table row; and
- All columns are loaded into a single column, `_c0`.
This behaviour is explained by the delimiter - the symbol used to separate columns in the file - which, in this case, is a semicolon rather than the standard comma.

This issue highlights a challenge with querying files without a well-defined schema, particulary in formats like CSV. In the upcoming sections, we will learn how to address this challenge.