# 📂 Querying Files in Databricks

Databricks allows you to query files directly using **Spark SQL**, making it easy to extract, view, and transform data before loading it into Delta tables.

---

## 🧾 Extracting Data from Files

- Use `SELECT` statements to read data directly from files.
- File paths must be wrapped in **backticks (`)**.
  
```sql
SELECT * FROM file_format.`/path/to/file.csv`;
```
---
## File Formats
- Self-describing formats like JSON and Parquet are ideal because they contain schema information.
- Non-self-describing formats like CSV may require additional options (e.g., headers, schema definition).
```sql
SELECT * FROM file_format.`/path/to/file.csv`;
```

Note: **file_format** is the self-describing format in the example above.
 - This is not very useful with non self-describing formats like CSV and TSV
---
## ✳️ Wildcard Characters
Use wildcards to read multiple files or full directories with matching format and schema:
```sql
SELECT * FROM json.`/mnt/logs/2025/*`;
```
- **Single file**: file_2025.csv
- **Multiple files**: file_*.cvs
  - Assuming that all of the files in the directory have the same format and schema
---
## 📝 Handling Text-Based Files
When working with formats like `CSV`, `TSV`, `TXT`, or `JSON`:
- Use the `text` format to extract raw strings.
- This can be helpful when input data may be malformed or partially corrupted.
```sql
SELECT * FROM text.`/mnt/raw/input.txt`;
```
---
## 🖼️ Handling Binary Data
Use the binaryFile format for unstructured data like images or PDFs:
```sql
SELECT * FROM binaryFile.`/mnt/images/*.png`;
```
---
## 💾 Loading Data into Delta Lake
- Use CTAS (Create Table As Select) to load external data into a Delta Lake table.
```sql
CREATE TABLE delta_table_name
AS SELECT * FROM csv.`/mnt/data/source.csv`;
```
Note: For CSV files, you may need to set options like header=true or provide a schema explicitly.

---
## 🛠️ Create Table with External Source
Use CREATE TABLE ... USING to create a reference to external data without moving it:
```sql
CREATE TABLE external_table
USING csv
OPTIONS (
  path '/mnt/data/example.csv',
  header 'true',
  inferSchema 'true'
);
```
These tables are not Delta tables and may lack performance optimizations like indexing or schema enforcement.

---
## 🧪 Using Temporary Views + CTAS
To convert external data into a Delta table:

1. Create a temporary view on the source data:
```sql
CREATE OR REPLACE TEMP VIEW temp_data
AS SELECT * FROM csv.`/mnt/data/example.csv`;
```
2. Load the data into Delta using CTAS:
```sql
CREATE TABLE delta_table AS
SELECT * FROM temp_data;

```
This gives you the performance and reliability of Delta Lake while using raw files as the source.