<p><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/UNAL_Logosimbolo.svg/583px-UNAL_Logosimbolo.svg.png" alt="" width="1280" height="300" /></p>

# TRANSVERSAL

## PAHTS

Spark always works with distributed file systems, not the default local file system. For this reason, the file path must include the schema indicating the storage type.

Also, keep in mind that the default file format in Spark is Parquet, a columnar format optimized for performance.

### HDFS

Hadoop Distributed File System


```bash
hdfs://namenode:9000/user/data/file.parquet
```

### DATABRICKS (DBFS)


```bash
dbfs:/mnt/datalake/file.parquet
```


### LOCAL READ

```bash
file:///path/local/file.parquet
```

> 💡 **Note:** Using local paths (`file://`) is not recommended in production, as Spark is designed to process data in distributed environments.

## MAGIC COMMANDS

In Databricks, magic commands are prefixed with % or %% and let you interact with different languages or Databricks features within a notebook cell.

### MARKDOWN

`%md`  
  Render Markdown content.
![](https://i.postimg.cc/MTmpnW7x/dbo.png)

### PYTHON

`%python`  
  Run a cell using Python.  

![](https://i.postimg.cc/MTmpnW7x/dbo.png)

In [0]:
%python
print("hello, all!")

### SQL
`%sql`  
  Run SQL queries directly.
![](https://i.postimg.cc/MTmpnW7x/dbo.png)

In [0]:
%sql
SELECT current_date

### SCALA
`%scala`  
  Run Scala code in a cell.
![](https://i.postimg.cc/MTmpnW7x/dbo.png)

In [0]:
%scala
val x = 10; 
println(x)

### R
`%scala`  
  Run R code in a cell.
![](https://i.postimg.cc/MTmpnW7x/dbo.png)

In [0]:
%r  
summary(cars)

### BASH
`%sh`  
  Run shell/bash commands.
![](https://i.postimg.cc/MTmpnW7x/dbo.png)

In [0]:
%sh  
ls -lrt

### DBFS
`%fs`  
 Interact with the Databricks File System (DBFS).

In [0]:
%fs ls 

### MODULAR
 Include and run another notebook or file.

#### RUN
`%run`  
 Include and run another notebook.
![](https://i.postimg.cc/MTmpnW7x/dbo.png)

In [0]:
%run ./modularx/notebook

In [0]:
print(simple_header())

In [0]:
print(YEAR)

#### IMPORT

In [0]:
from modularx.utilities import set_config

In [0]:
output = set_config(
  spark_master='spark://5.6.7.8:7077',
  spark_executor_memory='4g',
  spark_eventLog_enabled='true',
  spark_serializer ='org.apache.spark.serializer.KryoSerializer',
)

print(output)

## NOTEBOKS

Once you have a DataFrame (commonly named df), there are several essential methods you can use to inspect, manipulate, and analyze your data effectively. These methods help you understand the structure, contents, and performance implications of your DataFrame, especially when working in distributed environments like Spark.


### df.collect()	
Returns all rows from the DataFrame as a list of Row objects. Use carefully — can cause memory issues with large data.

### df.show()	
Prints a tabular view of the DataFrame in the console, with a limited number of rows (default 20).

### df.printSchema()
Prints the schema of the DataFrame in a tree format, showing each column name, data type, and whether it is nullable. Very useful for understanding the structure of your data.

### df.show()

Prints the **top 10 rows** of a DataFrame in a **tabular format** on the console.

```python
df.show(n=20, truncate=True)
```

### df.explain()

In Apache Spark, the method .explain() is used to display the execution plan for a DataFrame or SQL query. It helps developers understand how Spark will process their code, including details about:

* Logical plans
* Physical plans
* Optimizations (e.g., predicate pushdown, broadcast joins)

``` ´ython
df.explain(True)             # Includes physical and logical plan
df.explain(mode="extended")  # Same as above
```

### display(df)

In environments like Databricks or Jupyter, it renders the DataFrame nicely as an interactive table.

![](https://i.postimg.cc/MTmpnW7x/dbo.png)

### df.display()

Same above.

![](https://i.postimg.cc/MTmpnW7x/dbo.png)

### displayHTML(html_code)
Used to render HTML code inside Databricks notebooks — useful for showing custom tables, reports, or web content.

![](https://i.postimg.cc/MTmpnW7x/dbo.png)

In [0]:
displayHTML(
"""
<div style="display: flex; align-items: center; background-color: #fff4e5; border: 1px solid #f47920; padding: 10px 14px; border-radius: 6px; font-family: Arial, sans-serif; color: #b34700; font-size: 15px; max-width: 600px;">
  <img src="https://images.icon-icons.com/2699/PNG/512/databricks_logo_icon_170295.png" alt="Databricks" style="width: 20px; height: 20px; margin-right: 10px;">
  <strong>Testing</strong>
</div>
"""

)

### HELP(object.atribute)

In [0]:
from pyspark.sql import functions as fc

help(fc.upper)

## DBUTILS

`dbutils` is a utility library provided to help you interact with the Databricks environment. It includes tools for working with files, secrets, notebooks, widgets, and more.

![](https://i.postimg.cc/MTmpnW7x/dbo.png)


In [0]:
dbutils.help()

In [0]:
dbutils.fs.help()

In [0]:
dbutils.fs.ls('/')

In [0]:
display(dbutils.fs.ls('/'))

In [0]:
display(dbutils.fs.ls('/user'))

## SQL

## CATALOGS

In data platforms like Databricks, data is organized as:

**catalog.schema.table**

- **Catalog**: Top-level container that holds schemas and defines access control.
- **Schema**: Logical grouping of tables and views (similar to a database).
- **Table**: Structured data stored in rows and columns.

**Example:**
```sql
SELECT * FROM main.sales.customers;
```

main = **catalog**
sales = **schema**
customers = **table**

### input_file_name()

The function `input_file_name()` in Databricks (and Apache Spark) returns the full path of the file from which each row of the DataFrame was read.

```sql
SELECT input_file_name() AS file_path, * FROM my_table;
```