
<img src="https://databricks.gallerycdn.vsassets.io/extensions/databricks/databricks/1.1.5/1696858282359/Microsoft.VisualStudio.Services.Icons.Default" alt="iconDatabricks" width="100"/>

<br>


## Databricks with PySpark
------


- What is PySpark?
- Notebooks
- Clusters
- dbUtils
- FileSystem 


## What is PySpark?
<br>
PySpark is the Python library for Apache Spark. <br>
PySpark provides a user-friendly API for interacting with Spark's distributed computing capabilities.

<br>
<br>

<img src="https://3.bp.blogspot.com/-tlCzGQ9Tslw/Wn3rA1eJM4I/AAAAAAAAEE4/nmHxKp3qWbkz1Ehzv792izraR_wxjEKhQCLcBGAs/s1600/ApacheSpark.JPG" alt="iconDatabricks" width="400"/>

<br>


PySpark supports all of Spark’s features (Modules) such as Spark SQL, DataFrames, Structured Streaming, Machine Learning (MLlib) and Spark Core.

[PySpark Docs](https://spark.apache.org/docs/latest/api/python/index.html)

## What is a Cluster?
An Azure Databricks cluster is a set of computation resources and configurations on which you run data engineering, data science, and data analytics workloads, such as production ETL pipelines, streaming analytics, ad-hoc analytics, and machine learning. 
<br>
<br>
You run these workloads as a set of commands in a notebook or as an automated job.
<br>


<img src="https://516237376-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MIIWE47MSPMOIxmLgUz%2F-MdGKuZ-zriRADaPpkEi%2F-MdGL-5SaZRHuVHEODP2%2F001-Azure%20Data%20Lake%20Storage%20Credential%20Passthrough.png?alt=media&token=8311d127-558e-4b74-865c-f3af04d15dba" alt="Cluster" width="400"/>

[Databricks Cluster Docs](https://learn.microsoft.com/en-us/azure/databricks/clusters/)

## Notebooks
A collection of cell that run commands code in a databricks spark cluster <br>
 You can run different languages in a notebook using magic commands <br> 
--------
**Magic commands** overwrite the default language of the notebook 

<br> 

>1. %python
>2. %scala
>3. %md
>4. %sql
>5. %r
<br>

[Azure Databricks Notebook Docs](https://learn.microsoft.com/en-us/azure/databricks/notebooks/)

In [0]:
# this command cell is a python cell
print('Hola Everyone')

In [0]:
#%sql
--# Magic commands overwrite the default language of the notebook

SELECT 'this message come from a sql command' AS sql_command

In [0]:
# When you execute a sql command is stored in a variable called _sqldf
_sqldf.show()

In [0]:

%scala
val msg = "this command is running scala language"
print(msg)
// Magic Command to execute scala

## Dataframe

DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.  
<br>
DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs  

[DataFrames on Azure Databricks](https://learn.microsoft.com/en-us/azure/databricks/getting-started/dataframes-python)

In [0]:
# We are going to create a dataframe from a list using 'createdataFrame' function

from datetime import datetime, date
df = spark.createDataFrame([
    (1, 2., 'name1', date(2023, 1, 1), datetime(2023, 1, 1, 12, 0)),
    (2, 3., 'name2', date(2023, 2, 1), datetime(2023, 1, 2, 12, 0)),
    (3, 4., 'name3', date(2023, 3, 1), datetime(2023, 1, 3, 12, 0))
], schema='id long, value double, name string, date date, time timestamp')
df.show()

In [0]:
print(df.dtypes)
#print(df.printSchema())

### Notation

In [0]:
# Notations - Dot Notation, Bracket Notation
display(df.select('id'))
# display(df.select('ID'))
# display(df.select(df['id']))
#display(df.select(df.id))

## Utilities
This module provides various utilities for users to interact with  Databricks. <br>
We are going to focus in the filesystem

In [0]:
dbutils.help()
#Allow us to run other notebooks inside the current notebook
dbutils.notebook.help()
#Allow us to pass parameter between notebooks or from ADFY
dbutils.widgets.help()

## FileSystem
dbutils.fs <br>
Provides utilities for working with FileSystems. <br> 
Most methods in this package can take either a DBFS path (e.g., "/foo" or "dbfs:/foo"), or another FileSystem URI. 

In [0]:
dbutils.fs.help()

In [0]:
dbutils.fs.help('ls')

In [0]:
# Displays information about what is mounted within DBFS, you can see the mountPoint (short) and the source
# Access files using semantics instead of URLs
# Access data without using credentials
# Store files to object storage
display(dbutils.fs.mounts())

In [0]:
#Lists the contents of a directory -container-
display(dbutils.fs.ls('/databricks-datasets/'))
#display(dbutils.fs.ls('/databricks-datasets/wine-quality/'))


In [0]:
datasets = dbutils.fs.ls('/databricks-datasets/wine-quality/')
display(datasets)

In [0]:
disney = dbutils.fs.ls('/mnt/adl2/d3/70_training_dataset_D3/public_datasets/Disney/')
display(disney)


In [0]:
disney_full = dbutils.fs.ls('abfss://d3-shared-data@cdlprdadl2weu.dfs.core.windows.net/70_training_dataset_D3/public_datasets/Disney/')
display(disney_full)