Let's create a first notebook and attach it to a cluster. We can choose a `Python` module and make us of PySpark. Plea

In [0]:
print("Hello world")

Hello world


Where are we actually sitting? What files are available to us?

In [0]:
dbutils.fs.ls("/")

[FileInfo(path='dbfs:/databricks-datasets/', name='databricks-datasets/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/databricks-results/', name='databricks-results/', size=0, modificationTime=0)]

In [0]:
%fs ls

path,name,size,modificationTime
dbfs:/databricks-datasets/,databricks-datasets/,0,0
dbfs:/databricks-results/,databricks-results/,0,0


In [0]:
%fs ls databricks-datasets

path,name,size,modificationTime
dbfs:/databricks-datasets/COVID/,COVID/,0,0
dbfs:/databricks-datasets/README.md,README.md,976,1532468253000
dbfs:/databricks-datasets/Rdatasets/,Rdatasets/,0,0
dbfs:/databricks-datasets/SPARK_README.md,SPARK_README.md,3359,1455043490000
dbfs:/databricks-datasets/adult/,adult/,0,0
dbfs:/databricks-datasets/airlines/,airlines/,0,0
dbfs:/databricks-datasets/amazon/,amazon/,0,0
dbfs:/databricks-datasets/asa/,asa/,0,0
dbfs:/databricks-datasets/atlas_higgs/,atlas_higgs/,0,0
dbfs:/databricks-datasets/bikeSharing/,bikeSharing/,0,0


[DBFS](https://docs.databricks.com/data/databricks-file-system.html) is an abstraction on top of scalable object storage and offers the following benefits:

- Allows you to mount storage objects so that you can seamlessly access data without requiring credentials.
- Allows you to interact with object storage using directory and file semantics instead of storage URLs.
- Persists files to object storage, so you won’t lose data after you terminate a cluster.

Where can I play around?

In [0]:
%fs mkdirs mywork

path,name,size,modificationTime
dbfs:/databricks-datasets/,databricks-datasets/,0,0
dbfs:/databricks-results/,databricks-results/,0,0


In [0]:
%fs ls 

path,name,size,modificationTime
dbfs:/databricks-datasets/,databricks-datasets/,0,0
dbfs:/databricks-results/,databricks-results/,0,0


In [0]:
%fs ls mywork

## Read

We'll now read some files as strings and as dataframes

In [0]:
filename = "dbfs:/databricks-datasets/README.md"

df = spark.read.text(filename)
df.show(10, truncate=False)

+-------------------------------------------------------------------------------+
|value                                                                          |
+-------------------------------------------------------------------------------+
|Databricks Hosted Datasets                                                     |
|                                                                               |
|The data contained within this directory is hosted for users to build          |
|data pipelines using Apache Spark and Databricks.                              |
|                                                                               |
|License                                                                        |
|-------                                                                        |
|Unless otherwise noted (e.g. within the README for a given data set), the data |
|is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0),  |
+---------------

In [0]:
print(df.count())
print(type(df))

23
<class 'pyspark.sql.dataframe.DataFrame'>


## Dataframes

Then, what is this `DataFrame` class?

A [DataFrame](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.html) is similar to a relational table in SQL or to a Pandas dataframe.

Once created, it can be manipulated using the various domain-specific-language (DSL) functions. We'll find out more about them over this training.

Another data structure that can be used to process data in spark is an [RDD](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.html#pyspark.RDD), a Resilient Distributed Dataset, the basic abstraction in Spark. Throughout this course, we will focus more on the usage of `DataFrame`s. Since Spark 2.x, RDD are now consigned to low-level APIs and are not a recommended tool for common use cases.

## What's happening behind the scene?

To understand what’s happening under the hood with our sample code, you’ll need to be familiar with some of the key concepts of a Spark application and how the code is transformed and executed as tasks across the Spark executors. We’ll begin by defining some important terms:

- **Application**. A user program built on Spark using its APIs. It consists of a driver program and executors on the cluster.
- **SparkSession**. An object that provides a point of entry to interact with underlying Spark functionality and allows programming Spark with its APIs. In an interactive Spark shell, the Spark driver instantiates a SparkSession for you, while in a Spark application, you create a SparkSession object yourself.
- **Job**. A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g., save(), collect()).
- **Stage**. Each job gets divided into smaller sets of tasks called stages that depend on each other.
- **Task** A single unit of work or execution that will be sent to a Spark executor.

We now move back to ours slides!