Let's create a first notebook and attach it to a cluster. We can choose a `Python` module and make us of PySpark.

In [None]:
print("Hello world")

Where are we actually sitting? What files are available to us?

In [None]:
dbutils.fs.ls("/")

In [None]:
%fs ls

In [None]:
%fs ls databricks-datasets

[DBFS](https://docs.databricks.com/data/databricks-file-system.html) is an abstraction on top of scalable object storage and offers the following benefits:

- Allows you to mount storage objects so that you can seamlessly access data without requiring credentials.
- Allows you to interact with object storage using directory and file semantics instead of storage URLs.
- Persists files to object storage, so you won’t lose data after you terminate a cluster.

Where can I play around?

In [None]:
%fs ls FileStore/shared_uploads

## Read

We'll now read some files as strings and as dataframes

In [None]:
filename = "dbfs:/databricks-datasets/README.md"

df = spark.read.text(filename)
df.show(10, truncate=False)

In [None]:
df.count()

## Dataframes and RDDs

Spark computations are expressed as operations. These operations are then converted into low-level RDD-based bytecode as tasks, which are distributed to Spark’s executors for execution.

Also note that we used the high-level Structured APIs to read a text file into a Spark DataFrame rather than an RDD. Throughout the book, we will focus more on these Structured APIs; since Spark 2.x, RDDs are now consigned to low-level APIs.

To understand what’s happening under the hood with our sample code, you’ll need to be familiar with some of the key concepts of a Spark application and how the code is transformed and executed as tasks across the Spark executors. We’ll begin by defining some important terms:

- **Application**. A user program built on Spark using its APIs. It consists of a driver program and executors on the cluster.
- **SparkSession**. An object that provides a point of entry to interact with underlying Spark functionality and allows programming Spark with its APIs. In an interactive Spark shell, the Spark driver instantiates a SparkSession for you, while in a Spark application, you create a SparkSession object yourself.
- **Job**. A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g., save(), collect()).
- **Stage**. Each job gets divided into smaller sets of tasks called stages that depend on each other.
- **Task** A single unit of work or execution that will be sent to a Spark executor.

We now move back to ours slides!