## Quick Start Using Python
* Using a Databricks notebook to showcase DataFrame operations using Python
* Reference http://spark.apache.org/docs/latest/quick-start.html

In [2]:
# Take a look at the file system
display(dbutils.fs.ls("/databricks-datasets/samples/docs/"))

path,name,size
dbfs:/databricks-datasets/samples/docs/README.md,README.md,3137


DataFrames have ***transformations***, which return pointers to new DataFrames, and ***actions***, which return values.

In [4]:
# transformation
textFile = spark.read.text("/databricks-datasets/samples/docs/README.md")

In [5]:
# action
textFile.count()

In [6]:
# Output the first line from the text file
textFile.first()

Now we're using a filter ***transformation*** to return a new DataFrame with a subset of the items in the file.

In [8]:
textFile.columns

In [9]:
# Filter all of the lines within the DataFrame
linesWithSpark = textFile.filter(textFile.value.contains("Spark"))

In [10]:
from pyspark.sql.functions import col
linesWithSpark = textFile.filter(textFile.select(col('value')).contains("Spark"))

Notice that this completes quickly because it is a transformation but lacks any action.  
* But when performing the actions below (e.g. count, take) then you will see the executions.

In [12]:
# Perform a count (action) 
linesWithSpark.count()

In [13]:
# Output the first five rows using take
linesWithSpark.take(5)

In [14]:
# Output the first five rows using show
linesWithSpark.show(5, truncate=False)