# Reading Data - CSV Files

**Technical Accomplishments:**
- Start working with the API documentation
- Introduce the class `SparkSession` and other entry points
- Introduce the class `DataFrameReader`
- Read data from:
  * CSV without a Schema.
  * CSV with a Schema.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [0]:
%run "./Includes/Classroom-Setup"

In [0]:
%run "./Includes/Utility-Methods"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Entry Points

Our entry point for Spark 2.0 applications is the class `SparkSession`.

An instance of this object is already instantiated for us which can be easily demonstrated by running the next cell:

In [0]:
print(spark)

<pyspark.sql.session.SparkSession object at 0x7fd5b86de9a0>


It's worth noting that in Spark 2.0 `SparkSession` is a replacement for the other entry points:
* `SparkContext`, available in our notebook as **sc**.
* `SQLContext`, or more specifically it's subclass `HiveContext`, available in our notebook as **sqlContext**.

In [0]:
print(sc)
print(sqlContext)

<SparkContext master=local[8] appName=Databricks Shell>
<pyspark.sql.context.SQLContext object at 0x7fd5b86deac0>


Before we can dig into the functionality of the `SparkSession` class, we need to know how to access the API documentation for Apache Spark.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Spark API

### **Spark API Home Page**
0. Open a new browser tab.
0. Search for **Spark API Latest** or **Spark API _x.x.x_** for a specific version.
0. Select **Spark API Documentation - Spark _x.x.x_ Documentation - Apache Spark**. 
0. Which set of documentation you will use depends on which language you will use.

Other Documentation:
* Programming Guides for DataFrames, SQL, Graphs, Machine Learning, Streaming...
* Deployment Guides for Spark Standalone, Mesos, Yarn...
* Configuration, Monitoring, Tuning, Security...

Here are some shortcuts
  * <a href="https://spark.apache.org/docs/latest/api/" target="_blank">Spark API Documentation - Latest</a>
  * <a href="https://spark.apache.org/docs/2.4.0/" target="_blank">Spark API Documentation - 2.4.0</a>
  * <a href="https://spark.apache.org/docs/2.2.0/" target="_blank">Spark API Documentation - 2.2.0</a>
  * <a href="https://spark.apache.org/docs/2.1.1/" target="_blank">Spark API Documentation - 2.1.1</a>
  * <a href="https://spark.apache.org/docs/2.0.2/" target="_blank">Spark API Documentation - 2.0.2</a>
  * <a href="https://spark.apache.org/docs/1.6.3/" target="_blank">Spark API Documentation - 1.6.3</a>

### Spark API (Scala)

0. Select **Spark Scala API (Scaladoc)**.
0. Look up the documentation for `org.apache.spark.sql.SparkSession`.
  0. In the upper-left-hand-corner type **SparkSession** into the search field.
  0. The search will execute automatically.
  0. In the class/package list, click on **SparkSession**.
  0. The documentation should open in the right-hand pane.

### Spark API (Python)

0. Select **Spark Python API (Sphinx)**.
0. Look up the documentation for `pyspark.sql.SparkSession`.
  0. In the lower-left-hand-corner type **SparkSession** into the search field.
  0. Hit **[Enter]**.
  0. The search results should appear in the right-hand pane.
  0. Click on **pyspark.sql.SparkSession (Python class, in pyspark.sql module)**
  0. The documentation should open in the right-hand pane.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) SparkSession

Quick function review:
* `createDataSet(..)`
* `createDataFrame(..)`
* `emptyDataSet(..)`
* `emptyDataFrame(..)`
* `range(..)`
* `read(..)`
* `readStream(..)`
* `sparkContext(..)`
* `sqlContext(..)`
* `sql(..)`
* `streams(..)`
* `table(..)`
* `udf(..)`

The function we are most interested in is `SparkSession.read()` which returns a `DataFrameReader`.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) DataFrameReader

Look up the documentation for `DataFrameReader`.

Quick function review:
* `csv(path)`
* `jdbc(url, table, ..., connectionProperties)`
* `json(path)`
* `format(source)`
* `load(path)`
* `orc(path)`
* `parquet(path)`
* `table(tableName)`
* `text(path)`
* `textFile(path)`

Configuration methods:
* `option(key, value)`
* `options(map)`
* `schema(schema)`

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Reading from CSV w/InferSchema

We are going to start by reading in a very simple text file.

### The Data Source
* For this exercise, we will be using a tab-separated file called **pageviews_by_second.tsv** (255 MB file from Wikipedia)
* We can use **&percnt;fs ls ...** to view the file on the DBFS.

In [0]:
%fs ls /mnt/training/wikipedia/pageviews/

path,name,size,modificationTime
dbfs:/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet/,pageviews_by_second.parquet/,0,0
dbfs:/mnt/training/wikipedia/pageviews/pageviews_by_second.tsv,pageviews_by_second.tsv,262099389,1509989475000


We can use **&percnt;fs head ...** to peek at the first couple thousand characters of the file.

In [0]:
%fs head /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv

There are a couple of things to note here:
* The file has a header.
* The file is tab separated (we can infer that from the file extension and the lack of other characters between each "column").
* The first two columns are strings and the third is a number.

Knowing those details, we can read in the "CSV" file.

### Step #1 - Read The CSV File
Let's start with the bare minimum by specifying the tab character as the delimiter and the location of the file:

In [0]:
# A reference to our tab-separated-file
csvFile = "/mnt/training/wikipedia/pageviews/pageviews_by_second.tsv"

tempDF = (spark.read           # The DataFrameReader
   .option("sep", "\t")        # Use tab delimiter (default is comma-separator)
   .csv(csvFile)               # Creates a DataFrame from CSV after reading in the file
)

This is guaranteed to <u>trigger one job</u>.

A *Job* is triggered anytime we are "physically" __required to touch the data__.

In some cases, __one action may create multiple jobs__ (multiple reasons to touch the data).

In this case, the reader has to __"peek" at the first line__ of the file to determine how many columns of data we have.

We can see the structure of the `DataFrame` by executing the command `printSchema()`

It prints to the console the name of each column, its data type and if it's null or not.

** *Note:* ** *We will be covering the other `DataFrame` functions in other notebooks.*

In [0]:
tempDF.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)



We can see from the schema that...
* there are three columns
* the column names **_c0**, **_c1**, and **_c2** (automatically generated names)
* all three columns are **strings**
* all three columns are **nullable**

And if we take a quick peek at the data, we can see that line #1 contains the headers and not data:

In [0]:
display(tempDF)

_c0,_c1,_c2
timestamp,site,requests
2015-03-16T00:09:55,mobile,1595
2015-03-16T00:10:39,mobile,1544
2015-03-16T00:19:39,desktop,2460
2015-03-16T00:38:11,desktop,2237
2015-03-16T00:42:40,mobile,1656
2015-03-16T00:52:24,desktop,2452
2015-03-16T00:54:16,mobile,1654
2015-03-16T01:18:11,mobile,1720
2015-03-16T01:30:32,desktop,2288


### Step #2 - Use the File's Header
Next, we can add an option that tells the reader that the data contains a header and to use that header to determine our column names.

** *NOTE:* ** *We know we have a header based on what we can see in "head" of the file from earlier.*

In [0]:
(spark.read                    # The DataFrameReader
   .option("sep", "\t")        # Use tab delimiter (default is comma-separator)
   .option("header", "true")   # Use first line of all files as header
   .csv(csvFile)               # Creates a DataFrame from CSV after reading in the file
   .printSchema()
)

root
 |-- timestamp: string (nullable = true)
 |-- site: string (nullable = true)
 |-- requests: string (nullable = true)



A couple of notes about this iteration:
* again, only one job
* there are three columns
* all three columns are **strings**
* all three columns are **nullable**
* the column names are specified: **timestamp**, **site**, and **requests** (the change we were looking for)

A "peek" at the first line of the file is all that the reader needs to determine the number of columns and the name of each column.

Before going on, make a note of the duration of the previous call - it should be just under 3 seconds.

### Step #3 - Infer the Schema

Lastly, we can add an option that tells the reader to infer each column's data type (aka the schema)

In [0]:
(spark.read                        # The DataFrameReader
   .option("header", "true")       # Use first line of all files as header
   .option("sep", "\t")            # Use tab delimiter (default is comma-separator)
   .option("inferSchema", "true")  # Automatically infer data types
   .csv(csvFile)                   # Creates a DataFrame from CSV after reading in the file
   .printSchema()
)

root
 |-- timestamp: timestamp (nullable = true)
 |-- site: string (nullable = true)
 |-- requests: integer (nullable = true)



### Review: Reading CSV w/InferSchema
* we still have three columns
* all three columns are still **nullable**
* all three columns have their proper names
* two jobs were executed (not one as in the previous example)
* our three columns now have distinct data types:
  * **timestamp** == **timestamp**
  * **site** == **string**
  * **requests** == **integer**

**Question:** Why were there two jobs?

**Question:** How long did the last job take?

**Question:** Why did it take so much longer?

Discuss...

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Reading from CSV w/User-Defined Schema

This time we are going to read the same file.

The difference here is that we are going to define the schema beforehand and hopefully avoid the execution of any extra jobs.

### Step #1
Declare the schema.

This is just a list of field names and data types.

In [0]:
# Required for StructField, StringType, IntegerType, etc.
from pyspark.sql.types import *

csvSchema = StructType([
  StructField("timestamp", StringType(), False),
  StructField("site", StringType(), False),
  StructField("requests", IntegerType(), False)
])

### Step #2
Read in our data (and print the schema).

We can specify the schema, or rather the `StructType`, with the `schema(..)` command:

In [0]:
(spark.read                   # The DataFrameReader
  .option('header', 'true')   # Ignore line #1 - it's a header
  .option('sep', "\t")        # Use tab delimiter (default is comma-separator)
  .schema(csvSchema)          # Use the specified schema
  .csv(csvFile)               # Creates a DataFrame from CSV after reading in the file
  .printSchema()
)

root
 |-- timestamp: string (nullable = true)
 |-- site: string (nullable = true)
 |-- requests: integer (nullable = true)



### Review: Reading CSV w/ User-Defined Schema
* We still have three columns
* All three columns are **NOT** nullable because we declared them as such.
* All three columns have their proper names
* Zero jobs were executed
* Our three columns now have distinct data types:
  * **timestamp** == **string**
  * **site** == **string**
  * **requests** == **integer**

**Question:** Why were there no jobs?

**Question:** What is different about the data types of these columns compared to the previous exercise & why?

**Question:** Do I need to indicate that the file has a header?

**Question:** Do the declared column names need to match the columns in the header of the TSV file?

Discuss...

For a list of all the options related to reading CSV files, please see the documentation for `DataFrameReader.csv(..)`

Let's take a look at some of the other details of the `DataFrame` we just created for comparison sake.

In [0]:
csvDF = (spark.read
  .option('header', 'true')
  .option('sep', "\t")
  .schema(csvSchema)
  .csv(csvFile)
)
print("Partitions: " + str(csvDF.rdd.getNumPartitions()) )
printRecordsPerPartition(csvDF)
print("-"*80)

Partitions: 8
Per-Partition Counts
#1: 914,994
#2: 914,493
#3: 914,669
#4: 913,926
#5: 914,091
#6: 914,221
#7: 914,314
#8: 799,292
--------------------------------------------------------------------------------


## Next steps

Start the next lesson, [Reading Data - JSON]($./2.Reading%20Data%20-%20JSON)