# Reading Data - Tables and Views

**Technical Accomplishments:**
* Demonstrate how to pre-register data sources in Azure Databricks.
* Introduce temporary views over files.
* Read data from tables/views.
* Regarding `printRecordsPerPartition(..)`, it 
  * converts the specified `DataFrame` to an RDD
  * counts the number of records in each partition
  * prints the results to the console.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [0]:
%run "./Includes/Classroom-Setup"

In [0]:
%run "./Includes/Utility-Methods"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Registering Tables in Databricks

So far we've seen purely programmatic methods for reading in data.

Databricks allows us to "register" the equivalent of "tables" so that they can be easily accessed by all users. 

It also allows us to specify configuration settings such as secret keys, tokens, username & passwords, etc without exposing that information to all users.

## Register a Table/View
* Databrick's UI has built in support for working with a number of different data sources
* New ones are being added regularly
* In our case we are going to upload the file <a href="http://files.training.databricks.com/static/data/pageviews_by_second_example.tsv">pageviews_by_second_example.tsv</a>
* .. and then use the UI to create a table.

There are several benefits to this strategy:
* Once setup, it never has to be done again
* It is available for any user on the platform (permissions permitting)
* Minimizes exposure of credentials
* No real overhead to reading the schema (no infer-schema)
* Easier to advertise available datasets to other users

## Follow these steps to register a new Table

**NOTE:** *It may be easiest for you to duplicate this browser tab so you can refer back to these steps.*

1. Download the [pageviews_by_second_example.tsv](http://files.training.databricks.com/static/data/pageviews_by_second_example.tsv) file to your computer.
2. Select **Data** in the left-hand menu.
3. Select the database with your username.
4. Select **Add Data** to create a new Table.

  ![The Data menu item and Add Data button are both highlighted.](https://databricksdemostore.blob.core.windows.net/images/03-de-learning-path/data-add-data.png)

5. In the Create New Table form, make sure **Upload File** is selected, then click on browse and select the [pageviews_by_second_example.tsv](http://files.training.databricks.com/static/data/pageviews_by_second_example.tsv) file is highlighted, or drag and drop it into the File box.
6. Select **Create Table with UI**.

  ![The previously listed form options are shown.](https://databricksdemostore.blob.core.windows.net/images/03-de-learning-path/create-new-table-1.png)

7. Select your cluster, then select **Preview Table**.
8. Under **Create in Database**, select the database with your username in the list. It is **important** that you do not skip this step. You can find the database name in the output of `cell 3` above.
9. Select **Create Table**.

  ![The previously listed form options are shown.](https://databricksdemostore.blob.core.windows.net/images/03-de-learning-path/create-new-table-2.png)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Reading from a Table/View

We can now read in the "table" **pageviews_by_seconds_example** as a `DataFrame` with one simple command (and then print the schema):

In [0]:
pageviewsBySecondsExampleDF = spark.read.table("pageviews_by_second_example_tsv")

pageviewsBySecondsExampleDF.printSchema()

And of course we can now view that data as well:

In [0]:
display(pageviewsBySecondsExampleDF)

### Review: Reading from Tables
* No job is executed - the schema is stored in the table definition on Databricks.
* The data types shown here are those we defined when we registered the table.
* In our case, the file was uploaded to Databricks and is stored on the DBFS.
  * If we used JDBC, it would open the connection to the database and read it in.
  * If we used an object store (like what is backing the DBFS), it would read the data from source.
* The "registration" of the table simply makes future access, or access by multiple users easier.
* The users of the notebook cannot see username and passwords, secret keys, tokens, etc.

Let's take a look at some of the other details of the `DataFrame` we just created for comparison sake.

In [0]:
print("Partitions: " + str(pageviewsBySecondsExampleDF.rdd.getNumPartitions()))
printRecordsPerPartition(pageviewsBySecondsExampleDF)
print("-"*80)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Temporary Views

Tables that are loadable by the call `spark.read.table(..)` are also accessible through the SQL APIs.

For example, we already used Databricks to expose **pageviews_by_second_example_tsv** as a table/view.

In [0]:
%sql
select * from pageviews_by_second_example_tsv limit(5)

You can also take an existing `DataFrame` and register it as a view exposing it as a table to the SQL API.

If you recall from earlier, we have an instance called `parquetDF`.

We can create a [temporary] view with this call...

In [0]:
# create a DataFrame from a parquet file
parquetFile = "/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/"
parquetDF = spark.read.parquet(parquetFile)

# create a temporary view from the resulting DataFrame
parquetDF.createOrReplaceTempView("parquet_table")

And now we can use the SQL API to reference that same `DataFrame` as the table **parquet_table**.

In [0]:
%sql
select * from parquet_table order by requests desc limit(5)

** *Note #1:* ** *The method createOrReplaceTempView(..) is bound to the SparkSession meaning it will be discarded once the session ends.*

** *Note #2:* ** On the other hand, the method createOrReplaceGlobalTempView(..) is bound to the spark application.*

*Or to put that another way, I can use createOrReplaceTempView(..) in this notebook only. However, I can call createOrReplaceGlobalTempView(..) in this notebook and then access it from another.*

## Next steps

Start the next lesson, [Writing Data]($./5.Writing%20Data)