# Describe a DataFrame

Your data processing in Azure Databricks is accomplished by defining Dataframes to read and process the Data.

This notebook will introduce how to read your data using Azure Databricks Dataframes.

#Introduction

** Data Source **
* One hour of Pagecounts from the English Wikimedia projects captured August 5, 2016, at 12:00 PM UTC.
* Size on Disk: ~23 MB
* Type: Compressed Parquet File
* More Info: <a href="https://dumps.wikimedia.org/other/pagecounts-raw" target="_blank">Page view statistics for Wikimedia projects</a>

**Technical Accomplishments:**
* Develop familiarity with the `DataFrame` APIs
* Introduce the classes...
  * `SparkSession`
  * `DataFrame` (aka `Dataset[Row]`)
* Introduce the actions...
  * `count()`

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [4]:
%run "./Includes/Classroom-Setup"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) **The Data Source**

* In this notebook, we will be using a compressed parquet "file" called **pagecounts** (~23 MB file from Wikipedia)
* We will explore the data and develop an understanding of it as we progress.
* You can read more about this dataset here: <a href="https://dumps.wikimedia.org/other/pagecounts-raw/" target="_blank">Page view statistics for Wikimedia projects</a>.

We can use **dbutils.fs.ls()** to view our data on the DBFS.

In [6]:
(source, sasEntity, sasToken) = getAzureDataSource()

spark.conf.set(sasEntity, sasToken)

In [7]:
path = source + "/wikipedia/pagecounts/staging_parquet_en_only_clean/"
files = dbutils.fs.ls(path)
display(files)

path,name,size
wasbs://training@dbtrainwestus.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/_SUCCESS,_SUCCESS,0
wasbs://training@dbtrainwestus.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/_committed_6241970109963426653,_committed_6241970109963426653,760
wasbs://training@dbtrainwestus.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/_started_6241970109963426653,_started_6241970109963426653,0
wasbs://training@dbtrainwestus.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00000-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00000-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2996913
wasbs://training@dbtrainwestus.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00001-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00001-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2994285
wasbs://training@dbtrainwestus.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00002-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00002-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2994196
wasbs://training@dbtrainwestus.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00003-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00003-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2992431
wasbs://training@dbtrainwestus.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00004-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00004-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2990093
wasbs://training@dbtrainwestus.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00005-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00005-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2989931
wasbs://training@dbtrainwestus.blob.core.windows.net/wikipedia/pagecounts/staging_parquet_en_only_clean/part-00006-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,part-00006-tid-6241970109963426653-cd9cd6ee-cb10-4da2-82b3-ea25a8369cbf-0-c000.gz.parquet,2989314


As we can see from the files listed above, this data is stored in <a href="https://parquet.apache.org" target="_blank">Parquet</a> files which can be read in a single command, the result of which will be a `DataFrame`.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Create a DataFrame
* We can read the Parquet files into a `DataFrame`.
* We'll start with the object **spark**, an instance of `SparkSession` and the entry point to Spark 2.0 applications.
* From there we can access the `read` object which gives us an instance of `DataFrameReader`.

In [10]:
parquetDir = source + "/wikipedia/pagecounts/staging_parquet_en_only_clean/"

In [11]:
pagecountsEnAllDF = (spark  # Our SparkSession & Entry Point
  .read                     # Our DataFrameReader
  .parquet(parquetDir)      # Returns an instance of DataFrame
)
print(pagecountsEnAllDF)    # Python hack to see the data type

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) count()

If you look at the API docs, `count()` is described like this:
> Returns the number of rows in the Dataset.

`count()` will trigger a job to process the request and return a value.

We can now count all records in our `DataFrame` like this:

In [13]:
total = pagecountsEnAllDF.count()

print("Record Count: {0:,}".format( total ))

That tells us that there are around 2 million rows in the `DataFrame`.

## Next steps

Start the next lesson, [Use common DataFrame methods]($./2.Use-common-dataframe-methods)