# Lesson 14 - Introduction to DataFrames

## DataFrames

The Spark **DataFrame** is high-level data structure used for working with structured data. The structure of a DataFrame is similar to that of a table. A DataFrame is a essentially an RDD containing several objects of type **`Row`**. A Row in Spark is an ordered collection of objects. Each Row in a DataFrame must contain the same number of elements. All of the elements at the same position in each of the Rows combine together to form a **`Column`**. Each row in a DataFrame is intended to represent an individual record or observation, and each column is intended to represent a specific value or piece of information that has been recorded for each of the records. 

The Spark DataFrame is inspired by similar data structures in R and in the pandas package for Python. Some key differences between DataFrames in Spark and these other languages are:

1. Spark DataFrames are immutable. 
2. Spark DataFrames are distributed. 
3. Transformations performed on Spark DataFrames are evaluated lazily. 
4. Transformations performed on Spark DataFrames are highly optimized to improve performance.

## Spark Data Types

Every column in a DataFrame has a data type associated with it. The data types used by Spark are based on Scala data types and differ somewhat from the data types available in Python. Before continuing on to discuss DataFrames in detail, we need to take a moment to remove some common Spark data types. 

* **String Data Types.** Spark provides a `string` data type that is directly equivalent to the python `str` type. 
* **Integer Data Types.** Spark provides several data types for storing integers of different sizes. These include the 1 byte `byte` type, the 2-byte `short` type, the 4-byte `integer` type, and the 8-byte `long` type. In contrast, Python provides a single `int` data type that is capable of scaling based on need. 
* **Floating Point Data Types.** Spark provides two data types for storing floating point numbers. These are the 4-byte `float` type and the 8-byte `double` type. As with integers, Python provides as single `float` data type that scales according to need. 

You can find more information about Spark data types here: [Spark Documentation: Spark Data Types](https://spark.apache.org/docs/latest/sql-ref-datatypes.html)

## The SparkSession

The `SparkSession` object is the primary entry point for working with structured data. The tools that Spark provides specifically for working with DataFrames can be accessed through the `SparkSession`.

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

## Creating a DataFrame from an RDD or List

We can use the `createDataFrame()` method of the `SparkSession` object to create a DataFrame from an in-memory object such as a list of lists or a pandas DataFrame. This is illustrated with a small example in the following cell.

Note that Spark DataFrames have a `collect()` method, just like RDDs. When we call this method on our newly constructed DataFrame and print each result, we can see the the DataFrame consists of several `Row` objects.

In [0]:
employees_list = [
  ['Mary', 43, 15.6],
  ['John', 56, 13.7],
  ['Kent', 28, 16.2],
  ['Rose', 34, 16.2],
  ['Lona', 52, 16.2],
]

employees_df = spark.createDataFrame(employees_list)

for row in employees_df.collect():
  print(row)

## Schemas

Every Spark DataFrame comes with a **schema** which defines the structure of the DataFrame by specifying a name and data type for each of the columns in the DataFrame. We can view a DataFrame's schema calling its `printSchema()` method.

In [0]:
employees_df.printSchema()

Notice the expressions **`_1`**, **`_2`**, and **`_3`** in the output above. These are the names that Spark has assigned to the three columns in our DataFrame. In just a bit, we will see how we can set these names ourselves when creating the DataFrame. 

Following the name of each column, you will see the Spark data type that has been assigned to that column. These are, in order, `string`, `long`, and `double`. The expression `(nullable = true)` indicates that each of these columns are allowed to contain missing values.

### Assigning a Custom Schema

In the example above, Spark assigned the names `_1`, `_2`, and `_3` to the columns of our DataFrame. If we wish to specify our own column names, or would like to have control over the DataTypes being assigned to the columns, we can create a custom schema.

### Schema Classes

There are two commonly-used approaches to created a schema. The first technique that we discuss requires several classes from the `pyspark.sql.types` module. In particular, we need to import the `StructType` class, the `StructField` class, and a specific class assocated with each data type that appears within our columns. Some examples of these data type classes are `StringType`, `IntegerType`, `LongType`, and `DoubleType`. 

Every schema is represented by an element of the `StructType` class. Each column is represented by a `StructField` instance. When creating a `StructField` instance, we have to provide three arguments. These are a string representing the name of the column, a class representing the data type associated with the column, and a boolean value indicating whether or not the column can accept null values. 

In the cell below, we illustrate the process of creating a custom schema by performing the following steps:

1. We import all of the relevant classes. 
2. We create the schema as an instance of the `StructType` class. 
3. We provide the schema as an argument when calling `createDataFrame()`. 
4. We use `printSchema()` to display the schema for the DataFrame.

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, LongType

my_schema = StructType([
  StructField('Name', StringType(), True),
  StructField('Age', IntegerType(), True),
  StructField('Rate', DoubleType(), True) 
])

employees_df = spark.createDataFrame(employees_list, schema=my_schema)

employees_df.printSchema()

### DDL Strings

An alternate approach to creating a schema is to use a **Data Definition Language (DDL) String**. A DDL string is simply a Python string that states the name and data type for each column. The name and data type are separated by spaces, and the information for each column is separated by a comma. This approach is typically much quicker and more concise than the `StructType` approach to defining a schema, and it requires no special import statements. 

We illustrate how to define a schema using a DDL string in the cell below.

In [0]:
my_schema = 'Name STRING, Age INTEGER, Rate DOUBLE'

employees_df = spark.createDataFrame(employees_list, schema=my_schema)

employees_df.printSchema()

### Column Names

Every DataFrame has a `columns` attribute that contains a lists of the names of the columns in that DataFrame, as well as a `dtypes` attribute that contains a list of tuples containg the names and data types for each of the columns.

In [0]:
print(employees_df.columns)
print(employees_df.dtypes)

## Displaying DataFrames

There are several tools that can be used to display the contents of a DataFrame. The first one we will demonstrate is the `show()` DataFrame method. This method displays the first several rows of the DataFrame. By default, 20 rows are displayed, but we can ask for `n` rows to be displayed using `show(n)`.

In [0]:
employees_df.show()

DataBricks provides a `display()` function that produces an interactive display of the contents of a DataFrame, or to generate plots from a DataFrame. Note that this tool is specific to DataBricks.

In [0]:
display(employees_df)

Name,Age,Rate
Mary,43,15.6
John,56,13.7
Kent,28,16.2
Rose,34,16.2
Lona,52,16.2


We can convert a Spark DataFrame into a pandas DataFrame using `toPandas()`. Since pandas DataFrames are not distributed, using this method will cause all of the contents of the Dataframe to be loaded into memory on the node running the driver process. If the dataset is too large to fit in that machine's memory, then it will likely crash the node.

In [0]:
employees_pdf = employees_df.toPandas()
employees_pdf

Unnamed: 0,Name,Age,Rate
0,Mary,43,15.6
1,John,56,13.7
2,Kent,28,16.2
3,Rose,34,16.2
4,Lona,52,16.2


## Reading Data from a File

We can create DataFrames from external files using Spark's **Read API**. This API is accessed through the `spark.read` object, which is of type `DataFrameReader`. We can use the `option()` method of `spark.read` to customize the how the file is read into the DataFrame. The application of this method inlcude setting the delimiter for the data file and specifying if the file contains a header. We can use the `schema()` method to provide the schema to be used for the new DataFrame. Finally, the `csv()` method is used to provide the path to the data file being read. 

In the cell below, we use will create a DataFrame using the gapminder dataset. This dataset is stored in the tab-separated file `FileStore/tables/gapminder.txt`.

In [0]:
gm_schema = (
    'country STRING, year INTEGER, continent STRING, population INTEGER, '
    'life_exp DOUBLE, gdp_per_cap INTEGER, gini DOUBLE'
)

gm_df = (
    spark.read
    .option('delimiter', '\t')
    .option('header', True)
    .schema(gm_schema)
    .csv('/FileStore/tables/gapminder_data.txt')
)
    
gm_df.printSchema()

We will now display the first several rows of the DataFrame we just created.

In [0]:
gm_df.show(10)

### Inferring the Schema

The Spark Read API provides use with the option to allow Spark to scan the dataset to infer the appropriate schema rather than requiring us to specify the schema. This can be set using `option('inferSchema', True)`. Inferring the schema might save us from having to create our own `StructType` or DDL string, but it is less efficient than these other approaches since it requires Spark to scan the dataset in order to determine the correct data types. In it generally better for you to provide Spark with the desired schema information rather than asking Spark to infer it.

In [0]:
gm_df = (
    spark.read
    .option('delimiter', '\t')
    .option('header', True)
    .option('inferSchema', True)
    .csv('/FileStore/tables/gapminder_data.txt')
)
    
gm_df.printSchema()