##Spark DataFrames

###Imports

For these examples, we just need to import two **pyspark.sql** libraries:
- `types`
- `functions`

We need `pyspark.sql.types` to define schemas for the DataFrames. The `pyspark.sql.functions` library contains all of the functions specific to SQL and DataFrames in **PySpark**.

In [0]:
from pyspark.sql.types import *  # Necessary for creating schemas
from pyspark.sql.functions import * # Importing PySpark functions

###Creating DataFrames

#### Making a Simple DataFrame from a Tuple List

In [0]:
# Make a tuple list
a_list = [('a', 1), ('b', 2), ('c', 3)]

# Create a Spark DataFrame, without supplying a schema value
df_from_list_no_schema = \
sqlContext.createDataFrame(a_list)

# Print the DF object
print(df_from_list_no_schema)

# Print a collected list of Row objects
print(df_from_list_no_schema.collect())

# Show the DataFrame
df_from_list_no_schema.show()

#### Making a Simple DataFrame from a Tuple List and a Schema

In [0]:
# Create a Spark DataFrame, this time with schema
df_from_list_with_schema = \
sqlContext.createDataFrame(a_list, ['letters', 'numbers']) # this simple schema contains just column names

# Show the DataFrame
df_from_list_with_schema.show()

# Show the DataFrame's schema
df_from_list_with_schema.printSchema()

#### Making a Simple DataFrame from a Dictionary

In [0]:
# Make a dictionary
a_dict = [{'letters': 'a', 'numbers': 1},
          {'letters': 'b', 'numbers': 2},
          {'letters': 'c', 'numbers': 3}]

# Create a Spark DataFrame from the dictionary
df_from_dict = \
(sqlContext
 .createDataFrame(a_dict)) # You will get a warning about this

# Show the DataFrame
df_from_dict.show()

#### Making a Simple DataFrame Using a StructType Schema + RDD

In [0]:
# Define the schema
schema = StructType([
    StructField('letters', StringType(), True),
    StructField('numbers', IntegerType(), True)])

# Create an RDD from a list
rdd = sc.parallelize(a_list)

# Create the DataFrame from these raw components
nice_df = \
(sqlContext
 .createDataFrame(rdd, schema))

# Show the DataFrame
nice_df.show()

In [0]:
# Define the schema
schema = StructType([
    StructField('letters', StringType(), True),
    StructField('numbers', IntegerType(), True)])

# Create an RDD from a list
rdd = sc.parallelize(a_list)

# Create the DataFrame from these raw components
nice_df = \
(sqlContext
 .createDataFrame(rdd, schema))

# Show the DataFrame
nice_df.show()

###Simple Inspection Functions

We now have a `nice_df`, here are some nice functions for inspecting the DataFrame.

In [0]:
# `columns`: return all column names as a list
nice_df.columns

In [0]:
# `dtypes`: get the datatypes for all columns
nice_df.dtypes

In [0]:
# `printSchema()`: prints the schema of the supplied DF
nice_df.printSchema()

In [0]:
# `schema`: returns the schema of the provided DF as `StructType` schema
nice_df.schema

In [0]:
# `first()` returns the first row as a Row while
# `head()` and `take()` return `n` number of Row objects
print(nice_df.first()) # can't supply a value; never a list
print(nice_df.head(2)) # can optionally supply a value (default: 1);
                      # with n > 1, a list
print(nice_df.take(2)) # expects a value; always a list

In [0]:
# `count()`: returns a count of all rows in DF
nice_df.count()

In [0]:
# `describe()`: print out stats for numerical columns
nice_df.describe().show() # can optionally supply a list of column names

In [0]:
# the `explain()` function explains the under-the-hood evaluation process
nice_df.explain()

###Relatively Simple DataFrame Manipulation Functions

Let's use these functions:
- `unionAll()`: combine two DataFrames together
- `orderBy()`: perform sorting of DataFrame columns
- `select()`: select which DataFrame columns to retain
- `drop()`: select a single DataFrame column to remove
- `filter()`: retain DataFrame rows that match a condition

In [0]:
(nice_df
 .unionAll(nice_df)
 .show())

In [0]:
# Add it to itself twice
(nice_df
 .unionAll(nice_df)
 .unionAll(nice_df)
 .show())



In [0]:
# Sorting the DataFrame by the `numbers` column
(nice_df
 .unionAll(nice_df)
 .unionAll(nice_df)
 .orderBy('numbers')
 .show())

# Sort the same column in reverse order
(nice_df
 .unionAll(nice_df)
 .unionAll(nice_df)
 .orderBy('numbers',
          ascending = False)
 .show())

In [0]:
# `select()` and `drop()` both take a list of column names
# and these functions do exactly what you might expect

# Select only the first column of the DF
(nice_df
 .select('letters')
 .show())

# Re-order columns in the DF using `select()`
(nice_df
 .select(['numbers', 'letters'])
 .show())

# Drop the second column of the DF
(nice_df
 .drop('letters')
 .show())

In [0]:
# The `filter()` function performs filtering of DF rows

# Here is some numeric filtering with comparison operators
# (>, <, >=, <=, ==, != all work)

# Filter rows where values in `numbers` is > 1
(nice_df
 .filter(nice_df.numbers > 1)
 .show())

# Perform two filter operations
(nice_df
 .filter(nice_df.numbers > 1)
 .filter(nice_df.numbers < 3)
 .show())

# Not just numbers! Use the `filter()` + `isin()`
# combo to filter on string columns with a set of values
(nice_df
 .filter(nice_df.letters
         .isin(['a', 'b']))
 .show())