# Using lazy processing

- Lazy processing operations will usually return in about the same amount of time regardless of the actual quantity of data. Remember that this is due to Spark not performing any transformations until an action is requested.

- For this exercise, we'll be defining a Data Frame (`aa_dfw_df`) and add a couple transformations. Note the amount of time required for the transformations to complete when defined vs when the data is actually queried. These differences may be short, but they will be noticeable. When working with a full Spark cluster with larger quantities of data the difference will be more apparent.

## Instructions

- Load the Data Frame.
- Add the transformation for `F.lower()` to the `Destination Airport` column.
- Drop the `Destination Airport` column from the Data Frame `aa_dfw_df`. Note the time for these operations to complete.
- Show the Data Frame, noting the time difference for this action to complete.

In [1]:
# Intialization
import os
import sys

os.environ["SPARK_HOME"] = "/home/talentum/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.6" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

# NOTE: Whichever package you want mention here.
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0 pyspark-shell' 
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.3 pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'

In [2]:
#Entrypoint 2.x
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()

# On yarn:
# spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().master("yarn").getOrCreate()
# specify .master("yarn")

sc = spark.sparkContext

In [4]:
import pyspark.sql.functions as F

# Load the CSV file
aa_dfw_df = spark.read.format('csv').options(Header=True).load('file:///home/talentum/test-jupyter/P3/M1/SM2/Dataset/AA_DFW_2018_Departures_Short.csv.gz')

print(aa_dfw_df.printSchema())
df2=aa_dfw_df.select('Date (MM/DD/YYYY)','Destination Airport')
print(df2)
df2.write.csv('file:///home/talentum/test.csv') # .csv can never store metadata
# Add the airport column using the F.lower() method
aa_dfw_df = aa_dfw_df.withColumn('airport', F.lower(aa_dfw_df['Destination Airport']))

# Drop the Destination Airport column
aa_dfw_df = aa_dfw_df.drop(aa_dfw_df['Destination Airport'])

# Show the DataFrame
print(aa_dfw_df.show())

root
 |-- Date (MM/DD/YYYY): string (nullable = true)
 |-- Flight Number: string (nullable = true)
 |-- Destination Airport: string (nullable = true)
 |-- Actual elapsed time (Minutes): string (nullable = true)

None
DataFrame[Date (MM/DD/YYYY): string, Destination Airport: string]
+-----------------+-------------+-----------------------------+-------+
|Date (MM/DD/YYYY)|Flight Number|Actual elapsed time (Minutes)|airport|
+-----------------+-------------+-----------------------------+-------+
|       01/01/2018|         0005|                          498|    hnl|
|       01/01/2018|         0007|                          501|    ogg|
|       01/01/2018|         0043|                            0|    dtw|
|       01/01/2018|         0051|                          100|    stl|
|       01/01/2018|         0075|                          147|    dca|
|       01/01/2018|         0096|                           92|    stl|
|       01/01/2018|         0103|                          227|    sj

In [5]:
from pyspark.sql.types import *

# Define a new schema using the StructType method
people_schema = StructType([
  # Define a StructField for each field
  StructField('Date (MM/DD/YYYY)', StringType(), True),
  StructField('Destination Airport', StringType(), True)
])
print(people_schema)

StructType(List(StructField(Date (MM/DD/YYYY),StringType,true),StructField(Destination Airport,StringType,true)))


In [6]:
new_df = spark.read.format('csv').options(Header=True).load('file:///home/talentum/test.csv', schema=people_schema)

In [7]:
new_df.show()

+-----------------+-------------------+
|Date (MM/DD/YYYY)|Destination Airport|
+-----------------+-------------------+
|       01/01/2018|                OGG|
|       01/01/2018|                DTW|
|       01/01/2018|                STL|
|       01/01/2018|                DCA|
|       01/01/2018|                STL|
|       01/01/2018|                SJC|
|       01/01/2018|                OGG|
|       01/01/2018|                HNL|
|       01/01/2018|                MCO|
|       01/01/2018|                EWR|
|       01/01/2018|                SJC|
|       01/01/2018|                RDU|
|       01/01/2018|                SAT|
|       01/01/2018|                SFO|
|       01/01/2018|                MIA|
|       01/01/2018|                LAS|
|       01/01/2018|                KOA|
|       01/01/2018|                CVG|
|       01/01/2018|                MIA|
|       01/01/2018|                LIH|
+-----------------+-------------------+
only showing top 20 rows

