## Spark SQL and DataFrames

In this module we will cover Spark SQL operations and working with DataFrames.

In [None]:
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import *
import matplotlib.pyplot as plt
import numpy as np
import datetime

sqlContext = SQLContext(sc)

Import data from a public facing storage account.

In [None]:
# 1. Location of training data: contains Dec 2013 trip and fare data from NYC 
trip_file_loc = "wasb://data@cdspsparksamples.blob.core.windows.net/NYCTaxi/KDD2016/trip_data_12.csv"
fare_file_loc = "wasb://data@cdspsparksamples.blob.core.windows.net/NYCTaxi/KDD2016/trip_fare_12.csv"

## READ IN TRIP DATA FRAME FROM CSV
trip = spark.read.csv(path=trip_file_loc, header=True, inferSchema=True)

## READ IN FARE DATA FRAME FROM CSV
fare = spark.read.csv(path=fare_file_loc, header=True, inferSchema=True)

## REGISTER DATA-FRAMEs AS A TEMP-TABLEs IN SQL-CONTEXT
trip.createOrReplaceTempView("trip")
fare.createOrReplaceTempView("fare")

Examine tables in your Spark SQL context, i.e., Hive Catalog.

In [None]:
spark.sql("show tables").show()

To examine your datasets, use the `.show` method. To inspect the schema, use the `printSchema` method.

In [None]:
trip.printSchema()

In [None]:
fare.printSchema()

In [None]:
fare.groupBy("payment_type").count().show()

Code | Payment Description 
--- | ---
"CRD"| card, debit or credit
"CSH"| cash
"DIS"| disputed fare
"NOC"| no charge
"UNK"| unknown

In [None]:
%%sql
SELECT payment_type, AVG(tip_amount) AS ave_tip FROM fare GROUP BY payment_type