# Spark Lab 5 - Using Spark SQL (Solution)

In this lab, we practice the critical skills of creating a proper data frame from a source dataset (CSV), querying it using both Spark.sql and DataFrame APIs, and save results to the storage.

## Dataset

We will use the `auctiondata.csv` distributed with `sparkdata.zip`. If you have not previously downloaded sparkdata.zip, you can download it from `http://idsdl.csom.umn.edu/c/share/sparkdata.zip` using `wget`. Alternatively, you can copy the URL in your browser and download it from there. 

For this lab, you can do it using cloudera VM (on HDFS) or using any of the alternative spark environment (using a local copy of the file).

Our dataset is a `.csv` file that consists of online auction data. Each auction has an auction id associated with it and can have multiple bids. Each row represents a bid. For each bid, we have the following information:

Column|Type| Description
--|--|--
`aucid`|String| Auction ID
`bid`|Float| Bid amount
`bidtime`|Float| Time of bid from start of auction
`bidder`|String| The bidder’s userid
`bidrate`|Int| The bidder’s rating
`openbid`|Float| Opening price
`Price`|Float| Final price
`itemtype`|String| Item type
`dtl`|Int| Days to live



## Step 1: Explore the data first from shell.
What does the data looks like? Does it have a header row? What is the delimiter?

**tip**: you can run operating system command in notebook cells by prefixing your command with "!". e.g., 
```bash
! ls
```

In [1]:
!head sparkdata/auctiondata.csv

8213034705,95,2.927373,jake7870,0,95,117.5,xbox,3
8213034705,115,2.943484,davidbresler2,1,95,117.5,xbox,3
8213034705,100,2.951285,gladimacowgirl,58,95,117.5,xbox,3
8213034705,117.5,2.998947,daysrus,10,95,117.5,xbox,3
8213060420,2,0.065266,donnie4814,5,1,120,xbox,3
8213060420,15.25,0.123218,myreeceyboy,52,1,120,xbox,3
8213060420,3,0.186539,parakeet2004,5,1,120,xbox,3
8213060420,10,0.18669,parakeet2004,5,1,120,xbox,3
8213060420,24.99,0.187049,parakeet2004,5,1,120,xbox,3
8213060420,20,0.249491,bluebubbles_1,25,1,120,xbox,3


## Step 2.Creating the DataFrame

To there are several ways to create a new dataframe 

### Approach 1: The RDD route

Read the text file into an RDD, convert the RDD to RDD[Row] with proper field names and data types, then convert the RDD to a DataFrame. 

This approach is useful if you are dealing with raw data (maybe unstructured) and you want to explore it before turning it into a table.


In [60]:
data_file = "sparkdata/auctiondata.csv"
rawDataRDD = sc.textFile(data_file).cache()

In [61]:
from pyspark.sql import Row

csvRDD = rawDataRDD.map(lambda l: l.split(","))

# convert
rowRDD = csvRDD.map(lambda p: Row(
    auctionid=p[0], 
    bid=float(p[1]),
    bidtime=float(p[2]),
    bidder=p[3],
    bidrate=int(p[4]),
    openbid=float(p[5]),
    price=float(p[6]),
    itemtype=p[7],
    dtl=int(p[8]),
    )
)

bids = spark.createDataFrame(rowRDD)

Verify the column names and data types of the dataFrame. 

**Question**: Are the data types the same as as what you have specified. What do you think has happened?

In [62]:
bidsDF.printSchema()

root
 |-- auctionid: long (nullable = true)
 |-- bid: double (nullable = true)
 |-- bidder: string (nullable = true)
 |-- bidrate: long (nullable = true)
 |-- bidtime: double (nullable = true)
 |-- dtl: long (nullable = true)
 |-- itemtype: string (nullable = true)
 |-- openbid: double (nullable = true)
 |-- price: double (nullable = true)



**Answer**: The data types are not the same. Some are converted from string --> long, or float-->double. Spark reader has internal mechanisms to convert input data types into types that are best suited for Spark DataFrames

Verify the data frame by showing its first 5 rows in a tablular format.

In [63]:
bidsDF.show(5)

+----------+-----+--------------+-------+--------+---+--------+-------+-----+
| auctionid|  bid|        bidder|bidrate| bidtime|dtl|itemtype|openbid|price|
+----------+-----+--------------+-------+--------+---+--------+-------+-----+
|8213034705| 95.0|      jake7870|      0|2.927373|  3|    xbox|   95.0|117.5|
|8213034705|115.0| davidbresler2|      1|2.943484|  3|    xbox|   95.0|117.5|
|8213034705|100.0|gladimacowgirl|     58|2.951285|  3|    xbox|   95.0|117.5|
|8213034705|117.5|       daysrus|     10|2.998947|  3|    xbox|   95.0|117.5|
|8213060420|  2.0|    donnie4814|      5|0.065266|  3|    xbox|    1.0|120.0|
+----------+-----+--------------+-------+--------+---+--------+-------+-----+
only showing top 5 rows



### Approach 2: using DataFrame Reader for CSV files

Use the CSV reader to read the file, and try to infer schema from the source data.

In [64]:
bids = spark.read.option("inferSchema","true").csv(data_file)

Verify the schema, you'll notice the columns have automatic column names such as "_c0"

In [13]:
bids.printSchema()

root
 |-- _c0: long (nullable = true)
 |-- _c1: double (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: integer (nullable = true)
 |-- _c5: double (nullable = true)
 |-- _c6: double (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: integer (nullable = true)



#### Attach a schema (string) at the loading time.
One way you can add a schema (col names) is to use the `schema(schema_str)` option of DataFrame Reader, where schema_str takes the form of `"col1 INT, col2 String"`. For your convenience, we have provided that column string below.

In [111]:
schema_str = """
    auctionid long, bid double, bidtime double, bidder string,
    bidrate long, openbid double, price double,  
    itemtype string, dtl long
"""

Now recreate the dataframe `bids` using the `schema_str` and verify the schema of the resulting dataframe.

In [27]:
bids = spark.read.schema(schema_str).format("csv").load("sparkdata/auctiondata.csv")

In [28]:
bids.printSchema()

root
 |-- auctionid: long (nullable = true)
 |-- bid: double (nullable = true)
 |-- bidder: string (nullable = true)
 |-- bidrate: long (nullable = true)
 |-- bidtime: double (nullable = true)
 |-- dtl: long (nullable = true)
 |-- itemtype: string (nullable = true)
 |-- openbid: double (nullable = true)
 |-- price: double (nullable = true)



#### Attach a schema (StructType) at the loading time.

The `schema()` can also take a StructType as argument. In our lecture, we have shown you an example of creating a StructType schema.

```python
from pyspark.sql.types import *
schema = StructType([    
    StructField("name", StringType(), False),
    StructField("age", IntegerType(), True)
])
```

Now it is your turn to create the StructType for our dataset and use it when you load a DataFrame from csv.

In [79]:
from pyspark.sql.types import *

#define a structtype as schema
schema_structtype = StructType([
 StructField("auctionid",LongType(),True),
 StructField("bid",DoubleType(),True),
 StructField("bidtime",DoubleType(),True),
 StructField("bidder",StringType(),True),
 StructField("bidrate",LongType(),True),
 StructField("openbid",DoubleType(),True),
 StructField("price",DoubleType(),True),
 StructField("itemtype",StringType(),True),
 StructField("dtl",LongType(),True)
])
 

In [80]:
bids = spark.read.schema(schema_structtype).csv("sparkdata/auctiondata.csv")

Verify your dataframe's schema and top 5 rows

In [81]:
bids.printSchema()

root
 |-- auctionid: long (nullable = true)
 |-- bid: double (nullable = true)
 |-- bidtime: double (nullable = true)
 |-- bidder: string (nullable = true)
 |-- bidrate: long (nullable = true)
 |-- openbid: double (nullable = true)
 |-- price: double (nullable = true)
 |-- itemtype: string (nullable = true)
 |-- dtl: long (nullable = true)



In [112]:
bids.show(5)

+----------+-----+--------+--------------+-------+-------+-----+--------+---+
| auctionid|  bid| bidtime|        bidder|bidrate|openbid|price|itemtype|dtl|
+----------+-----+--------+--------------+-------+-------+-----+--------+---+
|8213034705| 95.0|2.927373|      jake7870|      0|   95.0|117.5|    xbox|  3|
|8213034705|115.0|2.943484| davidbresler2|      1|   95.0|117.5|    xbox|  3|
|8213034705|100.0|2.951285|gladimacowgirl|     58|   95.0|117.5|    xbox|  3|
|8213034705|117.5|2.998947|       daysrus|     10|   95.0|117.5|    xbox|  3|
|8213060420|  2.0|0.065266|    donnie4814|      5|    1.0|120.0|    xbox|  3|
+----------+-----+--------+--------------+-------+-------+-----+--------+---+
only showing top 5 rows



#### Add column names after you create the schema

Another way is to rename column names after you have created the dataframe. 

You can do it using `withColumnRenamed` but that is tedious. 

Alternatively, you can use the `toDF(*colnames)` function to create a new dataframe with given names, where `*colnames` is a list of names, e.g. `toDF("name", "age")`.

For your convenience, we have provided you a list of names in `cols`, please use this to rename columns after you load the data (recreate the "unnamed" DataFrame if you have overwritten it). 

In [87]:
cols = [
 'auctionid',
 'bid',
 'bidtime',
 'bidder',
 'bidrate',
 'openbid',
 'price',
 'itemtype',
 'dtl'
]

In [84]:
bids = spark.read.option("inferSchema",True).csv("sparkdata/auctiondata.csv")

In [88]:
bids = bids.toDF(*cols)

Verify your dataframe's schema and top 5 rows

In [113]:
bids.printSchema()

root
 |-- auctionid: long (nullable = true)
 |-- bid: double (nullable = true)
 |-- bidtime: double (nullable = true)
 |-- bidder: string (nullable = true)
 |-- bidrate: integer (nullable = true)
 |-- openbid: double (nullable = true)
 |-- price: double (nullable = true)
 |-- itemtype: string (nullable = true)
 |-- dtl: integer (nullable = true)



In [114]:
bids.show(5)

+----------+-----+--------+--------------+-------+-------+-----+--------+---+
| auctionid|  bid| bidtime|        bidder|bidrate|openbid|price|itemtype|dtl|
+----------+-----+--------+--------------+-------+-------+-----+--------+---+
|8213034705| 95.0|2.927373|      jake7870|      0|   95.0|117.5|    xbox|  3|
|8213034705|115.0|2.943484| davidbresler2|      1|   95.0|117.5|    xbox|  3|
|8213034705|100.0|2.951285|gladimacowgirl|     58|   95.0|117.5|    xbox|  3|
|8213034705|117.5|2.998947|       daysrus|     10|   95.0|117.5|    xbox|  3|
|8213060420|  2.0|0.065266|    donnie4814|      5|    1.0|120.0|    xbox|  3|
+----------+-----+--------+--------------+-------+-------+-----+--------+---+
only showing top 5 rows



## Step 3: Run Queries on your DataFrame

**Query 1**: We are interested in auction id `1645914432`, and want to show a list of bids by descending  order of bid times. show `bid`, `bidder`, and `bidtime`. Implement this using the DataFrame API approach (i.e. using DataFrame's methods such as .select, .filter etc).

In [99]:
bids.filter(bids.auctionid==1645914432) \
    .select("bid","bidder","bidtime").sort(bids.bidtime.desc()).show()

+-----+-----------+--------+
|  bid|     bidder| bidtime|
+-----+-----------+--------+
|511.0|gracedivine|2.645231|
|501.0|   beelprez|2.379583|
|451.0|   beelprez|2.379352|
|405.0|   beelprez| 2.37912|
|310.0|   beelprez|2.378866|
|280.0|   beelprez|2.374896|
|260.0|   beelprez|2.374711|
|250.0|   beelprez|2.374502|
|500.0| jcobb74787|2.029803|
|220.0|   beelprez|1.943183|
|210.0|    leakang|0.882593|
|200.0|   beelprez|0.433669|
+-----+-----------+--------+



**Query 2**: Fetch the maximum price (col name: `max_price`) and number of bids (col name `num_bids`) by item type (`itemtype`) using the spark.SQL approach (note that you need to create view first). 

In [91]:
bids.createOrReplaceTempView("bids")

In [92]:
itemtypes = spark.sql("""
    SELECT itemtype, max(price) as max_price, count(*) as num_bids 
    from bids 
    group by itemtype
""")

In [93]:
itemtypes.show()

+--------+---------+--------+
|itemtype|max_price|num_bids|
+--------+---------+--------+
| cartier|   5400.0|    1953|
|    palm|    290.0|    5917|
|    xbox|   501.77|    2784|
+--------+---------+--------+



**Query 3**: Complete the query 2 using the DataFrame API approach

In [95]:
import pyspark.sql.functions as f
# the count function can count any columns
bids.groupBy("itemtype").agg(f.max(bids.price).alias("max_price"),f.count(bids.price).alias("num_bids")).show()

+--------+---------+--------+
|itemtype|max_price|num_bids|
+--------+---------+--------+
| cartier|   5400.0|    1953|
|    palm|    290.0|    5917|
|    xbox|   501.77|    2784|
+--------+---------+--------+



**Query 4**: For all bids on "cartier", find out the maximum bid price for all associated auctions, display auctionid, itemtype, and price. Using the API approach

In [104]:
maxprices = bids.select("auctionid","itemtype","price") \
    .filter(bids.itemtype=='cartier') \
    .groupBy("auctionid") \
    .max("price") \
    .withColumnRenamed("max(price)","max_price")

In [105]:
maxprices.show()

+----------+---------+
| auctionid|max_price|
+----------+---------+
|1642534283|   1125.0|
|1645542737|    386.0|
|1644077790|   1400.0|
|1644724061|    305.0|
|1647149304|   752.56|
|1640495398|   1800.0|
|1644109746|   3103.0|
|1647329406|   1000.0|
|1644752795|   1800.0|
|1642875447|    831.0|
|1644197869|    300.0|
|1647320738|   452.87|
|1649718196|   1799.0|
|1640793161|   2395.0|
|1643244227|   1025.0|
|1644343468|  1038.99|
|1642561397|   1825.0|
|1646448593|    202.5|
|1646988233|    620.0|
|1645989170|    280.0|
+----------+---------+
only showing top 20 rows



## Step 4. Write the results in CSV
You can use Spark Dataframe Writer to write results to desirable formats. Here we ask you to write the results in csv files. Keep in mind that in big data, our dataset = folders. 

Save the result of your previous query (dataframe) in a folder called "maxprices" in the csv format.

In [106]:
maxprices.write.csv("maxprices")

Verify the files in the folder. **Question**: why there are multiple files? Which one/ones are your result set?

In [107]:
!ls -l maxprices/

total 51
-rw-rw-r-- 1 vagrant vagrant  0 Jul 31 15:59 part-00000-ca1de9b7-b178-4376-85a3-e6dd948f0723-c000.csv
-rw-rw-r-- 1 vagrant vagrant 18 Jul 31 15:59 part-00001-ca1de9b7-b178-4376-85a3-e6dd948f0723-c000.csv
-rw-rw-r-- 1 vagrant vagrant 35 Jul 31 15:59 part-00002-ca1de9b7-b178-4376-85a3-e6dd948f0723-c000.csv
-rw-rw-r-- 1 vagrant vagrant 17 Jul 31 15:59 part-00004-ca1de9b7-b178-4376-85a3-e6dd948f0723-c000.csv
-rw-rw-r-- 1 vagrant vagrant 18 Jul 31 15:59 part-00006-ca1de9b7-b178-4376-85a3-e6dd948f0723-c000.csv
-rw-rw-r-- 1 vagrant vagrant 36 Jul 31 15:59 part-00007-ca1de9b7-b178-4376-85a3-e6dd948f0723-c000.csv
-rw-rw-r-- 1 vagrant vagrant 18 Jul 31 15:59 part-00010-ca1de9b7-b178-4376-85a3-e6dd948f0723-c000.csv
-rw-rw-r-- 1 vagrant vagrant 18 Jul 31 15:59 part-00012-ca1de9b7-b178-4376-85a3-e6dd948f0723-c000.csv
-rw-rw-r-- 1 vagrant vagrant 70 Jul 31 15:59 part-00013-ca1de9b7-b178-4376-85a3-e6dd948f0723-c000.csv
-rw-rw-r-- 1 vagrant vagrant 18 Jul 31 15:59 part-00014-ca1de9b

**Answer**: all of the csv files are part of the result. You have multiple files because of parallel processing (each partition of your data may write its own output).

Take your data out in a single file.

*Hint: consider using Linux rediretion (">") to create this single file.*

In [109]:
! cat maxprices/* > maxprices.csv

In [116]:
! head maxprices.csv

1642534283,1125.0
1645542737,386.0
1644077790,1400.0
1644724061,305.0
1647149304,752.56
1640495398,1800.0
1644109746,3103.0
1647329406,1000.0
1644752795,1800.0
1642875447,831.0
