# Spark Lab 5 - Using Spark SQL

In this lab, we practice the critical skills of creating a proper data frame from a source dataset (CSV), querying it using both Spark.sql and DataFrame APIs, and save results to the storage.

## Dataset

We will use the `auctiondata.csv` distributed with `sparkdata.zip`. If you have not previously downloaded sparkdata.zip, you can download it from `http://idsdl.csom.umn.edu/c/share/sparkdata.zip` using `wget`. Alternatively, you can copy the URL in your browser and download it from there. 

For this lab, you can do it using cloudera VM (on HDFS) or using any of the alternative spark environment (using a local copy of the file).

Our dataset is a `.csv` file that consists of online auction data. Each auction has an auction id associated with it and can have multiple bids. Each row represents a bid. For each bid, we have the following information:

Column|Type| Description
--|--|--
`aucid`|String| Auction ID
`bid`|Float| Bid amount
`bidtime`|Float| Time of bid from start of auction
`bidder`|String| The bidder’s userid
`bidrate`|Int| The bidder’s rating
`openbid`|Float| Opening price
`Price`|Float| Final price
`itemtype`|String| Item type
`dtl`|Int| Days to live



## Step 1: Explore the data first from shell.
What does the data looks like? Does it have a header row? What is the delimiter?

**tip**: you can run operating system command in notebook cells by prefixing your command with "!". e.g., 
```bash
! ls
```

## Step 2.Creating the DataFrame

To there are several ways to create a new dataframe 

### Approach 1: The RDD route

Read the text file into an RDD, convert the RDD to RDD[Row] with proper field names and data types, then convert the RDD to a DataFrame. 

This approach is useful if you are dealing with raw data (maybe unstructured) and you want to explore it before turning it into a table.


Verify the column names and data types of the dataFrame. 

**Question**: Are the data types the same as as what you have specified. What do you think has happened?

**Answer**: The data types are not the same. Some are converted from string --> long, or float-->double. Spark reader has internal mechanisms to convert input data types into types that are best suited for Spark DataFrames

Verify the data frame by showing its first 5 rows in a tablular format.

### Approach 2: using DataFrame Reader for CSV files

Use the CSV reader to read the file, and try to infer schema from the source data.

Verify the schema, you'll notice the columns have automatic column names such as "_c0"

#### Attach a schema (string) at the loading time.
One way you can add a schema (col names) is to use the `schema(schema_str)` option of DataFrame Reader, where schema_str takes the form of `"col1 INT, col2 String"`. For your convenience, we have provided that column string below.

In [None]:
schema_str = """
    auctionid long, bid double, bidtime double, bidder string,
    bidrate long, openbid double, price double,  
    itemtype string, dtl long
"""

Now recreate the dataframe `bids` using the `schema_str` and verify the schema of the resulting dataframe.

#### Attach a schema (StructType) at the loading time.

The `schema()` can also take a StructType as argument. In our lecture, we have shown you an example of creating a StructType schema.

```python
from pyspark.sql.types import *
schema = StructType([    
    StructField("name", StringType(), False),
    StructField("age", IntegerType(), True)
])
```

Now it is your turn to create the StructType for our dataset and use it when you load a DataFrame from csv.

Verify your dataframe's schema and top 5 rows

#### Add column names after you create the schema

Another way is to rename column names after you have created the dataframe. 

You can do it using `withColumnRenamed` but that is tedious. 

Alternatively, you can use the `toDF(*colnames)` function to create a new dataframe with given names, where `*colnames` is a list of names, e.g. `toDF("name", "age")`.

For your convenience, we have provided you a list of names in `cols`, please use this to rename columns after you load the data (recreate the "unnamed" DataFrame if you have overwritten it). 

In [None]:
cols = [
 'auctionid',
 'bid',
 'bidtime',
 'bidder',
 'bidrate',
 'openbid',
 'price',
 'itemtype',
 'dtl'
]

Verify your dataframe's schema and top 5 rows

## Step 3: Run Queries on your DataFrame

**Query 1**: We are interested in auction id `1645914432`, and want to show a list of bids by descending  order of bid times. show `bid`, `bidder`, and `bidtime`. Implement this using the DataFrame API approach (i.e. using DataFrame's methods such as .select, .filter etc).

**Query 2**: Fetch the maximum price (col name: `max_price`) and number of bids (col name `num_bids`) by item type (`itemtype`) using the spark.SQL approach (note that you need to create view first). 

**Query 3**: Complete the query 2 using the DataFrame API approach

**Query 4**: For all bids on "cartier", find out the maximum bid price for all associated auctions, display auctionid, itemtype, and price. Using the API approach

## Step 4. Write the results in CSV
You can use Spark Dataframe Writer to write results to desirable formats. Here we ask you to write the results in csv files. Keep in mind that in big data, our dataset = folders. 

Save the result of your previous query (dataframe) in a folder called "maxprices" in the csv format.

Verify the files in the folder. **Question**: why there are multiple files? Which one/ones are your result set?

**Answer**: all of the csv files are part of the result. You have multiple files because of parallel processing (each partition of your data may write its own output).

Take your data out in a single file.

*Hint: consider using Linux rediretion (">") to create this single file.*