# DataFrame & Column
##### Objectives
1. Construct columns
1. Subset columns
1. Add or replace columns
1. Subset rows
1. Sort rows

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html" target="_blank">DataFrame</a>: `select`, `selectExpr`, `drop`, `withColumn`, `withColumnRenamed`, `filter`, `distinct`, `limit`, `sort`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.html" target="_blank">Column</a>: `alias`, `isin`, `cast`, `isNotNull`, `desc`, operators

In [0]:
%run ./Includes/Classroom-Setup

Let's use the BedBricks events dataset.

In [0]:
eventsDF = spark.read.parquet(eventsPath)
display(eventsDF)

## Column Expressions

A <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.html" target="_blank">Column</a> is a logical construction that will be computed based on the data in a DataFrame using an expression

Construct a new Column based on existing columns in a DataFrame

In [0]:
from pyspark.sql.functions import col

eventsDF.device
eventsDF["device"]
col("device")

Out[6]: Column<'device'>

Scala supports an additional syntax for creating a new Column based on existing columns in a DataFrame

In [0]:
%scala
$"device"

### Column Operators and Methods
| Method | Description |
| --- | --- |
| \*, + , <, >= | Math and comparison operators |
| ==, != | Equality and inequality tests (Scala operators are `===` and `=!=`) |
| alias | Gives the column an alias |
| cast, astype | Casts the column to a different data type |
| isNull, isNotNull, isNan | Is null, is not null, is NaN |
| asc, desc | Returns a sort expression based on ascending/descending order of the column |

Create complex expressions with existing columns, operators, and methods.

In [0]:
col("ecommerce.purchase_revenue_in_usd") + col("ecommerce.total_item_quantity")
col("event_timestamp").desc()
(col("ecommerce.purchase_revenue_in_usd") * 100).cast("int")

Here's an example of using these column expressions in the context of a DataFrame

In [0]:
revDF = (eventsDF
         .filter(col("ecommerce.purchase_revenue_in_usd").isNotNull())
         .withColumn("purchase_revenue", (col("ecommerce.purchase_revenue_in_usd") * 100).cast("int"))
         .withColumn("avg_purchase_revenue", col("ecommerce.purchase_revenue_in_usd") / col("ecommerce.total_item_quantity"))
         .sort(col("avg_purchase_revenue").desc())
        )

display(revDF)

In [0]:
%sql
CREATE TABLE IF NOT EXISTS events USING parquet OPTIONS (path "/mnt/training/ecommerce/events/events.parquet");

In [0]:
eventsPath = "/mnt/training/ecommerce/events/events.parquet"
eventsDF = spark.read.format("parquet").load(eventsPath)

In [0]:
revDF = (eventsDF.filter(col("ecommerce.purchase_revenue_in_usd").isNotNull())
         .withColumn("purchase_revenue", (col("ecommerce.purchase_revenue_in_usd")*100).cast("int")) 
         .withColumn("average_purchase_revenue", (col("ecommerce.purchase_revenue_in_usd")/ col("ecommerce.total_item_quantity"))*100) 
         .sort(col("average_purchase_revenue").desc())
      )

## DataFrame Transformation Methods
| Method | Description |
| --- | --- |
| select | Returns a new DataFrame by computing given expression for each element |
| drop | Returns a new DataFrame with a column dropped |
| withColumnRenamed | Returns a new DataFrame with a column renamed |
| withColumn | Returns a new DataFrame by adding a column or replacing the existing column that has the same name |
| filter, where | Filters rows using the given condition |
| sort, orderBy | Returns a new DataFrame sorted by the given expressions |
| dropDuplicates, distinct | Returns a new DataFrame with duplicate rows removed |
| limit | Returns a new DataFrame by taking the first n rows |
| groupBy | Groups the DataFrame using the specified columns, so we can run aggregation on them |

### Subset columns
Use DataFrame transformations to subset columns

#### `select()`
Selects a list of columns or column based expressions

In [0]:
devicesDF = eventsDF.select("user_id", "device")
display(devicesDF)

user_id,device
UA000000107379500,macOS
UA000000107359357,Windows
UA000000107375547,macOS
UA000000107370581,iOS
UA000000107377108,Windows
UA000000107377161,Windows
UA000000107370851,iOS
UA000000107360961,macOS
UA000000107376205,Android
UA000000107359805,Windows


In [0]:
from pyspark.sql.functions import col

locationsDF = eventsDF.select(
    "user_id", 
    col("geo.city").alias("city"), 
    col("geo.state").alias("state")
)
display(locationsDF)

user_id,city,state
UA000000107379500,Montrose,MI
UA000000107359357,Northampton,MA
UA000000107375547,Salinas,CA
UA000000107370581,Everett,MA
UA000000107377108,Cottage Grove,MN
UA000000107377161,Medina,MN
UA000000107370851,Mount Pleasant,UT
UA000000107360961,Piedmont,AL
UA000000107376205,Rancho Santa Margarita,CA
UA000000107359805,Elyria,OH


#### `selectExpr()`
Selects a list of SQL expressions

In [0]:
appleDF = eventsDF.selectExpr("user_id", "device in ('macOS', 'iOS') as apple_user")
display(appleDF)

user_id,apple_user
UA000000107379500,True
UA000000107359357,False
UA000000107375547,True
UA000000107370581,True
UA000000107377108,False
UA000000107377161,False
UA000000107370851,True
UA000000107360961,True
UA000000107376205,False
UA000000107359805,False


#### `drop()`
Returns a new DataFrame after dropping the given column, specified as a string or Column object

Use strings to specify multiple columns

In [0]:
anonymousDF = eventsDF.drop("user_id", "geo", "device")
display(anonymousDF)

ecommerce,event_name,event_previous_timestamp,event_timestamp,items,traffic_source,user_first_touch_timestamp
"List(null, null, null)",warranty,1593878899217692.0,1593878946592107,List(),google,1593878899217692
"List(null, null, null)",press,1593876662175340.0,1593877011756535,List(),google,1593876662175340
"List(null, null, null)",add_item,1593878792892652.0,1593878815459100,"List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",youtube,1593878455472030
"List(null, null, null)",mattresses,1593878178791663.0,1593878809276923,List(),facebook,1593877903116176
"List(null, null, null)",mattresses,,1593878628143633,List(),google,1593878628143633
"List(null, null, null)",main,,1593878634344194,List(),youtube,1593878634344194
"List(null, null, null)",main,,1593877936171803,List(),direct,1593877936171803
"List(null, null, null)",main,,1593876843215329,List(),instagram,1593876843215329
"List(null, null, null)",warranty,1593878529774474.0,1593879213196400,List(),instagram,1593878529774474
"List(null, null, null)",main,,1593876713246514,List(),facebook,1593876713246514


In [0]:
noSalesDF = eventsDF.drop(col("ecommerce"))
display(noSalesDF)

device,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id
macOS,warranty,1593878899217692.0,1593878946592107,"List(Montrose, MI)",List(),google,1593878899217692,UA000000107379500
Windows,press,1593876662175340.0,1593877011756535,"List(Northampton, MA)",List(),google,1593876662175340,UA000000107359357
macOS,add_item,1593878792892652.0,1593878815459100,"List(Salinas, CA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",youtube,1593878455472030,UA000000107375547
iOS,mattresses,1593878178791663.0,1593878809276923,"List(Everett, MA)",List(),facebook,1593877903116176,UA000000107370581
Windows,mattresses,,1593878628143633,"List(Cottage Grove, MN)",List(),google,1593878628143633,UA000000107377108
Windows,main,,1593878634344194,"List(Medina, MN)",List(),youtube,1593878634344194,UA000000107377161
iOS,main,,1593877936171803,"List(Mount Pleasant, UT)",List(),direct,1593877936171803,UA000000107370851
macOS,main,,1593876843215329,"List(Piedmont, AL)",List(),instagram,1593876843215329,UA000000107360961
Android,warranty,1593878529774474.0,1593879213196400,"List(Rancho Santa Margarita, CA)",List(),instagram,1593878529774474,UA000000107376205
Windows,main,,1593876713246514,"List(Elyria, OH)",List(),facebook,1593876713246514,UA000000107359805


### Add or replace columns
Use DataFrame transformations to add or replace columns

#### `withColumn()`
Returns a new DataFrame by adding a column or replacing an existing column that has the same name.

In [0]:
mobileDF = eventsDF.withColumn("mobile", col("device").isin("iOS", "Android"))
display(mobileDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id,mobile
macOS,"List(null, null, null)",warranty,1593878899217692.0,1593878946592107,"List(Montrose, MI)",List(),google,1593878899217692,UA000000107379500,False
Windows,"List(null, null, null)",press,1593876662175340.0,1593877011756535,"List(Northampton, MA)",List(),google,1593876662175340,UA000000107359357,False
macOS,"List(null, null, null)",add_item,1593878792892652.0,1593878815459100,"List(Salinas, CA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",youtube,1593878455472030,UA000000107375547,False
iOS,"List(null, null, null)",mattresses,1593878178791663.0,1593878809276923,"List(Everett, MA)",List(),facebook,1593877903116176,UA000000107370581,True
Windows,"List(null, null, null)",mattresses,,1593878628143633,"List(Cottage Grove, MN)",List(),google,1593878628143633,UA000000107377108,False
Windows,"List(null, null, null)",main,,1593878634344194,"List(Medina, MN)",List(),youtube,1593878634344194,UA000000107377161,False
iOS,"List(null, null, null)",main,,1593877936171803,"List(Mount Pleasant, UT)",List(),direct,1593877936171803,UA000000107370851,True
macOS,"List(null, null, null)",main,,1593876843215329,"List(Piedmont, AL)",List(),instagram,1593876843215329,UA000000107360961,False
Android,"List(null, null, null)",warranty,1593878529774474.0,1593879213196400,"List(Rancho Santa Margarita, CA)",List(),instagram,1593878529774474,UA000000107376205,True
Windows,"List(null, null, null)",main,,1593876713246514,"List(Elyria, OH)",List(),facebook,1593876713246514,UA000000107359805,False


In [0]:
display(eventsDF.withColumn("Cities", col("geo.city").isin("Everett", "Montrose", "Salinas"))
                .filter("Cities == True"))

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id,Cities
macOS,"List(null, null, null)",warranty,1593878899217692.0,1593878946592107,"List(Montrose, MI)",List(),google,1593878899217692,UA000000107379500,True
macOS,"List(null, null, null)",add_item,1593878792892652.0,1593878815459100,"List(Salinas, CA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",youtube,1593878455472030,UA000000107375547,True
iOS,"List(null, null, null)",mattresses,1593878178791663.0,1593878809276923,"List(Everett, MA)",List(),facebook,1593877903116176,UA000000107370581,True
macOS,"List(null, null, null)",premium,1593877223736871.0,1593877973962436,"List(Everett, WA)",List(),instagram,1593877223736871,UA000000107364368,True
Linux,"List(null, null, null)",add_item,1593876752928952.0,1593876781634367,"List(Salinas, CA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",facebook,1593876752928952,UA000000107360110,True
Chrome OS,"List(null, null, null)",pillows,1593878452222535.0,1593878532244478,"List(Salinas, CA)",List(),google,1593877167160537,UA000000107363819,True
Windows,"List(null, null, null)",pillows,,1593877490080721,"List(Salinas, CA)",List(),facebook,1593877490080721,UA000000107366824,True
macOS,"List(null, null, null)",cart,1593877578341409.0,1593877602752023,"List(Salinas, CA)","List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))",facebook,1593877576180278,UA000000107367639,True
Windows,"List(null, null, null)",mattresses,,1593877497837665,"List(Montrose, CO)",List(),direct,1593877497837665,UA000000107366908,True
Windows,"List(null, null, null)",delivery,1593876601379674.0,1593876827618536,"List(Everett, WA)",List(),google,1593876601379674,UA000000107358843,True


In [0]:
display(eventsDF.filter(col("ecommerce.unique_items").isNotNull())
         .withColumn("average_purchase_revenue", ((col("ecommerce.purchase_revenue_in_usd")/col("ecommerce.total_item_quantity"))*100).cast("int"))
         .withColumn("state", col("geo.state").isin("CA"))
         .filter("state==True")
         .orderBy(col("average_purchase_revenue").desc())
        )

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id,average_purchase_revenue,state
Windows,"List(1995.0, 1, 1)",finalize,1593877957860351,1593878245298003,"List(Rancho Santa Margarita, CA)","List(List(null, M_PREM_K, Premium King Mattress, 1995.0, 1995.0, 1))",google,1593877273555818,UA000000107364837,199500,True
Android,"List(1995.0, 1, 1)",finalize,1593608158091859,1593610655395923,"List(Orange, CA)","List(List(null, M_PREM_K, Premium King Mattress, 1995.0, 1995.0, 1))",youtube,1593604357392277,UA000000106485303,199500,True
macOS,"List(1995.0, 1, 1)",finalize,1593439551062805,1593439617933160,"List(Los Angeles, CA)","List(List(null, M_PREM_K, Premium King Mattress, 1995.0, 1995.0, 1))",facebook,1593438829421906,UA000000106030732,199500,True
macOS,"List(1995.0, 1, 1)",finalize,1593164915457780,1593164958009379,"List(South El Monte, CA)","List(List(null, M_PREM_K, Premium King Mattress, 1995.0, 1995.0, 1))",google,1593163464873878,UA000000105187571,199500,True
Linux,"List(1995.0, 1, 1)",finalize,1593604474704238,1593604512139307,"List(Daly City, CA)","List(List(null, M_PREM_K, Premium King Mattress, 1995.0, 1995.0, 1))",google,1593600130952042,UA000000106474698,199500,True
macOS,"List(1995.0, 1, 1)",finalize,1593186873252444,1593187155737576,"List(Foster City, CA)","List(List(null, M_PREM_K, Premium King Mattress, 1995.0, 1995.0, 1))",google,1593181504189185,UA000000105243151,199500,True
iOS,"List(1995.0, 1, 1)",finalize,1593447348154121,1593447865950398,"List(Cupertino, CA)","List(List(null, M_PREM_K, Premium King Mattress, 1995.0, 1995.0, 1))",google,1593442146181947,UA000000106046395,199500,True
Windows,"List(1995.0, 1, 1)",finalize,1593182323731571,1593182341549092,"List(Huntington Beach, CA)","List(List(null, M_PREM_K, Premium King Mattress, 1995.0, 1995.0, 1))",facebook,1593178009281894,UA000000105226459,199500,True
Chrome OS,"List(1995.0, 1, 1)",finalize,1593266709663555,1593267001502519,"List(Perris, CA)","List(List(null, M_PREM_K, Premium King Mattress, 1995.0, 1995.0, 1))",instagram,1593263431371710,UA000000105464467,199500,True
macOS,"List(1995.0, 1, 1)",finalize,1592763824575449,1592764122740370,"List(Sunnyvale, CA)","List(List(null, M_PREM_K, Premium King Mattress, 1995.0, 1995.0, 1))",instagram,1592757832733826,UA000000104076430,199500,True


In [0]:
purchaseQuantityDF = eventsDF.withColumn("purchase_quantity", col("ecommerce.total_item_quantity").cast("int"))
purchaseQuantityDF.printSchema()

root
 |-- device: string (nullable = true)
 |-- ecommerce: struct (nullable = true)
 |    |-- purchase_revenue_in_usd: double (nullable = true)
 |    |-- total_item_quantity: long (nullable = true)
 |    |-- unique_items: long (nullable = true)
 |-- event_name: string (nullable = true)
 |-- event_previous_timestamp: long (nullable = true)
 |-- event_timestamp: long (nullable = true)
 |-- geo: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: string (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- coupon: string (nullable = true)
 |    |    |-- item_id: string (nullable = true)
 |    |    |-- item_name: string (nullable = true)
 |    |    |-- item_revenue_in_usd: double (nullable = true)
 |    |    |-- price_in_usd: double (nullable = true)
 |    |    |-- quantity: long (nullable = true)
 |-- traffic_source: string (nullable = true)
 |-- user_first_touch_timestamp: long (nullable = true)

#### `withColumnRenamed()`
Returns a new DataFrame with a column renamed.

In [0]:
locationDF = eventsDF.withColumnRenamed("geo", "location")
display(locationDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,location,items,traffic_source,user_first_touch_timestamp,user_id
macOS,"List(null, null, null)",warranty,1593878899217692.0,1593878946592107,"List(Montrose, MI)",List(),google,1593878899217692,UA000000107379500
Windows,"List(null, null, null)",press,1593876662175340.0,1593877011756535,"List(Northampton, MA)",List(),google,1593876662175340,UA000000107359357
macOS,"List(null, null, null)",add_item,1593878792892652.0,1593878815459100,"List(Salinas, CA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",youtube,1593878455472030,UA000000107375547
iOS,"List(null, null, null)",mattresses,1593878178791663.0,1593878809276923,"List(Everett, MA)",List(),facebook,1593877903116176,UA000000107370581
Windows,"List(null, null, null)",mattresses,,1593878628143633,"List(Cottage Grove, MN)",List(),google,1593878628143633,UA000000107377108
Windows,"List(null, null, null)",main,,1593878634344194,"List(Medina, MN)",List(),youtube,1593878634344194,UA000000107377161
iOS,"List(null, null, null)",main,,1593877936171803,"List(Mount Pleasant, UT)",List(),direct,1593877936171803,UA000000107370851
macOS,"List(null, null, null)",main,,1593876843215329,"List(Piedmont, AL)",List(),instagram,1593876843215329,UA000000107360961
Android,"List(null, null, null)",warranty,1593878529774474.0,1593879213196400,"List(Rancho Santa Margarita, CA)",List(),instagram,1593878529774474,UA000000107376205
Windows,"List(null, null, null)",main,,1593876713246514,"List(Elyria, OH)",List(),facebook,1593876713246514,UA000000107359805


### Subset Rows
Use DataFrame transformations to subset rows

#### `filter()`
Filters rows using the given SQL expression or column based condition.

In [0]:
purchasesDF = eventsDF.filter("ecommerce.total_item_quantity > 0")
display(purchasesDF)

In [0]:
revenueDF = eventsDF.filter(col("ecommerce.purchase_revenue_in_usd").isNotNull())
display(revenueDF)

In [0]:
androidDF = eventsDF.filter((col("traffic_source") != "direct") & (col("device") == "Android"))
display(androidDF)

#### `dropDuplicates()`
Returns a new DataFrame with duplicate rows removed, optionally considering only a subset of columns.

##### Alias: `distinct`

In [0]:
eventsDF.distinct()

In [0]:
distinctUsersDF = eventsDF.dropDuplicates(["user_id"])
display(distinctUsersDF)

#### `limit()`
Returns a new DataFrame by taking the first n rows.

In [0]:
limitDF = eventsDF.limit(100)
display(limitDF)

### Sort rows
Use DataFrame transformations to sort rows

#### `sort()`
Returns a new DataFrame sorted by the given columns or expressions.

##### Alias: `orderBy`

In [0]:
increaseTimestampsDF = eventsDF.sort("event_timestamp")
display(increaseTimestampsDF)

In [0]:
decreaseTimestampsDF = eventsDF.sort(col("event_timestamp").desc())
display(decreaseTimestampsDF)

In [0]:
increaseSessionsDF = eventsDF.orderBy(["user_first_touch_timestamp", "event_timestamp"])
display(increaseSessionsDF)

In [0]:
decreaseSessionsDF = eventsDF.sort(col("user_first_touch_timestamp").desc(), col("event_timestamp"))
display(decreaseSessionsDF)

# Purchase Revenues Lab

Prepare dataset of events with purchase revenue.

##### Tasks
1. Extract purchase revenue for each event
2. Filter events where revenue is not null
3. Check what types of events have revenue
4. Drop unneeded column

##### Methods
- DataFrame: `select`, `drop`, `withColumn`, `filter`, `dropDuplicates`
- Column: `isNotNull`

In [0]:
eventsDF = spark.read.parquet(eventsPath)
display(eventsDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id
macOS,"List(null, null, null)",warranty,1593878899217692.0,1593878946592107,"List(Montrose, MI)",List(),google,1593878899217692,UA000000107379500
Windows,"List(null, null, null)",press,1593876662175340.0,1593877011756535,"List(Northampton, MA)",List(),google,1593876662175340,UA000000107359357
macOS,"List(null, null, null)",add_item,1593878792892652.0,1593878815459100,"List(Salinas, CA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",youtube,1593878455472030,UA000000107375547
iOS,"List(null, null, null)",mattresses,1593878178791663.0,1593878809276923,"List(Everett, MA)",List(),facebook,1593877903116176,UA000000107370581
Windows,"List(null, null, null)",mattresses,,1593878628143633,"List(Cottage Grove, MN)",List(),google,1593878628143633,UA000000107377108
Windows,"List(null, null, null)",main,,1593878634344194,"List(Medina, MN)",List(),youtube,1593878634344194,UA000000107377161
iOS,"List(null, null, null)",main,,1593877936171803,"List(Mount Pleasant, UT)",List(),direct,1593877936171803,UA000000107370851
macOS,"List(null, null, null)",main,,1593876843215329,"List(Piedmont, AL)",List(),instagram,1593876843215329,UA000000107360961
Android,"List(null, null, null)",warranty,1593878529774474.0,1593879213196400,"List(Rancho Santa Margarita, CA)",List(),instagram,1593878529774474,UA000000107376205
Windows,"List(null, null, null)",main,,1593876713246514,"List(Elyria, OH)",List(),facebook,1593876713246514,UA000000107359805


### 1. Extract purchase revenue for each event
Add new column **`revenue`** by extracting **`ecommerce.purchase_revenue_in_usd`**

In [0]:
from pyspark.sql.functions import col

In [0]:
# TODO
revenueDF = eventsDF.withColumn("revenue", col("ecommerce.purchase_revenue_in_usd"))
display(revenueDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id,revenue
macOS,"List(null, null, null)",warranty,1593878899217692.0,1593878946592107,"List(Montrose, MI)",List(),google,1593878899217692,UA000000107379500,
Windows,"List(null, null, null)",press,1593876662175340.0,1593877011756535,"List(Northampton, MA)",List(),google,1593876662175340,UA000000107359357,
macOS,"List(null, null, null)",add_item,1593878792892652.0,1593878815459100,"List(Salinas, CA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",youtube,1593878455472030,UA000000107375547,
iOS,"List(null, null, null)",mattresses,1593878178791663.0,1593878809276923,"List(Everett, MA)",List(),facebook,1593877903116176,UA000000107370581,
Windows,"List(null, null, null)",mattresses,,1593878628143633,"List(Cottage Grove, MN)",List(),google,1593878628143633,UA000000107377108,
Windows,"List(null, null, null)",main,,1593878634344194,"List(Medina, MN)",List(),youtube,1593878634344194,UA000000107377161,
iOS,"List(null, null, null)",main,,1593877936171803,"List(Mount Pleasant, UT)",List(),direct,1593877936171803,UA000000107370851,
macOS,"List(null, null, null)",main,,1593876843215329,"List(Piedmont, AL)",List(),instagram,1593876843215329,UA000000107360961,
Android,"List(null, null, null)",warranty,1593878529774474.0,1593879213196400,"List(Rancho Santa Margarita, CA)",List(),instagram,1593878529774474,UA000000107376205,
Windows,"List(null, null, null)",main,,1593876713246514,"List(Elyria, OH)",List(),facebook,1593876713246514,UA000000107359805,


**CHECK YOUR WORK**

In [0]:
expected1 = [5830.0, 5485.0, 5289.0, 5219.1, 5180.0, 5175.0, 5125.0, 5030.0, 4985.0, 4985.0]
result1 = [row.revenue for row in revenueDF.sort(col("revenue").desc_nulls_last()).limit(10).collect()]

assert(expected1 == result1)

### 2. Filter events where revenue is not null
Filter for records where **`revenue`** is not **`null`**

In [0]:
# TODO
purchasesDF = revenueDF.filter(col("revenue").isNotNull())
display(purchasesDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id,revenue
Linux,"List(1195.0, 1, 1)",finalize,1593878893766134,1593878897648871,"List(Shawnee, KS)","List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))",google,1593876996316576,UA000000107362263,1195.0
iOS,"List(1045.0, 1, 1)",finalize,1593878485345763,1593878487460247,"List(Detroit, MI)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",facebook,1593877230282722,UA000000107364432,1045.0
Android,"List(595.0, 1, 1)",finalize,1593877930076602,1593878966392505,"List(East Chicago, IN)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593876889575474,UA000000107361347,595.0
iOS,"List(2290.0, 2, 2)",finalize,1593877650094042,1593877652106953,"List(Warwick, RI)","List(List(null, M_PREM_F, Premium Full Mattress, 1695.0, 1695.0, 1), List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593876687337581,UA000000107359573,2290.0
macOS,"List(945.0, 1, 1)",finalize,1593879151529456,1593879197837168,"List(Boonville, MO)","List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))",facebook,1593878603312910,UA000000107376872,945.0
Windows,"List(595.0, 1, 1)",finalize,1593877908876473,1593878020119079,"List(Hampton, VA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593877033894464,UA000000107362622,595.0
Android,"List(945.0, 1, 1)",finalize,1593878355764861,1593878641498265,"List(White Bear Lake, MN)","List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))",direct,1593877080764516,UA000000107363039,945.0
Chrome OS,"List(1095.0, 1, 1)",finalize,1593879073813036,1593879191730221,"List(San Antonio, TX)","List(List(null, M_PREM_T, Premium Twin Mattress, 1095.0, 1095.0, 1))",instagram,1593877153633764,UA000000107363715,1095.0
macOS,"List(1045.0, 1, 1)",finalize,1593877425584678,1593877429621158,"List(Searcy, AR)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",direct,1593876851338276,UA000000107361027,1045.0
iOS,"List(1045.0, 1, 1)",finalize,1593878984623390,1593879046209960,"List(Southport, IN)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",instagram,1593876574686487,UA000000107358614,1045.0


**CHECK YOUR WORK**

In [0]:
assert purchasesDF.filter(col("revenue").isNull()).count() == 0, "Nulls in 'revenue' column"

### 3. Check what types of events have revenue
Find unique **`event_name`** values in **`purchasesDF`** in one of two ways:
- Select "event_name" and get distinct records
- Drop duplicate records based on the "event_name" only

<img src="https://files.training.databricks.com/images/icon_hint_32.png" alt="Hint"> There's only one event associated with revenues

In [0]:
# TODO
distinctDF = purchasesDF.dropDuplicates(['event_name'])
display(distinctDF)

device,ecommerce,event_name,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id,revenue
Linux,"List(1195.0, 1, 1)",finalize,1593878893766134,1593878897648871,"List(Shawnee, KS)","List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))",google,1593876996316576,UA000000107362263,1195.0


### 4. Drop unneeded column
Since there's only one event type, drop **`event_name`** from **`purchasesDF`**.

In [0]:
# TODO
finalDF = purchasesDF.drop(col('event_name'))
display(finalDF)

device,ecommerce,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id,revenue
Linux,"List(1195.0, 1, 1)",1593878893766134,1593878897648871,"List(Shawnee, KS)","List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))",google,1593876996316576,UA000000107362263,1195.0
iOS,"List(1045.0, 1, 1)",1593878485345763,1593878487460247,"List(Detroit, MI)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",facebook,1593877230282722,UA000000107364432,1045.0
Android,"List(595.0, 1, 1)",1593877930076602,1593878966392505,"List(East Chicago, IN)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593876889575474,UA000000107361347,595.0
iOS,"List(2290.0, 2, 2)",1593877650094042,1593877652106953,"List(Warwick, RI)","List(List(null, M_PREM_F, Premium Full Mattress, 1695.0, 1695.0, 1), List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593876687337581,UA000000107359573,2290.0
macOS,"List(945.0, 1, 1)",1593879151529456,1593879197837168,"List(Boonville, MO)","List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))",facebook,1593878603312910,UA000000107376872,945.0
Windows,"List(595.0, 1, 1)",1593877908876473,1593878020119079,"List(Hampton, VA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593877033894464,UA000000107362622,595.0
Android,"List(945.0, 1, 1)",1593878355764861,1593878641498265,"List(White Bear Lake, MN)","List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))",direct,1593877080764516,UA000000107363039,945.0
Chrome OS,"List(1095.0, 1, 1)",1593879073813036,1593879191730221,"List(San Antonio, TX)","List(List(null, M_PREM_T, Premium Twin Mattress, 1095.0, 1095.0, 1))",instagram,1593877153633764,UA000000107363715,1095.0
macOS,"List(1045.0, 1, 1)",1593877425584678,1593877429621158,"List(Searcy, AR)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",direct,1593876851338276,UA000000107361027,1045.0
iOS,"List(1045.0, 1, 1)",1593878984623390,1593879046209960,"List(Southport, IN)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",instagram,1593876574686487,UA000000107358614,1045.0


**CHECK YOUR WORK**

In [0]:
expected_columns = {"device", "ecommerce", "event_previous_timestamp", "event_timestamp",
                    "geo", "items", "revenue", "traffic_source",
                    "user_first_touch_timestamp", "user_id"}
assert(set(finalDF.columns) == expected_columns)

### 5. Chain all the steps above excluding step 3

In [0]:
# TODO
finalDF = (eventsDF
                 .filter(col("ecommerce.purchase_revenue_in_usd").isNotNull())
                 .withColumn("revenue", col("ecommerce.purchase_revenue_in_usd"))
                 .drop(col("event_name"))  
          )

display(finalDF)

device,ecommerce,event_previous_timestamp,event_timestamp,geo,items,traffic_source,user_first_touch_timestamp,user_id,revenue
Linux,"List(1195.0, 1, 1)",1593878893766134,1593878897648871,"List(Shawnee, KS)","List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))",google,1593876996316576,UA000000107362263,1195.0
iOS,"List(1045.0, 1, 1)",1593878485345763,1593878487460247,"List(Detroit, MI)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",facebook,1593877230282722,UA000000107364432,1045.0
Android,"List(595.0, 1, 1)",1593877930076602,1593878966392505,"List(East Chicago, IN)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593876889575474,UA000000107361347,595.0
iOS,"List(2290.0, 2, 2)",1593877650094042,1593877652106953,"List(Warwick, RI)","List(List(null, M_PREM_F, Premium Full Mattress, 1695.0, 1695.0, 1), List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593876687337581,UA000000107359573,2290.0
macOS,"List(945.0, 1, 1)",1593879151529456,1593879197837168,"List(Boonville, MO)","List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))",facebook,1593878603312910,UA000000107376872,945.0
Windows,"List(595.0, 1, 1)",1593877908876473,1593878020119079,"List(Hampton, VA)","List(List(null, M_STAN_T, Standard Twin Mattress, 595.0, 595.0, 1))",google,1593877033894464,UA000000107362622,595.0
Android,"List(945.0, 1, 1)",1593878355764861,1593878641498265,"List(White Bear Lake, MN)","List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))",direct,1593877080764516,UA000000107363039,945.0
Chrome OS,"List(1095.0, 1, 1)",1593879073813036,1593879191730221,"List(San Antonio, TX)","List(List(null, M_PREM_T, Premium Twin Mattress, 1095.0, 1095.0, 1))",instagram,1593877153633764,UA000000107363715,1095.0
macOS,"List(1045.0, 1, 1)",1593877425584678,1593877429621158,"List(Searcy, AR)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",direct,1593876851338276,UA000000107361027,1045.0
iOS,"List(1045.0, 1, 1)",1593878984623390,1593879046209960,"List(Southport, IN)","List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))",instagram,1593876574686487,UA000000107358614,1045.0


**CHECK YOUR WORK**

In [0]:
assert(finalDF.count() == 180678)

In [0]:
expected_columns = {"device", "ecommerce", "event_previous_timestamp", "event_timestamp",
                    "geo", "items", "revenue", "traffic_source",
                    "user_first_touch_timestamp", "user_id"}
assert(set(finalDF.columns) == expected_columns)

### Clean up classroom

In [0]:
%run ./Includes/Classroom-Cleanup



-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>