-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Abandoned Carts Lab
Get abandoned cart items for email without purchases.
1. Get emails of converted users from transactions
2. Join emails with user IDs
3. Get cart item history for each user
4. Join cart item history with emails
5. Filter for emails with abandoned cart items

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html" target="_blank">DataFrame</a>: **`join`**
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html?#functions" target="_blank">Built-In Functions</a>: **`collect_set`**, **`explode`**, **`lit`**
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameNaFunctions.html" target="_blank">DataFrameNaFunctions</a>: **`fill`**

### Setup
Run the cells below to create DataFrames **`sales_df`**, **`users_df`**, and **`events_df`**.

In [0]:
%run ../Includes/Classroom-Setup

Deleted the working directory dbfs:/user/si@elisaandgeeks.com/dbacademy/aspwd/asp_3_4l_abandoned_carts_lab


Your working directory is
dbfs:/user/si@elisaandgeeks.com/dbacademy/aspwd

The source for this dataset is
wasbs://courseware@dbacademy.blob.core.windows.net/apache-spark-programming-with-databricks/v02/

Skipping install of existing dataset to
dbfs:/user/si@elisaandgeeks.com/dbacademy/aspwd/datasets


Out[5]: DataFrame[key: string, value: string]

In [0]:
# sale transactions at BedBricks
sales_df = spark.read.format("delta").load(sales_path)
#display(sales_df)

In [0]:
# user IDs and emails at BedBricks
users_df = spark.read.format("delta").load(users_path)
#display(users_df)

In [0]:
# events logged on the BedBricks website
events_df = spark.read.format("delta").load(events_path)
#display(events_df)

### 1: Get emails of converted users from transactions
- Select the **`email`** column in **`sales_df`** and remove duplicates
- Add a new column **`converted`** with the value **`True`** for all rows

Save the result as **`converted_users_df`**.

In [0]:
# TODO
from pyspark.sql.functions import *

converted_users_df = (sales_df
                      .select("email", lit(True).alias("converted"))
                      .distinct()
                     )
#display(converted_users_df)

#### 1.1: Check Your Work

Run the following cell to verify that your solution works:

In [0]:
expected_columns = ["email", "converted"]

expected_count = 210370

assert converted_users_df.columns == expected_columns, "converted_users_df does not have the correct columns"

assert converted_users_df.count() == expected_count, "converted_users_df does not have the correct number of rows"

assert converted_users_df.select(col("converted")).first()[0] == True, "converted column not correct"
print("All test pass")

All test pass


### 2: Join emails with user IDs
- Perform an outer join on **`converted_users_df`** and **`users_df`** with the **`email`** field
- Filter for users where **`email`** is not null
- Fill null values in **`converted`** as **`False`**

Save the result as **`conversions_df`**.

In [0]:
# TODO
conversions_df = (users_df
                  .join(other=converted_users_df, on="email", how="outer")
                  .filter("email is not null")
                  .na
                  .fill(False)
                 )
#display(conversions_df)

#### 2.1: Check Your Work

Run the following cell to verify that your solution works:

In [0]:
expected_columns = ["email", "user_id", "user_first_touch_timestamp", "converted"]

expected_count = 782749

expected_false_count = 572379

assert conversions_df.columns == expected_columns, "Columns are not correct"

assert conversions_df.filter(col("email").isNull()).count() == 0, "Email column contains null"

assert conversions_df.count() == expected_count, "There is an incorrect number of rows"

assert conversions_df.filter(col("converted") == False).count() == expected_false_count, "There is an incorrect number of false entries in converted column"
print("All test pass")

All test pass


### 3: Get cart item history for each user
- Explode the **`items`** field in **`events_df`** with the results replacing the existing **`items`** field
- Group by **`user_id`**
  - Collect a set of all **`items.item_id`** objects for each user and alias the column to "cart"

Save the result as **`carts_df`**.

In [0]:
# TODO
carts_df = (events_df
            .withColumn("items", explode("items"))
            .groupBy("user_id")
            .agg(collect_set("items.item_id").alias("cart"))
)
#display(carts_df)

#### 3.1: Check Your Work

Run the following cell to verify that your solution works:

In [0]:
expected_columns = ["user_id", "cart"]

expected_count = 488403

assert carts_df.columns == expected_columns, "Incorrect columns"

assert carts_df.count() == expected_count, "Incorrect number of rows"

assert carts_df.select(col("user_id")).drop_duplicates().count() == expected_count, "Duplicate user_ids present"
print("All test pass")

All test pass


### 4: Join cart item history with emails
- Perform a left join on **`conversions_df`** and **`carts_df`** on the **`user_id`** field

Save result as **`email_carts_df`**.

In [0]:
# TODO
email_carts_df = conversions_df.join(other=carts_df, on="user_id", how="left")

#display(email_carts_df)

#### 4.1: Check Your Work

Run the following cell to verify that your solution works:

In [0]:
expected_columns = ["user_id", "email", "user_first_touch_timestamp", "converted", "cart"]

expected_count = 782749

expected_cart_null_count = 397799

assert email_carts_df.columns == expected_columns, "Columns do not match"

assert email_carts_df.count() == expected_count, "Counts do not match"

assert email_carts_df.filter(col("cart").isNull()).count() == expected_cart_null_count, "Cart null counts incorrect from join"
print("All test pass")

All test pass


### 5: Filter for emails with abandoned cart items
- Filter **`email_carts_df`** for users where **`converted`** is False
- Filter for users with non-null carts

Save result as **`abandoned_carts_df`**.

In [0]:
# TODO
abandoned_carts_df = (email_carts_df
                      .filter("converted is False")
                      .filter("cart is not null")
)
#display(abandoned_carts_df)

#### 5.1: Check Your Work

Run the following cell to verify that your solution works:

In [0]:
expected_columns = ["user_id", "email", "user_first_touch_timestamp", "converted", "cart"]

expected_count = 204272

assert abandoned_carts_df.columns == expected_columns, "Columns do not match"

assert abandoned_carts_df.count() == expected_count, "Counts do not match"
print("All test pass")

All test pass


### 6: Bonus Activity
Plot number of abandoned cart items by product

In [0]:
# TODO
abandoned_items_df = (abandoned_carts_df
                      .withColumn("items", explode("cart"))
                      .groupBy("items")
                      .count()
                      .sort("items")
                     )
display(abandoned_items_df)

items,count
M_PREM_F,6363
M_PREM_K,9839
M_PREM_Q,11976
M_PREM_T,12527
M_STAN_F,25761
M_STAN_K,38765
M_STAN_Q,47008
M_STAN_T,50315
P_DOWN_K,2987
P_DOWN_S,6917


#### 6.1: Check Your Work

Run the following cell to verify that your solution works:

In [0]:
abandoned_items_df.count()

Out[28]: 12

In [0]:
expected_columns = ["items", "count"]

expected_count = 12

assert abandoned_items_df.count() == expected_count, "Counts do not match"

assert abandoned_items_df.columns == expected_columns, "Columns do not match"
print("All test pass")

All test pass


### Clean up classroom

In [0]:
classroom_cleanup()

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>