
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>



# Active Users Lab
Plot daily active users and average active users by day of week.
1. Extract timestamp and date of events
2. Get daily active users
3. Get average number of active users by day of week
4. Sort day of week in correct order

In [0]:
%run ../Includes/Classroom-Setup

Python interpreter will be restarted.
Python interpreter will be restarted.


Resetting the learning environment:
| No action taken

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/apache-spark-programming-with-databricks/v03"

Validating the locally installed datasets:
| listing local files...(3 seconds)
| validation completed...(3 seconds total)

Creating & using the schema "labuser6023680_cng7_da_asp" in the catalog "hive_metastore"...(0 seconds)

Predefined tables in "labuser6023680_cng7_da_asp":
| -none-

Predefined paths variables:
| DA.paths.working_dir: dbfs:/mnt/dbacademy-users/labuser6023680@vocareum.com/apache-spark-programming-with-databricks
| DA.paths.user_db:     dbfs:/mnt/dbacademy-users/labuser6023680@vocareum.com/apache-spark-programming-with-databricks/database.db
| DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/apache-spark-programming-with-databricks/v03
| DA.paths.checkpoints: dbfs:/mnt/dbacademy-users/labuser6023680@vocareum.com/apache-spark-programming-with-databricks/_checkpoints

Setup completed (5 seconds)

P



### Setup
Run the cell below to create the starting DataFrame of user IDs and timestamps of events logged on the BedBricks website.

In [0]:
from pyspark.sql.functions import col

df = (spark
      .read
      .format("delta")
      .load(DA.paths.events)
      .select("user_id", col("event_timestamp").alias("ts"))
     )

display(df)

user_id,ts
UA000000107379500,1593878946592107
UA000000107359357,1593877011756535
UA000000107375547,1593878815459100
UA000000107370581,1593878809276923
UA000000107377108,1593878628143633
UA000000107377161,1593878634344194
UA000000107370851,1593877936171803
UA000000107360961,1593876843215329
UA000000107376205,1593879213196400
UA000000107359805,1593876713246514




### 1. Extract timestamp and date of events
- Convert **`ts`** from microseconds to seconds by dividing by 1 million and cast to timestamp
- Add **`date`** column by converting **`ts`** to date

In [0]:
# TODO
from pyspark.sql.functions import to_date

datetime_df = (df
               .withColumn("ts", (col("ts") / 1e6).cast("timestamp"))
               .withColumn("date", to_date("ts"))
              )
              
display(datetime_df)

user_id,ts,date
UA000000106459980,2020-07-01T06:33:33.296+0000,2020-07-01
UA000000106546041,2020-07-01T15:38:10.744+0000,2020-07-01
UA000000106556702,2020-07-01T16:17:02.994+0000,2020-07-01
UA000000106525232,2020-07-01T14:34:49.359+0000,2020-07-01
UA000000106502389,2020-07-01T13:13:07.617+0000,2020-07-01
UA000000106476093,2020-07-01T10:54:59.397+0000,2020-07-01
UA000000106528363,2020-07-01T14:43:56.012+0000,2020-07-01
UA000000106492536,2020-07-01T12:47:45.186+0000,2020-07-01
UA000000106522577,2020-07-01T14:25:28.140+0000,2020-07-01
UA000000106514480,2020-07-01T14:12:30.857+0000,2020-07-01





**1.1: CHECK YOUR WORK**

In [0]:
from pyspark.sql.types import DateType, StringType, StructField, StructType, TimestampType

expected1a = StructType([StructField("user_id", StringType(), True),
                         StructField("ts", TimestampType(), True),
                         StructField("date", DateType(), True)])

result1a = datetime_df.schema

assert expected1a == result1a, "datetime_df does not have the expected schema"
print("All test pass")

All test pass


In [0]:
import datetime

expected1b = datetime.date(2020, 6, 19)
result1b = datetime_df.sort("date").first().date

assert expected1b == result1b, "datetime_df does not have the expected date values"
print("All test pass")

All test pass




### 2. Get daily active users
- Group by date
- Aggregate approximate count of distinct **`user_id`** and alias to "active_users"
  - Recall built-in function to get **approximate count distinct** (also recall:  approximate count distinct is different than count distinct!)
- Sort by date
- Plot as line graph

In [0]:
# TODO
from pyspark.sql.functions import approx_count_distinct

active_users_df = (datetime_df
                   .groupBy("date")
                   .agg(approx_count_distinct("user_id").alias("active_users"))
                   .sort("date")
                  )
display(active_users_df)

date,active_users
2020-06-19,251573
2020-06-20,357215
2020-06-21,305055
2020-06-22,239094
2020-06-23,243117
2020-06-24,235205
2020-06-25,246548
2020-06-26,245022
2020-06-27,301330
2020-06-28,260756





**2.1: CHECK YOUR WORK**

In [0]:
from pyspark.sql.types import LongType

expected2a = StructType([StructField("date", DateType(), True),
                         StructField("active_users", LongType(), False)])

result2a = active_users_df.schema

assert expected2a == result2a, "active_users_df does not have the expected schema"
print("All test pass")

All test pass


In [0]:
expected2b = [(datetime.date(2020, 6, 19), 251573), (datetime.date(2020, 6, 20), 357215), (datetime.date(2020, 6, 21), 305055), (datetime.date(2020, 6, 22), 239094), (datetime.date(2020, 6, 23), 243117)]

result2b = [(row.date, row.active_users) for row in active_users_df.orderBy("date").take(5)]

assert expected2b == result2b, "active_users_df does not have the expected values"
print("All test pass")

All test pass




### 3. Get average number of active users by day of week
- Add **`day`** column by extracting day of week from **`date`** using a datetime pattern string - the expected output here will be a day name, not a number (e.g. **`Mon`**, not **`1`**)
- Group by **`day`**
- Aggregate average of **`active_users`** and alias to "avg_users"

In [0]:
# TODO
from pyspark.sql.functions import date_format, avg

active_dow_df = (active_users_df
                 .withColumn("day", date_format(col("date"), "E"))
                 .groupBy("day")
                 .agg(avg(col("active_users")).alias("avg_users"))
                )
display(active_dow_df)

day,avg_users
Sun,282905.5
Mon,238195.5
Thu,264620.0
Sat,278482.0
Wed,227214.0
Fri,247180.66666666663
Tue,260942.5





**3.1: CHECK YOUR WORK**

In [0]:
from pyspark.sql.types import DoubleType

expected3a = StructType([StructField("day", StringType(), True),
                         StructField("avg_users", DoubleType(), True)])

result3a = active_dow_df.schema

assert expected3a == result3a, "active_dow_df does not have the expected schema"
print("All test pass")

All test pass


In [0]:
expected3b = [("Fri", 247180.66666666666), ("Mon", 238195.5), ("Sat", 278482.0), ("Sun", 282905.5), ("Thu", 264620.0), ("Tue", 260942.5), ("Wed", 227214.0)]

result3b = [(row.day, row.avg_users) for row in active_dow_df.sort("day").collect()]

assert expected3b == result3b, "active_dow_df does not have the expected values"
print("All test pass")

All test pass




### Clean up classroom

In [0]:
DA.cleanup()

Resetting the learning environment:
| dropping the schema "labuser6023680_cng7_da_asp"...(1 seconds)
| removing the working directory "dbfs:/mnt/dbacademy-users/labuser6023680@vocareum.com/apache-spark-programming-with-databricks"...(0 seconds)

Validating the locally installed datasets:
| listing local files...(3 seconds)
| validation completed...(3 seconds total)


&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>