# Breakfast at the Frat: A Time Series Analysis

Sales and promotion information on the top five products from each of the top three brands within four selected categories (mouthwash, pretzels, frozen pizza, and boxed cereal), gathered from a sample of stores over 156 weeks.

- Unit sales, households, visits, and spend data by product, store, and week
- Base Price and Actual Shelf Price, to determine a product’s discount, if any
- Promotional support details (e.g., sale tag, in-store display), if applicable for the given product/store/week
- Store information, including size and location, as well as a price tier designation (e.g., upscale vs. value)
- Product information, including UPC, size, and description

To identify outliers, it is suggested to look at

- The ratio of units vs. number of visits
- The ratio of visits vs. number of households
- Some items that may be out-of-stock or discontinued for a store

**Source:** https://www.dunnhumby.com/source-files/

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder \
    .appName("breakfast") \
    .getOrCreate()

In [3]:
product_data_folder = "dataset/products"
store_data_folder = "dataset/stores"
transaction_data_folder = "dataset/transactions"

### Perform ETL to Answer the Following Questions

1. What is the range of prices offered on products?
1. What is the impact on units/visit of promotions by geographies?
1. Which products would you lower the price to increase sales?

In [4]:
product_df = spark. \
    read. \
    option("header",True) \
    .csv(product_data_folder)

In [5]:
product_df.show(1)

+----------+--------------------+-------------+----------+------------+------------+
|       UPC|         DESCRIPTION| MANUFACTURER|  CATEGORY|SUB_CATEGORY|PRODUCT_SIZE|
+----------+--------------------+-------------+----------+------------+------------+
|1111009477|PL MINI TWIST PRE...|PRIVATE LABEL|BAG SNACKS|    PRETZELS|       15 OZ|
+----------+--------------------+-------------+----------+------------+------------+
only showing top 1 row



In [6]:
product_df.show()

+----------+--------------------+-------------+--------------------+--------------------+------------+
|       UPC|         DESCRIPTION| MANUFACTURER|            CATEGORY|        SUB_CATEGORY|PRODUCT_SIZE|
+----------+--------------------+-------------+--------------------+--------------------+------------+
|1111009477|PL MINI TWIST PRE...|PRIVATE LABEL|          BAG SNACKS|            PRETZELS|       15 OZ|
|1111009497|   PL PRETZEL STICKS|PRIVATE LABEL|          BAG SNACKS|            PRETZELS|       15 OZ|
|1111009507|   PL TWIST PRETZELS|PRIVATE LABEL|          BAG SNACKS|            PRETZELS|       15 OZ|
|1111035398|PL BL MINT ANTSPT...|PRIVATE LABEL|ORAL HYGIENE PROD...|MOUTHWASHES (ANTI...|      1.5 LT|
|1111038078|PL BL MINT ANTSPT...|PRIVATE LABEL|ORAL HYGIENE PROD...|MOUTHWASHES (ANTI...|      500 ML|
|1111038080|PL ANTSPTC SPG MN...|PRIVATE LABEL|ORAL HYGIENE PROD...|MOUTHWASHES (ANTI...|      500 ML|
|1111085319|PL HONEY NUT TOAS...|PRIVATE LABEL|         COLD CEREAL|   AL

In [7]:
store_df = spark. \
    read. \
    option("header",True) \
    .csv(store_data_folder)

In [8]:
product_df.show(3)

+----------+--------------------+-------------+----------+------------+------------+
|       UPC|         DESCRIPTION| MANUFACTURER|  CATEGORY|SUB_CATEGORY|PRODUCT_SIZE|
+----------+--------------------+-------------+----------+------------+------------+
|1111009477|PL MINI TWIST PRE...|PRIVATE LABEL|BAG SNACKS|    PRETZELS|       15 OZ|
|1111009497|   PL PRETZEL STICKS|PRIVATE LABEL|BAG SNACKS|    PRETZELS|       15 OZ|
|1111009507|   PL TWIST PRETZELS|PRIVATE LABEL|BAG SNACKS|    PRETZELS|       15 OZ|
+----------+--------------------+-------------+----------+------------+------------+
only showing top 3 rows



In [9]:
transaction_df = spark. \
    read. \
    option("header",True) \
    .csv(transaction_data_folder)

In [10]:
transaction_df.show(3)

+-------------+---------+----------+-----+------+---+-----+-----+----------+-------+-------+--------+
|WEEK_END_DATE|STORE_NUM|       UPC|UNITS|VISITS|HHS|SPEND|PRICE|BASE_PRICE|FEATURE|DISPLAY|TPR_ONLY|
+-------------+---------+----------+-----+------+---+-----+-----+----------+-------+-------+--------+
|    14-Jan-09|      367|1111009477|   13|    13| 13|18.07| 1.39|      1.57|      0|      0|       1|
|    14-Jan-09|      367|1111009497|   20|    18| 18| 27.8| 1.39|      1.39|      0|      0|       0|
|    14-Jan-09|      367|1111009507|   14|    14| 14|19.32| 1.38|      1.38|      0|      0|       0|
+-------------+---------+----------+-----+------+---+-----+-----+----------+-------+-------+--------+
only showing top 3 rows



In [11]:
product_df.createOrReplaceTempView("products")
transaction_df.createOrReplaceTempView("transaction")

In [12]:
spark.sql("""
    SELECT
        *
        
    FROM transaction
    JOIN products
    ON
        transaction.upc = products.upc
""").show(3)

+-------------+---------+----------+-----+------+---+-----+-----+----------+-------+-------+--------+----------+--------------------+-------------+----------+------------+------------+
|WEEK_END_DATE|STORE_NUM|       UPC|UNITS|VISITS|HHS|SPEND|PRICE|BASE_PRICE|FEATURE|DISPLAY|TPR_ONLY|       UPC|         DESCRIPTION| MANUFACTURER|  CATEGORY|SUB_CATEGORY|PRODUCT_SIZE|
+-------------+---------+----------+-----+------+---+-----+-----+----------+-------+-------+--------+----------+--------------------+-------------+----------+------------+------------+
|    14-Jan-09|      367|1111009477|   13|    13| 13|18.07| 1.39|      1.57|      0|      0|       1|1111009477|PL MINI TWIST PRE...|PRIVATE LABEL|BAG SNACKS|    PRETZELS|       15 OZ|
|    14-Jan-09|      367|1111009497|   20|    18| 18| 27.8| 1.39|      1.39|      0|      0|       0|1111009497|   PL PRETZEL STICKS|PRIVATE LABEL|BAG SNACKS|    PRETZELS|       15 OZ|
|    14-Jan-09|      367|1111009507|   14|    14| 14|19.32| 1.38|      1.38

In [13]:
df = spark.sql("""
    SELECT
        *
        
    FROM transaction
    JOIN products
    ON
        transaction.upc = products.upc
""")

In [14]:
df.show(3)

+-------------+---------+----------+-----+------+---+-----+-----+----------+-------+-------+--------+----------+--------------------+-------------+----------+------------+------------+
|WEEK_END_DATE|STORE_NUM|       UPC|UNITS|VISITS|HHS|SPEND|PRICE|BASE_PRICE|FEATURE|DISPLAY|TPR_ONLY|       UPC|         DESCRIPTION| MANUFACTURER|  CATEGORY|SUB_CATEGORY|PRODUCT_SIZE|
+-------------+---------+----------+-----+------+---+-----+-----+----------+-------+-------+--------+----------+--------------------+-------------+----------+------------+------------+
|    14-Jan-09|      367|1111009477|   13|    13| 13|18.07| 1.39|      1.57|      0|      0|       1|1111009477|PL MINI TWIST PRE...|PRIVATE LABEL|BAG SNACKS|    PRETZELS|       15 OZ|
|    14-Jan-09|      367|1111009497|   20|    18| 18| 27.8| 1.39|      1.39|      0|      0|       0|1111009497|   PL PRETZEL STICKS|PRIVATE LABEL|BAG SNACKS|    PRETZELS|       15 OZ|
|    14-Jan-09|      367|1111009507|   14|    14| 14|19.32| 1.38|      1.38

In [15]:
spark.sql("""
    SELECT
        products.upc
        , price
        , description
        , category
        
    FROM transaction
    JOIN products
    ON
        transaction.upc = products.upc
""").show(3)

+----------+-----+--------------------+----------+
|       upc|price|         description|  category|
+----------+-----+--------------------+----------+
|1111009477| 1.39|PL MINI TWIST PRE...|BAG SNACKS|
|1111009497| 1.39|   PL PRETZEL STICKS|BAG SNACKS|
|1111009507| 1.38|   PL TWIST PRETZELS|BAG SNACKS|
+----------+-----+--------------------+----------+
only showing top 3 rows



In [16]:
spark.sql("""
    SELECT
        products.upc
        , min (price)
        , max (price)
        , description
        , category
        
    FROM transaction
    JOIN products
    ON
        transaction.upc = products.upc
    GROUP BY
        1, 4, 5
""").show(3)

+----------+----------+----------+--------------------+----------+
|       upc|min(price)|max(price)|         description|  category|
+----------+----------+----------+--------------------+----------+
|1111009477|      0.89|      1.83|PL MINI TWIST PRE...|BAG SNACKS|
|1111009497|      0.86|      1.69|   PL PRETZEL STICKS|BAG SNACKS|
|1111009507|       0.8|      1.69|   PL TWIST PRETZELS|BAG SNACKS|
+----------+----------+----------+--------------------+----------+
only showing top 3 rows



In [17]:
spark.sql("""
    SELECT
        upc
        , product_size
        , CASE
            WHEN CONTAINS(product_size, 'OZ') THEN 'yes'
        END AS is_oz
        
    FROM products
""").show(10)

+----------+------------+-----+
|       upc|product_size|is_oz|
+----------+------------+-----+
|1111009477|       15 OZ|  yes|
|1111009497|       15 OZ|  yes|
|1111009507|       15 OZ|  yes|
|1111035398|      1.5 LT| null|
|1111038078|      500 ML| null|
|1111038080|      500 ML| null|
|1111085319|    12.25 OZ|  yes|
|1111085345|       20 OZ|  yes|
|1111085350|       18 OZ|  yes|
|1111087395|     32.7 OZ|  yes|
+----------+------------+-----+
only showing top 10 rows

