# Breakfast at the Frat: A Time Series Analysis

Sales and promotion information on the top five products from each of the top three brands within four selected categories (mouthwash, pretzels, frozen pizza, and boxed cereal), gathered from a sample of stores over 156 weeks.

- Unit sales, households, visits, and spend data by product, store, and week
- Base Price and Actual Shelf Price, to determine a product’s discount, if any
- Promotional support details (e.g., sale tag, in-store display), if applicable for the given product/store/week
- Store information, including size and location, as well as a price tier designation (e.g., upscale vs. value)
- Product information, including UPC, size, and description

To identify outliers, it is suggested to look at

- The ratio of units vs. number of visits
- The ratio of visits vs. number of households
- Some items that may be out-of-stock or discontinued for a store

**Source:** https://www.dunnhumby.com/source-files/

In [None]:
from pyspark.sql import Row, SparkSession

In [None]:
spark = SparkSession.builder \
    .appName("breakfast") \
    .getOrCreate()

In [None]:
product_data_folder = "dataset/products"
store_data_folder = "dataset/stores"
transaction_data_folder = "dataset/transactions"

### Perform ETL to Answer the Following Questions

1. What is the range of prices offered on products?
1. What is the impact on units/visit of promotions by geographies?
1. Which products would you lower the price to increase sales?

In [None]:
product_df = spark.read.option("header", True).csv(product_data_folder)

In [None]:
transaction_df = spark.read.option("header", True).csv(transaction_data_folder)

In [None]:
product_df.createOrReplaceTempView("products")

transaction_df.createOrReplaceTempView("transactions")

In [None]:
df = spark.sql("""
    select
        *
        
    from transactions
    join products
    on 
        transactions.upc = products.upc
""")

In [None]:
df.show(1)

In [None]:
spark.sql("""
    select
        products.upc
        , min(price)
        , max(price)
        , description
        , category
        
    from transactions
    join products
    on 
        transactions.upc = products.upc
    group by
        1, 4, 5
""").show(3)