# Assignment 2 - Spark Dataframes
***Note***: All the dataset files were stored in the same folder as this notebook.

In [1]:
import os
import pyspark
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
conf = pyspark.SparkConf()
sc = pyspark.SparkContext(conf=conf)
spark = pyspark.sql.SparkSession(sc)
spark

## 1. 15 Points
**Datafile**: BreadBasket_DMS.csv

**Solve**: What is the most popular (most sold) between the 8:00AM and 8:59AM for each day?

Example output (not actual solution)

    2016-10-30, Pastry

    2016-10-31, Coffee
     :
     :

### Approach:
1. Import `BreadBasket_DMS.csv` into a dataframe
2. Extract dates in `YYYY-MM-DD` format from the `Date` column and times in `hh:mm:ss` format from the `Time` column
3. Filter the data by `Time` in the range of `08:00:00` and `08:59:00` inclusive and remove rows with `None` in the `Item` column
4. Group the data by `Date` and `Item`, aggregate the `sum` of `Transaction` for each `Item` aliased as `Total` and, sort by `Date` and `Total`
5. Group the data by `Date` and return the last `Item` and last `Total`

In [2]:
# 1. Import BreadBasket_DMS.csv into a dataframe and `filter` out rows with `NONE` in the `Item` column
from pyspark.sql.functions import col
BreadBasket_DMS = spark.read.option("header", True).option("InferSchema", True).csv("BreadBasket_DMS.csv")
BreadBasket_DMS = BreadBasket_DMS.filter(col("Item") != "NONE")

# 2. Extract dates in `YYYY-MM-DD` format from the `Date` column and times in `hh:mm:ss` format from the `Time` column
from pyspark.sql.functions import to_date, date_format
BreadBasket_DMS = BreadBasket_DMS.withColumn("Date", to_date(col("Date"), "YYYY-MM-DD"))
BreadBasket_DMS = BreadBasket_DMS.withColumn("Time", date_format(col("Time"),"hh:mm:ss"))

# 3. Filter the data by `Time` in the range of `08:00:00` and `08:59:00` inclusive 
q1 = BreadBasket_DMS
q1 = q1.filter((col("Time") <= "08:59:00") & (col("Time") >= "08:00:00"))

# 4. Group the data by `Date` and `Item`, aggregate by the `count` and, sort by `Date` and `count`
q1 = q1.groupBy("Date","Item").count().sort("Date","count")

# 5. Group the data by `Date` and return the last `Item` and last `count`
from pyspark.sql.functions import last
q1 = q1.groupBy("Date").agg(last("Item").alias("Most Popular Iteam"),last("count").alias("Total Transactions"))

# Display results
print("List of the most popular (most sold) items between the 8:00 AM and 8:59 AM for each day and their total transactions that day:")
q1.show()

List of the most popular (most sold) items between the 8:00 AM and 8:59 AM for each day and their total transactions that day:
+----------+------------------+------------------+
|      Date|Most Popular Iteam|Total Transactions|
+----------+------------------+------------------+
|2016-10-31|             Bread|                 2|
|2016-11-01|               Tea|                 3|
|2016-11-02|            Coffee|                 8|
|2016-11-03|            Coffee|                 4|
|2016-11-04|            Coffee|                 2|
|2016-11-05|             Bread|                 6|
|2016-11-07|            Coffee|                 1|
|2016-11-08|            Coffee|                 1|
|2016-11-09|            Coffee|                 1|
|2016-11-10|            Coffee|                 2|
|2016-11-11|             Bread|                 6|
|2016-11-12|         Medialuna|                 1|
|2016-11-14|            Coffee|                 2|
|2016-11-15|  Keeping It Local|                 1|
|2016-

## 2. 15 Points
**Datafile**: BreadBasket_DMS.csv

**Solve**: What is the most common item bought along with “Brownie”? (items bought in the same transaction)

### Assumptions:
We will assume that we will count each time an item was bought with “Brownie”. If an item was bought more than once in the same transaction we will count each time that item was bought in that transaction.

### Approach:
1. Import `BreadBasket_DMS.csv` into a dataframe
2. Extract dates in `YYYY-MM-DD` format from the `Date` column and times in `hh:mm:ss` format from the `Time` column
3. Make list of Brownie Trans and Dates and Time
4. Filter Brownie from list
5. Join both
6. GroupBy

In [3]:
# 1. Import BreadBasket_DMS.csv into a dataframe and `filter` out rows with `NONE` in the `Item` column
from pyspark.sql.functions import col
BreadBasket_DMS = spark.read.option("header", True).option("InferSchema", True).csv("BreadBasket_DMS.csv")
BreadBasket_DMS = BreadBasket_DMS.filter(col("Item") != "NONE")

# 2. Extract dates in `YYYY-MM-DD` format from the `Date` column and times in `hh:mm:ss` format from the `Time` column
from pyspark.sql.functions import to_date, date_format
BreadBasket_DMS = BreadBasket_DMS.withColumn("Date", to_date(col("Date"), "YYYY-MM-DD"))
BreadBasket_DMS = BreadBasket_DMS.withColumn("Time", date_format(col("Time"),"hh:mm:ss"))

# 3 Brownie Transactions Date and Time
BrownieTransactions = BreadBasket_DMS.filter(col("Item") == "Brownie").sort("Transaction")

# Non-Brownie Transactions.sort("Transaction")
OtherTransactions = BreadBasket_DMS.filter(col("Item") != "Brownie")

# Items bought with Brownies
JoinExpression = BrownieTransactions["Transaction"] == OtherTransactions["Transaction"]
ItemBougtWithBrownie = OtherTransactions.join(BrownieTransactions,JoinExpression, "left_semi").sort("Transaction")

# Total Transactions
from pyspark.sql.functions import desc
print("List of the most common items bought along with “Brownie” sorted by their counts")
ItemBougtWithBrownie.groupBy("Item").count().sort(desc("count")).show()

List of the most common items bought along with “Brownie” sorted by their counts
+-----------------+-----+
|             Item|count|
+-----------------+-----+
|           Coffee|  237|
|            Bread|  115|
|              Tea|   71|
|             Cake|   43|
|    Hot chocolate|   42|
|         Sandwich|   27|
|        Alfajores|   27|
|          Cookies|   26|
|            Juice|   24|
|           Pastry|   23|
|        Medialuna|   19|
|           Muffin|   18|
|             Soup|   15|
|            Scone|   12|
|             Coke|   11|
|       Farm House|   11|
|         Truffles|   11|
|    Mineral water|    9|
|            Toast|    7|
|Hearty & Seasonal|    6|
+-----------------+-----+
only showing top 20 rows

