# Assignment 2 - Spark Dataframes
***Note***: All the dataset files were stored in the same folder as this notebook.

In [1]:
import os
import pyspark
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
conf = pyspark.SparkConf()
sc = pyspark.SparkContext(conf=conf)
spark = pyspark.sql.SparkSession(sc)
spark

## 1. 15 Points
**Datafile**: BreadBasket_DMS.csv

**Solve**: What is the most popular (most sold) between the 8:00AM and 8:59AM for each day?

Example output (not actual solution)

    2016-10-30, Pastry

    2016-10-31, Coffee
     :
     :

### Approach:
1. Import `BreadBasket_DMS.csv` into a dataframe
2. Extract dates in `YYYY-MM-DD` format from the `Date` column and times in `hh:mm:ss` format from the `Time` column
3. Filter the data by `Time` in the range of `08:00:00` and `08:59:00` inclusive and remove rows with `None` in the `Item` column
4. Group the data by `Date` and `Item`, aggregate the `sum` of `Transaction` for each `Item` aliased as `Total` and, sort by `Date` and `Total`
5. Group the data by `Date` and return the last `Item` and last `Total`

In [2]:
# 1. Import BreadBasket_DMS.csv into a dataframe and `filter` out rows with `NONE` in the `Item` column
from pyspark.sql.functions import col
BreadBasket_DMS = spark.read.option("header", True).option("InferSchema", True).csv("BreadBasket_DMS.csv")
BreadBasket_DMS = BreadBasket_DMS.filter(col("Item") != "NONE")

# 2. Extract dates in `YYYY-MM-DD` format from the `Date` column and times in `hh:mm:ss` format from the `Time` column
from pyspark.sql.functions import to_date, date_format
BreadBasket_DMS = BreadBasket_DMS.withColumn("Date", to_date(col("Date"), "YYYY-MM-DD"))
BreadBasket_DMS = BreadBasket_DMS.withColumn("Time", date_format(col("Time"),"hh:mm:ss"))

# 3. Filter the data by `Time` in the range of `08:00:00` and `08:59:00` inclusive 
q1 = BreadBasket_DMS
q1 = q1.filter((col("Time") <= "08:59:00") & (col("Time") >= "08:00:00"))

# 4. Group the data by `Date` and `Item`, aggregate the `sum` of `Transaction` for each `Item` aliased as `Total` and, sort by `Date` and `Total`
from pyspark.sql.functions import sum
q1 = q1.groupBy("Date","Item").agg(sum("Transaction").alias("Total")).sort("Date","Total")

# 5. Group the data by `Date` and return the last `Item` and last `Total`
from pyspark.sql.functions import last
q1 = q1.groupBy("Date").agg(last("Item").alias("Most Popular Iteam"),last("Total").alias("Total Transactions"))

# Display results
q1.show()

+----------+------------------+------------------+
|      Date|Most Popular Iteam|Total Transactions|
+----------+------------------+------------------+
|2016-10-31|             Bread|               165|
|2016-11-01|               Tea|               542|
|2016-11-02|            Coffee|              2064|
|2016-11-03|            Coffee|              1382|
|2016-11-04|            Coffee|               883|
|2016-11-05|             Bread|              3164|
|2016-11-07|            Coffee|               739|
|2016-11-08|             Bread|               816|
|2016-11-09|             Bread|               890|
|2016-11-10|            Coffee|              1879|
|2016-11-11|             Bread|              6067|
|2016-11-12|         Medialuna|              1104|
|2016-11-14|         Medialuna|              2555|
|2016-11-15|  Keeping It Local|              1343|
|2016-11-16|             Bread|              1409|
|2016-11-17|          Siblings|              2953|
|2016-11-18|            Coffee|

## 2. 15 Points
**Datafile**: BreadBasket_DMS.csv

**Solve**: What is the most common item bought along with “Brownie”? (items bought in the same transaction)

### Assumptions:
We will assume that:
1. Items bought at the same date and time as "Brownie" qualify as an item bought in the same transaction.
2. "most commom" could imply the most number of total transactions or most number of occurences for the item. I will do both!
3. An item only occurs once in an individual transaction.

### Approach:
1. Import `BreadBasket_DMS.csv` into a dataframe (See Q1)
2. Extract dates in `YYYY-MM-DD` format from the `Date` column and times in `hh:mm:ss` format from the `Time` column (See Q1)
3. Make list of Brownie Trans and Dates and Time
4. Filter Brownie from list
5. Join both
#### Most Number of Total Transactions:

#### Most Number of Occurences:

In [3]:
# 3 Brownie Transactions Date and Time
BrownieTransactions = BreadBasket_DMS.filter(col("Item") == "Brownie").select("Date","Time")

# Non-Brownie Transactions
OtherTransactions = BreadBasket_DMS.filter(col("Item") != "Brownie")

# Items bought with Brownies
JoinExpression = (BrownieTransactions["Date"] == OtherTransactions["Date"]) & (BrownieTransactions["Time"] == OtherTransactions["Time"])
ItemBougtWithBrownie = OtherTransactions.join(BrownieTransactions,JoinExpression, "left_semi").sort("Date","Time")
ItemBougtWithBrownie.show()

+----------+--------+-----------+--------------------+
|      Date|    Time|Transaction|                Item|
+----------+--------+-----------+--------------------+
|2016-11-03|01:02:37|        391|            Sandwich|
|2016-11-03|01:02:37|        391|              Coffee|
|2016-11-03|01:19:57|        392|              Pastry|
|2016-11-03|01:19:57|        392|            Focaccia|
|2016-11-03|01:19:57|        392|          Farm House|
|2016-11-03|02:26:27|        403|                Cake|
|2016-11-03|02:26:27|        403|              Pastry|
|2016-11-03|03:55:46|        419|              Coffee|
|2016-11-03|03:55:46|        419|           Alfajores|
|2016-11-03|03:55:46|        419|Ella's Kitchen Po...|
|2016-11-03|03:55:46|        419|               Juice|
|2016-11-03|04:06:19|        421|             Cookies|
|2016-11-03|04:06:19|        421|               Bread|
|2016-11-03|04:06:19|        421|               Juice|
|2016-11-03|04:06:19|        421|           Alfajores|
|2016-11-0

In [4]:
# Total Transactions
from pyspark.sql.functions import desc, count, max
a = ItemBougtWithBrownie.groupBy("Item").agg(sum("Transaction").alias("Transactions")).sort(desc("Transactions")).show()

# Total Occurences
ItemBougtWithBrownie.groupBy("Item").agg(count("Item").alias("Occurences")).sort(desc("Occurences")).show()

+-------------+------------+
|         Item|Transactions|
+-------------+------------+
|       Coffee|     1000233|
|        Bread|      540798|
|          Tea|      320766|
|         Cake|      224202|
|Hot chocolate|      179817|
|      Cookies|      148147|
|     Sandwich|      138445|
|    Alfajores|      108437|
|        Juice|      107391|
|       Pastry|       85160|
|    Medialuna|       72654|
|        Scone|       66893|
|         Coke|       60400|
|         Soup|       56780|
|   Farm House|       48390|
|     Truffles|       47428|
|       Muffin|       47178|
|        Toast|       41480|
|       Tiffin|       41090|
|     Baguette|       37517|
+-------------+------------+
only showing top 20 rows

+-----------------+----------+
|             Item|Occurences|
+-----------------+----------+
|           Coffee|       237|
|            Bread|       115|
|              Tea|        71|
|             Cake|        43|
|    Hot chocolate|        42|
|         Sandwich|        27|