<a href="https://colab.research.google.com/github/AshKhanNY/BigData/blob/main/BDM_HW1_23900457_Khan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 1 - Streaming

**SUBMISSION**: please download your colab as `.ipynb` and named it `BDM_HW1_<EMPL_ID>_<LastName>.ipynb` (replacing `<EMPL_ID>` and `<LastName>` with your information accordingly) before submitting on Blackboard.

---
Given a sale transaction data, stored in a CSV file with a similar structure as below:

|Customer ID|Transaction ID|Date|Product ID|Item Cost|
|---|---|---|---|---|
|129482221|T29518|2018/02/28|A|10.99|
|129482221|T29518|2018/02/28|B|4.99|
|129482221|T93990|2018/03/15|A|9.99|
|583910109|T11959|2017/04/13|C|0.99|
|583910109|T29852|2017/12/25|D|13.99|
|873803751|T35662|2018/01/01|D|13.99|
|873803751|T17583|2018/05/08|B|5.99|
|873803751|T17583|2018/05/08|A|11.99|
|...|...|...|...|...|

The input data is *sorted by* **`Customer ID`**. You are asked to compute the following table, *sorted by* **`Product ID`**.

|Product ID|Customer Count|Total Revenue|
|---|---|---|
|A|2|32.97|
|B|2|10.98|
|C|1|0.99|
|D|2|27.98|
|...|...|...|

where:

* `Customer Count` is the number of unique customers that bought the product with the given ID
* `Total Revenue` is the total cost of the product in all transactions, kept at 2 decimal places.


**YOU ARE NOT ALLOWED TO STORE ALL DATA ROWS**.
The data is assumed to be large, with the number of customers much larger than the number of products. You must complete the task in a streaming fashion, and try not to depend your storage on the number of customers. In particular, it is not advisable to have your storage complexity to be **O(n)** where n is the number of customer. 

In addition, you need to complete the provided `homework1` *generator* without importing any non-built-in Python packages (e.g. `pandas` is not allowed). 

For each record that `homework1` "yields", it must be a CSV-string, (comma separated list), e.g. `"A,2,32,97"`, or `B,2,10.98`, etc., that can be output directly to a CSV file.

In [None]:
# Download the data set
!gdown 1DSxCQGZBaPG5bZ2T0fhgPA4wIkoIeiyi
!wc -l sales.csv

Downloading...
From: https://drive.google.com/uc?id=1DSxCQGZBaPG5bZ2T0fhgPA4wIkoIeiyi
To: /content/sales.csv
  0% 0.00/42.6k [00:00<?, ?B/s]100% 42.6k/42.6k [00:00<00:00, 50.5MB/s]
1001 sales.csv


In [None]:
import csv

def homework1(reader):
    '''
    reader: your input is DictReader, where you can access
            columns by name, e.g. row['Customer ID'] or row['Date'].
    '''
    # Store results in a map, where key of map is the Product ID and
    # each key has its own number and value
    result = {}
    lastCustomer = ''
    uniqueProducts = []

    for row in reader:
      product = row['Product ID']

      # Determine if product exists in result set
      if product not in result:
        result[product] = {}
        result[product]['Customer Count'] = 0
        result[product]['Total Revenue'] = 0

      # Determine if customer is new
      if row['Customer ID'] != lastCustomer:
        lastCustomer = row['Customer ID']
        uniqueProducts = []
      
      # Determine if product is unique to customer
      if product not in uniqueProducts:
        uniqueProducts.append(product)
        result[product]['Customer Count'] += 1
      
      # Add value to product if purchased
      result[product]['Total Revenue'] += float(row['Item Cost'])

    pKeys = sorted(result.keys())
    for key in pKeys:
      approxPrice = str(round(result[key]['Total Revenue'],2))
      yield f'{key},{result[key]["Customer Count"]},{approxPrice}'


with open ('sales.csv', 'r') as fi:
    reader = csv.DictReader(fi)
    print('Product ID,Customer Count,Total Revenue')
    for product in homework1(reader):
        print(product)

Product ID,Customer Count,Total Revenue
P02291,16,1181.97
P19498,17,989.99
P32565,17,1006.09
P33162,18,1210.92
P39328,17,1129.01
P58225,17,1349.82
P61235,18,959.02
P76615,18,1087.96
P82222,17,950.05
P92449,14,966.17
