## Product Mapping v2
### Anthony Ung

#### Some Jupyter things you need to be aware of ...
#### 
#### As long as you run the cells in the correct order, the mapping of the products table is idempotent.
#### If you want to run an individual cell, you need to restart the kernel.
#### Go to "Kernel" > "Restart Kernel and Run up to Selected Cell..."

In [1]:
import csv
import re

In [2]:
products_old = []
PRODUCTS_MAPPED = []
PRODUCT_CLASSES_NEW = []

# Read the product and product classes files.
with open('Products1.txt', 'r') as csvfile:

    csv.register_dialect('piper', delimiter='|', quoting=csv.QUOTE_NONE)
    for row in csv.DictReader(csvfile, dialect='piper'):
        products_old.append(row)
        
with open('product_class.txt', 'r') as csvfile:
    csv.register_dialect('tab', delimiter='\t', quoting=csv.QUOTE_NONE)
    
    for row in csv.DictReader(csvfile, dialect='tab'):
        PRODUCT_CLASSES_NEW.append(row)

In [3]:
class DEBUG:
    def print_product_classes():
        print("product_class_id|product_subcategory|product_category|product_department|product_family")
        for product in PRODUCT_CLASSES_NEW:
            print(f'{product['product_class_id']}|{product['product_subcategory']}|{product['product_category']}|{product['product_department']}|{product['product_family']}')

    def product_dump(product_arr):
        with open('products_to_be_mapped.csv', 'w', newline='') as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=product_arr[0].keys())

            writer.writeheader()
            for product in product_arr:
                writer.writerow(product)
    

### A utility function that invokes some ETL code on our behalf

The convention:  
- `func` - Contains ETL code to be invoked on our behalf.
- `src` - The source array
- `dst1` - The destination array for products successfully mapped
- `dst2` - The destination array for products not successfully mapped.

When creating a definition for `func`, the names `src`, `dst1`, and `dst2` have no meaning to the caller.

Each updated product needs to have the following fields:
- `product_class_id` - The code of the new product class
- `meta_code` - A unique ID.
- `meta_mapped_by` - The initials of the person who mapped the product (eg. AU, SJ, GK, AB, NB, etc.)
- `meta_reason` - The reason why this product was mapped (e.g. from a character match, from a specific manufacturer, etc.)

In [4]:
def pipeline(func, src, dst1, dst2):
    func(src, dst1, dst2)

def update_product(product, product_class_id, code, mapped_by, reason):
    product['product_class_id'] = product_class_id
    product['meta_code'] = code
    product['meta_mapped_by'] = mapped_by
    product['meta_reason'] = reason

#### Slide 9 stipulates that every product must have a key that will be mapped to our dimension table.

In [5]:
def generate_surrogate_key(src, dst1=None, dst2=None):
    product_id = 1

    for product in src:
        product['product_id'] = product_id
        product_id += 1

generate_surrogate_key(products_old)


### Slide 17 stipulates that we have specific suppliers.

In [6]:
def generate_suppliers(src, dst1=None, dst2=None):
    for product in src:
        if product['itemType'] == 'Milk':
            product['Supplier'] = 'Rowan Dairy'
        else:
            product['Supplier'] = 'Rowan Warehouse'
            
generate_suppliers(products_old)


### Some useful conventions in this cell:

Array names in all caps indicate that either (1) this array shall only be appended to, or (2) this array should not be modified at all.
`PRODUCTS_MAPPED` is Type 1. `PRODUCT_CLASSES_NEW` is Type 2.

In [7]:
def natural_mapping(src, dst1, dst2):
    '''
        Disallow duplicate product classes
        Used the following linux command to identify duplicates
            cat product_class.txt | cut -f 2 | sort | uniq -c | sort -r | head
    ''';
    product_subcategories = {}
    for subcategory in PRODUCT_CLASSES_NEW:
        if((subcategory['product_subcategory'] != 'Coffee') \
           and (subcategory['product_subcategory'] != 'Cleaners')):

            product_subcategories[subcategory['product_subcategory']] = subcategory ['product_class_id']

    '''
        Resolve a duplicate and verified by hand to use the smaller of the two
    '''
    product_subcategories['Fresh Vegetables'] = 13

    for product in src:
        if product['itemType'] in product_subcategories.keys():
            update_product( \
                product=product, \
                product_class_id = product_subcategories[product['itemType']], \
                code = 1, \
                mapped_by = 'AU', \
                reason = 'Mapped from old item type into new subcategory')
            dst1.append(product)
        else:
            dst2.append(product)

Products_To_Be_Mapped = []
natural_mapping(products_old, PRODUCTS_MAPPED, Products_To_Be_Mapped)

In [8]:
print(len(Products_To_Be_Mapped))

DEBUG.product_dump(Products_To_Be_Mapped)

904
