# Project 4 Retail Dashboard - Background Work

    From this notebook, we perform our data cleaning & transforming:
        - extracting raw data from directory:
            - PLEASE NOTE: You must download raw files into a folder called 'raw' within the 'data' folder. Otherwise the pre-existing code won't work!
        - extract additional files i.e. branch_expenses, product_list.
            - PLEASE NOTE: these files must be separated from the raw data files into their own folder called 'other' within the 'data' folder. Otherwise the pre-existing code won't work!
        - removing duplicates
        - acquire product list
        - acquire region list
        - adding product category as a field
        - adding region & county as separate fields
        - cleaning of data:
            - fix header to all match
            - convert incorrect str dtypes to int or float
        - save temporary csv file
        - extract temp file using pandas
        - add temp files data to pre-existing group dataframe
        - save grouped dataframe
        
    This is done through an iteration of the raw directory. The iteration will be run 3 times. Each run through it dictated on the year the branch was established.

    As the user of this script YOU MUST after every cycle adjust the name of the final save & the file range variable to the correct establishing year.
    This will also be noted in the code itself.

    Happy Coding!

## Setting Up
    - imports
    - directory creation
    - removal of duplicates
    - acquisition of product & region lists
    - created of empty dataframes.

### imports

In [15]:
import openpyxl
import os
import pandas as pd
import petl as etl

### create data directory

In [16]:
raw_data_path = "data/raw"
raw_data_directory = []

for filename in os.listdir(raw_data_path):
    file = os.path.join(raw_data_path, filename)

    if os.path.isfile(file):
        #print(file)
        raw_data_directory.append(file)

In [17]:
#raw_data_directory

### remove csv duplicates

In [18]:
len(raw_data_directory)

94

In [19]:
for file in raw_data_directory:
    if str(file).endswith('json'):
        if str(file).replace('json', 'csv') in raw_data_directory:
            raw_data_directory.remove(str(file).replace('json', 'csv'))

len(raw_data_directory)

86

In [20]:
#raw_data_directory

### acquire product list

In [21]:
# acquire product list from directory
product_list = list(
    etl.fromcsv('data/other/products_list.csv')\
        .values('product', 'category'))
        
# print for checking
product_list


[('Fairtrade Bananas Loose', 'fruits & vegetables'),
 ('British carrots loose', 'fruits & vegetables'),
 ('Onions Loose', 'fruits & vegetables'),
 ('British baking potatoes loose', 'fruits & vegetables'),
 ('Red pepper', 'fruits & vegetables'),
 ('Mixed pepper', 'fruits & vegetables'),
 ('Brocolli loose', 'fruits & vegetables'),
 ('Lemon', 'fruits & vegetables'),
 ('Spring onions bunch', 'fruits & vegetables'),
 ('Sweet potatoes loose', 'fruits & vegetables'),
 ('courgette loose', 'fruits & vegetables'),
 ('baby potatoes', 'fruits & vegetables'),
 ('british parsnips loose', 'fruits & vegetables'),
 ('fine beans', 'fruits & vegetables'),
 ('garlic', 'fruits & vegetables'),
 ('celery', 'fruits & vegetables'),
 ('aubergine', 'fruits & vegetables'),
 ('raspberries', 'fruits & vegetables'),
 ('british bramley cooking apples', 'fruits & vegetables'),
 ('easy peeler loose', 'fruits & vegetables'),
 ('scottish salmon', 'meat & fish'),
 ('beef mince', 'meat & fish'),
 ('british fresh chicken br

### acquire region list

In [22]:
# acquire branch list from directory
branch_list = list(
    etl.fromxlsx('data/other/branch_list.xlsx')\
        .values('region', 'county', 'branch_name'))



# print for checking
branch_list

[('East of England', 'Bedfordshire', 'Bedfordshire store'),
 ('East of England', 'East Cambridgeshire', 'East Cambridgeshire outlet'),
 ('West Midlands', 'Warwickshire', 'Warwickshire branch'),
 ('East Midlands', 'Lincolnshire', 'Lincolnshire store'),
 ('London', 'Islington', 'Islington branch'),
 ('North East England', 'Stockton-on-Tees', 'Stockton-on-Tees store'),
 ('Northern Ireland', 'Belfast', 'Belfast branch'),
 ('North West England', 'Wyre', 'Wyre branch'),
 ('South East England', 'Mole Valley', 'Mole Valley store'),
 ('Scotland', 'Orkney', 'Orkney store'),
 ('South West England', 'Dorset', 'Dorset outlet'),
 ('East Midlands', 'Nottinghamshire', 'Nottinghamshire store'),
 ('Yorkshire and the Humber', 'North Yorkshire', 'North Yorkshire outlet'),
 ('South East England', 'Reigate and Banstead', 'Reigate and Banstead branch'),
 ('South East England', 'Epsom and Ewell', 'Epsom and Ewell store'),
 ('Wales', 'Wrexham', 'Wrexham store'),
 ('Wales', 'Isle of Anglesey', 'Isle of Anglesey

### creation of blank dfs

In [23]:
all_branches_df = pd.DataFrame()

## Cleanup & Transform Functions
    - adding the product category as a field
    - adding region & county as separate fields
    - cleanup of data

### adding product category

In [24]:
# add product category function
def add_product_category(prv, cur, nxt):
    for row in product_list:
        if cur.product == row[0]:
            return row[1]

### adding region & city

In [25]:
# add region & city function
def add_region_city(table, file):
    # acquire city name from file
    branch_name = str(file).split('_', 1).pop(1)\
        .rsplit('.', 1).pop(0)\
        .replace('_', ' ')
    # check through branch list for a match
    for row in branch_list:
        if branch_name == row[2]:
            table = etl.addfields(table, [('county', row[1]), ('region', row[0])])
            return table

### cleanup
    - fix any miss matched headers
    - convert incorrect dtypes to appropriate ones.

#### fix headers

In [26]:
def fix_headers(table):
    
# if table has the header sku or item.. change to product
    if 'sku' in etl.header(table):
        table = etl.rename(
                        table, 
                        'sku', 'product')
    elif 'item' in etl.header(table):
        table = etl.rename(
                        table, 
                        'item', 'product')

# rename header to only have the one name (total_quantity_purchase)
    if 'quantity' in etl.header(table):
        table = etl.rename(
                        table,
                        'quantity', 'total_quantity_purchased'
    )
    elif 'total_quantity' in etl.header(table):
        table = etl.rename(
                        table,
                        'total_quantity', 'total_quantity_purchased'
    )
    elif 'quantity_purchased' in etl.header(table):
        table = etl.rename(
                        table,
                        'quantity_purchased', 'total_quantity_purchased'
    )

    return table

#### dtype conversions

In [27]:
def str_conversion(table):
    # convert field to int
    table = etl.convert(
                    table, 
                    ['total_quantity_purchased'],
                    int)

# convert field to float
# round up field to 2 decimal points
    table = etl.convert(
                    table, 
                    'amount_in_gbp',
                    lambda cell: round(float(cell), 2))

    return table

## Iterating Through Files
    - PLEASE NOTE: 
        This is a piece of script that needs to be run 3 times. Each time changing the file_range variable which can be seen in the next cell to the next year branches were established.
        Essentially... it is first set to 2010... after it has run adjust this to 2011 and after that 2012

        DON'T FORGET to also adjust the save file's name to the corresponding year! Otherwise something won't quite work!

In [28]:
# set of files to work on through this iteration
# PLEASE CHANGE THIS TO 2011 AND 2012 FOR THE NEXT TWO RUN THROUGHS
file_range = "2010"
# for each file in the directory
for file in raw_data_directory:

        if str(file).split('-').pop(0).endswith(file_range):
                print(file)
# check if csv file type
                if str(file).endswith('csv'):
# extract from said type
                        current_table = etl.fromcsv(file)
# initiate cleanup functions
# fix miss labelled headers
                        current_table = fix_headers(current_table)

# convert field types from string to number
                        current_table = str_conversion(current_table)

# adding product categories to current_table
                        current_table = etl.addfieldusingcontext(current_table, 'product_category', add_product_category)

# adding region & city
                        current_table = add_region_city(current_table, file)

# save temp file
                        current_table.tocsv('temp_branch.csv')

# extract file using pandas
                        current_df = pd.read_csv('temp_branch.csv')

# append the current df to the total df for branches
                        all_branches_df = all_branches_df.append(current_df, ignore_index=True)
              
# in case of other file type
                else:
# extract from json file type
                        current_table = etl.fromjson(file)
# initiate cleanup functions
# fix miss labelled headers
                        current_table = fix_headers(current_table)

# convert field types from string to number
                        current_table = str_conversion(current_table)

# adding product categories to current_table
                        current_table = etl.addfieldusingcontext(current_table, 'product_category', add_product_category)

# adding region & city
                        current_table = add_region_city(current_table, file)

# save file
                        current_table.tocsv('temp_branch.csv')

# extract as df
                        current_df = pd.read_csv('temp_branch.csv')
                
# append the current df to the total df for branches
                        all_branches_df = all_branches_df.append(current_df, ignore_index=True)
                      


data/raw\2012-2020_Armagh_outlet.csv
data/raw\2012-2020_Ballymoney_store.csv
data/raw\2012-2020_Bargoed_outlet.json
data/raw\2012-2020_Bedfordshire_store.json
data/raw\2012-2020_Colchester_outlet.json
data/raw\2012-2020_Darlington_store.csv
data/raw\2012-2020_East_Dunbartonshire_branch.json
data/raw\2012-2020_East_Hertfordshire_branch.json
data/raw\2012-2020_Edinburgh_City_branch.json
data/raw\2012-2020_Glasgow_City_outlet.csv
data/raw\2012-2020_Hackney_store.csv
data/raw\2012-2020_Isle_of_Anglesey_outlet.json
data/raw\2012-2020_Lancashire_store.json
data/raw\2012-2020_Lincolnshire_store.json
data/raw\2012-2020_Neath_Port_Talbot_outlet.json
data/raw\2012-2020_Newark_and_Sherwood_store.json
data/raw\2012-2020_Reigate_and_Banstead_branch.csv
data/raw\2012-2020_Rugby_branch.json
data/raw\2012-2020_Rushcliffe_branch.csv
data/raw\2012-2020_Selby_branch.json
data/raw\2012-2020_Sevenoaks_branch.csv
data/raw\2012-2020_Shepway_store.csv
data/raw\2012-2020_Stockton-on-Tees_store.csv
data/raw\201

## save
    - please note after each run through to change the year at the end of the file name to the correct establishing year... otherwise the code won't quite work!
    - also please don't forget to create the refined directory beforehand... otherwise an error will occur!

In [None]:

all_branches_df.to_csv('data/refined/branches_established_in_2010.csv', index=False)