# P2P (Procure to Pay) Data Analysis & Visualization, Machine Learning Predictive Analytics using Julia Language

This is **Part - 3** of 3 ERP Data analysis notebooks.
- Part 1 - General Ledger, Data Science Basics
- Part 2 - General Ledger Data Analysis & Visualization
- Part 3 - P2P (Procure to Pay) Data Analysis & Visualization

**Related blogs:**
    
- [Web-scrapping, Web automation using Julia Language](https://amit-shukla.medium.com/web-scrapping-web-automation-using-julia-language-2c473db84fbc)
- Working with ODBC, ORM, XML, JSON, PDF, TXT, CSV, XLS
- Working with PDF documents, Image Scanner, OCR Reader

**Target Audience:** This notebook, is meant for ERP consultants, IT Developers, Finance, Supply chain, HR & CRM managers, executive leaders or anyone curious to implement data science concepts in ERP space.

+ **Author:** Amit Shukla
+ **Contact:** info@elishconsulting.com

In part 1, 2 of 3 series notebooks, we covered basics & details of ERP Data Finance model and learned basics of DataFrames.jl package and looked into perform detail ERP Data Analysis with visualizations.


In this part 3 notebook, we will continue to analyze Supply Chain data in aspects of Procure to Pay P2P, often referred as Buy to Pay B2P.

## adding Packages

In [40]:
using Pkg
Pkg.add("DataFrames")
Pkg.add("Dates")
Pkg.add("CategoricalArrays")
Pkg.add("Interact")
Pkg.add("WebIO")
Pkg.add("CSV")
Pkg.add("XLSX")
Pkg.add("DelimitedFiles")
Pkg.add("Distributions")
Pkg.build("WebIO")
Pkg.status();

[32m[1m      Status[22m[39m `~/.julia/environments/v1.7/Project.toml`
 [90m [336ed68f] [39mCSV v0.10.3
 [90m [54eefc05] [39mCascadia v1.0.1
 [90m [324d7699] [39mCategoricalArrays v0.10.5
 [90m [8f4d0f93] [39mConda v1.7.0
 [90m [a93c6f00] [39mDataFrames v1.3.2
 [90m [31c24e10] [39mDistributions v0.25.53
 [90m [e30172f5] [39mDocumenter v0.27.15
 [90m [8f5d6c58] [39mEzXML v1.1.0
 [90m [708ec375] [39mGumbo v0.8.0
 [90m [cd3eb016] [39mHTTP v0.9.17
 [90m [7073ff75] [39mIJulia v1.23.2
 [90m [c601a237] [39mInteract v0.10.4
 [90m [0f8b85d8] [39mJSON3 v1.9.4
 [90m [b9914132] [39mJSONTables v1.0.3
 [90m [4d0d745f] [39mPDFIO v0.1.13
 [90m [c3e4b0f8] [39mPluto v0.18.4
 [90m [2dfb63ee] [39mPooledArrays v1.4.0
 [90m [438e738f] [39mPyCall v1.93.1
 [90m [88034a9c] [39mStringDistances v0.11.2
 [90m [a2db99b7] [39mTextAnalysis v0.7.3
 [90m [05625dda] [39mWebDriver v0.1.2
 [90m [0f1e0344] [39mWebIO v0.8.17
 [90m [fdbf4ff8] [39mXLSX v0.7.9
 [90m [ade2ca70]

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.juli

In [1]:
using DataFrames, Dates, Interact, CategoricalArrays, WebIO, CSV, XLSX, DelimitedFiles, Distributions

*rest of this blog, I will assume, you have added all packages and imported in current namespace/notebook scope.*

--- 
## Supply Chain Data Model
We already covered DataFrames and ERP Finance data model in Part 1 & Part 2 notebooks, in below section, let's recreate all Supply Chain DataFrames to continue advance analytics and visualization.

#### Dimensions

- Item master, Item Attribs, Item Costing

    **UNSPSC:**  The United Nations Standard Products and Services Code® (UNSPSC®) is a global classification system of products and services.
                These codes are used to classify products and services.
    
    **GUDID:** The Global Unique Device Identification Database (GUDID) is a database administered by the FDA that will serve as a reference catalog for every device with a unique device identifier (UDI).

    **GTIN:** Global Trade Item Number (GTIN) can be used by a company to uniquely identify all of its trade items. GS1 defines trade items as products or services that are priced, ordered or invoiced at any point in the supply chain.

    **GMDN:** The Global Medical Device Nomenclature (GMDN) is a comprehensive set of terms, within a structured category hierarchy, which name and group ALL medical device products including implantables, medical equipment, consumables, and diagnostic devices.
    
    
- Vendor master, Vendor Attribs, Vendor Costing
    Customer/Buyer/Procurement Officer Attribs
    shipto, warehouse, storage & inventory locations

#### Transactions

-   PurchaseOrder
-   MSR - Material Service
-   Voucher
-   Invoice
-   Receipt
-   Shipment
-   Sales, Revenue
-   Travel, Expense, TimeCard
-   Accounting Lines

## Item master

In [4]:
###############################
## create SUPPLY CHAIN DATA ###
###############################
# Item master, Item Attribs, Item Costing ##
#       UNSPSC, GUDID, GTIN, GMDN
############################################

##########
# UNSPSC #
##########
# UNSPSC file can be downloaded from this link https://www.ungm.org/Public/UNSPSC
xf = XLSX.readxlsx("sampleData/UNGM_UNSPSC_09-Apr-2022..xlsx")
# xf will display names of sheets and rows with data
# let's read this data in to a DataFrame

# using below command will read xlsx data into DataFrame but will not render column labels
# df = DataFrame(XLSX.readdata("UNGM_UNSPSC_09-Apr-2022..xlsx", "UNSPSC", "A1:D12988"), :auto)
dfUNSPSC = DataFrame(XLSX.readtable("sampleData/UNGM_UNSPSC_09-Apr-2022..xlsx", "UNSPSC")...)
# ... operator will splat the tuple (data, column_labels) into the constructor of DataFrame

# replace missing values with an integer 99999
replace!(dfUNSPSC."Parent key", missing => 99999)
size(dfUNSPSC)

# let's export this clean csv, we'll load this into database
# CSV.write("UNSPSC.csv", dfUNSPSC)

# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfUNSPSC)
# empty!(dfUNSPSC)
# Base.summarysize(dfUNSPSC)

"UNSPSC.csv"

In [5]:
##########
# GUDID ##
##########
# The complete list of GUDID Data Elements and descriptions can be found at this link.
# https://www.fda.gov/media/120974/download
# The complete GUDID Database (delimited version) download (250+MB)
# https://accessgudid.nlm.nih.gov/release_files/download/AccessGUDID_Delimited_Full_Release_20220401.zip
# let's extract all GUDID files in a folder
# readdir(pwd())
# readdir("sampleData/GUDID")
# since these files are in txt (delimited) format, we'll use delimited pkg

########################
## large txt files #####
## read one at a time ##
########################

# data, header = readdlm("sampleData/GUDID/contacts.txt", '|', header=true)
# dfGUDIDcontacts = DataFrame(data, vec(header))

# data, header = readdlm("sampleData/GUDID/identifiers.txt", '|', header=true)
# dfGUDIDidentifiers = DataFrame(data, vec(header))

data, header = readdlm("sampleData/GUDID/device.txt", '|', header=true)
dfGUDIDdevice = DataFrame(data, vec(header))

# CSV.write("GUDID.csv", dfGUDIDdevice[1:1000,:])

# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfGUDIDcontacts),Base.summarysize(dfGUDIDidentifiers),Base.summarysize(dfGUDIDdevice)
# empty!(dfGUDIDcontacts)
# empty!(dfGUDIDidentifiers)
# empty!(dfGUDIDdevice)
# Base.summarysize(dfGUDIDcontacts),Base.summarysize(dfGUDIDidentifiers),Base.summarysize(dfGUDIDdevice)

"GUDID.csv"

In [6]:
# dfGUDIDdevice has more than 3308327 rows,
# let's split this in 6 mini files, 
# so that, it can be loaded into RDBMS easily
size(dfGUDIDdevice)
# CSV.write("dfGUDIDdevice_1.csv", dfGUDIDdevice[1:500000,:])
# CSV.write("dfGUDIDdevice_2.csv", dfGUDIDdevice[500001:1000000,:])
# CSV.write("dfGUDIDdevice_3.csv", dfGUDIDdevice[1000001:1500000,:])
# CSV.write("dfGUDIDdevice_4.csv", dfGUDIDdevice[1500001:2000000,:])
# CSV.write("dfGUDIDdevice_5.csv", dfGUDIDdevice[2000001:2500000,:])
# CSV.write("dfGUDIDdevice_6.csv", dfGUDIDdevice[2500001:3308327,:])

(3308327, 34)

In [7]:
##########
# GTIN ###
##########

# xf = XLSX.readxlsx("SampleData/DS_GTIN_ALL.xlsx")
# xf will display names of sheets and rows with data
# let's read this data in to a DataFrame

# using below command will read xlsx data into DataFrame but will not render column labels
# df = DataFrame(XLSX.readdata("SampleData/DS_GTIN_ALL.xlsx", "Worksheet", "A14:E143403   "), :auto)
dfGTIN = DataFrame(XLSX.readtable("sampleData/DS_GTIN_ALL.xlsx", "Worksheet";first_row=14)...)
# ... operator will splat the tuple (data, column_labels) into the constructor of DataFrame

# replace missing values with an integer 99999
# replace!(dfUNSPSC."Parent key", missing => 99999)
# size(dfUNSPSC)

# let's export this clean csv, we'll load this into database
# CSV.write("GTIN.csv", dfGTIN)
# readdir(pwd())

# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfGTIN)
# empty!(dfGTIN)
# Base.summarysize(dfGTIN)

"GTIN.csv"

In [25]:
##########
# GMDN ###
##########

## GMDN data is not available

# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfGMDN)
# empty!(dfGMDN)
# Base.summarysize(dfGMDN)

## Vendor Master

In [8]:
#################
# Vendor master #
#################
# create Vendor Master from GUDID dataset
# show(first(dfGUDIDdevice,5), allcols=true)
# show(first(dfGUDIDdevice[:,[:brandName, :catalogNumber, :dunsNumber, :companyName, :rx, :otc]],5), allcols=true)
# names(dfGUDIDdevice)
# dfVendor = unique(dfGUDIDdevice[:,[:brandName, :catalogNumber, :dunsNumber, :companyName, :rx, :otc]])
# dfVendor = unique(dfGUDIDdevice[:,[:companyName]]) # 7574 unique vendors
dfVendor = unique(dfGUDIDdevice[:,[:brandName, :dunsNumber, :companyName, :rx, :otc]])
# dfVendor is a good dataset, have 216k rows for 7574 unique vendors

# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfVendor)
# empty!(dfVendor)
# Base.summarysize(dfVendor)

# CSV.write("VENDOR.csv", dfVendor[1:1000,:])

"VENDOR.csv"

## Location Master

In [9]:
data, header = readdlm("sampleData/uscities.csv", ',', header=true)
dfLocation = DataFrame(data, vec(header))

# # remember to empty dataFrame after usage
# # Julia will flush it out automatically after session,
# # but often ERP data gets bulky during session
# Base.summarysize(dfLocation)
# empty!(dfLocation)
# Base.summarysize(dfLocation)

# CSV.write("LOCATION_MASTER.csv", dfLocation[1:1000,:])

"LOCATION_MASTER.csv"

In [36]:
readdir("sampleData/GUDID")

9-element Vector{String}:
 "contacts.txt"
 "device.txt"
 "deviceSizes.txt"
 "environmentalConditions.txt"
 "gmdnTerms.txt"
 "identifiers.txt"
 "premarketSubmissions.txt"
 "productCodes.txt"
 "sterilizationMethodTypes.txt"

## Organization Master

In [10]:
dfOrgMaster = DataFrame(
    ENTITY=repeat(["HeadOffice"], inner=8),
    GROUP=repeat(["Operations"], inner=8),
    DEPARTMENT=["Procurement","Procurement","Procurement","Procurement","Procurement","HR","HR","MFG"],
    UNIT=["Sourcing","Sourcing","Maintenance","Support","Services","Helpdesk","ServiceCall","IT"])

    # CSV.write("ORG_MASTER.csv", dfOrgMaster[:,:])

"ORG_MASTER.csv"

--- 

## creating complete Supply Chain Data Model DataFrames
now since we created Supply chain attribute / chartfields/dimensions

- item master
- vendor master
- location master
- org Hierarchy

using above chartfields, let's create following Supply Chain Transactions

-   MSR - Material Service request
-   PurchaseOrder
-   Voucher
-   Invoice
-   Receipt
-   Shipment
-   Sales, Revenue
-   Travel, Expense, TimeCard
-   Accounting Lines

## MSR - Material Service request

In [11]:
sampleSize = 1000 # number of rows, scale as needed

dfMSR = DataFrame(
    UNIT = rand(dfOrgMaster.UNIT, sampleSize),
    MSR_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
    FROM_UNIT = rand(dfOrgMaster.UNIT, sampleSize),
    TO_UNIT = rand(dfOrgMaster.UNIT, sampleSize),
    GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
    QTY = rand(dfOrgMaster.UNIT, sampleSize));
first(dfMSR, 5)

    # CSV.write("MSR.csv", dfMSR[1:1000,:])

"MSR.csv"

## Purchase Order

In [12]:
sampleSize = 1000 # number of rows, scale as needed

dfPO = DataFrame(
    UNIT = rand(dfOrgMaster.UNIT, sampleSize),
    PO_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
    VENDOR=rand(unique(dfVendor.companyName), sampleSize),
    GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
    QTY = rand(1:150, sampleSize),
    UNIT_PRICE = rand(Normal(100, 2), sampleSize)
    );
show(first(dfPO, 5),allcols=true)

# CSV.write("PO.csv", dfPO[1:1000,:])

[1m5×6 DataFrame[0m
[1m Row [0m│[1m UNIT        [0m[1m PO_DATE    [0m[1m VENDOR                            [0m[1m GUDID          [0m[1m QTY   [0m[1m UNIT_PRICE [0m
[1m     [0m│[90m String      [0m[90m Date       [0m[90m Any                               [0m[90m Any            [0m[90m Int64 [0m[90m Float64    [0m
─────┼───────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Sourcing     2021-03-12  CAREDX, INC.                       191072148612       42     99.4394
   2 │ Services     2021-11-15  SUTTER MEDICAL TECHNOLOGIES USA,…  M7256008545       106    101.331
   3 │ ServiceCall  2020-09-16  NEWMARKET BIOMEDICAL LIMITED       10884521795143    146     99.0341
   4 │ Support      2022-02-08  WATERS MEDICAL SYSTEMS, LLC        5415067023377     131     97.6662
   5 │ Sourcing     2020-05-17  ULTHERA, INC.                      385640029568      142     96.7298

"PO.csv"

## Voucher Invoice

In [13]:
sampleSize = 1000 # number of rows, scale as needed

dfVCHR = DataFrame(
    UNIT = rand(dfOrgMaster.UNIT, sampleSize),
    VCHR_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
    STATUS=rand(["Closed","Paid","Open","Cancelled","Exception"], sampleSize),
    VENDOR_INVOICE_NUM = rand(10001:9999999, sampleSize),
    VENDOR=rand(unique(dfVendor.companyName), sampleSize),
    GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
    QTY = rand(1:150, sampleSize),
    UNIT_PRICE = rand(Normal(100, 2), sampleSize)
    );
show(first(dfVCHR, 5),allcols=true)

# CSV.write("VOUCHER.csv", dfVCHR[1:1000,:])

[1m5×8 DataFrame[0m
[1m Row [0m│[1m UNIT        [0m[1m VCHR_DATE  [0m[1m STATUS    [0m[1m VENDOR_INVOICE_NUM [0m[1m VENDOR                         [0m[1m GUDID         [0m[1m QTY   [0m[1m UNIT_PRICE [0m
[1m     [0m│[90m String      [0m[90m Date       [0m[90m String    [0m[90m Int64              [0m[90m Any                            [0m[90m Any           [0m[90m Int64 [0m[90m Float64    [0m
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Helpdesk     2021-08-17  Open                   769145  GERMAINE LABORATORIES, INC      814033020559      31    102.394
   2 │ Maintenance  2020-03-18  Cancelled             6922830  OSADA ELECTRIC CO.,LTD.         814978020546      50    101.569
   3 │ Support      2022-02-24  Cancelled             2343017  Hangzhou AGS MedTech Co., Ltd.  8056640016767    146     99.7956
   4 │ Services     2021-08-02  Closed                179

"VOUCHER.csv"

## SALES

In [14]:
sampleSize = 1000 # number of rows, scale as needed

dfREVENUE = DataFrame(
    UNIT = rand(dfOrgMaster.UNIT, sampleSize),
    SALES_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
    STATUS=rand(["Sold","Pending","Hold","Cancelled","Exception"], sampleSize),
    SALES_RECEIPT_NUM = rand(10001:9999999, sampleSize),
    CUSTOMER=rand(unique(dfVendor.companyName), sampleSize),
    GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
    QTY = rand(1:150, sampleSize),
    UNIT_PRICE = rand(Normal(100, 2), sampleSize)
    );
show(first(dfREVENUE, 5),allcols=true)

# CSV.write("SALES.csv", dfREVENUE[1:1000,:])

[1m5×8 DataFrame[0m
[1m Row [0m│[1m UNIT     [0m[1m SALES_DATE [0m[1m STATUS    [0m[1m SALES_RECEIPT_NUM [0m[1m CUSTOMER                          [0m[1m GUDID         [0m[1m QTY   [0m[1m UNIT_PRICE [0m
[1m     [0m│[90m String   [0m[90m Date       [0m[90m String    [0m[90m Int64             [0m[90m Any                               [0m[90m Any           [0m[90m Int64 [0m[90m Float64    [0m
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Sourcing  2020-02-05  Cancelled            9590264  Siemens Shanghai Medical Equipme…  8806378323473     83    100.088
   2 │ Support   2021-07-12  Sold                 8926812  AB Ardent                          3700780603770     23    104.813
   3 │ Services  2021-05-23  Hold                 4004679  Ambra Health                       887517046635      72    100.262
   4 │ Helpdesk  2022-02-20  Pending              1459425  SHIFE

"SALES.csv"

## SHIPMENT, RECEIPT

In [15]:
sampleSize = 1000 # number of rows, scale as needed

dfSHIPRECEIPT = DataFrame(
    UNIT = rand(dfOrgMaster.UNIT, sampleSize),
    SHIP_DATE=rand(collect(Date(2020,1,1):Day(1):Date(2022,5,1)), sampleSize),
    STATUS=rand(["Shippped","Returned","In process","Cancelled","Exception"], sampleSize),
    SHIPMENT_NUM = rand(10001:9999999, sampleSize),
    CUSTOMER=rand(unique(dfVendor.companyName), sampleSize),
    GUDID = rand(dfGUDIDdevice.PrimaryDI, sampleSize),
    QTY = rand(1:150, sampleSize),
    UNIT_PRICE = rand(Normal(100, 2), sampleSize)
    );
show(first(dfSHIPRECEIPT, 5),allcols=true)

# CSV.write("SHIPRECEIPT.csv", dfSHIPRECEIPT[1:1000,:])

[1m5×8 DataFrame[0m
[1m Row [0m│[1m UNIT        [0m[1m SHIP_DATE  [0m[1m STATUS    [0m[1m SHIPMENT_NUM [0m[1m CUSTOMER                          [0m[1m GUDID          [0m[1m QTY   [0m[1m UNIT_PRICE [0m
[1m     [0m│[90m String      [0m[90m Date       [0m[90m String    [0m[90m Int64        [0m[90m Any                               [0m[90m Any            [0m[90m Int64 [0m[90m Float64    [0m
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Helpdesk     2020-01-03  Shippped        6110428  NIMBIC SYSTEMS, INC.               10809160251607     20     95.9247
   2 │ Sourcing     2020-01-11  Cancelled       4638687  Scican Ltd                         30634303020123    130     99.9792
   3 │ ServiceCall  2020-09-25  Cancelled       9330541  W.H.P.M. INC.                      887517888754       68    100.744
   4 │ Helpdesk     2020-12-03  Exception       6892474  DORAN SCALE

"SHIPRECEIPT.csv"