# Everything you need to know about Julia DataFrames to support ERP Data Science Analysis
---

**Background:** Most of Enterprise ERP providers like SAP, Oracle, Microsoft build HCM, Finance, Supply Chain, CRM like systems, which store data in highly structured RDBMS tables.
Recent advancements in ERP systems also support authoring non-structured data like digital invoices, receipt or hand-held OCR readers.

All of these ERP systems are great OLTP systems, but depend on Analytic systems for creating dashboards, ad-hoc analysis, operational reporting or live predictive analytics.

Further, ERP systems depend on ELT/ELT or 3rd party tools for data mining, analysis and visualizations.

While data engineers use Java, Scala, SPARK based big data solutions to move data, they depend on 3rd party BI Reporting tools for creating dashboards, use Data Mining tools for data cleansing and AI Languages for advance predictive analytics.

When I started learning more about Julia Language, I thought of using Julia Language to solve ERP Analytics multiple languages problem.
Why not just use Julia Language to move, clean massive data set as Big data reporting solution, as Julia support multi-threading, distributing parallel computing.
Julia language and associated packages has first class support for large arrays, which can be used for data analysis.

and Julia has great visualization packages to publish interactive dashboards, live data reporting.

best of all, Julia is great in numerical computing, advance data science machine learning.

This blog, I am sharing my notes specific to perform typical ERP data analysis using Julia Language.

**Target Audience:** This notebook, is meant for ERP consultants, IT Developers, Finance, Supply chain, HR & CRM managers, executive leaders or anyone curious to implement data science concepts in ERP space.

+ **Author:** Amit Shukla
+ **Contact:** info@elishconsulting.com

-----

# - About ERP Systems, General Ledger & Supply chain
A typical ERP system consists of many modules based on business domain, functions and operations.
GL is core of Finance and Supply chain domains and Buy to Pay, Order to Cash deal with different aspects of business operations in an Organization.
Many organization, use ERPs in different ways and may chose to implement all or some of the modules.
You can find examples of module specific business operations/processes diagram here.
- [General Ledger process flow](https://github.com/AmitXShukla/AmitXShukla.github.io/raw/master/blogs/PlutoCon/gl.png)
- [Account Payable process flow](https://github.com/AmitXShukla/AmitXShukla.github.io/raw/master/blogs/PlutoCon/ap.png)
- [Tax Analytics](https://github.com/AmitXShukla/AmitXShukla.github.io/raw/master/blogs/PlutoCon/tax.png)
- [Sample GL ERD - Entity Relaton Diagram](https://github.com/AmitXShukla/AmitXShukla.github.io/raw/master/blogs/PlutoCon/gl_erd.png)

A typical ERP modules list looks like below diagram.

![ERP Modules](https://github.com/AmitXShukla/AmitXShukla.github.io/raw/master/blogs/PlutoCon/ERP_modules.png)

A typical ERP business process flow looks like below diagram.

![ERP Processes](https://github.com/AmitXShukla/P2P.ai/blob/main/docs/assets/images/ERD_logical.png?raw=true)

A typical GL Balance sheet, Cash-flow or Income Statement looks like this 

[click here](https://s2.q4cdn.com/470004039/files/doc_financials/2020/q4/FY20_Q4_Consolidated_Financial_Statements.pdf)

In this notebook, I will do my best to cite examples from real world data like above mentioned GL Financial statement.

---

# start with Julia 
It literally takes < 1 min to install Julia environments on almost any machine.

Here is [link to my tutorial](https://medium.com/me/stats/post/823d84f2cb28), which discuss Julia installation on different machines (including remote and mobile tablets).


## adding Packages

In [7]:
using Pkg
Pkg.add("DataFrames")
Pkg.add("Dates")
using DataFrames, Dates
Pkg.status()

[32m[1m      Status[22m[39m `~/.julia/environments/v1.7/Project.toml`
 [90m [324d7699] [39mCategoricalArrays v0.10.5
 [90m [a93c6f00] [39mDataFrames v1.3.2
 [90m [7073ff75] [39mIJulia v1.23.2
 [90m [c3e4b0f8] [39mPluto v0.18.4
 [90m [2dfb63ee] [39mPooledArrays v1.4.0
 [90m [ade2ca70] [39mDates


[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General`
[32m[1m    Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`


*rest of this blog, I will assume, you have added all packages and imported in current namespace/notebook scope.*

## helper functions

In [78]:
repeat(["AMIT","SHUKLA"], inner=5) # repeat list/string number of times
fill("34", 4) # repeat list/string number of times
range(1.0, stop=9.0, length=100) # generate n number of equal values between start and stop values
11000:1000:45000 # genarate a range of # from start to finish with set intervals
collect(1:4) # collect funtion collect all values in list
rand([1,2,3,4]) # random value from a list of values
rand(11000:1000:45000) # random value from a list of values
randn() # random # from a list of float values (+ or -)

# more helper string functions - replace etc... 
# fill these in from Pluto notebooks

-0.46708904491315945

## create DataFrame

Chart of accounts (organized hierarchy of account groups in tree form), Location/Department or Product based hierarchy allows businesses to group and report organization activities based on business processes.

These hierarchical grouping help capture monetary and statistical values of organization in finance statements.

---

To create Finance Data model and 
[Ledger Cash-flow or Balance Sheet like statements](https://s2.q4cdn.com/470004039/files/doc_financials/2020/q4/FY20_Q4_Consolidated_Financial_Statements.pdf),
We need associated dimensions (chartfields like chart of accounts).

We will discuss how to load actual data from CSV or RDBMS later. We will also learn how to group and create chartfield hierarchies later.

But for now, first Let's start with creating fake ACCOUNT, department and location chartfields.

In [61]:
# create dummy data
accounts = DataFrame(AS_OF_DATE=Date("1900-01-01", dateformat"y-m-d"), 
    ID = 11000:1000:45000,
    CLASSIFICATION=repeat([
        "OPERATING_EXPENSES","NON-OPERATING_EXPENSES", "ASSETS","LIABILITIES","NET_WORTH","STATISTICS","REVENUE"
                ], inner=5),
    CATEGORY=[
        "Travel","Payroll","non-Payroll","Allowance","Cash",
        "Facility","Supply","Services","Investment","Misc.",
        "Depreciation","Gain","Service","Retired","Fault.",
        "Receipt","Accrual","Return","Credit","ROI",
        "Cash","Funds","Invest","Transfer","Roll-over",
        "FTE","Members","Non_Members","Temp","Contractors",
        "Sales","Merchant","Service","Consulting","Subscriptions"],
    STATUS="A",
    DESCR=repeat([
    "operating expenses","non-operating expenses","assets","liability","net-worth","stats","revenue"], inner=5),
    ACCOUNT_TYPE=repeat(["E","E","A","L","N","S","R"],inner=5));

show("Accounts DIM size is: "), show(size(accounts)), show("Accounts Dim sample: "), accounts[collect(1:5:35),:]

"Accounts DIM size is: "(35, 7)"Accounts Dim sample: "

(nothing, nothing, nothing, [1m7×7 DataFrame[0m
[1m Row [0m│[1m AS_OF_DATE [0m[1m ID    [0m[1m CLASSIFICATION         [0m[1m CATEGORY     [0m[1m STATUS [0m[1m DESCR [0m ⋯
[1m     [0m│[90m Date       [0m[90m Int64 [0m[90m String                 [0m[90m String       [0m[90m String [0m[90m String[0m ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ 1900-01-01  11000  OPERATING_EXPENSES      Travel        A       operat ⋯
   2 │ 1900-01-01  16000  NON-OPERATING_EXPENSES  Facility      A       non-op
   3 │ 1900-01-01  21000  ASSETS                  Depreciation  A       assets
   4 │ 1900-01-01  26000  LIABILITIES             Receipt       A       liabil
   5 │ 1900-01-01  31000  NET_WORTH               Cash          A       net-wo ⋯
   6 │ 1900-01-01  36000  STATISTICS              FTE           A       stats
   7 │ 1900-01-01  41000  REVENUE                 Sales         A       revenu
[36m                            

There is lot to unpack here in above Julia code and lot is wrong (not best practice for sure).

First, **what is a dataframe anyway**, think of Julia DataFrame as tabular representation of data arranged in rows and columns. Unlike SQL, you should get into habit of reading and writing one column at a time (not because of reason, you can't read/write rows). Each column is an Array or a list of values, referred as vector.

Above Julia code creates accounts dataframe with columns name as AS_OF_DATE, DESCR, CATEGORY, ACCOUNT_TYPE, CLASSIFICATION, STATUS.

There are 35 rows, with same AS_OF_DATE, IDs starting from 11000-45000 in 1000 incremental values, all with STATUS = A (Active), 7 distinct Descriptions and account types (E=Expense, L=Liability, A= Assets, N=Net worth, S=Stats, R=Revenue) repeating 5 times per category.

For 35 rows, it's fine to store data like this, but now is a good time to learn about Categorical and Pooled Arrays, in case when dataframe has millions of rows.

In [11]:
using Pkg
Pkg.add("CategoricalArrays")
Pkg.add("PooledArrays")
Pkg.status()

[32m[1m      Status[22m[39m `~/.julia/environments/v1.7/Project.toml`
 [90m [324d7699] [39mCategoricalArrays v0.10.5
 [90m [a93c6f00] [39mDataFrames v1.3.2
 [90m [7073ff75] [39mIJulia v1.23.2
 [90m [c3e4b0f8] [39mPluto v0.18.4
 [90m [2dfb63ee] [39mPooledArrays v1.4.0
 [90m [ade2ca70] [39mDates


[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`


In [56]:
# here CLASSIFICATION column vector stores 3500 distinct values in an array
CLASSIFICATION=repeat(["OPERATING_EXPENSES","NON-OPERATING_EXPENSES", "ASSETS","LIABILITIES","NET_WORTH","STATISTICS","REVENUE"
                ], inner=500)

using CategoricalArrays
cl = categorical(CLASSIFICATION)
levels(cl)

using PooledArrays
pl = categorical(CLASSIFICATION)
levels(pl)

# show values in tabular format
df = DataFrame(Dict("Descr" => "CLASSIFICATION...ARR...", "Value" => size(CLASSIFICATION)[1]))
push!(df,("CAT...ARR...",size(cl)[1]))
push!(df,("CAT...ARR..COMPRESS.",size(compress(cl))[1]))
push!(df,("POOL...ARR...",size(pl)[1]))
push!(df,("POOL...ARR..COMPRESS.",size(compress(pl))[1]))
push!(df,("CAT...LEVELs...",size(levels(cl))[1]))
push!(df,("POOL...LEVELs...",size(levels(pl))[1]))
push!(df,("CLASSIFICATION...MEMSIZE", Base.summarysize(CLASSIFICATION)))
push!(df,("CAT...ARR...MEMSIZE", Base.summarysize(cl)))
push!(df,("POOL...ARR...MEMSIZE", Base.summarysize(pl)))
push!(df,("CAT...ARR..COMPRESS...MEMSIZE", Base.summarysize(compress(cl))))
push!(df,("POOL...ARR..COMPRESS...MEMSIZE", Base.summarysize(compress(pl))))

Unnamed: 0_level_0,Descr,Value
Unnamed: 0_level_1,String,Int64
1,CLASSIFICATION...ARR...,3500
2,CAT...ARR...,3500
3,CAT...ARR..COMPRESS.,3500
4,POOL...ARR...,3500
5,POOL...ARR..COMPRESS.,3500
6,CAT...LEVELs...,7
7,POOL...LEVELs...,7
8,CLASSIFICATION...MEMSIZE,28179
9,CAT...ARR...MEMSIZE,14739
10,POOL...ARR...MEMSIZE,14739


**Categorical and Pooled Arrays** as name suggests, are data structure to store voluminous data efficiently,specially when a column in a data frame has small number of distinct values (aka levels), repeated across entire column vector.

as an example, Finance Ledger may have millions of transactions and every row has one of these seven type of accounts. It's not recommended to store repeating value of entire string in every row. Instead, using a Categorical or PooledArray data type, memory/data size can be significantly reduced with out losing any data quality. (size(..) stays same for original, Categorical and PooledArray data type.

as you can see in above example, size of categorical / pooled array data type matches with original column vector but significantly reduces size/memory of data. (Base.summarysize(...)) is reduced 50% and is further reduced by 85% if used with compress(...))

Using Categorical Array type over PooledArray is recommended when there are fewer unique values, user need meaningful ordering and grouping. On the other hand, PoolArray is preferred when small memory usage is needed.

## ways of creating dataframe

There are different ways of creating dataframe, and it all depends on how user want to see data, which is almost always tabular anyway.

There are few things to keep in mind when working DataFrames.

- Think in terms of columns, and pay attention to column datatypes. for example, as mentioned earlier, using Categorical or PooledArray can significantly improve data analysis performance and save on memory
- Think, how do you want to filter dataset, and in that case, row index become very important.
- Pay attention to column names (column names with spaces, or special characters, can cause inconvenience). Julia has class support for variable names, which means it can store any type of literal string and not break. But one should follow standard guidelines for naming conventions.
- Pay attention to read dataframe columns efficiently, reading/mutating original version can harm data quality and analysis but unnecessary making copies of data add clutter to your temp space.
- Often DataFrames is good enough to support any query and transformations, but just in case, if you need, there are more data query frameworks like DataFramesMeta.jl & Query.jl to support advance use.



## creating data frame from 
this is a tuple
this is a named tuple
this is a vector
this is a categorical vector

## creating data frame from CSV, JSON, ODBC/ORM, XML
DataFrame(a=1:2, b=[1.0, missing],
                 c=categorical('a':'b'), d=[1//2, missing])
from tuple
series
dataframe
dict
namedtuple
csv
xls
json
xml
sql
column name with space
tuple
named tuple
dicts

normal/guassian distribution

## type systems
ledger
subledger
accounting
chartfields

category
typeof
subtypes
supertype
eltypes


#### show functions
first
last
show
eachcol
nrow
ncols
names
propertynames
describe
eltype

## transformation

unique rows
group by
order by
sort


mapscols
broadcasting
ncols
cols
regesx
match
view

#group by

# Visualization

# build interactive visualization

In [6]:
show()

"/home/ubuntu/amit/WIP/AmitXShukla.github.io/blogs/julia"

Hello Friends,
In this video, we will discuss everything one need to know about Julia Data Frames to perform a detail ERP Data analyis.

In case if you are not familiar with Julia Language, it's one of newer langauge for Data Science, you can compare this with R and Python. However, it's a newer language, which runs like C and walks like Python.

I'm not going to discuss, R vs Python vs Julia, I think, each language has Pros and Cons. Please don't waste your time on pointless powerpoint comparisons, specially when it's easier to just pick these languages and start coding, and you will sooner or later, once you get a hang of programming language, there comes a time, you will know, which language meets your need.

In this blog, we will discuss following topics.

1. about ERP data analysiswhat are 
2. why Julia Language
3. Julia & package Installation
4. using Julia Data Frames for data analysis
5. Data Visualization
6. other packages like online stats, ODBC, JuliaDB
7. Data Cleansing, Wrangling, Masking & Analysis
