# Everything you need to know about Julia DataFrames to support ERP Data Science for Finance Data Model Analysis
---

This is **Part - 1** of 3 ERP Data analysis notebooks.
- Part 1 - General Ledger, Data Science Basics
- Part 2 - General Ledger Data Analysis & Visualization
- Part 3 - P2P (Procure to Pay) Data Analysis & Visualization


**Target Audience:** This notebook, is meant for ERP consultants, IT Developers, Finance, Supply chain, HR & CRM managers, executive leaders or anyone curious to implement data science concepts in ERP space.

+ **Author:** Amit Shukla
+ **Contact:** info@elishconsulting.com

**Background:** Most of Enterprise ERP providers like SAP, Oracle, Microsoft build HCM, Finance, Supply Chain, CRM like systems, which store data in highly structured RDBMS tables.
Recent advancements in ERP systems also support authoring non-structured data like digital invoices, receipt or hand-held OCR readers.

All of these ERP systems are great OLTP systems, but depend on Analytic systems for creating dashboards, ad-hoc analysis, operational reporting or live predictive analytics.

Further, ERP systems depend on ELT/ELT or 3rd party tools for data mining, analysis and visualizations.

While data engineers use Java, Scala, SPARK based big data solutions to move data, they depend on 3rd party BI Reporting tools for creating dashboards, use Data Mining tools for data cleansing and AI Languages for advance predictive analytics.

When I started learning more about Julia Language, I thought of using Julia Language to solve ERP Analytics multiple languages problem.
Why not just use Julia Language to move, clean massive data set as Big data reporting solution, as Julia support multi-threading, distributing parallel computing.
Julia language and associated packages has first class support for large arrays, which can be used for data analysis.

and Julia has great visualization packages to publish interactive dashboards, live data reporting.

best of all, Julia is great in numerical computing, advance data science machine learning.

This blog, I am sharing my notes specific to perform typical ERP data analysis using Julia Language.

-----

# - About ERP Systems, General Ledger & Supply chain
A typical ERP system consists of many modules based on business domain, functions and operations.
GL is core of Finance and Supply chain domains and Buy to Pay, Order to Cash deal with different aspects of business operations in an Organization.
Many organization, use ERPs in different ways and may chose to implement all or some of the modules.
You can find examples of module specific business operations/processes diagram here.
- [General Ledger process flow](https://github.com/AmitXShukla/AmitXShukla.github.io/raw/master/blogs/PlutoCon/gl.png)
- [Account Payable process flow](https://github.com/AmitXShukla/AmitXShukla.github.io/raw/master/blogs/PlutoCon/ap.png)
- [Tax Analytics](https://github.com/AmitXShukla/AmitXShukla.github.io/raw/master/blogs/PlutoCon/tax.png)
- [Sample GL ERD - Entity Relaton Diagram](https://github.com/AmitXShukla/AmitXShukla.github.io/raw/master/blogs/PlutoCon/gl_erd.png)

A typical ERP modules list looks like below diagram.

![ERP Modules](https://github.com/AmitXShukla/AmitXShukla.github.io/raw/master/blogs/PlutoCon/ERP_modules.png)

A typical ERP business process flow looks like below diagram.

![ERP Processes](https://github.com/AmitXShukla/P2P.ai/blob/main/docs/assets/images/ERD_logical.png?raw=true)

A typical GL Balance sheet, Cash-flow or Income Statement looks like this 

[click here](https://s2.q4cdn.com/470004039/files/doc_financials/2020/q4/FY20_Q4_Consolidated_Financial_Statements.pdf)

In this notebook, I will do my best to cite examples from real world data like above mentioned GL Financial statement.

---

# start with Julia 
It literally takes < 1 min to install Julia environments on almost any machine.

Here is [link to my tutorial](https://amit-shukla.medium.com/setup-local-machine-ipad-android-tablets-for-julia-lang-data-science-computing-823d84f2cb28), which discuss Julia installation on different machines (including remote and mobile tablets).


## adding Packages

In [1]:
using Pkg
Pkg.add("DataFrames")
Pkg.add("Dates")
using DataFrames, Dates
Pkg.status()

[32m[1m      Status[22m[39m `~/.julia/environments/v1.7/Project.toml`
 [90m [54eefc05] [39mCascadia v1.0.1
 [90m [324d7699] [39mCategoricalArrays v0.10.5
 [90m [a93c6f00] [39mDataFrames v1.3.2
 [90m [8f5d6c58] [39mEzXML v1.1.0
 [90m [708ec375] [39mGumbo v0.8.0
 [90m [cd3eb016] [39mHTTP v0.9.17
 [90m [7073ff75] [39mIJulia v1.23.2
 [90m [c601a237] [39mInteract v0.10.4
 [90m [0f8b85d8] [39mJSON3 v1.9.4
 [90m [b9914132] [39mJSONTables v1.0.3
 [90m [4d0d745f] [39mPDFIO v0.1.13
 [90m [c3e4b0f8] [39mPluto v0.18.4
 [90m [2dfb63ee] [39mPooledArrays v1.4.0
 [90m [88034a9c] [39mStringDistances v0.11.2
 [90m [a2db99b7] [39mTextAnalysis v0.7.3
 [90m [05625dda] [39mWebDriver v0.1.2
 [90m [0f1e0344] [39mWebIO v0.8.17
 [90m [ade2ca70] [39mDates


[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General`
[32m[1m    Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`


*rest of this blog, I will assume, you have added all packages and imported in current namespace/notebook scope.*

## helper functions

In [2]:
# run one command at a time
repeat(["AMIT","SHUKLA"], inner=5) # repeat list/string number of times
fill("34", 4) # repeat list/string number of times
range(1.0, stop=9.0, length=100) # generate n number of equal values between start and stop values
11000:1000:45000 # genarate a range of # from start to finish with set intervals
collect(1:4) # collect funtion collect all values in list
rand([1,2,3,4]) # random value from a list of values
rand(11000:1000:45000) # random value from a list of values
randn() # random # from a list of float values (+ or -)

# more helper string functions - replace etc... 
# fill these in from Pluto notebooks

0.1215354890767833

In [3]:
# run one command at a time
# basic dataframe is constructed by passing column vectors (think of adding one excel column at a time)
org = "Apple Inc" # is a simple string
_ap = [1,2] # this is a vector

using CategoricalArrays
ap = categorical(_ap) # this is a vector
fy = categorical(repeat([2022], inner=2)) # this is a vector

actuals, budget = (98.40, 100) # this is a tuple
amount = (actuals = 98.54, budget = 100) # this is a named tuple
df_Ledger = DataFrame(Entity=fill(org), FiscalYear=fy, AccountingPeriod = ap, Actuals = actuals, Budget = budget)
# fill(org) or org will produce same results

Unnamed: 0_level_0,Entity,FiscalYear,AccountingPeriod,Actuals,Budget
Unnamed: 0_level_1,String,Cat…,Cat…,Float64,Int64
1,Apple Inc,2022,1,98.4,100
2,Apple Inc,2022,2,98.4,100


In [4]:
# run one command at a time
# adding one row at a time, can be done, but is not very efficient
push!(df_Ledger, Dict(:Entity => "Google", :FiscalYear => 2022, 
        :AccountingPeriod => 1, :Actuals => 95.42, :Budget => 101))
push!(df_Ledger, Dict(:Entity => "Google", :FiscalYear => 2022, 
        :AccountingPeriod => 2, :Actuals => 91.42, :Budget => 99))

Unnamed: 0_level_0,Entity,FiscalYear,AccountingPeriod,Actuals,Budget
Unnamed: 0_level_1,String,Cat…,Cat…,Float64,Int64
1,Apple Inc,2022,1,98.4,100
2,Apple Inc,2022,2,98.4,100
3,Google,2022,1,95.42,101
4,Google,2022,2,91.42,99


## create DataFrame

Chart of accounts (organized hierarchy of account groups in tree form), Location/Department or Product based hierarchy allows businesses to group and report organization activities based on business processes.

These hierarchical grouping help capture monetary and statistical values of organization in finance statements.

---

To create Finance Data model and 
[Ledger Cash-flow or Balance Sheet like statements](https://s2.q4cdn.com/470004039/files/doc_financials/2020/q4/FY20_Q4_Consolidated_Financial_Statements.pdf),
We need associated dimensions (chartfields like chart of accounts).

We will discuss how to load actual data from CSV or RDBMS later. We will also learn how to group and create chartfield hierarchies later.

But for now, first Let's start with creating fake ACCOUNT, department and location chartfields.

In [5]:
# create dummy data
accountsDF = DataFrame(
    ENTITY = "Apple Inc.",
    AS_OF_DATE=Date("1900-01-01", dateformat"y-m-d"),
    ID = 11000:1000:45000,
    CLASSIFICATION=repeat([
        "OPERATING_EXPENSES","NON-OPERATING_EXPENSES", "ASSETS","LIABILITIES","NET_WORTH","STATISTICS","REVENUE"
                ], inner=5),
    CATEGORY=[
        "Travel","Payroll","non-Payroll","Allowance","Cash",
        "Facility","Supply","Services","Investment","Misc.",
        "Depreciation","Gain","Service","Retired","Fault.",
        "Receipt","Accrual","Return","Credit","ROI",
        "Cash","Funds","Invest","Transfer","Roll-over",
        "FTE","Members","Non_Members","Temp","Contractors",
        "Sales","Merchant","Service","Consulting","Subscriptions"],
    STATUS="A",
    DESCR=repeat([
    "operating expenses","non-operating expenses","assets","liability","net-worth","stats","revenue"], inner=5),
    ACCOUNT_TYPE=repeat(["E","E","A","L","N","S","R"],inner=5));

show("Accounts DIM size is: "), show(size(accountsDF)), show("Accounts Dim sample: "), accountsDF[collect(1:5:35),:]

"Accounts DIM size is: "(35, 8)"Accounts Dim sample: "

(nothing, nothing, nothing, [1m7×8 DataFrame[0m
[1m Row [0m│[1m ENTITY     [0m[1m AS_OF_DATE [0m[1m ID    [0m[1m CLASSIFICATION         [0m[1m CATEGORY     [0m[1m ST[0m ⋯
[1m     [0m│[90m String     [0m[90m Date       [0m[90m Int64 [0m[90m String                 [0m[90m String       [0m[90m St[0m ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Apple Inc.  1900-01-01  11000  OPERATING_EXPENSES      Travel        A  ⋯
   2 │ Apple Inc.  1900-01-01  16000  NON-OPERATING_EXPENSES  Facility      A
   3 │ Apple Inc.  1900-01-01  21000  ASSETS                  Depreciation  A
   4 │ Apple Inc.  1900-01-01  26000  LIABILITIES             Receipt       A
   5 │ Apple Inc.  1900-01-01  31000  NET_WORTH               Cash          A  ⋯
   6 │ Apple Inc.  1900-01-01  36000  STATISTICS              FTE           A
   7 │ Apple Inc.  1900-01-01  41000  REVENUE                 Sales         A
[36m                                

There is lot to unpack here in above Julia code and lot is wrong (not best practice for sure).

First, **what is a dataframe anyway**, think of Julia DataFrame as tabular representation of data arranged in rows and columns. Unlike SQL, you should get into habit of reading and writing one column at a time (not because of reason, you can't read/write rows) for faster performance. Each column is an Array or a list of values, referred as vector.

Above Julia code creates accounts dataframe with columns name as AS_OF_DATE, DESCR, CATEGORY, ACCOUNT_TYPE, CLASSIFICATION, STATUS.

There are 35 rows, with same AS_OF_DATE, IDs starting from 11000-45000 in 1000 incremental values, all with STATUS = A (Active), 7 distinct Descriptions and account types (E=Expense, L=Liability, A= Assets, N=Net worth, S=Stats, R=Revenue) repeating 5 times per category.

For 35 rows, it's fine to store data like this, but now is a good time to learn about Categorical and Pooled Arrays, in case when dataframe has millions of rows.

In [6]:
using Pkg
Pkg.add("CategoricalArrays")
Pkg.add("PooledArrays")
Pkg.status()

[32m[1m      Status[22m[39m `~/.julia/environments/v1.7/Project.toml`
 [90m [54eefc05] [39mCascadia v1.0.1
 [90m [324d7699] [39mCategoricalArrays v0.10.5
 [90m [a93c6f00] [39mDataFrames v1.3.2
 [90m [8f5d6c58] [39mEzXML v1.1.0
 [90m [708ec375] [39mGumbo v0.8.0
 [90m [cd3eb016] [39mHTTP v0.9.17
 [90m [7073ff75] [39mIJulia v1.23.2
 [90m [c601a237] [39mInteract v0.10.4
 [90m [0f8b85d8] [39mJSON3 v1.9.4
 [90m [b9914132] [39mJSONTables v1.0.3
 [90m [4d0d745f] [39mPDFIO v0.1.13
 [90m [c3e4b0f8] [39mPluto v0.18.4
 [90m [2dfb63ee] [39mPooledArrays v1.4.0
 [90m [88034a9c] [39mStringDistances v0.11.2
 [90m [a2db99b7] [39mTextAnalysis v0.7.3
 [90m [05625dda] [39mWebDriver v0.1.2
 [90m [0f1e0344] [39mWebIO v0.8.17
 [90m [ade2ca70] [39mDates


[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`


In [7]:
# here CLASSIFICATION column vector stores 3500 distinct values in an array
CLASSIFICATION=repeat(["OPERATING_EXPENSES","NON-OPERATING_EXPENSES", "ASSETS","LIABILITIES","NET_WORTH","STATISTICS","REVENUE"
                ], inner=500)

using CategoricalArrays
cl = categorical(CLASSIFICATION)
levels(cl)

using PooledArrays
pl = categorical(CLASSIFICATION)
levels(pl)

# show values in tabular format
# run one command at a time
df = DataFrame(Dict("Descr" => "CLASSIFICATION...ARR...", "Value" => size(CLASSIFICATION)[1]))
push!(df,("CAT...ARR...",size(cl)[1]))
push!(df,("CAT...ARR..COMPRESS.",size(compress(cl))[1]))
push!(df,("POOL...ARR...",size(pl)[1]))
push!(df,("POOL...ARR..COMPRESS.",size(compress(pl))[1]))
push!(df,("CAT...LEVELs...",size(levels(cl))[1]))
push!(df,("POOL...LEVELs...",size(levels(pl))[1]))
push!(df,("CLASSIFICATION...MEMSIZE", Base.summarysize(CLASSIFICATION)))
push!(df,("CAT...ARR...MEMSIZE", Base.summarysize(cl)))
push!(df,("POOL...ARR...MEMSIZE", Base.summarysize(pl)))
push!(df,("CAT...ARR..COMPRESS...MEMSIZE", Base.summarysize(compress(cl))))
push!(df,("POOL...ARR..COMPRESS...MEMSIZE", Base.summarysize(compress(pl))))

Unnamed: 0_level_0,Descr,Value
Unnamed: 0_level_1,String,Int64
1,CLASSIFICATION...ARR...,3500
2,CAT...ARR...,3500
3,CAT...ARR..COMPRESS.,3500
4,POOL...ARR...,3500
5,POOL...ARR..COMPRESS.,3500
6,CAT...LEVELs...,7
7,POOL...LEVELs...,7
8,CLASSIFICATION...MEMSIZE,28179
9,CAT...ARR...MEMSIZE,14739
10,POOL...ARR...MEMSIZE,14739


**Categorical and Pooled Arrays** as name suggests, are data structure to store voluminous data efficiently,specially when a column in a data frame has small number of distinct values (aka levels), repeated across entire column vector.

as an example, Finance Ledger may have millions of transactions and every row has one of these seven type of accounts. It's not recommended to store repeating value of entire string in every row. Instead, using a Categorical or PooledArray data type, memory/data size can be significantly reduced with out losing any data quality. (size(..) stays same for original, Categorical and PooledArray data type.

as you can see in above example, size of categorical / pooled array data type matches with original column vector but significantly reduces size/memory of data. (Base.summarysize(...)) is reduced 50% and is further reduced by 85% if used with compress(...))

Using Categorical Array type over PooledArray is recommended when there are fewer unique values, user need meaningful ordering and grouping. On the other hand, PoolArray is preferred when small memory usage is needed.

## ways of creating dataframe

There are different ways of creating dataframe, and it all depends on how user want to see data, which is almost always tabular anyway.

There are few things to keep in mind when working DataFrames.

- Think in terms of columns, and pay attention to column datatypes. for example, as mentioned earlier, using Categorical or PooledArray can significantly improve data analysis performance and save on memory
- Think, how do you want to filter dataset, and in that case, row index become very important.
- Pay attention to column names (column names with spaces, or special characters, can cause inconvenience). Julia has class support for variable names, which means it can store any type of literal string and not break. But one should follow standard guidelines for naming conventions.
- Pay attention to read dataframe columns efficiently, reading/mutating original version can harm data quality and analysis but unnecessary making copies of data add clutter to your temp space.
- Often DataFrames is good enough to support any query and transformations, but just in case, if you need, there are more data query frameworks like DataFramesMeta.jl & Query.jl to support advance use.



## creating data frame from CSV, JSON, ODBC/ORM, XML, PDF or web.
We will cover all IO topics and reading from different data sources in other tutorials. Please visit these links for 
Related blogs: 
    
[Web-scrapping, Web automation using Julia Language](https://amit-shukla.medium.com/web-scrapping-web-automation-using-julia-language-2c473db84fbc)

Working with ODBC, ORM, XML, JSON, PDF, TXT, CSV, XLS

Working with PDF documents, Image Scanner, OCR Reader

## reading data from dataframe

In [8]:
# run one command at a time
accountsDF.ENTITY # read all values from a column
accountsDF[:,"ENTITY"] # read all values from a column
accountsDF[:,:ENTITY] # read all values from a column using column symbol is efficient
accountsDF[!,:ENTITY] # read all values from a column without making a copy
# learn to make use of ! properly, it's faster, memory efficient and accesses data directly without making a copy
# but accidently assigned values will overwrite original data
accountsDF[:,:ENTITY] # returns a vector
accountsDF[:,[:ENTITY]] # returns another DataFrame
accountsDF[:,[:ENTITY, :ID, :STATUS]] # returns another DataFrame with selective columns
accountsDF[:,1:3] # returns another DataFrame with selective columns (same as above)
accountsDF[:, All()] # same as [:,:]
accountsDF[:, Between(:ENTITY, :ID)]
accountsDF[:, Cols(x -> startswith(x, "ENTITY"))]
accountsDF[:, Cols(r"ENTITY", :)]
accountsDF[:, Cols(Not(r"ENTITY"), :)]
accountsDF[:, Cols(Not([:ENTITY, :ID, :STATUS]))] # returns another DataFrame without selective columns
size(accountsDF) # displays size of dataframe
nrow(accountsDF) # displays # of rows of dataframe
ncol(accountsDF) # displays # of columns of dataframe
names(accountsDF) # displays names of columns of dataframe
propertynames(accountsDF) # displays symbol names of columns of dataframe
accountsDF[:,[:ENTITY, :ID, :CLASSIFICATION]] # returns specific columns of DataFrame
first(accountsDF, 2) # print first 2 rows of dataframe
last(accountsDF, 2) # print last 2 rows of 
# show(accountsDF[1:2, :], allcols=true) # show all columns between rows 1 & 2
# show(accountsDF[:, 1:2], allrows=true)# show all rows between columns 1 & 2
eltype.(eachcol(accountsDF)) # displays type of each column
describe(accountsDF) # describe column stats

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Any,Any,Int64,DataType
1,ENTITY,,Apple Inc.,,Apple Inc.,0,String
2,AS_OF_DATE,,1900-01-01,1900-01-01,1900-01-01,0,Date
3,ID,28000.0,11000,28000.0,45000,0,Int64
4,CLASSIFICATION,,ASSETS,,STATISTICS,0,String
5,CATEGORY,,Accrual,,non-Payroll,0,String
6,STATUS,,A,,A,0,String
7,DESCR,,assets,,stats,0,String
8,ACCOUNT_TYPE,,A,,S,0,String


## view, copy, deepcopy, subdataframe

reading a dataframe seems intuitive at first but be aware, ERP Data Science analysis often deals with billions of rows and data read can be very slow and expensive operations if performed poorly.

df[:,:] is not same as df[!,:]
as one creates a copy and other reads directly from memory location.

Julia lang offers built in algorthims for multithread parallel, distributed computing and learning simple concepts adds up to even bigger savings and faster performances.

here is a awesome explaination I found on discourse.julialang.org

In [78]:
a = [[1,2,3],[4,5,6]]
b = copy(a)
c = deepcopy(a)
a[1][1] = 11
@show a, b, c

(a, b, c) = ([[11, 2, 3], [4, 5, 6]], [[11, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]])


([[11, 2, 3], [4, 5, 6]], [[11, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]])

--- 
## complete Finance Data Model
now since we got a handle of dataframe basics, let's create other chartfields/dimensions and create a complete Ledger DataFrame

In [9]:
# we already have created accounts dimension (35 rows, 8 columns) above
size(accountsDF)

(35, 8)

In [10]:
# DEPARTMENT Chartfield
deptDF = DataFrame(
    AS_OF_DATE=Date("2000-01-01", dateformat"y-m-d"), 
    ID = 1100:100:1500,
    CLASSIFICATION=["SALES","HR", "IT","BUSINESS","OTHERS"],
    CATEGORY=["sales","human_resource","IT_Staff","business","others"],
    STATUS="A",
    DESCR=[
    "Sales & Marketing","Human Resource","Infomration Technology","Business leaders","other temp"
        ],
    DEPT_TYPE=["S","H","I","B","O"]);
size(deptDF),deptDF[collect(1:5),:]

((5, 7), [1m5×7 DataFrame[0m
[1m Row [0m│[1m AS_OF_DATE [0m[1m ID    [0m[1m CLASSIFICATION [0m[1m CATEGORY       [0m[1m STATUS [0m[1m DESCR       [0m ⋯
[1m     [0m│[90m Date       [0m[90m Int64 [0m[90m String         [0m[90m String         [0m[90m String [0m[90m String      [0m ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ 2000-01-01   1100  SALES           sales           A       Sales & Mark ⋯
   2 │ 2000-01-01   1200  HR              human_resource  A       Human Resour
   3 │ 2000-01-01   1300  IT              IT_Staff        A       Infomration
   4 │ 2000-01-01   1400  BUSINESS        business        A       Business lea
   5 │ 2000-01-01   1500  OTHERS          others          A       other temp   ⋯
[36m                                                               2 columns omitted[0m)

In [11]:
# LOCATION Chartfield
locationDF = DataFrame(
    AS_OF_DATE=Date("2000-01-01", dateformat"y-m-d"), 
    ID = 11:1:22,
    CLASSIFICATION=repeat([
        "Region A","Region B", "Region C"], inner=4),
    CATEGORY=repeat([
        "Region A","Region B", "Region C"], inner=4),
    STATUS="A",
    DESCR=[
"Boston","New York","Philadelphia","Cleveland","Richmond",
"Atlanta","Chicago","St. Louis","Minneapolis","Kansas City",
"Dallas","San Francisco"],
    LOC_TYPE="Physical");
locationDF[:,:]

Unnamed: 0_level_0,AS_OF_DATE,ID,CLASSIFICATION,CATEGORY,STATUS,DESCR,LOC_TYPE
Unnamed: 0_level_1,Date,Int64,String,String,String,String,String
1,2000-01-01,11,Region A,Region A,A,Boston,Physical
2,2000-01-01,12,Region A,Region A,A,New York,Physical
3,2000-01-01,13,Region A,Region A,A,Philadelphia,Physical
4,2000-01-01,14,Region A,Region A,A,Cleveland,Physical
5,2000-01-01,15,Region B,Region B,A,Richmond,Physical
6,2000-01-01,16,Region B,Region B,A,Atlanta,Physical
7,2000-01-01,17,Region B,Region B,A,Chicago,Physical
8,2000-01-01,18,Region B,Region B,A,St. Louis,Physical
9,2000-01-01,19,Region C,Region C,A,Minneapolis,Physical
10,2000-01-01,20,Region C,Region C,A,Kansas City,Physical


In [12]:
# creating Ledger
ledgerDF = DataFrame(
            LEDGER = String[], FISCAL_YEAR = Int[], PERIOD = Int[], ORGID = String[],
            OPER_UNIT = String[], ACCOUNT = Int[], DEPT = Int[], LOCATION = Int[],
            POSTED_TOTAL = Float64[]
            );

# create 2020 Period 1-12 Actuals Ledger 
l = "Actuals";
fy = 2020;
for p = 1:12
    for i = 1:10^5
        push!(ledgerDF, (l, fy, p, "ABC Inc.", rand(locationDF.CATEGORY),
            rand(accountsDF.ID), rand(deptDF.ID), rand(locationDF.ID), rand()*10^8))
    end
end

# create 2021 Period 1-4 Actuals Ledger 
l = "Actuals";
fy = 2021;
for p = 1:4
    for i = 1:10^5
        push!(ledgerDF, (l, fy, p, "ABC Inc.", rand(locationDF.CATEGORY),
            rand(accountsDF.ID), rand(deptDF.ID), rand(locationDF.ID), rand()*10^8))
    end
end

# create 2021 Period 1-4 Budget Ledger 
l = "Budget";
fy = 2021;
for p = 1:12
    for i = 1:10^5
        push!(ledgerDF, (l, fy, p, "ABC Inc.", rand(locationDF.CATEGORY),
            rand(accountsDF.ID), rand(deptDF.ID), rand(locationDF.ID), rand()*10^8))
    end
end

# here is ~3 million rows ledger dataframe
size(ledgerDF)

(2800000, 9)

## using joins

In [13]:
# rename dimensions columns for innerjoin
df_accounts = rename(accountsDF, :ID => :ACCOUNTS_ID, :CLASSIFICATION => :ACCOUNTS_CLASSIFICATION, 
    :CATEGORY => :ACCOUNTS_CATEGORY, :DESCR => :ACCOUNTS_DESCR);
df_dept = rename(deptDF, :ID => :DEPT_ID, :CLASSIFICATION => :DEPT_CLASSIFICATION, 
    :CATEGORY => :DEPT_CATEGORY, :DESCR => :DEPT_DESCR);
df_location = rename(locationDF, :ID => :LOCATION_ID, :CLASSIFICATION => :LOCATION_CLASSIFICATION,
    :CATEGORY => :LOCATION_CATEGORY, :DESCR => :LOCATION_DESCR);

# join Ledger accounts chartfield with accounts chartfield dataframe to pull all accounts fields
# join Ledger dept chartfield with dept chartfield dataframe to pull all dept fields
# join Ledger location chartfield with location chartfield dataframe to pull all location fields
df_ledger = innerjoin(
                innerjoin(
                    innerjoin(ledgerDF, df_accounts, on = [:ACCOUNT => :ACCOUNTS_ID], makeunique=true),
                    df_dept, on = [:DEPT => :DEPT_ID], makeunique=true), df_location,
                on = [:LOCATION => :LOCATION_ID], makeunique=true);

# note, how ledger DF has 28 columns now (inclusive of all chartfields join)
size(df_accounts),size(df_dept),size(df_location), size(ledgerDF), size(df_ledger)

((35, 8), (5, 7), (12, 7), (2800000, 9), (2800000, 28))

## data transformation
often, user needs to add, update or transform an existing column in a dataset.

use the select/select! and transform/transform! functions to select, rename and transform columns in a data frame.

transform and transform! functions work identically to select and select!, with the only difference that they retain all columns that are present in the source data frame.

In [14]:
# run one command at a time
names(df_ledger) # displays all 28 columns available in dataframes
unique(df_ledger.PERIOD) # displays unique values of accounting periods

# in this example, user wants to add a new fiels "QUARTER" which disp[lays Qtr # ( 1 -4) based on month/period
# we can use select, but because we want to retain all existing columns
# let's use transform function instead to create this column
# also, we will use transform! (tranform bang) to update original ledger DF

# let's first create a function, which takes month/period and shows Qtr #
function periodToQtr(x)
    if x ∈ 1:3
        return 1
    elseif x ∈ 4:6
        return 2
    elseif x ∈ 7:9
        return 3
    else return 4
    end
end

# now we will use this function to transform a new column
transform!(df_ledger, :PERIOD => ByRow(periodToQtr) => :QTR)

# let's create one more generic function, which converts a number to USD currency
function numToCurrency(x)
        return string("USD ",round(x/10^6; digits = 2), " million")
end

transform!(df_ledger, :POSTED_TOTAL => ByRow(numToCurrency) => :TOTAL)
df_ledger[1:5,["POSTED_TOTAL","TOTAL"]]
"df_ledger_size after transformation is: ", size(df_ledger)

("df_ledger_size after transformation is: ", (2800000, 30))

## using split-apply-combine for data grouping
[Ref](https://dataframes.juliadata.org/stable/man/split_apply_combine/) The DataFrames package supports the split-apply-combine strategy through the groupby function that creates a GroupedDataFrame, followed by combine, select/select! or transform/transform!.

    splitting a data set into groups,
    applying some functions to each of the groups,
    combining the results.

In [25]:
# group by all Chartfields
gdf = groupby(df_ledger, [:LEDGER, :FISCAL_YEAR, :QTR, :OPER_UNIT, :ACCOUNTS_CLASSIFICATION, :DEPT_CLASSIFICATION, 
    # :LOCATION_CLASSIFICATION,
    :LOCATION_DESCR])
# group by all Ledger, FY, Period & Entity
# gdf = groupby(df_ledger, [:LEDGER, :FISCAL_YEAR, :QTR, :OPER_UNIT]) # create a GroupedDataFrame
gdf_plot = combine(gdf, :POSTED_TOTAL => sum => :TOTAL)
gdf_plot = transform(gdf_plot, :TOTAL => ByRow.(numToCurrency) => :TOTAL)
gdf_plot[1:5,:]

Unnamed: 0_level_0,LEDGER,FISCAL_YEAR,QTR,OPER_UNIT,ACCOUNTS_CLASSIFICATION,DEPT_CLASSIFICATION
Unnamed: 0_level_1,String,Int64,Int64,String,String,String
1,Actuals,2020,1,Region A,STATISTICS,HR
2,Actuals,2020,1,Region B,NON-OPERATING_EXPENSES,BUSINESS
3,Actuals,2020,1,Region C,OPERATING_EXPENSES,IT
4,Actuals,2020,1,Region A,NET_WORTH,HR
5,Actuals,2020,1,Region C,NET_WORTH,OTHERS


## GL BalanceSheet, IncomeStatement & CashFlow

### Balance Sheet (Interactive)

In [16]:
Pkg.add("Interact")
Pkg.add("WebIO")
using Interact
using WebIO
Pkg.build("WebIO")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`
[32m[1m    Building[22m[39m WebIO → `~/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/c9529be473e97fa0b3b2642cdafcd0896b4c9494/build.log`


In [70]:
@manipulate for ld = Dict("Actuals"=> "Actuals", "Budget" => "Budget"), 
                rg = Dict("Region A"=> "Region A", "Region B" => "Region B", "Region C" => "Region C"),
                yr = slider(2020:1:2022; value=2021),
                qtr = 1:1:4
    
    @show ld, rg, yr, qtr
    
select(gdf_plot[(
    (gdf_plot.FISCAL_YEAR .== yr)
    .&
    (gdf_plot.QTR .== qtr)
    .&
    (gdf_plot.LEDGER .== ld)
    .&
    (gdf_plot.OPER_UNIT .== rg)
    ),:],
        :OPER_UNIT => :Org,
        :FISCAL_YEAR => :FY,
        :QTR => :Qtr,
        :ACCOUNTS_CLASSIFICATION => :Accounts,
        :DEPT_CLASSIFICATION => :Dept,
        # :LOCATION_CLASSIFICATION => :Region,
        :LOCATION_DESCR => :Loc,
        :TOTAL => :TOTAL)
end

(ld, rg, yr, qtr) = ("Actuals", "Region B", 2021, 2)


### Income Statement (Interactive)

In [74]:
@manipulate for ld = Dict("Actuals"=> "Actuals", "Budget" => "Budget"), 
                rg = Dict("Region A"=> "Region A", "Region B" => "Region B", "Region C" => "Region C"),
                yr = slider(2020:1:2022; value=2021),
                qtr = 1:1:4
    
    @show ld, rg, yr, qtr
    
select(gdf_plot[(
    (gdf_plot.FISCAL_YEAR .== yr)
    .&
    (gdf_plot.QTR .== qtr)
    .&
    (gdf_plot.LEDGER .== ld)
    .&
    (gdf_plot.OPER_UNIT .== rg)
    .&
    (in.(gdf_plot.ACCOUNTS_CLASSIFICATION, Ref(["ASSETS", "LIABILITIES", "REVENUE","NET_WORTH"])))
    ),:],
        :OPER_UNIT => :Org,
        :FISCAL_YEAR => :FY,
        :QTR => :Qtr,
        :ACCOUNTS_CLASSIFICATION => :Accounts,
        :DEPT_CLASSIFICATION => :Dept,
        # :LOCATION_CLASSIFICATION => :Region,
        :LOCATION_DESCR => :Loc,
        :TOTAL => :TOTAL)
end

(ld, rg, yr, qtr) = ("Actuals", "Region B", 2021, 2)


### Cash Flow Statement (Interactive)

In [75]:
@manipulate for ld = Dict("Actuals"=> "Actuals", "Budget" => "Budget"), 
                rg = Dict("Region A"=> "Region A", "Region B" => "Region B", "Region C" => "Region C"),
                yr = slider(2020:1:2022; value=2021),
                qtr = 1:1:4
    
    @show ld, rg, yr, qtr
    
select(gdf_plot[(
    (gdf_plot.FISCAL_YEAR .== yr)
    .&
    (gdf_plot.QTR .== qtr)
    .&
    (gdf_plot.LEDGER .== ld)
    .&
    (gdf_plot.OPER_UNIT .== rg)
    .&
    (in.(gdf_plot.ACCOUNTS_CLASSIFICATION, Ref(["NON-OPERATING_EXPENSES","OPERATING_EXPENSES"])))
    ),:],
        :OPER_UNIT => :Org,
        :FISCAL_YEAR => :FY,
        :QTR => :Qtr,
        :ACCOUNTS_CLASSIFICATION => :Accounts,
        :DEPT_CLASSIFICATION => :Dept,
        # :LOCATION_CLASSIFICATION => :Region,
        :LOCATION_DESCR => :Loc,
        :TOTAL => :TOTAL)
end

(ld, rg, yr, qtr) = ("Actuals", "Region B", 2021, 2)


click here to continue - Part 2 - General Ledger Data Analysis & Visualization