## Effortlessly Updating Mass Columns in Multiple Files 
## with Custom Functions using Pandas, PyArrow and Polars
---

This tutorial will guide you through the process of writing code to update mass columns across CSV or data files. 

While this initial notebook focuses on simple calculations, 

the Automation series will gradually introduce more complex RPA projects that rely on similar code snippets for mass updates. 

By learning to code a basic calculator, you'll have a foundation for more advanced techniques. 

Throughout the notebook, you'll define functions for simple arithmetic calculations and algebraic operations passed as parameters. 

These concepts will later be applied to perform mass calculations across multiple files and directories. 

Keep this notebook handy as a reference.

- define functions for simple arithmetic calculations
- define functions to pass arithmetic/algebraic operations as parameters
- performing mass calculations across files and directories

In [None]:
def add(x, y):
    return x + y

def subtract(x, y):
    return x - y

def multiply(x, y):
    return x * y

def divide(x, y):
    return x / y

print("Select operation.")
print("1. Add")
print("2. Subtract")
print("3. Multiply")
print("4. Divide")

while True:
    choice = input("Enter choice (1/2/3/4): ")

## this is ChatGPT generated code
# MATCH/Switch case would have been a preferred approach to write below code
# TODO: refactor this code
    if choice in ('1', '2', '3', '4'):
        num1 = float(input("Enter first number: "))
        num2 = float(input("Enter second number: "))

        if choice == '1':
            print(num1, "+", num2, "=", add(num1, num2))

        elif choice == '2':
            print(num1, "-", num2, "=", subtract(num1, num2))

        elif choice == '3':
            print(num1, "*", num2, "=", multiply(num1, num2))

        elif choice == '4':
            print(num1, "/", num2, "=", divide(num1, num2))
        break
    else:
        print("Invalid Input")

## Mass calculations

Let's consider a real-life scenario where you have a dataset that contains three distinct types of deposits.

1. Simple interest Certificate or Fixed Deposit or Buddy Deposit system 
    
    One example of a borrowing system is the Simple Interest Certificate, Fixed Deposit, or Buddy Deposit system. In this system, you may borrow money from a friend and agree to repay it either with or without a fixed interest rate based on a specific duration of time.

```math
    BD = P (1 + r*t/n)
```
2. Compound interest
```math
    CD = P (1 + r/n)^n*T
```
3. Mutual Fund deposit
```math
    MFDeposit == GOK <=> god*only*knows ~ *some*known*formula*
```

```html
    BD = Buddy Deposit
    CD = Certificate of Deposit 
    P = Principal amount 
    r = R/100 
    R = Rate of Interest 
    T = Time in years 
    n = compound (365 = daily, 12=monthly, 1=yearly)
```

In [12]:
import pandas as pd
df = pd.read_csv("../SampleData/sampleData.csv")
df.head(10)

Unnamed: 0,deposit,amount,ROI,time,rate,compound,Interest,Total
0,buddy,100000.0,2.875,60.0,simple,1.0,14375.0,114375.0
1,CD,100000.0,2.875,60.0,daily,365.0,15458.888733,115458.888733
2,CD,100000.0,2.875,60.0,monthly,12.0,15439.693571,115439.693571
3,CD,100000.0,2.875,60.0,qtr,4.0,15400.195311,115400.195311
4,CD,100000.0,2.875,60.0,annual,1.0,15225.669739,115225.669739
5,MF-1,100000.0,0.0,60.0,Group A,1.0,59000.0,159000.0
6,MF-2,100000.0,0.0,60.0,Group B,1.0,19000.0,119000.0
7,MF-3,100000.0,0.0,60.0,Group D,1.0,-5000.0,95000.0
8,MF-4,100000.0,0.0,60.0,Group A,1.0,37000.0,137000.0
9,MF-5,100000.0,0.0,60.0,Group A,1.0,51000.0,151000.0


Imagine yourself as an employee of a bank, tasked with the responsibility of recalculating or verifying the calculations in the aforementioned spreadsheet. 

Although it may appear straightforward initially, the task becomes incredibly challenging when you consider that you need to carry it out for 

- a total of 350 banks
- each with a minimum of 1+ million daily transactions
- over a period of 365 days
- and for the past 5 years

Here are few options
- using pandas to update columns with calculation using functions
- using pyarrow to update columns with calculation using functions
- using polars to update columns with calculation using functions
- using polars and transform columns inline (with out calling functions)

## mass calculations
- Step 1: OOPs - classes, data structures, custom transformation methods
- Step 2: data transformation using Pandas
- Step 3: data transformation using PyArrow
- Step 4: data transformation using Polars

In [13]:
# let's first setup OOPs for Deposit class
import random
class Deposit:

    def __init__(self, P, n, r, t):
        self.principalAmount = P
        # self.rate = r/100
        self.rate = r/100
        self.compound = n # in case of simple interest, n = 1
        self.time = t/12

    def getSampleBDeposit(self):
    # buddy | fixed | certificate deposit
    # returns a tuple of interest and Total
        return round((self.principalAmount * (1 + (self.rate * self.time) / self.compound) - self.principalAmount), ndigits=2), round((self.principalAmount * (1 + (self.rate * self.time) / self.compound)), ndigits=2)

    def getSampleCDeposit(self):
    # compound rate interest deposit
        # returns a tuple of interest and Total
        return round(self.principalAmount * (1 + self.rate / self.compound)**(self.compound * self.time) - self.principalAmount, ndigits=2), round(self.principalAmount * (1 + self.rate / self.compound)**(self.compound * self.time), ndigits=2)

    def getSampleMFDeposit(self):
    # Mutual Fund deposit
        # returns a tuple of interest and Total
        x = random.random()
        return round(self.principalAmount * (1 + x) - self.principalAmount, ndigits=2), round(self.principalAmount * (1 + x), ndigits=2)
    
def getInterestMethod(deposit='CD', amount=10000, compound=1.0, rate=2.875, t=60):
    d = Deposit(amount, compound, rate, t)
    if deposit == "CD":
        return d.getSampleCDeposit()
    elif deposit == "buddy":
        return d.getSampleBDeposit()
    else:
        return d.getSampleMFDeposit()

In [15]:
getInterestMethod("CD",50000,1,2.875,600)

(156282.51, 206282.51)

## using pandas to update columns with calculation using functions

---

In [17]:
df['interest_new'] = df.apply (lambda row: getInterestMethod(row.deposit,row.amount,row.compound,row.ROI,row.time), axis=1)
# df.info()
df.head()

(159, 9)

In [19]:
perfStats = []
import timeit
perfStats.append(("Pandas Assignment", timeit.timeit(lambda: df.apply (lambda row: getInterestMethod(row.deposit,row.amount,row.compound,row.ROI,row.time), axis=1), number=10)))
perfStats

[('Pandas Assignment', 0.04505509999580681)]

In [20]:
# run this code to calculate 1 Billion rows
perfStats.append(("Pandas Assignment = 1 Billion rows", 
                  timeit.timeit(lambda: df.apply 
                                (lambda row: getInterestMethod(row.deposit,row.amount,row.compound,row.ROI,row.time), 
                                 axis=1), 
                                number=10)))
perfStats

[('Pandas Assignment', 0.04505509999580681),
 ('Pandas Assignment = 1 Billion rows', 0.04852850001771003)]

In [None]:
df.info()

## using pyarrow to update columns with calculation using functions

---
Arrow has the capability to perform logical computations on inputs that may have different data types. However, instead of utilizing the conventional pyarrow compute functions, we will be using a user-defined function. 

It's worth noting that this `API is currently experimental.`

`Note that Arrow tables (and arrays) cannot be modified once they are created as they are immutable. `

Therefore, you cannot update your table directly. To make changes to the table, you will need to create a copy of the data and modify it accordingly.

It is essential to keep in mind that Arrow provides only a few basic operations to modify strings, which are quite restricted in scope.

In [None]:
# pip install pyarrow
import pyarrow as pa
df.info()
df.dtypes
df["amount"] = df["amount"].astype('float32[pyarrow]')

In [33]:
#####################################
## using pyarrow compute function
#####################################

import pyarrow as pa
from pyarrow import csv
import pyarrow.compute as pc

dfPATable = csv.read_csv("../SampleData/sampleData.csv")
# print(type(dfPATable))

#####################################
## using pyarrow standard compute
#####################################
# validate column series arrays
# we will use this function to compare calculation
a = dfPATable["Interest"]
# b = dfPATable["Interest_newlyCalculated"]
b = dfPATable["Total"]
pc.equal(a, b)

#####################################
## using pyarrow grouped aggregations
#####################################
# note down stats before transformation
dfPATable.group_by("deposit").aggregate([("Total", "sum")])

#####################################
## using pyarrow join
#####################################
# we are working with only one table here,
# join is not applicable
#####################################
# table1 = pa.table({'id': [1, 2, 3],
#                    'year': [2020, 2022, 2019]})
# table2 = pa.table({'id': [3, 4],
#                    'n_legs': [5, 100],
#                    'animal': ["Brittle stars", "Centipede"]})
# joined_table = table1.join(table2, keys="id")

#####################################
## using pyarrow filter
#####################################
dfPATable.filter((pc.field("deposit") == "CD") & (pc.field("ROI") > 1.5))

######################################
## using pyarrow user defined function
## this API is experimental
######################################
function_name = "getInterestCalPolar"
function_docs = {
      "summary": "Calculates the Interest rate",
      "description":
         "Given Deposit type, Principal, rate, compound interest, time\n"
         "calculate Interest accrued amount."
}

input_types = {
   "deposit" : pa.string(),
   "amount" : pa.float64(),
   "compound" : pa.float64(),
   "ROI" : pa.float64(),
   "time" : pa.float64()
}

output_type = pa.float64()

def getInterestCalPolar(ctx, deposit, amount, compound, ROI, time):
   # This is not the preferred approach, because
   # calling a Python Object/Class methods will slow down data transformation
   # and is not truly using Arrow Compute function.
   # instead re-write function definition here using Arrow Compute functions.
   table = pa.Table.from_arrays(
    [ 
        pa.array([deposit], pa.string()),
        pa.array([amount, compound, ROI, time], pa.Float32()),
    ],
    schema=pa.schema(
    [
        pa.field('deposit', pa.string()), 
        pa.field('amount', pa.Float32()), 
    ]
    ))
   columns = []
   my_columns = ['newInterest_col']
   for column_name in table.column_names:
      column_data = table[column_name]
      if column_name in my_columns:
         column_data = pa.array(table['str_col'].to_pandas().apply(getInterestMethod(deposit, amount, compound, ROI, time)))
      columns.append(column_data)

   updated_table = pa.Table.from_arrays(
      columns, 
      schema=table.schema
   )
   return updated_table
   # return getInterestMethod(deposit, amount, compound, ROI, time)[0]

pc.register_scalar_function(getInterestCalPolar,
                           function_name,
                           function_docs,
                           input_types,
                           output_type)

In [None]:
from pyarrow import csv
import pyarrow.compute as pc
dfPATable = csv.read_csv("../SampleData/sampleData.csv")
# print(type(dfPATable))
# dfPATable.shape
# dfPATable.filter((pc.field("deposit") == "CD") & (pc.field("ROI") > 1.5))
import pyarrow.dataset as ds
dataset = ds.dataset(dfPATable)

func_args = [ds.field("deposit"), ds.field("amount"), ds.field("compound"), ds.field("ROI"), ds.field("time")]
dataset.to_table(
            columns={
                'gcd_value': ds.field('')._call("getInterestCalPolar", func_args),
                'deposit': ds.field('deposit'),
                'amount': ds.field('amount'),
                'compound': ds.field('compound'),
                'ROI': ds.field('ROI'),
                'time': ds.field('time')
            })

## Efficient Column Updates with Function Calculations using Polars
---
While there are tons of state of the art dataframe packages are in the market, 
Polar dataframe (built on RUST) claims to have fastest execution to support complex data science operations on tabular datasets.

- larger than memory (RAM) in-memory data analytics
- Automatic Optimization
- Embarrassingly Parallel
- easy to learn consistent, predictable API that has strict schema
- Lazy API vs Eager execution

`Note that I will soon release another blog post that delves into the Rust Polars dataframe in great detail.`

Regarding the usage of custom functions,
- Polars expressions are incredibly potent and versatile, and hence, the need for custom Python functions is relatively low compared to other libraries. 

However, it is still essential to have the capability to transfer an expression's state to a third-party library or to apply your own function over data in Polars.

In [None]:
# pip install polars
dfPolars.fetch(n_rows=3)

In [None]:
import random
import polars as pl
dfPolars = pl.scan_csv("../SampleData/sampleData.csv")
# type(dfPolars)
dfPolars.fetch(n_rows=10)

def rank_pct():
    print(pl.element())
    pass

# out = dfPolars.select([pl.concat_list(pl.col(["deposit","amount"])).alias("new_Calc")]).collect()

# buddy deposit calculation
dfPolars.filter(pl.col("deposit") == "buddy").with_columns(
    # create the list of homogeneous data
    pl.concat_list(pl.all().exclude(["deposit","rate","Interest","Total"])).alias("all_vals")
).select([
    # select all columns except the intermediate list
    pl.all().exclude("all_vals"),
    # compute the rank by calling `arr.eval`
    pl.col("all_vals").apply(lambda x: 
                             round(x[0] * (1 + ((x[1]/100) * (x[2]/12)) / x[3]), ndigits=2))
    .alias("new_Calc")
]).collect()

# CD deposit calculation
dfPolars.filter(pl.col("deposit") == "CD").with_columns(
    # create the list of homogeneous data
    pl.concat_list(pl.all().exclude(["deposit","rate","Interest","Total"])).alias("all_vals")
).select([
    # select all columns except the intermediate list
    pl.all().exclude("all_vals"),
    # compute the rank by calling `arr.eval`
    pl.col("all_vals").apply(lambda x: 
                             round(x[0] * (1 + ((x[1]/100)/x[3])) ** (x[3] * x[2]/12), ndigits=2))
    .alias("new_Calc")
]).collect()

# x[0] principalAmount
# x[1] rate
# x[2] time in months
# x[3] compound type

# self.principalAmount * (1 + self.rate / self.compound)**(self.compound * self.time)

In [None]:
# using Polars custom function
# using map or apply
# be mindful, apply or map when call Python function
# be very slow

# this code is equally slow as using plain Python code on Pandas dataframe

def getInterestMethod(deposit='CD', amount=10000, compound=1.0, rate=2.875, t=60):
    d = Deposit(amount, compound, rate, t)
    if deposit == "CD":
        return d.getSampleCDeposit()
    elif deposit == "buddy":
        return d.getSampleBDeposit()
    else:
        return d.getSampleMFDeposit()


out = df.select(
    [
        pl.col("values").apply(getInterestMethod).alias("solution_apply"),
        (pl.col("values") + pl.arange(1, pl.count() + 1)).alias("solution_expr"),
    ]
)
print(out)
