# Exercise 1 - Jupyter Notebook 
## Using Isolation Forest for Outlier Analysis on financial transaction data
SAP TechEd 2025, Hands-On Workshop: DA261 - Unlocking AI-driven insights from your business data in SAP HANA Cloud
<br><br>

### Understanding the exercise scenario 

In this exercise, you will explore how to __apply machine learning techniques__ like __Isolation Forest__ for __outlier analysis__ on __financial business transaction data__.
- __Financial business transaction data__ of most SAP applications is managed in the central tables like [Universal Journal](https://help.sap.com/docs/SAP_S4HANA_ON-PREMISE/651d8af3ea974ad1a4d74449122c620e/523b8a55559ad007e10000000a44538d.html?locale=en-US&version=LATEST), ACDOCA and accessible through a large variety of CDS views like for example [I_GLAccountLineItemRawData](https://help.sap.com/docs/SAP_S4HANA_CLOUD/c0c54048d35849128be8e872df5bea6d/7fe239a3f2214e2cb36e90d453eee6d3.html) for different purposes. The Universal Journal, ACDOCA table, is one of the tables in S/4HANA systems with the largest number of links to business transactions, and thus also is typically a very large table within the system.
    - As reference information and background reading, see the following blog posts [Analytics on Universal Journal, the heart of SAP S/4HANA](https://community.sap.com/t5/enterprise-resource-planning-blog-posts-by-sap/analytics-on-universal-journal-the-heart-of-sap-s-4hana/ba-p/13489661) and [Understanding the Universal Journal in SAP S/4HANA](https://community.sap.com/t5/enterprise-resource-planning-blog-posts-by-sap/understanding-the-universal-journal-in-sap-s-4hana/ba-p/13345726).  
- Given the variety in nature of business transaction managed in the Universal Journal, __outlier analysis__ as a form of __analytics on the universal journal__ involves a as well a great variety of techniques in order to detect errors or fraudulent transactions. 
    - Examples would be detecting outlier by applying certain rules in searching through the data for false or incomplete booking, e.g. missing transaction types, missing functional areas, wrong accounts for business scenario, balance not zero in certain account, and so on and on.

<br>

In this exercise, we introduce a __trending machine learning__ technique for __outlier analyis__ called __isolation forest__, 
- the technique uses a ensemble set of multiple decision trees and identifies outliers based on the assumption that the __decision tree depths for outliers is shorter__ than the average decision tree depths of normal values;
- the technique can also be applied to financial business transaction data, and can be applied within the SAP HANA database,
- thus __adding to the mix of outlier detection methods__, it can be __helpful__ to __identify outliers in the business transaction data by their unusual pattern of data values__. 
- Furthermore, the technique is also well known as __capable of analyzing larger amounts of data__.

Data used for the analysis
- Outlier analysis on financial business transaction data typically __focuses on specific, smaller subsets or slices of the data__, e.g. business area with specific and typical financial transaction data pattern. Therefore outlier analysis will never be applied to the complete table at once.
- Hence, use of filtering for subsets and slices and specific CDS views is common practice
- Potential extraction of data for further analysis to a side by side SAP HANA Cloud system is possible via the Smart Data Access (SDA) ABAP adapter used to consume remote ABAP CDS Views, see this blog post for details [Taking Data Federation to the Next Level: Accessing Remote ABAP CDS View Entities in SAP HANA Cloud](https://community.sap.com/t5/technology-blog-posts-by-sap/taking-data-federation-to-the-next-level-accessing-remote-abap-cds-view/ba-p/13635034).
- In this exercise now, we use an artificially generated small table called ACDOCA. For details on how the data is generated, see the code in the appendix of the notebook.

<br><br>

Overview on the exercise tasks

![](./images/ex1_scenario.png)


<br>
Not let's get started with the exercise!

<br><br>

## Ex. 1.0 - Connect to your SAP HANA Cloud instance

### Step 0: Establish and check connection


Throughout this exercise, we will be using the __Python Machine Learning client for SAP HANA (hana-ml)__, as your reference to all its functions and capabilities see the [hana-ml documentation](https://help.sap.com/doc/cd94b08fe2e041c2ba778374572ddba9/2025_2_QRC/en-US/change_log.html). The current version released with SAP HANA Cloud 2025 Q3 release is 2.26.
- The python package hana-ml in general allows to script in python, while SQL code is generated on-the-fly and directly passed to a connected SAP HANA database system for execution.
    - It allows to access and prepare data by means of a HANA dataframe, a python object holding a SQL select query. Many methods are provided to be used with the HANA dataframe, changing the SQL select query behind the scenes.
    - As its core, it provides methods to apply AI functions (algorithms from the [Predictive Analysis Library (PAL)](https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-predictive-analysis-library/sap-hana-cloud-sap-hana-database-predictive-analysis-library-pal?locale=en-US&version=LATEST) and [Automated Predictive Library (APL)](https://help.sap.com/docs/apl?locale=en-US&version=LATEST) to the data prepared with a HANA dataframe, designed to apply all the processing within the SAP HANA database.
- The python package is released as a component with every the SAP HANA Client delivery, in addition the latest __hana-ml__-version can always be found at the pypi public repository at https://pypi.org/project/hana-ml/.
<br><br>

In Python, installed packages require to be __imported__ to the current session, in order to be available for use in python scripts.
- Let's first run the import of hana-ml

In [None]:
# Importing the Python Machine Learning client library for SAP HANA and get the version
import hana_ml
print(hana_ml.__version__)

Next, execute the next cell and the referenced script, prepared in the Getting Started section to connect to the SAP HANA CLoud database.

In [None]:
%run "../ex0/ex0_2-check_setup.ipynb"

## Ex. 1.1 - Exploring financial transaction data

### Step 1: Create HANA dataframe for the financial business transaction sample table

__Introduction to SAP HANA dataframes__
- The [HANA dataframe](https://help.sap.com/doc/cd94b08fe2e041c2ba778374572ddba9/2025_2_QRC/en-US/hana_ml.dataframe.html#module-hana_ml.dataframe) represents a database query as a hana-ml dataframe object in python, comarable to a pandas dataframe. 
  - Most HANA dataframe operations are designed to NOT bring data back from the database into the python envrionment, unless it is a small aggregated result set or explicitly requested. 
- SAP HANA dataframes can be created
    - based on database tables, SQL views and calculation views (incl. parameters), custom SQL statements incl. multi-statements
    - or created from pandas dataframe or spark dataframe.

Based on SAP HANA Cloud database instance [ConnectionContext-object](https://help.sap.com/doc/cd94b08fe2e041c2ba778374572ddba9/2025_2_QRC/en-US/hana_ml.dataframe.html#hana_ml.dataframe.ConnectionContext) __"myconn"__, we are using the __table-method__ to create the initial HANA dataframe.
- Note, ConnectionContext is a child-object to the dataframe-class, that's why we can apply the table- and other methods to create a HANA dataframe.

In [None]:
# Check the connection using "myconn"
myconn.connection.isconnected()

<br>

Creating a HANA dataframe __acdoca_hdf__ against the database table ACDOCA demo table, containing artificially generated data and columns. See the appendix for details.

In [None]:
# Creating a HANA dataframe in Python against the HANA Cloud table
acdoca_hdf = myconn.table("ACDOCA", schema="DA261_SHARE")

<br>

Explore what a HANA dataframe object is within the python environment. 
- The attribute __select_statement__ shows the current HANA dataframe SQL select of the dataframe
- The python methods __print()__ or __display()__ present the output of executed python commands.

In [None]:
# Understand what is the HANA dataframe
print(acdoca_hdf)
print(acdoca_hdf.select_statement)

<br>

Understand the data structure of the query set underlying the dataframe
- using the shape, columns and dtypes() methods

In [None]:
# Understand the data structure of the query set underlying the dataframe
print(acdoca_hdf.shape, '\n')
print(acdoca_hdf.columns, '\n') 
display(acdoca_hdf.dtypes())

As the shape method output indicates, the table we use for the purpose of this session is very small and has ony 500 rows and 13 columns.  
The data will be explored in more detail by the following dataframe methods and theirs output.
<br>

<br>

For an __initial view at the data__, the __head()__- in conjunction with the __collect()-method__ can be used.

- __Only__ when the __collect()-method__ is used, a __HANA dataframe's SQL query result set data__ is actually __transferred into the python envrionment__. 
- Therefore, __carefully__ make use of __collect()-method__, best __Do not__ use collect() without any further filtering or aggregation methods applied.
- The head()-method for example, adds the TOP N predicate to the SQL select statement

In [None]:
# Filter on the TOP 5 rows, and show them in python
display(acdoca_hdf.head(5).collect())

<br>
The dataframe and its method applied, is still only a SQL query statement. So how does it look like now?

In [None]:
print(acdoca_hdf.head(5).select_statement)

<br><br>
### Step 2: Explore the ACDOCA data using HANA dataframe methods

Let's look as some exemplary filtering and aggregation methods for the HANA dataframe
- Using a column list with the select()- and the head(5)-method to filter the result set rows

In [None]:
# Selecting columns using the select-method and list of columns
acdoca_hdf.select('Company Code', 'G/L Account', 'Profit Center', 'Financial Account Type','Amount (Transaction)').head(5).collect()

In [None]:
# Again, it's just a change to the select statement query of the dataframe
acdoca_hdf.select('Company Code', 'G/L Account', 'Profit Center', 'Financial Account Type','Amount (Transaction)').head(5).select_statement

<br>

Filtering data rows, by applying SQL WHERE-clause expressions using the filter-method
- Escaping of 'string'-values in the expression using \\'string\\' is required

In [None]:
# Filter by rows, the filter-method applies a SQL where.clause  [Note, string-quotes "'" within the where-expression need to be escaped using "\"]
acdoca_hdf.filter('"Company Code" = \'CC01\' AND "Profit Center"=\'PC002\'').head(5).collect()

# As to your interest, explore the select statement as well uncommenting the next line
# acdoca_hdf.filter('"Company Code" = \'CC01\' AND "Profit Center"=\'PC002\'').select_statement

<br>
Combining it all

In [None]:
display(acdoca_hdf.select('G/L Account', 'Profit Center', 'Financial Account Type','Amount (Transaction)').filter('"Profit Center"=\'PC002\'').head(5).collect())
print(acdoca_hdf.select('G/L Account', 'Profit Center', 'Financial Account Type','Amount (Transaction)').filter('"Profit Center"=\'PC002\'').head(5).select_statement)

As you can determine by the __nested SQL statement__, the __order of method-calls__ matters (here select>filter>head) and impacts the results.  
For data inspection using a dataframe, for safeguarding it is a good practice to always consider using the head()-function before the collect()-call.
<br><br>

Creating a new HANA dataframe from an existing dataframe
- New dataframe = existing dataframe.\<methods ...\>

In [None]:
# Note, we are not using the collect()-method, as we don't want to transferring any data
hdf_acdoca_tmp=acdoca_hdf.select('G/L Account', 'Profit Center', 'Financial Account Type','Amount (Transaction)').filter('"Profit Center"=\'PC002\'').head(5)
print(hdf_acdoca_tmp.select_statement)

<br><br>
## Ex 1.2 - Outlier analysis using IsolationForest

### Step 3: Execute basic outlier analysis using Isolation Forest

Now, let's explore our financial business transaction data for outliers, 
- filtering the HANA dataframe for Company Code CC01 data and for Profit Center PC002,  
- and prepare a HANA dataframe __hdf_acdoca_slice__ as the input data set to our Isolation Forest analysis.

In [None]:
hdf_acdoca_slice=acdoca_hdf.filter('"Company Code" = \'CC01\' AND "Profit Center"=\'PC002\'')

<br>

The __Isolation Forest__-function in SAP HANA Cloud [PAL SQL reference](https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-predictive-analysis-library/isolation-forest-isolation-forest-11345d9?q=evaluation_metric&locale=en-US&version=LATEST), [hana-ml Python reference](https://help.sap.com/doc/cd94b08fe2e041c2ba778374572ddba9/2025_2_QRC/en-US/pal/algorithms/hana_ml.algorithms.pal.preprocessing.IsolationForest.html#isolationforest) builds an __esemble model__ of __multiple decision trees__ (thus called __"forest"__). 
- It tries to detect or isolate outliers based on the assumption that the __decision tree depths__ for __outlier values__ is __shorter__ than the average decision tree depths for normal values.  
- It is regarded as a technique also suitable to be applied to large datasets.  
- For additional background reading on algorithm, see also [wikipedia: isolation forest](https://en.wikipedia.org/wiki/Isolation_forest) and [wikipedia: different outlier detection techniques](https://en.wikipedia.org/wiki/Anomaly_detection) for additional background reading.
![](./images/ex1_IF_algorithm.png)

Amongst other capabilities, the __SAP HANA Cloud Isolation Forest-functions__ supports for
- analysis of outliers within a __mixed feature set__ of numeric as well as __categorial__ columns,
- provides __AI explainability__ insights for predicted outliers
- allows for __massive data-parallel outlier analysis__ on __multiple subsets_ of data in parallel.

<br>
Preparing the feature set as a list of columns we seek to consider for the outlier detection, incl. some categorial columns

In [None]:
outlier_features=['Debit/Credit', 'Accounting Document Type', 'Transaction Type', 'Financial Account Type', 'Amount (USD)', 'Amount (Transaction)'] 

<br>

Now in the first step, we will use the __fit-method__ to __build the Isolation Forest model__, and __apply__ the model with the __predict-method__ in second step to retrieve the detected outliers.
- If we seek to detect outliers in very large datasets, the outlier detection Isolation Forest model could be trained on a representative and larger enough sample of the data,
- while the predict method with the trained model can then applied to the full data set. 
- The predict-task is also a row-independent task and thus various parallel invocation techniques can be applied (e.g. parallel by partition, by value, ...) for a faster performance.  

We are applying the Isolation Forest method with the __default algorithm parameter values__, which are
- __n_estimators=100__, which specifies the number of trees the model will be composed of.
- __max_samples=256__, which refers to the number of sample rows to draw from the input data to train each tree. 
- __bootstrap=False__, row sampling happens without replacement, thus each tree is build from a different set of sample rows.
- __random_state=251104__, can be set to any value. It simply determines repeatability by setting a fix starting point for randomness within the algorihm.

For details on how to set Isolation Forest parameter values 
- see the reference section in the Appendix of the notebook __Setting Isolation Forest parameter values with larger datasets__

In [None]:
# Loading the Isolation Forest method class
from hana_ml.algorithms.pal.preprocessing import IsolationForest

# Creating our IsolationForest model object names "isof"
isof = IsolationForest(random_state=251104, n_estimators=100, max_samples=256, bootstrap=False)

# Executing the fitting, i.e. the training of the Isolation Forest Outlier model
isof.fit(data=hdf_acdoca_slice, features=outlier_features) 

As the fit method in this case doesn't present results, let's inspect what has been executed on the SAP HANA Cloud database.
- There a list a [base-methods](https://help.sap.com/doc/cd94b08fe2e041c2ba778374572ddba9/2025_2_QRC/en-US/pal/algorithms/hana_ml.algorithms.pal.pal_base.PALBase.html#hana_ml.algorithms.pal.pal_base.PALBase) available for all PAL algorithms in HANA-ML, helpful to explore generated or executed sql. As an example we will use __"get_fit_execute_statement"__
- In addition, there is the __last_execute_statement__-method of the connection object, which sometimes is also helpful to quickly inspect what has been executed at last.

In [None]:
# What has just been executed in the SAP HANA CLoud datbase?
print(isof.get_fit_execute_statement())

In [None]:
# Alternatively, what has been last executed over the connection?
#print(myconn.last_execute_statement)

<br>

Next, in this second step, we want to __predict the outliers__ in our data using the isolation forest model we just created
- While in this exercise step, we are using the same data set, this could as well be a different, much larger one or an updated data set. Provided it has the same structure and columns.
- Alike with other AI functions and algorithms, the predict-method requires and a key-/ID-column to be able to match the predicted results with the input data used for to generate the predictions.

First we therefore create a simple ID-column using the __add_id()-method__

In [None]:
# For the purpose of this demo, we're a using a simple add_id()-method. Within your real data, likely one would create a key-column over a set of composite logical key columns.
hdf_acdoca_slice_id=hdf_acdoca_slice.add_id()

hdf_acdoca_slice_id.head(10).collect()

<br>

Now, let's execute the __predict-method__
- The __most important parameter__ we set is the __contamination value__, it's basically the __proportion of expected outliers__ in the data set. 
- Thus the value should be set by the data analyst, familiar with the business context of the dat.
- Here, let's start with 0.05, i.e. expecting 5% of outliers in our data
- Try it for different values like 0.01, 0.02 as well

In [None]:
outlier_results = isof.predict(data=hdf_acdoca_slice_id, key='ID', features=outlier_features, 
                               contamination=0.05)

<br>

The result object of the __isof.predict()__ call is again HANA dataframe, let's take a closer look ..

In [None]:
# The results are persisted into a temporary database table, identified by the HANA dataframe "outlier results".
outlier_results.select_statement

In [None]:
# If you wish to explore, what has been executed within the SAP HANA database
# print(isof.get_predict_execute_statement())

<br>

Now let's explore the __Isolation Forest outlier analysis results__
- The result dataframe presents the columns ID, __SCORE__ and __LABEL__
- __LABEL__ values of __-1__ indicate, the __data row has been identified as an outlier__
- The __SCORE (or anomaly score)__ value __quantifies how easily__ a data point could be isolated from the rest of the data, and thus classified as an outlier or not.
    - It is calculated based on the average path length in the trees of the isolation forest, with __shorter path lengths__ indicating __higher anomaly scores__ (closer to 1),  
  suggesting the point is an anomaly, while longer path lengths indicate lower anomaly scores (closer to 0) and are considered normal.

In [None]:
# Sort outlier results descending by SCORE-value
display(outlier_results.sort('SCORE', desc=True).head(5).collect())

# Count number of predicted outliers (LABEL = -1) and non-outliers (LABEL = 1)
display(outlier_results.agg([('count','ID','n_transactions')] ,group_by = ['LABEL']).collect())

The first results set presents the top 5 predictions, 
- using the sort-method with the result dataframe, sorting the results descendingly based on the SCORE-values.
The second result set aggregates the prediction results and 
- counts the predication based on the LABEL values. Thus we can conclude we could identify 9 outlier records in our data.
<br>


<br>

Next, let's visualize the outlier scores and how they distribute across the amount values, using a Plotly visualization

In [None]:
# Visualize the outlier scores and how they distribute across the amount values, using a Plotly visualization
# Export the results into a pandas dataframe for visualization in Python, in this case it is still a small set of data points
pdf=hdf_acdoca_slice_id.select('ID', ('"Amount (Transaction)"', 'AMOUNT')).set_index("ID").join(outlier_results.set_index("ID")).sort('SCORE', desc=True).collect()
pdf['LABEL'] = pdf['LABEL'].astype(str)

# Building a custom scatter plot
import plotly.express as px
fig = px.scatter(pdf, x="AMOUNT", y="SCORE", color_discrete_map = {"1": "blue", "-1":"red"}, color="LABEL", width=800, height=400)
fig.update_traces(marker=dict(size=5))
fig.show()

<br>

One first approach in trying to understand the outlier classification predictions by the model, is to join the results back with the input data.
- here we use the HANA dataframe __join-method__

In [None]:
# Join dataframes with each other, applying set_index as the inner join key column; restricting to the columns we have used as features with Isolation Forest
hdf_acdoca_slice_id.select('ID', 'Debit/Credit', 'Accounting Document Type', 'Transaction Type', 'Financial Account Type', 'Amount (USD)', 'Amount (Transaction)'
                                          ).set_index("ID").join(outlier_results.set_index("ID")).filter('LABEL = -1').sort('SCORE', desc=True).head(5).collect()

Depending on our familiarity and expertise with the data, it might still be difficult to understand the outlier decisioning done  by the Isolation Forest model.  
Therefore, let's try to get more insights applying AI explainability methods.
<br><br>

### Step 4: Add shapley explanations to Isolation Forest outlier predictions

The SAP HANA Predictive Analysis Library provides and extensive set of __AI explainability__  approaches for [classification](https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-predictive-analysis-library/local-interpretability-of-models-local-interpretability-of-models-c330665?locale=en-US&version=LATEST), regression as well as time series functions predictions.
In AI explainability we try to get transparency at level of the AI / machine learning model, as well as the individual predictions, therefore we distinguish between
- __Global AI explainability__ as analysis and reports, presenting the overall model insights, "global" overall contribution of individual features to the model. See this [blog post](https://community.sap.com/t5/technology-blog-posts-by-sap/global-explanation-capabilities-in-sap-hana-machine-learning/ba-p/13620594) if interested in more details;
- __Local AI explainbility__, as explaitions for individual predictions, which feature values contributed to which degree to the models prediction outcome. 

As a very commonly applied AI explainability method, the so-called __Shapley Additive Explanations(SHAP)__ have become a widely used standard
- Shapley Additive Explanations(SHAP) provide explainabilty models derived from game theory, indicating which feature-values for a given prediction had the largest impacted in a decision.
- For more details see this community blog post [On Responsible AI: SHAP of you](https://community.sap.com/t5/technology-blog-posts-by-members/on-responsible-ai-shap-of-you/ba-p/13553641)
- The hana-ml package provides a series of methods and visualization, see the documentation for [local_interpretability](https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/2.0.07/en-US/pal/topics/local_interpretability.html)

<br>

The __SHappley Additive Explainations__ can also be utilized with the SAP HANA Cloud __Isolation Forest__-function (since 2025 Q2).
- Local AI explainability is activated in the predict function using __show_explainer=True__-parameter, where the explanation output is generated into the REASON_CODE column
- Due to its processing impact, it is recommended not to enable it by default and use with __explain_scope__ to explain only the outliers.
- With __top_k_attributions__ we can limit the explanation outout within the REASON_CODE column to the __top k columns within the individual explanations__

In [None]:
outlier_results_explained = isof.predict(data=hdf_acdoca_slice_id, key='ID', features=outlier_features,
                       contamination=0.05,
                       show_explainer=True, explain_scope='outliers', top_k_attributions=5)

In [None]:
# Displaying the RESON_CODE explanations of the predicted outliers in full width
import pandas as pd
pd.set_option('max_colwidth', None) 
outlier_results_explained.filter('LABEL = -1').head(5).collect()

How to read and interpret the results in the REASON_CODE column, 
- for ID 17, the top 1 column in the explanation result is: {__"attr":"Transaction Type"__,"val":0.4292705364764549,__"pct":26.700000000000004__}
- __"Transaction Type" is the top 1 attribute, contribution to 26.7% to the outlier prediction__ (classification as outlier and SCORE value). 
- "val" is the SHAPley value, which on its own cannot be explained here.
<br><br>

In [None]:
# Now joining the results again, with the input features for better reasoning of the explanaition results
hdf_acdoca_slice_id.select('ID', 'Debit/Credit', 'Accounting Document Type', 'Transaction Type', 'Financial Account Type', 'Amount (USD)', 'Amount (Transaction)'
                                          ).set_index("ID").join(outlier_results_explained.set_index("ID")).sort('SCORE', desc=True).head(3).collect()

<br>

We can review __global AI explainabiltiy__, overall insights on the level fo the isolation forest-model using the __Shapley Explainer summary plot__
- Select the __Feature Effects__ on the __Bar-Plot tab__

In [None]:
# Global AI Explainability results, inspect the Feature Effects on the Bar Plot of the visualization
from hana_ml.visualizers.shap import ShapleyExplainer
shapley_explainer = ShapleyExplainer(reason_code_data=outlier_results_explained.sort('ID').select('REASON_CODE'), feature_data=hdf_acdoca_slice_id.sort('ID').select(outlier_features)) 

# Pipeline model, glocal shapley model explanations
shapley_explainer.summary_plot()

<br>

The ShapleyExplainer-object, also provides another __local AI explainability__ visualization, the the so-called __Shapley Force plot__.  
- As an example to understand the Force Plot, select row 56 in the plot, it shall match ID 57 in the data
- Expand the force plot visualization, clicking the "+" icon to the left of the row
- The __shapley values__, not the percentage values are __displayed with their positive / negative values__ and impact to the classification decision

In [None]:
# Explore the Force plot, for a visualization of the local AI explainability, Shapley values
shapley_explainer.force_plot()

<br><br>

## Ex 1.3 - Outlier analysis per subgroup using Massive Isolation Forest (optional)

The __Isolation Forest__-funtion of the __Predictive Analysis Library__ in SAP HANA CLoud recently (release 2025 Q2) introcuded [massive, data-parallel outlier analysis](https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-predictive-analysis-library/isolation-forest-isolation-forest-11345d9#ariaid-title6). 
- This allows to run isolation forest analysis tasks for each subgroup independently and in parallel for a maximum of performance and outcome.
- This __massive, data-parallel__ pattern is a commonly used approach within SAP applications applying for example times series forecasting, at the scale of modelling and forecasting multiple thousand time series in parallel. 
- __Resource consumption__ can be fully controlled by SAP HANA's __Workload Management__ capabilities, i.e. by __limiting the number of available threads__ for the task, which slows down the overall task by reducing the degree of parallelism.

### Step 5: Data-parallel outlier analysis per "G/L Account"

Let's apply the massive, data-parallel isolation forest to the same data

In [None]:
# Review our data and features
print(outlier_features, '\n')
display(hdf_acdoca_slice.head(2).collect())
display(hdf_acdoca_slice_id.head(2).collect())

<br>

Let's search for outliers in our data slice, for each "G/L Account" in parallel

In [None]:
# How many "G/L Account" values do we have in our data?
hdf_acdoca_slice.distinct("G/L Account").agg([('count','G/L Account','Count of G/L Accounts')]).collect()

<br>

The IsolationForest object parameter __massive=True__ and the fit-/predict-method parameter __group_key="G/L Account"__ trigger the massive data-parallel processing

In [None]:
from hana_ml.algorithms.pal.preprocessing import IsolationForest
parallel_isof = IsolationForest(massive=True, random_state=2, n_estimators=100, max_samples=256, bootstrap=False)

parallel_isof.fit(data=hdf_acdoca_slice, group_key="G/L Account", features=outlier_features)

res, err = parallel_isof.predict(data=hdf_acdoca_slice_id, key="ID", group_key="G/L Account", features=outlier_features                      )


In [None]:
#print(parallel_isof.get_fit_execute_statement())

In [None]:
# The results show outlier predictions across the data groups of "G/L Accounts"
display(res.sort('SCORE', desc=True).head(10).collect())

In [None]:
# join original value with outlier data
hdf_acdoca_slice_id.select('ID', 'G/L Account', 'Amount (USD)', 'Amount (Transaction)').set_index("ID").join(res.set_index("ID")).sort('SCORE', desc=True).collect()

<br><br>

## Summary

You've now completed exercise 1, great !

Continue to - [Exercise 2 - Analyzing consumer complaints using text embeddings and machine learning](../ex2/README.md)


<br><br>

## Further reference information and examples 
Note the appendix section of the ex1_notebook.ipynb-file might include additional expert-level details for your offline study, incl. reference to the data used in the exercise.

<br><br>

# Appendix - reference sections

## Code generation for design-time applications (optional)

<br> 

You are happy with the outlier analysis scenario isolation forest model and now seek to embedd it into your design-time application.
1. utilize methods from the [hana-ml.PAL.PAL-base subclass](https://help.sap.com/doc/cd94b08fe2e041c2ba778374572ddba9/2025_3_QRC/en-US/pal/algorithms/hana_ml.algorithms.pal.pal_base.PALBase.html) which are implicitly available with all PAL algorithms objects in hana-ml
<br> 
    ![](./images/ex1-appd-dt-basemethods.png)

<br>

2. utilize methods from the [hana-ml.artifacts package](https://help.sap.com/doc/cd94b08fe2e041c2ba778374572ddba9/2025_3_QRC/en-US/hana_ml.artifacts.html) to generated ABAP AMDP, CAP or HANA native HDI (HANA deployment infrastructe) design-time artifacts for a respective project
    - SQL tracing requires to be enabled
    - At least both fit- and predict-method have to be executed for generating persistant artifacts from traced execution code

<br>

Approach 1.) __Review recently generated code and temporary objects__

In [None]:
outlier_features=['Debit/Credit', 'Accounting Document Type', 'Transaction Type', 'Financial Account Type', 'Amount (USD)', 'Amount (Transaction)'] 

In [None]:
# Loading the Isolation Forest method class
from hana_ml.algorithms.pal.preprocessing import IsolationForest

# Creating our IsolationForest model object names "isof"
isof = IsolationForest(random_state=251104)

# Executing the fitting, i.e. the training of the Isolation Forest Outlier model
isof.fit(data=hdf_acdoca_slice, features=outlier_features) 

In [None]:
print(isof.consume_fit_hdbprocedure("ISOF_OUTLIER_BASE_PROC_NAME")['base'], "\n")
#print(isof.fit_hdbprocedure)
print(isof.get_fit_output_table_names())
print(isof.get_fit_parameters())

print(isof.consume_fit_hdbprocedure('<PROC_NAME>', in_tables=["<ANALYSIS_DATA>"], out_tables=["<IF_MODEL_TABLE>"])['consume'], "\n")

<br><br>

Approach 2.) __Now, generate the full design-time artifacts along with other dev-project required files like synonyms etc.__

In [None]:
# For the purpose of later artifact generation, enable sql tracing
myconn.sql_tracer.enable_sql_trace(True)
myconn.sql_tracer.enable_trace_history(True)

In [None]:
# Loading the Isolation Forest method class
from hana_ml.algorithms.pal.preprocessing import IsolationForest

# Creating our IsolationForest model object names "isof"
isof_dev = IsolationForest(random_state=251104)

In [None]:
# Executing the fitting, i.e. the training of the Isolation Forest Outlier model
isof_dev.fit(data=hdf_acdoca_slice, features=outlier_features) 

In [None]:
outlier_results = isof.predict(data=hdf_acdoca_slice_id, key='ID', features=outlier_features, 
                               contamination=0.05)

In [None]:
# What has been captured as objects in the SQL tracer log?
print(myconn.sql_tracer.trace_sql_log.keys())


In [None]:
# What are the trace object at the next level of the fit object?
print(myconn.sql_tracer.trace_sql_log['IsolationForest1'].keys())
print(myconn.sql_tracer.trace_sql_log['IsolationForest1']['fit'].keys())

#print(myconn.sql_tracer.trace_sql_log['IsolationForest1']['predict'].keys())

In [None]:
# And what does the SQL object look like for the fit call?
display(myconn.sql_tracer.trace_sql_log['IsolationForest1']['fit']['sql'])

In [None]:
display(myconn.sql_tracer.trace_sql_log['IsolationForest1']['fit']['output_tables'])

In [None]:
from hana_ml.artifacts.generators import hana
from hana_ml.artifacts.generators.hana import HANAGeneratorForCAP
hanagen = HANAGeneratorForCAP(project_name="OutlierAnalysis_HANA-CAP",
                              output_dir="./generated_src4CAP",
                              namespace="hana.ml")
hanagen.generate_artifacts(isof, model_position=True, cds_gen=False, tudf=True)

In [None]:
# shows current path: /home/user/projects/teched2025-DA261/exercises/ex1
!pwd

# in case of windows systems
#!cd

In [None]:
# List artifacts in target directory
!ls ./generated_src4CAP/OutlierAnalysis_HANA-CAP/db/src/

# in case of windows systems
#!dir .\\generated_src4CAP\\OutlierAnalysis_HANA-CAP\\db\\src\\

In [None]:
!cat ./generated_src4CAP/OutlierAnalysis_HANA-CAP/db/src/hana-ml-base-pal-isolation-forest.hdbprocedure
#!type .\\generated_src4CAP\\OutlierAnalysis_HANA-CAP\\db\\src\\hana-ml-base-pal-isolation-forest.hdbprocedure

In [None]:
!cat ./generated_src4CAP/OutlierAnalysis_HANA-CAP/db/src/hana-ml-cons-pal-isolation-forest.hdbprocedure
#!type .\\generated_src4CAP\\OutlierAnalysis_HANA-CAP\\db\\src\\hana-ml-cons-pal-isolation-forest.hdbprocedure

<br>

Generating non-CAP, HANA HDI only artifacts for a native SAP HANA application

In [None]:
# For the purpose of later artifact generation, enable sql tracing
myconn.sql_tracer.enable_sql_trace(False)
myconn.sql_tracer.enable_trace_history(False)

## Setting Isolation Forest parameter values with larger datasets (optional)

The Isolation Forest algorithms is commonly regarded as an outlier detection technique is regarded as a technique also suitable to be applied to large datasets, like fraud detection on transactional data.
- In such cases, the Isolation Forest outlier model could be trained on a represenative and large enough sample of the data. While the predict method with the trained outlier model can then applied to full data, incl. use of additional parallelization techniques.
- Nevertheless, the adjustment and experimentation with Isolation Forest parameter values will be required to handle larger datasets. 

For larger datasets, increasing both max_samples and n_estimators when fitting the model can improve the accuracy of anomaly detection by capturing more diverse information and achieving a better consensus among trees. The model fitting then will certainly require more omputation time and resources.  
Therefore, as the dataset grows, we might consider raising both max_samples and n_estimators. 
    - For max_samples, consider incrementally increasing it by a fraction (e.g. 10%) of the dataset size each step, provided that computational resources allow. 
    - For n_estimators, aim to gradually enhance it by 100 increments until reaching a threshold such as 1000. 
      However, the extent of this increase should be guided by computational constraints, which are difficult to predict without conducting experiments and validation.
      
Example, if you have a data set to analyze of 10.000.000 rows
- you may consider to train the model on fraction of the data, e.g. let's say 10%, 1 million rows
- Now think you seek to cover 100% of data utilized to build the model, then max_samples * n_estimators needs to at least calculate up to 1 million
    - You could evaluate n_estimators=500 (build 500 trees), and max_samples=2000 (each tree to capture data of 2000 rows)
    - If you switch bootstrap=True, row sampling happens with replacement, thus each row could be sampled multiple times and therefore you would require higher values for n_estimators and max_samples to avoid that rows are not sampled by any trees. The model itself would become more robust, as multiple trees would determine together if a data point is an outlier or not.
    - If the data pattern is extremely diverse, the decision may grow very large, i.e. a very large decision tree-depth implying very big models requiring more resources to use, and possibly overfitted, too granular models. Remember, Isolation Forest tries to isolate outliers by their identification at shorter tree-depths in the tree. Therefore you should choose max_samples at max to ~50.000, which will lead to a tree_depth of ~16 (rule of thumb calculation Log(2, <max samples value>) <= 16).  
    ROUND(Log(2, {datarows}/SAMPLES_PCT), 0) AS MAX_DEPTH,

<br>

Moreover, it is __strongly recommended to apply workload classes__ controlling the maximum resources to be consumed with a PAL-call, or multiple calls as a whole.
- There are multiple ways an SAP HANA database administrator can setup the workload classed, mapping the constrained to users, certain database objects like the PAL procedures or applications. 
- For details see [workload class management documentation](https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-administration-guide/managing-workload-classes-in-sap-hana-cloud-central?locale=en-US&version=LATEST)

Within hana-ml scenarios, if not anyhow implicitly applied to the user and scenario, use of an existing workload class can be applied by setting it for the PAL method object. It is then attached with the anonymous SQL block generated by hana-ml and sent for each execution initiated by the object, untile disabled again.


In [None]:
# enable_workload_class(workload_class_name)
isof.enable_workload_class("PAL_AUTOML_WORKLOAD")

In [None]:
# Check the number of rows of your data slice to analyze for outliers
datarows=hdf_acdoca_slice_id.shape[0]
datarows2=hdf_acdoca_slice_id.count()
print(datarows, datarows2)
print(f"Table {datarows} has {hdf_acdoca_slice_id.count()} record(s).")
print(f'The data slice, dataframe result set has {datarows} and {hdf_acdoca_slice_id.count()} record(s)')

<br>

Let's say we want to build a outlier-model for 1mio rows sample of our 10mio data

In [None]:
# Loading the Isolation Forest method class
from hana_ml.algorithms.pal.preprocessing import IsolationForest

# Creating our IsolationForest model object names "isof"
#isof = IsolationForest(random_state=251104, n_estimators=500, max_samples=2000, bootstrap=False)

# Or with Bootstap sampling applied
isof = IsolationForest(random_state=251104, n_estimators=500, max_samples=8000, bootstrap=True)

# Executing the fitting, i.e. the training of the Isolation Forest Outlier model
isof.fit(data=hdf_acdoca_slice, features=outlier_features) 

<br>

The predict-task is a row-idenpendent task and thus various parallel invocation techniques can be applied.
Beside the PAL "massive" data-parallel function implementations, which provide an optimized implementation group-by-value parallel, independent processing including optimized resources usage and batching of the parallelization task, PAL and the SAP HANA SQL engine provide additional parallelization techniques for such row- / data subset-independent processing tasks
- PAL calls with a hint [parallel by partition](https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-predictive-analysis-library/calling-pal-procedures-in-parallel-with-hint-parallel-by-parameter-partitions-calling-pal-procedures-in-parallel-with-hint-parallel-by-parameter-partitions-ed5807b?locale=en-US&version=LATEST), would invoke one predict function call for each SAP HANA cloud table partition in parallel 
- Futhermore, such PAL calls maybe parallelized using SQL patterns [MAP_REDUCE](https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-predictive-analysis-library/calling-pal-procedures-in-parallel-with-map-reduce?version=LATEST&locale=en-US) or [MAP_MERGE](https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-predictive-analysis-library/calling-pal-procedures-in-parallel-with-operator-map-merge?locale=en-US&version=LATEST)



<br>

Let's apply the parallel by partition hint with the predict-call
- "apply_with_hint" allows the use of SQL hints with the anonymous SQL block generated by hana-ml and sent for execution
- The parameter value p1 with PARALLEL_BY_PARAMETER_PARTITIONS refers to the 1st input table of the PAL procedure call, which is the input data table and its physical partitioning scheme to be used to determine the parallelization. 

In [None]:
# Setting the hint for the the PAL method object
isof.apply_with_hint('PARALLEL_BY_PARAMETER_PARTITIONS(p1)')

In [None]:
outlier_results = isof.predict(data=hdf_acdoca_slice_id, key='ID', features=outlier_features, 
                        contamination=0.05)

In [None]:
# Important, disabling the hint again for the PAL method object
isof.disable_with_hint()

In [None]:
print(isof.get_predict_execute_statement())

## Setting Isolation Forest parameter values for each "massive" grouping set (optional)

In [None]:
# Filtering the data for illustration on two G/L Accounts
filterSQL=f'"G/L Account" in (720000, 630000)'
hdf_acdoca_slice_id.filter(filterSQL).head(10).collect()

Applying parameters with massive, data-parallel Isolation Forest outlier analysis scenarios
- Parameter value applied by general parameter, will be applied to all groups without any group-specific setting (e.g. n_estimators=101)
- Group_params allows to set parameters for each individual gouping set by its group-id value

In [None]:
# Massive data-parallel Isolation Forest Outlier Analysis with group-specific parameter values
from hana_ml.algorithms.pal.preprocessing import IsolationForest
parallel_isof = IsolationForest(massive=True, random_state=2,  n_estimators=101, max_samples=100000,
                                group_params={'720000': {'n_estimators':50, 'max_samples' : 10000}, 
                                              '630000': {'n_estimators':50, 'max_samples' : 10000}})

filterSQL=f'"G/L Account" in (720000, 630000)'
parallel_isof.fit(data=hdf_acdoca_slice.filter(filterSQL), group_key="G/L Account", features=outlier_features)

res, err = parallel_isof.predict(data=hdf_acdoca_slice_id.filter(filterSQL), key="ID", group_key="G/L Account", features=outlier_features,  contamination=0.05,   
                                 group_params={'720000': {'contamination': 0.10}, 
                                              '630000': {'contamination':  0.025}})


In [None]:
print(parallel_isof.get_fit_execute_statement())

<br>

Preparing the group_parameters as a python dict-variable and applying it to the method call

In [None]:
mygroup_params=dict({'720000': {'contamination': 0.10}, '630000': {'contamination':  0.025}})

In [None]:
# Massive with reason code
res, err = parallel_isof.predict(data=hdf_acdoca_slice_id.filter(filterSQL), key="ID", group_key="G/L Account", features=outlier_features,  contamination=0.05,   
                       group_params=mygroup_params
                       ,show_explainer=True, explain_scope='outliers', top_k_attributions=5
                      )
display(res.sort('SCORE', desc=True).head(10).collect())

In [None]:
print(parallel_isof.get_predict_execute_statement())

## Model storage and retrieval of outlier Isolation Forest models (optional)

In [None]:
from hana_ml.model_storage import ModelStorage
MODEL_SCHEMA = '<your user schema | or different>' # HANA schema in which models are to be saved
model_storage = ModelStorage(connection_context=myconn, schema=MODEL_SCHEMA)

In [None]:
#isof.model_.collect()

In [None]:
isof.name = 'IF_ACDOCA_OUTLIERMODEL'
model_storage.save_model(model=isof) #if_exists='replace', if_exists='upgrade'

In [None]:
display(model_storage.list_models(display_type='simple')) #display_type: 'complete', 'simple', 'no_reports'

In [None]:
#Retrieve model
isof_reloaded = model_storage.load_model(name='IF_ACDOCA_OUTLIERMODEL', version=1)

In [None]:
out = isof_reloaded.predict(data=acdoca_hdf, key='ID', features=outlier_features,
                       contamination=0.05)
print(out.head(3).collect())

In [None]:
model_storage.delete_model(name='IF_ACDOCA_OUTLIERMODEL', version=1)
#model_storage.delete_models(name=model.name)
#model_storage.clean_up()

In [None]:
display(model_storage.list_models(display_type='simple'))

## Outlier data generation (optional)
Instructions on how to generate sample outlier data

In [None]:
import random
import pandas as pd
company_codes = ['CC01']
gl_accounts = [str(x) for x in range(400000, 800000, 10000)]
profit_centers = ['PC001', 'PC002', 'PC003']
cost_centers = ['C101', 'C102', 'C103']
functional_areas = ['FA01', 'FA02']
business_areas = ['BA01', 'BA02', 'BA03']
segments = ['S1', 'S2', 'S3']
dc_indicators = ['S', 'H']
doc_types = ['SA', 'SB', 'SC', 'SD']
tx_types = ['TA01', 'TA02', 'TA03']
fin_types = ['P+L Statement', 'Balance Sheet Asset', 'Balance Sheet Liability', 'Equity']
data = []
for _ in range(500):
   amount = round(random.uniform(-20000, 20000), 2)
   data.append([
       random.choice(company_codes),
       random.choice(gl_accounts),
       random.choice(profit_centers),
       random.choice(cost_centers),
       random.choice(functional_areas),
       random.choice(business_areas),
       random.choice(segments),
       random.choice(dc_indicators),
       random.choice(doc_types),
       random.choice(tx_types),
       random.choice(fin_types),
       abs(amount),
       amount
   ])
df = pd.DataFrame(data, columns=[
   'Company Code', 'G/L Account', 'Profit Center', 'Cost Center',
   'Functional Area', 'Business Area', 'Segment', 'Debit/Credit',
   'Accounting Document Type', 'Transaction Type', 'Financial Account Type',
   'Amount (USD)', 'Amount (Transaction)'
])
df.to_csv('acdoca_data.csv', index=False)

In [None]:
from hana_ml.dataframe import create_dataframe_from_pandas
import pandas as pd
acdoca_hdf = dataframe.create_dataframe_from_pandas(
        myconn,
        df,
        table_name="ACDOCA",
        force=True,
        replace=True,
        drop_exist_tab=True
        )
print(acdoca_hdf.select_statement)

In [None]:
display(acdoca_hdf.collect())