<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Solve Imbalanced Class Problems with ClearScape Analytics
 <br>       
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 150px; height: auto; margin-top: 20pt;">
  <br>
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial'>Real-world datasets are very likely to have the problem of imbalanced classes. In scenarios like fraud detection, customer churn prediction, or medical diagnosis, one class is often heavily outnumbered by the other. The result? Predictive models that ignore the rare but critical minority class. Left unchecked, imbalanced data can distort your model’s performance and lead to poor decision-making. In this notebook we will describe the core challenges posed by imbalanced classes and how addressing them is key to building better models. This is a practitioner's guide on imbalanced class classification, with the focus of putting it into production using Teradata's ClearScape Analytics.</p>

<p style = 'font-size:16px;font-family:Arial'>To get a clearer picture of how imbalanced classes show up in real-world situations, let's check out some common examples. The table below shows cases from industries like finance, healthcare, and marketing, where the uneven split between majority and minority classes can cause costly errors. In each scenario, false positives and false negatives have different consequences, highlighting why it’s so important to tackle the imbalance and keep models performing well. These examples give you a practical sense of what’s at stake when working with imbalanced datasets.</p>

<table style = 'font-size:16px;font-family:Arial;border: 1px solid black'>
  <tr style = 'font-size:16px;font-family:Arial;border: 1px solid black'>
    <th>Real-World Example</th>
    <th>Source of Imbalance</th>
    <th>False Positive</th>
    <th>False Negative</th>  
  </tr>
  <tr style = 'font-size:14px;font-family:Arial;border: 1px solid black'>
      <td><b>Fraud Detection (Credit Card)</b></td>
      <td>Fraudulent transactions are rare</td>
      <td>Flagging a legitimate transaction as fraud, causing customer inconvenience. </td>
      <td>Missing a fraudulent transaction, leading to financial loss.</td> 
  </tr>
  <tr style = 'font-size:14px;font-family:Arial;border: 1px solid black'>
      <td><b>Medical Diagnosis (Cancer)</b></td>
      <td>Most patients tested do not have cancer</td>
      <td>Diagnosing a healthy person with cancer, leading to stress and unnecessary treatment.</td>
      <td>Missing a cancer diagnosis, delaying treatment and reducing survival chances.</td>
  </tr>
  <tr style = 'font-size:14px;font-family:Arial;border: 1px solid black'>
    <td><b>Spam Detection (Email)</b></td>
      <td>Most emails are legitimate</td>
      <td>Marking a legitimate email as spam, causing important communication to be missed.</td>
      <td>Allowing spam into the inbox, increasing the risk of phishing or malware.</td>
  </tr>
  <tr style = 'font-size:14px;font-family:Arial;border: 1px solid black'>
    <td><b>Loan Default Prediction</b></td>
      <td>Most borrowers repay loans on time</td>
      <td>Denying a loan to a creditworthy individual, losing potential business.</td>
      <td>Approving a loan to a high-risk borrower, leading to financial loss from default.</td>
  </tr>
  <tr style = 'font-size:14px;font-family:Arial;border: 1px solid black'>
    <td><b>Churn Prediction (Customer)</b></td>
      <td>Most customers stay, only a few churn</td>
      <td>Targeting a loyal customer for retention efforts, wasting resources.</td>
      <td>Failing to identify a churn risk, losing a valuable customer.</td>
  </tr>
  <tr style = 'font-size:14px;font-family:Arial;border: 1px solid black'>
    <td><b>Predictive Maintenance (Manufacturing)</b></td>
      <td>Equipment failures are rare compared to normal operation</td>
      <td>Incorrectly predicting a failure, causing unnecessary downtime and maintenance costs.</td>
      <td>Failing to predict a true failure, leading to unexpected downtime, lost production, and costly repairs.</td>
  </tr>  
</table>

<p style = 'font-size:16px;font-family:Arial'>Now that we’ve got an idea on what imbalanced classes are and the challenges they bring, the next step is to understand different ways to tackle the problem. The diagram below gives an overview of methods across two dimensions: when in the modeling process they’re applied and how much effort it takes to put them into production. The diagram flows from left to right, covering the different workflow stages like data preprocessing, modeling, and post-processing. The vertical axis shows the effort needed to implement each method in production but, this is a subjective call.</p>

<img src="images/effortvsstage.png"> 

<p style = 'font-size:16px;font-family:Arial'>Experienced data scientists usually approach this issue with a trial-and-error mindset, testing out methods to see how they impact model performance. There’s no one-size-fits-all fix for imbalanced data; every dataset is different, and what works for one might not work for another. In practice, we might get some quick wins by adjusting class weights or using resampling techniques, but more advanced methods like active learning or custom loss functions could be necessary too. By experimenting with these techniques, we can find a solution that strikes the right balance between performance and ease of implementation.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial'><b>1. Connect to Vantage</b></p>


<p style = 'font-size:16px;font-family:Arial'>In the section, we import the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
import warnings
warnings.filterwarnings('ignore')

import getpass
import pandas as pd
import plotly.express as px

import seaborn as sns
import matplotlib.pyplot as plt

from teradataml import *
display.max_rows = 5
# from teradataml import configure
# configure.val_install_location = "val"

<p style = 'font-size:16px;font-family:Arial'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../../UseCases/startup.ipynb
eng = create_context(host = 'host.docker.internal', username = 'demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=PP_Recipe_Solve_Imbalance_Class_Problems.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial'><b>2. Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. We have the option of either running the demo using foreign tables to access the data without using any storage on our environment or downloading the data to local storage, which may yield somewhat faster execution. However, we need to consider available storage. There are two statements in the following cell, and one is commented out. We may switch which mode we choose by changing the comment string.</p>   


In [None]:
# %run -i ../../UseCases/run_procedure.py "call get_data('DEMO_DataImbalance_cloud');"  # Takes about 20 secs
%run -i ../../UseCases/run_procedure.py "call get_data('DEMO_DataImbalance_local');"  # Takes about 50 secs

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
%run -i ../../UseCases/run_procedure.py "call space_report();"

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>3. Analyze the raw data set</b></p>

<p style = 'font-size:16px;font-family:Arial'>Let us start by creating a teradataml dataframe. A "Virtual DataFrame" that points directly to the dataset in Vantage.</p>



In [None]:
df=DataFrame(in_schema("DEMO_DataImbalance","imbalanced_data"))
df

<p style = 'font-size:16px;font-family:Arial'>We are using synthethic data here for showing the various techniques that can be used on imbalanced data. As we can see above this dataset is a dummy dataset.</p>
<p></p>
<p style = 'font-size:16px;font-family:Arial'>We will use the TrainTestSplit function to split the dataset into two.</p>

In [None]:
cols = df.columns
cols.remove('row_id')
cols.remove('classlabel')
all_features = cols
target = "classlabel"
key = "row_id"

In [None]:
# split in training & test set 80 20 stratified
DF_splitted = TrainTestSplit(
    data = df, 
    id_column = 'row_id', 
    stratify_column='classlabel', 
    seed = 42, 
    train_size = 0.8
).result

In [None]:
DF_train = (
    DF_splitted
    .loc[DF_splitted.TD_IsTrainRow==1]
    .drop(columns=["TD_IsTrainRow"])
)

In [None]:
DF_test = (
    DF_splitted
    .loc[DF_splitted.TD_IsTrainRow==0]
    .drop(columns=["TD_IsTrainRow"])
)

<hr style="height:1px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>3.1 Initial Analysis: Univariate Statistics, Visualization, and a Baseline Model</b></p>

<p style = 'font-size:16px;font-family:Arial'>To start our analysis, we'll conduct some Exploratory Data Analysis (EDA) to get a better feel for the dataset and set up a baseline model. We’ll use <code>teradataml</code> as our Python client and keep the entire analysis within the database, taking full advantage of in-database processing for speed and scalability.</p>

<p style = 'font-size:16px;font-family:Arial'>EDA is super important when dealing with imbalanced classification problems because it helps us spot patterns, understand how features are distributed, and catch any anomalies that might affect model performance. For imbalanced datasets, it’s especially crucial to see how features relate to the minority class, as even small differences can have a big impact.</p>

<p style = 'font-size:16px;font-family:Arial'>We’ll start by calculating Cohen's d for each feature. This metric tells us the effect size, showing how much each feature separates the minority class from the majority class, helping us find the most relevant features. Then, we’ll visualize the feature with the strongest effect size using bivariate histograms to see how its distribution changes across classes.</p>

<p style = 'font-size:16px;font-family:Arial'>To keep things efficient, We’ve put together a Python module <code>(`tdimbalancedlearn`, imported as tdimbl)</code> with reusable functions for common tasks. We’ll just import and use these functions directly.</p>


<p style = 'font-size:20px;font-family:Arial'><b>Cohen's D and Boxplots</b></p>

<p style = 'font-size:16px;font-family:Arial'>To get a sense of how each feature affects our target variable, we’ll calculate Cohen’s d using <code>UnivariateStatistics</code> and a bit of DataFrame manipulation. [Cohen’s](https://rpsychologist.com/cohend/) d gives us a measure of effect size, helping us find out which features show the biggest difference between the minority and majority classes. This helps us focus on the features that matter most.</p>

<p style = 'font-size:16px;font-family:Arial'>Next, we create a simple boxplot based on these statistics so we can visually compare the distributions. This makes it easy to quickly spot which features have the most significant impact.</p>


In [None]:
import tdimbalancedlearn as tdimbl
DF_cohensd = tdimbl.get_cohens_d_stats(DF_train, all_features, target)

In [None]:
this_fig = tdimbl.plot_grouped_boxplot(
    DF_cohensd.to_pandas().sort_values(by= "Attribute", key=lambda x: x.str.extract(r'(\d+)')[0].astype(int)
))

In [None]:
DF_cohensd.select(["Attribute","cohens_d","abs_cohens_d"]).sort("abs_cohens_d", ascending=False)

<p style = 'font-size:16px;font-family:Arial'>We can see that the <code>feature_19</code> stands out with the strongest effect size. We’ll focus on this feature next to create bivariate histograms, as it gives the most noticeable separation between the classes.</p>

<p style = 'font-size:20px;font-family:Arial'><b>Histograms: Understanding Feature Distributions</b></p>

<p style = 'font-size:16px;font-family:Arial'>As observed <code>feature_19</code> is the most influential variable using Cohen’s d, we’ll visualize its distribution across classes using histograms. We should keep in mind, while Cohen’s d assumes normal distributions, real-world data—especially in imbalanced cases—doesn’t always play by those rules. Histograms let us check this assumption and see if the groups are actually distinguishable in practice.</p>

<p style = 'font-size:16px;font-family:Arial'>Our <code>plot_histograms</code> function creates both absolute and relative frequency plots for <b>feature_19</b>. Below steps are a part of the function:</p>
<li style = 'font-size:16px;font-family:Arial'><code>Histogram Calculation:</code> We use the in-database `tdml.Histogram` function to get the percentile-based distribution of the feature.</li>
<li style = 'font-size:16px;font-family:Arial'><code>Pivoting the Table:</code> The result is then pivoted so we can easily compare the two classes.</li>
<li style = 'font-size:16px;font-family:Arial'><code>In-Database Plotting:</code> Finally, we use in-database plotting with equal-width bins, making sure the bar chart accurately shows how the feature’s values are spread.</li>

<p style = 'font-size:16px;font-family:Arial'>By plotting both absolute and relative frequencies, we get a fuller picture. The <code>absolute frequency plot</code> shows raw counts, which gives us a sense of the majority class’s dominance. The <code>relative frequency plot</code>, on the other hand, scales everything to percentages, so we can better compare the shapes of the distributions regardless of the class sizes.</p>

<p style = 'font-size:16px;font-family:Arial'>Looking at the histograms, you can see there’s a lot of overlap between the majority (blue) and minority (red) classes, but there are also areas where the minority class stands out, especially in the negative range of the feature values.</p>

In [None]:
histplot = tdimbl.plot_histograms(DF_train, "feature_19", target)
histplot.show()

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>4. Baseline GLM Model and Performance Check: ROC & AUC</b></p>

<p style = 'font-size:16px;font-family:Arial'>To set up a benchmark, we start with a simple <b><u>G</u></b>eneralized <b><u>L</u></b>inear <b><u>M</u></b>odel <b>(GLM)</b> as our baseline. This "vanilla" version of the GLM is trained without any special techniques to handle the class imbalance. By starting with this basic model, we can later compare its performance to more advanced methods designed to deal with imbalanced data. We’ll measure its effectiveness using the ROC curve and AUC values.</p>

<p style = 'font-size:16px;font-family:Arial'>We take advantage of in-database processing for both training and prediction. The model training uses the <code>GLM</code> function, and the predictions are handled through <code>TDGLMPredict</code>, so there’s no need to move data out of the database.</p>


In [None]:
GLM_obj = GLM(data = DF_train, 
                   input_columns=all_features,
                   response_column = target,
                   family = "BINOMIAL")

In [None]:
DF_pred = TDGLMPredict(
                    object=GLM_obj, 
                    newdata=DF_test,
                    id_column=key, 
                    accumulate=[target], 
                    output_prob=True,
                    output_responses="1").result

<p style = 'font-size:16px;font-family:Arial'>To see how our model performs, we calculate classification metrics and ROC stats, storing everything in database tables for consistency. We use the in-database <code>ClassificationEvaluator</code> function, which gives us both a confusion matrix and a set of standard metrics like precision, recall, and F1-score.</p>

<p style = 'font-size:16px;font-family:Arial'>With imbalanced datasets, these metrics need to be taken with a grain of salt  as they might not show the whole picture. For example, accuracy can look high simply because the majority class dominates. That’s why looking at the confusion matrix or the ROC curve can be more helpful, as these give a more detailed view of how the model is doing, especially for the minority class.</p>

<p style = 'font-size:16px;font-family:Arial'>Next, we calculate the ROC (Receiver Operating Characteristic) curve using the `tdml.ROC` function. The ROC curve shows how the true positive rate (sensitivity) and the false positive rate change across different threshold levels, giving us a visual sense of the model’s performance at various points.</p>

<p style = 'font-size:16px;font-family:Arial'>The AUC (Area Under the Curve) value gives us a single score to measure the model's performance, showing how likely the model is to rank a random positive example higher than a random negative one. The ROC and AUC are particularly useful because they help us choose the threshold that makes the most business sense. This threshold isn’t always set at 0.5 but instead is based on what maximizes value, considering the costs of false positives and false negatives.</p>


In [None]:
tdimbl.clean_classeval(eval_id = "eval")
tdimbl.clean_roc(eval_id = "eval")

In [None]:
#function that performs classification evaluation on the predicted data and save the results
tdimbl.save_classeval(
    DF_pred, 
    model_id = "vanilla_glm", 
    key= key, 
    target= target, 
    eval_id  = "eval", # Table prefix 
    prediction = "prediction"
)

In [None]:
DF_conf,DF_metrics = tdimbl.get_classeval(eval_id  = "eval")
DF_conf

In [None]:
DF_metrics

In [None]:
# Save ROC statistics for the model
tdimbl.save_roc(
    DF_pred, 
    model_id =  "vanilla_glm", 
    key= key, 
    target = target, 
    probability = "prob_1", 
    eval_id = "eval"
)

In [None]:
DF_roc, DF_auc = tdimbl.get_rocauc(eval_id = "eval")
tdimbl.plot_roc_curves(DF_roc, highlight_model="vanilla_glm")

In [None]:
DF_auc

<p style = 'font-size:16px;font-family:Arial'>Let’s take a look at the initial results: The ROC curve for the <code>vanilla_glm</code> model shows it’s only doing a little better than random guessing, which you can tell by how close it is to the diagonal line. The AUC score of 0.567 indicates that the model doesn’t have much power when it comes to telling the two classes apart. This isn’t too surprising, though, considering the model is quite basic and the dataset is imbalanced.</p>


<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>5. Post Processing Methods</b></p>

<p style = 'font-size:16px;font-family:Arial'>Now that we’ve built a simple baseline model, it’s important to remember that a model’s (probability) predictions don’t automatically translate into business decisions. In real-world scenarios, predictions need to be interpreted and adjusted to match business goals and constraints. In this section, we’ll look at post-processing methods that can help fine-tune model outputs to make them more useful for decision-making.</p>

<p style = 'font-size:16px;font-family:Arial'>We’ll focus on two main techniques: <code>threshold tuning</code> and <code>cost-sensitive tuning</code>. These methods let us fine-tune the decision boundary and adjust the outputs based on the costs of false positives and false negatives. This way, we can make sure the model fits business needs and helps minimize risks.</p>


<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>5.1 Threshold Tuning</b></p>

<p style = 'font-size:16px;font-family:Arial'>In real-world scenarios, business requirements often set specific performance targets for models. We’ve kept our dataset use-case-neutral up until now, but for this section, let’s imagine it’s for predicting credit defaults. For instance, the goal might be to catch at least **70%** of credit defaults (true positive rate ≥ 0.70). In these cases, the classification threshold should be adjusted to hit that target while keeping the false positive rate (FPR) as low as possible.</p>

<p style = 'font-size:16px;font-family:Arial'>To find the best threshold, we filter the ROC dataframe for points where the true positive rate (TPR) is above 0.70, then choose the one with the lowest FPR. Here’s how you can do this using a <code>teradataml</code> DataFrame.</p>


In [None]:
DF_roc.loc[DF_roc.tpr>0.70].sort("fpr")

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>5.2 Cost-Sensitive Tuning</b></p>

<p style = 'font-size:16px;font-family:Arial'>Cost-sensitive tuning is all about optimizing model predictions based on the business costs and benefits tied to different outcomes. Let’s say our data represents a banking scenario where we’re predicting the probability of credit defaults and deciding whether or not to approve a customer’s credit application. The idea is to choose a threshold that maximizes overall business outcomes, taking into account the financial impact of true positives, false positives, true negatives, and false negatives.</p>

<p style = 'font-size:16px;font-family:Arial'>To do this, we evaluate the confusion matrix at each threshold and multiply the frequency of each outcome (e.g., true positive, false negative) by its business value. Adding these up gives us the overall outcome for each threshold, and our goal is to pick the one that gives the best result.</p>

<p style = 'font-size:16px;font-family:Arial'>The <code>tdimbl.calculate_cost</code> function does this calculation by looping through the ROC dataframe and finding the business outcome for each threshold. We use in-database plotting to see how different thresholds impact the overall outcome, and the `WhichMax` function helps us find the one with the highest return.</p>

<p style = 'font-size:16px;font-family:Arial'>In our example, here’s how we’ve defined the business values:</p>
<li style = 'font-size:16px;font-family:Arial'><code>True Positive</code>: The customer is denied a credit they wouldn’t have been able to repay. No gain or loss. - Outcome: <b>0</b></li>
    
<li style = 'font-size:16px;font-family:Arial'><code>False Positive</code>: The customer is denied a credit they could have repaid. Again, no gain or loss. - Outcome: <b>0</b></li>
    
<li style = 'font-size:16px;font-family:Arial'><code>True Negative</code>: The customer is approved and repays, earning interest. - Outcome: <b>+600</b></li>
    
<li style = 'font-size:16px;font-family:Arial'><code>False Negative</code>: The customer is approved but defaults, leading to a loss. - Outcome: <b>-30,000</b></li></p>
    
<p style = 'font-size:16px;font-family:Arial'>By factoring these values into the cost-sensitive tuning, we make sure that our model’s threshold selection focuses on maximizing business value, rather than just looking at basic performance metrics.</p>

In [None]:
outcome_tp =       0
outcome_fp =       0
outcome_tn = +   600 
outcome_fn = - 30000 

In [None]:
DF_cost = tdimbl.calculate_costs(
    DF_test, 
    outcome_tp, 
    outcome_fp, 
    outcome_tn, 
    outcome_fn, 
    eval_id = "eval"
)

In [None]:
DF_cost.plot(x = DF_cost.threshold_value, 
             y = DF_cost.total_outcome,
                kind="line")

In [None]:
threshold = (
    WhichMax(
        data=DF_cost.select(["threshold_value","total_outcome"]),
        target_column="total_outcome")
    .result.to_pandas().threshold_value.values[0]
)
threshold

<p style = 'font-size:16px;font-family:Arial'>Based on our cost-sensitive analysis, the best threshold from the <code>WhichMax</code> function is <code>0.0101</code>, which could bring in about <code>USD 643,200</code> for the 20,000 customers in our test set. This threshold strikes a balance between approving and denying credit applications, maximizing the overall business value.</p>

<p style = 'font-size:16px;font-family:Arial'>However, keep in mind that this threshold was tuned using the test set, so to make sure it holds up, it’s best to validate it on a separate evaluation set to see if it really works outside of the test data.</p>

<p style = 'font-size:16px;font-family:Arial'>Even though the results are positive, this approach isn’t without issues. At this threshold, <code>14,787</code> out of 20,000 customers get denied credit, even though they could have paid it back. This high rejection rate comes from the fact that the cost of a default (USD -30,000) is way bigger than the profit from successful repayments (USD +600). So, the model focuses on avoiding defaults, even if that means rejecting a lot of good customers.</p>

<p style = 'font-size:16px;font-family:Arial'>Doubts aside, deploying this threshold would be pretty simple. It just involves adding a column in the production setup that checks if the predicted probabilities meet the <code>0.0101</code> threshold, making sure the decisions match up with the business value goal.</p>



In [None]:
DF_pred.assign(credit_denied = case([(DF_pred.prob_1 > threshold,1)], else_=0))

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>6. Undersample & Oversample</b></p>

<p style = 'font-size:16px;font-family:Arial'>Next, we focus on the data prep stage and how to fix the imbalance using some simple sampling techniques. The easiest ones are random undersampling, oversampling, and a combo of both. These methods balance the classes by either reducing the majority class (undersampling) or boosting the minority class (oversampling).</p>

<p style = 'font-size:16px;font-family:Arial'>These methods are easy to apply, but they come with trade-offs. Undersampling helps shrink the dataset, which speeds up training and saves resources. However, it can also drop important data points from the majority class, leading to increased variance. Oversampling, meanwhile, duplicates samples from the minority class to balance things out. This way, you don’t lose any data, but there’s a risk the model might overfit by memorizing the repeated samples.</p>

<p style = 'font-size:16px;font-family:Arial'>A combination of the two tries to find a middle ground by trimming the majority class and boosting the minority class, which can help build a more balanced model. But it’s still important to manage the bias and variance carefully. On the technical side, we can easily use the <code>sample</code> function on <code>DF_train</code> for these methods. For undersampling, <code>sample</code> the majority class without replacement. For oversampling, set <code>replace=True</code> to sample with replacement and make sure you have enough examples. Also, set <code>randomize=True</code> to keep the sampling process random and consistent.</p>


In [None]:
# undersample, we know we have 1600 of the minortiy class. 
# we want to have a ratio of 4:1
(DF_train
     .loc[DF_train[target] == 0]
     .sample(n=4*1600, randomize=True, seed=42, id_column=key).drop(columns=["sampleid"])
     .to_sql("undersampled_train", if_exists="replace", primary_index=key))

(DF_train
     .loc[DF_train[target] == 1]
     .to_sql("undersampled_train", if_exists="append"))

In [None]:
DF_undersampled_train = DataFrame("undersampled_train")
DF_undersampled_train.select(["row_id","classlabel"]).groupby("classlabel").count()

In [None]:
# oversample, we know we have 78400 of the majority class. 
# we want to have a ratio of 4:1
(DF_train
     .loc[DF_train[target] == 0]
     .to_sql("oversampled_train", if_exists="replace", primary_index=key))

(DF_train
     .loc[DF_train[target] == 1]
     .sample(n=int(78400/4), 
             replace = True,
             randomize=True, seed=42, id_column=key)
     .drop(columns=["sampleid"])
     .to_sql("oversampled_train", if_exists="append"))

## continue with machine learning

In [None]:
DF_oversampled_train = DataFrame("oversampled_train")
DF_oversampled_train.select(["row_id","classlabel"]).groupby("classlabel").count()

In [None]:
# resample, # we want to have a ratio of 4:1, check if this makes sense

(DF_train.sample(case_when_then = {
                     DF_train[target] == 0: 40000,
                     DF_train[target] == 1: 10000, },
             replace = True,
             randomize=True, seed=42, id_column="row_id")
      .drop(columns=["sampleid"])
     .to_sql("resampled_train", if_exists="replace"))

## continue with machine learning 

In [None]:
DF_resampled_train = DataFrame("resampled_train")
DF_resampled_train.select(["row_id","classlabel"]).groupby("classlabel").count()

<p style = 'font-size:16px;font-family:Arial'>Next, we train the GLM model again using the resampled datasets, just like we did before. We then evaluate its performance using the original, non-resampled <code>DF_test</code> dataset to see how the different resampling methods affect the results compared to our baseline model.</p>

In [None]:
GLM_undersampled_obj = GLM(data = DF_undersampled_train, 
                   input_columns=all_features,
                   response_column = target,
                   family = "BINOMIAL",
                   volatile = True)

GLM_oversampled_obj = GLM(data = DF_oversampled_train, 
                   input_columns=all_features,
                   response_column = target,
                   family = "BINOMIAL",
                   volatile = True)

GLM_resampled_obj = GLM(data = DF_resampled_train, 
                   input_columns=all_features,
                   response_column = target,
                   family = "BINOMIAL",
                   volatile = True)

DF_pred_undersampled = TDGLMPredict(
        object=GLM_undersampled_obj, 
        newdata=DF_test,
        id_column=key, 
        accumulate=[target], 
        output_prob=True,
        output_responses="1", 
        volatile = True).result
tdimbl.save_roc(DF_pred_undersampled, model_id="undersampled_glm", key=key, target=target )

DF_pred_oversampled = TDGLMPredict(
        object=GLM_oversampled_obj, 
        newdata=DF_test,
        id_column=key, 
        accumulate=[target], 
        output_prob=True,
        output_responses="1", 
        volatile = True).result
tdimbl.save_roc(DF_pred_oversampled, model_id="oversampled_glm" , key=key, target=target )

DF_pred_resampled = TDGLMPredict(
        object=GLM_resampled_obj, 
        newdata=DF_test,
        id_column=key, 
        accumulate=[target], 
        output_prob=True,
        output_responses="1", 
        volatile = True).result
tdimbl.save_roc(DF_pred_resampled, model_id="resampled_glm" , key=key, target=target )

In [None]:
DF_auc

In [None]:
# highlight undersampled because of best AUC value
tdimbl.plot_roc_curves(DF_roc, highlight_model= "undersampled_glm")

<p style = 'font-size:16px;font-family:Arial'>Interestingly, we see quite some improvement, with the undersampling approach delivering the highest AUC among all the methods we’ve tried so far. The ROC curves clearly show that undersampling gives a solid boost in performance compared to the baseline and other resampling techniques.</p>

<p style = 'font-size:16px;font-family:Arial'>This shows that even a simple method like undersampling can lead to better results, but it’s worth noting that these improvements depend on the specific dataset and might not work in every situation. Also, it’s important to point out that the evaluation was done on the original, non-resampled test dataset to make sure the results are a true reflection of real-world performance.</p>


<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>7. SMOTE and ADASYN</b></p>

<p style = 'font-size:16px;font-family:Arial'>SMOTE (Synthetic Minority Over-sampling Technique) and ADASYN (Adaptive Synthetic Sampling) are more advanced than simple methods like random oversampling or undersampling. Instead of just duplicating existing samples, these techniques create synthetic instances for the minority class based on the feature space, giving us more realistic and varied examples. This helps reduce the overfitting risk that comes with basic oversampling, as the model gets new examples rather than seeing the same ones over and over.</p>

<p style = 'font-size:16px;font-family:Arial'>The big advantage of SMOTE and ADASYN is that they can enrich the minority class with more diverse and useful samples, which can lead to better decision boundaries and overall model performance. SMOTE works by creating synthetic data points by interpolating between existing samples from the minority class, while ADASYN takes it a step further by focusing on areas where the minority class is underrepresented, making the model tackle the tough spots.</p>

<p style = 'font-size:16px;font-family:Arial'>That said, these methods aren’t without their downsides. By generating synthetic data, there’s a risk of adding noise or patterns that don’t actually exist in real-world scenarios, which can lead to overfitting and make the model less generalizable. Plus, if the minority class isn’t well represented in the first place, these synthetic examples might not accurately reflect the data’s true distribution, which could result in biased models.</p>

<p style = 'font-size:16px;font-family:Arial'>To see how effective SMOTE and ADASYN are, we’ll use the `imbalanced-learn` python package to resample our dataset. We’ll start by generating synthetic samples with SMOTE, then use ADASYN to check out how each approach enriches the minority class differently. Once we’ve done that, we’ll move back to the database, train a GLM model on the newly balanced datasets, and test its performance using the test set, just like we did with the other models.</p>


In [None]:
from imblearn.over_sampling import  SMOTE, ADASYN
df_train = DF_train.to_pandas(all_rows = True)

In [None]:
sm = SMOTE( sampling_strategy = 20000.0/80000.0, random_state=42)
X_res, y_res = sm.fit_resample(df_train[all_features], df_train[target])

df_train_smote = pd.DataFrame(X_res, columns=all_features)
df_train_smote[target] = y_res
df_train_smote = df_train_smote.reset_index().rename(columns={'index': key})

In [None]:
copy_to_sql(df_train_smote, "df_train_smote",if_exists='replace', primary_index=key)

In [None]:
DF_train_SMOTE = DataFrame("df_train_smote")
DF_train_SMOTE.select(["row_id","classlabel"]).groupby("classlabel").count()

In [None]:
sm = ADASYN(sampling_strategy=20000.0/80000.0, random_state=42)
X_res, y_res = sm.fit_resample(df_train[all_features], df_train[target])

df_train_adasyn = pd.DataFrame(X_res, columns=all_features)
df_train_adasyn[target] = y_res
df_train_adasyn = df_train_adasyn.reset_index().rename(columns={'index': key})

In [None]:
copy_to_sql(df_train_adasyn, "df_train_adasyn",if_exists='replace', primary_index=key)

In [None]:
DF_train_ADASYN = DataFrame("df_train_adasyn")
DF_train_ADASYN.select(["row_id","classlabel"]).groupby("classlabel").count()

In [None]:
GLM_SMOTE_obj = GLM(data = DF_train_SMOTE, 
                   input_columns=all_features,
                   response_column = target,
                   family = "BINOMIAL",
                   volatile = True)

GLM_ADASYN_obj = GLM(data = DF_train_ADASYN, 
                   input_columns=all_features,
                   response_column = target,
                   family = "BINOMIAL",
                   volatile = True)

DF_pred_SMOTE = TDGLMPredict(
        object=GLM_SMOTE_obj, 
        newdata=DF_test,
        id_column=key, 
        accumulate=[target], 
        output_prob=True,
        output_responses="1", 
        volatile = True).result
tdimbl.save_roc(DF_pred_SMOTE, model_id="SMOTE_glm", key=key, target=target )

DF_pred_ADASYN = TDGLMPredict(
        object=GLM_ADASYN_obj, 
        newdata=DF_test,
        id_column=key, 
        accumulate=[target], 
        output_prob=True,
        output_responses="1", 
        volatile = True).result
tdimbl.save_roc(DF_pred_ADASYN, model_id="ADASYN_glm" , key=key, target=target )

In [None]:
DF_auc

In [None]:
tdimbl.plot_roc_curves(DF_roc, highlight_model= "ADASYN_glm")

<p style = 'font-size:16px;font-family:Arial'>Well, it looks like the results for **SMOTE** and **ADASYN** aren’t that different from plain  random undersampling. The AUC scores are nearly the same, and you can see from the ROC curves that they don't significantly outperform. This suggests that in this case, the more sophisticated sampling techniques didn’t provide any clear advantage. It goes to show that sometimes, keeping it simple works just as well, and the dataset characteristics really play a huge role in determining which method will shine.</p>

<p style = 'font-size:16px;font-family:Arial'>Outlook: If you’re using a VantageCloud Lake environment, there’s a way to make things even more efficient by skipping the data movement between pandas and the database. The in-database <code>TD_SMOTE</code> function allows you to apply SMOTE directly within Vantage, saving time and streamlining the workflow. You can check out how it works in the <a href='https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Feature-Engineering-Transform-Functions/TD_SMOTE'>TD_SMOTE documentation.</a></p>


<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>8. Modeling with Adjusted Class Weights</b></p>

<p style = 'font-size:16px;font-family:Arial'>We’ve already covered post-processing and data preprocessing techniques for handling imbalanced classes. Now, let’s move on to adjusting the models themselves, specifically for imbalanced learning. A simple and effective way to do this is by using algorithms that support class weights, so the model can put more focus on the minority class during training.</p>

<p style = 'font-size:16px;font-family:Arial'>When an algorithm has class weights, it tweaks the penalty for getting each class wrong. By giving more weight to the minority class, the model learns to pay closer attention to those cases, boosting its accuracy on those tougher instances.</p>

<p style = 'font-size:16px;font-family:Arial'>Teradata supports class-weighted models using:</p>
<li style = 'font-size:16px;font-family:Arial'><code>SVM (Support Vector Machine)</code>: Teradata’s SVM can use class weights to help set better decision boundaries when the classes aren’t balanced. You can find more details <a href='https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Training-Functions/TD_SVM/TD_SVM-Syntax'>here</a></li>
<li style = 'font-size:16px;font-family:Arial'><code>GLM (Generalized Linear Model)</code>: Teradata’s GLM also allows class weights, making it simple to adjust the model to focus more on the minority class. The <a href='https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Database-Analytic-Functions/Model-Training-Functions/TD_GLM/TD_GLM-Syntax'>documentation</a> has more info.</li></p>

<p style = 'font-size:16px;font-family:Arial'>Since our synthetic dataset is already standardized and centered, we won’t need to do any extra preprocessing. We can go straight into applying the algorithms with the adjusted class weights. Finding the best weights can be tricky, so we’ll use an in-database grid search, another feature from ClearScape, to test different weight setups. This lets us efficiently find the best configuration without having to move any data out of Vantage. The code below sets up the grid and runs the search entirely in the database.</p>


In [None]:
params = {
    "input_columns":all_features, 
    "response_column":target, 
    "model_type":'Classification', 
    "class_weights":('0:0.5, 1:1.0',
                  '0:0.2, 1:1.0',
                   '0:0.1, 1:1.0',
                   '0:0.05, 1:1.0',
                  ), 
}
eval_params = {"id_column": key,
               "accumulate": target}


In [None]:
gs_obj = GridSearch(func=SVM, params=params)

In [None]:
gs_obj.fit(data=DF_train, **eval_params)

In [None]:
gs_obj.models

In [None]:
gs_obj.model_stats

In [None]:
all_models = list(gs_obj.model_stats.MODEL_ID.unique())

In [None]:
try:
    db_drop_table("svm_grid_predictions")
except:
    pass

In [None]:
for model_id in all_models:
    gs_obj.set_model(model_id)
    (gs_obj
         .predict(newdata=DF_test, **eval_params, 
                   output_prob=True, output_responses ="1")
         .result
         .assign(model_id = model_id)
         .to_sql("svm_grid_predictions", if_exists="append")   
    )

In [None]:
DF_pred_gridSVM = DataFrame("svm_grid_predictions")
DF_pred_gridSVM

In [None]:
for model_id in all_models:
    DF_ = DF_pred_gridSVM.loc[DF_pred_gridSVM.model_id == model_id]
    tdimbl.save_roc(DF_, model_id=model_id , key=key, target=target )

In [None]:
tdimbl.plot_roc_curves(DF_roc, all_models)

<p style = 'font-size:16px;font-family:Arial'>The ROC curves for the SVM models show no real difference, with all of them performing at almost the same level as random guessing. This tells us that the SVM didn’t manage to effectively separate the classes in this case.</p>

<p style = 'font-size:16px;font-family:Arial'>This result isn’t too surprising, though. SVMs usually work best when the classes are clearly separated or have distinct boundaries. But when the data points overlap a lot or exist in a continuous space—like we have here—SVMs often struggle to draw a clear line. This outcome suggests that the dataset’s characteristics might just not be a good fit for this algorithm. It’s a reminder that not every model works for every kind of data, so it’s important to choose one that matches the data’s distribution for the best results.</p>


<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>9. Modelling with Ensembles that come with Embedded Rebalancing</b></p>

<p style = 'font-size:16px;font-family:Arial'>Our next method uses ensemble algorithms that are designed to handle imbalanced classes automatically. The <code>imbalanced-learn</code> Python package, which is built on top of scikit-learn, has several powerful ensemble methods just for this. We’ll give these algorithms a try by training the models locally in Python, then deploying them back in Vantage using the Bring Your Own Model (BYOM) setup. This way, we get the best of both worlds—using advanced models externally while still benefiting from Vantage’s capabilities.</p>

For those interested in learning more about the BYOM workflow, you can explore this [notebook](https://github.com/martinhillebrand/ClearscapeCookbook/blob/main/03_sklearn_ONNX/BYOM_recipe_sklearn_ONNX_tdprepview.ipynb) or review the [official documentation](https://docs.teradata.com/r/Enterprise_IntelliFlex_Lake_VMware/Teradata-VantageTM-Bring-Your-Own-Model-User-Guide/Welcome-to-Bring-Your-Own-Model).

<p style = 'font-size:16px;font-family:Arial'>Before we get into the deployment process, let’s take a look at the ensemble algorithms available in <code>imbalanced-learn</code>:</p>

<p style = 'font-size:16px;font-family:Arial'><b>Bagging Algorithms:</b></p>
    <li style = 'font-size:16px;font-family:Arial'><code>BalancedRandomForestClassifier</code>: A version of random forest that balances classes during the tree-building phase.</li></p>

<p style = 'font-size:16px;font-family:Arial'><b>Boosting Algorithms:</b></p>
    <li style = 'font-size:16px;font-family:Arial'><code>RUSBoostClassifier</code>: A twist on AdaBoost that adds random under-sampling (RUS) to the boosting process.</li></p>

<p style = 'font-size:16px;font-family:Arial'>These algorithms work to balance the classes right during the training phase, ensuring that each subset of data is balanced and doesn’t let one class dominate. This helps build models that are more robust and well-suited for handling imbalanced data. In the next steps, we’ll train these models, evaluate how they perform, and show how to deploy them using Vantage.</p>


<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>9.1 Models Training</b></p>

<p style = 'font-size:16px;font-family:Arial'>We start by loading the training dataset <code>DF_train</code> into a local pandas DataFrame. Next, we use the <code>imbalanced-learn</code> package to set up some ensemble algorithms that are built to work well with imbalanced data.</p>


In [None]:
df_train = DF_train.to_pandas(all_rows=True)

In [None]:
from imblearn.ensemble import (
    BalancedRandomForestClassifier, 
    RUSBoostClassifier)

model_rusboost = RUSBoostClassifier(algorithm='SAMME',random_state=42)

model_balancedrandomforest = BalancedRandomForestClassifier(random_state=42)


model_rusboost.fit(df_train[all_features], df_train[target])

model_balancedrandomforest.fit(df_train[all_features], df_train[target])

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>9.2 imblearn to sklearn models</b></p>

<p style = 'font-size:16px;font-family:Arial'>To deploy our models with BYOM, we need to convert them to ONNX format. The catch? The ONNX conversion tools only work with standard scikit-learn models—not directly with the <code>imbalanced-learn<code> classifiers.</p>

<p style = 'font-size:16px;font-family:Arial'>But no worries, we’ve got a simple workaround. We "rewrap" the fitted `imbalanced-learn` models into their equivalent scikit-learn versions. Since these models are basically built off scikit-learn, swapping them like this keeps their prediction behavior the same while making them ONNX-friendly. The key here is that <code>imbalanced-learn</code> only changes the training process—when it comes to prediction, they work just like regular scikit-learn models. So, this trick lets us convert them without any fuss.</p>

<p style = 'font-size:16px;font-family:Arial'>Here’s how the conversions look:</p>
<li style = 'font-size:16px;font-family:Arial'>RUSBoostClassifier → sklearn.ensemble.AdaBoostClassifier</li>
<li style = 'font-size:16px;font-family:Arial'>BalancedRandomForestClassifier → sklearn.ensemble.RandomForestClassifier</li><p></p>

<p style = 'font-size:16px;font-family:Arial'>Once we’ve done the swap, we test <code>predict_proba()</code> on both versions to make sure they give the same results. This way, they’re all set for ONNX export and ready to deploy in Vantage. If you’re curious about how this works, check out the code in <code>tdimbalancedlearn.py</code>.</p>


In [None]:
model_dict = {"rusboost":{"imblearnmodel":model_rusboost},
              "balancedrandomforest":{"imblearnmodel":model_balancedrandomforest},
             }

In [None]:
for modelid  in model_dict.keys():
    print(modelid)
    imblearn_model =  model_dict[modelid]["imblearnmodel"]
    new_sklearn_model = tdimbl.convert_imblearn2sklearn(imblearn_model)
    
    same_predictions = tdimbl.compare_predictproba(df_train[all_features],
           imblearn_model, new_sklearn_model)
    assert same_predictions
    model_dict[modelid]["sklearnmodel"] = new_sklearn_model

In [None]:
model_dict

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>9.3 ONNX Conversion</b></p>

<p style = 'font-size:16px;font-family:Arial'>Next up, we need to convert our scikit-learn models into ONNX format to get them ready for deployment. ONNX (Open Neural Network Exchange) is an open format that makes it easy to use models across different machine learning platforms, which is perfect for deploying them inside Teradata Vantage. If you want to dig into the details of the conversion process, you can check out more info <a href='https://onnx.ai/sklearn-onnx/'>here</a></p>.


In [None]:
from skl2onnx import convert_sklearn, to_onnx
from skl2onnx.common.data_types import FloatTensorType

TARGET_OPSET = 12 

# Define the input type based on the number of features in your dataset
n_features = len(all_features)
n_features
initial_type = [('float_input', FloatTensorType([None, n_features]))]
initial_type

for modelid  in model_dict.keys():
    print(modelid)
    sklearn_model = model_dict[modelid]["sklearnmodel"]
    onnx_model = convert_sklearn(
        sklearn_model, 
        modelid,
        initial_types=initial_type,
        target_opset=TARGET_OPSET)

    model_dict[modelid]["onnxmodel"] = onnx_model

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>9.4 Local ONNX Test</b></p>

<p style = 'font-size:16px;font-family:Arial'>To check that the ONNX models work as they should, we run them through an <code>onnxruntime</code> session to see if they produce the same predictions as the scikit-learn ones. This step ensures the conversion was successful and everything is functioning properly before deploying to Vantage. The tests show consistent results, so we’re confident the models are solid.</p>


In [None]:
import numpy as np
from onnxruntime import InferenceSession

for modelid  in model_dict.keys():
    print(modelid)
    onnx_model = model_dict[modelid]["onnxmodel"]
    sess = InferenceSession(
        onnx_model.SerializeToString(), providers=["CPUExecutionProvider"]
    )
    input_name = sess.get_inputs()[0].name
    label_name = sess.get_outputs()[1].name # proba
    pred_onx = sess.run(
        [label_name], 
        {input_name: df_train[all_features].head(1).astype(np.float32).values})[0]
    
    model_dict[modelid]["model_pred"] = pred_onx
    

for modelid  in model_dict.keys():
    print(modelid)
    print(model_dict[modelid]["imblearnmodel"].predict_proba(df_train[all_features].head(1)))
    print(model_dict[modelid]["model_pred"])

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>9.5 Save to Files, Upload to Vantage, Predict & Evaluate</b></p>

<p style = 'font-size:16px;font-family:Arial'>The ONNX models are saved to files and uploaded into Vantage. From there, we use <code>ONNXPredict</code> for in-database predictions and evaluate how the models perform by looking at the ROC chart.</p>


In [None]:
#this needs to set, so that Vantage can find the Scoring functions. Very often it is "mldb"
configure.byom_install_location = "mldb"

for modelid  in model_dict.keys():
    with open(f"{modelid}.onnx", "wb") as f:
        f.write(model_dict[modelid]["onnxmodel"].SerializeToString())

try:
    db_drop_table("model_repo")
except:
    pass

for modelid  in model_dict.keys():
    save_byom(
        modelid,
        f"{modelid}.onnx",
        table_name="model_repo"
    )

In [None]:
DF_test.to_sql("df_test", if_exists="replace")
DF_test = DataFrame("df_test")

In [None]:
from sqlalchemy import literal_column
for modelid  in model_dict.keys():
    print(modelid)
    myonnxmodel = retrieve_byom(modelid,"model_repo")
    DF_pred = ONNXPredict(
        newdata=DF_test,
        modeldata=myonnxmodel, 
        accumulate=[key, target], 
        overwrite_cached_models = "true",
        model_input_fields_map="float_input="+",".join(all_features),
        volatile = True
    ).result
    
    DF_pred_onnx = DF_pred.assign(
        json_report_json =  literal_column(f"NEW JSON (json_report)"),
        prob_distribution = literal_column(f"json_report_json.JSONExtract('$.output_probability[0].value')"),
        prob_1 =  literal_column(f"CAST(prob_distribution.JSONExtractValue('$[0].1') AS FLOAT)"), 
    )
    
    tdimbl.save_roc(DF_pred_onnx, model_id=modelid , key=key, target=target )
    

In [None]:
DF_auc.sort("AUC", ascending=False)

In [None]:
tdimbl.plot_roc_curves(DF_roc, model_list = list(model_dict.keys()))

<p style = 'font-size:16px;font-family:Arial'>Now, it’s time to compare the models, and the results speak for themselves. The ensemble models really shine, with <code>BalancedRandomForest</code> leading the way with the highest AUC scores. This shows just how effective specialized algorithms can be for handling imbalanced data. In a real-world setting, these kinds of improvements could mean the right customers get the credit they need, while the bank reduces losses and boosts profits—a win-win for everyone.</p>


<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>10. Reframe as Anomaly Detection</b></p>

<p style = 'font-size:16px;font-family:Arial'>Sometimes, it makes more sense to treat an imbalanced classification problem as an anomaly detection task. This works well when you don’t have any known outliers and assume that all the existing data shows "normal" behavior. The idea is to train the model using only the majority class, ignoring the minority class, and then flag anything that looks different as an anomaly. We’ll try two methods here: an in-database One-Class SVM and an open-source approach with Isolation Forest, which we later deploy using ONNX and BYOM.</p>

<p style = 'font-size:16px;font-family:Arial'>The <code>One-Class SVM</code> tries to create a boundary that captures most of the "normal" data, marking anything outside this boundary as an anomaly. Even though SVM methods haven’t worked great on this dataset so far, it’s still worth testing it in this anomaly detection setup.</p>

<p style = 'font-size:16px;font-family:Arial'>The <code>Isolation Forest</code> is a tree-based method that splits features randomly to isolate data points. It’s great at spotting anomalies because it identifies them early on, making it effective even with imbalanced datasets like ours. This method works well when the minority class has very different behavior from the majority.</p>

<p style = 'font-size:16px;font-family:Arial'>By switching to an anomaly detection approach, we can use algorithms designed to spot unusual patterns, which might be more effective at catching rare events like credit defaults when traditional classification methods don’t quite cut it.</p>


<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>10.1 In-DB OneClassSVM</b></p>

<p style = 'font-size:16px;font-family:Arial'>The OneClassSVM is trained directly in the database, using only the majority class data. We then evaluate the model with the ROC curve to see how well it detects anomalies, treating the minority class instances as outliers or deviations from the norm.</p>

In [None]:
#model fitting
svm_obj = OneClassSVM(
    data=DF_train.loc[DF_train[target]==0],     
    input_columns=all_features, 
    volatile=True
)

# predictions
svm_pred_obj = OneClassSVMPredict(
    object=svm_obj, 
    newdata=DF_test,
    id_column=key,
    accumulate=[target], output_prob=True,
    output_responses= ["0","1"],
   volatile=True
)

DF_pred_svm = (svm_pred_obj.result
 .drop(columns = ["prediction", "prob_1" ])
 .assign(prob_1 = svm_pred_obj.result.prob_0)
 .drop(columns = [ "prob_0" ]))

tdimbl.save_roc(DF_pred_svm, model_id="oneclass_svm" , key=key, target=target )

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>10.2 IsolationForest with BYOM</b></p>

<p style = 'font-size:18px;font-family:Arial'>We train the Isolation Forest model using scikit-learn and then deploy it with BYOM. Unlike regular classifiers, the Isolation Forest gives an anomaly score instead of a probability. Negative scores mean outliers (anomalies), and positive scores mean normal instances.</p>

<p style = 'font-size:18px;font-family:Arial'>To turn these scores into something usable for setting thresholds, we apply a log transformation to scale them between 0 and 1. While these values aren’t true probabilities, they still work well for setting thresholds that make sense for business decisions.</p>


In [None]:
from sklearn.ensemble import IsolationForest
modelid = "isolation_forest"
# Filter majority class (0) for training the Isolation Forest
df_majority = df_train[df_train[target] == 0]

# Train Isolation Forest only on majority class
iso_forest = IsolationForest(contamination=0.02, random_state=42)
iso_forest.fit(df_majority[all_features])

In [None]:
from skl2onnx import to_onnx



iso_forest_onnx = to_onnx(
        iso_forest, 
        name= modelid,
        initial_types=initial_type,
        target_opset={"": TARGET_OPSET, "ai.onnx.ml": 3}
)

with open(f"{modelid}.onnx", "wb") as f:
        f.write(iso_forest_onnx.SerializeToString())

save_byom(
    modelid,
    f"{modelid}.onnx",
    table_name="model_repo"
)

myonnxmodel = retrieve_byom(modelid,"model_repo")

DF_pred = ONNXPredict(
    newdata=DF_test,
    modeldata=myonnxmodel, 
    accumulate=[key, target], 
    overwrite_cached_models = "true",
    model_input_fields_map="float_input="+",".join(all_features),
    volatile = True
).result.assign(
    json_report_json = literal_column(f"NEW JSON (json_report)"),
    prob_distribution =  literal_column(f"json_report_json.JSONExtract('$.scores[0]')"),
    anomaly_score =  literal_column(f"CAST(prob_distribution.JSONExtractValue('$[0].0') AS FLOAT)"),   
)

In [None]:
# translating anomaly_score to a probabilty
DF_pred_onnx = DF_pred.assign(
 prob_1 = literal_column(f"1- 1 / (1 + EXP(-10 * (anomaly_score)))"), 
)

In [None]:
tdimbl.save_roc(DF_pred_onnx, model_id=modelid , key=key, target=target )

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>10.3 Compare Model Performance</b></p>

In [None]:
DF_auc.loc[DF_auc.model_id.isin(["isolation_forest","oneclass_svm"])]

In [None]:
tdimbl.plot_roc_curves(DF_roc, model_list = ["isolation_forest","oneclass_svm"])

<p style = 'font-size:16px;font-family:Arial'>In this case, Isolation Forest didn’t quite work, but that doesn’t mean it’s not worth trying in other scenarios. It’s great when you’re dealing with anomalies that are rare, clearly separated, and structurally different from the norm—like detecting network intrusions, spotting fraud, or identifying rare equipment breakdowns. Isolation Forest is built for finding these rare, isolated cases (hence the name) by splitting the data with random cuts. It works best when anomalies are grouped in distinct clusters or stand apart in simpler spaces. However, based on our EDA, we saw that the classes aren’t very separable, with Cohen’s d only reaching 0.5 in the best features, which likely explains why it didn’t do well here. In the end, the data itself will point you toward the right approach.</p>


<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>11 Comparison of All Approaches</b></p>

<p style = 'font-size:16px;font-family:Arial'>Time for the showdown! We’ve reached the final stage of our tests, where we see how all the models compare when dealing with the imbalanced dataset. Below is the table showing AUC and Gini scores for each one. The results are clear—ensemble methods designed for imbalanced data, like <code>BalancedRandomForest</code> and <code>BalancedBagging</code>, come out on top with the best AUC and Gini scores. This suggests they’re the most effective at telling the classes apart.</p>

<p style = 'font-size:16px;font-family:Arial'>In contrast, resampling methods like undersampling, SMOTE and ADASYN with GLM offer some improvement but don’t reach the same level as the specialized ensembles. This indicates that while resampling is an easy way to balance the classes, it might not be enough for more complex decision boundaries.</p>

<p style = 'font-size:16px;font-family:Arial'>The <code>OneClass SVM</code> and <code>Isolation Forest</code>, which we framed as anomaly detection models, didn’t perform well here. The low AUC scores show that these methods might not be the best choice when both classes are present but have different distributions.</p>

<p style = 'font-size:16px;font-family:Arial'>Lastly, the <code>SVMs</code> with class weights didn’t perform well either, which isn’t surprising since the data didn't offer an easy way to split the classes.</p>


In [None]:
DF_auc.to_pandas().sort_values("AUC", ascending=False)

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>Conclusion</b></p>

<p style = 'font-size:16px;font-family:Arial'>We’ve covered a lot of ground in this post, exploring different strategies for handling imbalanced classification—from simple sampling techniques and quick post-processing tricks to more advanced ensemble methods and anomaly detection. The key takeaway? It’s all about the data—knowing its patterns, behavior, and how your model’s predictions impact the business. But staying practical is crucial: while advanced methods can definitely improve things, sometimes the simpler approach works just as well with way less effort. It’s all about finding that sweet spot between effort and results to build models that deliver real value.</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>8. Cleanup</b></p>
<p style = 'font-size:18px;font-family:Arial;color:##00233C'><b>Work Tables</b></p>

In [None]:
tables_tobedeleted =['model_repo',
 'oversampled',
 'oversampled_train',
 'resampled_train',
 'resampled_train',
 'df_train_smote',
 'df_train_adasyn',
 'sampling_predictions',
 'svm_grid_predictions',
 'undersampled',
 'undersampled_train',
 'df_test',
 'eval_auc',
 'eval_confusion',
 'eval_metrics',
 'eval_roc',
 'iris_enc_sample']
for t in tables_tobedeleted:
    try:
        db_drop_table(t)
    except:
        pass

<p style = 'font-size:18px;font-family:Arial'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../../UseCases/run_procedure.py "call remove_data('DEMO_DataImbalance');" 
#Takes 40 seconds

In [None]:
remove_context()

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>High-Effort Approaches: What’s Next?</b></p>

<p style = 'font-size:16px;font-family:Arial'>In this notebook, we covered a bunch of methods for dealing with imbalanced data, focusing on techniques that are easy to implement and deploy in real-world scenarios. But as our initial diagram showed, there are some more advanced approaches out there that, while they need more time and resources, can be super effective for the right use cases. These methods are complex, but here’s a quick overview of these high-effort strategies and when they might be worth diving into.</p>

<p style = 'font-size:18px;font-family:Arial'><b>Advanced Feature Engineering</b></p>

<p style = 'font-size:16px;font-family:Arial'>Feature engineering is a game-changer, especially with tricky data like time series or mixed data types. If you’re working with time series data, transforming it into the frequency domain (think Fourier or Wavelet Transforms) can reveal hidden patterns you wouldn’t spot otherwise. These features can seriously boost the model’s ability to tell the classes apart. But heads up—advanced feature engineering needs a lot of domain knowledge and experimenting, so it’s resource-heavy and might require teaming up with domain experts.</p>

<p style = 'font-size:18px;font-family:Arial'><b>Advanced EDA on Unused Data Sources</b></p>

<p style = 'font-size:16px;font-family:Arial'>Exploring and using unused or unstructured data sources can totally change the game when detecting rare events. There might be logs, text reports, or sensor data just waiting to be tapped into. Doing some deep-dive EDA on these sources can uncover new patterns that could boost your model’s performance. For example, text mining or NLP techniques could help you extract useful insights from reports. But integrating these sources means investing in data cleaning, processing, and possibly building new ways to capture and handle the data—definitely not a quick fix.</p>

<p style = 'font-size:18px;font-family:Arial'><b>New Data Sources: Stepping Up Data Collection</b></p>

<p style = 'font-size:16px;font-family:Arial'>Sometimes, the data we’ve got just doesn’t cut it. In manufacturing, for instance, adding new sensors or cameras to the production line can give a fuller view of the process, helping spot product defects you’d otherwise miss. But bringing in new data sources often comes with big costs—new hardware, software, and a solid pipeline to manage all that new info. It’s worth it if accuracy is a top priority, but it’s not a small investment.</p>

<p style = 'font-size:18px;font-family:Arial'><b>Advanced Data Synthesis</b></p>

<p style = 'font-size:16px;font-family:Arial'>If we can’t fix data collection issues or class imbalance through traditional methods, advanced data synthesis might be the way to go. Techniques like GANs (using packages like `ctgan`) can generate synthetic data points that mimic the minority class, bulking up your dataset without needing real-world observations. It’s powerful but tricky—making sure the synthetic data is realistic and unbiased takes some serious work. Plus, it needs a lot of computing power and expertise, so it’s not for quick, easy wins.</p>

<p style = 'font-size:18px;font-family:Arial'><b>Custom Cost Function Integration</b></p>

<p style = 'font-size:16px;font-family:Arial'>Another route is to create custom cost functions that fit your business needs and build them right into your model. This let's us optimize for what really matters to your business, rather than generic metrics like accuracy or AUC. For example, if false negatives are a huge issue, a custom loss function can make the model penalize those mistakes more heavily. But this approach needs a deep understanding of the business and often requires adjusting algorithms, making it a pretty technical and time-consuming solution.</p>

<p style = 'font-size:18px;font-family:Arial'><b>Active Learning</b></p>

<p style = 'font-size:16px;font-family:Arial'>When labeled data is scarce or expensive, active learning can be a lifesaver. It’s a smart approach where the model actively picks the most useful samples to label, improving performance with minimal data. It’s great for fields like medical diagnostics or legal review, where labeling can get pricey. But active learning isn’t simple—it needs iterative model training, human-in-the-loop interactions, and clever sampling strategies, which all add up to a lot of time and effort.</p>


<hr style="height:1px;border:none;">

<p style = 'font-size:16px;font-family:Arial'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Medium Blog Posts: <a href = 'https://medium.com/teradata'>here</a></li>
    
    
</ul>

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>