## <u>Anomaly Detection</u>
###### After successfully ingesting the data (inside **PesaTransactionsV1 Lakehouse** ), the project flow is divided into the following main steps.

1) Train the model, test the data for anomalies.
2) Use OpenAI GPT4o-mini to analyze the anomaly level (low/medium/high), along with explanations to the deductions and the message that will be sent to the risk analysts.
3) Connect to Azure blob storage, upload `.txt` files (the text in here is the alert message content with crucial info) for any anomalies found. 

After the last step, Power Automate takes over from here. We've set up a trigger to listen to addition or modifications of files in the blob storage, and steps to send emails of the alert message to risk analysts.

##### **STEP 1:** Train the model, test the data for anomalies.

In [None]:
# Read the data and initiate the DataFrame.
df = spark.read.table("PesaTransactionsV1.PesaTransactions")
df = df.toPandas()

In [None]:
# Select and prepare relevant features (Acc No, Posting Date, Amount).
import pandas as pd
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

df["AccountNumber"] = df["NewAccountNo"].str.extract(r"-(\d+)-")
df["PostingDate"] = pd.to_datetime(df["PostingDate"])
df["Day"] = df["PostingDate"].dt.dayofyear

# The amount needs to be normalized
df["Scaled_Amount"] = scaler.fit_transform(df[["Amount"]])

# Determine the features
features = df[["AccountNumber", "Scaled_Amount", "Day"]]

StatementMeta(, b23c15c2-45ae-48ff-b5d8-e1dfcd7a81be, 18, Finished, Available, Finished)

In [None]:
# Split the data for testing and evaluation.
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(features, test_size=0.005, random_state=42)

StatementMeta(, b23c15c2-45ae-48ff-b5d8-e1dfcd7a81be, 19, Finished, Available, Finished)

In [None]:
# Train the the model on the respective data (train_data)
"""
- We chose the Isolation Forest Algorithm, for it's strengths in working with high-dimension datasets.
"""

from sklearn.ensemble import IsolationForest
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='most_frequent')
train_data_imputed = imputer.fit_transform(train_data)

model = IsolationForest(
    n_estimators=100,
    contamination=0.05,
    bootstrap=False,
    verbose=1,
    random_state=42
)
model.fit(train_data_imputed)

StatementMeta(, b23c15c2-45ae-48ff-b5d8-e1dfcd7a81be, 20, Finished, Available, Finished)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s finished


In [None]:
# Detecting anomalies in the data
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='most_frequent')
test_data_imputed = imputer.fit_transform(test_data)

test_data["Status"] = model.predict(test_data_imputed)
test_data["Status"] = test_data["Status"].map({1: "Normal", -1: "Anomaly"})

StatementMeta(, b23c15c2-45ae-48ff-b5d8-e1dfcd7a81be, 21, Finished, Available, Finished)

In [None]:
# Combine with original data and save to lakehouse
"""
- We'll create a new table named "Anomaly_Results_v2" that will have the original values, 
and an added column of status where the status value for each row will be either "Normal" or "Anomaly".
"""
from pyspark.sql.functions import col

result_df = test_data.copy()
result_df["Amount"] = df.loc[result_df.index, "Amount"]

spark_df = spark.createDataFrame(result_df)
spark_df = spark_df.withColumn("Amount", col("Amount").cast("int"))
spark_df.write \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("PesaTransactionsV1.Anomaly_Results_v2")


StatementMeta(, b23c15c2-45ae-48ff-b5d8-e1dfcd7a81be, 22, Finished, Available, Finished)

In [None]:
# Filter out anomaly rows
import pandas as pd

df_anomalies = spark.read.table("Anomaly_Results_v2").filter("Status = 'Anomaly'").toPandas()
df_anomalies.head()

StatementMeta(, b23c15c2-45ae-48ff-b5d8-e1dfcd7a81be, 23, Finished, Available, Finished)

Unnamed: 0,AccountNumber,Scaled_Amount,Day,Status,Amount
0,97346,2.392144,128,Anomaly,30000
1,90346,2.392144,310,Anomaly,30000
2,9202,2.392144,177,Anomaly,30000
3,86024,2.392144,84,Anomaly,30000
4,17388,1.391652,2,Anomaly,21000


##### **STEP 2:** Use OpenAI GPT4o-mini to analyze the anomaly level (low/medium/high), along with explanations to the deductions and the message that will be sent to the risk analysts.

- At this point, we've already created a Key Vault on the Azure portal to store the OpenAI endpoint-url and the OpenAI key.

In [None]:
# Function to generate the prompt to send to GPT4 for each anomaly
def format_prompt(transaction):
    return f"""
        A transaction has been flagged by an anomaly detection system. 
        Details:
        - Anomaly: {transaction["Status"]}
        - Amount: KES{transaction["Amount"]}
        - Account No: {transaction["AccountNumber"]}

        Evaluate this transaction and assign a risk level (low, medium, high) with
        a short explanation. Provide a final alert message in plain English for a 
        fraud analyst.

        Add three columns for your output:
        1. RiskLevel.
        2. Explanation.
        3. AlertMessage
    """

StatementMeta(, b23c15c2-45ae-48ff-b5d8-e1dfcd7a81be, 24, Finished, Available, Finished)

In [None]:
# Install the packages below (the versions are just as crucial)
!pip install openai==0.28.1 typing-extensions==4.5.0 pydantic==1.10.13 notebookutils

In [None]:
# Call GPT4o mini
import openai
from notebookutils.mssparkutils.credentials import getSecret
import re
import pandas as pd
import json


# Constants
KEY_VAULT_ENDPOINT = "https://pesatransactionskeyvault.vault.azure.net/"
AZURE_OPENAI_API_KEY = getSecret(KEY_VAULT_ENDPOINT, "openaiGPT4o-mini-key")
AZURE_OPENAI_ENDPOINT_URL = getSecret(KEY_VAULT_ENDPOINT, "openaiendpointURL")

openai.api_type = "azure"
openai.api_base = AZURE_OPENAI_ENDPOINT_URL
openai.api_version = "2025-03-01-preview"
openai.api_key = AZURE_OPENAI_API_KEY

def gpt4o_risk_evaluation(prompt):
    response = openai.ChatCompletion.create(
        engine="gpt-4o-mini-kenya-hack",
        messages=[
            {"role": "system", "content": """You are a fraud assistant. Always respond in valid JSON format with these exact keys:
            {
                "risk_level": "low/medium/high",
                "explanation": "your explanation here",
                "alert_message": "your alert message here"
            }"""},
                        {"role": "user", "content": prompt}
                    ],
        response_format={"type": "json_object"},
        temperature=0.7
    )
    return json.loads(response.choices[0].message["content"])



def parse_gpt4_response(response_text):
    """Parse the GPT-4 response into structured components"""
    risk_level = response_text["risk_level"]
    explanation = response_text["explanation"]
    alert_message = response_text["alert_message"]
    
    return pd.Series({
        'RiskLevel': risk_level,
        'Explanation': explanation,
        'AlertMessage': alert_message
    })

StatementMeta(, b23c15c2-45ae-48ff-b5d8-e1dfcd7a81be, 26, Finished, Available, Finished)

In [None]:
# Loop through the anomalies and evaluate each
"""
- The intention for this is to come up with a table with the original data, but with additional columns for the risk level,
the explanation as to why this is an anomaly, and the alert message to be sent to analysts.
"""

response_series = df_anomalies.apply(
    lambda row: parse_gpt4_response(gpt4o_risk_evaluation(format_prompt(row))), 
    axis=1
)

# Add the parsed columns to your DataFrame
df_anomalies[['RiskLevel', 'Explanation', 'AlertMessage']] = response_series

# Show what we have so far:
df_anomalies.head()

StatementMeta(, b23c15c2-45ae-48ff-b5d8-e1dfcd7a81be, 27, Finished, Available, Finished)

Unnamed: 0,AccountNumber,Scaled_Amount,Day,Status,Amount,RiskLevel,Explanation,AlertMessage
0,97346,2.392144,128,Anomaly,30000,medium,The transaction amount of KES30000 is signific...,Transaction flagged for review: KES30000 to ac...
1,90346,2.392144,310,Anomaly,30000,medium,The transaction amount of KES30000 is signific...,A transaction of KES30000 has been flagged as ...
2,9202,2.392144,177,Anomaly,30000,medium,The transaction amount of KES30000 is signific...,Transaction flagged for review: KES30000 on ac...
3,86024,2.392144,84,Anomaly,30000,medium,The transaction amount of KES30000 is notable ...,Transaction of KES30000 on Account No: 86024 h...
4,17388,1.391652,2,Anomaly,21000,medium,The transaction amount of KES21000 is signific...,The transaction of KES21000 from account numbe...


In [None]:
# Save results back to the lakehouse but in a new table
from pyspark.sql import SparkSession

spark_df = spark.createDataFrame(df_anomalies)

spark_df.write.format("delta") \
    .option("mergeSchema", "true") \
    .mode("append") \
    .saveAsTable("Anomaly_Risk_Summary_v2")

StatementMeta(, b23c15c2-45ae-48ff-b5d8-e1dfcd7a81be, 28, Finished, Available, Finished)

##### **STEP 3:** Connect to Azure blob storage, upload `.txt` files (the text in here is the alert message content with crucial info) for any anomalies found.

- At this point, a Storage Account is already created on Azure, along with the blob container that we'll use to store the uploaded files.
- The final cell in this section will log; A message to show if connecting to the blob container is successful, if uploading the anomaly files is successful, and lastly, a verification to show how many items are currently in the blob container (well, only a few file names).

In [None]:
!pip install azure-storage-blob

In [None]:
## Save Anomalies Summary to Blob Storage
# This cell handles exporting our processed data to Azure Blob Storage as JSON

# %%
import pandas as pd
from azure.storage.blob import BlobServiceClient
from datetime import datetime
import json
import os

# Constants
KEY_VAULT_ENDPOINT = "https://pesatransactionskeyvault.vault.azure.net/"
AZURE_OPENAI_API_KEY = getSecret(KEY_VAULT_ENDPOINT, "openaiGPT4o-mini-key")
AZURE_OPENAI_ENDPOINT_URL = getSecret(KEY_VAULT_ENDPOINT, "openaiendpointURL")

# Configuration Section
STORAGE_ACCOUNT_NAME = getSecret(KEY_VAULT_ENDPOINT, "storageAccountName")
STORAGE_ACCOUNT_KEY = getSecret(KEY_VAULT_ENDPOINT, "storageAccountKey")
CONTAINER_NAME = "gpt4osummary"

# Generate timestamp for filename
TIMESTAMP = datetime.now().strftime('%Y%m%d_%H%M%S')
BLOB_NAME = f"anomaly_alerts_{TIMESTAMP}.txt"

# Create the connection to Azure Blob Storage
try:
    connection_string = f"DefaultEndpointsProtocol=https;AccountName={STORAGE_ACCOUNT_NAME};\
        AccountKey={STORAGE_ACCOUNT_KEY};EndpointSuffix=core.windows.net"
    blob_service_client = BlobServiceClient.from_connection_string(connection_string)
    
    # Get reference to container (will create if doesn't exist)
    container_client = blob_service_client.get_container_client(CONTAINER_NAME)
    if not container_client.exists():
        container_client.create_container()
    
    print(f"Successfully connected to Azure Blob Storage container: {CONTAINER_NAME}")
except Exception as e:
    print(f"Error connecting to Azure Blob Storage: {str(e)}")
    raise

# %%
# Convert Spark DataFrame to text files and upload
try:
    # First convert Spark DataFrame to Pandas
    pandas_df = spark_df.toPandas()
    
    # Ensure the required column exists
    if 'AlertMessage' not in pandas_df.columns:
        raise ValueError("DataFrame is missing required 'AlertMessage' column")
    
    # Upload each summary as a separate text file
    for index, row in pandas_df.iterrows():
        # Create filename with timestamp and index
        txt_filename = f"anomaly_summary_{TIMESTAMP}_{index}.txt"
        
        # Get the summary text
        summary_text = str(row['AlertMessage'])
        
        # Upload to blob storage
        blob_client = container_client.get_blob_client(txt_filename)
        blob_client.upload_blob(summary_text, overwrite=True)
        
        print(f"Uploaded: {txt_filename} (Size: {len(summary_text)} characters)")
    
    print(f"\nSuccessfully uploaded {len(pandas_df)} text files to container {CONTAINER_NAME}")

except Exception as e:
    print(f"Error uploading text files to blob storage: {str(e)}")
    raise

# Verification - list blobs in container (optional)
print("\nCurrent blobs in container:")
try:
    blob_list = container_client.list_blobs()
    txt_files = [blob for blob in blob_list if blob.name.endswith('.txt')]
    
    print(f"Found {len(txt_files)} text files:")
    for blob in txt_files[:5]:
        print(f"- {blob.name} (Size: {blob.size} bytes)")
    if len(txt_files) > 5:
        print(f"- ... and {len(txt_files)-5} more")
    
except Exception as e:
    print(f"Error listing blobs: {str(e)}")