## 1. SWIFT Assignment

In this assignment, a Python script is developed to convert SWIFT messages into a structured dataframe. The SWIFT messages will be analyzed, and pertinent information such as sender, receiver, amount, currency, value date, reference, and any remarks will be extracted and transformed into a dataframe.

After converting the SWIFT messages, risk analyses will be conducted. These analyses may encompass various aspects, such as detecting suspicious transaction patterns, identifying potential compliance issues, or evaluating financial risks associated with the transactions.

The aim of this assignment is to create an automated process that is flexible and capable of handling various types of SWIFT messages. The script should be able to extract relevant information from the messages and subsequently perform risk analyses to ensure the integrity and compliance of the financial transactions.

In [476]:
import pandas as pd
import re
from datetime import datetime, timedelta
import pandas as pd
import numpy as np

The provided script includes a list of MT103 messages. Each message encapsulates details of a financial transaction. These messages serve as the input data for our script, which will convert them into a structured dataframe and conduct risk analyses.

In [477]:
# Provided data: List of MT103 messages
mt103_messages = [
    """
    {1:F01MYMBGB2L0XXX0000000000}{2:I103HBUKGB4BXXXN}{3:{108:MT103
    0001}}{4:
    :20:MT103 0001
    :23B:CRED
    :32A:210322USD5000,
    :50K:/DE98765432101234567890
    COMMERZBANK AG
    HAMBURG, GERMANY
    /COBADEHHXXX
    :52A:/COBADEHHXXX
    COMMERZBANK AG
    HAMBURG, GERMANY
    :53A:/MYMBGB2LXXX
    METRO BANK PLC
    LONDON, UNITED KINGDOM
    :57A:/HBUKGB4BXXX
    HSBC BANK PLC
    LONDON, UNITED KINGDOM
    :59:/GB57METR12345678901234
    NORDFISCH GMBH
    BODENSEE STR. 226
    22761 HAMBURG
    GERMANY
    :71A:OUR
    :71F:/BIC/HBUKGB4BXXX
    :71G:/INS/THIS IS A PAYMENT FOR TUNA SUPPLY
    -}
    """,
    # Add more MT103 messages here
    """"
    {1:F01ABCBUS33AXXX0000000000}{2:I103HSBCHKHHHKXXXN}{3:{108:MT103
    0001}}{4:
    :20:MT103 0001
    :23B:CRED
    :32A:210322USD10000,
    :50K:/US12345678901234567890
    ABC INDUSTRIES
    123 MAIN STREET
    NEW YORK, NY 10001
    UNITED STATES
    /ABCBUS33XXX
    :52A:/ABCBUS33XXX
    ABC BANK
    NEW YORK, NY
    UNITED STATES
    :53A:/HSBCHKHHHKXXX
    HSBC HONG KONG
    1 QUEEN'S ROAD CENTRAL
    Unset
    HONG KONG
    :54A:/CITIHKHX
    CITIBANK HONG KONG
    3 GARDEN ROAD
    CENTRAL, HONG KONG
    :56A:/ICBKCNBJGZU
    INDUSTRIAL AND COMMERCIAL BANK OF CHINA
    GUANGZHOU BRANCH
    76 HUANSHI ROAD WEST
    GUANGZHOU, CHINA
    :57A:/CITIUS33
    CITIBANK NA
    111 WALL STREET
    NEW YORK, NY 10043
    UNITED STATES
    :59:/CN123456789012345678
    XYZ SUPPLIERS
    123 HUANGPU ROAD
    SHANGHAI, CHINA
    :71A:OUR
    :71F:/BIC/HSBCHKHHHKXXX
    :71G:/MSG/PAYMENT FOR GOODS
    -}""",

    """"
    {1:F01ABNANL2AXXX0000000000}{2:I103SCBLGB2LXXXXN}{3:{103:TGT}{108
    :MT103 0001}}{4:
    :20:MT103 0001
    :23B:CRED
    :32A:210322USD9899,
    :50A:/NL20ABNA0404875234
    ABNANL2A
    ABC SUPPLIERS BV
    AMSTERDAM, NETHERLANDS
    :56A:/SCBLGB2LXXX
    STANDARD CHARTERED BANK
    LONDON, UK
    :57A:/BNYMUS33XXX
    BNY MELLON
    NEW YORK, NY, US
    :59:/PASSNGLAXXX
    AFRICAN EXPORT-IMPORT BANK
    LAGOS, NIGERIA
    XYZ ENTERPRISES LTD
    LAGOS, NIGERIA
    :70:INV NO. 12345
    REF. 98765
    SUPPLY OF GOODS AS PER PURCHASE ORDER NO. 54321
    -}"""
    ]

The provided extract_value function is utilized to extract values associated with different tags within a loop. It takes two parameters: tag represents the specific tag we want to extract information from, and message is the SWIFT message from which the value needs to be extracted.

This function employs a regular expression pattern to locate the value corresponding to the provided tag within the message. The regex pattern is dynamically constructed based on the tag input. Once a match is found, the function returns the extracted value, stripping any leading or trailing whitespace. If no match is found, an empty string is returned.

In [478]:
def extract_value(tag, message):
    regex_pattern = r'(?<=:' + tag + r':)(.*?)(?=:\d{2}[A-Z]|\Z)'
    match = re.search(regex_pattern, message, re.DOTALL)  # re.DOTALL zorgt ervoor dat '.' ook newlines matcht
    if match:
        return match.group(1).strip()
    else:
        return ''

An empty DataFrame is generated with columns designed to accommodate various transaction details extracted from the messages. This structure facilitates the integration of information derived from the SWIFT messages for further analysis.

In [479]:
# Create an empty DataFrame with columns as per the provided schema
columns = [
    "transaction_date", 
    "transaction_id",
    "transaction_message", 
    "transaction_currency",
    "transaction_amount",
    "transaction_type", 
    "transaction_direction",
    "transaction_status",
    "instrument_type", 
    "originator_full_name",
    "originator_first_name",
    "originator_account_number",
    "originator_middle_names_patronymic",
    "originator_last_name", 
    "originator_address",
    "originator_country",
    "originator_account_number",
    "originator_branch_id",
    "originator_bic",
    "originator_fi_name",
    "originator_fi_country",
    "incoming_intermediary_fi_bic",
    "outgoing_intermediary_fi_bic",
    "beneficiary_full_name",
    "beneficiary_first_name",
    "beneficiary_middle_name_patronymic",
    "beneficiary_last_name",
    "beneficiary_address",
    "beneficiary_country",
    "beneficiary_account_number",
    "beneficiary_branch_id",
    "beneficiary_bic",
    "beneficiary_fi_name",
    "beneficiary_fi_country",     
]

empty_df = pd.DataFrame(columns=columns)

The function extract_intermediary_bic_codes is designed to extract information from tags 56 and 57 of the provided MT103 message. It captures the intermediary financial institution BIC codes, storing them in a dictionary for further processing.

In [480]:
def extract_intermediary_bic_codes(mt103_message):
    # Dictionary to store the found BIC codes
    bic_codes = {
        "incoming_intermediary_fi_bic": None,
        "outgoing_intermediary_fi_bic": None
    }
    
    # Extract BIC from tag 56A
    match_56a = re.search(r':56A:/(.+)', mt103_message)
    if match_56a:
        bic_codes["incoming_intermediary_fi_bic"] = match_56a.group(1).strip()
    
    # Extract BIC from tag 57A
    match_57a = re.search(r':57A:/(.+)', mt103_message)
    if match_57a:
        bic_codes["outgoing_intermediary_fi_bic"] = match_57a.group(1).strip()
    
    return bic_codes


This function, extract_originator_fi_country, determines the country of the financial institution originating the transaction in an MT103 message. It makes a choice between processing tag 50A and tag 50K to extract the relevant country information.

In [481]:
def extract_originator_fi_country(mt103_message):
    # Attempt to process tag 50K first
    match_50k = re.search(r':50K:/.+?\n(.+?)/[A-Z]{2}[0-9A-Z]{9,11}', mt103_message, re.DOTALL)
    if match_50k:
        content_after_50k = match_50k.group(1)
        lines = content_after_50k.strip().split('\n')
        last_line = lines[-1] if lines else ""
        parts = last_line.split(',')
        country_50k = parts[-1].strip() if parts else None
        if country_50k:
            return country_50k

    # If no country is extracted from 50K, try tag 50A
    match_50a = re.search(r':50A:/[A-Z]{2}[0-9A-Z]{0,30}\n(?:.*\n)*?.*,\s*(.*)', mt103_message)
    if match_50a:
        country_50a = match_50a.group(1).strip()
        return country_50a


This function, extract_originator_country, aims to extract the country of origin from an MT103 message. It first attempts to process tag 50K, followed by tag 50A if no country information is found. If neither tag provides a country, it returns None.

In [482]:
def extract_originator_country(mt103_message):
    # Attempt to process tag 50K first
    match_50k = re.search(r':50K:/.+?\n(.+?)/[A-Z]{2}[0-9A-Z]{9,11}', mt103_message, re.DOTALL)
    if match_50k:
        content_after_50k = match_50k.group(1)
        lines = content_after_50k.strip().split('\n')
        last_line = lines[-1] if lines else ""
        parts = last_line.split(',')
        country_50k = parts[-1].strip() if parts else None
        if country_50k:
            return country_50k

    # If no country is extracted from 50K, try tag 50A
    match_50a = re.search(r':50A:/[A-Z]{2}[0-9A-Z]{0,30}\n(?:.*\n)*?.*,\s*(.*)', mt103_message)
    if match_50a:
        country_50a = match_50a.group(1).strip()
        return country_50a

    # If neither tag yields a country, return None
    return None



This function, extract_originator_bic, is designed to extract the originator's BIC code from an MT103 message. It first attempts to retrieve the BIC from tag 50K, followed by tag 50A if the BIC is not found. If neither tag contains the BIC, it returns None.

In [483]:
def extract_originator_bic(mt103_message):
    # First attempt to extract the BIC code from tag 50K
    match_50k = re.search(r':50K:/.+?\n.+?/([A-Z]{6}[A-Z2-9][A-NP-Z0-9]{2}(?:[A-Z0-9]{3})?)([A-Z]{2})?', mt103_message, re.DOTALL)
    if match_50k:
        # Basic BIC code
        originator_bic_50k = match_50k.group(1)
        # Optional: Check if there are two additional letters and append them to the BIC code
        extra_letters_50k = match_50k.group(2) if match_50k.lastindex >= 2 else ""
        originator_bic_50k += extra_letters_50k
        return originator_bic_50k

    # If no BIC code is found in tag 50K, try tag 50A
    bic_name_matches_50a = re.findall(r':50A:/(.+?)\n(.+?)\n', mt103_message)
    if bic_name_matches_50a:
        originator_bic_50a = bic_name_matches_50a[0][1].strip()  # [0][1] to retrieve the BIC code
        return originator_bic_50a

    return None


This function, extract_originator_branch_id, is designed to extract the branch ID of the originator from an MT103 message. It first attempts to extract the information from tag 50K, and if not found, it tries tag 50A. If successful, it returns the branch ID, otherwise, it returns None.

In [484]:
def extract_originator_branch_id(mt103_message):
    # First attempt to extract the BIC code from tag 50K
    match_50k = re.search(r':50K:/.+?\n(.+?)/([A-Z]{6}[A-Z2-9][A-NP-Z0-9]{2}(?:[A-Z0-9]{3})?)([A-Z]{2})?', mt103_message, re.DOTALL)
    if match_50k:
        originator_bic_50k = match_50k.group(2).strip()
        # Check if there are additional letters after the BIC code and append them
        extra_letters_50k = match_50k.group(3) if match_50k.lastindex >= 3 else ""
        originator_bic_50k += extra_letters_50k
        branch_id = originator_bic_50k[-3:]  # Keep only the last three characters of the BIC code
        if len(originator_bic_50k) > 8:
            return branch_id
        else:
            return "head office"

    # If no BIC code is found in tag 50K, try tag 50A
    bic_name_matches_50a = re.findall(r':50A:/(.+?)\n(.+?)\n', mt103_message)
    if bic_name_matches_50a:
        originator_bic_50a = bic_name_matches_50a[0][1].strip()  # [0][1] to retrieve the BIC code
        # Check if there are additional letters after the BIC code and append them
        extra_letters_50a = re.search(r'([A-Z]{2})$', originator_bic_50a)
        if extra_letters_50a:
            originator_bic_50a += extra_letters_50a.group(1)
        branch_id = originator_bic_50a[-3:]  # Keep only the last three characters of the BIC code
        if len(originator_bic_50a) > 8:
            return branch_id
        else:
            return "head office"

    return None


In [485]:
# A counter to keep track of the number of messages.
message_count = 0

def get_transaction_status():
    global message_count  # Use the global counter
    message_count += 1  # Increase the counter by 1

    # Check the value of the counter and return the status
    if message_count <= 2:
        return "accepted"
    else:
        return "rejected"


This function, extract_beneficiary_bic, is aimed at extracting the beneficiary's BIC code from an MT103 message. It initially tries to retrieve the BIC from tag 53A and, if unsuccessful, it attempts tag 59. If the BIC is found, it is returned; otherwise, None is returned.

In [486]:
def extract_beneficiary_bic(mt103_message):
    # First attempt to extract the BIC code from tag 53A
    match_53a = re.search(r':53A:/(.+)', mt103_message)
    if match_53a:
        beneficiary_bic_53a = match_53a.group(1).strip()
        return beneficiary_bic_53a

    # If no BIC code is found in tag 53A, try tag 59
    # Adjust the search pattern to also capture any additional letters after the BIC code
    match_59 = re.search(r':59:/([A-Z]{6}[A-Z2-9][A-NP-Z0-9]{2}(?:[A-Z0-9]{3})?)([A-Z]{2})?', mt103_message)
    if match_59:
        # Basic BIC code
        beneficiary_bic_59 = match_59.group(1)
        # Check if there are two additional letters and append them to the BIC code
        extra_letters_59 = match_59.group(2) if match_59.lastindex >= 2 else ""
        beneficiary_bic_59 += extra_letters_59
        return beneficiary_bic_59

    return None

This function, extract_beneficiary_branch_id, is designed to extract the branch ID of the beneficiary from an MT103 message. It first attempts to retrieve the branch ID from tag 53A, and if unsuccessful, it tries tag 59. If the branch ID is found, it is returned; otherwise, None is returned.

In [487]:
def extract_beneficiary_branch_id(mt103_message):
    # First attempt to extract the branch ID from tag 53A
    match_53a = re.search(r':53A:/.+?\n.+?/([A-Z]{6}[A-Z2-9][A-NP-Z0-9]{2}(?:[A-Z0-9]{3})?)([A-Z]{2})?', mt103_message, re.DOTALL)
    if match_53a:
        # Basic branch ID
        beneficiary_branch_id_53a = match_53a.group(1)
        # Optional: Check if there are two additional letters and append them to the branch ID
        extra_letters_53a = match_53a.group(2) if match_53a.lastindex >= 2 else ""
        beneficiary_branch_id_53a += extra_letters_53a
        return beneficiary_branch_id_53a

    # If no branch ID is found in tag 53A, try tag 59
    match_59 = re.search(r':59:/.+?\n.+?/([A-Z]{6}[A-Z2-9][A-NP-Z0-9]{2}(?:[A-Z0-9]{3})?)([A-Z]{2})?', mt103_message, re.DOTALL)
    if match_59:
        # Basic branch ID
        beneficiary_branch_id_59 = match_59.group(1)
        # Optional: Check if there are two additional letters and append them to the branch ID
        extra_letters_59 = match_59.group(2) if match_59.lastindex >= 2 else ""
        beneficiary_branch_id_59 += extra_letters_59
        return beneficiary_branch_id_59

    return None

This function, extract_beneficiary_branch_id, aims to extract the branch ID of the beneficiary from an MT103 message, focusing on the last three characters. It first tries to retrieve the branch ID from tag 53A and, if unsuccessful, attempts tag 59. If the branch ID is found, it is returned; otherwise, None is returned.

In [488]:
def extract_beneficiary_branch_id(mt103_message):
    # First attempt to extract the branch ID from tag 53A, focusing on the last three characters
    match_53a = re.search(r':53A:/.+?([A-Z0-9]{3})\b', mt103_message)
    if match_53a:
        beneficiary_branch_id_53a = match_53a.group(1)
        return beneficiary_branch_id_53a

    # If no branch ID is found in tag 53A, try tag 59, also focusing on the last three characters
    match_59 = re.search(r':59:/.+?([A-Z0-9]{3})\b', mt103_message)
    if match_59:
        beneficiary_branch_id_59 = match_59.group(1)
        return beneficiary_branch_id_59

    return None


This function, extract_beneficiary_fi_name, is designed to extract the financial institution name from an MT103 message, primarily focusing on the second line of tags 53A or 59. It first attempts to extract the FI name from tag 53A, and if not found, it tries tag 59. If the FI name is found, it is returned; otherwise, None is returned.

In [489]:

def extract_beneficiary_fi_name(mt103_message):
    # First attempt to extract the FI name from tag 53A, focusing on the second line
    fi_name_matches_53a = re.findall(r':53A:.+?\n(.+?)\n', mt103_message, re.DOTALL)
    if fi_name_matches_53a:
        beneficiary_fi_name_53a = fi_name_matches_53a[0].strip()
        return beneficiary_fi_name_53a

    # If no FI name is found in tag 53A, try tag 59, also focusing on the second line
    fi_name_matches_59 = re.findall(r':59:.+?\n(.+?)\n', mt103_message, re.DOTALL)
    if fi_name_matches_59:
        beneficiary_fi_name_59 = fi_name_matches_59[0].strip()
        return beneficiary_fi_name_59

    return None


This function, extract_beneficiary_fi_country, is designed to extract the country of the beneficiary's financial institution from an MT103 message. It first attempts to locate the country information within tag 53A, and if not found, it searches within tag 59. If the country is found, it is returned.

In [490]:
def extract_beneficiary_fi_country(mt103_message):
    # Search for the entire content of tag 53A
    match_53a = re.search(r':53A:[^\n]*\n(.*?)(?=:|$)', mt103_message, re.DOTALL)
    if match_53a:
        # Split the content into lines and take the last line
        lines_53a = match_53a.group(1).strip().split('\n')
        last_line_53a = lines_53a[-1].strip()
        # Retrieve the last word of the last line, which should be the country
        country_53a = last_line_53a.split(',')[-1].strip()
        if country_53a:
            return country_53a

    # Search for the entire content of tag 59 if tag 53A yields no result
    match_59 = re.search(r':59:[^\n]*\n(.*?)(?=:|$)', mt103_message, re.DOTALL)
    if match_59:
        # Split the content into lines and take the last line
        lines_59 = match_59.group(1).strip().split('\n')
        last_line_59 = lines_59[-1].strip()
        # Retrieve the last word of the last line, which should be the country
        country_59 = last_line_59.split(',')[-1].strip()
        if country_59:
            return country_59

    return None



This function, extract_beneficiary_name, aims to extract the beneficiary's name from an MT103 message. It searches for the name within the lines following the ':59:' tag, returning the first non-empty line encountered. If no name is found, it returns an empty string.

In [491]:
def extract_beneficiary_name(mt103_message):
    lines = mt103_message.split('\n')
    for i, line in enumerate(lines):
        if ':59:' in line:
            for j in range(1, 5):
                name_line = lines[i + j].strip()
                if name_line:
                    return name_line  # Return the first non-empty line after ':59:'
    return ""  # Return an empty string if no name is found


In this function, the full name is split into its constituent parts: first name, middle name (if present), and last name.

In [492]:
# This function splits a full name into first name, middle name (if present), and last name.
def split_beneficiary_name(full_name):
    parts = full_name.split()
    beneficiary_first_name = parts[0] if parts else ""
    beneficiary_last_name = parts[-1] if len(parts) > 1 else ""
    beneficiary_middle_name_patronymic = " ".join(parts[1:-1]) if len(parts) > 2 else ""
    return beneficiary_first_name, beneficiary_middle_name_patronymic, beneficiary_last_name

This function, determine_transaction_direction, is designed to identify the direction of a transaction within an MT103 message based on known information about the originator and beneficiary.

In [493]:
# Function to Determine Transaction Direction
def determine_transaction_direction(mt103_message, known_originator, known_beneficiary):
  
    originator_field = extract_value('50K', mt103_message)
    beneficiary_field = extract_value('59', mt103_message)
    
    if known_originator.lower() in originator_field.lower():
        return "o"  # Outgoing
    elif known_beneficiary.lower() in beneficiary_field.lower():
        return "i"  # Incoming
    else:
        return "io"  # Undetermined
    
# Example usage:
known_originator = "Industries"
known_beneficiary = "Suppliers"

In the function refined_extract_tag_59, provided a list of messages, it iterates over each message to extract information from the ':59:' tag. It extracts the account number, address, and country from the tag's content, handling cases where the country may include additional address information. Finally, it compiles this extracted data into a list of dictionaries.

In [494]:
def refined_extract_tag_59(messages):
    extracted_data = []
    for message in messages:
        matches = re.findall(r':59:\/([^\n]+)\n([^:]+)', message, re.DOTALL)
        for match in matches:
            account_number, info = match
            lines = info.strip().split('\n')
            # We extract the last line as country and the preceding lines (except the first one, which is the name) as address
            country = lines[-1].strip()
            address = ", ".join(lines[1:-1]).strip()
            # If the country line contains a comma, it's assumed that it includes city and country, so we split them
            if ',' in country:
                address_part, country = country.rsplit(',', 1)
                address += ", " + address_part.strip()
            extracted_data.append({
                'beneficiary_account_number': account_number,
                'beneficiary_address': address,
                'beneficiary_country': country
            })
    return extracted_data

We initially created a dictionary based on sources found online, categorizing MT messages into their corresponding instrument types. However, the dictionary may not be exhaustive. Our aim was to demonstrate how we could categorize MT messages using this approach.

In [495]:
import re

# Dictionary with MT codes and their corresponding instrument types
mt_code_to_instrument_type = {
    "MT102": "wire",
    "MT103": "wire",
    "MT110": "check",
    "MT111": "check",
    "MT191": "other",
    "MT200": "ach/lcy_transfers",
    "MT202": "ach/lcy_transfers",
    "MT203": "ach/lcy_transfers",
    "MT204": "ach/lcy_transfers",
    "MT205": "ach/lcy_transfers",
    "MT210": "ach/lcy_transfers",
    "MT300": "securities",
    "MT306": "securities",
    "MT320": "securities",
    "MT330": "securities",
    "MT340": "securities",
    "MT341": "securities",
    "MT350": "securities",
    "MT360": "securities",
    "MT361": "securities",
    "MT362": "securities",
    "MT364": "securities",
    "MT365": "securities",
    "MT400": "cash",
    "MT410": "cash",
    "MT412": "cash",
    "MT416": "cash",
    "MT420": "cash",
    "MT422": "cash",
    "MT430": "cash",
    "MT517": "securities",
    "MT518": "securities",
    "MT540": "securities",
    "MT541": "securities",
    "MT542": "securities",
    "MT543": "securities",
    "MT592": "securities",
    "MT598": "securities",
    "MT643": "precious_metal",
    "MT644": "precious_metal",
    "MT645": "precious_metal",
    "MT646": "precious_metal",
    "MT649": "precious_metal",
    "MT700": "other",
    "MT701": "other",
    "MT705": "other",
    "MT707": "other",
    "MT710": "other",
    "MT720": "other",
    "MT730": "other",
    "MT732": "other",
    "MT734": "other",
    "MT740": "other",
    "MT742": "other",
    "MT747": "other",
    "MT750": "other",
    "MT752": "other",
    "MT754": "other",
    "MT756": "other",
    "MT760": "other",
    "MT767": "other",
    "MT768": "other",
    "MT900": "other",
    "MT910": "other",
    "MT920": "other",
    "MT940": "other",
    "MT950": "other",
}

# Function to extract the MT code and instrument type from the message
def extract_mt_code_and_instrument_type(message):
    # Search for the {2:} section of the message to extract the MT code
    mt_code_match = re.search(r'\{2:[IO](\d{3})', message)
    if mt_code_match:
        # If there's a match, format the MT code
        mt_code = "MT" + mt_code_match.group(1)
    else:
        # If no match, we cannot identify the MT code
        mt_code = "Unknown"

    # Use the extracted MT code to retrieve the instrument type from the dictionary
    instrument_type = mt_code_to_instrument_type.get(mt_code, "Unknown")
    return mt_code, instrument_type


In the following loop, each message in the mt103_messages list is processed to extract relevant data and construct a DataFrame. This DataFrame is then appended to a list, dfs, containing DataFrames for each message. The loop iterates through the messages, extracts transaction details, originator and beneficiary information, and intermediary BIC codes, creating a dictionary for each message. This dictionary is then used to create a DataFrame, which is added to the list for further concatenation into a final DataFrame.

In [496]:

empty_df = pd.DataFrame(columns=columns)
# Assuming mt103_messages is a list of MT103 message strings and extract_value is a predefined function
dfs = []
for mt103_message in mt103_messages:

    data = {}

    # Extract and format the transaction date from the :32A: field
    date_str = extract_value('32A', mt103_message)[:6]  # Assuming the date is always at the start
    transaction_year = int(date_str[:2]) + 2000  # Adjust the century as needed
    transaction_month = int(date_str[2:4])
    transaction_day = int(date_str[4:6])
    transaction_date = f"{transaction_day:02d}-{transaction_month:02d}-{transaction_year}"
    # Extract values from the MT103 message
    transaction_id = extract_value('20', mt103_message)
    transaction_message = extract_value('71G', mt103_message)

    value_field = extract_value('32A', mt103_message)
    transaction_currency_match = re.match(r'\d{6}([A-Z]{3})', value_field)
    transaction_currency = transaction_currency_match.group(1) if transaction_currency_match else None
    transaction_amount = re.sub(r'^\d{6}[A-Z]{3}', '', value_field).replace(',', '') if transaction_currency else None

    transaction_type = extract_value('23B', mt103_message)

    transaction_direction = determine_transaction_direction(mt103_message, known_originator, known_beneficiary)
    # Update the data dictionary with the new transaction_direction value
    # data["transaction_direction"] = transaction_direction

    transaction_status = None

    instrument_type = None

    # Try to extract the full name from tag 50K first
    originator_field = extract_value('50K', mt103_message)
    if originator_field:  # Check for the presence of the '50K' tag
        # Split the originator field into its components based on newlines and double slashes
        originator_components = originator_field.split('\n')

        if len(originator_components) >= 4:
            originator_account_number = originator_components[0].strip()
            originator_full_name = originator_components[1].strip()
            originator_address = originator_components[2].strip()
            originator_Country_bic = originator_components[3].strip()

        if len(originator_components) >= 3:
            # Assume the first line is the account number and the second line is the full name
            originator_account_number = originator_components[0].strip()
            originator_full_name = originator_components[1].strip()
            
            # Dynamically extract the address
            # The address starts after the name and continues until we encounter a line that is likely a country or BIC
            address_components = []
            for component in originator_components[2:]:
                # Stop if we encounter a line that is likely a country or BIC
                if re.match(r"^[A-Z\s]+(?:[A-Z]{2})?$", component.strip()) or '/' in component:
                    break
                address_components.append(component.strip())
            originator_address = ', '.join(address_components)

            # Assumption: The full name can be split into first name, middle names, and last name.
            name_parts = originator_full_name.split()
            if len(name_parts) > 2:  # Assumption: There are middle names or a patronymic.
                originator_first_name = name_parts[0]
                originator_last_name = name_parts[-1]
                originator_middle_names_patronymic = ' '.join(name_parts[1:-1])
            elif len(name_parts) == 2:  # Only first name and last name.
                originator_first_name = name_parts[0]
                originator_middle_names_patronymic = ''
                originator_last_name = name_parts[1]
            else:  # Only one name available.
                originator_first_name = name_parts[0]
                originator_middle_names_patronymic = ''
                originator_last_name = ''
    elif "50A" in mt103_message:  # If there is no '50K' tag, check for '50A' tag
        originator_field = extract_value('50A', mt103_message)  

        # Split the originator field into its components based on newlines and double slashes
        originator_components = originator_field.split('\n')

        if len(originator_components) >= 4:
            originator_account_number = originator_components[0].strip()
            originator_full_name = originator_components[2].strip()  # Third line for the full name
            originator_address = originator_components[-1].strip()  # Last line for the address
            
            # Adjustment: Extract middle names or patronymics if available
            name_parts = originator_full_name.split()
            if len(name_parts) > 2:
                originator_first_name = name_parts[0]
                originator_last_name = name_parts[-1]
                originator_middle_names_patronymic = ' '.join(name_parts[1:-1])
            else:
                originator_first_name = name_parts[0]
                originator_middle_names_patronymic = ''
                originator_last_name = ''
        elif len(originator_components) >= 3:
            # If we have only an account number, full name, and address
            originator_account_number = originator_components[0].strip()
            originator_full_name = originator_components[2].strip()  # Third line for the full name
            originator_address = originator_components[-1].strip()  # Last line for the address

            # Further processing as done previously for tag 50K
            # Assumption: The full name can be split into first name, middle names, and last name.
            name_parts = originator_full_name.split()
            if len(name_parts) > 2:
                originator_first_name = name_parts[0]
                originator_last_name = name_parts[-1]
                originator_middle_names_patronymic = ' '.join(name_parts[1:-1])
            else:
                originator_first_name = name_parts[0]
                originator_middle_names_patronymic = ''
                originator_last_name = ''

    #originator_fi_name
    fi_name_matches = re.findall(r':52A:/(.+?)\n(.+?)\n', mt103_message)
    if fi_name_matches:
        originator_fi_name = fi_name_matches[0][1].strip()
    else:
        fi_name_matches = re.findall(r':50A:/(.+?)\n(.+?)\n', mt103_message)
        if fi_name_matches:
            originator_fi_name = fi_name_matches[0][1].strip()
        else:
            originator_fi_name = None  # or some default value    

    #originator_fi_country
    originator_fi_country = extract_originator_fi_country(mt103_message)
    # Add the obtained value to the data dictionary
    data["originator_fi_country"] = originator_fi_country

    #originator_country
    originator_country = extract_originator_country(mt103_message)
    # Add the obtained value to the data dictionary
    data["originator_country"] = originator_country

    # Extract the originator BIC from tag 50K or 50A
    originator_bic = extract_originator_bic(mt103_message)
    # Add the obtained BIC code to the data dictionary
    data["originator_bic"] = originator_bic

    # Extract the originator branch ID
    originator_branch_id = extract_originator_branch_id(mt103_message)
    # Add the obtained branch ID to the data dictionary
    data["originator_branch_id"] = originator_branch_id

    beneficiary_field = extract_value('59a', mt103_message)

    beneficiary_full_name = extract_beneficiary_name(mt103_message) 

    # Split the full name of the beneficiary
    beneficiary_first_name, beneficiary_middle_name_patronymic, beneficiary_last_name = split_beneficiary_name(beneficiary_full_name)

    # Extract the originator branch ID
    beneficiary_bic = extract_beneficiary_bic(mt103_message)
    # Add the obtained branch ID to the data dictionary
    data["beneficiary_bic"] = beneficiary_bic

    # Extract the originator branch ID
    beneficiary_branch_id = extract_beneficiary_branch_id(mt103_message)
    # Add the obtained branch ID to the data dictionary
    data["beneficiary_branch_id"] = beneficiary_branch_id


    # Extract the originator branch ID
    beneficiary_fi_name = extract_beneficiary_fi_name(mt103_message)
    # Add the obtained branch ID to the data dictionary
    data["beneficiary_fi_name"] = beneficiary_fi_name


    beneficiary_fi_country = extract_beneficiary_fi_country(mt103_message)
    # Add the obtained branch ID to the data dictionary
    data["beneficiary_fi_country"] = beneficiary_fi_country

    message_count += 1  # Increment the message counter
        
    # Hardcode the transaction status based on the message count
    if message_count <= 2:
            transaction_status = "accepted"
    else:
            transaction_status = "rejected"

        # Add the hardcoded transaction status to the data dictionary
    data["transaction_status"] = transaction_status



    beneficiary_info = refined_extract_tag_59([mt103_message])  # Ensure it's a list
    if beneficiary_info:  # Making sure there's data returned
        beneficiary_info = beneficiary_info[0]  # Assuming one beneficiary per MT103
    else:
        # Handle cases where no beneficiary info is found
        beneficiary_info = {"beneficiary_address": None, "beneficiary_country": None, "beneficiary_account_number": None}

    intermediary_bic_codes = extract_intermediary_bic_codes(mt103_message)

    instrument_type = extract_mt_code_and_instrument_type(mt103_message)

    # Create a dictionary with extracted values
    data = {
    "transaction_date": transaction_date,
    "transaction_id": transaction_id,
    "transaction_message": transaction_message,
    "transaction_currency": transaction_currency,
    "transaction_amount": transaction_amount,
    "transaction_type": transaction_type,
    "transaction_direction": transaction_direction,  
    "transaction_status": transaction_status,  
    "instrument_type": instrument_type,  
    "originator_full_name": originator_full_name,
    "originator_first_name": originator_first_name,  
    "originator_middle_names_patronymic": originator_middle_names_patronymic,  
    "originator_last_name": originator_last_name, 
    "originator_address": originator_address,
    "originator_country": originator_country,
    "originator_account_number": originator_account_number,
    "originator_branch_id": originator_branch_id,  
    "originator_bic": originator_bic,
    "originator_fi_name": originator_fi_name,
    "originator_fi_country": originator_fi_country,  
    "incoming_intermediary_fi_bic": intermediary_bic_codes["incoming_intermediary_fi_bic"],
    "outgoing_intermediary_fi_bic": intermediary_bic_codes["outgoing_intermediary_fi_bic"],  
    "beneficiary_full_name": beneficiary_full_name,  
    "beneficiary_first_name": beneficiary_first_name,  
    "beneficiary_middle_name_patronymic": beneficiary_middle_name_patronymic,  
    "beneficiary_last_name": beneficiary_last_name,  
    "beneficiary_address": beneficiary_info.get("beneficiary_address", ""),
    "beneficiary_country": beneficiary_info.get("beneficiary_country", ""),
    "beneficiary_account_number": beneficiary_info.get("beneficiary_account_number", ""),
    "beneficiary_branch_id": beneficiary_branch_id,  
    "beneficiary_bic": beneficiary_bic,
    "beneficiary_fi_name": beneficiary_fi_name,  
    "beneficiary_fi_country": beneficiary_fi_country, 
    }
    # Append the transaction data to the list of dataframes
    dfs.append(pd.DataFrame([data]))

# Concatenate all dataframes in the list to create a single dataframe
df = pd.concat(dfs, ignore_index=True)

# Remove leading and trailing spaces from the values in the specified columns
df['originator_country'] = df['originator_country'].str.strip()
df['beneficiary_country'] = df['beneficiary_country'].str.strip()

# Replace 'UNITED STATES' with 'USA' in the 'columns of the DataFrame
df['originator_country'] = df['originator_country'].replace('UNITED STATES', 'USA')
df['beneficiary_country'] = df['beneficiary_country'].replace('UNITED STATES','USA')
                                                              


Presenting df, the consolidated database containing essential information extracted from messages. This database serves as a key asset for conducting detailed risk analyses.

In [497]:
df

Unnamed: 0,transaction_date,transaction_id,transaction_message,transaction_currency,transaction_amount,transaction_type,transaction_direction,transaction_status,instrument_type,originator_full_name,...,beneficiary_first_name,beneficiary_middle_name_patronymic,beneficiary_last_name,beneficiary_address,beneficiary_country,beneficiary_account_number,beneficiary_branch_id,beneficiary_bic,beneficiary_fi_name,beneficiary_fi_country
0,22-03-2021,MT103 0001,/INS/THIS IS A PAYMENT FOR TUNA SUPPLY\n -},USD,5000,CRED,io,accepted,"(MT103, wire)",COMMERZBANK AG,...,NORDFISCH,,GMBH,"BODENSEE STR. 226, 22761 HAMBURG",GERMANY,GB57METR12345678901234,XXX,MYMBGB2LXXX,METRO BANK PLC,UNITED KINGDOM
1,22-03-2021,MT103 0001,/MSG/PAYMENT FOR GOODS\n -},USD,10000,CRED,o,accepted,"(MT103, wire)",ABC INDUSTRIES,...,XYZ,,SUPPLIERS,"123 HUANGPU ROAD, SHANGHAI",CHINA,CN123456789012345678,XXX,HSBCHKHHHKXXX,HSBC HONG KONG,HONG KONG
2,22-03-2021,MT103 0001,,USD,9899,CRED,io,rejected,"(MT103, wire)",ABC SUPPLIERS BV,...,AFRICAN,EXPORT-IMPORT,BANK,"LAGOS, NIGERIA, XYZ ENTERPRISES LTD, LAGOS",NIGERIA,PASSNGLAXXX,XXX,PASSNGLAXXX,AFRICAN EXPORT-IMPORT BANK,NIGERIA


# 2. Risk patterns

In this part of the assignment, we are tasked with calculating fraud risks in a number of patterns in transactions. In total, we wat to base our risk assessment on 7 metrics: round amount payments, payments from high risk countries, smurfing, nesting, transactions that are non-adherence to FATF recommendation 16, transcactions from shell companies and traded based money laundring. To assess the risk of these various topics, we decided to calculate the risks of each individual topic as well as for the total cumulative risk per transaction. Each risk assessment is appended to the dataframe we created based on the swift messages. This enables intrested parties to set their own risk thresholds for individual and total risk calculations. So can stakeholders that are intrested in smurfing set a threshold of 0.7 in the corresponding column to research transaction that have a high smurfing risk score and stakeholders that are intrested in total risk set a threshold for the total risk column (for example at 4.5) to filter out transaction that have a high total risk score.

### Round Amount payments

The decision to adjust the thresholds for round amount payments in the code is based on the principles discussed in the article "Round numbers: A fingerprint of fraud" by B.M.J. Nigrini, PhD, published in the Journal of Accountancy in May 2018.

In the article, Nigrini explains how fraudsters often manipulate financial data by using round numbers, which can serve as a "fingerprint" of fraudulent activities. Round numbers are frequently used in financial statement fraud, occupational fraud, and bribery schemes to make fraudulent numbers appear more believable or to cover up irregularities.

To identify potential instances of fraud, Nigrini suggests analyzing transaction amounts that are exact multiples of specific round numbers. These round numbers are often associated with fraudulent activities because they are easier to fabricate and may not reflect the natural distribution of legitimate financial transactions.

Based on this literature, the decision to set the thresholds in the code is guided by the need to capture amounts commonly associated with fraudulent activities. By setting the thresholds at values commonly found in fraudulent transactions, such as multiples of 1000, 500, and 250, the code aims to identify potential instances of fraud in the dataset, aligning with the principles discussed in Nigrini's article. In the article, transaction values that can be divided by 1000 are seen as most suspicious. After that, fractions of 1000 are seen as suspicious, such as 500 and 250 (Nigirni, 2018). Therefore, it is decided to calculate a risk score according to this ranking.

In [498]:
df['transaction_amount'] = df['transaction_amount'].astype(float)  # Convert to float


In [499]:
# Function to calculate risk factor for round amount payments
def calculate_round_amount_risk(row):
    # Check if the transaction amount is divisible by 1000
    if row['transaction_amount'] % 1000 == 0:
        return 1  # Assign risk factor 1 for transactions divisible by 1000
    # Check if the transaction amount is divisible by 500
    elif row['transaction_amount'] % 500 == 0:
        return 0.5  # Assign risk factor 0.5 for transactions divisible by 500
    # Check if the transaction amount is divisible by 250
    elif row['transaction_amount'] % 250 == 0:
        return 0.25  # Assign risk factor 0.25 for transactions divisible by 250
    else:
        return 0  # Not a round amount payment, assign risk factor 0

# Apply the function to each row and create a new column 'round_amount_risk'
df['risk_rate_roundamount'] = df.apply(calculate_round_amount_risk, axis=1)

In [500]:
df

Unnamed: 0,transaction_date,transaction_id,transaction_message,transaction_currency,transaction_amount,transaction_type,transaction_direction,transaction_status,instrument_type,originator_full_name,...,beneficiary_middle_name_patronymic,beneficiary_last_name,beneficiary_address,beneficiary_country,beneficiary_account_number,beneficiary_branch_id,beneficiary_bic,beneficiary_fi_name,beneficiary_fi_country,risk_rate_roundamount
0,22-03-2021,MT103 0001,/INS/THIS IS A PAYMENT FOR TUNA SUPPLY\n -},USD,5000.0,CRED,io,accepted,"(MT103, wire)",COMMERZBANK AG,...,,GMBH,"BODENSEE STR. 226, 22761 HAMBURG",GERMANY,GB57METR12345678901234,XXX,MYMBGB2LXXX,METRO BANK PLC,UNITED KINGDOM,1
1,22-03-2021,MT103 0001,/MSG/PAYMENT FOR GOODS\n -},USD,10000.0,CRED,o,accepted,"(MT103, wire)",ABC INDUSTRIES,...,,SUPPLIERS,"123 HUANGPU ROAD, SHANGHAI",CHINA,CN123456789012345678,XXX,HSBCHKHHHKXXX,HSBC HONG KONG,HONG KONG,1
2,22-03-2021,MT103 0001,,USD,9899.0,CRED,io,rejected,"(MT103, wire)",ABC SUPPLIERS BV,...,EXPORT-IMPORT,BANK,"LAGOS, NIGERIA, XYZ ENTERPRISES LTD, LAGOS",NIGERIA,PASSNGLAXXX,XXX,PASSNGLAXXX,AFRICAN EXPORT-IMPORT BANK,NIGERIA,0


### Payments from high risk countries

In our analysis, we utilized various sources to compile lists of countries categorized by their financial secrecy and tax haven status. Firstly, we referred to the "Black and Grey" lists provided by the Financial Action Task Force (FATF). These lists contain countries identified as having deficiencies in their anti-money laundering and counter-terrorist financing measures. From this source, we extracted countries categorized as either blacklisted or greylisted. Blacklisted countries include those with significant deficiencies, while greylisted countries have made a commitment to address deficiencies but are yet to fully implement necessary reforms (FATF, n.d.).

Additionally, we consulted the Corporate Tax Haven Index 2021 published by the Tax Justice Network. This index ranks countries based on their level of financial secrecy and their role as tax havens. Countries with high secrecy scores are considered to facilitate financial opacity and potentially enable tax avoidance and evasion (Tax Justice Network, 2021).

Combining information from these sources, we categorized countries into three groups: blacklisted, greylisted, and ranked by their secrecy score. Unfortunately, due to limitations in data availability, we had to manually compile the secrecy scores for each country, as the Tax Justice Network's data portal was undergoing maintenance.

We then calculated the risk score based on the the risk of the originating country, which can either be blacklisted, greylisted or unlisted. Then we multiplied this with secrecy score devided by 100, so we can once again get a risk value between 0-1.

In [501]:
# Blacklisted countries with high risk
blacklisted = ["DEMOCRATIC PEOPLE'S REPUBLIC OF KOREA", "IRAN", "MYANMAR"]

# Greylisted countries with moderate risk
greylisted = ["BULGARIA", "BURKINA FASO", "CAMEROON", "CROATIA", "DEMOCRATIC REPUBLIC OF CONGO",
 "HAITI", "JAMAICA", "KENYA", "MALI", "MOZAMBIQUE", "NAMIBIA", "NIGERIA",
 "PHILIPPINES", "SENEGAL", "SOUTH AFRICA", "SOUTH SUDAN", "SYRIA", "TANZANIA",
 "TÜRKIYE", "VIETNAM", "YEMEN"]

# Data containing secrecy scores for different countries
data = {
"UNITED STATES": 67, "SWITZERLAND": 70, "SINGAPORE": 67, "HONG KONG": 65, "LUXEMBOURG": 55,
"JAPAN": 63, "GERMANY": 57, "UNITED ARAB EMIRATES": 79, "BRITISH VIRGIN ISLANDS": 71,
"GUERNSEY": 71, "CHINA": 66, "NETHERLANDS": 65, "UNITED KINGDOM": 47, "CAYMAN ISLANDS": 73,
"CYPRUS": 62, "SOUTH KOREA": 64, "TAIWAN": 60, "PANAMA": 73, "JERSEY": 63, "QATAR": 74,
"ITALY": 55, "BAHAMAS": 75, "THAILAND": 70, "VIETNAM": 81, "SAUDI ARABIA": 69,
"BELGIUM": 53, "IRELAND": 47, "CANADA": 51, "SPAIN": 57, "FRANCE": 48, "MACAO": 63,
"ISRAEL": 59, "ANGOLA": 79, "ALGERIA": 79, "KUWAIT": 75, "INDIA": 55, "AUSTRALIA": 56,
"MALTA": 54, "MALAYSIA": 66, "LIBERIA": 73, "KENYA": 67, "NIGERIA": 65, "RUSSIA": 60,
"AUSTRIA": 55, "GUATEMALA": 75, "SOUTH AFRICA": 60, "OMAN": 74, "NORWAY": 53,
"BERMUDA": 70, "SRI LANKA": 76, "MARSHALL ISLANDS": 71, "BANGLADESH": 75, "NEW ZEALAND": 63,
"LIECHTENSTEIN": 72, "MAURITIUS": 70, "EGYPT": 68, "PORTUGAL": 57, "ANGUILLA": 75,
"TURKEY": 61, "BAHRAIN": 68, "ISLE OF MAN": 65, "ROMANIA": 59, "BARBADOS": 74, "PUERTO RICO": 78,
"JORDAN": 72, "INDONESIA": 56, "SWEDEN": 45, "ST. KITTS AND NEVIS": 77, "VENEZUELA": 72,
"GHANA": 53, "URUGUAY": 58, "PHILIPPINES": 67, "CHILE": 60, "PAKISTAN": 66, "ARUBA": 71,
"HUNGARY": 55, "LEBANON": 65, "KAZAKHSTAN": 63, "MOROCCO": 66, "DENMARK": 49, "CAMEROON": 70,
"MEXICO": 53, "BRAZIL": 49, "DOMINICAN REPUBLIC": 65, "UKRAINE": 59, "POLAND": 46,
"US VIRGIN ISLANDS": 72, "FINLAND": 52, "SEYCHELLES": 72, "CURACAO": 76, "MALDIVES": 75,
"CZECHIA": 50, "TANZANIA": 69, "NAMIBIA": 71, "LATVIA": 55, "GIBRALTAR": 67,
"EL SALVADOR": 61, "RWANDA": 72, "GREECE": 53, "CROATIA": 53, "SLOVAKIA": 53, "TUNISIA": 60,
"LITHUANIA": 51, "SAMOA": 73, "COSTA RICA": 56, "BULGARIA": 53, "PERU": 54, "COLOMBIA": 54,
"BOLIVIA": 79, "SERBIA": 54, "ARGENTINA": 49, "VANUATU": 76, "BOTSWANA": 57, "ANDORRA": 55,
"BELIZE": 75, "ECUADOR": 52, "PARAGUAY": 66, "MONACO": 74, "MONTENEGRO": 61,
"TURKS AND CAICOS ISLANDS": 76, "FIJI": 70, "ST. VINCENT AND THE GRENADINES": 67,
"ALBANIA": 54, "NORTH MACEDONIA": 62, "ESTONIA": 44, "ICELAND": 42, "ANTIGUA AND BARBUDA": 77,
"DOMINICA": 65, "KOSOVO": 69, "TRINIDAD AND TOBAGO": 69, "COOK ISLANDS": 70, "GRENADA": 66,
"ST. LUCIA": 72, "GUAM": 70, "AMERICAN SAMOA": 69, "BRUNEI": 73, "SLOVENIA": 36,
"GAMBIA": 73, "NAURU": 59, "SAN MARINO": 60, "MONTSERRAT": 74,
}

# Create a dictionary using the data
secrecy_scores = {country: score for country, score in data.items()}

# Function to calculate risk rate for each row
def calculate_risk_rate(row):
    originator_country = row['originator_country']
    beneficiary_country = row['beneficiary_country']
    
    # Initialize points
    originator_points = 0

    # Check if originator country is blacklisted or greylisted and calculate points based on their scores
    if originator_country in blacklisted:
        # Assign full points for blacklisted countries
        originator_points = 1 * (secrecy_scores.get(beneficiary_country, 0) / 100)
    elif originator_country in greylisted:
        # Assign 2/3 of points for greylisted countries
        originator_points = 0.66 * (secrecy_scores.get(beneficiary_country, 0) / 100)
    else:
        # Assign 1/3 of points for other countries
        originator_points = 0.33 * (secrecy_scores.get(beneficiary_country, 0) / 100)
        
    country_risk_rate = originator_points

    return country_risk_rate

# Apply the function to each row and update the 'risk_rate' column with the combined risk rate
df['risk_rate_countries'] = df.apply(calculate_risk_rate, axis=1)

### Smurfing

We based our smurfing risk calculator on three aspects that were found by Joseph Ibitola (2023). He explored the deceptive strategy of breaking down large sums of illicit funds into smaller transactions to evade detection and circumvent anti-money laundering regulations. The calculator assesses various factors including the frequency of transactions within a 24-hour window, transaction amounts below a certain threshold, and the geographical spread of the transactions. By incorporating these criteria, the risk calculator aims to identify transactions exhibiting characteristics commonly associated with smurfing (Ibitola, 2023), thereby enabling financial institutions to prioritize monitoring and investigation efforts effectively in combating this form of financial crime. Frequent transactions between the same people over a short period of time (24 hours) is considered the biggest indicator of fraud and is therefore awarded a higher risk score then the other two factors. 

In [502]:
import pandas as pd
from datetime import datetime, timedelta

# Function to calculate risk score for each transaction
def calculate_risk_score(row):
    # Initialize risk score
    risk_score = 0
    
    # Parse the date string into a datetime object using strptime
    transaction_date = datetime.strptime(row['transaction_date'], '%d-%m-%Y')
    
    # Filter DataFrame to include transactions within 24 hours of the current transaction
    start_time = transaction_date - timedelta(days=1)
    end_time = transaction_date
    
    # Convert 'transaction_date' column to datetime
    df['transaction_date'] = pd.to_datetime(df['transaction_date'], format='%d-%m-%Y')
    
    recent_transactions = df[(df['transaction_date'] >= start_time) & (df['transaction_date'] <= end_time)]
    
    # Originator frequency and beneficiary frequency within 24 hours
    originator_frequency = recent_transactions[recent_transactions['originator_account_number'] == row['originator_account_number']].shape[0]
    beneficiary_frequency = recent_transactions[recent_transactions['beneficiary_account_number'] == row['beneficiary_account_number']].shape[0]
    
    # Check if originator or beneficiary has more than 3 transactions within 24 hours
    if originator_frequency > 3 or beneficiary_frequency > 3:
        risk_score += 0.5  # Increase risk score by 0.5 if either originator or beneficiary has more than 3 transactions within 24 hours
    
    # Check transaction amount
    if row['transaction_amount'] < 10000:
        risk_score += 0.25  # Increase risk score by 0.25 if transaction amount is less than 1000
    
    # Check for geographical spread
    unique_originator_countries = recent_transactions[recent_transactions['originator_account_number'] == row['originator_account_number']]['originator_country'].nunique()
    unique_beneficiary_countries = recent_transactions[recent_transactions['beneficiary_account_number'] == row['beneficiary_account_number']]['beneficiary_country'].nunique()
    
    if unique_originator_countries > 1 or unique_beneficiary_countries > 1:
        risk_score += 0.25  # Increase risk score by 0.25 if either originator or beneficiary has transactions involving more than one country within 24 hours
    
    return risk_score

# Apply the function to each row and create a new column 'risk_score'
df['risk_rate_smurfing'] = df.apply(calculate_risk_score, axis=1)


### Nesting 

The function calculates a "nesting risk" based on the number of intermediaries involved in both incoming and outgoing financial transactions. It starts by extracting the incoming and outgoing intermediaries from the DataFrame row. Then, it counts these intermediaries. The total number of intermediaries is determined by adding the counts of incoming and outgoing intermediaries. After counting the intermediaries, a risk score is assigned based on the number of intermediaries. If there are no intermediaries, the risk score is 0. If there is one intermediary, the score is 0.25. With two intermediaries, the score is 0.50. With three intermediaries, the score is 0.75. For more than three intermediaries, the score is 1.

In [503]:
def calculate_nesting_risk(row):
    # Count the number of intermediaries
    incoming_intermediaries = str(row['incoming_intermediary_fi_bic']).strip(';')
    outgoing_intermediaries = str(row['outgoing_intermediary_fi_bic']).strip(';')
    
    # Calculate total number of intermediaries
    total_intermediaries = 0
    
    # Count incoming intermediaries
    if incoming_intermediaries and incoming_intermediaries != 'None':
        total_intermediaries += len(incoming_intermediaries.split(';'))
    
    # Count outgoing intermediaries
    if outgoing_intermediaries and outgoing_intermediaries != 'None':
        total_intermediaries += len(outgoing_intermediaries.split(';'))
    
    # Determine the nesting risk based on the number of intermediaries
    if total_intermediaries == 0:
        risk_score = 0
    elif total_intermediaries == 1:
        risk_score = 0.25
    elif total_intermediaries == 2:
        risk_score = 0.50
    elif total_intermediaries == 3:
        risk_score = 0.75
    else:
        risk_score = 1
    
    return risk_score

# Apply the function to each row and create a new column 'risk_rate_nesting'
df['risk_rate_nesting'] = df.apply(calculate_nesting_risk, axis=1)


### Non-adherence to FATF Recommendation 16 


In [504]:
countries_list = [
    "Southern Rhodesia",
    "South Africa",
    "the Former Yugoslavia",
    "Haiti",
    "Angola",
    "Liberia",
    "Eritrea",
    "Ethiopia",
    "Rwanda",
    "Sierra Leone",
    "Côte d’Ivoire",
    "Iran",
    "Somalia",
    "Eritrea",
    "ISIL (Da’esh) and Al-Qaida",
    "Iraq",
    "Democratic Republic of the Congo",
    "Sudan",
    "Lebanon",
    "Democratic People’s Republic of Korea",
    "Libya",
    "the Taliban",
    "Guinea-Bissau",
    "Central African Republic",
    "Yemen",
    "South Sudan",
    "Mali"
]

# Function to calculate risk factor for wire transfers
def calculate_wire_transfer_risk(row):
    risk_score = 0

    # Check if originator and beneficiary information is included and accurate
    if row['originator_full_name'] == '':
        risk_score += 0.25

    # Check if wire transfer lacks required originator and/or beneficiary information
    if row['beneficiary_full_name'] == '':
        risk_score += 0.25

    # Check if financial institutions monitor wire transfers for compliance
    if row['instrument_type'] != 'wire' and row['transaction_status'] == 'accepted':
        risk_score += 0.25

    # Check if freezing action is taken for designated persons/entities as per relevant UN resolutions
    if row['beneficiary_country'] in countries_list:
        risk_score += 0.25

    return risk_score

# Apply the function to each row and create a new column 'wire_transfer_risk_score'
df['risk_score_wiretransfer'] = df.apply(calculate_wire_transfer_risk, axis=1)

### Shell companies

Alvarez & Marsal (2023) discusses the risks posed by shell companies, including their potential utilization for illicit financial activities such as money laundering, tax evasion, and corruption, as well as the lack of transparency and regulatory oversight surrounding their operations. It highlights the susceptibility of shell companies to abuse by individuals seeking to conceal their identity and engage in illicit financial transactions. It explains generic terms commonly used in shell company names and examining addresses for indicators like the presence of a PO Box or lack of a specific location. Therefore, the code aims to identify generic terms in the company names. If it finds one of the lsited generic names as part of the name, it awards risk score. Then it checks the adress of the originator on if its present or if PO box is mentioned in the adress. If the adress is not stated or consists of a PO box, it is awarded a risk score.

In [505]:
# Function to calculate risk factor for shell companies
def calculate_shell_company_risk(row):
    risk_score = 0

    # Check if the company name contains generic terms often used in shell companies
    generic_terms = ['investments', 'holdings', 'consulting', 'services', 'management']
    for term in generic_terms:
        if term in row['originator_full_name'].lower():
            risk_score += 0.5

    # Check if the address is a PO Box or lacks a specific location
    if 'PO Box' in row['originator_address'] or 'Unknown' in row['originator_address']:
        risk_score += 0.5

    return risk_score

# Apply the function to each row and create a new column 'shell_company_risk_score'
df['risk_score_shellcompany'] = df.apply(calculate_shell_company_risk, axis=1)

### Trade based money laundering

The following code aims to calculate the risk of trade-based money laundering (TBML) in international trade transactions. It begins by merging data from multiple CSV files to create a comprehensive DataFrame (merged_df) containing details of trade activities between different economies. By grouping this data based on trading country combinations, the code creates a summarized DataFrame (grouped_df) to analyze trade dynamics.

To assess TBML risk, the code calculates import-export ratios for each partner combination and normalizes them to a standardized range. These normalized ratios are then converted into risk scores, indicating the potential for TBML. Subsequently, the code extracts the originator and beneficiary countries from each transaction extracted from the SWIFT messages and combines them to identify risks for transactions.

Using these identifiers, the code retrieves the corresponding risk score from grouped_df for each transaction and appends it as a new column to the SWIFT message DataFrame.

The data is gathered from the World Trade Organization (2022).

In [506]:
# Read the data from CSV files
df_data = pd.read_csv('data.csv')
df_other_economy = pd.read_csv('other_economy.csv')
df_selected_economy = pd.read_csv('selected_economy.csv')

# Merge based on 'Partner' and 'Reporter' columns
merged_df = pd.merge(df_data, df_other_economy, on='Partner', how='outer')
merged_df = pd.merge(merged_df, df_selected_economy, on='Reporter', how='outer')

# Drop rows with NaN values in the 'L3' column
merged_df.dropna(subset=['L3'], inplace=True)

# Create a new column by adding 'PartnerName' and 'ReporterName' together
merged_df['Combined_Names'] = merged_df['PartnerName'] + ' ' + merged_df['ReporterName']

# Group by 'Combined_Names' and calculate the mean of numeric columns while retaining 'PartnerName' and 'ReporterName'
grouped_df = merged_df.groupby('Combined_Names').agg({'PartnerName': 'first', 'ReporterName': 'first', 
                                                      'Reporter_Total_Imports': 'mean', 'Partner_Total_Imports': 'mean',
                                                      'Reporter_Total_Exports': 'mean', 'Partner_Total_Exports': 'mean'})

# Reset the index to make 'Combined_Names' a column again
grouped_df.reset_index(inplace=True)
grouped_df = grouped_df[~grouped_df['Combined_Names'].str.contains('the rest of the World')]

# Calculate import-export ratios
grouped_df['Export_Import_Ratio_partner'] = grouped_df['Partner_Total_Exports'] / grouped_df['Partner_Total_Imports']
grouped_df['Import_Export_Ratio_reporter'] = grouped_df['Reporter_Total_Imports'] / grouped_df['Reporter_Total_Exports']

# Calculate the ratio between partner and reporter import-export ratios
grouped_df['Ratio_between_countries'] = grouped_df['Export_Import_Ratio_partner'] / grouped_df['Import_Export_Ratio_reporter']

# Define the minimum and maximum values of the ratio
min_ratio = grouped_df['Ratio_between_countries'].min()
max_ratio = grouped_df['Ratio_between_countries'].max()

# Normalize the ratios to range [0, 1]
normalized_ratios = (grouped_df['Ratio_between_countries'] - min_ratio) / (max_ratio - min_ratio)

# Calculate the risk score based on the normalized values
risk_score = normalized_ratios

# Add the risk score as a new column in the DataFrame
grouped_df['Risk_Score'] = risk_score

# Convert values in the 'Combined_Names' column to uppercase
grouped_df['Combined_Names'] = grouped_df['Combined_Names'].str.upper()

# Initialize an empty list to store risk scores
risk_scores = []

# Iterate over each row in your DataFrame
for index, row in df.iterrows():
    # Extract originator and beneficiary countries
    originator_country = row['originator_country']
    beneficiary_country = row['beneficiary_country']
    
    # Combine originator and beneficiary countries
    combined_countries = f"{originator_country} {beneficiary_country}"
    
    # Search for combined names in grouped_df and extract risk score
    matched_row = grouped_df[grouped_df['Combined_Names'] == combined_countries]
    
    # If a match is found, extract the risk score
    if not matched_row.empty:
        risk_score = matched_row.iloc[0]['Risk_Score']
    else:
        # If no match is found, assign a default risk score
        risk_score = 0
    
    # Append the risk score to the list
    risk_scores.append(risk_score)

# Add the risk scores as a new column to your DataFrame
df['risk_rate_tbml'] = risk_scores

### Total risk

As explained, in this code the total risk is calculated

In [507]:
# Get column names starting with 'risk_rate'
risk_rate_columns = [col for col in df.columns if col.startswith('risk_rate')]

# Sum the values of these columns row-wise
df['risk_rate'] = df[risk_rate_columns].sum(axis=1)

In [508]:
df.head()

Unnamed: 0,transaction_date,transaction_id,transaction_message,transaction_currency,transaction_amount,transaction_type,transaction_direction,transaction_status,instrument_type,originator_full_name,...,beneficiary_fi_name,beneficiary_fi_country,risk_rate_roundamount,risk_rate_countries,risk_rate_smurfing,risk_rate_nesting,risk_score_wiretransfer,risk_score_shellcompany,risk_rate_tbml,risk_rate
0,2021-03-22,MT103 0001,/INS/THIS IS A PAYMENT FOR TUNA SUPPLY\n -},USD,5000.0,CRED,io,accepted,"(MT103, wire)",COMMERZBANK AG,...,METRO BANK PLC,UNITED KINGDOM,1,0.1881,0.25,0.25,0.25,0,0.0,1.6881
1,2021-03-22,MT103 0001,/MSG/PAYMENT FOR GOODS\n -},USD,10000.0,CRED,o,accepted,"(MT103, wire)",ABC INDUSTRIES,...,HSBC HONG KONG,HONG KONG,1,0.2178,0.0,0.5,0.25,0,0.079669,1.797469
2,2021-03-22,MT103 0001,,USD,9899.0,CRED,io,rejected,"(MT103, wire)",ABC SUPPLIERS BV,...,AFRICAN EXPORT-IMPORT BANK,NIGERIA,0,0.2145,0.25,0.5,0.0,0,0.073092,1.037592


Literature:

Nigrini, B. M. J., PhD. (2018, May 1). Round numbers: A fingerprint of fraud. Journal of Accountancy. https://www.journalofaccountancy.com/issues/2018/may/fraud-round-numbers.html

Financial Action Task Force (FATF). (n.d.). Black and Grey lists. Retrieved from https://www.fatf-gafi.org/en/countries/black-and-grey-lists.html

Tax Justice Network. (2021). Corporate Tax Haven Index 2021. Retrieved from https://fsi.taxjustice.no/fsi/2022/world/score/top

Ibitola, J. (2023, July 5). Understanding smurfing in money laundering. Retrieved from https://www.flagright.com/post/smurfing-in-money-laundering

Alvarez & Marsal. (2023, August 11). Understanding the risks of shell & shelf companies. Alvarez & Marsal | Management Consulting | Professional Services. https://www.alvarezandmarsal.com/insights/understanding-risks-shell-shelf-companies

World Trade Organization. (2022). Trade Connectivity Heatmap. https://stats.wto.org/dashboard/tradeconnectivity_en.html