# <h1 style="text-align: left;font-size: 2em;">  <strong>EDA Notebook</strong></h1>

---

## <h2 style="font-size: 1.6em; font-weight: bold;"> Life Cycle of a Machine Learning Project </h2>

- Problem Definition
- Data Collection
- Data Cleaning / Preprocessing
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Model Training
- Evaluation
- Choose best model

---

## <h2 style="font-size: 1.6em; font-weight: bold;"> Understanding the Problem Statement </h2>

- **Goal:** Predict whether a transaction is fraudulent using transaction-level data.  
- **Source:** Simulated Nordic banking transactions (based on Nordea's format).

---

## <h2 style="font-size: 1.6em; font-weight: bold;"> Data Collection </h2>

- Simulated **10,000 transaction records** using Python.
- Fields are designed to resemble real-world banking transactions based on **Nordea Open Banking APIs**.

**Data Description**
| **Field Name**                  | **Description**                                                               |
|--------------------------------|------------------------------------------------------------------------------- |
| `transaction_id`               | Unique transaction ID                                                          |
| `booking_date`                 | When transaction was booked                                                    |
| `value_date`                   | When transaction value was transferred                                         |
| `transaction_date`             | Actual transaction execution date                                              |
| `payment_date`                 | Payment processing date                                                        |
| `amount`                       | Transaction amount                                                             |
| `currency`                     | Transaction currency                                                           |
| `from_account_id`              | Sender account ID                                                              |
| `from_account_name`            | Sender account name                                                            |
| `from_account_country`         | Sender country (critical for trade war checks)                                 |
| `from_account_business_type`   | Sender business type (e.g., Textile, Electronics)                              |
| `from_account_expected_turnover` | Expected annual turnover (€, $)                                              |
| `counterparty_account_id`      | Receiver account ID (if available)                                             |
| `counterparty_name`            | Receiver account name                                                          |
| `counterparty_country`         | Receiver country (critical for trade war checks)                               |
| `counterparty_bank_bic`        | Receiver bank BIC code                                                         |
| `counterparty_business_type`   | Receiver business type (if available)                                          |
| `narrative`                    | Transaction narrative or description                                           |
| `payment_purpose_code`         | Standard code explaining purpose (e.g., Invoice, Salary)                       |
| `fx_conversion_flag`           | Indicates if currency was converted (Y/N)                                      |
| `related_trade_invoice_id`     | Linked invoice ID (if available)                                               |
| `swift_message_type`           | SWIFT type if cross-border (e.g., MT103, MT202)                                |
| `transaction_status`           | Status of the transaction (e.g., billed, pending, failed)                      |
| `transaction_type_description` | Internal description (e.g., "salary payment")                                  |
| `end_to_end_identification`    | Chain ID for linked payments (useful for smurfing detection)                   |


## <h2 style="font-size: 1.6em; font-weight: bold;"> 1. Data Collection </h2>
Import Required Packages and Data

**Importing Pandas, Numpy, Matplotlib, Seaborn, Plotly.express and Warings Library.**

In [95]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

**Import the CSV Data as Pandas DataFrame**

In [96]:
df = pd.read_csv('data/nordic_transactions_with_fraud.csv')
df.head()

Unnamed: 0,transaction_id,booking_date,value_date,transaction_date,payment_date,amount,currency,from_account_id,from_account_name,from_account_country,...,counterparty_business_type,narrative,payment_purpose_code,fx_conversion_flag,related_trade_invoice_id,swift_message_type,transaction_status,transaction_type_description,end_to_end_identification,is_fraud
0,6592b26c-95e1-49b2-ac85-07af117b117a,2024-05-10,2024-11-27,2024-10-27,2025-03-16,14915.24,SEK,GB98WTGY27137540331877,"Black, Martin and Osborne",SE,...,Electronics,enhance cross-media schemas,Unknown,N,,MT940,pending,Card Payment,,0
1,39682303-9436-4b0b-8263-c80a05af9c2c,2024-12-09,2025-03-08,2024-10-21,2025-03-26,199664.19,EUR,GB68PXAS43329104116740,Turner Ltd,FI,...,Shipping,expedite bleeding-edge web-readiness,Unknown,N,dce76dd3-3b5c-471b-89ab-e0e5899e20ba,MT202,failed,BG-LI-LÖN,,0
2,af5a4a9d-2e8b-436c-8149-efee00eb40c0,2024-08-22,2024-04-29,2025-03-17,2024-09-22,332447.25,EUR,GB65JSXU67625306714407,Jones-Benitez,FI,...,Electronics,incentivize e-business e-business,Invoice,N,,MT103,pending,Direct Debit,,0
3,c91d6f80-1820-4ad1-b5f1-4878cf8f4a16,2024-08-21,2025-01-21,2025-03-28,2025-03-18,184652.31,NOK,GB79KKPR14790262231975,Rivera and Sons,NO,...,Consulting,utilize open-source schemas,Unknown,N,,MT103,pending,Card Payment,5fb0f5cb-6ae9-4d5d-a620-8c13c728636d,0
4,9ae1a733-dc39-452b-af0b-76d2c14b7914,2024-10-21,2025-02-27,2024-09-20,2025-03-26,315372.76,EUR,GB07JQEO03033666209261,Hammond-Hobbs,FI,...,Electronics,reinvent front-end experiences,Consulting,Y,3302e624-735c-44a6-ab2c-07b67b70fd1d,MT103,pending,Wire Transfer,,0


**Shape of the dataset**

In [97]:
df.shape

(10300, 26)

We have a dataset with 10,300 records and 26 columns.

In [98]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10300 entries, 0 to 10299
Data columns (total 26 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   transaction_id                  10300 non-null  object 
 1   booking_date                    10300 non-null  object 
 2   value_date                      10300 non-null  object 
 3   transaction_date                10300 non-null  object 
 4   payment_date                    10300 non-null  object 
 5   amount                          10300 non-null  float64
 6   currency                        10300 non-null  object 
 7   from_account_id                 10300 non-null  object 
 8   from_account_name               10300 non-null  object 
 9   from_account_country            10300 non-null  object 
 10  from_account_business_type      9997 non-null   object 
 11  from_account_expected_turnover  10300 non-null  float64
 12  counterparty_account_id         

## <h2 style="font-size: 1.6em; font-weight: bold;"> 2. Data Wrangling </h2>
- Handled missing values
- Removed duplicate entries
- Verified and adjusted data types
- Filtered billed data
- Convert currency using dynamic foreign exchange (FX) rates
- Generated dataset statistics

**Create a Copy of the Dataset**

In [99]:
df_copy= df.copy()

**2.1 Handled Missing Values**

In [100]:
df_copy.isnull().sum()
df_copy.isnull().sum()*100/ len(df)

transaction_id                     0.000000
booking_date                       0.000000
value_date                         0.000000
transaction_date                   0.000000
payment_date                       0.000000
amount                             0.000000
currency                           0.000000
from_account_id                    0.000000
from_account_name                  0.000000
from_account_country               0.000000
from_account_business_type         2.941748
from_account_expected_turnover     0.000000
counterparty_account_id            0.000000
counterparty_name                  4.737864
counterparty_country               0.000000
counterparty_bank_bic              0.000000
counterparty_business_type         0.000000
narrative                          5.252427
payment_purpose_code               0.000000
fx_conversion_flag                 0.000000
related_trade_invoice_id          71.640777
swift_message_type                 0.000000
transaction_status              

Column names and the percentage of missing values for each column:
- `narrative `          5.2%
- `counterparty_name`          5.7%
- `from_account_business_type` 2.9%
- `related_trade_invoice_id`   71.6%
- `end_to_end_identification`  48.3% 


Replace missing values with 0

In [101]:
df_copy.fillna(0,inplace=True)
df_copy.isnull().sum()

transaction_id                    0
booking_date                      0
value_date                        0
transaction_date                  0
payment_date                      0
amount                            0
currency                          0
from_account_id                   0
from_account_name                 0
from_account_country              0
from_account_business_type        0
from_account_expected_turnover    0
counterparty_account_id           0
counterparty_name                 0
counterparty_country              0
counterparty_bank_bic             0
counterparty_business_type        0
narrative                         0
payment_purpose_code              0
fx_conversion_flag                0
related_trade_invoice_id          0
swift_message_type                0
transaction_status                0
transaction_type_description      0
end_to_end_identification         0
is_fraud                          0
dtype: int64

**2.2 Removed Duplicate Entries**

In [102]:
df_copy.duplicated().sum()

500

The data has 500 duplicate records. Drop all duplicate entries.

In [103]:
df_copy.drop_duplicates(inplace=True)
df_copy.duplicated().sum()

0

**2.3 Verified and adjusted data types**

Convert `booking_date`, `transaction_date`, `value_date`, and `payment_date` from object type to datetime format

In [104]:
df_copy['booking_date']=pd.to_datetime(df_copy['booking_date'])
df_copy['transaction_date']=pd.to_datetime(df_copy['transaction_date'])
df_copy['value_date']=pd.to_datetime(df_copy['value_date'])
df_copy['payment_date']=pd.to_datetime(df_copy['payment_date'])
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9800 entries, 0 to 10299
Data columns (total 26 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   transaction_id                  9800 non-null   object        
 1   booking_date                    9800 non-null   datetime64[ns]
 2   value_date                      9800 non-null   datetime64[ns]
 3   transaction_date                9800 non-null   datetime64[ns]
 4   payment_date                    9800 non-null   datetime64[ns]
 5   amount                          9800 non-null   float64       
 6   currency                        9800 non-null   object        
 7   from_account_id                 9800 non-null   object        
 8   from_account_name               9800 non-null   object        
 9   from_account_country            9800 non-null   object        
 10  from_account_business_type      9800 non-null   object        
 11  from_acc

**2.4 Filtered billed data**

In [105]:
df_copy=df_copy.loc[df_copy['transaction_status']=='billed']
df_copy.shape

(3466, 26)

After filtering billed transactions, the dataframe contains 3,466 records

**2.5 Convert currency**

In [139]:
import pandas as pd
from datetime import datetime, timedelta

# Currencies list
currencies = df_copy['currency'].unique()

# Get the date
single_date = datetime.today() - timedelta(days=1)
# Read exchange rate table (base currency is EUR)
df_rates = pd.read_html(f'https://www.xe.com/currencytables/?from=EUR&date={single_date.strftime("%Y-%m-%d")}')[0]

# Build rate_dict: currency -> EUR per unit
rate_dict = {
    cur: df_rates[df_rates['Currency'] == cur]['EUR per unit'].values[0]
    if cur in df_rates['Currency'].values else 1.0
    for cur in currencies
}
# Now convert amounts **to EUR** by dividing by EUR per unit
df_copy['amount_eur'] = df_copy.apply(
    lambda x: float(x['amount']) if x['currency'] == 'EUR'
    else float(x['amount']) * rate_dict.get(x['currency'], 1.0),
    axis=1
)

df_copy[['amount', 'currency','amount_eur']]


Unnamed: 0,amount,currency,amount_eur
10,321196.70,EUR,321196.700000
16,307485.88,SEK,28049.142782
23,314987.93,DKK,42196.769133
25,407620.24,NOK,34578.630419
29,119791.23,SEK,10927.465399
...,...,...,...
10287,327131.56,EUR,327131.560000
10290,499091.61,SEK,45527.592455
10291,499316.70,SEK,45548.125371
10292,400309.01,SEK,36516.553471


In [25]:
########### Python 3.2 #############
import urllib.request, json

try:
    url = "https://api.riksbank.se/swea/v1/CrossRates/SEKAUDPMI/SEKBGNPMI/2025-04-29/2025-04-29"

    hdr ={
    # Request headers
    'Cache-Control': 'no-cache',
    }

    req = urllib.request.Request(url, headers=hdr)

    req.get_method = lambda: 'GET'
    response = urllib.request.urlopen(req)
    print(response.getcode())
    print(response.read())
except Exception as e:
    print(e)
####################################

200
b'[{"date":"2025-04-29","value":1.10161}]'


In [24]:
########### Python 3.2 #############
import urllib.request, json

try:
    url = "https://api.riksbank.se/swea/v1/CrossRates/SEKAUDPMI/SEKBGNPMI/2025-04-29"

    hdr ={
    # Request headers
    'Cache-Control': 'no-cache',
    }

    req = urllib.request.Request(url, headers=hdr)

    req.get_method = lambda: 'GET'
    response = urllib.request.urlopen(req)
    print(response.getcode())
    print(response.read())
except Exception as e:
    print(e)
####################################

200
b'[{"date":"2025-04-29","value":1.10161},{"date":"2025-04-30","value":1.09889}]'
