# Transaction Risk Analysis Code

**This code implements a fraud detection system using Local Outlier Factor (LOF) to identify potentially risky transactions.** 

This project implemented an anomaly detection system using the Local Outlier Factor (LOF) algorithm to identify potentially fraudulent transactions in a simulated financial dataset.

#### Key Points About This Approach

- It combines both transaction-level and customer behavior features
- LOF is particularly good at detecting local anomalies (transactions that are unusual for that specific customer's pattern)
- The code handles edge cases (missing customer IDs, single-transaction customers)
- The contamination parameter of 0.01 suggests the model expects about 1% of transactions to be fraudulent

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('synthetic_bank_transfers.csv')
df

Unnamed: 0,Transaction Type,Sender Name,Sender Account,Recipient Name,Recipient Account,Amount,Timestamp
0,Wire Transfer,Deborah Stuart,GB25YFSJ32955239077145,Daniel Mullen,GB06JCLA52666027047459,6158.14,2022-06-19 05:15:21.098123
1,Check,Jason Wagner,GB61ZBQL97930805722596,Michael Roberts,GB83SHNZ43420904498234,5336.63,2020-11-02 16:34:49.373628
2,Check,Tina Davis,GB05COMK46550155019277,Kayla Johns,GB74RQHI02704652515696,6916.41,2022-03-24 06:17:17.523424
3,Wire Transfer,Matthew Herrera,GB42DZCA70665166124949,Thomas Guzman,GB40RDOB66268771119965,6607.40,2020-03-15 14:31:01.927106
4,Check,Eric Adams,GB93FDXA69268202738226,Kayla Cowan,GB79UKKK91772426185654,9285.42,2021-03-18 14:35:56.603051
...,...,...,...,...,...,...,...
5025,Check,Kathleen Carr,GB39JGHL63974355550775,Jennifer Acosta,GB44OVTN32634517404775,943.63,2022-12-26 03:47:10.229748
5026,Deposit,Jennifer Lewis,GB80HCSH29479959432011,Thomas Thompson,GB65DMCU14967511343462,3959.86,2022-04-09 09:42:55.043718
5027,Deposit,Jeffrey Barry,GB50UUEB70881547566660,Brandon Walters,GB29OLCR28467598078146,4197.70,2023-09-14 05:43:01.557469
5028,Check,Derrick Vazquez,GB95RVHB66145642195021,Carol Williams,GB71ZKBO96054061575689,6164.83,2022-07-06 05:51:10.609749


In [3]:
print(df.info())
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5030 entries, 0 to 5029
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Transaction Type   5030 non-null   object 
 1   Sender Name        5030 non-null   object 
 2   Sender Account     5030 non-null   object 
 3   Recipient Name     5030 non-null   object 
 4   Recipient Account  5030 non-null   object 
 5   Amount             5030 non-null   float64
 6   Timestamp          5030 non-null   object 
dtypes: float64(1), object(6)
memory usage: 275.2+ KB
None
             Amount
count   5030.000000
mean    5286.773097
std     3604.408641
min      100.950000
25%     2700.607500
50%     5262.165000
75%     7608.537500
max    49211.290000


## Clean Data

This includes:

- Datetime Conversion: Converts Timestamp to datetime to easily extract features like hour, day, and month.
- Missing Values Check: Verifies if any columns have missing values. In this case, there appear to be none.
- Feature Engineering: Extracts features from the Timestamp column, such as the hour of the transaction, the day of the week, and the month.
- Dropping Irrelevant Columns: Removes Sender Name and Recipient Name, as they may not contribute significantly to fraud detection.
- Encoding Categorical Variables: Uses one-hot encoding to convert Transaction Type into numerical features.
- Standardizing Numerical Features: Scales the Amount column to ensure it is on the same scale as other features (important for models like DBSCAN and LOF that use distance-based metrics).

In [4]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

# 1. Convert 'Timestamp' column to datetime
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

# 2. Check for missing values
print("Missing values:\n", df.isnull().sum())

# 3. Feature engineering (extracting time-based features)
df['Transaction Hour'] = df['Timestamp'].dt.hour
df['Transaction Day'] = df['Timestamp'].dt.dayofweek  # Monday=0, Sunday=6
df['Transaction Month'] = df['Timestamp'].dt.month
df['Weekend'] = df['Transaction Day'].apply(lambda x: 1 if x >= 5 else 0)

# 4. Drop 'Sender Name' and 'Recipient Name' (if not needed for modeling)
df.drop(columns=['Sender Name', 'Recipient Name'], inplace=True)

# 5. Encode 'Transaction Type'
df = pd.get_dummies(df, columns=['Transaction Type'], drop_first=True)

# 6. Standardize 'Amount'
scaler = StandardScaler()
df['Amount'] = scaler.fit_transform(df[['Amount']])

# 7. Check data types
print("\nData info after preprocessing:")
print(df.info())

# Add customer ID if not present (assuming 'Sender Account' represents customer)
df['CustomerID'] = df['Sender Account']

Missing values:
 Transaction Type     0
Sender Name          0
Sender Account       0
Recipient Name       0
Recipient Account    0
Amount               0
Timestamp            0
dtype: int64

Data info after preprocessing:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5030 entries, 0 to 5029
Data columns (total 10 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   Sender Account                  5030 non-null   object        
 1   Recipient Account               5030 non-null   object        
 2   Amount                          5030 non-null   float64       
 3   Timestamp                       5030 non-null   datetime64[ns]
 4   Transaction Hour                5030 non-null   int32         
 5   Transaction Day                 5030 non-null   int32         
 6   Transaction Month               5030 non-null   int32         
 7   Weekend                         5030 non-null   int64

**Customer Identification**

In [None]:
# 1. First ensure we have a CustomerID column
if 'CustomerID' not in df.columns:
    # Use 'Sender Account' as CustomerID if available
    if 'Sender Account' in df.columns:
        df['CustomerID'] = df['Sender Account']
    # Alternatively use another identifier if available
    elif 'AccountID' in df.columns:
        df['CustomerID'] = df['AccountID']
    else:
        # If no customer identifier exists, create a dummy one
        print("Warning: No customer identifier found - using transaction index as CustomerID")
        df['CustomerID'] = df.index.astype(str)

# 2. Verify the CustomerID exists and has proper values
print("\nCustomerID sample values:")
print(df['CustomerID'].head())
print("\nNumber of unique customers:", df['CustomerID'].nunique())

**Basic Feature Engineering**

In [None]:
# 3. Now proceed with the feature engineering
transaction_features = df[[
    'Amount',
    'Transaction Hour',
    'Transaction Day',
    'Weekend'
]].copy()

# Add time since last transaction
df['TimeSinceLast'] = df.groupby('CustomerID')['Timestamp'].diff().dt.total_seconds()/3600
transaction_features['TimeSinceLast'] = df['TimeSinceLast'].fillna(24)  # Default 24h for first transactions

**Customer Behavior Features**

Only creates customer behavior features if customers have multiple transactions

Aggregates by customer to create features like:
- Average and standard deviation of transaction amounts
- Average transaction time and its consistency
- Weekend transaction ratio

In [None]:
# 4. Create customer behavior features only if we have multiple transactions per customer
if df['CustomerID'].nunique() < len(df):
    customer_behavior = df.groupby('CustomerID').agg({
        'Amount': ['mean', 'std', 'count'],
        'Transaction Hour': ['mean', 'std'],
        'Weekend': 'mean'
    })
    customer_behavior.columns = ['_'.join(col).strip() for col in customer_behavior.columns.values]
    
    # Calculate additional metrics
    customer_behavior['Amount_coef_var'] = customer_behavior['Amount_std'] / (customer_behavior['Amount_mean'] + 1e-6)
    customer_behavior['Hour_consistency'] = customer_behavior['Transaction Hour_std'] / (customer_behavior['Transaction Hour_mean'] + 1e-6)
    
    # Select features to join
    customer_features = customer_behavior[[
        'Amount_mean',
        'Amount_std',
        'Amount_coef_var',
        'Hour_consistency',
        'Weekend_mean'
    ]].rename(columns={'Weekend_mean': 'Weekend_ratio'})
    
    transaction_features = transaction_features.join(customer_features, on='CustomerID')
else:
    print("Warning: Only one transaction per customer - skipping customer behavior features")
    transaction_features['Amount_mean'] = 0
    transaction_features['Amount_std'] = 0
    transaction_features['Amount_coef_var'] = 0
    transaction_features['Hour_consistency'] = 0
    transaction_features['Weekend_ratio'] = 0

# Fill NA values
transaction_features.fillna(0, inplace=True)

**LOF Analysis**

Applies LOF algorithm with:
- Number of neighbors set to minimum of 20 or (number of samples - 1)
- Contamination rate of 1% (expecting ~1% outliers)

In [18]:
# 5. Proceed with LOF analysis
from sklearn.preprocessing import RobustScaler
from sklearn.neighbors import LocalOutlierFactor

scaler = RobustScaler()
X_lof = scaler.fit_transform(transaction_features)

lof = LocalOutlierFactor(n_neighbors=min(20, len(X_lof)-1), contamination=0.01)
df['LOF_Score'] = lof.fit_predict(X_lof)
df['LOF_Outlier'] = (df['LOF_Score'] == -1).astype(int)

# Display results
print(f"\nDetected {df['LOF_Outlier'].sum()} outlier transactions ({df['LOF_Outlier'].mean():.2%})")
print("\nTop 5 riskiest transactions:")
print(df.nsmallest(5, 'LOF_Score')[['Timestamp', 'CustomerID', 'Amount', 'LOF_Score']])


CustomerID sample values:
0    GB25YFSJ32955239077145
1    GB61ZBQL97930805722596
2    GB05COMK46550155019277
3    GB42DZCA70665166124949
4    GB93FDXA69268202738226
Name: CustomerID, dtype: object

Number of unique customers: 5030

Detected 51 outlier transactions (1.01%)

Top 5 riskiest transactions:
                     Timestamp              CustomerID     Amount  LOF_Score
135 2021-01-26 08:31:18.161286  GB11GJBK55102819390890   5.697275         -1
141 2020-12-24 03:23:31.925294  GB52VOFI12376628773685   1.245897         -1
257 2022-04-28 07:33:00.782881  GB82QLJW69004109738010  -1.434046         -1
332 2022-12-05 22:31:18.056673  GB18EUSO29562851939941   2.012662         -1
369 2021-03-14 01:10:58.072148  GB90GFRK74601695010029  12.187543         -1


## Conclusion

While this implementation successfully identified structural outliers, true fraud detection requires richer transaction history and labeled data. Future work should focus on expanding the dataset, refining features, and validating results against known fraud cases to build a more robust risk-scoring system.

The analysis successfully flagged 1% of transactions (51 outliers) as high-risk based on features such as:

- Transaction Amount (including negative values, which may indicate reversals)
- Time of Transaction (early morning/late-night transactions were flagged)
- Weekend vs. Weekday Activity
- Time Since Last Transaction (though limited due to single-transaction-per-customer data)

### Key Findings

1. Data Limitations:

The dataset contained only one transaction per customer, preventing the model from learning customer-specific behavioral patterns.

Without historical data, the model could only detect global anomalies (e.g., extreme amounts, unusual times) rather than customer-specific deviations.

2. Detected Anomalies:

The riskiest transactions included:
- A negative amount (possible refund or error)
- Late-night/early-morning transactions
- Unusually high amounts compared to the dataset’s distribution

3. Model Performance:
- The LOF algorithm performed as expected, flagging 1% of transactions (matching the contamination=0.01 parameter).
- However, without labeled fraud cases, we cannot yet measure precision or recall (i.e., how many flagged transactions are truly fraudulent).