# Project Directory Structure

In [2]:
# Create project directory structure 
!mkdir data
!mkdir data\raw
!mkdir data\processed
!mkdir data\synthetic
!mkdir notebooks
!mkdir src
!mkdir docs
!mkdir docs\technical
!mkdir docs\business
!mkdir dashboard

### Initialize git repository

In [5]:
!git init

Initialized empty Git repository in C:/Users/Obaidullah/Videos/AML/.git/


### create readme.md

In [8]:
%%writefile README.md
# Investment Banking AML Detection System

## Overview
This project implements an advanced Anti-Money Laundering (AML) detection system focused on the investment banking sector. It uses data analytics and machine learning to identify suspicious patterns related to:

- Trade-Based Money Laundering (TBML)
- Casino-based money laundering schemes
- Complex financing and leasing arrangements

## Project Structure
- `notebooks/`: Jupyter notebooks for analysis and model development
- `data/`: Financial transaction datasets and synthetic patterns
- `src/`: Supporting Python modules
- `dashboard/`: Interactive visualization dashboard
- `docs/`: Technical and business documentation

## Getting Started
1. Clone this repository
2. Install requirements: `pip install -r requirements.txt`
3. Run notebooks in sequence starting with `01_data_preparation.ipynb`

## Features
- Transaction anomaly detection
- Network analysis of financial relationships
- Risk scoring system
- Interactive visualization dashboard
- Narrative-driven case studies

## Technologies
- Python, Pandas, Scikit-learn
- NetworkX for relationship analysis
- Plotly & Dash for visualization
- Jupyter Notebooks for development

Writing README.md


### Requirements

In [11]:
%%writefile requirements.txt
pandas==2.0.0
numpy==1.24.3
scikit-learn==1.2.2
matplotlib==3.7.1
seaborn==0.12.2
plotly==5.14.1
dash==2.9.3
networkx==3.1
jupyter==1.0.0
nbformat>=5.7.0

Writing requirements.txt


### Create .gitignore file

In [14]:
%%writefile .gitignore
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# Jupyter Notebook
.ipynb_checkpoints

# Virtual Environment
venv/
ENV/

# Data files
data/raw/*.csv
data/raw/*.xlsx
data/processed/*.csv
data/synthetic/*.csv
*.csv
*.xlsx
*.parquet

# Credentials
.env
config.ini
*credentials*

# OS specific
.DS_Store
Thumbs.db

Writing .gitignore


# AML Detection System: Data Preparation

This notebook covers Phase 1 of our AML Detection System project focused on investment banking, including:

1. Data acquisition from public sources
2. Exploratory data analysis
3. Synthetic data generation for money laundering patterns

## Project Overview

We're building a sophisticated AML detection system for investment banking that uncovers advanced money laundering schemes through visualization and narrative-driven insights. The system will focus on detecting patterns like Trade-Based Money Laundering (TBML), casino exploitation, and complex financing arrangement"""

## Setup and Dependencies

First, we'll install and import all the necessary libraries for our analysis.

# Install required packages

In [22]:
!pip install pandas numpy matplotlib seaborn scikit-learn networkx requests plotly



Now let's import all the libraries we'll need for data acquisition and processing.

# Import libraries

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import json
import os
import random
from datetime import datetime, timedelta
import networkx as nx
from sklearn.preprocessing import StandardScaler
import plotly.express as px
import plotly.graph_objects as go

# Set random seed for reproducibility

In [29]:
np.random.seed(42)
random.seed(42)

# Configure visualizations

In [32]:
plt.style.use('ggplot')
sns.set(style="whitegrid")
%matplotlib inline

### Create Directory Structure

We'll create the necessary directory structure for our project to store the data we'll be working with.

In [52]:
# Get the current working directory
current_dir = os.getcwd()
print(f"Current working directory: {current_dir}")

# Create necessary directories with absolute paths
data_dir = os.path.join(current_dir, 'data')
raw_dir = os.path.join(data_dir, 'raw')
processed_dir = os.path.join(data_dir, 'processed')
synthetic_dir = os.path.join(data_dir, 'synthetic')

os.makedirs(raw_dir, exist_ok=True)
os.makedirs(processed_dir, exist_ok=True)
os.makedirs(synthetic_dir, exist_ok=True)

print(f"Created directories:")
print(f"- Raw data: {raw_dir}")
print(f"- Processed data: {processed_dir}")
print(f"- Synthetic data: {synthetic_dir}")

Current working directory: C:\Users\Obaidullah\Videos\AML
Created directories:
- Raw data: C:\Users\Obaidullah\Videos\AML\data\raw
- Processed data: C:\Users\Obaidullah\Videos\AML\data\processed
- Synthetic data: C:\Users\Obaidullah\Videos\AML\data\synthetic


## 1. Data Acquisition

In this section, we'll acquire financial transaction data from public sources, particularly focusing on FINTRAC (Financial Transactions and Reports Analysis Centre of Canada) data.

### 1.1 Download Function for FINTRAC Data

First, we'll define a function to download data from the Open Government Portal. This function will:
- Check if the file already exists locally
- Download the file if it doesn't exist
- Return the data as a DataFrame

In [40]:
def download_fintrac_data(url, save_path):
    """
    Downloads FINTRAC data from the specified URL and saves it to the given path.
    
    Args:
        url (str): URL to download the data from
        save_path (str): Path where the data will be saved
    
    Returns:
        pandas.DataFrame: The downloaded data as a DataFrame
    """
    # Check if file already exists
    if os.path.exists(save_path):
        print(f"Loading existing file from {save_path}")
        return pd.read_csv(save_path)
    
    # Download the file
    print(f"Downloading data from {url}")
    response = requests.get(url)
    
    if response.status_code == 200:
        # Save the file
        with open(save_path, 'wb') as f:
            f.write(response.content)
        print(f"Data saved to {save_path}")
        
        # Return the data as a DataFrame
        return pd.read_csv(save_path)
    else:
        print(f"Failed to download data: {response.status_code}")
        return None

### 1.2 Download FINTRAC Data

Now we'll download two key FINTRAC datasets:
1. Suspicious Transaction Reports
2. Large Cash Transaction Reports

Note: The URLs below are placeholders and would need to be replaced with actual URLs from the Open Government Portal in a real implementn.
"""

In [44]:
# URLs for FINTRAC data (example URLs - replace with actual FINTRAC data URLs)
fintrac_suspicious_transactions_url = "https://open.canada.ca/data/dataset/5e756e0c-d254-49d6-9813-a82c61b762dc/resource/f3bcc5db-68e2-454a-82a8-73bdfd99ef4e/download/fintrac-suspicious-transaction-reports.csv"
fintrac_large_cash_transactions_url = "https://open.canada.ca/data/dataset/5e756e0c-d254-49d6-9813-a82c61b762dc/resource/bafaaf6a-5b0c-4eef-9652-cad254b8d6a9/download/fintrac-large-cash-transaction-reports.csv"

# Attempt to download the data
try:
    suspicious_transactions_df = download_fintrac_data(
        fintrac_suspicious_transactions_url, 
        '../data/raw/fintrac_suspicious_transactions.csv'
    )

    large_cash_transactions_df = download_fintrac_data(
        fintrac_large_cash_transactions_url, 
        '../data/raw/fintrac_large_cash_transactions.csv'
    )
    
    # Check if data was successfully downloaded
    if suspicious_transactions_df is not None and large_cash_transactions_df is not None:
        print("Successfully downloaded FINTRAC data!")
    else:
        print("Failed to download one or more FINTRAC datasets.")
        print("Will proceed with synthetic data generation.")
except Exception as e:
    print(f"Error downloading FINTRAC data: {e}")
    print("Will proceed with synthetic data generation.")

Downloading data from https://open.canada.ca/data/dataset/5e756e0c-d254-49d6-9813-a82c61b762dc/resource/f3bcc5db-68e2-454a-82a8-73bdfd99ef4e/download/fintrac-suspicious-transaction-reports.csv
Failed to download data: 404
Downloading data from https://open.canada.ca/data/dataset/5e756e0c-d254-49d6-9813-a82c61b762dc/resource/bafaaf6a-5b0c-4eef-9652-cad254b8d6a9/download/fintrac-large-cash-transaction-reports.csv
Failed to download data: 404
Failed to download one or more FINTRAC datasets.
Will proceed with synthetic data generation.


## 1.3 Generate Synthetic FINTRAC-like Data

Since the real FINTRAC data might not be accessible or might have limitations, we'll create a synthetic dataset that mimics the structure and patterns of real FINTRAC data. This is particularly useful for development and testing purposes.

Our synthetic data will include:
- Temporal information (year, month)
- Sector categories (banking, securities, etc.)
- Report types (STR, LCTR, etc.)
- Geographic information (provinces, postal codes)
- Transaction counts and amounts

First, let's define a function to generate synthetic FINTRAC-like data:

In [47]:
def generate_synthetic_fintrac_data(n_records=1000):
    """
    Generates synthetic FINTRAC-like data for suspicious transaction reports.
    
    Args:
        n_records (int): Number of records to generate
        
    Returns:
        pandas.DataFrame: DataFrame containing synthetic FINTRAC-like data
    """
    # Define possible values for categorical columns
    sectors = [
        'Banking', 'Securities Dealers', 'Money Services Businesses', 
        'Life Insurance', 'Real Estate', 'Casinos', 'Accountants', 'Notaries'
    ]
    
    report_types = [
        'Suspicious Transaction Report', 'Large Cash Transaction Report',
        'Electronic Funds Transfer Report', 'Casino Disbursement Report'
    ]
    
    provinces = [
        'Ontario', 'Quebec', 'British Columbia', 'Alberta', 'Manitoba',
        'Saskatchewan', 'Nova Scotia', 'New Brunswick', 'Newfoundland and Labrador',
        'Prince Edward Island', 'Northwest Territories', 'Nunavut', 'Yukon'
    ]
    
    # Generate data with realistic distributions
    # Banking and Securities sectors have higher representation
    # STRs and LCTRs are more common than other report types
    data = {
        'Year': np.random.choice(range(2015, 2025), n_records),
        'Month': np.random.choice(range(1, 13), n_records),
        'Sector': np.random.choice(sectors, n_records, p=[0.35, 0.2, 0.15, 0.1, 0.1, 0.05, 0.03, 0.02]),
        'Report_Type': np.random.choice(report_types, n_records, p=[0.4, 0.3, 0.2, 0.1]),
        'Province': np.random.choice(provinces, n_records),
        'Count': np.random.poisson(10, n_records),  # Number of reports follows Poisson distribution
        'Amount': np.random.exponential(10000, n_records) * np.random.lognormal(0, 0.5, n_records),  # Transaction amounts follow heavy-tailed distribution
    }
    
    # Convert to DataFrame
    df = pd.DataFrame(data)
    
    # Create Date column
    df['Date'] = pd.to_datetime(df[['Year', 'Month']].assign(Day=1))
    
    # Round amounts
    df['Amount'] = df['Amount'].round(2)
    
    # Additional metadata - postal code prefixes
    df['Postal_Code_Prefix'] = np.random.choice([f"{l1}{d1}{l2}" for l1 in "ABCEGHJKLMNPRSTVXY" for d1 in range(10) for l2 in "ABCEGHJKLMNPRSTVWXYZ"], n_records)
    
    return df

Now, let's generate a significant amount of synthetic data (5,000 records) to work with. We're generating a large dataset to ensure we have enough data to detect patterns and train models in later phases.

In [54]:
# Generate synthetic FINTRAC data
synthetic_fintrac_df = generate_synthetic_fintrac_data(5000)

# Save the synthetic data with absolute path
synthetic_file_path = os.path.join(synthetic_dir, 'synthetic_fintrac_data.csv')
synthetic_fintrac_df.to_csv(synthetic_file_path, index=False)

print(f"Generated {len(synthetic_fintrac_df)} synthetic FINTRAC records")
print(f"Data saved to: {synthetic_file_path}")

# Verify the file exists
if os.path.exists(synthetic_file_path):
    print(f"File successfully created! File size: {os.path.getsize(synthetic_file_path) / 1024:.2f} KB")
else:
    print("Warning: File was not created successfully")

# Display the first few rows
synthetic_fintrac_df.head()

Generated 5000 synthetic FINTRAC records
Data saved to: C:\Users\Obaidullah\Videos\AML\data\synthetic\synthetic_fintrac_data.csv
File successfully created! File size: 446.89 KB


Unnamed: 0,Year,Month,Sector,Report_Type,Province,Count,Amount,Date,Postal_Code_Prefix
0,2021,4,Securities Dealers,Suspicious Transaction Report,Newfoundland and Labrador,14,54477.58,2021-04-01,B2M
1,2023,1,Money Services Businesses,Large Cash Transaction Report,Northwest Territories,6,15190.09,2023-01-01,N4T
2,2015,7,Life Insurance,Large Cash Transaction Report,Nova Scotia,7,8596.86,2015-07-01,E4Y
3,2017,9,Banking,Suspicious Transaction Report,Northwest Territories,10,13296.63,2017-09-01,T5K
4,2022,12,Banking,Electronic Funds Transfer Report,Saskatchewan,8,11140.46,2022-12-01,K0Z
