# Detecting Money Laundering Patterns Across Global Financial Transactions- ETL process and statistical analysis.

## Objectives

This project will explore a large dataset of 10,000 records, containing information on international  financial transactions. The dataset includes categorical and numerical information as it relates to each financial transaction including country of origin, the legality of the transaction, transaction amount, industry, number of shell companies, tax haven of country and risk rating. 

The objective is to determine what variable or group of variables within the dataset will determine if the transaction is illegal. To do this, we will:

- Carry out a complete ETL (Extract, Transform, Load) process to clean and prepare the data.

 - Run statistical tests to evaluate the four different hypothesis.

 - Use these insights to suggest money laundering risk reduction strategies.



---

1. **Import Required Libraries**

I will begin by importing the necessary Python libraries for data handling and exploration.

In [27]:
import pandas as pd
import numpy as np
import os

# Step 0. Move up one directory level to where the dataset file is located, enabling pandas to find and load the CSV successfully
print(os.getcwd())

/Users/nataliewaugh/Documents/DataCode/Detecting_Money_Laundering_Patterns-/jupyter_notebooks


2. **Load the Dataset**

I will load the CSV file named Money_Laundering_Dataset.csv from the local directory.

In [28]:
#Step 1. Load the dataset
data = pd.read_csv('/Users/nataliewaugh/Documents/DataCode/Detecting_Money_Laundering_Patterns-/data/Money_ Laundering_Dataset.csv')

#Step 2. show the first few rows of the dataset
data.head()

Unnamed: 0,Transaction ID,Country,Amount (USD),Transaction Type,Date of Transaction,Person Involved,Industry,Destination Country,Reported by Authority,Source of Money,Money Laundering Risk Score,Shell Companies Involved,Financial Institution,Tax Haven Country
0,TX0000000001,Brazil,3267530.0,Offshore Transfer,2013-01-01 00:00:00,Person_1101,Construction,USA,True,Illegal,6,1,Bank_40,Singapore
1,TX0000000002,China,4965767.0,Stocks Transfer,2013-01-01 01:00:00,Person_7484,Luxury Goods,South Africa,False,Illegal,9,0,Bank_461,Bahamas
2,TX0000000003,UK,94167.5,Stocks Transfer,2013-01-01 02:00:00,Person_3655,Construction,Switzerland,True,Illegal,1,3,Bank_387,Switzerland
3,TX0000000004,UAE,386420.1,Cash Withdrawal,2013-01-01 03:00:00,Person_3226,Oil & Gas,Russia,False,Illegal,7,2,Bank_353,Panama
4,TX0000000005,South Africa,643378.4,Cryptocurrency,2013-01-01 04:00:00,Person_7975,Real Estate,USA,True,Illegal,1,9,Bank_57,Luxembourg


In [26]:
# Step 3. Display summary of dataset structure, including column names, non-null counts, and data types.
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 13 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Country                      10000 non-null  object 
 1   Amount (USD)                 10000 non-null  float64
 2   Transaction Type             10000 non-null  object 
 3   Date of Transaction          10000 non-null  object 
 4   Person Involved              10000 non-null  object 
 5   Industry                     10000 non-null  object 
 6   Destination Country          10000 non-null  object 
 7   Reported by Authority        10000 non-null  bool   
 8   Source of Money              10000 non-null  object 
 9   Money Laundering Risk Score  10000 non-null  int64  
 10  Shell Companies Involved     10000 non-null  int64  
 11  Financial Institution        10000 non-null  object 
 12  Tax Haven Country            10000 non-null  object 
dtypes: bool(1), float

3. **Clean the Dataset**

I will check for irregularities across the dataset which may hinder further analysis i.e spelling mistakes and duplication. Moneytary values will be brought down to 0 decimal places 

In [35]:
# Step 4. Drop columns that are not needed for analysis
data.drop(columns=['Transaction ID'], inplace=True)

In [33]:
# Step 5. Check for duplicate values in 'Person Involved' column
print(data['Person Involved'].duplicated().sum())

3680


In [42]:
print(data['Person Involved'].unique())

['Person_1101' 'Person_7484' 'Person_3655' ... 'Person_6348' 'Person_4171'
 'Person_3267']


Initially, I considered removing the `'Person Involved'` column because it contains coded identifiers. However, after discovering that there are 3,680 duplicate entries, it’s clear that this column provides valuable information for analysis. These repeated entries can help identify individuals involved in multiple transactions, which is important for detecting suspicious patterns or behaviors.

In [41]:
# Step 6. Check for duplicate values in 'Financial Institution' column
print(data['Financial Institution'].duplicated().sum())
data['Financial Institution'].value_counts()

9501


Financial Institution
Bank_81     36
Bank_260    36
Bank_100    35
Bank_120    34
Bank_438    33
            ..
Bank_199    11
Bank_169    11
Bank_269    11
Bank_249     9
Bank_133     9
Name: count, Length: 499, dtype: int64

The exploration above shows that there is repetition in the `'Financial Institution`' involved, with some banks appearing significantly more often than others. Therefore, I will retain this column to further analyze whether there is any correlation between the financial institution and the legality of the transactions.

In [43]:
#Step 7. Check for irregularies in 'Industry' column
print(data['Industry'].unique())

['Construction' 'Luxury Goods' 'Oil & Gas' 'Real Estate' 'Arms Trade'
 'Casinos' 'Finance']


`'Industry`' column contains seven industries and no irregularities in the data. 

In [45]:
#Step 8. Check for irregularities in remaining columns 
print(data['Reported by Authority'].unique())
print(data['Source of Money'].unique())
print(data['Destination Country'].unique())
print(data['Tax Haven Country'].unique())
 

[ True False]
['Illegal' 'Legal']
['USA' 'South Africa' 'Switzerland' 'Russia' 'Brazil' 'UK' 'India' 'China'
 'Singapore' 'UAE']
['Singapore' 'Bahamas' 'Switzerland' 'Panama' 'Luxembourg'
 'Cayman Islands']


The remaining columns in the dataset do not show any irregularities in their entries. There are a total of 10 countries where the source of money originates and six tax haven countries. Overall, the data quality in these columns appears to be good.

In [47]:
# Step 9. Change the monetary values to 0 decimal places
data['Amount (USD)'] = data['Amount (USD)'].round(0)
print(data['Amount (USD)'].head())
  

0    3267530.0
1    4965767.0
2      94168.0
3     386420.0
4     643378.0
Name: Amount (USD), dtype: float64


4. **Time Shift data**

The transaction dates in the dataset range from 2013 onwards. For the purpose of this analysis, the dates have been shifted forward by 10 years to align with the current timeline (2023 and beyond). This adjustment helps in contextualizing the data to present-day conditions without altering the relative timing of transactions, allowing for more relevant insights while maintaining data integrity.

In [49]:
# Check date column is in the correct datetime format to allow for analysis 
data['Date of Transaction'] = pd.to_datetime(data['Date of Transaction'])

# Shift the dates forward by 10 years
data['Date of Transaction'] = data['Date of Transaction'] + pd.DateOffset(years=10)
print(data['Date of Transaction'].head())

0   2023-01-01 00:00:00
1   2023-01-01 01:00:00
2   2023-01-01 02:00:00
3   2023-01-01 03:00:00
4   2023-01-01 04:00:00
Name: Date of Transaction, dtype: datetime64[ns]


---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
