# Decision Tree

1. **Aims and Objectives**

Following the exploratory data analysis and visualisations developed in Tableau, this section will  apply a supervised machine learning approach to predict the likelihood that a financial transaction is fraudulent. The model is trained on the pre-processed, labeled dataset, prepared in the `etl_process.ipynb` notebook. By using classification techniques, the goal is to identify key patterns and risk indicators that may signal potential money laundering or financial crime activity before the transaction occurs.

The workflow I will use for this supervised learning is
- Split the dataset into train and test set
- Fit the pipeline
- Evaluate your model. 

In [5]:
#step 0. Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

import warnings
warnings.filterwarnings('ignore')

1. **Import the data**


In [6]:
#Step 1. Load the dataset
data = pd.read_csv ('/Users/nataliewaugh/Documents/DataCode/Detecting_Money_Laundering_Patterns-/data/cleaned_money_laundering_datasetrevised.csv') 

#Step 2. show the first few rows of the dataset
data.head()

Unnamed: 0,Origin Country,Amount (USD),Transaction Type,Date of Transaction,Person Involved,Industry,Destination Country,Reported by Authority,Source of Money,Money Laundering Risk Score,Shell Companies Involved,Financial Institution,Tax Haven Country,Domestic or Cross-Border,Origin Country Category,Destination Tax Haven Flag
0,Brazil,3267530.0,Offshore Transfer,2023-01-01 00:00:00,Person_1101,Construction,USA,True,Illegal,6,1,Bank_40,Singapore,Cross-Border,Upper-Middle / Partial Regulated,Non-Tax Haven
1,China,4965767.0,Stocks Transfer,2023-01-01 01:00:00,Person_7484,Luxury Goods,South Africa,False,Illegal,9,0,Bank_461,Bahamas,Cross-Border,Upper-Middle / Partial Regulated,Non-Tax Haven
2,UK,94168.0,Stocks Transfer,2023-01-01 02:00:00,Person_3655,Construction,Switzerland,True,Illegal,1,3,Bank_387,Switzerland,Cross-Border,High Income / Regulated,Tax Haven
3,UAE,386420.0,Cash Withdrawal,2023-01-01 03:00:00,Person_3226,Oil & Gas,Russia,False,Illegal,7,2,Bank_353,Panama,Cross-Border,High Income / Tax Haven,Non-Tax Haven
4,South Africa,643378.0,Cryptocurrency,2023-01-01 04:00:00,Person_7975,Real Estate,USA,True,Illegal,1,9,Bank_57,Luxembourg,Cross-Border,Upper-Middle / Regulated,Non-Tax Haven


As this is a classification problem, I will use the Decision Tree Classifier from sklearn.

In [8]:
#Step 3 Import the Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

The target variable is `Source of Money` I will split the data to train and test set.

In [11]:
#Step 4. import the train_test_split function
from sklearn.model_selection import train_test_split

#Step 5. Drop the target variable and split the data into features and target
X_train, X_test,y_train, y_test = train_test_split(
    data.drop(['Source of Money'],axis=1),
    data['Source of Money'],
    test_size=0.2,
    random_state=101
    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

* Train set: (8000, 15) (8000,) 
* Test set: (2000, 15) (2000,)


I will apply feature scaling, perform feature selection, and build a model using DecisionTreeClassifier. To ensure reproducibility of results across different environments, I set a fixed random_state. All these steps—scaling, selecting features, and modeling—are combined into a single Pipeline using scikit-learn for cleaner and more efficient workflow management.

In [12]:
#Step 6. Import necessary libraries for the pipeline
from sklearn.pipeline import Pipeline

#Step 7. Import Feature Scaling
from sklearn.preprocessing import StandardScaler

#Step 8. Feat Selection
from sklearn.feature_selection import SelectFromModel

#Sept 8. Import ML algorithms 
from sklearn.tree import DecisionTreeClassifier


def pipeline_decision_tree_moneylaundering():
  pipeline = Pipeline([
      ( "feat_scaling",StandardScaler() ),

      ( "feat_selection",SelectFromModel(DecisionTreeClassifier(random_state=101)) ),
      
      ( "model", DecisionTreeClassifier(random_state=101)),

    ])

  return pipeline

pipeline_decision_tree_moneylaundering()