<a href="https://colab.research.google.com/github/Skalwalker/AntiMoneyLaundering/blob/main/anti_money_laundering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IBM Transactions for Anti Money Laundering

The project is based on the analysis of the «IBM Transactions for Anti Money Laundering» dataset published on [Kaggle](https://www.kaggle.com/datasets/ealtman2019/ibm-transactions-for-anti-money-laundering-aml) and released under the Community Data License Agreement – Sharing – Version 1.0. This dataset contains several CSV files, each having a different combination of data size and amount of illicit transactions.

## About

This project is a partial requisite for completing the courses of "Algorithms for massive datasets" and "Statistical methods for ML" on the masters degree computer science program from Università degli Studi di Milano.

- **Author:** Renato Avellar Nobre
- **Matriculation Number:** 984405
- **Exam Project Year:** 22/23

### Disclaimer

"I declare that this material, which I now submit for assessment, is entirely my own work and has not been taken from the work of others, save and to the extent that such work has been cited and acknowledged within the text of my work. I understand that plagiarism, collusion, and copying are grave and serious offences in the university and accept the penalties that would be imposed should I engage in plagiarism, collusion or copying. This assignment, or any part of it, has not been previously submitted by me or any other person for assessment on this or any other course of study."


# Overview

The task is to implement a system which predicts whether or not a transaction is illicit, using the attribute "Is Laundering" as a label to be predicted. Classification should be done exploiting a random forest, organizing the project as follows.

1. A sequential implementation (from scratch) of the learning algorithm for a decision tree should be provided, and tested using one or more subsets of the dataset which can be loaded in main memory.

2. A mock-up code that uses spark in order to consider a dataset and processes it in order to distribute the creation of the single trees in a random forest should be proposed. In particular, the construction of each tree should be done by providing different data to each worker, both subsampling the number of rows (i.e., labeled objects) and columns (i.e., attributes) in the overall dataset. Concerning the first kind of subsampling, you might possibly consider introducing the so-called bootstrap sampling, in which the labeled objects are sampled with replacement and therefore a same object can occur more than once in the resulting dataset. It is not required to distribute the creation of a single decision tree: for this task you are free to use the implementation provided in point 1, as well as the implementation already available in scikit-learn.




## Before we start...

Please upload the JSON file of your Kaggle API by executing the code below. Kaggle API JSON files can be generated on your [Kaggle user profile setting](https://www.kaggle.com/settings)

In [2]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

# Move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json
User uploaded file "kaggle.json" with length 66 bytes


# Setup

In [6]:
import os

data_folders = ["01_raw", "02_intermediate", "03_primary", "04_feature", "05_model_input",
                "06_models", "07_model_output", "08_reporting"]

try:
    os.makedirs("data", exist_ok = True)
    [os.makedirs("data/" + folder_name, exist_ok = True) for folder_name in data_folders]
    print("Directories created successfully")
except OSError as error:
    print(f"Directories can not be created: {error}")


Directories created successfully


## Fetching Files from Kaggle

In [3]:
!kaggle datasets download -d ealtman2019/ibm-transactions-for-anti-money-laundering-aml

Downloading ibm-transactions-for-anti-money-laundering-aml.zip to /content
100% 7.41G/7.42G [01:39<00:00, 102MB/s] 
100% 7.42G/7.42G [01:39<00:00, 80.3MB/s]


In [7]:
!unzip ibm-transactions-for-anti-money-laundering-aml.zip -d ./data/01_raw/

Archive:  ibm-transactions-for-anti-money-laundering-aml.zip
  inflating: ./data/01_raw/HI-Large_Patterns.txt  
  inflating: ./data/01_raw/HI-Large_Trans.csv  
  inflating: ./data/01_raw/HI-Medium_Patterns.txt  
  inflating: ./data/01_raw/HI-Medium_Trans.csv  
  inflating: ./data/01_raw/HI-Small_Patterns.txt  
  inflating: ./data/01_raw/HI-Small_Trans.csv  
  inflating: ./data/01_raw/LI-Large_Patterns.txt  
  inflating: ./data/01_raw/LI-Large_Trans.csv  
  inflating: ./data/01_raw/LI-Medium_Patterns.txt  
  inflating: ./data/01_raw/LI-Medium_Trans.csv  
  inflating: ./data/01_raw/LI-Small_Patterns.txt  
  inflating: ./data/01_raw/LI-Small_Trans.csv  


In [8]:
!rm ibm-transactions-for-anti-money-laundering-aml.zip
!rm ./data/01_raw/*.txt

## Installations

In [None]:
!pip install --upgrade ipykernel
!pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Imports

# 0. Data Engineering Preparation

# 1. Random Forest

# 2. Distributed Workers