# Anti-Money Laundering and Fraud Prediction
This is a breakdown of the overall models that can be developed to make AI models to predict and detect money-laundering or fraud within financial datasets.
Data sources for these datasets come from sources on Kaggle:
- [IBM Transactions for Anti Money Laundering (AML) | Kaggle](https://www.kaggle.com/code/alexisbcook/creating-your-own-notebooks/tutorial)

These data sets contain various amounts of data dating till 2013, with various levels of information that is captured from financial entities.

@Article{Hunter:2007,
  Author    = {Hunter, J. D.},
  Title     = {Matplotlib: A 2D graphics environment},
  Journal   = {Computing in Science \& Engineering},
  Volume    = {9},
  Number    = {3},
  Pages     = {90--95},
  abstract  = {Matplotlib is a 2D graphics package used for Python for
  application development, interactive scripting, and publication-quality
  image generation across user interfaces and operating systems.},
  publisher = {IEEE COMPUTER SOC},
  doi       = {10.1109/MCSE.2007.55},
  year      = 2007
}

## Packages to install for below
> Make sure pip is up to date for these packages to install
>> `python.exe -m pip install --upgrade pip`

> To install the SciKit (sklearn) packages use the below command:
>> `pip install scikit-learn`

> To install Seaborn packages use the below command:
>> `pip install seaborn`

> To install the Plotly packages use the below command:
>> `pip install plotly`

NOTE: You will need to download Python version 3.11 from the Microsoft Store for this to work

In [None]:
%pip install --upgrade pip
%pip install scikit-learn
%pip install seaborn
%pip install plotly
%pip install tensorflow
%pip install pandas
%pip install pyodbc
%pip install openpyxl
%pip install nbformat

## Library Imports
To start the overall work click play on the play button for the packages

In [None]:
import warnings
import numpy as np # linear algebra breakdown
import pandas as pd # data processing, CSV files input/output
import matplotlib.pyplot as plt # graph plotting
import seaborn as sns 

import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import iplot

from numpy import percentile
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats
from scipy.stats import trim_mean

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import average_precision_score

warnings.filterwarnings('ignore')
%matplotlib inline

## Dataset files

From the above components you will then be able to import the different files that are needing to be analysed with Pandas.
Pandas will be able to pull in the different files, for example with this work from Github.

In [None]:
import pandas as pd

#url ='https://onedrive.live.com/download?cid=4A3B8562A2CB78B3&resid=4A3B8562A2CB78B3%21158025&authkey=ACHHRI10a-phyio&em=2'
url ='https://onedrive.live.com/download?cid=4A3B8562A2CB78B3&resid=4A3B8562A2CB78B3%21158163&authkey=AFMTDEy7Ff3oCL8'
#url ='https://onedrive.live.com/download?cid=4A3B8562A2CB78B3&resid=4A3B8562A2CB78B3%21158165&authkey=APswAhpzUuOs4KQ'

#df = pd.read_excel(url)
df = pd.read_csv(url)
df.info()

In [None]:
df.head()


In [None]:
df.tail()

In [None]:
df.shape

### Overview of dataset
This shows the breakdown of the dataset and shows what is visible in the information

In [None]:
print("Shape of dataset:", df.shape)
print("Overview of the data:")
print(df.head())


In [None]:
df['Is Laundering'] = df['Is Laundering'].replace({'Yes': 1, 'No':0})

In [None]:
df_subset = df

In [None]:
df_subset.shape

In [None]:
print("Data types of columns", df.dtypes)

In [None]:
print("Description of the dataset:")
df.info()

In [None]:
# Scans the data to search the percentages of fraud vs no fraud
amount = df.groupby('Is Laundering')['Is Laundering'].sum()
fraud, unfraud = len(df[df['Is Laundering'] == 1]), len(df[df['Is Laundering'] == 0])
fraud_perc, unfraud_perc = (fraud/len(df)) * 100, (unfraud/len(df))*100

Loss = pd.DataFrame({'Fraud' : ['Fraud', 'No Fraud'], 'Total Amount' : [amount[1], amount[0]], 'Freq.' : [fraud, unfraud], '% perc.' : [fraud_perc, unfraud_perc]})

Loss = Loss.set_index('Fraud')
Loss

In [None]:
# Initalise the lists to store catagorical and numerical features
catfeat = []
numfeat = []

for i in df.columns:
    if(df[i].dtypes == 'Is Laundering'): catfeat.append(i)
    else:
        numfeat.append(i)
print(f'The number of Objects Features : {len(catfeat)}')
print(f'The number of Numerical Features : {len(numfeat)}')

In [None]:
print(f'Number of missing values : {df.isnull().sum().sum()}')

In [None]:
# This scans the data for any duplicates that are within the datasets
namedfeat = ['Amount Received', 'Amount Paid', 'Payment Format']
for i in df[namedfeat]:
    if(df[i].duplicated().sum() > 0): print(f'{i} has {df[i].duplicated().sum()} duplicates') 

df[namedfeat].describe().T

## Description of the data variables

### Percentage Scores
This should give a breakdown of the successful percentage amounts for each of the two columns referenced.

In [None]:
# Checks the quality of the withdrawal amounts in the dataset
class_counts_with = df['Amount Received'].value_counts()
class_counts_percentage_with = df['Amount Received'].value_counts(normalize=True) * 100

# Checks the quality of the deposit amounts in the dataset
class_counts_dep = df['Amount Paid'].value_counts()
class_counts_percentage_dep = df['Amount Paid'].value_counts(normalize=True) * 100

print("Amount Received: \n", class_counts_with)
print("Amount Paid: \n", class_counts_dep)

In [None]:
# Checks the quality of the deposit amounts in the dataset
class_counts_lau = df['Is Laundering'].value_counts()
class_counts_percentage_lau = df['Is Laundering'].value_counts(normalize=True) * 100

print("Is Laundering: \n", class_counts_lau)

### Scatter Plot Graphs
This will showcase the data as scatter plot graphs, in both 3D and 2D styles

In [None]:
import plotly.express as px
import plotly.graph_objects as go

px.scatter(df, x ='Amount Received', y ='Amount Paid', color ='Payment Format')

In [None]:
import plotly.express as px
import plotly.graph_objects as go

px.scatter(df, x ='Payment Format', y ='Amount Paid', color ='Is Laundering')

In [None]:
import plotly.express as px
import plotly.graph_objects as go

px.scatter_3d(df, x ='Amount Received', y ='Amount Paid', z='Payment Format', color ='Payment Currency')

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Read data from csv file
data = pd.read_excel(url)

# Data to plot
labels = data['Payment Format']
values = data['Is Laundering']

# Plot
plt.bar(labels, values)

plt.show()

In [None]:
df_subset['Amount Paid'] = pd.to_numeric(df_subset['Amount Paid'], errors='coerce')
df_subset.boxplot(column='Amount Paid', by='Is Laundering')

plt.title('Box plot graph for Payments vs Fraud')
plt.show()

## Data Preprocessing
This will scan through the data, to process the data for allowing training and test data to be created

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Read data from csv file
data = pd.read_excel(url)

# Split the dataset
X = df['Is Laundering']
y = df['Payment Format']

# Plot
bars = plt.bar(X, y)
plt.bar_label(bars)

plt.show()