# Anti-Money Laundering and Fraud Prediction
This is a breakdown of the overall models that can be developed to make AI models to predict and detect money-laundering or fraud within financial datasets.
Data sources for these datasets come from sources on Kaggle:
- [Credit Card Fraud Detection | Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud?datasetId=310&sortBy=voteCount)
- [Fake Bills | Kaggle]("https://www.kaggle.com/datasets/alexandrepetit881234/fake-bills")

These data sets contain various amounts of data dating till 2013, with various levels of information that is captured from financial entities.

## Packages to install for below
> Make sure pip is up to date for these packages to install
>> `python.exe -m pip install --upgrade pip`

> To install the SciKit (sklearn) packages use the below command:
>> `pip install scikit-learn`

> To install Seaborn packages use the below command:
>> `pip install seaborn`

> To install the Plotly packages use the below command:
>> `pip install plotly`

NOTE: You will need to download Python version 3.11 from the Microsoft Store for this to work

## Library Imports
To start the overall work click play on the play button for the packages

In [None]:
import numpy as np # linear algebra breakdown
import pandas as pd # data processing, CSV files input/output
import matplotlib.pyplot as plt # graph plotting
import seaborn as sns 

import plotly.express as px
import plotly.graph_objects as go

from numpy import percentile
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats
from scipy.stats import trim_mean

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import average_precision_score

## Dataset files

From the above components you will then be able to import the different files that are needing to be analysed with Pandas.
Pandas will be able to pull in the different files, for example with this work from Github.

In [None]:
# This block would allow for pulling in multiple different sources, but these are not all the same formats, so will throw errors
#urls = ["https://github.com/jono120/fictional-octo-potato/raw/main/transaction_data/fake_bills.csv", "https://github.com/Jono120/fictional-octo-potato/tree/main/transaction_data/bank.csv", "https://github.com/Jono120/fictional-octo-potato/tree/main/transaction_data/train_trd.csv"]
#df = pd.read_csv("https://github.com/jono120/fictional-octo-potato/raw/main/transaction_data/fake_bills.csv")
#df = pd.read_csv("https://github.com/Jono120/fictional-octo-potato/tree/main/transaction_data/bank.csv")
#df = pd.read_csv("https://github.com/Jono120/fictional-octo-potato/tree/main/transaction_data/train_trd.csv")

#dfs = [pd.read_csv(url) for url in urls]
#df = pd.concat(dfs)
#print(df.head())
#df.info()

In [None]:
df = pd.read_csv("https://github.com/jono120/fictional-octo-potato/raw/main/transaction_data/fake_bills.csv", sep = ';')

df_false = df.loc[df['is_genuine']==False]
df_true = df.loc[df['is_genuine']==True]
df_true = df_true.fillna(df_true.median())
df_false = df_false.fillna(df_false.median())

In [None]:
print("Shape of dataset:", df.shape)
print("Overview of the data:" )
print(df.head())
df.info()

In [None]:
df_true.describe().round(2)

In [None]:
df_false.describe().round(2)

In [None]:
catfeat = []
numfeat = []

for i in df.columns:
    if(df[i].dtypes == 'objects'): catfeat.append(i)
    else:
        numfeat.append(i)
print(f'The number of Objects Features : {len(catfeat)}')
print(f'The number of Numerical Features : {len(numfeat)}')

In [None]:
print(f'Number of missing values : {df.isnull().sum().sum()}')

# This might not work, but lets try it
object_cols = [""]
for i in object_cols:
    print("column name : {}".format(i))
    print("NUmber of unique columns of ", i, ":{}".format(df[i].nunique()))
    print("Values of unique columns of ", i, "is below: \n{}".format(df[i].value_counts()))
    print("----")

In [None]:
namedfeat = ['Time', 'Amount']
for i in df[namedfeat]:
    if(df[i].duplicated().sum() > 0): print(f'{i} Number of Duplicatied : {df[i].duplicated().sum()}') 

## Scatter Plot Graphs
This will showcase the data as scatter plot graphs, in both 3D and 2D styles

In [None]:
px.scatter_3d(df, x='margin_low', y='length', z='height_left', color='is_genuine')
px.scatter_3d(df, x='length', y='margin_low', color='is_genuine')
px.scatter_3d(df, x='length', y='margin_up', color='is_genuine')
px.scatter_3d(df, x='length', y='height_left', color='is_genuine')

In [None]:
px.scatter(df, x='margin_low', y='length', z='height_left', color='is_genuine')
px.scatter(df, x='length', y='margin_low', color='is_genuine')
px.scatter(df, x='length', y='margin_up', color='is_genuine')
px.scatter(df, x='length', y='height_left', color='is_genuine')

## Enrich Data
This will use the TensorFlow platform to enrich the data and analyse information.

In [None]:
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras.optimizers import Adam

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(40, input_dim=6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [None]:
x = df.iloc[0:, 1:].values
y1 = df['is_genuine']