# Anti-Money Laundering and Fraud Prediction
This is a breakdown of the overall models that can be developed to make AI models to predict and detect money-laundering or fraud within financial datasets.
Data sources for these datasets come from sources on Kaggle:
- [Credit Card Fraud Detection | Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud?datasetId=310&sortBy=voteCount)
- [Fake Bills | Kaggle]("https://www.kaggle.com/datasets/alexandrepetit881234/fake-bills")

These data sets contain various amounts of data dating till 2013, with various levels of information that is captured from financial entities.

## Packages to install for below
> Make sure pip is up to date for these packages to install
>> `python.exe -m pip install --upgrade pip`

> To install the SciKit (sklearn) packages use the below command:
>> `pip install scikit-learn`

> To install Seaborn packages use the below command:
>> `pip install seaborn`

> To install the Plotly packages use the below command:
>> `pip install plotly`

> To install the Tensorflow packages use the below command:
>> `pip install tensorflow`

NOTE: You will need to download Python version 3.11 from the Microsoft Store for this to work

## Library Imports
To start the overall work click play on the play button for the packages

In [None]:
import warnings
import numpy as np # linear algebra breakdown
import pandas as pd # data processing, CSV files input/output
import matplotlib.pyplot as plt # graph plotting
import seaborn as sns 

import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import iplot

from numpy import percentile
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats
from scipy.stats import trim_mean

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import average_precision_score

warnings.filterwarnings('ignore')
%matplotlib inline

## Dataset files

From the above components you will then be able to import the different files that are needing to be analysed with Pandas.
Pandas will be able to pull in the different files, for example with this work from Github.

In [None]:
# This block would allow for pulling in multiple different sources, but these are not all the same formats, so will throw errors
#urls = ["https://github.com/jono120/fictional-octo-potato/raw/main/transaction_data/fake_bills.csv", "https://github.com/Jono120/fictional-octo-potato/tree/main/transaction_data/bank.csv", "https://github.com/Jono120/fictional-octo-potato/tree/main/transaction_data/train_trd.csv"]
#df = pd.read_csv("https://github.com/jono120/fictional-octo-potato/raw/main/transaction_data/fake_bills.csv")
#df = pd.read_csv("https://github.com/Jono120/fictional-octo-potato/tree/main/transaction_data/bank.csv")
#df = pd.read_csv("https://github.com/Jono120/fictional-octo-potato/tree/main/transaction_data/train_trd.csv")

#dfs = [pd.read_csv(url) for url in urls]
#df = pd.concat(dfs)
#print(df.head())
#df.info()

In [None]:
# df = pd.read_csv("https://github.com/jono120/fictional-octo-potato/raw/main/transaction_data/fake_bills.csv", sep = ';')
df = pd.read_csv("https://github.com/Jono120/fictional-octo-potato/raw/main/transaction_data/bank.csv")

# Checking the file structure
df.info()

### Overview of dataset
This shows the breakdown of the dataset and shows what is visible in the information

In [None]:
print("Shape of dataset:", df.shape)
print("Overview of the data:")
print(df.head())
df.info()

In [None]:
print("Data types of columns", df.dtypes)

In [None]:
print("Description of the dataset:")
df.info().round(2)

In [None]:
# Scans the data to search the percentages of fraud vs no fraud
amount = df.groupby('BALANCE AMT')['BALANCE AMT'].sum()
fraud, unfraud = len(df[df['BALANCE AMT'] == 1]), len(df[df['BALANCE AMT'] == 0])
fraud_perc, unfraud_perc = (fraud/len(df)) * 100, (unfraud/len(df))*100

Loss = pd.DataFrame({'Fraud' : ['Fraud', 'No Fraud'], 'Total Amount' : [amount[1], amount[0]], 'Freq.' : [fraud, unfraud], '% perc.' : [fraud_perc, unfraud_perc]})

Loss = Loss.set_index('Fraud')
Loss

In [None]:
# Initalise the lists to store catagorical and numerical features
catfeat = []
numfeat = []

for i in df.columns:
    if(df[i].dtypes == 'BALANCE AMT'): catfeat.append(i)
    else:
        numfeat.append(i)
print(f'The number of Objects Features : {len(catfeat)}')
print(f'The number of Numerical Features : {len(numfeat)}')

In [None]:
print(f'Number of missing values : {df.isnull().sum().sum()}')

In [None]:
# This scans the data for any duplicates that are within the datasets
namedfeat = [' WITHDRAWAL AMT ', ' DEPOSIT AMT ', 'BALANCE AMT']
for i in df[namedfeat]:
    if(df[i].duplicated().sum() > 0): print(f'{i} has {df[i].duplicated().sum()} duplicates') 

df[namedfeat].describe().T

## Description of the data variables

### Percentage Scores
This should give a breakdown of the successful percentage amounts for each of the two columns referenced.

In [None]:
# Checks the quality of the withdrawal amounts in the dataset
class_counts_with = df[' WITHDRAWAL AMT '].value_counts()
class_counts_percentage_with = df[' WITHDRAWAL AMT '].value_counts(normalize=True) * 100

# Checks the quality of the deposit amounts in the dataset
class_counts_dep = df[' DEPOSIT AMT '].value_counts()
class_counts_percentage_dep = df[' DEPOSIT AMT '].value_counts(normalize=True) * 100

print("Withdrawal amounts: \n", class_counts_with)
print("Deposit amounts: \n", class_counts_dep)

### Count Plot graph
This allows for the creation of a bar graph using specific categorical information for the overall datasets

In [None]:
sns.countplot(x=' WITHDRAWAL AMT ', data=df)

# Add title, x-axis, y-axis labels for the graph
plt.title("Distributions of the targets")
plt.xlabel(" WITHDRAWAL AMT ")
plt.ylabel("Count")

# Display the plot
plt.show()

class_counts = df[' WITHDRAWAL AMT '].value_counts()
print(class_counts)

### Scatter Plot Graphs
This will showcase the data as scatter plot graphs, in both 3D and 2D styles

In [None]:
import plotly.express as px
import plotly.graph_objects as go

px.scatter(df, x =' WITHDRAWAL AMT ', y =' DEPOSIT AMT ', color ='TRANSACTION DETAILS')
#px.scatter(df, x='length', y='margin_low', color='is_genuine')
#px.scatter(df, x='length', y='margin_up', color='is_genuine')
#px.scatter(df, x='length', y='height_left', color='is_genuine')

## Data Enrichments
This will use the TensorFlow platform to enrich the data and analyse information.

In [None]:
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras.optimizers import Adam

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(40, input_dim=6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [None]:
x = df.iloc[0:, 1:].values
y = df['TRANSACTION DETAILS']

## Data Preprocessing
This will scan through the data, to process the data for allowing training and test data to be created

In [None]:
# Split the dataset
X = df.drop(' WITHDRAWAL AMT ', axis=1)
y = df[' DEPOSIT AMT ']

### Training and Testing Splits
This section will pull all information that is needed for building out the Training and Testing models within the datasets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y)

### Data Transformations
The below section will enable the information to be transformed into the relevant testing and training information, for further graphing and data sorting.

In [None]:
from sklearn.preprocessing import StandardScaler

X_train[namedfeat] = StandardScaler().fit_transform(X_train[namedfeat])
X_test[namedfeat] = StandardScaler().fit_transform(X_test[namedfeat])