# Anti-Money Laundering and Fraud Prediction
This is a breakdown of the overall models that can be developed to make AI models to predict and detect money-laundering or fraud within financial datasets.
Data sources for these datasets come from sources on Kaggle:
- [Credit Card Fraud Detection | Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud?datasetId=310&sortBy=voteCount)
- [Fake Bills | Kaggle]("https://www.kaggle.com/datasets/alexandrepetit881234/fake-bills")

These data sets contain various amounts of data dating till 2013, with various levels of information that is captured from financial entities.

## Packages to install for below
Make sure pip is up to date for these packages to install
```
python.exe -m pip install --upgrade pip
```

To install the SciKit (sklearn) packages use the below command:
```
pip install scikit-learn
```

To install Seaborn packages use the below command:
```
pip install seaborn
```

To install the Plotly packages use the below command:
```
pip install plotly
```

## Library Imports
To start the overall work click play on the play button for the packages

In [39]:
import numpy as np # linear algebra breakdown
import pandas as pd # data processing, CSV files input/output
import matplotlib.pyplot as plt # graph plotting
import seaborn as sns 
import warnings

import plotly.express as px
import plotly.graph_objects as go

from numpy import percentile
from mpl_toolkits.mplot3d import Axes3D

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import average_precision_score


warnings.filterwarnings('ignore')

%matplotlib inline

## Dataset files

From the above components you will then be able to import the different files that are needing to be analysed with Pandas.
Pandas will be able to pull in the different files, for example with this work from Github.

In [40]:
#urls = ["https://github.com/jono120/fictional-octo-potato/raw/main/transaction_data/fake_bills.csv", "https://github.com/Jono120/fictional-octo-potato/tree/main/transaction_data/bank.csv", "https://github.com/Jono120/fictional-octo-potato/tree/main/transaction_data/train_trd.csv"]
#df = pd.read_csv("https://github.com/jono120/fictional-octo-potato/raw/main/transaction_data/fake_bills.csv")
#df = pd.read_csv("https://github.com/Jono120/fictional-octo-potato/tree/main/transaction_data/bank.csv")
#df = pd.read_csv("https://github.com/Jono120/fictional-octo-potato/tree/main/transaction_data/train_trd.csv")

#dfs = [pd.read_csv(url) for url in urls]
#df = pd.concat(dfs)
#print(df.head())
#df.info()

In [41]:
df_full = pd.read_csv("https://github.com/jono120/fictional-octo-potato/raw/main/transaction_data/fake_bills.csv", sep = ';')

df_false = df_full.loc[df_full['is_genuine']==False]
df_true = df_full.loc[df_full['is_genuine']==True]
df_true = df_true.fillna(df_true.median())
df_false = df_false.fillna(df_false.median())

In [42]:
print("Shape of dataset:", df_full.shape)
print("Overview of the data:" )
print(df.head())
df_full.info()


Shape of dataset: (1500, 7)
Overview of the data:
  is_genuine;diagonal;height_left;height_right;margin_low;margin_up;length
0         True;171.81;104.86;104.95;4.52;2.89;112.83                      
1         True;171.46;103.36;103.66;3.77;2.99;113.09                      
2           True;172.69;104.48;103.5;4.4;2.94;113.16                      
3         True;171.36;103.91;103.94;3.62;3.01;113.51                      
4         True;171.73;104.28;103.46;4.04;3.48;112.54                      
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   is_genuine    1500 non-null   bool   
 1   diagonal      1500 non-null   float64
 2   height_left   1500 non-null   float64
 3   height_right  1500 non-null   float64
 4   margin_low    1463 non-null   float64
 5   margin_up     1500 non-null   float64
 6   length        1500 non-null   float64
dtypes:

In [43]:
df_true.describe()

Unnamed: 0,diagonal,height_left,height_right,margin_low,margin_up,length
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,171.98708,103.94913,103.80865,4.11592,3.05213,113.20243
std,0.300441,0.300231,0.29157,0.31446,0.18634,0.359552
min,171.04,103.14,102.82,2.98,2.27,111.76
25%,171.79,103.74,103.61,3.91,2.93,112.95
50%,171.99,103.95,103.81,4.11,3.05,113.205
75%,172.2,104.14,104.0,4.33,3.18,113.46
max,172.92,104.86,104.95,5.04,3.74,114.44


In [44]:
df_false.describe()

Unnamed: 0,diagonal,height_left,height_right,margin_low,margin_up,length
count,500.0,500.0,500.0,500.0,500.0,500.0
mean,171.90116,104.19034,104.14362,5.21552,3.35016,111.63064
std,0.306861,0.223758,0.270878,0.549086,0.180498,0.615543
min,171.04,103.51,103.43,3.82,2.92,109.49
25%,171.69,104.04,103.95,4.84,3.22,111.2
50%,171.91,104.18,104.16,5.19,3.35,111.63
75%,172.0925,104.3325,104.32,5.59,3.4725,112.03
max,173.01,104.88,104.95,6.9,3.91,113.85
