<a href="https://colab.research.google.com/github/Tayes06/financial-anomaly-detection/blob/main/fraud_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to this notebook
## Fraud detection and prevention
Trough this notebook, we are going to be able to detect and make prediction of fraud in financial transaction.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


We will built two ML model that one will be used to detect and the other to predict fraud in financial transaction

The dataset we are going to use is paysim dataset, available on kaggle at this address https://www.kaggle.com/datasets/mtalaltariq/paysim-data. \
I've downloaded it and now we'll be exploring following this notebook.

### Imports

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Load the dataset
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/bank-fraud-detection/paysim_dataset_for_prediction.csv")

# Show first lines
print(df.head())

   step      type    amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0     1   PAYMENT   9839.64  C1231006815       170136.0       160296.36   
1     1   PAYMENT   1864.28  C1666544295        21249.0        19384.72   
2     1  TRANSFER    181.00  C1305486145          181.0            0.00   
3     1  CASH_OUT    181.00   C840083671          181.0            0.00   
4     1   PAYMENT  11668.14  C2048537720        41554.0        29885.86   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isFlaggedFraud  
0  M1979787155             0.0             0.0        0               0  
1  M2044282225             0.0             0.0        0               0  
2   C553264065             0.0             0.0        1               0  
3    C38997010         21182.0             0.0        1               0  
4  M1230701703             0.0             0.0        0               0  


In [None]:
df.shape

(6362620, 11)

* Adding _device_type_ column

### Enrichment of the dataset
Let's add more column in order to give more sence to our dataset.  
**device_type:** to track the type of device used for transaction.  
**location:** to get the place from where the transaction is been operating.  
**ip_address:** used to track suspicious users.  
**time_of_day:** it can be *Morning*, *Midday*, *Afternoon*, or *Night*.  
**rapid_transactions:** Number of transaction within an hour.

In [None]:
np.random.seed(42)  # Pour avoir les mêmes résultats à chaque exécution

# Simuler le type d'appareil utilisé pour la transaction
device_types = ['Mobile', 'Desktop', 'ATM']
df['device_type'] = np.random.choice(device_types, size=len(df))

# Show devices distributions
print(df['device_type'].value_counts())

device_type
Desktop    2121683
ATM        2121611
Mobile     2119326
Name: count, dtype: int64


* Adding *location* column

In [None]:
cities = ['Douala', 'Yaoundé', 'Paris', 'Limoges', 'Abuja', 'Toronto', 'Guangzhou', 'Bafoussam', 'Abijan']
df['location'] = np.random.choice(cities, size=len(df))

# Visualisation des villes les plus fréquentes
print(df['location'].value_counts())

location
Abijan       708859
Abuja        707246
Yaoundé      707004
Paris        706873
Guangzhou    706746
Douala       706648
Limoges      706637
Bafoussam    706350
Toronto      706257
Name: count, dtype: int64


* ip_address

* time_of_day

In [None]:
def assign_time(step):
    if step % 24 < 6:
        return 'Nuit'
    elif step % 24 < 12:
        return 'Matin'
    elif step % 24 < 18:
        return 'Après-midi'
    else:
        return 'Soir'

df['time_of_day'] = df['step'].apply(assign_time)

# Vérifier la distribution des transactions par moment de la journée
print(df['time_of_day'].value_counts())

time_of_day
Après-midi    2689784
Soir          2365669
Matin         1194562
Nuit           112605
Name: count, dtype: int64


* rapid_transactions  
  📌 Calculer combien de transactions le même client a effectuées dans l'heure précédente.

In [None]:
df['rapid_transactions'] = (df.groupby('nameOrig')['step']
                             .diff()
                             .fillna(999)
                             .lt(1)
                             .astype(int)
                             .groupby(df['nameOrig'])
                             .cumsum())

print(df[['nameOrig', 'rapid_transactions']].head(20))

       nameOrig  rapid_transactions
0   C1231006815                   0
1   C1666544295                   0
2   C1305486145                   0
3    C840083671                   0
4   C2048537720                   0
5     C90045638                   0
6    C154988899                   0
7   C1912850431                   0
8   C1265012928                   0
9    C712410124                   0
10  C1900366749                   0
11   C249177573                   0
12  C1648232591                   0
13  C1716932897                   0
14  C1026483832                   0
15   C905080434                   0
16   C761750706                   0
17  C1237762639                   0
18  C2033524545                   0
19  C1670993182                   0


In [None]:
print(df.head())

   step      type    amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0     1   PAYMENT   9839.64  C1231006815       170136.0       160296.36   
1     1   PAYMENT   1864.28  C1666544295        21249.0        19384.72   
2     1  TRANSFER    181.00  C1305486145          181.0            0.00   
3     1  CASH_OUT    181.00   C840083671          181.0            0.00   
4     1   PAYMENT  11668.14  C2048537720        41554.0        29885.86   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isFlaggedFraud  \
0  M1979787155             0.0             0.0        0               0   
1  M2044282225             0.0             0.0        0               0   
2   C553264065             0.0             0.0        1               0   
3    C38997010         21182.0             0.0        1               0   
4  M1230701703             0.0             0.0        0               0   

  device_type location time_of_day  rapid_transactions  
0         ATM  Limoges        Nuit       

In [None]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 15 columns):
 #   Column              Dtype  
---  ------              -----  
 0   step                int64  
 1   type                object 
 2   amount              float64
 3   nameOrig            object 
 4   oldbalanceOrg       float64
 5   newbalanceOrig      float64
 6   nameDest            object 
 7   oldbalanceDest      float64
 8   newbalanceDest      float64
 9   isFraud             int64  
 10  isFlaggedFraud      int64  
 11  device_type         object 
 12  location            object 
 13  time_of_day         object 
 14  rapid_transactions  int64  
dtypes: float64(5), int64(4), object(6)
memory usage: 728.1+ MB
None


#### Data preprocessing
Delete unusefull columns  
We are going to delete columns that aren't too relevants to train the model:  
* nameOrig, nameDest : Account identifiers (Don't deliver usefull informations).  
* isFraud : We remove it because Isolation Forest is unsupervised.  
* isFlaggedFraud : Can be use later for analystics but not directly in training.  

In [None]:
df = df.drop(columns=['nameOrig', 'nameDest', 'isFraud', 'isFlaggedFraud'])

#### Encoding of categorical variables
The following columns are categoricals and shall be converted into numeric:  
* type (type of transaction)  
* device_type (used device)  
* location (transaction location)  
* time_of_day (a periode of the day)  
Let's use *LabelEncoder* to encode

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
for col in ['type', 'device_type', 'location', 'time_of_day']:
    df[col] = encoder.fit_transform(df[col])

#### Normalization of numerical variables
To avoid certain variables dominate to others (ex. amount vs rapid_transactions), we use *StandardScaler* to normalize amounts and balances

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest', 'rapid_transactions']] = scaler.fit_transform(
    df[['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest', 'rapid_transactions']]
)

#### Training of Isolation Forest Model
1. Model Initialization

In [None]:
from sklearn.ensemble import IsolationForest

# Definition of the model
model = IsolationForest(n_estimators = 100, contamination = 0.02, random_state = 42)

# Training of the model
df['anomaly'] = model.fit_predict(df)

# Result transformation (-1 = fraud, 1 = normal)
df['anomaly'] = df['anomaly'].apply(lambda x: 1 if x == -1 else 0)

#### Detected anomalies evaluation
1. Check how anomalies are distributed  

We'll check if the model returns a realistic percentage of anomaly

In [None]:
print(df['anomaly'].value_counts())

anomaly
0    6235367
1     127253
Name: count, dtype: int64


2. Let's compare with really fraudulents transactions

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# Comparer les anomalies détectées avec la colonne 'isFraud'
df_eval = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/bank-fraud-detection/paysim_dataset_for_prediction.csv")  # Recharger le dataset pour récupérer 'isFraud'
print(confusion_matrix(df_eval['isFraud'], df['anomaly']))
print(classification_report(df_eval['isFraud'], df['anomaly']))

[[6228156  126251]
 [   7211    1002]]
              precision    recall  f1-score   support

           0       1.00      0.98      0.99   6354407
           1       0.01      0.12      0.01      8213

    accuracy                           0.98   6362620
   macro avg       0.50      0.55      0.50   6362620
weighted avg       1.00      0.98      0.99   6362620



#### Data visualisation
1. Show suspicious transactions

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10,6))
sns.scatterplot(x=df.index, y=df['amount'], hue=df['anomaly'], palette={1:"red", 0:"blue"})
plt.title("Transactions normales vs anomalies détectées")
plt.show()

#### Export the Model
Exportation of the Isolation Forest model trained for futher use

In [None]:
import joblib

# Sauvegarde du modèle
joblib.dump(model, 'isolation_forest_model.pkl')

print("✅ Modèle sauvegardé avec succès !")