# Basic model

This jupyter notebook contains a basic analysis of the fraud detection dataset.

In [23]:
import pandas as pd
from pathlib import Path

## Loading the dataset (from parquet file)

In [None]:
# "parquet" or "csv" or "pickle"
import_choice = "parquet"

In [None]:
DF_PATH = Path(f"../data/fraud_detection.{import_choice}")

In [None]:
match import_choice:
    case "csv":
        df = pd.read_csv(DF_PATH)
    case "pickle":
        df = pd.read_pickle(DF_PATH)
    case "parquet":
        df = pd.read_parquet(DF_PATH)
    case _:
        raise ValueError("Unsupported file format")

In [26]:
df.head()

Unnamed: 0,ZIBZIN,IDAvisAutorisationCheque,FlagImpaye,Montant,DateTransaction,CodeDecision,VerifianceCPT1,VerifianceCPT2,VerifianceCPT3,D2CB,...,TauxImpNb_RB,TauxImpNB_CPM,EcartNumCheq,NbrMagasin3J,DiffDateTr1,DiffDateTr2,DiffDateTr3,CA3TRetMtt,CA3TR,Heure
0,A013010004908126703060931,78643044,0,20.0,2017-02-01 07:32:14,1,0,0,0,551,...,37.186668,52.076034,0,1,4.0,4.0,4.0,20.0,0.0,
1,A013011306908024927155000,78643045,0,20.0,2017-02-01 07:43:37,1,0,0,0,551,...,48.844716,52.076034,1,2,1.797685,4.0,4.0,28.61,8.61,
2,A013010002908283134592527,78643046,0,57.64,2017-02-01 07:47:38,1,0,0,0,549,...,73.11828,52.076034,0,1,4.0,4.0,4.0,57.64,0.0,
3,A011010002908105209831316,78643047,0,54.29,2017-02-01 07:48:48,0,1,1,1,267,...,110.056926,53.554234,0,1,4.0,4.0,4.0,54.29,0.0,
4,A013010041908000125652029,78643048,0,26.9,2017-02-01 08:13:27,1,0,0,0,549,...,45.368313,52.076034,1,1,1.997106,4.0,4.0,59.15,32.25,


## train test split  
From the project description :  

Remarque La variable **CodeDecision** n’est pas une variable à utiliser pour faire de la prédiction car cette information est **acquise post-transaction**. On peut en revanche s’en servir lors de la phase d’apprentissage pour analyser les données par exemple.  

Vous disposez donc d’un jeu de données comprenant **10 mois de transactions** du
**"2017-02-01" au "2017-11-30"**.  
On définira les ensembles de la façon suivante :  

• **Apprentissage** : transactions ayant eu lieu entre le **"2017-02-01" et le "2017-
08-31"**.  
• **Test** : transactions ayant eu lieu entre le **"2017-09-01" et le "2017-11-30"** 

-> So, we will split the data with the first 7 months as train set and the last 3 months as test set.

In [15]:
train_index = (df['DateTransaction'] >= '2017-02-01') & (df['DateTransaction'] <= '2017-08-31')
test_index = (df['DateTransaction'] >= '2017-09-01') & (df['DateTransaction'] <= '2017-11-30')

train = df[train_index]
test = df[test_index]

del train_index, test_index

### Checking the split:

In [22]:
# checking train:
print(f"Train set from {train['DateTransaction'].min()} to {train['DateTransaction'].max()}")
assert train['DateTransaction'].min() >= pd.Timestamp('2017-02-01')
assert train['DateTransaction'].max() <= pd.Timestamp('2017-08-31')

# checking test:
print(f"Test set from {test['DateTransaction'].min()} to {test['DateTransaction'].max()}")
assert test['DateTransaction'].min() >= pd.Timestamp('2017-09-01')
assert test['DateTransaction'].max() <= pd.Timestamp('2017-11-30')

# checking the proportion:
print(f"Train set size: {len(train)}")
print(f"Test set size: {len(test)}")
print(f"proportion: {len(train) / (len(train) + len(test)):.2%} / {len(test) / (len(train) + len(test)):.2%}")

Train set from 2017-02-01 07:32:14 to 2017-08-30 21:58:00
Test set from 2017-09-01 07:34:17 to 2017-11-29 21:48:56
Train set size: 3888468
Test set size: 737067
proportion: 84.07% / 15.93%
