## Fraud Detection using Logistic Regression

In [1]:
import numpy as np
import pandas as pd

import statsmodels.api as sm

### Load the Data

In [2]:
df = pd.read_csv('../../Datasets/fraud_dataset.csv')
df.head()

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False
3,47812,32.784252,weekend,False
4,43455,17.756828,weekend,False


### Understand the data

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8793 entries, 0 to 8792
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   transaction_id  8793 non-null   int64  
 1   duration        8793 non-null   float64
 2   day             8793 non-null   object 
 3   fraud           8793 non-null   bool   
dtypes: bool(1), float64(1), int64(1), object(1)
memory usage: 214.8+ KB


In [4]:
df.describe()

Unnamed: 0,transaction_id,duration
count,8793.0,8793.0
mean,55243.38451,29.704626
std,21792.120147,7.464452
min,17301.0,0.215113
25%,36454.0,25.211787
50%,55420.0,29.92316
75%,74131.0,34.532567
max,92828.0,60.412763


### Convert categories into numeric format

In [5]:
df['weekday'] = df['day'].map({'weekend': 0, 'weekday': 1})
df['fraud'] = df['fraud'].map({False: 0, True: 1})

df.drop('day', axis=1, inplace=True)

In [6]:
df.head()

Unnamed: 0,transaction_id,duration,fraud,weekday
0,28891,21.3026,0,0
1,61629,22.932765,0,0
2,53707,32.694992,0,1
3,47812,32.784252,0,0
4,43455,17.756828,0,0


### Fitting the model

In [7]:
df['intercept'] = 1
logit_mod = sm.Logit(df['fraud'], df[['intercept', 'duration', 'weekday']])
logit_res = logit_mod.fit()

Optimization terminated successfully.
         Current function value: 0.002411
         Iterations 16


In [8]:
logit_res.summary()

0,1,2,3
Dep. Variable:,fraud,No. Observations:,8793.0
Model:,Logit,Df Residuals:,8790.0
Method:,MLE,Df Model:,2.0
Date:,"Wed, 21 Jul 2021",Pseudo R-squ.:,0.9633
Time:,16:46:57,Log-Likelihood:,-21.2
converged:,True,LL-Null:,-578.1
Covariance Type:,nonrobust,LLR p-value:,1.39e-242

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,9.8709,1.944,5.078,0.000,6.061,13.681
duration,-1.4637,0.290,-5.039,0.000,-2.033,-0.894
weekday,2.5465,0.904,2.816,0.005,0.774,4.319


In [9]:
np.exp(-1.4637), np.exp(2.5465)

(0.2313785882117941, 12.762357271496972)

In [10]:
1/np.exp(-1.4637)

4.321921089278333

**NOTE:**
    
* Fraud is 12.76 times as likely on weekdays than weekends holding all else constant.
* For each 1 unit increase in duration, fraud is 0.23 times as likely holding all else constant.
                                or
* For each 1 unit decrease in duration, fraud is 4.32 times as likely holding all else constant.