### Fitting Logistic Regression

In this notebook, I will be fitting a logistic regression model to a dataset where I would like to predict if a transaction is fraud or not.

To get started let's read in the libraries and take a quick look at the dataset.

In [15]:
import numpy as np
import pandas as pd
import statsmodels.api as sm


df = pd.read_csv('./fraud_dataset.csv')
df.head()

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False
3,47812,32.784252,weekend,False
4,43455,17.756828,weekend,False


`1.` As you can see, there are two columns that need to be changed to dummy variables.  Replacing each of the current columns to the dummy version.  Using the 1 for `weekday` and `True`, and 0 otherwise.  

In [16]:
#Replacing the 'day' and 'fraud' columns with dummy variables.
#dummy variables

df['day'] = df['day'].replace({'weekday': 1, 'weekend': 0})
df['fraud'] = df['fraud'].replace({True: 1, False: 0})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8793 entries, 0 to 8792
Data columns (total 4 columns):
transaction_id    8793 non-null int64
duration          8793 non-null float64
day               8793 non-null int64
fraud             8793 non-null int64
dtypes: float64(1), int64(3)
memory usage: 274.9 KB


In [18]:
#The proportion of fraudulent transactions:
df['fraud'].sum();
107/8793

0.012168770612987604

In [24]:
#The average duration for fraudulent transaction:
avg_duration_fraud = df[df['fraud'] == 1]['duration'].mean()

print("The average duration for fraudulent transactions is:", avg_duration_fraud)

The average duration for fraudulent transactions is: 4.62424737062


In [28]:
#The proportion of weekday transaction:
df['day'].sum();
3036/8793

0.3452746502900034

In [29]:
#The average duration for non-fraudulent transactions:
avg_duration_nfraud = df[df['fraud'] == 0]['duration'].mean()

print("The average duration for non-fraudulent transactions is:", avg_duration_nfraud)

The average duration for non-fraudulent transactions is: 30.0135831325


`2.` Now that I have dummy variables, fitting a logistic regression model to predict if a transaction is fraud using both day and duration.  Not forgeting an intercept!  

In [30]:
#Adding an intercept column to the DataFrame.
df['intercept'] = 1

In [31]:
# Defining the independent and dependent variables
X = df[['intercept', 'day', 'duration']]
y = df['fraud']

# Fitting a logistic regression model
logit_model = sm.Logit(y, X)
result = logit_model.fit()

# Displaying the summary of the fitted model
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.002411
         Iterations 16
                        Results: Logit
Model:              Logit            No. Iterations:   16.0000
Dependent Variable: fraud            Pseudo R-squared: 0.963  
Date:               2023-06-02 10:14 AIC:              48.4009
No. Observations:   8793             BIC:              69.6460
Df Model:           2                Log-Likelihood:   -21.200
Df Residuals:       8790             LL-Null:          -578.10
Converged:          1.0000           Scale:            1.0000 
---------------------------------------------------------------
            Coef.   Std.Err.     z     P>|z|    [0.025   0.975]
---------------------------------------------------------------
intercept   9.8709    1.9438   5.0783  0.0000   6.0613  13.6806
day         2.5465    0.9043   2.8160  0.0049   0.7741   4.3188
duration   -1.4637    0.2905  -5.0389  0.0000  -2.0331  -0.8944



>Both duration and weekday had p-values suggesting they were statistically significant.

> - On weekdays, the chance of fraud is 12.76 times more likely than on weekends holding duration constant.
    - The exponentiated coefficient for 'day' is exp(2.5465) = 12.76. This means that the odds of fraud on weekdays are 12.76 times higher than on weekends, holding duration constant.
> - For each minute less spent on the transaction, the chance of fraud is 4.32 times more likely holding the da of week constant.