# Detecting Frauds with Logistic Regression

### Fitting a Logistic Regression Model to a dataset with the aim of predicting wether a transaction is fraud or not.

**The Logistic Regerssion is a methodology used to predict the value of the dichotomic dependent variable Y, based on a set of explanantory variables, which could be either quantitative and qualitative.**

***Y* is the qualitative response, which describes the outcome of a random event *X*.**
***Y* can be 0 or 1.**

**In the following project, we are going to analyse a transactions dataset, and analyse wether their duration and day of the week they were made have an effect on the likelyhood of turning out to be a fraud.**

**In order to fit the Logistic Regression we will be first using the library *statsmodels*, then the library *scikit-learn*.**

### Option 1 - using statsmodels

In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm


df = pd.read_csv('fraud_dataset.csv')
df.head()

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False
3,47812,32.784252,weekend,False
4,43455,17.756828,weekend,False


In [3]:
# Change 'day' and 'fraud' columns into dummy variables
df['weekday'] = pd.get_dummies(df['day'])['weekday']
df[['not_fraud','fraud']] = pd.get_dummies(df['fraud'])
df = df.drop('not_fraud', axis=1)  # we only need n-1 dummy columns per cathegorical column

df.head()

Unnamed: 0,transaction_id,duration,day,fraud,weekday
0,28891,21.3026,weekend,0,0
1,61629,22.932765,weekend,0,0
2,53707,32.694992,weekday,0,1
3,47812,32.784252,weekend,0,0
4,43455,17.756828,weekend,0,0


In [4]:
# Fitting logistic regression model
df['intercept'] = 1
logit_mod = sm.Logit(df['fraud'], df [['intercept', 'weekday','duration']])
results = logit_mod.fit()
results.summary()

  return np.sum(np.log(self.cdf(q*np.dot(X,params))))


Optimization terminated successfully.
         Current function value: inf
         Iterations 16




0,1,2,3
Dep. Variable:,fraud,No. Observations:,8793.0
Model:,Logit,Df Residuals:,8790.0
Method:,MLE,Df Model:,2.0
Date:,"Thu, 10 Jan 2019",Pseudo R-squ.:,inf
Time:,15:56:35,Log-Likelihood:,-inf
converged:,True,LL-Null:,0.0
,,LLR p-value:,1.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,9.8709,1.944,5.078,0.000,6.061,13.681
weekday,2.5465,0.904,2.816,0.005,0.774,4.319
duration,-1.4637,0.290,-5.039,0.000,-2.033,-0.894


#### Both weekday and duration are statistically significant farcors (p-value < 0.05)


#### Interpretation of the coefficients

In [5]:
np.exp(2.5465)  # exponentiation of coefficient for weekday

12.762357271496972

#### If the transaction happens on a weekdays, it's 12.76 more likely tthat it's a fraud then if it happened on a weekend, ceteris paribus

In [6]:
np.exp(-1.4637)  # exponentiation of coefficient for duration ==> result <1

0.2313785882117941

In [7]:
1/np.exp(-1.4637)    # as simple exponentiation gave result <1, we take the reciprocal

4.321921089278333

#### For each time-unit decrease in duration, the probability for a transaction to be a fraud increases of 4.32 times, ceteris paribus

### Option 2 - using [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [8]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score
from sklearn.model_selection import train_test_split
np.random.seed(42)

df = pd.read_csv('fraud_dataset.csv')
df.head()

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False
3,47812,32.784252,weekend,False
4,43455,17.756828,weekend,False


In [9]:
# Change 'day' and 'fraud' columns into dummy variables
df['weekday'] = pd.get_dummies(df['day'])['weekday']
df[['not_fraud','fraud']] = pd.get_dummies(df['fraud'])
df = df.drop('not_fraud', axis=1) 

df.head()

Unnamed: 0,transaction_id,duration,day,fraud,weekday
0,28891,21.3026,weekend,0,0
1,61629,22.932765,weekend,0,0
2,53707,32.694992,weekday,0,1
3,47812,32.784252,weekend,0,0
4,43455,17.756828,weekend,0,0


#### Defining dependent and independent variables to make the model with

In [10]:
y = df['fraud'] # define response column
X = df[['duration','weekday']] # defining independent variables
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # splitting dataset in train and test

#### Fitting the model on the training set and predicting y on the test set

In [11]:
log_mod = LogisticRegression()  
log_mod.fit(X_train, y_train) # fitting model on training set
y_preds = log_mod.predict(X_test)  # predicting y on test set

#### Metrics

In [12]:
precision_score(y_test, y_preds)  # PRECISION: focuses on false positives

0.9444444444444444

In [13]:
recall_score(y_test, y_preds)   # RECALL: focuses on false negatives

1.0

In [14]:
accuracy_score(y_test, y_preds)   # ACCURACY: focuses on the % of y labelled correctly (regardless if positives or negatives)

0.9994314951677089

In [15]:
confusion_matrix(y_test, y_preds)  

array([[1741,    1],
       [   0,   17]], dtype=int64)

#### From the confusion matrix we can see that on the test set (20% of the entire dataset),  the model correctly predicted 1741 not_fraud out of 1742 (one not_fraud was labelled as fraud) and 17 fraud out of 17.