### Fitting Logistic Regression

In this first notebook, you will be fitting a logistic regression model to a dataset where we would like to predict if a transaction is fraud or not.

To get started let's read in the libraries and take a quick look at the dataset.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm


df = pd.read_csv('fraud_dataset.csv')
df.head(10)

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False
3,47812,32.784252,weekend,False
4,43455,17.756828,weekend,False
5,21351,21.704516,weekday,False
6,86349,15.420331,weekend,False
7,47299,33.245462,weekday,False
8,90271,27.827033,weekend,False
9,75021,39.492428,weekend,False


`1.` As you can see, there are two columns that need to be changed to dummy variables.  Replace each of the current columns to the dummy version.  Use the 1 for `weekday` and `True`, and 0 otherwise.  Use the first quiz to answer a few questions about the dataset.

In [2]:
df[['weekday', 'weekend']] = pd.get_dummies(df['day'])
df[['no_fraud', 'fraud']] = pd.get_dummies(df['fraud'])

df.head()

Unnamed: 0,transaction_id,duration,day,fraud,weekday,weekend,no_fraud
0,28891,21.3026,weekend,0,0,1,1
1,61629,22.932765,weekend,0,0,1,1
2,53707,32.694992,weekday,0,1,0,1
3,47812,32.784252,weekend,0,0,1,1
4,43455,17.756828,weekend,0,0,1,1


In [3]:
# The proportion of fraudulent transactions.
df.fraud.mean()

0.012168770612987604

In [4]:
# The average duration for fraudulent transaction.
df.groupby('fraud').mean()['duration']

fraud
0    30.013583
1     4.624247
Name: duration, dtype: float64

In [5]:
# The proportion of weekday transactions.
df.weekday.mean()

0.3452746502900034

In [6]:
# The proportion of weekend transactions.
df.weekend.mean()

0.6547253497099966

`2.` Now that you have dummy variables, fit a logistic regression model to predict if a transaction is fraud using both day and duration.  Don't forget an intercept!  Use the second quiz below to assure you fit the model correctly.

In [7]:
df['intercept'] = 1

logit_mod = sm.Logit(df['fraud'], df[['intercept', 'duration', 'weekday']])
results = logit_mod.fit()
results.summary()

Optimization terminated successfully.
         Current function value: inf
         Iterations 16


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))


0,1,2,3
Dep. Variable:,fraud,No. Observations:,8793.0
Model:,Logit,Df Residuals:,8790.0
Method:,MLE,Df Model:,2.0
Date:,"Tue, 23 Mar 2021",Pseudo R-squ.:,inf
Time:,20:07:59,Log-Likelihood:,-inf
converged:,True,LL-Null:,0.0
Covariance Type:,nonrobust,LLR p-value:,1.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,9.8709,1.944,5.078,0.000,6.061,13.681
duration,-1.4637,0.290,-5.039,0.000,-2.033,-0.894
weekday,2.5465,0.904,2.816,0.005,0.774,4.319


### Interpretation

- the p-values for `duration` and `weekday` suggest they are statistically significant in predicting wheter a 
transaction is fraudelent or not
- in most cases we dont care abt the `intercept`

In [10]:
# we need to exponentiaite the coeffs for the explanatory variables
np.exp(-1.4637), np.exp(2.5465)

(0.2313785882117941, 12.762357271496972)

In [12]:
1/np.exp(-1.4637)

4.321921089278333

### Interpretation

- on weekdays fraud is 12.76 more likely than weekend holding holding all other vars constant
- for each 1 unit increase in duration, fraud is 0.23 times as likely holding all other vars constant
- for each 1 unit **decrease** in duration, fraud is 4.32 times more likely holding all other vars constant