### Fitting Logistic Regression

In this first notebook, you will be fitting a logistic regression model to a dataset where we would like to predict if a transaction is fraud or not.

To get started let's read in the libraries and take a quick look at the dataset.

In [16]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import math as m

df = pd.read_csv('./fraud_dataset.csv')
df.head()

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False
3,47812,32.784252,weekend,False
4,43455,17.756828,weekend,False


`1.` As you can see, there are two columns that need to be changed to dummy variables.  Replace each of the current columns to the dummy version.  Use the 1 for `weekday` and `True`, and 0 otherwise.  Use the first quiz to answer a few questions about the dataset.

In [7]:
day = pd.get_dummies(df['day'])
fraud = pd.get_dummies(df['fraud'])
df['fraud'] = fraud[True]
df['weekday'] = day['weekday']
df['intercept'] = 1
df.head()


Unnamed: 0,transaction_id,duration,day,fraud,weekday,intercept
0,28891,21.3026,weekend,0,0,1
1,61629,22.932765,weekend,0,0,1
2,53707,32.694992,weekday,0,1,1
3,47812,32.784252,weekend,0,0,1
4,43455,17.756828,weekend,0,0,1


In [8]:
df.fraud.mean()

0.012168770612987604

In [9]:
df.query('fraud == 1')['duration'].mean()

4.624247370615658

In [10]:
df.weekday.mean()

0.3452746502900034

In [11]:
df.query('fraud == 0')['duration'].mean()

30.013583132522584

`2.` Now that you have dummy variables, fit a logistic regression model to predict if a transaction is fraud using both day and duration.  Don't forget an intercept!  Use the second quiz below to assure you fit the model correctly.

In [13]:
mod = sm.Logit(df['fraud'], df[['weekday', 'duration', 'intercept']])
res = mod.fit()
print(res.summary())

Optimization terminated successfully.
         Current function value: inf
         Iterations 16
                           Logit Regression Results                           
Dep. Variable:                  fraud   No. Observations:                 8793
Model:                          Logit   Df Residuals:                     8790
Method:                           MLE   Df Model:                            2
Date:                Sat, 18 Apr 2020   Pseudo R-squ.:                     inf
Time:                        21:29:59   Log-Likelihood:                   -inf
converged:                       True   LL-Null:                        0.0000
Covariance Type:            nonrobust   LLR p-value:                     1.000
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
weekday        2.5465      0.904      2.816      0.005       0.774       4.319
duration      -1.4637      0.290 

In [18]:
m.exp(2.5465)

12.762357271496972

In [19]:
m.exp(1.4637)

4.321921089278333