<a href="https://colab.research.google.com/github/DonnaVakalis/nano/blob/master/Fitting_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Fitting Logistic Regression

In this first notebook, you will be fitting a logistic regression model to a dataset where we would like to predict if a transaction is fraud or not.

To get started let's read in the libraries and take a quick look at the dataset.

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import math

df = pd.read_csv('./fraud_dataset.csv')

  from pandas.core import datetools


`1.` As you can see, there are two columns that need to be changed to dummy variables.  Replace each of the current columns to the dummy version.  Use the 1 for `weekday` and `True`, and 0 otherwise.  Use the first quiz to answer a few questions about the dataset.

In [None]:
df['intercept'] = 1
df[['weekday','weekend']] = pd.get_dummies(df['day'])
df[['False','True']] = pd.get_dummies(df['fraud'])
df.head()

Unnamed: 0,transaction_id,duration,day,fraud,intercept,weekday,weekend,False,True
0,28891,21.3026,weekend,False,1,0,1,1,0
1,61629,22.932765,weekend,False,1,0,1,1,0
2,53707,32.694992,weekday,False,1,1,0,1,0
3,47812,32.784252,weekend,False,1,0,1,1,0
4,43455,17.756828,weekend,False,1,0,1,1,0


In [None]:
# proportion of fraudulent transactions
df['True'].mean()

0.012168770612987604

In [None]:
# ave duration for fraudulent trans
df[df['True']==1]['duration'].mean()

4.6242473706156568

In [None]:
# proportion of weekday trans
df['weekday'].mean()

0.34527465029000343

In [None]:
# ave duration for non-fraudent trans
df[df['True']==0]['duration'].mean()

30.013583132522555

`2.` Now that you have dummy variables, fit a logistic regression model to predict if a transaction is fraud using both day and duration.  Don't forget an intercept!  Use the second quiz below to assure you fit the model correctly.

In [None]:
model = sm.Logit(df['True'], df[['intercept','weekday','duration']])

In [None]:
results = model.fit()

Optimization terminated successfully.
         Current function value: inf
         Iterations 16


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))


In [None]:
results.summary2()

  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))
  return 1 - self.llf/self.llnull


0,1,2,3
Model:,Logit,No. Iterations:,16.0
Dependent Variable:,True,Pseudo R-squared:,
Date:,2020-09-07 19:05,AIC:,inf
No. Observations:,8793,BIC:,inf
Df Model:,2,Log-Likelihood:,-inf
Df Residuals:,8790,LL-Null:,-inf
Converged:,1.0000,Scale:,1.0

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
intercept,9.8709,1.9438,5.0783,0.0000,6.0613,13.6806
weekday,2.5465,0.9043,2.8160,0.0049,0.7741,4.3188
duration,-1.4637,0.2905,-5.0389,0.0000,-2.0331,-0.8944


In [None]:
# another way of writing it
Xtrain = df[['intercept', 'weekday', 'duration']] 
ytrain = df[['True']] 
   
# building the model and fitting the data 
log_reg = sm.Logit(ytrain, Xtrain).fit() 
log_reg.summary2()

Optimization terminated successfully.
         Current function value: inf
         Iterations 16


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))
  return 1 - self.llf/self.llnull


0,1,2,3
Model:,Logit,No. Iterations:,16.0
Dependent Variable:,True,Pseudo R-squared:,
Date:,2020-09-07 19:05,AIC:,inf
No. Observations:,8793,BIC:,inf
Df Model:,2,Log-Likelihood:,-inf
Df Residuals:,8790,LL-Null:,-inf
Converged:,1.0000,Scale:,1.0

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
intercept,9.8709,1.9438,5.0783,0.0000,6.0613,13.6806
weekday,2.5465,0.9043,2.8160,0.0049,0.7741,4.3188
duration,-1.4637,0.2905,-5.0389,0.0000,-2.0331,-0.8944


In [None]:
# how much more likely on weekdays
math.exp(2.5465)

12.762357271496972

In [None]:
# how much more likely for each minute less 
1/math.exp(-1.4637)

4.321921089278333