### Fitting Logistic Regression

In this first notebook, you will be fitting a logistic regression model to a dataset where we would like to predict if a transaction is fraud or not.

To get started let's read in the libraries and take a quick look at the dataset.

In [63]:
import numpy as np
import pandas as pd
import statsmodels.api as sm


df = pd.read_csv('02-Dataset/fraud_dataset.csv')
df.head()

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False
3,47812,32.784252,weekend,False
4,43455,17.756828,weekend,False


`1.` As you can see, there are two columns that need to be changed to dummy variables.  Replace each of the current columns to the dummy version.  Use the 1 for `weekday` and `True`, and 0 otherwise.  Use the first quiz to answer a few questions about the dataset.

In [64]:
# Copying the original df.
df_new = df.copy()

In [72]:
# Converting day categorical variables in Dummies. 
df_new[['weekday','weekend']] = pd.get_dummies(df['day'])

# Converting fraud categorical variables in Dummies. 
df_new[['no_fraud','fraud_bin']] = pd.get_dummies(df['fraud'])

# Dropping non used dummies.
df_new = df_new.drop(['weekend','no_fraud'], axis = 1)

# Printing the first 5 rows.
df_new.head()

Unnamed: 0,transaction_id,duration,day,fraud,weekday,fraud2,intercept,fraud_bin
0,28891,21.3026,weekend,False,0,0,1,0
1,61629,22.932765,weekend,False,0,0,1,0
2,53707,32.694992,weekday,False,1,0,1,0
3,47812,32.784252,weekend,False,0,0,1,0
4,43455,17.756828,weekend,False,0,0,1,0


In [73]:
# Proportion of Fraudulent Transactions.
print("Proportion of fraudulent transactions:", sum(df_new.fraud)/len(df_new.fraud))

Proportion of fraudulent transactions: 0.012168770612987604


In [74]:
# Average Duration for fraudulent transactions.
print("Avg Duration for Fraudulent Transactions: ", df_new[df_new.fraud].duration.mean())

Avg Duration for Fraudulent Transactions:  4.624247370615657


In [75]:
# Proportion of transactions in Weekday.
print("Proportion Weekday transactions: ", sum(df_new.weekday)/len(df_new.weekday))

Proportion Weekday transactions:  0.3452746502900034


In [76]:
# Average Duration for non fraudulent transactions.
print("Avg Duration for Non-fraudulent Transactions: ", df_new[np.logical_not(df_new.fraud)].duration.mean())

Avg Duration for Non-fraudulent Transactions:  30.013583132522555


In [77]:
# Proportion of Fraud in weekdays.
print("Proportion of Fraud in Weekday: ", sum(df_new[df_new.fraud].weekday)/len(df_new[df_new.fraud].weekday))

Proportion of Fraud in Weekday:  0.7383177570093458


`2.` Now that you have dummy variables, fit a logistic regression model to predict if a transaction is fraud using both day and duration.  Don't forget an intercept!  Use the second quiz below to assure you fit the model correctly.

In [85]:
# Adding intercept
df_new['intercept'] = 1

# Creating the object.
lm = sm.Logit(df_new['fraud_bin'], df_new[['intercept','duration', 'weekday']])

# Calculating the coefficients.
results = lm.fit()

# Printing the summary.
results.summary()

Optimization terminated successfully.
         Current function value: inf
         Iterations 16


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))


0,1,2,3
Dep. Variable:,fraud_bin,No. Observations:,8793.0
Model:,Logit,Df Residuals:,8790.0
Method:,MLE,Df Model:,2.0
Date:,"Fri, 11 Jan 2019",Pseudo R-squ.:,inf
Time:,15:09:00,Log-Likelihood:,-inf
converged:,True,LL-Null:,0.0
,,LLR p-value:,1.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,9.8709,1.944,5.078,0.000,6.061,13.681
duration,-1.4637,0.290,-5.039,0.000,-2.033,-0.894
weekday,2.5465,0.904,2.816,0.005,0.774,4.319


### Interpretation

|coef|std err|z  |P>z|0.025 | 0.975|
|:-: |:-:    |:-:|:-:|:-:   |:-:   |
|intercept|9.8709|1.944|5.078|0.000|6.061|13.681
|duration|-1.4637|0.290|-5.039|0.000|-2.033|-0.894
|weekday|2.5465|0.904|2.816|0.005|0.774|4.319

In Logistic Regression the `coef` must be compared after the exponential.

On Weekdays, the chance of fraud in 12.76 times more likely than on weekends holding duration constant.

In [101]:
np.exp(2.5465)

12.762357271496972

For each minute less spent on the transaction, the chance of fraud is 4.32 times more likely holding the day of the week constant.

In [102]:
np.exp(-1.4637)

0.2313785882117941

In [87]:
1/np.exp(-1.4637)

4.321921089278333