# Logistic Regression

- We saw how to use MLR to take multiple explanatory variables both quantitative and categorical and use them all at the same time to predict quantiative response variable

- We will be predicting categorical response here

- Can predict two possible outcomes

## Fitting Logistic Regression

- We had no constraints on our MLR

- We bound our response to a probability of either 0 or 1

- We first transform the two labels to 0 or 1

- It will predict the log odds instead of predicting the response itself

$$ \log{(\frac{p}{1-p})} = b_0 + b_1x_1 + b_2x_2 + ... $$

- we can solve this to get the probability directly

    - $$ p = \frac{e^{b_0 + b_1x_1 + b_2x_2 + ... }}    {1- e^{b_0 + b_1x_1 + b_2x_2 + ...}}$$
    
- $b_0 + b_1x_1 + b_2x_2 + ...$ is the sigmoid, and maps any vallue to 0 to 1

- $p$ here is the probabillity of occuring, while $1-p$ is value not occurring

    - so this is a ration of the probability of an event occurring to it not occurring
    
    - this is the odds ratio
    
    - we take the log so that we can control our predictions between 0 and 1 for the probability of a success

In [189]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

In [190]:
df = pd.read_csv("./data-2/fraud_dataset.csv")

In [191]:
df.head()

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,False
1,61629,22.932765,weekend,False
2,53707,32.694992,weekday,False
3,47812,32.784252,weekend,False
4,43455,17.756828,weekend,False


In [192]:
df[["no_fraud", "fraud"]] = pd.get_dummies(df["fraud"])

In [193]:
df = df.drop("no_fraud", axis=1)

In [194]:
df.head()

Unnamed: 0,transaction_id,duration,day,fraud
0,28891,21.3026,weekend,0
1,61629,22.932765,weekend,0
2,53707,32.694992,weekday,0
3,47812,32.784252,weekend,0
4,43455,17.756828,weekend,0


In [195]:
df['intercept'] = 1
logit_model = sm.Logit(df['fraud'], df[['intercept', 'duration']])
results = logit_model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: inf
         Iterations 16


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))


0,1,2,3
Dep. Variable:,fraud,No. Observations:,8793.0
Model:,Logit,Df Residuals:,8791.0
Method:,MLE,Df Model:,1.0
Date:,"Tue, 14 Apr 2020",Pseudo R-squ.:,inf
Time:,12:34:05,Log-Likelihood:,-inf
converged:,True,LL-Null:,0.0
Covariance Type:,nonrobust,LLR p-value:,1.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,10.3827,1.756,5.912,0.000,6.940,13.825
duration,-1.3404,0.237,-5.649,0.000,-1.805,-0.875


In [196]:
df["weekday"] = df.day == "weekday"

In [197]:
df['weekday'] = df.weekday.apply(lambda x: 1 if x else 0)

In [198]:
porp_fraud = df[df['fraud'] == 1].count()[0] / df.shape[0]

In [199]:
porp_not_fraud = df[df['fraud'] != 1].count()[0] / df.shape[0]

In [200]:
avg_dur_fraud = df[df['fraud'] == 1].duration.mean()

In [201]:
avg_dur_not_fraud = df[df['fraud'] != 1].duration.mean()

In [202]:
weekday_porp = df[df["weekday"] == 1].count()[0] / df.shape[0]

In [203]:
porp_fraud

0.012168770612987604

In [204]:
porp_not_fraud

0.9878312293870124

In [205]:
assert porp_fraud + porp_not_fraud == 1

In [206]:
avg_dur_fraud, avg_dur_not_fraud

(4.624247370615657, 30.013583132522555)

In [207]:
weekday_porp

0.3452746502900034

In [208]:
df.head()

Unnamed: 0,transaction_id,duration,day,fraud,intercept,weekday
0,28891,21.3026,weekend,0,1,0
1,61629,22.932765,weekend,0,1,0
2,53707,32.694992,weekday,0,1,1
3,47812,32.784252,weekend,0,1,0
4,43455,17.756828,weekend,0,1,0


In [209]:
df['intercept'] = 1
df['intercept_weekday'] = 1
logit_model = sm.Logit(df['fraud'],
                       df[['intercept',
                           'duration',
                           'weekday']])
results = logit_model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: inf
         Iterations 16


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))


0,1,2,3
Dep. Variable:,fraud,No. Observations:,8793.0
Model:,Logit,Df Residuals:,8790.0
Method:,MLE,Df Model:,2.0
Date:,"Tue, 14 Apr 2020",Pseudo R-squ.:,inf
Time:,12:34:05,Log-Likelihood:,-inf
converged:,True,LL-Null:,0.0
Covariance Type:,nonrobust,LLR p-value:,1.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,9.8709,1.944,5.078,0.000,6.061,13.681
duration,-1.4637,0.290,-5.039,0.000,-2.033,-0.894
weekday,2.5465,0.904,2.816,0.005,0.774,4.319


- We don't care about the intercept

- first we exponentiate each value to get the multiplicative change in the odds

In [210]:
duration_coef = -1.4637
mult_change_duration = math.exp(duration_coef)
mult_change_duration

0.2313785882117941

In [211]:
weekday_coef = 2.5465
mult_change_weekday = math.exp(weekday_coef)
mult_change_weekday

12.762357271496972

- On weekdays, Fraud is 12.76 times as likely then on weekends, holding all else constant

- For each 1 unit increase in durtion, fraud is 0.23 times as likely all else constant

- if we get a less then 1 change, we can use the reciprocal to see the decrease

In [212]:
1 / mult_change_duration 

4.321921089278333

- Now we see a 4 unit decrease in every unit of duration change if all else is held constant

## Interpretting Results

- P value will help us to understand if a particulare variable is stat sig in helping us predict if they are fraud or not

- We need to exponentiate each coefficient

    - For quantititative variables:
    
        - for every 1 unit increase in $x_1$ we expect a multiplicative change in the odds of a 1, $e^{b_1}$ holding all other variables constant
        
    - For categorical interpretations
    
        - When in category $x_1$ we expect a multiplicative change in the odds of a 1 by $e^{b_1}$ compared to the baseline
        
            

In [213]:
df = pd.read_csv('./data-3/admissions.csv')

In [214]:
df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


In [215]:
df.prestige = df.prestige.astype(str)

In [216]:
df.prestige.value_counts()

2    148
3    121
4     67
1     61
Name: prestige, dtype: int64

In [217]:
cols = ["prestige_" + str(i+1) for i in range(4)]
df = pd.get_dummies(df, "prestige")

In [218]:
df.head()

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380,3.61,0,0,1,0
1,1,660,3.67,0,0,1,0
2,1,800,4.0,1,0,0,0
3,1,640,3.19,0,0,0,1
4,0,520,2.93,0,0,0,1


In [219]:
df.value_counts()

AttributeError: 'DataFrame' object has no attribute 'value_counts'

In [221]:
df.head()

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380,3.61,0,0,1,0
1,1,660,3.67,0,0,1,0
2,1,800,4.0,1,0,0,0
3,1,640,3.19,0,0,0,1
4,0,520,2.93,0,0,0,1


In [223]:
df['intercept'] = 1
df['intercept_prestige'] = 1

logit_model = sm.Logit(df['admit'],
                       df[['intercept',
                           'gre',
                           'gpa',
                            'prestige_2',
                          'prestige_3',
                          'prestige_4']])
results = logit_model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Tue, 14 Apr 2020",Pseudo R-squ.:,0.08166
Time:,12:36:39,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
Covariance Type:,nonrobust,LLR p-value:,1.176e-07

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,-3.8769,1.142,-3.393,0.001,-6.116,-1.638
gre,0.0022,0.001,2.028,0.043,7.44e-05,0.004
gpa,0.7793,0.333,2.344,0.019,0.128,1.431
prestige_2,-0.6801,0.317,-2.146,0.032,-1.301,-0.059
prestige_3,-1.3387,0.345,-3.882,0.000,-2.015,-0.663
prestige_4,-1.5534,0.417,-3.721,0.000,-2.372,-0.735


In [235]:
# Prestige in our modell is assocaited with decrease
# with the level of prestige indicating a larger
# probabilistic decrease with less prestigious
# schools
bl = -3.8769
prestige = np.array([math.exp(bl),
            math.exp(bl-0.6801),
            math.exp(bl-1.3387),
            math.exp(bl-1.5534)])
1 / prestige

array([ 48.27433244,  95.29715909, 184.1222612 , 228.21770047])

In [242]:
prestige[0] / prestige[3]

4.727516444398727

In [244]:
prestige[0] / prestige[2]

3.8140819745031704

In [243]:
prestige[0] / prestige[1]

1.974075129873389

In [246]:
gre = math.exp(0.0022)
gre

1.0022024217756431

In [247]:
gpa = 0.7793
math.exp(gpa)

2.1799457692483717

## Model Diagnostic

- How well is our logisitc regression model doing at predicting correct labels?

    - most common measure is accuracy
    
        - `accuracy = number of correct labels / number of rows`
        
- Accuracy isnt as useful in certain cases

    - like if we have large class imbalances in dataset
    
        - lots of not fraud labels but only a few fraud labels
    
- We will view other metrics to see how our logistic model is doing

## Confusion Matrix

- Two by Two matrix

- We count what we got correct and incorrect

\begin{matrix}
pred/actual & pos & neg\\
pos & a & b\\
neg & c & d
\end{matrix}

- b is a false alarm 

- we can shift our line up and down for our classifier by shifting the parameters of our classifier

- on larger confusion matrix, we want bulk of data to be in middle diagonal because those are correct identifications

## Recall and Precision

- Recall() :=

    - correctly true / entire true row
    
    -  True Positive / (True Positive + False Negative)

- precision() := 

    - correctly predicted  / entire predicted row
    
    - True Positive / (True Positive + False Positive).
    
- True Positive: Correct entry in confusion matrix 

- False Positives: Predicted correct but it was truly not correct

    - Type I, rejected our null for alt when it wasn't true

- False Negatives: predicted incorrect but it was truly correct

    - Type II, failed to reject null, alt true

## Stats to ML

- Statistics: determine relationships and understand the driving mechanisms. Are relationships due to chance?

- ML (supervised): work to predict as well as possible. Often without regard to why it works well.