<a href="https://colab.research.google.com/github/DepartmentOfStatisticsPUE/cda-2022/blob/main/notebooks/cda_10_logistic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

Python

In [2]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

  import pandas.util.testing as tm


R

In [1]:
%load_ext rpy2.ipython

In [None]:
%%R
install.packages("car")

In [4]:
%%R
library(car)

R[write to console]: Loading required package: carData



Julia

In [None]:
%%bash
wget -q https://julialang-s3.julialang.org/bin/linux/x64/1.7/julia-1.7.2-linux-x86_64.tar.gz
tar zxvf julia-1.7.2-linux-x86_64.tar.gz
## python's module
pip install julia

In [6]:
import julia
julia.install(julia = "/content/julia-1.7.2/bin/julia")
from julia import Julia
jl = Julia(runtime="/content/julia-1.7.2/bin/julia",compiled_modules=False)
%load_ext julia.magic


Precompiling PyCall...
Precompiling PyCall... DONE
PyCall is installed and built successfully.

PyCall is setup for non-default Julia runtime (executable) `/content/julia-1.7.2/bin/julia`.
To use this Julia runtime, PyJulia has to be initialized first by
    from julia import Julia
    Julia(runtime='/content/julia-1.7.2/bin/julia')


Initializing Julia interpreter. This may take some time...




In [None]:
%%julia
using Pkg
Pkg.add("StatsBase")
Pkg.add("GLM")
Pkg.add("DataFrames")
Pkg.add("CategoricalArrays")
Pkg.add("CSV")

In [8]:
%%julia
using StatsBase
using GLM
using CategoricalArrays
using Statistics
using CSV
using DataFrames

## Solutions

### First example using aggregated data

Using Python. We need to create an incercept because we will use matrix notation instead of formula.

In [9]:
df = pd.DataFrame({"gender": ["females", "males"], "bought":[243, 48], "notbought": [30,240]})
df["intercept"] = 1
df["males"] = np.where(df["gender"] == "males", 1, 0)
df

Unnamed: 0,gender,bought,notbought,intercept,males
0,females,243,30,1,0
1,males,48,240,1,1


Unfortunately, this model cannot be calculated in Python using statsmodels due to perfect separation (see https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression).

In [10]:
m1 = sm.GLM(np.asarray(df[['bought', 'notbought']]), 
             np.asarray(df[['intercept', "males"]]), 
             family=sm.families.Binomial()).fit()

print(m1.summary())

  scale = np.dot(wresid, wresid) / df_resid


PerfectSeparationError: ignored

Using R -- works well

In [11]:
%%R
df1 <- data.frame(gender = c("females", "males"), bought = c(243, 48),  notbought=c(30,240))
     
m1 <- glm(cbind(bought, notbought) ~ gender,  data = df1, family = binomial())
     
summary(m1)


Call:
glm(formula = cbind(bought, notbought) ~ gender, family = binomial(), 
    data = df1)

Deviance Residuals: 
[1]  0  0

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.0919     0.1935   10.81   <2e-16 ***
gendermales  -3.7013     0.2499  -14.81   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3.2833e+02  on 1  degrees of freedom
Residual deviance: 4.7962e-14  on 0  degrees of freedom
AIC: 14.659

Number of Fisher Scoring iterations: 3



Odds ratio

In [12]:
%%R
exp(coef(m1))

(Intercept) gendermales 
 8.10000000  0.02469136 


Confidence intervals for odds ratio.

In [13]:
%%R
exp(confint(m1))

R[write to console]: Waiting for profiling to be done...



                 2.5 %      97.5 %
(Intercept) 5.64054025 12.07688821
gendermales 0.01488326  0.03970901


Using Julia -- we need to create new variables: `total` and `share`.

In [14]:
%%julia
df = DataFrame(:gender => ["females", "males"],
               :bought => [243, 48],
               :notbought => [30,240])
df.total = df.bought + df.notbought
df.share = df.bought ./ df.total
df

<PyCall.jlwrap 2×5 DataFrame
 Row │ gender   bought  notbought  total  share
     │ String   Int64   Int64      Int64  Float64
─────┼─────────────────────────────────────────────
   1 │ females     243         30    273  0.89011
   2 │ males        48        240    288  0.166667>

We can estimate this model providing `share` in the formula and `total` in `wts` arguments (case weights).

In [15]:
%%julia
m1 = glm(@formula(share ~ gender), df, Binomial(), LogitLink(), wts = df.total)
m1

<PyCall.jlwrap StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, LogitLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

share ~ 1 + gender

Coefficients:
───────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error       z  Pr(>|z|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)     2.09186    0.19351    10.81    <1e-26    1.71259    2.47114
gender: males  -3.7013     0.249892  -14.81    <1e-48   -4.19108   -3.21152
───────────────────────────────────────────────────────────────────────────>

In [None]:
%%julia
exp.(coef(m1))

array([8.09999998, 0.02469136])

Confidence intervals for odds ratios

In [None]:
%%julia
exp.(confint(m1))

array([[ 5.5433061 , 11.83589694],
       [ 0.01512993,  0.04029518]])

## Skills example

Python solution

In [39]:
df= pd.read_csv("https://raw.githubusercontent.com/DepartmentOfStatisticsPUE/cda-2022/main/data/count-data.csv")
df["occup1"].astype("category")
df["woj"].astype("category")
df.head(n=4)

Unnamed: 0,id,year,occup1,woj,nace,technical,math,artistic,computer,cognitive,managerial,interpersonal,individual,physical,availability,office,total_skills
0,626307,2014,5,4,M,0,0,1,0,0,0,0,1,0,0,0,2
1,626305,2014,5,12,M,0,0,0,0,0,0,0,0,0,0,0,0
2,617154,2014,7,14,C,0,0,0,0,0,0,0,0,0,0,0,0
3,617155,2014,7,14,C,0,0,0,0,0,0,0,0,0,0,0,0


In [41]:
m1 = smf.glm(formula="computer ~ occup1", data=df, family=sm.families.Binomial()).fit()
print(m1.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:               computer   No. Observations:                12914
Model:                            GLM   Df Residuals:                    12912
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -7971.4
Date:                Thu, 12 May 2022   Deviance:                       15943.
Time:                        07:44:12   Pearson chi2:                 1.27e+04
No. Iterations:                     4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.3232      0.043      7.585      0.0

In [43]:
np.exp(m1.params)

Intercept    1.381502
occup1       0.750175
dtype: float64

In [44]:
m2 = smf.glm(formula="computer ~ nace", data=df, family=sm.families.Binomial()).fit()
print(m2.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:               computer   No. Observations:                12914
Model:                            GLM   Df Residuals:                    12897
Model Family:                Binomial   Df Model:                           16
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -7716.3
Date:                Thu, 12 May 2022   Deviance:                       15433.
Time:                        07:45:01   Pearson chi2:                 1.29e+04
No. Iterations:                     5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.8258      0.033    -25.274      0.0

In [45]:
np.exp(m2.params)

Intercept    0.437906
nace[T.D]    0.494111
nace[T.E]    1.392435
nace[T.F]    1.156625
nace[T.G]    1.405560
nace[T.H]    1.042510
nace[T.I]    0.551868
nace[T.J]    6.364767
nace[T.K]    1.076878
nace[T.L]    0.500788
nace[T.M]    1.065066
nace[T.N]    1.188164
nace[T.O]    1.735531
nace[T.P]    0.239873
nace[T.Q]    0.399629
nace[T.R]    0.377453
nace[T.S]    1.573142
dtype: float64

R solution

In [20]:
%%R
df <- read.csv("https://raw.githubusercontent.com/DepartmentOfStatisticsPUE/cda-2022/main/data/count-data.csv")
df$occup1 <- as.factor(df$occup1)
df$woj <- as.factor(df$woj)
head(df)

      id year occup1 woj nace technical math artistic computer cognitive
1 626307 2014      5   4    M         0    0        1        0         0
2 626305 2014      5  12    M         0    0        0        0         0
3 617154 2014      7  14    C         0    0        0        0         0
4 617155 2014      7  14    C         0    0        0        0         0
5 632044 2014      3  24    C         0    0        0        0         0
6 613019 2014      3  14    K         0    0        0        0         0
  managerial interpersonal individual physical availability office total_skills
1          0             0          1        0            0      0            2
2          0             0          0        0            0      0            0
3          0             0          0        0            0      0            0
4          0             0          0        0            0      0            0
5          0             1          0        1            0      0            2
6        

In [18]:
%%R
m1 <- glm(formula = computer ~ occup1, data=df, family=binomial, subset=occup1!=6)
summary(m1)


Call:
glm(formula = computer ~ occup1, family = binomial, data = df, 
    subset = occup1 != 6)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1861  -0.8265  -0.7982   1.1706   2.6557  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.80494    0.07202 -11.177  < 2e-16 ***
occup12      0.82089    0.07884  10.412  < 2e-16 ***
occup13     -0.09359    0.08311  -1.126   0.2601    
occup14      0.82540    0.10150   8.132 4.22e-16 ***
occup15     -0.17562    0.08184  -2.146   0.0319 *  
occup17     -1.77324    0.16086 -11.023  < 2e-16 ***
occup18     -2.36964    0.30325  -7.814 5.53e-15 ***
occup19     -2.69156    0.51184  -5.259 1.45e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 16601  on 12906  degrees of freedom
Residual deviance: 15413  on 12899  degrees of freedom
AIC: 15429

Number of Fisher Scoring iterations: 5



In [19]:
%%R
m1 |> coef() |> exp()

(Intercept)     occup12     occup13     occup14     occup15     occup17 
 0.44711538  2.27252041  0.91065474  2.28279292  0.83894016  0.16978259 
    occup18     occup19 
 0.09351468  0.06777486 


In [21]:
%%R
m2 <- update(m1, . ~ nace)
summary(m2)


Call:
glm(formula = computer ~ nace, family = binomial, data = df, 
    subset = occup1 != 6)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6319  -0.8754  -0.8530   1.3892   2.1695  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.82380    0.03268 -25.206  < 2e-16 ***
naceD       -0.70695    0.09637  -7.336 2.20e-13 ***
naceE        0.32910    0.25585   1.286  0.19833    
naceF        0.14355    0.14277   1.006  0.31465    
naceG        0.33848    0.07168   4.722 2.33e-06 ***
naceH        0.03968    0.15552   0.255  0.79862    
naceI       -0.59640    0.20949  -2.847  0.00441 ** 
naceJ        1.84883    0.07092  26.069  < 2e-16 ***
naceK        0.07211    0.06064   1.189  0.23436    
naceL       -0.69352    0.15954  -4.347 1.38e-05 ***
naceM        0.06209    0.06498   0.955  0.33935    
naceN        0.17046    0.09211   1.851  0.06423 .  
naceO        0.54936    0.30610   1.795  0.07270 .  
naceP       -1.42960    0.21274  -

In [22]:
%%R
m2 |> coef() |> exp()

(Intercept)       naceD       naceE       naceF       naceG       naceH 
  0.4387622   0.4931470   1.3897188   1.1543690   1.4028188   1.0404764 
      naceI       naceJ       naceK       naceL       naceM       naceN 
  0.5507919   6.3523519   1.0747778   0.4998111   1.0640558   1.1858463 
      naceO       naceP       naceQ       naceR       naceS 
  1.7321455   0.2394053   0.3988493   0.3767172   1.5700734 


In [23]:
%%R
m3 <- update(m1, . ~ . + nace + woj)
car::Anova(m3)

Analysis of Deviance Table (Type II tests)

Response: computer
       LR Chisq Df Pr(>Chisq)    
occup1   857.47  7  < 2.2e-16 ***
nace     837.46 16  < 2.2e-16 ***
woj       85.27 15  7.486e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Julia solution

In [25]:
%%julia
df = CSV.read(download("https://raw.githubusercontent.com/DepartmentOfStatisticsPUE/cda-2022/main/data/count-data.csv"), DataFrame)
df.occup1 = categorical(df.occup1)
df.woj = categorical(df.woj)
first(df,2)

<PyCall.jlwrap 2×17 DataFrame
 Row │ id      year   occup1  woj   nace     technical  math   artistic  computer  cognitive  managerial  interpersonal  individual  physical  availability  office  total_skills
     │ Int64   Int64  Cat…    Cat…  String1  Int64      Int64  Int64     Int64     Int64      Int64       Int64          Int64       Int64     Int64         Int64   Int64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 626307   2014  5       4     M                0      0         1         0          0           0              0           1         0             0       0             2
   2 │ 626305   2014  5       12    M                0      0         0         0          0           0              0           0         0             0       0             0>

In [34]:
%%julia
df2 = subset(df, :occup1 => ByRow(x -> x != 6))
m1 = glm(@formula(computer ~ occup1), df2, Binomial())

<PyCall.jlwrap StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, LogitLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

computer ~ 1 + occup1

Coefficients:
────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error       z  Pr(>|z|)  Lower 95%   Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)  -0.804939    0.0720194  -11.18    <1e-28  -0.946094  -0.663783
occup1: 2     0.82089     0.0788412   10.41    <1e-24   0.666364   0.975415
occup1: 3    -0.0935914   0.0831086   -1.13    0.2601  -0.256481   0.0692985
occup1: 4     0.8254      0.101501     8.13    <1e-15   0.626462   1.02434
occup1: 5    -0.175616    0.0818407   -2.15    0.0319  -0.336021  -0.0152111
occup1: 7    -1.77324     0.160864   -11.02    <1e-27  -2.08852   -1.45795
occup1: 8    -2.36964     0.303248    -7.81    <

In [35]:
%%julia
m1 |> coef .|> exp

array([0.44711538, 2.27252041, 0.91065474, 2.28279292, 0.83894016,
       0.16978259, 0.09351468, 0.06777486])

In [36]:
%%julia
m2 = glm(@formula(computer ~ nace), df2, Binomial())

<PyCall.jlwrap StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, LogitLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

computer ~ 1 + nace

Coefficients:
────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error       z  Pr(>|z|)   Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)  -0.823798    0.0326821  -25.21    <1e-99  -0.887853   -0.759742
nace: D      -0.706948    0.0963684   -7.34    <1e-12  -0.895827   -0.518069
nace: E       0.329101    0.255848     1.29    0.1983  -0.172352    0.830555
nace: F       0.143554    0.142766     1.01    0.3146  -0.136262    0.42337
nace: G       0.338484    0.0716757    4.72    <1e-05   0.198002    0.478965
nace: H       0.0396787   0.155523     0.26    0.7986  -0.265141    0.344499
nace: I      -0.596398    0.209486    -2.85  

In [38]:
%%julia
m2 |> coef .|> exp

array([0.43876221, 0.493147  , 1.3897188 , 1.15436902, 1.40281878,
       1.04047642, 0.55079188, 6.35235195, 1.07477784, 0.49981115,
       1.06405578, 1.18584634, 1.73214551, 0.23940535, 0.39884929,
       0.37671716, 1.57007341])