<a href="https://colab.research.google.com/github/DepartmentOfStatisticsPUE/cda-2022/blob/main/notebooks/cda_10_logistic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

Python

In [8]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

  import pandas.util.testing as tm


R

In [1]:
%load_ext rpy2.ipython

In [None]:
%%R
install.packages("car")

In [None]:
%%R
library(car)

Julia

In [None]:
%%bash
wget -q https://julialang-s3.julialang.org/bin/linux/x64/1.7/julia-1.7.2-linux-x86_64.tar.gz
tar zxvf julia-1.7.2-linux-x86_64.tar.gz
## python's module
pip install julia

In [38]:
import julia
julia.install(julia = "/content/julia-1.7.2/bin/julia")
from julia import Julia
jl = Julia(runtime="/content/julia-1.7.2/bin/julia",compiled_modules=False)
%load_ext julia.magic


Precompiling PyCall...
Precompiling PyCall... DONE
PyCall is installed and built successfully.

PyCall is setup for non-default Julia runtime (executable) `/content/julia-1.7.2/bin/julia`.
To use this Julia runtime, PyJulia has to be initialized first by
    from julia import Julia
    Julia(runtime='/content/julia-1.7.2/bin/julia')


Initializing Julia interpreter. This may take some time...




In [None]:
%%julia
using Pkg
Pkg.add("StatsBase")
Pkg.add("GLM")
Pkg.add("DataFrames")
Pkg.add("CategoricalArrays")
Pkg.add("CSV")

In [40]:
%%julia
using StatsBase
using GLM
using CategoricalArrays
using Statistics
using CSV
using DataFrames

## Solutions

### First example using aggregated data

Using Python. We need to create an incercept because we will use matrix notation instead of formula.

In [31]:
df = pd.DataFrame({"gender": ["females", "males"], "bought":[243, 48], "notbought": [30,240]})
df["intercept"] = 1
df["males"] = np.where(df["gender"] == "males", 1, 0)
df

Unnamed: 0,gender,bought,notbought,intercept,males
0,females,243,30,1,0
1,males,48,240,1,1


Unfortunately, this model cannot be calculated in Python using statsmodels due to perfect separation (see https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression).

In [36]:
m1 = sm.GLM(np.asarray(df[['bought', 'notbought']]), 
             np.asarray(df[['intercept', "males"]]), 
             family=sm.families.Binomial()).fit()

print(m1.summary())

  scale = np.dot(wresid, wresid) / df_resid


PerfectSeparationError: ignored

Using R -- works well

In [33]:
%%R
df1 <- data.frame(gender = c("females", "males"), bought = c(243, 48),  notbought=c(30,240))
     
m1 <- glm(cbind(bought, notbought) ~ gender,  data = df1, family = binomial())
     
summary(m1)


Call:
glm(formula = cbind(bought, notbought) ~ gender, family = binomial(), 
    data = df1)

Deviance Residuals: 
[1]  0  0

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.0919     0.1935   10.81   <2e-16 ***
gendermales  -3.7013     0.2499  -14.81   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3.2833e+02  on 1  degrees of freedom
Residual deviance: 4.7962e-14  on 0  degrees of freedom
AIC: 14.659

Number of Fisher Scoring iterations: 3



Odds ratio

In [6]:
%%R
exp(coef(m1))

(Intercept) gendermales 
 8.10000000  0.02469136 


Confidence intervals for odds ratio.

In [47]:
%%R
exp(confint(m1))

R[write to console]: Waiting for profiling to be done...



                 2.5 %      97.5 %
(Intercept) 5.64054025 12.07688821
gendermales 0.01488326  0.03970901


Using Julia -- we need to create new variables: `total` and `share`.

In [41]:
%%julia
df = DataFrame(:gender => ["females", "males"],
               :bought => [243, 48],
               :notbought => [30,240])
df.total = df.bought + df.notbought
df.share = df.bought ./ df.total
df

<PyCall.jlwrap 2×5 DataFrame
 Row │ gender   bought  notbought  total  share
     │ String   Int64   Int64      Int64  Float64
─────┼─────────────────────────────────────────────
   1 │ females     243         30    273  0.89011
   2 │ males        48        240    288  0.166667>

We can estimate this model providing `share` in the formula and `total` in `wts` arguments (case weights).

In [42]:
%%julia
m1 = glm(@formula(share ~ gender), df, Binomial(), LogitLink(), wts = df.total)
m1

<PyCall.jlwrap StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, LogitLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

share ~ 1 + gender

Coefficients:
───────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error       z  Pr(>|z|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)     2.09186    0.19351    10.81    <1e-26    1.71259    2.47114
gender: males  -3.7013     0.249892  -14.81    <1e-48   -4.19108   -3.21152
───────────────────────────────────────────────────────────────────────────>

In [45]:
%%julia
exp.(coef(m1))

array([8.09999998, 0.02469136])

Confidence intervals for odds ratios

In [46]:
%%julia
exp.(confint(m1))

array([[ 5.5433061 , 11.83589694],
       [ 0.01512993,  0.04029518]])