# Logistic Regression with Weights in Python

From this [stackoverflow answer](https://stackoverflow.com/questions/62742387/how-to-use-weights-in-a-logistic-regression)

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns

In [2]:
data = sns.load_dataset("iris")
data.species.value_counts()

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

###### Recode to binary (versicolor/ not versicolor)

In [3]:
data['species'] = (data['species'] == "versicolor").astype(int)
data.species.value_counts()

0    100
1     50
Name: species, dtype: int64

###### Unweighted Logistic Regression

In [4]:
fit = smf.glm("species ~ sepal_length + sepal_width + petal_length + petal_width",
              family=sm.families.Binomial(),data=data).fit()
print(fit.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                species   No. Observations:                  150
Model:                            GLM   Df Residuals:                      145
Model Family:                Binomial   Df Model:                            4
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -72.535
Date:                Thu, 09 Jun 2022   Deviance:                       145.07
Time:                        12:08:57   Pearson chi2:                     134.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.2635
Covariance Type:            nonrobust                                         
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        7.3785      2.499      2.952   

###### Showing `smf.logit()` is equivalent

In [5]:
fit = smf.logit("species ~ sepal_length + sepal_width + petal_length + petal_width", data=data).fit()
print(fit.summary())

Optimization terminated successfully.
         Current function value: 0.483566
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                species   No. Observations:                  150
Model:                          Logit   Df Residuals:                      145
Method:                           MLE   Df Model:                            4
Date:                Thu, 09 Jun 2022   Pseudo R-squ.:                  0.2403
Time:                        12:08:57   Log-Likelihood:                -72.535
converged:                       True   LL-Null:                       -95.477
Covariance Type:            nonrobust   LLR p-value:                 2.603e-09
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        7.3785      2.499      2.952      0.003       2.480      12.277
sepal_length    -0.2454

###### Weighted Logistic Regression

In [6]:
wts = np.repeat(np.arange(1,6),30)

In [7]:
fit = smf.glm("species ~ sepal_length + sepal_width + petal_length + petal_width",
              family=sm.families.Binomial(),data=data,freq_weights=wts).fit()
print(fit.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                species   No. Observations:                  150
Model:                            GLM   Df Residuals:                      445
Model Family:                Binomial   Df Model:                            4
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -224.46
Date:                Thu, 09 Jun 2022   Deviance:                       448.93
Time:                        12:08:57   Pearson chi2:                     414.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.5623
Covariance Type:            nonrobust                                         
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        8.7146      1.444      6.036   

In [8]:
type(wts)

numpy.ndarray

In [9]:
wts

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5])