# AccelerateAI: Logistic Regression

## Multinomial Logit

### Import Libraries and Load Dataset

In [1]:
import pandas as pd 
import numpy as np

import statsmodels.api as sm

**Dataset: American National Election Survey 1996**

Number of observations - 944
Number of variables - 10

Variables name definitions::

        popul - Census place population in 1000s
        TVnews - Number of times per week that respondent watches TV news.
        PID - Party identification of respondent.
            0 - Strong Democrat
            1 - Weak Democrat
            2 - Independent-Democrat
            3 - Independent-Indpendent
            4 - Independent-Republican
            5 - Weak Republican
            6 - Strong Republican
        age : Age of respondent.
        educ - Education level of respondent
            1 - 1-8 grades
            2 - Some high school
            3 - High school graduate
            4 - Some college
            5 - College degree
            6 - Master's degree
            7 - PhD
        income - Income of household
            1  - None or less than $2,999
            2  - $3,000-$4,999
            3  - $5,000-$6,999
            4  - $7,000-$8,999
            5  - $9,000-$9,999
            6  - $10,000-$10,999
            7  - $11,000-$11,999
            8  - $12,000-$12,999
            9  - $13,000-$13,999
            10 - $14,000-$14.999
            11 - $15,000-$16,999
            12 - $17,000-$19,999
            13 - $20,000-$21,999
            14 - $22,000-$24,999
            15 - $25,000-$29,999
            16 - $30,000-$34,999
            17 - $35,000-$39,999
            18 - $40,000-$44,999
            19 - $45,000-$49,999
            20 - $50,000-$59,999
            21 - $60,000-$74,999
            22 - $75,000-89,999
            23 - $90,000-$104,999
            24 - $105,000 and over
        vote - Expected vote
            0 - Clinton
            1 - Dole
        The following 3 variables all take the values:
            1 - Extremely liberal
            2 - Liberal
            3 - Slightly liberal
            4 - Moderate
            5 - Slightly conservative
            6 - Conservative
            7 - Extremely Conservative
        selfLR - Respondent's self-reported political leanings from "Left"
            to "Right".
        ClinLR - Respondents impression of Bill Clinton's political
            leanings from "Left" to "Right".
        DoleLR  - Respondents impression of Bob Dole's political leanings
            from "Left" to "Right".
        logpopul - log(popul + .1)


Dataset ref: https://www.statsmodels.org/dev/datasets/generated/anes96.html

In [2]:
anes_data = sm.datasets.anes96.load()
anes_exog = anes_data.exog
anes_exog = sm.add_constant(anes_exog, prepend=False)

In [3]:
anes_data.exog_name

['logpopul', 'selfLR', 'age', 'educ', 'income']

In [4]:
anes_data.endog_name

'PID'

In [5]:
df = pd.DataFrame.from_records(sm.datasets.anes96.load().data)

df.head(5)

Unnamed: 0,popul,TVnews,selfLR,ClinLR,DoleLR,PID,age,educ,income,vote,logpopul
0,0.0,7.0,7.0,1.0,6.0,6.0,36.0,3.0,1.0,1.0,-2.302585
1,190.0,1.0,3.0,3.0,5.0,1.0,20.0,4.0,1.0,0.0,5.24755
2,31.0,7.0,2.0,2.0,6.0,1.0,24.0,6.0,1.0,0.0,3.437208
3,83.0,4.0,3.0,4.0,5.0,1.0,28.0,6.0,1.0,0.0,4.420045
4,640.0,7.0,5.0,6.0,4.0,0.0,68.0,6.0,1.0,0.0,6.461624


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 944 entries, 0 to 943
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   popul     944 non-null    float64
 1   TVnews    944 non-null    float64
 2   selfLR    944 non-null    float64
 3   ClinLR    944 non-null    float64
 4   DoleLR    944 non-null    float64
 5   PID       944 non-null    float64
 6   age       944 non-null    float64
 7   educ      944 non-null    float64
 8   income    944 non-null    float64
 9   vote      944 non-null    float64
 10  logpopul  944 non-null    float64
dtypes: float64(11)
memory usage: 81.2 KB


### Fit the Multinomial Logit Model

In [7]:
mlogit_mod = sm.MNLogit(anes_data.endog, anes_exog)
mlogit_res = mlogit_mod.fit()

Optimization terminated successfully.
         Current function value: 1.548647
         Iterations 7


In [8]:
print(mlogit_res.summary())

                          MNLogit Regression Results                          
Dep. Variable:                    PID   No. Observations:                  944
Model:                        MNLogit   Df Residuals:                      908
Method:                           MLE   Df Model:                           30
Date:                Sun, 11 Sep 2022   Pseudo R-squ.:                  0.1648
Time:                        15:31:47   Log-Likelihood:                -1461.9
converged:                       True   LL-Null:                       -1750.3
Covariance Type:            nonrobust   LLR p-value:                1.822e-102
     PID=1       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
logpopul      -0.0115      0.034     -0.336      0.736      -0.079       0.056
selfLR         0.2977      0.094      3.180      0.001       0.114       0.481
age           -0.0249      0.007     -3.823      0.0

In [9]:
print(mlogit_res.params)

                 0         1         2         3         4          5
logpopul -0.011536 -0.088751 -0.105967 -0.091557 -0.093285  -0.140881
selfLR    0.297714  0.391669  0.573451  1.278772  1.346962   2.070080
age      -0.024945 -0.022898 -0.014851 -0.008681 -0.017904  -0.009433
educ      0.082491  0.181043 -0.007152  0.199828  0.216939   0.321926
income    0.005197  0.047874  0.057575  0.084498  0.080958   0.108894
const    -0.373402 -2.250913 -3.665584 -7.613843 -7.060478 -12.105751


In [10]:
mlogit_res.llnull

-1750.346710709092

### Alternate Solvers

The method determines which solver from ```scipy.optimize``` is used.

'bfgs' for Broyden-Fletcher-Goldfarb-Shanno (BFGS)

In [11]:
mlogit_res = mlogit_mod.fit(method="bfgs", maxiter=250)
print(mlogit_res.summary())

Optimization terminated successfully.
         Current function value: 1.548647
         Iterations: 111
         Function evaluations: 117
         Gradient evaluations: 117
                          MNLogit Regression Results                          
Dep. Variable:                    PID   No. Observations:                  944
Model:                        MNLogit   Df Residuals:                      908
Method:                           MLE   Df Model:                           30
Date:                Sun, 11 Sep 2022   Pseudo R-squ.:                  0.1648
Time:                        15:31:47   Log-Likelihood:                -1461.9
converged:                       True   LL-Null:                       -1750.3
Covariance Type:            nonrobust   LLR p-value:                1.822e-102
     PID=1       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
logpopul      -0.0115      0.034   

The method determines which solver from ```scipy.optimize``` is used.

'powell' for modified Powell's method

In [12]:
mlogit_res = mlogit_mod.fit(method="powell", maxiter=250)
print(mlogit_res.summary())

Optimization terminated successfully.
         Current function value: 1.553526
         Iterations: 30
         Function evaluations: 14156
                          MNLogit Regression Results                          
Dep. Variable:                    PID   No. Observations:                  944
Model:                        MNLogit   Df Residuals:                      908
Method:                           MLE   Df Model:                           30
Date:                Sun, 11 Sep 2022   Pseudo R-squ.:                  0.1621
Time:                        15:31:49   Log-Likelihood:                -1466.5
converged:                       True   LL-Null:                       -1750.3
Covariance Type:            nonrobust   LLR p-value:                1.457e-100
     PID=1       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
logpopul      -0.0242      0.034     -0.706      0.480      -0.091   