<a href="https://colab.research.google.com/github/550tealeaves/DATA-70500-working-with-data/blob/main/HW4_LinearModels2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Models, part II: Logistic Regression

We'll use the global social indicators data to develop a logistic regression model and pratice interpreting the results.

First, we'll import the libraries we'll need for this model.


In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import seaborn as sb
import math


Here's some information about where these social indicators were created: https://hdr.undp.org/data-center/composite-indices


Next, we'll read in the data sources and create the DataFrame with all of our variables.

In [2]:
#Load datasets and use na_values to note missing values
OriginalOrders = pd.read_csv('https://raw.githubusercontent.com/550tealeaves/DATA-70500-working-with-data/refs/heads/main/datasets/original_orders.csv', index_col='outfit.id', na_values=[np.nan])
OriginalOrders.head()

Unnamed: 0_level_0,customer.id,rentalPeriod.start,rentalPeriod.end
outfit.id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
outfit.923f3fd476b5450b9582d1f525604546,3945,5/25/2018,5/28/2018
outfit.8c8e922e228ba03f,4088,8/29/2019,9/2/2019
outfit.96f152543e7668ae,4360,8/10/2018,8/13/2018
outfit.ddba05a5ced34fa1ab3a0722c05bb11a,4697,6/14/2018,6/19/2018
outfit.5ef01d4dc15243fb854ca797716fd663,3890,8/24/2019,8/27/2019


In [3]:
SpotRentals = pd.read_csv('https://raw.githubusercontent.com/550tealeaves/DATA-70500-working-with-data/refs/heads/main/datasets/spot_rentals.csv', index_col='outfit.id', na_values=[np.nan])
SpotRentals.head()

Unnamed: 0_level_0,rentalPeriod.start,rentalPeriod.end
outfit.id,Unnamed: 1_level_1,Unnamed: 2_level_1
outfit.028bd28ce1184e1283d20ae44694bdb8,12/16/2022,12/19/2022
outfit.5349817f34194e5f975f8d51af28c4ba,1/31/2024,3/31/2024
outfit.268878c3e3e24d98bbc6e9770e2eb44c,10/19/2023,10/22/2023
outfit.40fd217f4ee74939aaf0422cf1478f6e,11/19/2020,11/23/2020
outfit.c1e2e74fb2d84e07bec3185521fbea5b,6/29/2022,7/4/2022


In [4]:
ThirdChance = pd.read_csv('https://raw.githubusercontent.com/550tealeaves/DATA-70500-working-with-data/refs/heads/main/datasets/third_chance.csv', index_col='outfit.id', na_values=[np.nan])
ThirdChance.head()

Unnamed: 0_level_0,date_added,name,brand,owner,condition,condition_desctription,retail_price,tc_price,sold
outfit.id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
outfit.00731d8db5504c64af534f9f43a7f061,,L Heaston Skirt Hushed Violet,Samsøe Samsøe,FJONG,5 - Damaged,Flekker som ikke går bort.,800.0,50.0,True
outfit.00fa8d1fd5804fe78ac284c2e0c28b4e,6/10/2022,L Navy Buffalo Skirt,Holzweiler,FJONG,5 - Damaged,Small black dots on shoulders and back (see im...,4700.0,1175.0,False
outfit.0106cbd7e23f41da92e3081146b4ad03,10/31/2023,Juliette Skirt Red Line,Samsøe & Samsøe,FJONG,5 - Damaged,Store flekker i front,2900.0,580.0,False
outfit.0113e4422a3f408f8fc9a94d2861474f,8/18/2022,Sand Poddle99 Blazer,Holzweiler,Holzweiler,5 - Damaged,Lite Hull både foran og bak. Kan sys? eventuel...,2400.0,300.0,False
outfit.0118b605888840348f80965c2d8865cf,1/10/2023,XS Wabi Embroidered Bright White Shirt,FWSS,FWSS,5 - Damaged,,,,False


In [5]:
#Use concat method to combine the datasets - all the variables for same outfit id are aligned
#datasets are not stacked, they are merged - index labels the rows
VintageClothes = pd.concat([OriginalOrders, SpotRentals, ThirdChance], axis=1)
VintageClothes.info('verbose')

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Now, we'll compute a binary variable that will be our dependent variable, Y. Then, we'll identify the relevant independent variables and put them in a new DataFrame, X. At that point, we can compute the model.

In [None]:
#GII abstract score that varies b/w 0 and 1 - higher the score = more gender inequality. Lower score = lower gender inequality = higher gender equality
GlobalIndicatorsTotal['Gender Inequality Index (GII)'].describe()

Unnamed: 0,Gender Inequality Index (GII)
count,155.0
mean,0.365884
std,0.191457
min,0.016
25%,0.184
50%,0.385
75%,0.5245
max,0.744


In [None]:
# Turn index into binary variable - compare high GII values & everyone else
# use LOC method to perform a binary test
GlobalIndicatorsTotal['GII Binary'] = 0 #crete new variable
GlobalIndicatorsTotal.loc[GlobalIndicatorsTotal['Gender Inequality Index (GII)'] < 0.19, ['GII Binary']] = 1 #These are nations with a low gender inequality score--that is, the highest gender equality - change values to 1 if they are in the lower quartile
GlobalIndicatorsTotal['GII Binary'].describe() #now see that every country is either 0 or 1

Unnamed: 0,GII Binary
count,188.0
mean,0.207447
std,0.406561
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [None]:
#binary variables are typically 0 and 1
GlobalIndicatorsTotal['GII Binary']

Unnamed: 0_level_0,GII Binary
Country,Unnamed: 1_level_1
Norway,1
Australia,1
Switzerland,1
Denmark,1
Netherlands,1
...,...
Burundi,0
Chad,0
Eritrea,0
Central African Republic,0


In [None]:
#Create linear model to explain probability of being in the high GII group (group coded 1)
# uses 5 potential independent variables
Y = GlobalIndicatorsTotal['GII Binary']
X = GlobalIndicatorsTotal[['Percent Representation in Parliament', 'Population with Secondary Education (Female)', 'Labour Force Participation Rate (Female)', 'Life Expectancy at Birth', 'Gross National Income (GNI) per Capita']]
model0 = sm.Logit(Y, X, missing='drop').fit()
print(model0.summary())

Optimization terminated successfully.
         Current function value: 0.321133
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:             GII Binary   No. Observations:                  156
Model:                          Logit   Df Residuals:                      151
Method:                           MLE   Df Model:                            4
Date:                Wed, 02 Oct 2024   Pseudo R-squ.:                  0.4289
Time:                        00:07:16   Log-Likelihood:                -50.097
converged:                       True   LL-Null:                       -87.724
Covariance Type:            nonrobust   LLR p-value:                 1.760e-15
                                                   coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------------------------
Percent Representation in Parliament    

- LLR p-value is < 0.05 = model is reliabie (large dataset)
- Pseudo R-squared (goodness of fit) - ~43% of being high GII is accounted for in the model - but pseudo R-squared is not always best way to explain probability
- must convert the coefficients to odds b/c they are not interpretable

- labor fource participation rate & life expectancy at birth have negative odds

In [None]:
print(math.exp(model0.params[0]), math.exp(model0.params[1]), 1/math.exp(model0.params[2]), 1/math.exp(model0.params[3]), math.exp(1000*model0.params[4]))
# We need to exponentiate (or the take anti-logs of) the coefficients in order to interpret them as odds.
#For the negative coefficients, it is useful to take the inverse of the result & interpret it in the opposite direction (that is, the odds of not being in the high gender equality group). You can also change the increment of change in X
# as is the case here with the parameter for GNI per capita; I changed the increment to $1000 instead of $1.

1.113047941683358 1.0677546839634509 1.0600888084476103 1.086297381643493 1.0540509355600345


  print(math.exp(model0.params[0]), math.exp(model0.params[1]), 1/math.exp(model0.params[2]), 1/math.exp(model0.params[3]), math.exp(1000*model0.params[4]))


### **Summary**
For each unit increase in high gender equality group, there is
- For each 1% increase in percent representation in Parliament, there is 1.11 times more likely to be in the high equality group
- For each 1% increase in female population with secondary education, there are 1.06 greater odds of being in the high equality group
- For each 1% increase in female force labor participation, there is 1.06 (greater odds) times more likely to be in the low equality group
- For each 1% increase in life expectancy at birth, you are 1.08 times more likely to be in the low equality group.
- For each 1% increase in GNI per capita, it is 1.05 times more likely to be in the high equality group.

In [None]:
print(np.exp(model0.params)) #These are expressed as odds ratios

Percent Representation in Parliament            1.113048
Population with Secondary Education (Female)    1.067755
Labour Force Participation Rate (Female)        0.943317
Life Expectancy at Birth                        0.920558
Gross National Income (GNI) per Capita          1.000053
dtype: float64


- How much likely is a unit to be high GII with one of these independent variables

In [None]:
#often show log regression results as marginal effects - not a straight line
model0_marginals = model0.get_margeff() #These are the average effects - aka the avg slope
print(model0_marginals.summary())

        Logit Marginal Effects       
Dep. Variable:             GII Binary
Method:                          dydx
At:                           overall
                                                  dy/dx    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------------------------
Percent Representation in Parliament             0.0109      0.002      4.922      0.000       0.007       0.015
Population with Secondary Education (Female)     0.0066      0.001      5.889      0.000       0.004       0.009
Labour Force Participation Rate (Female)        -0.0059      0.002     -2.653      0.008      -0.010      -0.002
Life Expectancy at Birth                        -0.0084      0.001     -6.576      0.000      -0.011      -0.006
Gross National Income (GNI) per Capita        5.335e-06   1.13e-06      4.728      0.000    3.12e-06    7.55e-06


In [None]:
#changes in odds for predictors at the median - can exponentiate them to convert them into odds
model0_marginals = model0.get_margeff(at='median') #It is often more useful to get estimages of the effect sizes at particular values for the factors
print(model0_marginals.summary())

        Logit Marginal Effects       
Dep. Variable:             GII Binary
Method:                          dydx
At:                            median
                                                  dy/dx    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------------------------
Percent Representation in Parliament             0.0055      0.002      2.711      0.007       0.002       0.009
Population with Secondary Education (Female)     0.0034      0.001      3.074      0.002       0.001       0.006
Labour Force Participation Rate (Female)        -0.0030      0.001     -2.608      0.009      -0.005      -0.001
Life Expectancy at Birth                        -0.0043      0.002     -2.647      0.008      -0.007      -0.001
Gross National Income (GNI) per Capita        2.708e-06    1.1e-06      2.458      0.014    5.48e-07    4.87e-06


In [None]:
mdn_rep = np.exp(0.0055*10)
print("The effect of median representation in parliament is", f"{mdn_rep:.3f} times more likely to be a high quality nation for an increase of ten percent in representation.")

The effect of median representation in parliament is 1.057 times more likely to be a high quality nation for an increase of ten percent in representation.


In [None]:
model0_pred = model0.pred_table()
print(model0_pred) # Correct predictions are on the diagonal of the 2d array.

[[109.   8.]
 [ 12.  27.]]


In [None]:
correct_i = 109 / (109 + 8) # The proportion of correct predictions of 0.
correct_j = 27 / (27 + 12) # The proportion of correct predictions of 1.
print(correct_i, correct_j)

0.9316239316239316 0.6923076923076923


## Activity

1. Find and read into a DataFrame a suitable dataset. You may use the global social indictors data from the example here. You may need to combine files, as shown here.

2. Identify a dependent variable to explain. Create a binary variable of this measure, if needed. Explain why you chose this variable or recoded in the way you did.

3. Build a model to explain the DV. You can use the odds (anti-log of the coefficients) or the marginal effects to test for the unique effects of each predictor.

4. Explain the results.