<h1 style="text-align: center; color: purple;" markdown="1">Econ 320 Python Lab Regression Analysis and Qualitative Regressors PART2</h1>
<h2 style="text-align: center; color: purple;" markdown="1">Handout 11</h2>

Many variables of interest are qualitative rather than quantitative. Gender, race, marital status, level of education, ocupation, region, etc. Qualitative information is ussualy represented in regressions as binary or dummy variables which can only take a value zero or one. 

**The set up**

In [90]:
import wooldridge as woo
import numpy as np
import pandas as pd

import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats

from stargazer.stargazer import Stargazer
from IPython.core.display import HTML
pd.options.display.float_format = '{:.3f}'.format

## Categorical variables


When estimating a linear regression in python using **statsmodels** you can easily transform any variable into a categorical variable using the function `C()` in the definition of the formula. Our **ols** function will add *g-1* dummy variables if the variable has *g* categories. As a refrence category the first category is left out by default. 

When you use categorical variables that have many categories, you have to choose a reference category and this is the ommitted variable that you use to avoid colinearity. By default the first category is left out in Python but we can use a second argument in the `C()` command where we procide a new reference group `somegroup` with the using the command **Treament("somegroup")**. 

The code below shows how our categorical variables are used variables are used.

* Table of categories and frequencies for two factor variables gender and occupation:
* What type of variable is occupation
* Regression with dummies for many categories from a categorical variable 

In [91]:
acs=pd.read_csv("usa_00003.csv")
# Select only variables in the list above
acs = acs[["AGE", "SEX","MARST","CITIZEN", "EDUC", "EDUCD", "LOOKING", "FTOTINC"]]
# show info and description of variables. 
acs.info()
acs.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3239553 entries, 0 to 3239552
Data columns (total 8 columns):
 #   Column   Dtype
---  ------   -----
 0   AGE      int64
 1   SEX      int64
 2   MARST    int64
 3   CITIZEN  int64
 4   EDUC     int64
 5   EDUCD    int64
 6   LOOKING  int64
 7   FTOTINC  int64
dtypes: int64(8)
memory usage: 197.7 MB


Unnamed: 0,AGE,SEX,MARST,CITIZEN,EDUC,EDUCD,LOOKING,FTOTINC
count,3239553.0,3239553.0,3239553.0,3239553.0,3239553.0,3239553.0,3239553.0,3239553.0
mean,42.224,1.51,3.583,0.293,6.267,65.107,1.859,580076.789
std,23.803,0.5,2.308,0.801,3.248,32.319,1.215,2121902.922
min,0.0,1.0,1.0,0.0,0.0,1.0,0.0,-15200.0
25%,22.0,1.0,1.0,0.0,5.0,50.0,1.0,40000.0
50%,43.0,2.0,4.0,0.0,6.0,64.0,3.0,79000.0
75%,62.0,2.0,6.0,0.0,10.0,101.0,3.0,141700.0
max,96.0,2.0,6.0,3.0,11.0,116.0,3.0,9999999.0


Pay attention to the averages here total income for example

In [92]:
# see that the values 9999999 = N/A change that into nan with the following code 
acs=acs.rename(columns={"FTOTINC": "familyIncome"})
acs["familyIncome"] = acs["familyIncome"].replace(9999999, np.nan)

In [None]:
# Lets use value_counts() to check our SEX, MARST and EDUC variables variable

print(acs.SEX.value_counts())
print(acs.MARST.value_counts())
print(acs.EDUC.value_counts())


SEX
2    1651221
1    1588332
Name: count, dtype: int64
MARST
1    1338592
6    1326614
4     296419
5     175626
2      60316
3      41986
Name: count, dtype: int64
EDUC
6     957679
10    507242
7     378171
11    320738
1     235229
8     217713
2     208489
0     189602
5      82793
4      74787
3      67110
Name: count, dtype: int64


In [94]:
# Value counts for MARST
acs.MARST.value_counts()

MARST
1    1338592
6    1326614
4     296419
5     175626
2      60316
3      41986
Name: count, dtype: int64

In [95]:
acs.CITIZEN.value_counts()

CITIZEN
0    2831596
2     213640
3     163153
1      31164
Name: count, dtype: int64

In [96]:
# define function dummy
# Define criteria, return 1 , else return 0
    
def dummyHS(x):
    if x>=6:
        return 1
    else: 
        return 0

acs['HS']=acs["EDUC"].apply(dummyHS)
acs[['EDUC','HS']].head(10)   
acs[['EDUC', 'HS']].value_counts(dropna=False)

EDUC  HS
6     1     957679
10    1     507242
7     1     378171
11    1     320738
1     0     235229
8     1     217713
2     0     208489
0     0     189602
5     0      82793
4     0      74787
3     0      67110
Name: count, dtype: int64

## Boolean Variables

To store qualitative yes or no information Python uses **Boolean variables**. Instead of transforming boolean variables into 0/1 dummy variables tehy can be directly used as regressors in the output their coefficient is then named `varname[T.True]`. These variables are treated such that **TRUE=1** and **FALSE=0**.

Below we will take the femail dummy variable and recoded as a boolean variable and introduce it in the regression. See below. 

In [97]:
display(acs.CITIZEN.value_counts())
# Create the boolean variable 
# Define a criteria in parenthesis and save that into a new variable
# In this case, has dependents 
acs['noncitizen'] = (acs['CITIZEN'] ==3  )

acs['noncitizen'].value_counts()


CITIZEN
0    2831596
2     213640
3     163153
1      31164
Name: count, dtype: int64

noncitizen
False    3076400
True      163153
Name: count, dtype: int64

In [98]:
freq_MARST= pd.crosstab(acs['MARST'], columns='count')
freq_MARST

col_0,count
MARST,Unnamed: 1_level_1
1,1338592
2,60316
3,41986
4,296419
5,175626
6,1326614


# Regression with many categories

When working with categorical variables, polynomials or orther specifications, the influence of one variables is capture by several regressors. In our example below the effect of marital status is captured by six regressors of their respective dummy variables. 

Our model is of the form:

$$log(wage) = \beta_0 + \beta_1* education + \beta_2*sex + \\ + \beta_3*age + \beta_4*married +\beta_5*marriedns \beta_6*separated + \beta_7*divorced + \beta_8*widowed  + \beta_9*single + u $$


To add this to your regression he simplest way is using **`C(varname)`** this will convert the variable into a categorical instead of numerical and create dummy variables for each category

When you want to add variables that are arithmetic operations of other variables instead of creating a separate variable you can add them just by using **`I(formula)`** This will be useful for creating dummies inside the regression, see the example of the sex variable.




In [99]:
# directly using categorical variables in regression formula:
m7 = smf.ols(formula ="np.log(familyIncome) ~ HS +  AGE + I(SEX-1)", 
                      data=acs[acs["familyIncome"]>0]).fit()
m8 = smf.ols(formula ="np.log(familyIncome) ~ HS +  AGE +  I(SEX-1)+ C(MARST)", 
                      data=acs[acs["familyIncome"]>0]).fit()

# Create a vector with the variable list to organize the output of the results
variable_names = m8.params.index.tolist()
print(variable_names)

['Intercept', 'C(MARST)[T.2]', 'C(MARST)[T.3]', 'C(MARST)[T.4]', 'C(MARST)[T.5]', 'C(MARST)[T.6]', 'HS', 'AGE', 'I(SEX - 1)']


In [100]:
# print regression table:
ms = Stargazer([m7,m8])

HTML(ms.render_html())


ms.title('Regression on Wages')
ms.custom_columns(['All', 'With Marital Status'], [1, 1])
ms.covariate_order(['Intercept', 'HS', 'I(SEX - 1)', 'AGE', 'C(MARST)[T.2]', 'C(MARST)[T.3]', 'C(MARST)[T.4]', 'C(MARST)[T.5]', 'C(MARST)[T.6]'])
HTML(ms.render_html())


0,1,2
,,
,Dependent variable: np.log(familyIncome),Dependent variable: np.log(familyIncome)
,,
,All,With Marital Status
,(1),(2)
,,
Intercept,11.197***,11.910***
,(0.001),(0.002)
HS,0.260***,0.082***
,(0.002),(0.002)


### Choosing a new the reference category


In [101]:
# rerun regression with different reference category:
reg_newref = smf.ols(formula='np.log(familyIncome) ~ HS +  I(SEX-1) + AGE + noncitizen + '
                             'C(MARST, Treatment(6))', 
                     data=acs[acs["familyIncome"]>0]).fit()


# print regression table:
m9s = Stargazer([reg_newref])
HTML(m9s.render_html())



0,1
,
,Dependent variable: np.log(familyIncome)
,
,(1)
,
AGE,-0.010***
,(0.000)
"C(MARST, Treatment(6))[T.1]",0.730***
,(0.002)
"C(MARST, Treatment(6))[T.2]",0.130***


# Create the Dummies Separate 
If you want only one category compared to everything else see below

In [102]:
# Convert 'MARST' to categorical type
acs['MARST'] = acs['MARST'].astype('category')

# Create dummy variables for MARST
marst_dummies = pd.get_dummies(acs['MARST'], prefix='MARST')

# Concatenate the dummy variables with the original dataframe
acs = pd.concat([acs, marst_dummies], axis=1)

# Print the first few rows of the dataframe with dummy variables
print(acs.info())
#BE VERY CAREFUL WITH THIS PART BECAUSE IF YOU RUN MORE THAN ONCE 
#YOU WILL CREATE THE SAME VARIABLE A FEW TIMES

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3239553 entries, 0 to 3239552
Data columns (total 16 columns):
 #   Column        Dtype   
---  ------        -----   
 0   AGE           int64   
 1   SEX           int64   
 2   MARST         category
 3   CITIZEN       int64   
 4   EDUC          int64   
 5   EDUCD         int64   
 6   LOOKING       int64   
 7   familyIncome  float64 
 8   HS            int64   
 9   noncitizen    bool    
 10  MARST_1       bool    
 11  MARST_2       bool    
 12  MARST_3       bool    
 13  MARST_4       bool    
 14  MARST_5       bool    
 15  MARST_6       bool    
dtypes: bool(7), category(1), float64(1), int64(7)
memory usage: 222.4 MB
None


In [103]:
acs["MARST_1"].value_counts()

MARST_1
False    1900961
True     1338592
Name: count, dtype: int64

In [104]:
m10 = smf.ols(formula='np.log(familyIncome) ~ HS +  I(SEX-1) + AGE + MARST_1 + MARST_1:I(SEX-1) ',
              data=acs[acs["familyIncome"]>0]).fit()


# print regression table:
m10s = Stargazer([m10])
HTML(m10s.render_html())

0,1
,
,Dependent variable: np.log(familyIncome)
,
,(1)
,
AGE,-0.010***
,(0.000)
HS,0.072***
,(0.002)
I(SEX - 1),-0.064***


# Numeric variables into categories

Sometimes we need to make numerical variables into categories because a linear relation with the dependent variable seems implausible or the interpretation is inconvenient. Or we simply want to have a different 
interpretation. 

In [105]:
lawsch85 = woo.dataWoo('lawsch85')[['rank', 'lsalary', 'LSAT', 'GPA', 'libvol', 'cost']]
lawsch85.info()
lawsch85['rank'].describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   rank     156 non-null    int64  
 1   lsalary  148 non-null    float64
 2   LSAT     150 non-null    float64
 3   GPA      149 non-null    float64
 4   libvol   155 non-null    float64
 5   cost     150 non-null    float64
dtypes: float64(5), int64(1)
memory usage: 7.4 KB


count   156.000
mean     83.750
std      50.064
min       1.000
25%      40.750
50%      83.500
75%     125.500
max     175.000
Name: rank, dtype: float64

In the example below the variable `rank` is the rank of the law school as a number between 1 and 175. We would like to compare schools in the different groups like in list below

|School Rank | 
|-----------| 
|top 10 |
|11-20 |
|21-30 |
|31-50 |
|50-100 | 
|above 100 | 


In the code below we create variable for these categories. First define cut point and then create a new factor (categorical) variable based on these cut points using the cut command. 

In [106]:
# define cut points for the rank:
cutpts = [0, 10, 20, 30, 50, 100, 175]

# create categorical variable containing ranges for the rank:
lawsch85['rc'] = pd.cut(lawsch85['rank'], bins=cutpts,
                       labels=['top 10', '(10,20]', '(20,30]',
                                '(30,50]', '(50,100]', '(100,175]'])

# display frequencies:
freq = pd.crosstab(lawsch85['rc'], columns='count')
freq

col_0,count
rc,Unnamed: 1_level_1
top 10,10
"(10,20]",12
"(20,30]",9
"(30,50]",17
"(50,100]",46
"(100,175]",62


Estimate the following equation $$ log(salary)= \beta_0 +\beta_1*rankcat + \beta_2*LSAT + \beta_3*GPA + \beta_4*log(libvol) + \beta_5*log(cost)$$ But first follow the instructions to set the reference category, for the school ranking. 

>  Choose reference category, we want the last group as the reference category, so we use relevel. Save that in a new variable called rankcat

In [107]:
# run regression:
reg1 = smf.ols(formula='np.log(lsalary) ~ rank + LSAT + GPA + np.log(libvol)+ np.log(cost)',data=lawsch85).fit()
reg2 = smf.ols(formula='np.log(lsalary) ~ C(rc, Treatment("top 10")) +'
              'LSAT + GPA + np.log(libvol)+ np.log(cost)',
              data=lawsch85).fit()
# Create a vector with the variable list to organize the output of the results
variable_names = reg2.params.index.tolist()
print(variable_names)

['Intercept', 'C(rc, Treatment("top 10"))[T.(10,20]]', 'C(rc, Treatment("top 10"))[T.(20,30]]', 'C(rc, Treatment("top 10"))[T.(30,50]]', 'C(rc, Treatment("top 10"))[T.(50,100]]', 'C(rc, Treatment("top 10"))[T.(100,175]]', 'LSAT', 'GPA', 'np.log(libvol)', 'np.log(cost)']


In [108]:
# try using chatgpt to create a dictionary with the variables from reg2 to improve the names for your regression output

ms = Stargazer([reg1,reg2])

HTML(ms.render_html())


ms.title('Regression on Raking')
ms.custom_columns(['All', 'With Categories'], [1, 1])
# ms.covariate_order(['Intercept', 'LSAT', 'GPA', 'np.log(libvol)', 'np.log(cost)', 'rank',
#                    'C(rc, Treatment("top 10"))[T.(10,20]]', 'C(rc, Treatment("top 10"))[T.(20,30]]', 
#                    'C(rc, Treatment("top 10"))[T.(30,50]]', 'C(rc, Treatment("top 10"))[T.(50,100]]', 
#                    'C(rc, Treatment("top 10"))[T.(100,175]]'])
#ms.rename_covariates(new_coefficient_names)

HTML(ms.render_html())

0,1,2
,,
,Dependent variable: np.log(lsalary),Dependent variable: np.log(lsalary)
,,
,All,With Categories
,(1),(2)
,,
"C(rc, Treatment(""top 10""))[T.(10,20]]",,-0.008**
,,(0.004)
"C(rc, Treatment(""top 10""))[T.(100,175]]",,-0.064***
,,(0.005)


# Categorical dependent variables 

When you have a categorical dependent variable you can use regular OLS model, this will be a linear probability model LPM or you can use logit or probit models.

The Python code for these last two models is:

# Estimate logit model:

Your y variable is binary 0 or 1 

>`reg_logit = smf.logit(formula='y ~ x1 + x2 + ...+ xn',
                      data=mydata)`

disp = 0 avoids printing out information during the estimation:

>`results_logit = reg_logit.fit(disp=0)`


# Estimate probit model:
>`reg_probit = smf.probit(formula='y ~ x1 + x2 + ...+ xn',
                      data=mydata)
results_probit = reg_probit.fit(disp=0)`

In [109]:
!jupyter nbconvert --to html H11_320Lab_QualitativePart2.ipynb

This application is used to convert notebook files (*.ipynb)
        to various other formats.


Options
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePr



&nbsp;
<hr />
<p style="font-family:palatino; text-align: center;font-size: 15px">ECON320 Python Programming Laboratory</a></p>
<p style="font-family:palatino; text-align: center;font-size: 15px">Professor <em> Paloma Lopez de mesa Moyano</em></a></p>
<p style="font-family:palatino; text-align: center;font-size: 15px"><span style="color: #6666FF;"><em>paloma.moyano@emory.edu</em></span></p>

<p style="font-family:palatino; text-align: center;font-size: 15px">Department of Economics</a></p>
<p style="font-family:palatino; text-align: center; color: #012169;font-size: 15px">Emory University</a></p>

&nbsp;