In [None]:
import stata_setup, os
if os.name == 'nt':
    stata_setup.config('C:/Program Files/Stata17/','mp')
else:
    stata_setup.config('/usr/local/stata17','mp')

## Household Survey Data - March 2009 Current Population Survey

The Current Population Survey (CPS) is a monthly survey of about 57,000 U.S. households conducted by the Bureau of the Census of the Bureau of Labor Statistics. From the March 2009 survey  individuals are extracted with non-allocated variables who were full time employed (defined as those who had worked at least 36 hours per week for at least 48 weeks the past year), and excluded those in the military. This sample has 50,742 individuals.

<blockquote>
Hansen, B. 2022. <em>Econometrics</em>. Princeton University Press 1st edition.
</blockquote>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="https://www.ssc.wisc.edu/~bhansen/econometrics/">https://www.ssc.wisc.edu/~bhansen/econometrics/</a>

In [None]:
%%stata
use ../Data/cps09mar, clear
describe

In [None]:
%%stata
vl set

In [None]:
%%stata
vl list (_all), sort

The following code <em>move</em> variables in `vluncertain` to either the continuous set or the categorical set:

In [None]:
%%stata
quietly vl move (age education hour) vlcontinuous
quietly vl move (race) vlcategorical
vl list (_all), sort

## Effect of Traffic-related air pollution on Attention in Primary School Children

A real-world dataset that includes children’s performance on a test of reaction time, levels of nitrogen dioxide (NO2) pollution, the children’s physical and socioeconomic characteristics, and some other environmental factors. The data were collected and analyzed by

<blockquote>
Sunyer, J., E. Suades-González, R. García-Esteban, I. Rivas, J. Pujol, M. Alvarez-Pedrerol, J. Forns, X. Querol, and X. Basagaña. 2017. Traffic-related air pollution and attention in primary school children: Short-term association. <em>Epidemiology</em> 28: 181–189.
</blockquote>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="https://doi.org/10.1097/EDE.0000000000000603">https://doi.org/10.1097/EDE.0000000000000603</a>

Our interest is in how levels of nitrogen dioxide in the classroom affect the children’s performance on the test, while adjusting for other factors. We will focus on two _outcomes_ from the Attention Network Test (ANT)

👉🏼 Reaction time (continuous)

👉🏼 Omissions (count)

In [None]:
%%stata
use ../Data/breathe, clear
describe

Our goal is to create two lists of control covariates, for example, independent variables. One list will contain continuous control covariates and the other will contain categorical control covariates. Why not just one list? Because we want the categorical variables to enter our model as indicator variables for each level (distinct value) of the categorical variable. To expand a categorical variable into indicator variables for its levels, we must prefix it with an ```i.```, for example, ```i.grade```.

In [None]:
%%stata
vl set

In [None]:
%%stata
display "$vlcategorical"

## College Proximity

In a influential paper David Card (1995) suggested if a potential student lives close to a college this reduces the cost of attendence and thereby raises the likelihood that the student will attend college. However, college proximity does not directly affect a student’s skills or abilities so should not have a direct effect on his or her market wage. These considerations suggest that college proximity can be used as an <em>instrument</em> for education in a wage regression. Card used data from the National Longitudinal Survey of Young Men (NLSYM) for 1976. We drop observations for which `wage` is missing. The remaining sample has 3,010 observations.

<blockquote>
Card, D. 1995. Using geographic variation in college proximity to estimate the return to schooling, in L. N. Christofides, E. K. Grant, and R. Swidinsky (editors) <em>Aspects of Labor Market
Behavior: Essays in Honour of John Vanderkamp</em>, University of Toronto Press.  
</blockquote>    

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="https://doi.org/10.3386/w4483">https://doi.org/10.3386/w4483</a>

In [None]:
%%stata
use ../Data/Card1995, clear
describe

## Determinants of Wages

The log of married women’s wages (```lwage```) is modeled as a function of their experience (```exper```), the square of their experience, and their years of education (```educ```). Collectively, these are called _exogenous_ covariates.

<blockquote>
Mroz, T. A. 1987. The sensitivity of an empirical model of married women’s hours of work to economic and statistical
assumptions. <em>Econometrica</em> 55: 765–799.    
</blockquote>    

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="https://doi.org/10.2307/1911029">https://doi.org/10.2307/1911029</a>

As is customary, education is treated as an endogenous variable. The reasoning is that we cannot measure innate ability, and ability is likely to influence both education level and income. Some disciplines refer to this as unobserved confounding rather than endogeneity. Either way, you cannot just run a regression of wages on education and experience and learn anything about the true effect of education on wages.

You need more information from variables that you presume are not affected by the woman’s unmeasured ability — let’s call them __instruments__. And, they also cannot belong in the model for wages. We will use their mothers’ education (```motheduc```), their fathers' education (```fatheduc```), and their husbands’ education (```huseduc```) as instruments for the woman’s education. The instruments are also required to be _exogenous_, but we will just call them instruments.

In [None]:
%%stata
use ../Data/mroz, clear
vl create exogbase = (exper age husage kidslt6 kidsge6 city)
note: $exogbase initialized with 6 variables.
vl create instbase = (motheduc fatheduc huseduc)
note: $instbase initialized with 3 variables.

In [None]:
from pystata import stata
stata.run('describe')