# Part 1 - Cross-Section Data

## Importing the libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


## Load data

In [4]:
df = pd.read_csv('../data/HPRICE2.raw', sep='\s+', header=None)


### Show first 5 rows of the dataframe

In [5]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,24000,0.006,5.38,6.57,4.09,1,29.6,15.3,4.98,10.08581,1.682688,5.69036
1,21599,0.027,4.69,6.42,4.97,2,24.2,17.8,9.14,9.980402,1.545433,5.488938
2,34700,0.027,4.69,7.18,4.97,2,24.2,17.8,4.03,10.4545,1.545433,5.488938
3,33400,0.032,4.58,7.0,6.06,3,22.2,18.7,2.94,10.41631,1.521699,5.402678
4,36199,0.069,4.58,7.15,6.06,3,22.2,18.7,5.33,10.49679,1.521699,5.402678


### Assign Column Names

In [8]:
# Assign column names, fetched from the DES file 
column_names = [
    "price",      # median housing price, $
    "crime",      # crimes committed per capita
    "nox",        # nitrous oxide, parts per 100 mill.
    "rooms",      # avg number of rooms per house
    "dist",       # weighted dist. to 5 employ centers
    "radial",     # accessibility index to radial hghwys
    "proptax",    # property tax per $1000
    "stratio",    # average student-teacher ratio
    "lowstat",    # % of people 'lower status'
    "lprice",     # log(price)
    "lnox",       # log(nox)
    "lproptax"    # log(proptax)
]

# Assign these column names to the DataFrame
df.columns = column_names

### Show first 5 rows of the dataframe

In [9]:
df.head()

Unnamed: 0,price,crime,nox,rooms,dist,radial,proptax,stratio,lowstat,lprice,lnox,lproptax
0,24000,0.006,5.38,6.57,4.09,1,29.6,15.3,4.98,10.08581,1.682688,5.69036
1,21599,0.027,4.69,6.42,4.97,2,24.2,17.8,9.14,9.980402,1.545433,5.488938
2,34700,0.027,4.69,7.18,4.97,2,24.2,17.8,4.03,10.4545,1.545433,5.488938
3,33400,0.032,4.58,7.0,6.06,3,22.2,18.7,2.94,10.41631,1.521699,5.402678
4,36199,0.069,4.58,7.15,6.06,3,22.2,18.7,5.33,10.49679,1.521699,5.402678


## Questions

1. State the fundamental hypothesis under which the Ordinary Least Squares (OLS) estimators are unbiased.



2. Show that under this assumption the OLS estimators are indeed unbiased.



3. Explain the sample selection bias with an example from the course.



4. Explain the omitted variable bias with an example from the course.


5. Explain the problem of multicollinearity. Is it a problem in this dataset?


6. Create three categories of `nox` levels (low, medium, high), corresponding to the following percentiles: 0-25%, 26%-74%, 75%-100%.



7. Compute for each category of `nox` level the average median price and comment on your results.



8. Produce a scatter plot with the variable `price` on the y-axis and the variable `nox` on the x-axis. Is this a ceteris paribus effect?



9. Run a regression of `price` on a constant, `crime`, `nox`, `rooms`, and `proptax`. Comment on the histogram of the residuals. Interpret all coefficients.



10. Run a regression of `lprice` on a constant, `crime`, `nox`, `rooms`, and `proptax`. Interpret all coefficients.



11. Run a regression of `lprice` on a constant, `crime`, `lnox`, `rooms`, and `lproptax`. Interpret all coefficients.



12. In the specification of question 9, test the hypothesis $H_0: \beta_{nox} = 0$ vs. $H_1: \beta_{nox} \neq 0$ at the 1% level using the p-value of the test.



13. In the specification of question 9, test the hypothesis $H_0: \beta_{crime} = \beta_{proptax}$ at the 10% level.



14. In the specification of question 9, test the hypothesis $H_0: \beta_{nox} = 0, \beta_{proptax} = 0$ at the 10% level.



15. In the specification of question 9, test the hypothesis $H_0: \beta_{nox} = -500, \beta_{proptax} = -100$ at the 10% level using the p-value of the test.



16. In the specification of question 9, test the hypothesis that all coefficients are the same for observations with low levels of `nox` vs. medium and high levels of `nox`.



17. Repeat the test of question 16 but now assuming that only the coefficients of `nox` and `proptax` can change between the two groups of observations. State and test $H_0$.
