# Foundations of Computational Economics #17

by Fedor Iskhakov, ANU

<img src="_static/img/dag3logo.png" style="width:256px;">

## Linear regression using Pandas and Numpy

<img src="_static/img/lab.png" style="width:64px;">

<img src="_static/img/youtube.png" style="width:65px;">

[https://youtu.be/LafDXp28IRE](https://youtu.be/LafDXp28IRE)

Description: Using Numpy and Pandas to estimate simple regression.

### Linear regression

Recall the classic linear regression model with data in columns of
$ (X,y) $, where $ X $ are independent variables and $ y $ is
the dependent variable.
Parameter vector to be estimated is $ \beta $, and we assume that
errors follow $ \varepsilon \sim N(0, \sigma) $

$$
y = X \beta + \varepsilon \quad \quad \varepsilon \sim N(0, \sigma)
$$

Let $ \hat{\beta} $ denote the estimate of the parameters $ \beta $.
To find it, we minimize the sum of squares of the residuals
$ e = y - X \hat{\beta} $, i.e. $ e'e \longrightarrow_{\hat{\beta}} \min $,
which leads to the well known OLS formula

$$
\hat{\beta} = (X'X)^{-1} X' y
$$

The mean standard error (MSE) of the regression is calculated as $ s = \sqrt{\frac{1}{n-k} e'e} $,
where $ n $ is the number of observations and $ k $ is the number of parameters (elements in $ \beta $).

The variance-covariance matrix of the estimates is given by $ \hat{\Sigma} = s^2 (X'X)^{-1} $.
The square root of the diagonal elements of this matrix are them standard deviations of the estimates, and give us the measure of the accuracy of the estimated parameters.

[William Greene “Econometric Analysis”](https://books.google.com.au/books?id=LWQuAAAAQBAJ&dq=greene%20econometric%20analysis)

In [None]:
import numpy as np
def ols(X,y,addConstant=True,verbose=True):
    '''Return the OLS estimates and their variance-covariance matrix for the given data X,y
    When addConstant is True, constant is added to X
    When verbose is True, a report is printed
    '''
    pass

In [None]:
# test on small dataset
X = np.array([[5, 3],
              [2, 3],
              [3, 1],
              [2, 8],
              [4.5, 2.5],
              [2.5, 1.5],
              [4.3, 4.2],
              [0.5, 3.5],
              [1, 5],
              [3, 8]])
truebeta = np.array([1.234,-0.345])[:,np.newaxis]  # column vector
y = X @ truebeta + 2.5 + np.random.normal(size=(X.shape[0],1),scale=0.2)

beta,S=ols(X,y)
beta,S=ols(X,y,addConstant=False)

In [None]:
# test with one dimensional arrays
X = np.array([1,2,3,4,5,6,7,8,9,10])
y = np.array([9.4,8.1,7.7,6.3,5.7,4.4,3.0,2.1,1.1,0.8])

beta,S=ols(X,y)
beta,S=ols(X,y,addConstant=False)

In [None]:
import numpy as np
def ols(X,y,addConstant=True,verbose=True):
    '''Return the OLS estimates and their variance-covariance matrix for the given data X,y
    When addConstant is True, constant is added to X
    When verbose is True, a report is printed
    '''
    y = y.squeeze()  # we are better off if y is one-dimensional
    if addConstant and X.ndim==1:
        X = np.hstack((np.ones(X.shape[0])[:,np.newaxis],X[:,np.newaxis]))
        k = 2
    elif addConstant and X.ndim>1:
        X = np.hstack((np.ones(X.shape[0])[:,np.newaxis],X))
        k = X.shape[1]+1
    elif X.ndim==1:
        X = X[:,np.newaxis]
    xxinv = np.linalg.inv(X.T@X)  # inv(X'X)
    beta = xxinv @ X.T@y  # OLS estimates
    e = y - X@beta  # residuals
    n,k = X.shape  # number of observations and parameters
    s2 = e.T@e / (n-k)
    Sigma = s2*xxinv
    if verbose:
        # report the estimates
        print('Number of observations: {:d}\nNumber of parameters: {:d}'.format(n,k))
        print('Parameter estimates (std in brackets)')
        for b,s in zip(beta,np.sqrt(np.diag(Sigma))):
            print('{:10.5f} ({:10.5f})'.format(b,s))
        print('MSE = {:1.5f}\n'.format(np.sqrt(s2)))
    return beta,Sigma

### Data on median wages

**The Economic Guide To Picking A College Major**

Data dictionary available at

[https://github.com/fivethirtyeight/data/tree/master/college-majors](https://github.com/fivethirtyeight/data/tree/master/college-majors)

In [None]:
import pandas as pd
# same data as in video 15
data = pd.read_csv('./_static/data/recent-grads.csv')

In [None]:
data.info()

In [None]:
data.head(n=15)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
data.plot(x='ShareWomen', y='Median', kind='scatter', figsize=(10, 8), color='red')
plt.xlabel('Share of women')
plt.ylabel('Median salary')
# add a linear regression line to the plot

In [None]:
print(data[['Median','ShareWomen']].isnull().sum()) # check if there are NaNs in the data!
data1 = data[['Median','ShareWomen']].dropna()  # drop NaNs
data1.plot(x='ShareWomen', y='Median', kind='scatter', figsize=(10, 8), color='red')
plt.xlabel('Share of women')
plt.ylabel('Median salary')
# add a linear regression line to the plot
b,_ = ols(X=data['ShareWomen'],y=data['Median'],verbose=False)
fn = lambda x: b[0]+b[1]*x
xx = np.linspace(0,1,100)
plt.plot(xx,fn(xx),color='navy',linewidth=3)
plt.show()

In [None]:
# create fraction variables
data.drop(index=data[data['Total']==0].index,inplace=True)  # drop zero Totals
data.drop(index=data[data['Employed']==0].index,inplace=True)  # drop zero Employed
data['Employment rate'] = data['Employed'] / data['Total']
data['Fulltime rate'] = data['Full_time'] / data['Employed']
data2 = data[['Median','ShareWomen','Employment rate','Fulltime rate']].dropna()  # drop NaNs
y = data2['Median']/1000  # rescale salary

In [None]:
# run the full model
ols(data2[['ShareWomen','Employment rate','Fulltime rate']],y);

#### Further learning resources

- Regression analysis using `sklearn` library
  [https://datascience.quantecon.org/applications/regression.html](https://datascience.quantecon.org/applications/regression.html)  