# Econometrics Assignment 1
## Question 5

Consider the data-set NLS-Y (US National Longitudinal Survey of Youth), containing a set of young men surveyed in 1980. In the dataset `wage.dat`, data are provided on RNS, MRT, SMSA, MED, KWW, IQ, AGE, S, EXPR, TENURE, LW. The definition of these variables is:

- `RNS`: dummy for residency in the southern states
- `MRT`: dummy for marital status
- `SMSA`: dummy for residency in the metropolitan areas
- `MED`: mother's education in years
- `KWW`: score on the `knowledge of the world of work' test
- `IQ`: IQ score
- `AGE`: age of the individual
- `S`: completed years of schooling
- `EXPR`: experience in years
- `TENURE`: tenure in years
- `wage`: monthly wage

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_excel('wage.xls')
df.head()

Unnamed: 0,RNS,MRT,SMSA,MED,KWW,IQ,AGE,S,EXPR,TENURE,wage
0,0,1,1,8,35,93,31,12,10.635,2,768.930054
1,0,1,1,14,41,119,37,18,11.367,16,807.545959
2,0,1,1,14,46,108,33,14,11.035,9,824.683777
3,0,1,1,12,32,96,32,12,13.089,7,650.017944
4,0,1,1,6,27,74,34,11,14.402,5,562.280029


### a) 
What is the average and the median wage of the individuals in the sample? What is the average and the median log wage of the individuals in the sample?

In [3]:
df['wage'].describe()

count     758.000000
mean      999.976870
std       411.326834
min       115.468758
25%       714.083557
50%       948.139008
75%      1202.009613
max      3077.891357
Name: wage, dtype: float64

The average wage is `999.976`, the median wage is `948.139`.

In [4]:
df['wage_log'] = np.log(df['wage'])
df.head()

Unnamed: 0,RNS,MRT,SMSA,MED,KWW,IQ,AGE,S,EXPR,TENURE,wage,wage_log
0,0,1,1,8,35,93,31,12,10.635,2,768.930054,6.645
1,0,1,1,14,41,119,37,18,11.367,16,807.545959,6.694
2,0,1,1,14,46,108,33,14,11.035,9,824.683777,6.715
3,0,1,1,12,32,96,32,12,13.089,7,650.017944,6.477
4,0,1,1,6,27,74,34,11,14.402,5,562.280029,6.332


In [5]:
df['wage_log'].describe()

count    758.000000
mean       6.826555
std        0.409927
min        4.749000
25%        6.571000
50%        6.854500
75%        7.091750
max        8.032000
Name: wage_log, dtype: float64

The average wage is `6.826`, the median wage is `6.854`.

#### Note that "average of wage log" is different from "log of wage average"
In short, "median of wage log" is the same as "log of wave median" because the position of median does not change after log and that it is only one value.

However, for average it is different because log *bends each value disproportionately based on each value*, meaning that a larger value have a larger difference with its log value than a smaller value and its log value. Given that every value is decreased differently after computing with log, we know that its average will change, therefore, "average of wage log" will be different from "log of wage average".

In [6]:
np.log(df['wage'].describe()['mean'])

6.907732148272591

In [7]:
np.log(df['wage'].describe()['50%'])

6.854501123961609

The "log of wage average" is `6.907` while the "average of wage log" is `6.826`.

The "log of wage medium" and "medium of wage log" are both `6.854`.

### b) 
What is the median wage of the individuals for the subsample of individuals who live in metropolitan areas? please use two different approach to calculate this result.

In [8]:
df[df['SMSA']==1].describe()

Unnamed: 0,RNS,MRT,SMSA,MED,KWW,IQ,AGE,S,EXPR,TENURE,wage,wage_log
count,540.0,540.0,540.0,540.0,540.0,540.0,540.0,540.0,540.0,540.0,540.0,540.0
mean,0.261111,0.890741,1.0,10.964815,37.109259,104.303704,32.992593,13.835185,11.237693,7.303704,1058.846712,6.889537
std,0.439648,0.312253,0.0,2.662992,7.204224,13.480846,3.051801,2.208237,4.195172,5.119027,415.443952,0.398763
min,0.0,0.0,1.0,0.0,13.0,54.0,28.0,9.0,0.769,0.0,115.468758,4.749
25%,0.0,1.0,1.0,9.0,33.0,96.0,30.0,12.0,8.3535,3.0,788.198624,6.66975
50%,0.0,1.0,1.0,12.0,38.0,105.0,33.0,13.0,10.757,7.0,1000.244751,6.908
75%,1.0,1.0,1.0,12.0,42.0,114.0,36.0,16.0,14.40625,11.0,1250.126465,7.131
max,1.0,1.0,1.0,18.0,56.0,145.0,38.0,18.0,22.045,22.0,3077.891357,8.032


The median wage of individuals living in metropolitan areas is `1000.24`

### c) 
What is the covariance of IQ and S?

In [9]:
np.cov(df['IQ'], df['S'])

array([[185.46806586,  15.5060456 ],
       [ 15.5060456 ,   4.90486332]])

Answer this question again for the subsample of individuals whose net income is lower than the 90th percentile value of net income.

In [10]:
np.percentile(df['wage'], 90)

1499.66943359375

In [11]:
df2 = df[df['wage']<np.percentile(df['wage'], 90)]
np.cov(df2['IQ'], df2['S'])

array([[183.19460966,  13.98931822],
       [ 13.98931822,   4.60740165]])