# Python for 240A
Author: _Ingrid Haegele_ (inha@berkeley.edu)

This template gives you an overview on how to get started using Python for Econ 240A. 

Python is a powerful tool for doing applied work in economics. In my research, I use it a lot for data cleaning and preprocessing (e.g. correcting typos, using natural language processing to classify text). It is especially handy because it allows you to have multiple dataframes open at the same time (in contrast to Stata). Once you cleaned your data, Python offers you a convenient way to write your own regression commands or define the likelihood function you would like to estimate. People also use it interchangebly with Matlab for simulations and structural work. Another great application for using Python is machine learning. Python has an easy to use machine learning library (scikit-learn) that will allow you to implement a lot of machine learning tools. 

We will get a lot of practice with theses things in the problem sets this semester. 

## Downloading and Installing Python

I recommend using jupyter notebooks for the problem sets in this class, because it offers a very easy way to include headlines, text and plots in your code. You can simply save your jupyter notebooks as html files and hand them in as your problem set solutions without any extra work. 

First, download the Anaconda distribution under https://www.anaconda.com/download/. Make sure to download the most recent version of Python (Python 3.6). 
Second, after installing Anaconda, you can easily run jupyter notebooks as described here http://jupyter.org/install.html. 

## Python Resources
First, you should have a look at the Python notebooks on Bryan Graham's website: https://github.com/bryangraham/Ec240a/tree/master/Ec240a_Fall2016/Python. This code is a great example for most of the things you will have to do in your problem sets and is a good way to start absorbing how Python works. 


Another excelent resource to learn python can be found here https://lectures.quantecon.org/py/. 

If you get really stuck with Python, consider reaching out to the D-LAB. They do not only offer free Python classes, but also offer free consulting sessions. You can reach out to one of their consultants and they will usually be able ot meet with you within a couple of days. Find more information here:  http://dlab.berkeley.edu/consulting. 


## Example Notebook
The following code shows you the most important features of a Python Notebook. 

### Loading Modules
In order to use modules, we need to first import them. The following describes some fundamental modules:

NumPy provides fundamental computing capabilities similar to MATLAB.

Pandas allows us to conveniently manage data.

PyPlot allows MATLAB-like plotting environment.

StatsModels provides some statistical functions.

We typically designate an abbreviation for a module. For instance, we use 'np' for the NumPy module. When NumPy is imported as np, we can call the functions the belong to NumPy by typing np.function. For instance, np.transpose(X) will yield the transpose of X (you first need to define X of course).

In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm


### Switching between Code and Text
You can choose the celltype for each cell. 
For example markdown cells are used for body-text and headlines. To choose the size of the text, you can insert '#', for example one '#' refers to headline number 1. 

Code cells contain the Python code. You can run a specific cell by clicking on the run-button. If a little star appears next to the cell, python is in the process of running. Especially when you perform large matrix-operations, your code might run for a while. 

### Load Data, User-Define Functions, Run Commands, Plots
For specific examples, please have a look at the above mentioned Python code on Bryan Graham's github page. 

### Python and Arrays
In contrast to Matlab, the standard format in Python is the array. For more information, have a look at https://docs.scipy.org/doc/numpy-dev/user/numpy-for-matlab-users.html.

When you write your code, make sure to check which type of input a function requires, for example 1-dimensional or 2-dimensional array. 

Here is an example of how you can define and check arrays. 

In [20]:
# Generating a 2x2 array
A = np.array([[1,2],[3,4]])

print("Array A")
print(A)
print("Dimensions of A")
print(A.shape)

Array A
[[1 2]
 [3 4]]
Dimensions of A
(2, 2)


In [21]:
# Generating a 1-dimensional array
B = np.array([1,2])
print("Array B")
print(B)
print("Dimensions of B")
print(B.shape)


Array B
[1 2]
Dimensions of B
(2,)


In [22]:
# Re-shaping 1-dim to get 2-dim (2x1) array 
C = B.reshape(-1,1)
print("Array C")
print(C)
print("Dimensions of C")
print(C.shape)

Array C
[[1]
 [2]]
Dimensions of C
(2, 1)


### Loading and Preparing Data

Depending on the format of your data (e.g. xlsx, csv), you will need to specify the import function. Make sure to inser the right directory at the beginning. 

In [23]:
# Specify the directory where data file is located
workdir =  '/Users/ingridhagele/Dropbox/My_Documents_Dropbox/Teaching/2018/Additional material/'

In [24]:
# Read in the csv data as dataframe
data = pd.read_csv(workdir+'NLSY79.csv') 


In [25]:
# Use the head command to see the first rows 
data.head()

Unnamed: 0,PID_79,HHID_79,core_sample,sample_wgts,month_born,year_born,live_with_mom_at_14,live_with_dad_at_14,single_mom_at_14,usborn,...,weeks_worked_2001,weeks_worked_2003,weeks_worked_2005,weeks_worked_2007,weeks_worked_2009,weeks_worked_2011,NORTH_EAST_79,NORTH_CENTRAL_79,SOUTH_79,WEST_79
0,1,1,1,602156.31,9,58,1.0,1.0,0.0,1.0,...,,,,,,,1.0,0.0,0.0,0.0
1,2,2,1,816100.38,1,59,1.0,1.0,0.0,0.0,...,0.0,18.0,52.0,52.0,52.0,52.0,1.0,0.0,0.0,0.0
2,3,3,1,572996.38,8,61,1.0,0.0,0.0,1.0,...,0.0,,43.0,0.0,,52.0,1.0,0.0,0.0,0.0
3,4,3,1,604567.88,8,62,1.0,0.0,0.0,1.0,...,,,,,,,1.0,0.0,0.0,0.0
4,5,5,1,764753.0,7,59,1.0,1.0,0.0,1.0,...,,,,,,,1.0,0.0,0.0,0.0


In [26]:
# Use describe to get basic summary statistics
data.describe()

Unnamed: 0,PID_79,HHID_79,core_sample,sample_wgts,month_born,year_born,live_with_mom_at_14,live_with_dad_at_14,single_mom_at_14,usborn,...,weeks_worked_2001,weeks_worked_2003,weeks_worked_2005,weeks_worked_2007,weeks_worked_2009,weeks_worked_2011,NORTH_EAST_79,NORTH_CENTRAL_79,SOUTH_79,WEST_79
count,12686.0,12686.0,12686.0,9763.0,12686.0,12686.0,12667.0,12667.0,12667.0,12685.0,...,7724.0,7661.0,7654.0,7757.0,7565.0,7301.0,11466.0,11466.0,11466.0,11466.0
mean,6343.5,6337.116191,0.769589,337046.7,6.494561,60.340848,0.927686,0.706718,0.172417,0.9311,...,41.550363,40.19175,40.400706,40.258863,38.123331,37.953842,0.199546,0.253271,0.366736,0.180447
std,3662.277092,3657.983694,0.421113,222352.2,3.407233,2.249621,0.259017,0.455284,0.377757,0.253294,...,18.994106,20.190597,19.999255,20.129455,21.230304,21.923177,0.399677,0.434904,0.481935,0.384576
min,1.0,1.0,0.0,40317.0,1.0,57.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3172.25,3171.0,1.0,123113.0,4.0,58.0,1.0,0.0,0.0,1.0,...,42.0,38.0,38.0,38.0,24.0,19.0,0.0,0.0,0.0,0.0
50%,6343.5,6338.5,1.0,259991.0,7.0,60.0,1.0,1.0,0.0,1.0,...,52.0,52.0,52.0,52.0,52.0,52.0,0.0,0.0,0.0,0.0
75%,9514.75,9505.0,1.0,536356.6,9.0,62.0,1.0,1.0,0.0,1.0,...,52.0,52.0,52.0,52.0,52.0,52.0,0.0,1.0,1.0,0.0
max,12686.0,12686.0,1.0,1131376.0,12.0,64.0,1.0,1.0,1.0,1.0,...,52.0,52.0,52.0,52.0,52.0,52.0,1.0,1.0,1.0,1.0


In [27]:
# Make a new variable that averages over original variables
# Calculate average earnings across the 1997, 1999, 2001 and 2003 calendar years
data['Earnings'] =data[["real_earnings_1997", "real_earnings_1999", \
                             "real_earnings_2001", "real_earnings_2003"]].mean(axis=1)

In [28]:
# Make a new dataframe that only contains a subset of the original data
data_subsample = data[["PID_79","HHID_79","Earnings","HGC_Age28","AFQT"]] 

In [29]:
# In order to deal with missing values, restrict data further to only contain complete cases (drop missing values)
data_subsample = data_subsample.dropna()

In [30]:
# Add a constant to the dataframe
data['Constant'] = 1 

### Run a first Least Squares Fit

In [31]:
# Be a good labor econmist and transform earnings to log earnings

# Drop units with zero earnings 
data_subsample = data_subsample[data_subsample.Earnings!=0]
# Compute log earnings
data_subsample['LogEarn'] = np.log(data_subsample.Earnings) 

In [32]:
Y    = data_subsample['LogEarn'] 
X    = data_subsample['HGC_Age28']

short_reg=sm.OLS(Y,sm.add_constant(X)).fit(cov_type='HC0')
print(short_reg.summary())


                            OLS Regression Results                            
Dep. Variable:                LogEarn   R-squared:                       0.122
Model:                            OLS   Adj. R-squared:                  0.121
Method:                 Least Squares   F-statistic:                     957.9
Date:                Thu, 11 Oct 2018   Prob (F-statistic):          3.19e-198
Time:                        10:46:53   Log-Likelihood:                -11173.
No. Observations:                7616   AIC:                         2.235e+04
Df Residuals:                    7614   BIC:                         2.236e+04
Df Model:                           1                                         
Covariance Type:                  HC0                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          8.0651      0.072    111.491      0.0

### De-bugging
Especially when you are new to Python, de-bugging is a challenge. A useful trick is to use the print-command a lot. You can not only print an object but also it's shape. This helps a lot with detecting situations in which dimensions do not match. 

In [33]:
print(data['Earnings'])

0                  NaN
1             0.000000
2           317.007690
3         17661.857000
4                  NaN
5         57623.856667
6         34636.081000
7         36207.417000
8         48700.858250
9                  NaN
10        39993.469500
11                 NaN
12        56102.037000
13        72579.809000
14            0.000000
15        51168.929500
16        63720.794500
17       186308.023250
18            0.000000
19        69094.201250
20       108999.235500
21        37403.759000
22                 NaN
23        46192.547000
24        23720.794250
25                 NaN
26         7195.303250
27        25216.545000
28        70488.850000
29        10665.782250
             ...      
12656              NaN
12657              NaN
12658     26911.149500
12659              NaN
12660              NaN
12661              NaN
12662     68920.605500
12663              NaN
12664              NaN
12665              NaN
12666     35692.692000
12667              NaN
12668      