# Lecture 4 - Intro to Python for Econometrics

## Software

### What is Python?

Python is a high-level, interpreted programming language that has gained massive popularity in the last decade. Its simple and readable syntax makes it relatively easy to learn, while its many libraries and modules allow for advanced functionality.

Python is a versatile language that can be used for various applications, from web development to data analysis, scientific computing, machine learning, and more. In econometrics, Python's powerful data analysis libraries, such as NumPy, Pandas, Scikit-learn and Statsmodels, make it an ideal tool for statistical analysis and modelling.

### What is a Jupyter Notebook?

A Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualisations, and narrative text. The term "notebook" references the traditional notebook used in labs or research to keep track of one's work and observations.

Jupyter notebooks support over 40 programming languages, including Python, and are widely used in data cleaning and transformation, numerical simulation, statistical modelling, data visualisation, machine learning, and more.

Each notebook is made up of a series of cells. These cells can contain either code or markdown text. The markdown cells are used for text and include formatted text, images, hyperlinks, LaTeX equations, etc. The code cells contain the actual code that you want to run. You can run each cell individually, and it will display any output directly below the cell.

### What is Binder?

Binder is a free, open-source online service that allows you to turn a GitHub repository into a collection of interactive notebooks. It's powered by BinderHub, an open-source tool that deploys the Binder service in the cloud. This makes it an excellent tool for sharing reproducible research, as others can view your work and interact with the code and data.

In this class, we'll use Binder to host our Python notebooks. This means you can run and experiment with the code provided without needing to install Python or any of the necessary libraries on your local machine.

### 1. Python Basics
#### Outline
1. Understanding Variables
2. Help and Documentation
3. Numbers in Python
4. Text in Python
5. True and False in Python

#### 1.1 Understanding Variables
Variable assignment associates a value to a variable.

Below, we assign the value “Hello World” to the variable x

In [2]:
x = "Hello World"

Once we have assigned a value to a variable, Python will remember this variable as long as the current session of Python is still running.

Notice how writing x into the prompt below outputs the value “Hello World”.

In [3]:
x

'Hello World'

#### Help
We can figure out what a function does by asking for help.

In Jupyter notebooks, this is done by placing a `?` after the function name (without using parenthesis) and evaluating the cell.

For example, we can ask for help on the print function by writing `print?`.

Depending on how you launched Jupyter, this will either launch

JupyterLab: display the help in text below the cell.
Classic Jupyter Notebooks: display a new panel at the bottom of your screen. You can exit this panel by hitting the escape key or clicking the x at the top right of the panel.


In [4]:
print?

[0;31mDocstring:[0m
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)

Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file:  a file-like object (stream); defaults to the current sys.stdout.
sep:   string inserted between values, default a space.
end:   string appended after the last value, default a newline.
flush: whether to forcibly flush the stream.
[0;31mType:[0m      builtin_function_or_method

#### Objects and Types
Everything in Python is an object.

Objects are “things” that contain 1) data and 2) functions that can operate on the data.

Sometimes we refer to the functions inside an object as methods.

We can investigate what data is inside an object and which methods it supports by typing `.` after that particular variable, then hitting `TAB`.

It should then list data and method names to the right of the variable name like this:

![Auto Compete list](img/obj_list.png)

We often refer to this as “tab completion”.

Let’s do this below. Keep going down until you find the method split.



In [7]:
# Type a period after `x` and then press find SPLIT.
# Once you find it, press enter and add () to the end of it like split()
x.split()

['Hello', 'World']

We often want to identify what kind of object some value is– called its “type”.

A “type” is an abstraction which defines a set of behavior for any “instance” of that type i.e. `2.0` and `3.0` are instances of `float`, where `float` has a set of particular common behaviors.

In particular, the type determines:

- the available data for any “instance” of the type (where each instance may have different values of the data).
- the methods that can be applied on the object and its data.
We can figure this out by using the `type` function.

The `type` function takes a single argument and outputs the type of that argument.



In [None]:
type(2)

In [8]:
type("Hello World")

str

In [9]:
type([1, 2, 3])

list

In [7]:
import numpy as np              # Import numpy library as np
import pandas as pd             # Import pandas library as pd
import matplotlib.pyplot as plt # Import matplotlib.pyplot library as plt

from sklearn.linear_model import (
    LassoCV,                    # Import LassoCV
    RidgeCV                     # Import RidgeCV
)


from sklearn.preprocessing import PolynomialFeatures

In [6]:
# load and prepare the data
url = "http://fmwww.bc.edu/ec-p/data/wooldridge/kielmc.dta"
data = pd.read_stata(url)

# keep if year==1981
data = data[data['year'] == 1981]

# View the first 5 rows of the data
data.head()

Unnamed: 0,year,age,agesq,nbh,cbd,intst,lintst,price,rooms,area,...,lprice,y81,larea,lland,y81ldist,lintstsq,nearinc,y81nrinc,rprice,lrprice
179,1981.0,81.0,6561.0,4.0,4000.0,1000.0,6.9078,49000.0,6.0,1554.0,...,10.79958,1.0,7.348588,8.823206,9.375855,47.717701,1.0,1.0,37634.410156,10.53567
180,1981.0,71.0,5041.0,4.0,3000.0,2000.0,7.6009,52000.0,5.0,1575.0,...,10.859,1.0,7.36201,8.156223,9.220291,57.773682,1.0,1.0,39938.550781,10.5951
181,1981.0,31.0,961.0,4.0,3000.0,2000.0,7.6009,68000.0,6.0,3304.0,...,11.12726,1.0,8.102889,9.837935,9.230143,57.773682,1.0,1.0,52227.339844,10.86336
182,1981.0,41.0,1681.0,4.0,3000.0,2000.0,7.6009,54000.0,6.0,1700.0,...,10.89674,1.0,7.438384,8.922658,9.323669,57.773682,1.0,1.0,41474.660156,10.63284
183,1981.0,31.0,961.0,4.0,4000.0,2000.0,7.6009,70000.0,6.0,1454.0,...,11.15625,1.0,7.282073,8.612503,9.375855,57.773682,1.0,1.0,53763.441406,10.89235


In [8]:
# Xlevels is the array (matrix) with the basic features (predictors) in levels.
Xlevels = data[["rooms", "age", "lland", "larea", "lintst"]]
# Dimensions of this array:
print(np.shape(Xlevels))

# Outcome (dependent variable)
y = data["lprice"].ravel()
print(np.shape(y))

# Means
print(np.mean(Xlevels,axis=0))
print(np.mean(y,axis=0))
# SDs
print(np.std(Xlevels,axis=0))
print(np.std(y,axis=0))

# What would happen if we asked for the mean of X without specifying the axis?
print(np.mean(Xlevels))
# What would happen if we asked for the mean of X for axis=1?
print(np.mean(Xlevels,axis=1))


######### MODEL ##########
# Various models and specifications below. Uncomment to choose one.

# Lasso
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html
# "normalize=True" provides the normalize argument without reference to the position.
# normalize is like standardize but without dividing by the sample size.
# nb: normalize will be removed from sklearn starting with release 1.2.
model = LassoCV(fit_intercept=True,cv=5)

# Ridge
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html
# model = RidgeCV(normalize=True,fit_intercept=True,cv=5)

######### PREDICTORS ########
# If using just the basic variables in levels, X is just X levels.
# For a polynomial, use PolynomialFeatures.

# Basic model - just the raw variables in levels.
# X = Xlevels

# Polynomial including interactions.
# The argument to PolynomialFeatures is the max degree of the polynomial.
poly = PolynomialFeatures(degree=4)
X = poly.fit_transform(Xlevels)

######### ESTIMATE #########
print("Training Model ... ")
model.fit(X,y)
print("Complete.")

######### RESULTS ##########

# Penalty hyperparameter (called alpha in sklearn, called lambda in the lectures).
model.alpha_

# R-squared (called score).
# For LassoCV:
# model.score(Xall,y)
# For RidgeCV:
# model.best_score_

# Estimated coefficients and intercept (returned separately):
b = model.coef_
print("Coefficients: ", b)
model.intercept_

# Dimension of X?
print("Shape ", np.shape(X))
# How many selected?
print("nonzero count: ", np.count_nonzero(b))

########## PREDICTED VALUES ########

# Predicted values:
yhatvalues = model.predict(X)

# Residuals:
ehatvalues = y - yhatvalues

# In-sample MSE = SD of residuals (no DOF adustment)
np.std(ehatvalues)

(142, 5)
(142,)
rooms      6.591549
age       13.978873
lland     10.278893
larea      7.655591
lintst     9.450821
dtype: float32
11.629019
rooms      0.823552
age       23.852375
lland      0.717999
larea      0.355933
lintst     0.715446
dtype: float32
0.38854474
9.5911455
179    22.015919
180    19.823826
181    12.508345
182    14.192389
183    12.099095
         ...    
316    11.273715
317     7.202640
318    10.698859
319    10.271528
320     6.984193
Length: 142, dtype: float32
Training Model ... 


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

Complete.
Coefficients:  [ 0.0000000e+00  0.0000000e+00 -0.0000000e+00  0.0000000e+00
  0.0000000e+00  0.0000000e+00  0.0000000e+00 -0.0000000e+00
  0.0000000e+00  0.0000000e+00  0.0000000e+00 -0.0000000e+00
 -0.0000000e+00 -0.0000000e+00 -0.0000000e+00  0.0000000e+00
  0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
  0.0000000e+00  0.0000000e+00 -0.0000000e+00  0.0000000e+00
  0.0000000e+00  0.0000000e+00 -0.0000000e+00 -0.0000000e+00
 -0.0000000e+00 -0.0000000e+00  0.0000000e+00  0.0000000e+00
  0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
 -0.0000000e+00 -0.0000000e+00 -0.0000000e+00 -0.0000000e+00
 -0.0000000e+00 -0.0000000e+00 -0.0000000e+00 -0.0000000e+00
 -0.0000000e+00 -0.0000000e+00  0.0000000e+00  0.0000000e+00
  0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
  0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
  0.0000000e+00 -0.0000000e+00  0.0000000e+00  0.0000000e+00
  0.0000000e+00 -0.0000000e+00 -0.0000000e+00 -0.0000000e+00

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


0.3319131