# Machine Learning and Prediction in Python

As noted in the Stata part of the lab, Python has a number of very powerful machine learning libraries. In this part of the lab, we will use the `scikit-learn` library to perform a number of machine learning tasks. 

`scikit-learn` makes it very easy to perform machine learning tasks. The library is very well documented and has a number of examples that you can use to learn how to use the library.

We will replicate the Stata part of the lab using `scikit-learn`. We will use the same data and perform the same tasks.

In [None]:
import numpy as np              # Import numpy library as np
import pandas as pd             # Import pandas library as pd
import matplotlib.pyplot as plt # Import matplotlib.pyplot library as plt

from sklearn.linear_model import (
    LassoCV,                    # Import LassoCV
    RidgeCV                     # Import RidgeCV
)


from sklearn.preprocessing import PolynomialFeatures

# load and prepare the data
url = "http://fmwww.bc.edu/ec-p/data/wooldridge/kielmc.dta"
data = pd.read_stata(url)

# keep if year==1981
data = data[data['year'] == 1981]

# View the first 5 rows of the data
data.head()





Unnamed: 0,year,age,agesq,nbh,cbd,intst,lintst,price,rooms,area,...,lprice,y81,larea,lland,y81ldist,lintstsq,nearinc,y81nrinc,rprice,lrprice
179,1981.0,81.0,6561.0,4.0,4000.0,1000.0,6.9078,49000.0,6.0,1554.0,...,10.79958,1.0,7.348588,8.823206,9.375855,47.717701,1.0,1.0,37634.410156,10.53567
180,1981.0,71.0,5041.0,4.0,3000.0,2000.0,7.6009,52000.0,5.0,1575.0,...,10.859,1.0,7.36201,8.156223,9.220291,57.773682,1.0,1.0,39938.550781,10.5951
181,1981.0,31.0,961.0,4.0,3000.0,2000.0,7.6009,68000.0,6.0,3304.0,...,11.12726,1.0,8.102889,9.837935,9.230143,57.773682,1.0,1.0,52227.339844,10.86336
182,1981.0,41.0,1681.0,4.0,3000.0,2000.0,7.6009,54000.0,6.0,1700.0,...,10.89674,1.0,7.438384,8.922658,9.323669,57.773682,1.0,1.0,41474.660156,10.63284
183,1981.0,31.0,961.0,4.0,4000.0,2000.0,7.6009,70000.0,6.0,1454.0,...,11.15625,1.0,7.282073,8.612503,9.375855,57.773682,1.0,1.0,53763.441406,10.89235


In [2]:
# Xlevels is the array (matrix) with the basic features (predictors) in levels.
Xlevels = data[["rooms", "age", "lland", "larea", "lintst"]]
# Dimensions of this array:
print(np.shape(Xlevels))

# Outcome (dependent variable)
y = data["lprice"].ravel()
print(np.shape(y))

# Means
print(np.mean(Xlevels,axis=0))
print(np.mean(y,axis=0))
# SDs
print(np.std(Xlevels,axis=0))
print(np.std(y,axis=0))

# What would happen if we asked for the mean of X without specifying the axis?
print(np.mean(Xlevels))
# What would happen if we asked for the mean of X for axis=1?
print(np.mean(Xlevels,axis=1))


######### MODEL ##########
# Various models and specifications below. Uncomment to choose one.

# Lasso
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html
# "normalize=True" provides the normalize argument without reference to the position.
# normalize is like standardize but without dividing by the sample size.
# nb: normalize will be removed from sklearn starting with release 1.2.
model = LassoCV(fit_intercept=True,cv=5)

# Ridge
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html
# model = RidgeCV(normalize=True,fit_intercept=True,cv=5)

######### PREDICTORS ########
# If using just the basic variables in levels, X is just X levels.
# For a polynomial, use PolynomialFeatures.

# Basic model - just the raw variables in levels.
# X = Xlevels

# Polynomial including interactions.
# The argument to PolynomialFeatures is the max degree of the polynomial.
poly = PolynomialFeatures(degree=4)
X = poly.fit_transform(Xlevels)

######### ESTIMATE #########
print("Training Model ... ")
model.fit(X,y)
print("Complete.")

######### RESULTS ##########

# Penalty hyperparameter (called alpha in sklearn, called lambda in the lectures).
model.alpha_

# R-squared (called score).
# For LassoCV:
# model.score(Xall,y)
# For RidgeCV:
# model.best_score_

# Estimated coefficients and intercept (returned separately):
b = model.coef_
print("Coefficients: ", b)
model.intercept_

# Dimension of X?
print("Shape ", np.shape(X))
# How many selected?
print("nonzero count: ", np.count_nonzero(b))

########## PREDICTED VALUES ########

# Predicted values:
yhatvalues = model.predict(X)

# Residuals:
ehatvalues = y - yhatvalues

# In-sample MSE = SD of residuals (no DOF adustment)
np.std(ehatvalues)

(142, 5)
(142,)
rooms      6.591549
age       13.978873
lland     10.278893
larea      7.655591
lintst     9.450821
dtype: float32
11.629019
rooms      0.823552
age       23.852375
lland      0.717999
larea      0.355933
lintst     0.715446
dtype: float32
0.38854474
9.5911455
179    22.015919
180    19.823826
181    12.508345
182    14.192389
183    12.099095
         ...    
316    11.273715
317     7.202640
318    10.698859
319    10.271528
320     6.984193
Length: 142, dtype: float32
Training Model ... 


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

Complete.
Coefficients:  [ 0.0000000e+00  0.0000000e+00 -0.0000000e+00  0.0000000e+00
  0.0000000e+00  0.0000000e+00  0.0000000e+00 -0.0000000e+00
  0.0000000e+00  0.0000000e+00  0.0000000e+00 -0.0000000e+00
 -0.0000000e+00 -0.0000000e+00 -0.0000000e+00  0.0000000e+00
  0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
  0.0000000e+00  0.0000000e+00 -0.0000000e+00  0.0000000e+00
  0.0000000e+00  0.0000000e+00 -0.0000000e+00 -0.0000000e+00
 -0.0000000e+00 -0.0000000e+00  0.0000000e+00  0.0000000e+00
  0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
 -0.0000000e+00 -0.0000000e+00 -0.0000000e+00 -0.0000000e+00
 -0.0000000e+00 -0.0000000e+00 -0.0000000e+00 -0.0000000e+00
 -0.0000000e+00 -0.0000000e+00  0.0000000e+00  0.0000000e+00
  0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
  0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00
  0.0000000e+00 -0.0000000e+00  0.0000000e+00  0.0000000e+00
  0.0000000e+00 -0.0000000e+00 -0.0000000e+00 -0.0000000e+00

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


0.3319131