# Intro to patsy

[patsy](https://patsy.readthedocs.io/en/latest/index.html) lets us build modeling tables for sklearn or statsmodels using a formula syntax that mirrors R (`y ~ x`). It has a couple advantages:

1. Patsy makes data transformations easy e.g. handling categorical variables, adding interaction terms, or apply arbitrary Python transformations. It also "knows how to apply ‘the same’ transformation used on original data to new data".
2. Formula syntax makes our models more transparent and easier to document than having transformations listed and computed separately. The formula strings could, for example, be specified in a kernel parameters file.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

from patsy import dmatrices, dmatrix, build_design_matrices

In [2]:
df = (pd.read_csv("data/imports-85.data", header=None, 
                 names=['symbol', 'normalized_losses', 'make', 'fuel_type',
                       'aspiration', 'num_doors', 'body_style', 'drive_wheels',
                       'engine_location', 'wheel_base', 'length', 'width',
                       'height', 'curb_weight', 'engine_type', 'num_cylinders',
                       'engine_size', 'fuel_system', 'bore', 'stroke',
                       'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg',
                       'highway_mpg', 'price'])
     .query("(price != '?') & (horsepower != '?') & (num_doors != '?')"))
df['price'] = df['price'].astype('float')
df['horsepower'] = df['horsepower'].astype('float')

In [3]:
df.head()

Unnamed: 0,symbol,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000,21,27,13495.0
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000,21,27,16500.0
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000,19,26,16500.0
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500,24,30,13950.0
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500,18,22,17450.0


# Basic use

* One hot encode make
* engine size (no transform)
* log horsepower
* mean city and highway mpg
* no intercept

In [4]:
formula = "price ~ make + engine_size + np.log(horsepower) + I((city_mpg + highway_mpg)/2) -1"
y, X = dmatrices(formula, df, return_type='dataframe')

In [5]:
X.head()

Unnamed: 0,make[alfa-romero],make[audi],make[bmw],make[chevrolet],make[dodge],make[honda],make[isuzu],make[jaguar],make[mazda],make[mercedes-benz],...,make[plymouth],make[porsche],make[saab],make[subaru],make[toyota],make[volkswagen],make[volvo],engine_size,np.log(horsepower),I((city_mpg + highway_mpg) / 2)
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,130.0,4.70953,24.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,130.0,4.70953,24.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,152.0,5.036953,22.5
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,109.0,4.624973,27.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,136.0,4.744932,20.0


In [6]:
reg = linear_model.LassoCV(cv=5).fit(X, y)

# Stateful transforms

Useful for making transformations on training data and then applying them to the test set

In [7]:
df_train, df_test = train_test_split(df, test_size=0.2)

Center price based on the training data then apply the same transformation to the test data

In [8]:
formula = """center(price) ~ make + engine_size + np.log(horsepower) 
                             + I((city_mpg + highway_mpg)/2) -1"""
y_train, X_train = dmatrices(formula, df_train, return_type='dataframe')

In [9]:
X_test = build_design_matrices([X_train.design_info], df_test, return_type='dataframe')
y_test = build_design_matrices([y_train.design_info], df_test, return_type='dataframe')

In [10]:
reg = linear_model.LassoCV(cv=5).fit(X_train, y_train)

In [11]:
reg.predict(X_test[0])

array([ -5011.6406727 ,  -2490.84407756,   7455.0676381 ,  -6074.39662516,
         2590.99423286,  -2025.74067599, -10320.69491944,   9555.73146738,
        -5312.25857592,  -5166.67513989,   5621.86486527,   8580.55402855,
         5621.86486527,   2393.35188775,  -6074.39662516,  -6892.17683902,
        -8399.91363299,  -9465.03234324,  17356.00871805,     52.43783544,
        -7157.27513769,  -9017.68874395,  -2135.80450747,  10813.76700717,
         2590.99423286,  -3771.3649352 ,  -2025.74067599,  -1893.19152666,
        10813.76700717,  -6892.17683902, -12163.34872342,  -6074.39662516,
         2238.31742056,  -1205.59770435,  -3926.39940238,    850.09548923,
        -1940.52492016,  -2910.97684342,   2613.47955071,  -1055.28875273])

# Other cool functions

## Interaction terms
Patsy automatically creates the right type of interaction term based on the type of your features.

In [12]:
formula = "fuel_type:horsepower + fuel_type:num_doors -1"
X = dmatrix(formula, df, return_type='dataframe')
X.head()

Unnamed: 0,fuel_type[diesel]:num_doors[four],fuel_type[gas]:num_doors[four],fuel_type[diesel]:num_doors[two],fuel_type[gas]:num_doors[two],fuel_type[diesel]:horsepower,fuel_type[gas]:horsepower
0,0.0,0.0,0.0,1.0,0.0,111.0
1,0.0,0.0,0.0,1.0,0.0,111.0
2,0.0,0.0,0.0,1.0,0.0,154.0
3,0.0,1.0,0.0,0.0,0.0,102.0
4,0.0,1.0,0.0,0.0,0.0,115.0


## Apply your own functions

First a one-variable function to right-censor the price to a maximum of $30,000

In [17]:
print("before censoring: ", df['price'].max())

def my_censor(ser):
    return np.minimum(ser, 3e4)

formula = "my_censor(price) ~ horsepower -1"
y, X = dmatrices(formula, df, return_type='dataframe')

print("after censoring: ", y.max().values[0])

before censoring:  45400.0
after censoring:  30000.0


We can also apply more complex functions like this one, which averages two series (e.g. mpg) and returns high, medium, and low categories. Note that patsy then automatically one hot encodes the categories.

In [18]:
def threshold_var(ser1, ser2, low_thresh=23, high_thresh=30):
    return (pd.concat([ser1, ser2],axis=1)
            .apply(np.mean, axis=1)
            .apply(lambda x: 'low' if x < low_thresh else ('medium' if x < high_thresh else 'high'))
           )

formula = "threshold_var(city_mpg, highway_mpg, low_thresh=25)"
X = dmatrix(formula, df, return_type='dataframe')
X.head()

Unnamed: 0,Intercept,"threshold_var(city_mpg, highway_mpg, low_thresh=25)[T.low]","threshold_var(city_mpg, highway_mpg, low_thresh=25)[T.medium]"
0,1.0,1.0,0.0
1,1.0,1.0,0.0
2,1.0,1.0,0.0
3,1.0,0.0,1.0
4,1.0,1.0,0.0
