**DS 301: Applied Data Modeling and Predictive Analysis**

**Lecture 11 – Support Vector Machine**

# Scikit-Learn Pipeline

Nok Wongpiromsarn, 21 September 2020

**Description:** This illustrates how we can perform regularized polynomial regression as in Step 1.4 of C04-PolynomialRegression using Scikit-Learn Pipeline.

**Set up the training and test sets with house-price data**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("datasets/house-price.csv")
X = df['YearBuilt']
X = X.values.reshape(-1,1)
y = df['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Regularized polynomial regression using Pipeline**

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

model = Pipeline([
    ("poly_features", PolynomialFeatures(degree=2, include_bias=False)),
    ("std_scaler", StandardScaler()),
    ("regul_reg", Ridge(alpha=0.05, solver="cholesky")),
])
model.fit(X_train, y_train)

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = sqrt(mean_squared_error(y_test, y_pred))
print("MSE: {}".format(mse))
print("RMSE: {}".format(rmse))

**Include categorical column**

In [None]:
X = df[['YearBuilt', 'MSZoning']]
y = df['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

num_transformer = Pipeline([
    ("poly_features", PolynomialFeatures(degree=2, include_bias=False)),
    ("std_scaler", StandardScaler()),
])

preprocessor = ColumnTransformer(transformers=[
    ('num', num_transformer, ['YearBuilt']),
    ('cat', OneHotEncoder(), ['MSZoning'])
])

model = Pipeline([
    ("preprocessor", preprocessor),
    ("regul_reg", Ridge(alpha=0.05, solver="cholesky")),
])
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = sqrt(mean_squared_error(y_test, y_pred))
print("MSE: {}".format(mse))
print("RMSE: {}".format(rmse))