# **Pipelines**

Pipelines are used to keep our preprocessing and modeling steps more organized by bundling them together.

Creating a pipeline is done in 3 steps:
* Define preprocess
* Define model
* Create and evaluate model

In [3]:
# import modules/libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
print('Modules Loaded')

Modules Loaded


## **Prepare data**

In [6]:
# Load data
data = pd.read_csv("/home/vosti/machine_learning/csvs/melb_data.csv")

# Define features and targets
X = data.drop(['Price'], axis=1)
y = data.Price

# Divide data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

# Get categorical columns
cat_cols = [col for col in X_train.columns if X_train[col].nunique() < 10 
           and X_train[col].dtype == "object"]

# Get numerical columns
num_cols = [col for col in X_train.columns if X_train[col].dtype in ['int64', 'float64']]

# Combine cols
my_cols = cat_cols + num_cols
new_X_train = X_train[my_cols].copy()
new_X_val = X_val[my_cols].copy()

## **Define Preprocess**

For numerical columns with *missing values* we do imputation while for *categorical data* we do imputation and One-Hot Encoding. 

We use `ColumnTransformer` to bundle together handling of numerical and categorical data.

In [8]:
# numerical transformer
num_trans = SimpleImputer(strategy="constant")

# Categorical data
cat_trans = Pipeline(steps=[('imp', SimpleImputer(strategy="most_frequent")), 
                           ('ohe', OneHotEncoder(handle_unknown='ignore'))])

# Bundle handling of numerical and Categorical data
pre_processor = ColumnTransformer(transformers=[('imputation', num_trans, num_cols),
                                               ('cat', cat_trans, cat_cols)])
pre_processor

ColumnTransformer(transformers=[('imputation',
                                 SimpleImputer(strategy='constant'),
                                 ['Rooms', 'Distance', 'Postcode', 'Bedroom2',
                                  'Bathroom', 'Car', 'Landsize', 'BuildingArea',
                                  'YearBuilt', 'Lattitude', 'Longtitude',
                                  'Propertycount']),
                                ('cat',
                                 Pipeline(steps=[('imp',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('ohe',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Type', 'Method', 'Regionname'])])

## **Define Model**

In [9]:
model = RandomForestRegressor(n_estimators=100, random_state=0)

## **Create and Evaluate Pipeline**

We bundle together the preprocessing and modeling steps using a `Pipeline`

In [13]:
# create pipeline
my_pipeline = Pipeline(steps=[('preprocess', pre_processor), ('model', model)])

# fit data
my_pipeline.fit(new_X_train, y_train)

# predict 
predict = my_pipeline.predict(new_X_val)

# Get MAE
score = mean_absolute_error(y_val, predict)
print("The MAE is: ", score)

The MAE is:  160679.18917034855
