Example from youtube tutorial by Greg Hogg on how to create data pipelines for California Housing price dataset.

link to the video: https://www.youtube.com/watch?v=xIqX1dqcNbY&t=60s

This notebook contains 
1. Notes i created for the definitions of each scikit-learn class 
2. The actual implementation of the classes on the data set

The dataset processed by Greg is a simplified version of the California Housing dataset, and in my own attempt at processing the same dataset, 

In additional to Greg's implementations, my pipeline applied One hot encoding technique on the 'Ocean_proximity' column, and imputed missing values for 'Total_bedrooms' column, which were not included in Greg's tutorial.

In [None]:
'''Note on StandardScaler implementation'''
#from sklearn.preprocessing import StandardScaler, MinMaxScaler, FunctionTransformer
1) StandardScaler 

#StandardScaler operates on the principle of normalization, where it transforms the distribution of each feature to have a mean of zero and a standard deviation of one. 
#This process ensures that all features are on the same scale, preventing any single feature from dominating the learning process due to its larger magnitude.

#The transformation performed by StandardScaler can be expressed mathematically as:

#z=x−μ/σ​

#where x represents the original feature value, μ is the mean of the feature, σ is the standard deviation, and z is the standardized feature value.

1.1)transform(X, copy=None)
#Perform standardization by centering and scaling.

# x :The data used to scale along the features axis.
# Output = X_tr:  Transformed array.

1.2)fit_transform(X, y=None, **fit_params)
#Fit to data, then transform it.

#Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

1.3)fit(X, y=None, sample_weight=None)
#Compute the mean and std to be used for later scaling.
# Returns Fitted scaler.



In [None]:
'''Note on Column Transformer implementation'''
2) ColumnTransformer(transformers, *, remainder='drop', sparse_threshold=0.3, n_jobs=None, transformer_weights=None, verbose=False, verbose_feature_names_out=True, force_int_remainder_cols=True)
#use column transformer when u have features in dataframe that need different preprocessing 

#transformers argument = list of tuples

#List of (name, transformer, columns) tuples specifying the transformer objects to be applied to subsets of the data.
#example: 
ct = ColumnTransformer(
    [("norm1", Normalizer(norm='l1'), [0, 1]),
     ("norm2", Normalizer(norm='l1'), slice(2, 4))])

2.1)fit_transform()
#Fit all transformers, transform the data and concatenate results.
X_trans = ct.fit_transform(X) 
# X is dataframe 


The processes of the youtube tutorial is followed step by step below.


In [5]:
'''Previewing the California Housing Dataset'''

import pandas as pd
import numpy as np
rawdata_df = pd.read_csv('housing.csv')
rawdata_df  #not yet imputed and not yet one hot encoded


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


In [6]:
'''checking missing values in each column'''
rawdata_df.isna().sum()
#need to use imputer


longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [7]:
'''Combining use of Imputer with Column Transformer'''
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
freshdata_df = rawdata_df.copy(deep=True) #When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below).

ct = ColumnTransformer(
    [('imputer', SimpleImputer(strategy='mean'),['total_bedrooms'])])

freshdata_df['total_bedrooms'] = pd.DataFrame(ct.fit_transform(freshdata_df))


freshdata_df.isna().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

In [None]:
'''Oceann proximity column data exploration'''

ozn_p = rawdata_df['ocean_proximity']
ozn_p

print(pd.get_dummies(ozn_p))
print(rawdata_df.groupby('ocean_proximity').count())

       <1H OCEAN  INLAND  ISLAND  NEAR BAY  NEAR OCEAN
0          False   False   False      True       False
1          False   False   False      True       False
2          False   False   False      True       False
3          False   False   False      True       False
4          False   False   False      True       False
...          ...     ...     ...       ...         ...
20635      False    True   False     False       False
20636      False    True   False     False       False
20637      False    True   False     False       False
20638      False    True   False     False       False
20639      False    True   False     False       False

[20640 rows x 5 columns]
                 longitude  latitude  housing_median_age  total_rooms  \
ocean_proximity                                                         
<1H OCEAN             9136      9136                9136         9136   
INLAND                6551      6551                6551         6551   
ISLAND                

In [None]:
'''One hot encode Ocean Proximity'''
dummies = pd.DataFrame(pd.get_dummies(ozn_p))
# Concatenate the one hot encoded with original df

df_encoded_imputed = pd.concat([freshdata_df.drop(['ocean_proximity'], axis=1), dummies],axis=1)

df_encoded_imputed

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,False,False,False,True,False
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,False,False,False,True,False
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,False,False,False,True,False
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,False,False,False,True,False
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,False,True,False,False,False
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,False,True,False,False,False
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,False,True,False,False,False
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,False,True,False,False,False


In [11]:
''' Whole dataset processing by Catagorical OneHot encode + Impute missing values before splitting and training'''

def encode_impute(X):
    df = X.copy(deep=True)
    ct = ColumnTransformer(
        [('imputer', SimpleImputer(strategy='mean'),['total_bedrooms'])])
    df['total_bedrooms'] = pd.DataFrame(ct.fit_transform(df))
    ozn_p = df['ocean_proximity']
    dummies = pd.DataFrame(pd.get_dummies(ozn_p))
    # Concatenate the one hot encoded with original df

    df_encoded_imputed = pd.concat([df.drop(['ocean_proximity'], axis=1), dummies],axis=1)
    return df_encoded_imputed



F = encode_impute(rawdata_df)
F

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,False,False,False,True,False
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,False,False,False,True,False
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,False,False,False,True,False
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,False,False,False,True,False
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,False,True,False,False,False
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,False,True,False,False,False
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,False,True,False,False,False
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,False,True,False,False,False


In [13]:
'''Splitting Dataset into dependent variables and independent variables'''
X, Y = F.drop(['median_house_value'], axis=1), F.to_numpy()[:,-6]
display(pd.DataFrame(Y))
pd.DataFrame(X)

Unnamed: 0,0
0,452600.0
1,358500.0
2,352100.0
3,341300.0
4,342200.0
...,...
20635,78100.0
20636,77100.0
20637,92300.0
20638,84700.0


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,False,False,False,True,False
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,False,False,False,True,False
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,False,False,False,True,False
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,False,False,False,True,False
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,False,True,False,False,False
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,False,True,False,False,False
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,False,True,False,False,False
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,False,True,False,False,False


In [None]:
'''Splitting Dataset'''
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=104,test_size=0.20, shuffle=True)

X_train.shape, y_train.shape, X_test.shape, y_test.shape
X_train

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
18507,-122.01,36.99,29.0,227.0,45.0,112.0,41.0,6.4469,False,False,False,False,True
8529,-118.35,33.90,13.0,2887.0,853.0,2197.0,800.0,2.8777,True,False,False,False,False
16031,-122.45,37.72,46.0,1406.0,235.0,771.0,239.0,4.7143,False,False,False,True,False
4130,-118.19,34.13,52.0,2012.0,458.0,1314.0,434.0,3.9250,True,False,False,False,False
9036,-117.78,34.58,6.0,10263.0,1864.0,6163.0,1781.0,3.8803,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
14180,-117.06,32.72,31.0,2669.0,514.0,1626.0,499.0,3.1923,False,False,False,False,True
7896,-118.07,33.86,17.0,3666.0,562.0,2104.0,579.0,5.6818,True,False,False,False,False
6310,-118.02,34.04,27.0,5640.0,1001.0,3538.0,978.0,5.0650,True,False,False,False,False
17113,-122.18,37.47,37.0,2848.0,328.0,852.0,327.0,13.3670,False,False,False,True,False


In [15]:
display(X_train.iloc[:,:2])
X_train.iloc[:,2:8]

Unnamed: 0,longitude,latitude
18507,-122.01,36.99
8529,-118.35,33.90
16031,-122.45,37.72
4130,-118.19,34.13
9036,-117.78,34.58
...,...,...
14180,-117.06,32.72
7896,-118.07,33.86
6310,-118.02,34.04
17113,-122.18,37.47


Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
18507,29.0,227.0,45.0,112.0,41.0,6.4469
8529,13.0,2887.0,853.0,2197.0,800.0,2.8777
16031,46.0,1406.0,235.0,771.0,239.0,4.7143
4130,52.0,2012.0,458.0,1314.0,434.0,3.9250
9036,6.0,10263.0,1864.0,6163.0,1781.0,3.8803
...,...,...,...,...,...,...
14180,31.0,2669.0,514.0,1626.0,499.0,3.1923
7896,17.0,3666.0,562.0,2104.0,579.0,5.6818
6310,27.0,5640.0,1001.0,3538.0,978.0,5.0650
17113,37.0,2848.0,328.0,852.0,327.0,13.3670


In [17]:
'''Standardise big and negative numbers '''

from sklearn.preprocessing import StandardScaler, MinMaxScaler, FunctionTransformer
from copy import deepcopy

#std_clear = StandardScaler().fit(x_train[:, :2])
#min_max_scaler = MinMaxScaler().fit(x_train[:, 2:])

'''
standard scaler is for numerical values that go into the negatives

Min Max is fine for Positive values
'''


'''
You need to apply StandardScaler to the training set to prevent the distribution of the test set leaking into the model. 
If you fit the scaler on the full dataset before splitting, the test set information is used to transform the training set and use it to train the model.

In general, what you must do is scale on the training set and transfer the scale over to the testing set.
'''

std_scaler = StandardScaler().fit(X_train.iloc[:, :2])  # defines where the scalers are applied on any future datasets we apply to it
min_max_scaler = MinMaxScaler().fit(X_train.iloc[:,2:8])


def preprocessor(X):# just a function that uses both of the scalers
    A = np.copy(X)
    A[:, :2] = std_scaler.transform(X.iloc[:, :2])
    A[:, 2:8] = min_max_scaler.transform(X.iloc[:,2:8])
    return A

preprocessor(X_test)

array([[0.8930958618385856, -0.9790190791964893, 0.6078431372549019, ...,
        False, False, False],
       [2.480966455011275, -0.37155903926614825, 0.7843137254901961, ...,
        False, False, False],
       [-1.4337836866471898, 0.9835441267323072, 0.9999999999999999, ...,
        False, True, False],
       ...,
       [1.2725869469993505, -1.3341495640788423, 0.8235294117647058, ...,
        False, False, False],
       [0.923055684351279, -0.9743463096585645, 0.27450980392156865, ...,
        False, False, False],
       [-0.2154175711310296, 1.5442764712833938, 0.43137254901960786,
        ..., False, False, False]], dtype=object)

In [18]:
preprocess_transformer = FunctionTransformer(preprocessor)

In [None]:
'''Notes on Pipeline class'''
3) Pipeline
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
pipe.fit(X_train, y_train).score(X_test, y_test)

3.1)fit(X, y=None, **params)[source] 
#Fit all the transformers one after the other and sequentially transform the data. Finally, fit the transformed data using the final estimator.

#Xiterable
#Training data. Must fulfill input requirements of first step of the pipeline.

#yiterable, default=None
#Training targets. Must fulfill label requirements for all steps of the pipeline.

In [19]:
'''Creating Pipeline connecting Model and data processor'''
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

p1 = Pipeline([('Scaler', preprocess_transformer),
               ('Linear REgression', LinearRegression())]) #'Scaler' is just a name
p1

In [23]:
from sklearn.metrics import mean_absolute_error

def fit_and_print(p, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test):
    p.fit(X_train, y_train)
    train_preds = p.predict(X_train)
    test_preds = p.predict(X_test)
    print('Training error: ' + str(mean_absolute_error(train_preds, y_train)))
    print('Test error:     ' + str(mean_absolute_error(test_preds, y_test)))

fit_and_print(p1)

Training error: 49492.9835876938
Test error:     50934.82703488372


In [22]:
#actually K neighbors regressor is acutually better
#however, since Training error is much less than Test Error, it is overfitting

from sklearn.neighbors import KNeighborsRegressor as KNeighborsRegressor
p2 = Pipeline([('Scaler', preprocess_transformer),
               ('KNR Regression', KNeighborsRegressor())]) #'Scaler' is just a name
p2

fit_and_print(p2)

Training error: 28179.792514534885
Test error:     37045.204312015514
