<a href="https://colab.research.google.com/github/MouadEttali/Machine-Learning-Study-/blob/main/sklearnPipeline_with_customized_scaling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook implementing selective scaling and pipelines 

**Loading Data**

In [None]:
import pandas as pd
import numpy as np

In [None]:
train_df = pd.read_csv('/content/sample_data/california_housing_train.csv')
test_df = pd.read_csv('/content/sample_data/california_housing_test.csv')

In [None]:
test_df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
1,-118.3,34.26,43.0,1510.0,310.0,809.0,277.0,3.599,176500.0
2,-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0
3,-118.36,33.82,28.0,67.0,15.0,49.0,11.0,6.1359,330000.0
4,-119.67,36.33,19.0,1241.0,244.0,850.0,237.0,2.9375,81700.0


**Splitting the data to training and test data and converting them to numpy arrays**

In [None]:
X_train , y_train = train_df.to_numpy()[:,:-1], train_df.to_numpy()[:,-1]
X_test , y_test = test_df.to_numpy()[:,:-1], test_df.to_numpy()[:,-1]

X_train.shape , y_train.shape , X_test.shape , y_test.shape

((17000, 8), (17000,), (3000, 8), (3000,))

**This is where we make our selective scaling using standard and minmax scalers on different features , this might make our predictions better and it's a preprocessing tool to keep in mind when scaling your original data**

In [None]:
from sklearn.preprocessing import StandardScaler , MinMaxScaler , FunctionTransformer
from copy import deepcopy

std_scaler = StandardScaler().fit(X_train[:,:2]) 
min_max_scaler = MinMaxScaler().fit(X_train[:,2:])
def Preprocessor(X):
  A = np.copy(X)
  A[:,:2] = std_scaler.transform(X[:,:2])
  A[:,2:] = min_max_scaler.transform(X[:,2:])
  return A

In [None]:
preprocess_transformer = FunctionTransformer(Preprocessor)
preprocess_transformer

FunctionTransformer(accept_sparse=False, check_inverse=True,
                    func=<function Preprocessor at 0x7f64e7021830>,
                    inv_kw_args=None, inverse_func=None, kw_args=None,
                    validate=False)

#Implementing our pipelines using Linear Regression / KNN regression / Random Forest regression and the above mentioned.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

pipe = Pipeline([('Scaler',preprocess_transformer),
                 ('Linear Regression',LinearRegression())])
pipe

Pipeline(memory=None,
         steps=[('Scaler',
                 FunctionTransformer(accept_sparse=False, check_inverse=True,
                                     func=<function Preprocessor at 0x7f64e7021830>,
                                     inv_kw_args=None, inverse_func=None,
                                     kw_args=None, validate=False)),
                ('Linear Regression',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

In [None]:
from sklearn.metrics import mean_absolute_error

def fit_and_print(p, X_train= X_train ,y_train= y_train,X_test= X_test,y_test= y_test):
  p.fit(X_train,y_train)
  train_predictions = p.predict(X_train)
  test_predictions = p.predict(X_test)
  print("Training error: "+ str(mean_absolute_error(train_predictions,y_train)))
  print("Test error: "+ str(mean_absolute_error(test_predictions,y_test)))

In [None]:
fit_and_print(pipe)

Training error: 50795.85711786371
Test error: 50352.228257942894


In [None]:
from sklearn.neighbors import KNeighborsRegressor as KNR

pipe2 = Pipeline([('Scaler',preprocess_transformer),
                 ('KNN Regression',KNR(n_neighbors=7))])

fit_and_print(pipe2)

Training error: 30045.80900840336
Test error: 35865.41276190476


In [None]:
from sklearn.ensemble import RandomForestRegressor as RFR

pipe2 = Pipeline([('Scaler',preprocess_transformer),
                 ('RFR Regression',RFR(n_estimators=10,max_depth=7))])

fit_and_print(pipe2)

Training error: 41315.28320125289
Test error: 44332.475437123634


# Conclusion

*  This method of selective Scaling can be very useful when we have different types of data columns that require different types of preprocessing.
*   Pipeline is an amazing way to train , test and evaluate your models very fast and pit them against each other ( like I have done in the three cells of the code above) 

