# ML PIPELINE

1. **What are Pipelines and Why Use Them?**
   - Here, a pipeline in scikit-learn is a way to organize and chain multiple data processing and modeling steps together.
   - Instead of applying each step separately, pipelines allow us to perform data transformations and modeling in a more streamlined and organized manner.
   - Pipelines ensure that the data is handled consistently, preventing data leakage and making the code more modular and easy to understand.

2. **Building the First Pipeline (`pipe`):**
   - We start by importing the necessary libraries for pipelines, data preprocessing (e.g., scaling), and a logistic regression model for classification.
   - We create a list of steps that we want to perform sequentially in the pipeline.
   - The first step is feature scaling using `StandardScaler`, which ensures that all features have similar scales, making the model training more effective.
   - The second step is the `LogisticRegression` model, which we will use for classification tasks.
   - We then create the pipeline, which will apply these two steps in sequence.

3. **Visualizing the Pipeline (`pipe`):**
   - We can visualize the pipeline as a diagram, showing the steps and their connections using the `set_config` function.
   - This visual representation helps us understand how the data flows through the pipeline during training and prediction.

4. **Preparing Data and Splitting into Training and Testing Sets:**
   - We generate a synthetic dataset (`X` and `y`) using `make_classification`.
   - To evaluate our model's performance, we split the data into training (`X_train`, `y_train`) and testing (`X_test`, `y_test`) sets.

5. **Fitting and Predicting with the First Pipeline (`pipe`):**
   - We fit the pipeline to the training data (`X_train`, `y_train`), which means we apply the feature scaling and train the logistic regression model.
   - Then, we use the fitted pipeline to make predictions on the test data (`X_test`) and store the predictions in `y_pred`.

6. **Adding Dimensionality Reduction and Another Model (`pipe_1`):**
   - In the second pipeline (`pipe_1`), we add two more steps:
      - The first step is `PCA` (Principal Component Analysis), which reduces the dimensionality of the data to 5 components.
      - The second step is the `SVC` (Support Vector Classifier) model, which we will use for classification as an alternative to logistic regression.
   - We then fit `pipe_1` to the training data and make predictions on the test data, just like we did with the first pipeline.

7. **Custom Pipelines (`pipe` and `pipe22`):**
   - We create custom pipelines (`pipe` and `pipe22`) with specific named steps and configurations using `Pipeline`.
   - `pipe` includes feature scaling and PCA for dimensionality reduction, followed by `SVC` for classification.
   - `pipe22` includes feature scaling, PCA, and two models (`SVC` and `LogisticRegression`) in the final step.

8. **Using GridSearchCV for Hyperparameter Tuning:**
   - We use `GridSearchCV` to search for the best combination of hyperparameters (settings) for our pipeline `pipe`.
   - Hyperparameters are values set before training the model, and `GridSearchCV` tries different combinations to find the best ones for our data.

9. **ColumnTransformer for Data Preprocessing:**
   - We introduce the concept of `ColumnTransformer` to handle different preprocessing steps for different subsets of columns in our dataset.
   - For example, we use `StandardScaler` and `SimpleImputer` for numerical features and `OneHotEncoder` for categorical features.
   - The `ColumnTransformer` allows us to apply these preprocessing steps efficiently.

10. **Creating a Final Pipeline (`final_pipeline`):**
   - We create a `final_pipeline` that combines our `preprocessor` (from the `ColumnTransformer`) with a `LinearRegression` model for regression tasks.
   - This final pipeline will handle both data preprocessing and regression modeling in a single chain.

In summary, this code demonstrates the power of pipelines in scikit-learn for organizing data preprocessing and modeling tasks. Pipelines help maintain code clarity, modularity, and reusability while making it easier to switch between different preprocessing techniques and models. Additionally, the code introduces the concept of `ColumnTransformer`, which allows different preprocessing steps to be applied to different subsets of columns in the dataset.

In [None]:
from sklearn.pipeline import Pipeline
#feature scaling
from sklearn.preprocessing import StandardScaler
#model
from sklearn.linear_model import LogisticRegression


In this section, we import necessary libraries for building and using pipelines, performing feature scaling (StandardScaler), and creating a logistic regression model (LogisticRegression).

In [None]:
#initialize the steps we want perform sequentially
#these steps should be a list with key-value inside a tuple
steps = [('standard_scaler', StandardScaler()),
 ('model', LogisticRegression())]


Here, we define the steps that will be performed sequentially in the pipeline. A pipeline is created by combining different steps, and each step is represented as a tuple with a unique key (name) and a corresponding value (the transformer or estimator). The first step is 'standard_scaler', which is the StandardScaler, used for feature scaling. The second step is 'model', which represents the LogisticRegression classifier.

In [None]:
steps

[('standard_scaler', StandardScaler()), ('model', LogisticRegression())]

In [None]:
pipe = Pipeline(steps)
pipe

This creates a pipeline using the specified steps.

In [None]:
#we can also visuzalize using set_config visualizing pipeline
from sklearn import set_config


This sets the configuration to display the pipeline as a diagram.

In [None]:
set_config(display='diagram')
pipe

In [None]:
#create dataset
from sklearn.datasets import make_classification
X,y = make_classification(n_samples=200)


This code generates a synthetic dataset using make_classification function from scikit-learn. It creates X (features) and y (labels) with 200 samples.

In [None]:
X.shape


(200, 20)

In [None]:
y.shape


(200,)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Here, the dataset is split into training and testing sets using the train_test_split function from scikit-learn. The test set size is set to 20% of the total data, and the random_state is set to 42 for reproducibility.

In [None]:
X_test.shape

(40, 20)

In [None]:
X_train

array([[-0.16827243,  0.69198102,  0.60023108, ..., -1.321927  ,
        -0.65778593,  1.33888266],
       [-1.02158819, -0.55658644, -1.37159501, ..., -1.97535954,
        -1.42007791,  0.99230753],
       [-0.44353936,  0.7582759 ,  1.51359576, ...,  0.00881065,
        -0.44989678, -0.24634754],
       ...,
       [-0.51960886,  0.13240342,  0.70674036, ..., -2.28136236,
        -0.68849353,  0.86341474],
       [-0.07037283, -1.005115  ,  0.12810784, ...,  2.10258443,
        -0.21297212, -1.16245068],
       [ 0.09218244,  3.78074509,  1.69214273, ...,  2.00234207,
        -0.34538136, -2.40010907]])

In [None]:
pipe.fit(X_train, y_train)


The pipeline is fitted to the training data (X_train and y_train) using the fit method. This means that the StandardScaler will scale the features and the LogisticRegression model will be trained on the scaled data.

In [None]:
#during prediction piplines perform only transform
y_pred = pipe.predict(X_test)


The pipeline is used to predict the labels for the test set (X_test) using the predict method. The LogisticRegression model makes predictions based on the scaled features.

In [None]:
y_pred


array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1])

In [None]:
from sklearn.decomposition import PCA
from sklearn.svm import SVC


In [None]:
steps = [('standard_scaler', StandardScaler()),
 ('PCA', PCA(n_components=5)),
 ('SVC', SVC())]
steps


[('standard_scaler', StandardScaler()),
 ('PCA', PCA(n_components=5)),
 ('SVC', SVC())]

In [None]:
pipe_1 = Pipeline(steps)
pipe_1

A new pipeline, pipe_1, is created with additional steps - PCA for dimensionality reduction and an SVM classifier (SVC).

In [None]:
#now perform every task included in the pipeline
pipe_1.fit(X_train, y_train)


The new pipeline pipe_1 is fitted to the training data, and predictions are made on the test data using the predict method.

In [None]:
pipe_1.predict(X_test)


array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1])

In [None]:
pipe_1['standard_scaler'].fit_transform(X_train)


array([[-0.13938456,  0.68210871,  0.47639023, ..., -1.06877275,
        -0.53186466,  0.96702681],
       [-0.99701531, -0.67059607, -1.44328664, ..., -1.55392961,
        -1.29388704,  0.70830569],
       [-0.41604344,  0.75393294,  1.36559901, ..., -0.08073426,
        -0.32404904, -0.21636028],
       ...,
       [-0.49249761,  0.0758593 ,  0.58008265, ..., -1.78112883,
        -0.5625614 ,  0.61208619],
       [-0.04098991, -1.15653434,  0.01675331, ...,  1.47383893,
        -0.08720817, -0.90023864],
       [ 0.12238736,  4.02849249,  1.53942391, ...,  1.39941157,
        -0.21957059, -1.82416058]])

The fit_transform method is applied to the StandardScaler step within pipe_1, scaling the training data.

In [None]:
pipe_1[0]


In [None]:
#the estimators are stored as a list elements
pipe_1[0].fit_transform(X_train)


array([[-0.13938456,  0.68210871,  0.47639023, ..., -1.06877275,
        -0.53186466,  0.96702681],
       [-0.99701531, -0.67059607, -1.44328664, ..., -1.55392961,
        -1.29388704,  0.70830569],
       [-0.41604344,  0.75393294,  1.36559901, ..., -0.08073426,
        -0.32404904, -0.21636028],
       ...,
       [-0.49249761,  0.0758593 ,  0.58008265, ..., -1.78112883,
        -0.5625614 ,  0.61208619],
       [-0.04098991, -1.15653434,  0.01675331, ...,  1.47383893,
        -0.08720817, -0.90023864],
       [ 0.12238736,  4.02849249,  1.53942391, ...,  1.39941157,
        -0.21957059, -1.82416058]])

In [None]:
pipe_1[:1]


In [None]:
pipe_1[:2]

In [None]:
pipe_1.steps[0]

('standard_scaler', StandardScaler())

In [None]:
#now we have to make a custom pipeline
from sklearn.pipeline import make_pipeline


In [None]:
make_pipeline(StandardScaler(), LogisticRegression())

This code creates a pipeline using the `make_pipeline` function directly, without specifying step names explicitly.

In [None]:
from sklearn.svm import SVC

In [None]:
pipe = Pipeline(steps = [
 ('sc', StandardScaler()),
 ('pca', PCA()),
 ('clf', SVC())
 ])
pipe

This creates a new pipeline `pipe` with named steps. The first step is feature scaling using `StandardScaler`, the second step is dimensionality reduction using `PCA`, and the final step is classification using the `SVC` (Support Vector Classifier) model.

In [None]:
pipe.set_params(clf__C = 10) #the estimator C of SVC will be set to 10

In [None]:
pipe.set_params(clf__kernel = 'linear')

These lines set parameters for the SVC step in the pipeline. clf__C=10 sets the 'C' parameter of the SVC model to 10, and clf__kernel='linear' sets the kernel of the SVC model to 'linear'.

In [None]:
pipe22 = Pipeline(steps = [
 ('sc', StandardScaler()),
 ('pca', PCA()),
 ('clf', [SVC(), LogisticRegression()])
 ])
pipe22

This creates a new pipeline pipe22, similar to pipe, but the last step now contains a list of two classifiers - SVC and LogisticRegression.

In [None]:
from sklearn.model_selection import GridSearchCV


In [None]:
param_grid = dict(pca__ncomponents = [2,4,6],
 clf__C = [2,5,8])
grid_search_pipe = GridSearchCV(pipe,param_grid)
grid_search_pipe

This code sets up a GridSearchCV object to perform a grid search on the pipeline pipe with different values for the number of components in PCA (pca__ncomponents) and the regularization parameter 'C' of the SVC model (clf__C).

In [None]:

param_grid = dict(clf = [LogisticRegression(), SVC()],
 pca = ['passthrough', PCA()],
 clf__C = [2,4,6],
 pca__n_components = [2,4,6])
GridSearchCV(pipe, param_grid)

This code sets up a GridSearchCV object to perform a grid search on the pipeline pipe with different values for the number of components in PCA (pca__ncomponents) and the regularization parameter 'C' of the SVC model (clf__C).

In [None]:
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression

In this section, we import additional libraries needed for data preprocessing and feature engineering.

In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
#supplying the scaling and encoding techniques directly
col_transformer = ColumnTransformer(
 [
 ('sc', StandardScaler(), [0,1]),
 ('imp', SimpleImputer(strategy='mean'), [0,1]),
 ('ohe', OneHotEncoder(), [2,3])
 ])
col_transformer

Here, we define a ColumnTransformer to apply different transformations to different columns in the data. It specifies three steps:

1. 'sc': StandardScaler is applied to columns with indices 0 and 1 (assuming these are numerical features).
2. 'imp': SimpleImputer with strategy 'mean' is applied to columns with indices 0 and 1 (filling missing values with the mean).
3. 'ohe': OneHotEncoder is applied to columns with indices 2 and 3 (assuming these are categorical features).

In [None]:
make_pipeline(col_transformer, LogisticRegression())

This creates a pipeline with the ColumnTransformer followed by a LogisticRegression model.

In [None]:
#another way
t = [('sc', StandardScaler(), [0,1]),
 ('imp', SimpleImputer(strategy='mean'), [0,1]),
 ('ohe', OneHotEncoder(), [2,3])]
transformers = ColumnTransformer(transformers=t)
#make_pipeline
pip = make_pipeline(transformers, LogisticRegression())
pip

Here, we define the steps of the ColumnTransformer separately as a list of tuples and then create a new pipeline pip that includes both the ColumnTransformer and the LogisticRegression model.

In [None]:

t = [('ohe', OneHotEncoder(), [2,3])]
ColumnTransformer(transformers=t, remainder='passthrough') # by default remaind
 # transformers argument should be a list of tuples

This code creates a new ColumnTransformer with only the 'ohe' step, which applies OneHotEncoder to columns with indices 2 and 3. The remainder argument is set to 'passthrough', meaning that the remaining columns will be passed through without any transformation.

In [None]:
#numerical processing
steps = [('imputation', SimpleImputer(strategy='mean'))]
num_pipe = Pipeline(steps)
num_pipe

This code creates a Pipeline for numerical data processing, which includes only the 'imputation' step using SimpleImputer with the 'mean' strategy.

In [None]:
#categorical processing
steps = [('imputation', SimpleImputer(strategy='most_frequent')),
 ('ohe', OneHotEncoder())]
#we can also fill with any other constant
#SimpleImputer(fill_value='missing', strategy='constant')
cat_pipe = Pipeline(steps)
cat_pipe

This code creates a Pipeline for categorical data processing, which includes two steps: 'imputation' using SimpleImputer with the 'most_frequent' strategy (filling missing values with the most frequent value) and 'ohe' using OneHotEncoder for one-hot encoding categorical variables.

In [None]:
preprocessor = ColumnTransformer(
 [
 ('categorical', cat_pipe, ['gender', 'qualification']),
 ('numerical', num_pipe, ['age'])
 ]
)
preprocessor

This ColumnTransformer defines the preprocessing steps for the entire dataset. It applies the cat_pipe pipeline to the 'gender' and 'qualification' columns (categorical data) and the num_pipe pipeline to the 'age' column (numerical data).

In [None]:
final_pipeline = make_pipeline(preprocessor, LinearRegression())
final_pipeline

Finally, we create the final_pipeline, which consists of the preprocessor (the ColumnTransformer for data preprocessing) followed by the LinearRegression model for regression tasks.

In [None]:
Pipeline(
 [
 ('t',preprocessor),
 ('m', LinearRegression())
 ])

The line `Pipeline([ ('t', preprocessor), ('m', LinearRegression()) ])` creates a new pipeline that includes two steps:

1. `'t'` Step: The `preprocessor` object, which is a `ColumnTransformer` with specific data preprocessing steps for different subsets of columns, is added as the first step of the pipeline. This means that the data will be preprocessed using the specified transformations before proceeding to the next step.

2. `'m'` Step: The `LinearRegression()` model is added as the second step of the pipeline. This step represents the linear regression model that will be used for regression tasks.

The overall purpose of this new pipeline is to combine the data preprocessing (`preprocessor`) and regression modeling (`LinearRegression()`) into a single entity, allowing for a more streamlined and organized workflow. When this pipeline is applied to data, it will first preprocess the data using the `preprocessor`, and then the preprocessed data will be used to train the linear regression model. This ensures that the data preprocessing and modeling steps are seamlessly integrated, making the code more concise and easier to manage.