# What is a Pipeline in Machine Learning?
** A machine learning pipeline is a step-by-step workflow that automates the process of building and deploying a model. It combines all necessary steps — from data preprocessing to model training — into a single, repeatable object.
In scikit-learn, Pipeline is a tool that allows you to chain multiple processing steps (like scaling, encoding, and modeling) together so they can be treated as one unit.**

## Typical Pipeline Steps:
- Preprocessing
- Handling missing values
- Feature scaling (e.g., StandardScaler)
- Encoding categorical data (e.g., OneHotEncoder)
- Feature Selection or Dimensionality Reduction (optional)
- Model Training
- Fitting a machine learning algorithm (e.g., LogisticRegression, RandomForest)
## Why Use a Pipeline?
- Simplifies Code
- Avoids Data Leakage
- Makes Cross-validation Safer
- Improves Reproducibility

In [1]:
import numpy as np
import pandas as pd

In [76]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.feature_selection import SelectKBest,chi2
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

In [28]:
df=pd.read_csv(r"C:\Users\Asus\Downloads\train.csv")

In [29]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [30]:
df.drop(['PassengerId','Name','Ticket','Cabin'],axis=1,inplace=True)

In [31]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df.drop(columns=['Survived']),
                                              df['Survived'],
                                              test_size=0.2,
                                              random_state=42)

In [32]:
X_train.isnull().sum()

Pclass        0
Sex           0
Age         140
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

In [42]:
# imputation transformer
trf1=ColumnTransformer([
    ('impute_age',SimpleImputer(),[2]),
    ('impute_embarked',SimpleImputer(strategy='most_frequent'),[6])
],remainder='passthrough')

In [51]:
# one hot Encoding
trf2=ColumnTransformer([
    ("ohe_sex_embarked",OneHotEncoder(handle_unknown='ignore',sparse_output=False),[1,6])
],remainder='passthrough')

In [52]:
# Scaling
trf3=ColumnTransformer([
    ('scale',MinMaxScaler(),slice(0,10))
])

In [53]:
# Feature selection
trf4 = SelectKBest(score_func=chi2,k=8)

In [77]:
# train the model
#trf5 = DecisionTreeClassifier()
trf5=RandomForestClassifier(n_estimators=100)

# Create Pipeline

In [78]:
pipe = Pipeline([
    ('trf1',trf1),
    ('trf2',trf2),
    ('trf3',trf3),
    ('trf4',trf4),
    ('trf5',trf5)
])


In [79]:
# Alternate Syntax
pipe = make_pipeline(trf1,trf2,trf3,trf4,trf5)

In [80]:
# train
pipe.fit(X_train,y_train)

In [81]:
# Predict
y_pred = pipe.predict(X_test)

In [82]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.6256983240223464