# What is a Pipeline in Machine Learning?


When you work on a machine learning problem, you usually do many steps like:

Cleaning the data (fix missing values)

Changing the data format (like scaling numbers, changing categories to numbers)

Training the model

Making predictions

Doing each step separately is slow and can cause mistakes.

Pipeline is a tool that puts all these steps together in a line, so you can do everything in one go easily.



# Why use Pipeline?
It makes your code cleaner and easier to read

It makes sure you do all the steps in the right order

# How does a Pipeline work?
Think of it like a factory line:

First step: data cleaning and preparation

Second step: model training

Then: use the model to predict new data

You create a pipeline that connects these steps.

In [11]:
# ✅ Step 1: Import all necessary libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression


In [14]:
# ✅ Step 2: Create a small sample dataset
df = pd.DataFrame({
    'age': [22, 38, 26, 35, None],
    'fare': [7.25, 71.83, 7.92, 53.1, None],
    'sex': ['male', 'female', 'female', 'female', 'male'],
    'embarked': ['S', 'C', 'Q', 'S', None],
    'survived': [0, 1, 1, 1, 0]
})


In [16]:
# ✅ Step 3: Fill missing values (mean for numbers, most frequent for text)
df['age'].fillna(df['age'].mean(), inplace=True)
df['fare'].fillna(df['fare'].mean(), inplace=True)
df['embarked'].fillna('S', inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['age'].fillna(df['age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['fare'].fillna(df['fare'].mean(), inplace=True)


In [19]:
# ✅ Step 4: Split into input (X) and target (y)
X = df[['age', 'fare', 'sex', 'embarked']]
y = df['survived']


In [21]:
# ✅ Step 5: Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [23]:
# ✅ Step 6: Choose numeric and categorical columns
numeric_features = ['age', 'fare']
categorical_features = ['sex', 'embarked']

In [25]:
# ✅ Step 7: Define transformers
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')


In [26]:
# ✅ Step 8: Combine both transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])


In [27]:
# ✅ Step 9: Create full pipeline (preprocessing + model)
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])


In [29]:
# ✅ Step 10: Train the pipeline on training data
pipeline.fit(X_train, y_train)


In [30]:
# ✅ Step 11: Predict on test data
y_pred = pipeline.predict(X_test)


In [31]:
# ✅ Step 12: Print the predictions
print("Prediction:", y_pred)


Prediction: [1]
