# **Pipeline in Machine Learning**

In machine learning, a pipeline is a sequence of data processing steps that are chained together to automate and streamline the machine learning workflow. A pipeline allows you to combine multiple data preprocessing and model training steps into a single object, making it easier to organize and manage your machine learning code.

> **Here are the key components of a pipeline:**

**`Data Preprocessing Steps:`**
Pipelines typically start with data preprocessing steps, such as feature scaling, feature encoding, handling missing values, or dimensionality reduction. These steps ensure that the data is in the appropriate format and quality for model training.

**`Model Training:`**
After the data preprocessing steps, the pipeline includes the training of a machine learning model. This can be a classifier for classification tasks, a regressor for regression tasks, or any other type of model depending on the problem at hand.

**`Model Evaluation:`**
Once the model is trained, the pipeline often incorporates steps for evaluating its performance. This may involve metrics calculation, cross-validation, or any other evaluation technique to assess the model's effectiveness.

**`Predictions:`**
After the model has been evaluated, the pipeline allows you to make predictions on new, unseen data using the trained model. This step applies the same preprocessing steps used during training to the new data before generating predictions.


> **The main advantages of using pipelines in machine learning are:**

**`Simplified Workflow:`** Pipelines provide a clean and organized structure for defining and managing the sequence of steps involved in machine learning tasks. This makes it easier to understand, modify, and reproduce the workflow.

**`Avoiding Data Leakage:`** Pipelines ensure that data preprocessing steps are applied consistently to both the training and testing data, preventing data leakage that could lead to biased or incorrect results.

**`Streamlined Model Deployment:`** Pipelines allow you to encapsulate the entire workflow, including data preprocessing and model training, into a single object. This simplifies the deployment of your machine learning model, as the same pipeline can be applied to new data without the need to reapply each individual step.

**`Hyperparameter Tuning:`** Pipelines can be combined with techniques like grid search or randomized search for hyperparameter tuning. This allows you to efficiently explore different combinations of hyperparameters for your models.

----
**Summary:**


Overall, pipelines are a powerful tool for managing and automating the machine learning workflow, promoting code reusability, consistency, and efficiency. They help streamline the development and deployment of machine learning models, making it easier to iterate and experiment with different approaches.

In [7]:
# importing the libraries
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# load the dataset
df = sns.load_dataset('titanic')

# splittint the data into X and y
X = df[['pclass', 'sex', 'age', 'fare', 'embarked']]
y = df['survived']

# USe train test spilt]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

# defien the column transformer for imputing the missing valies
numaric_col = ['age','fare']
cat_col = ['sex','pclass','embarked']


# transforming the numaric col
numaric_transform = Pipeline(steps=[
    ('imputer',SimpleImputer(strategy='median'))
])

# transforming the cat col
cat_transfrom = Pipeline(steps=[
    ('imputer',SimpleImputer(strategy='most_frequent')),
    ('encoder',OneHotEncoder(handle_unknown='ignore'))
])
# creating preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num',numaric_transform,numaric_col),
        ('cat', cat_transfrom,cat_col)
    ])

# create a pipline with processor and random forest classifeir
pipeline = Pipeline(steps=[
    ('preprocessor',preprocessor),
    ('classifier',RandomForestClassifier(random_state=42))
])
# fit the pipline on Xtrain and y_train
pipeline.fit(X_train,y_train)
# predict throgh pipline
y_pred = pipeline.predict(X_test)
# accurcay score 
accuracy = accuracy_score(y_test,y_pred)
print('accuracy is :',accuracy)

accuracy is : 0.7988826815642458
