# **Pipeline in Machine Learning** 

In machine learning, a pipeline is a sequence of data processing steps that are chained together to automate and streamline the machine learning workflow. A pipeline allows you to combine multiple data processing and model training steps into a single object, making it easier to organize and manage your machine learning code. 

## Summary:

Overall, piplelines are a powerful tool for managing and automating the machine learning workflow, promoting code reusability, consistency and efficiency. They help streamline the development and deployment of machine learning models, making it easier to iterate and experiment with different approaches. 

Here's an example of using a pipeline on the Titanic dataset to preprocess the data, train a model and make predictions: 


In [14]:
# Import libraries
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the titanic dataset
df = sns.load_dataset('titanic')

# Select the features and target variables
X = df[['pclass', 'sex', 'age', 'embarked']]
y = df['survived']

# Split the data into training and testing sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore')), 
    ('model', RandomForestClassifier(random_state=42))
])

# Fit the pipleline on the training data 
pipeline.fit(X_train, y_train)

# Make predictions on the test data
y_pred = pipeline.predict(X_test)

# Calculate the accuracy score
accuracy= accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

Accuracy: 0.7932960893854749


# Explanation: 

In this example, we start by loading the Titanic dataset from seaborn using sns.load_dataset('titanic'). We then select the relevent features and target variable (survived) to train our model. Next, we split the data into training and test sets using train_test_split from scikit-learn. 

The pipeline is created using the Pipeline class from scikit-learn. It consists of three steps:

1. Data preprocessing step: The SimpleImputer is used to handle missing values by replacing them with the most frequent value in each column. 
   
2. Feature encoding step: The OneHotEncoder is used to encode categorical variables (sex and embarked) as binary features. 

3. Model training step: The RandomForestClassifier is used as the machine learning model for classification. 

We then fit the pipeline on the training data using pipeline.fit(X_train, y_train). Afterward, we make predictions on the test data using pipeline.predict(X_test).

## **Hyperparamater tuning in pipeline**

Hyperparameter tuning in a pipeline involves optimizing the hyperparameters of the different steps in the pipeline to find the best combination that maximizes the model's performance. Here's and example of hyperparamter tuning in a pipeline and selecting the best model on the Titanic dataset:

In [16]:
# Import the libraries 
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the titanic dataset
df = sns.load_dataset('titanic')

# Select the features and target variables
X = df[['pclass', 'sex', 'age', 'embarked']]
y = df['survived']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore')),
    ('model', RandomForestClassifier(random_state=42))
])

# Define the hyperparamters to tune 
hyperparameters = {
    'model__n_estimators':[100, 200, 300],
    'model__max_depth':[None, 5, 10],
    'model__min_samples_split':[2, 5, 10]
}
# Perform grid search cross-validation 
grid_search = GridSearchCV(pipeline, hyperparameters, cv=5)

# Fit the pipeline to the training data
grid_search.fit(X_train, y_train)

# Get the best model 
best_model= grid_search.best_estimator_

# Make prediction on the test data using the best model 
y_pred = best_model.predict(X_test)

# Calculate accuracy score 
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

# Print the best hyperparameters 
print('Best Hyperparameters:', grid_search.best_params_)

Accuracy: 0.7988826815642458
Best Hyperparameters: {'model__max_depth': 10, 'model__min_samples_split': 2, 'model__n_estimators': 100}


## **Selecting best model in Pipeline**

To select the best model when using multiple models in a pipeline, you can use techniques like cross-validation and evaluation metrics to compare their performance. Here's an example of how to accomplish this on Titanic dataset:

In [21]:
# Import libraries 
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier 
from sklearn.metrics import accuracy_score

# Load the Titanic dataset from seaborn 
df = sns.load_dataset('titanic')

# Select the features and target variables
X = df[['pclass', 'sex', 'age', 'embarked']]
y = df['survived']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a list of models to evaluate 
models = [
    ('Random Forest', RandomForestClassifier(random_state=42)),
    ('Gradient Boosting', GradientBoostingClassifier(random_state=42))
]

# Selecting the best initial values for models 
best_model = None 
best_accuracy=0.0

# Iterate over the models and evaluate their performance 
for name, model in models:
    # Create a pipeline for each model 
    pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(handle_unknown='ignore')),
        ('model', model)

    ])
    # Perform cross-validation 
    scores = cross_val_score(pipeline, X_train, y_train, cv=5)  

    # Get the average accuracy
    mean_accuracy = scores.mean()

    # Fit the pipeline on the training data 
    pipeline.fit(X_train, y_train)

    # Make predictions on the test data 
    y_pred = pipeline.predict(X_test)

    # Calculate the accuracy
    accuracy = accuracy_score(y_test, y_pred)

    # Print the performance metrics 
    print('Model:', name)
    print('Cross-validation Accuracy:', mean_accuracy)
    print('Test Accuracy:', accuracy)
    print()

    # Check if the current model has the best accuracy
    if accuracy> best_accuracy:
        best_accuracy=accuracy
        best_model=pipeline

# Retrive the best model 
print ('Best Model:', best_model)


Model: Random Forest
Cross-validation Accuracy: 0.7963656062247612
Test Accuracy: 0.7932960893854749

Model: Gradient Boosting
Cross-validation Accuracy: 0.8062247611543386
Test Accuracy: 0.8044692737430168

Best Model: Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
                ('encoder', OneHotEncoder(handle_unknown='ignore')),
                ('model', GradientBoostingClassifier(random_state=42))])


## **Add more models**

In [22]:
# Import libraries 
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the Titanic dataset from seaborn 
df = sns.load_dataset('titanic')

# Select the features and target variables
X = df[['pclass', 'sex', 'age', 'embarked']]
y = df['survived']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a list of models to evaluate 
models = [
    ('Random Forest', RandomForestClassifier(random_state=42)),
    ('Gradient Boosting', GradientBoostingClassifier(random_state=42)),
    ('Support Vector Classifier', SVC(random_state=42)),
    ('Logistic Regression', LogisticRegression(random_state=42)),
]

# Selecting the best initial values for models 
best_model = None 
best_accuracy=0.0

# Iterate over the models and evaluate their performance 
for name, model in models:
    # Create a pipeline for each model 
    pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(handle_unknown='ignore')),
        ('model', model)

    ])
    # Perform cross-validation 
    scores = cross_val_score(pipeline, X_train, y_train, cv=5)  

    # Get the average accuracy
    mean_accuracy = scores.mean()

    # Fit the pipeline on the training data 
    pipeline.fit(X_train, y_train)

    # Make predictions on the test data 
    y_pred = pipeline.predict(X_test)

    # Calculate the accuracy
    accuracy = accuracy_score(y_test, y_pred)

    # Print the performance metrics 
    print('Model:', name)
    print('Cross-validation Accuracy:', mean_accuracy)
    print('Test Accuracy:', accuracy)
    print()

    # Check if the current model has the best accuracy
    if accuracy> best_accuracy:
        best_accuracy=accuracy
        best_model=pipeline

# Retrive the best model 
print ('Best Model:', best_model)


Model: Random Forest
Cross-validation Accuracy: 0.7963656062247612
Test Accuracy: 0.7932960893854749

Model: Gradient Boosting
Cross-validation Accuracy: 0.8062247611543386
Test Accuracy: 0.8044692737430168

Model: Support Vector Classifier
Cross-validation Accuracy: 0.8146163695459471
Test Accuracy: 0.7821229050279329

Model: Logistic Regression
Cross-validation Accuracy: 0.7823106470993795
Test Accuracy: 0.776536312849162

Best Model: Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
                ('encoder', OneHotEncoder(handle_unknown='ignore')),
                ('model', GradientBoostingClassifier(random_state=42))])


make practice of for loop? experts uses it and it differentiate from intermediate to proffessionals