**Pipeline in Machine Learning**
In machine learning, a pipeline is a sequence of data processing steps that are chained together to automate and streamline the machine learning workflow. A pipeline allows you to combine multiple data preprocessing and model training steps into a single object, making it easier to organize and manage your machine learning code.\
**Here are the key components of a pipeline:**\
**Data Preprocessing Steps:** Pipelines typically start with data preprocessing steps, such as feature scaling, feature encoding, handling missing values, or dimensionality reduction. These steps ensure that the data is in the appropriate format and quality for model training.\
**Model Training:** After the data preprocessing steps, the pipeline includes the training of a machine learning model. This can be a classifier for classification tasks, a regressor for regression tasks, or any other type of model depending on the problem at hand.\
**Model Evaluation:** Once the model is trained, the pipeline often incorporates steps for evaluating its performance. This may involve metrics calculation, cross-validation, or any other evaluation technique to assess the model's effectiveness.\
**Predictions:** After the model has been evaluated, the pipeline allows you to make predictions on new, unseen data using the trained model. This step applies the same\ preprocessing steps used during training to the new data before generating predictions.
**Simplified Workflow:** Pipelines provide a clean and organized structure for defining and managing the sequence of steps involved in machine learning tasks. This makes it easier to understand, modify, and reproduce the workflow.\
**Avoiding Data Leakage:** Pipelines ensure that data preprocessing steps are applied consistently to both the training and testing data, preventing data leakage that could lead to biased or incorrect results.\
**Streamlined Model Deployment:** Pipelines allow you to encapsulate the entire workflow, including data preprocessing and model training, into a single object. This simplifies the deployment of your machine learning model, as the same pipeline can be applied to new data without the need to reapply each individual step.\
**Hyperparameter Tuning:** Pipelines can be combined with techniques like grid search or randomized search for hyperparameter tuning. This allows you to efficiently explore different combinations of hyperparameters for your models.



In [61]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


In [62]:
titanic = sns.load_dataset('titanic')
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [63]:
X=titanic[["pclass", "sex", "age","fare", "embarked"]]
y=titanic["survived"]

In [64]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [65]:
numeric_features = ["age", "fare"]
categorical_features = ["pclass", "sex", "embarked"]

In [66]:
numeric_transformer=Pipeline(steps=[("Imputer", SimpleImputer(strategy="median"))])

categorical_transformer=Pipeline(steps=[("Imputer", SimpleImputer(strategy="most_frequent")),
                                        ("encoder", OneHotEncoder())])


In [67]:
preprocessor=ColumnTransformer(transformers=[("num", numeric_transformer, numeric_features),
                                             ("cat", categorical_transformer, categorical_features)])

In [68]:
pipeline=Pipeline(steps=[("preprocessor", preprocessor), 
                         ("model", RandomForestClassifier(n_estimators=100, random_state=42))])

In [69]:
pipeline.fit(X_train, y_train)

In [70]:
y_pred=pipeline.predict(X_test)

In [71]:
# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.7821229050279329


**Hyperparamter tunning in pipeline**

In [72]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [73]:
titanic = sns.load_dataset('titanic')
titanic.head()
X=titanic[["pclass", "sex", "age","fare", "embarked"]]
y=titanic["survived"]

In [74]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [75]:
pipeline=Pipeline([
    ("Imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore")),
    ("model", RandomForestClassifier(random_state=42))
])

In [79]:
# Define the hyperparameter grid
hyperparameters = {
    'model__n_estimators': [100, 200, 300, 500],
    'model__max_depth': [None, 5, 10, 30],
    'model__min_samples_split': [2, 5, 10, 15]
}

In [80]:
grid_search=GridSearchCV(pipeline, hyperparameters,cv=5)

In [81]:
grid_search.fit(X_train, y_train)

In [82]:
y_pred=grid_search.predict(X_test)

In [83]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8212290502793296


In [84]:
print(f"Best parameters: {grid_search.best_params_}")


Best parameters: {'model__max_depth': 30, 'model__min_samples_split': 5, 'model__n_estimators': 100}


**Selecting best model in Pipeline**

In [95]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [86]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [89]:
X=titanic[["pclass","sex","age","fare","embarked"]]
y=titanic["survived"]

In [90]:
# Split the data into a training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [96]:
models=[("Random forest", RandomForestClassifier(random_state=42)), 
        ("Gradient boosting", GradientBoostingClassifier(random_state=43)),
        ("XGBoost", XGBClassifier(random_state=42))]

In [97]:
best_model=None
best_accuracy=0

In [113]:
for name, model in models:
    pipeline=Pipeline([
        ("Imputer", SimpleImputer(strategy="most_frequent")),
         ("encoder", OneHotEncoder(handle_unknown="ignore")),
            ("model", model)
    ])

In [100]:
scores=cross_val_score(pipeline, X_train, y_train, cv=5)

In [101]:
mean_accuracy=scores.mean()

In [102]:
pipeline.fit(X_train, y_train)

In [103]:
y_pred=pipeline.predict(X_test)

In [109]:
accuracy = accuracy_score(y_test, y_pred)

In [116]:
print("model:", name)
print("mean accuracy:", mean_accuracy)
print("accuracy:", accuracy)
print()


# Check if the current model has the best accuracy
if accuracy > best_accuracy:
    best_accuracy = accuracy
    best_model = pipeline

# Retrieve the best model
print("Best Model:", best_model)

model: XGBoost
mean accuracy: 0.8076233625529401
accuracy: 0.7932960893854749

Best Model: Pipeline(steps=[('Imputer', SimpleImputer(strategy='most_frequent')),
                ('encoder', OneHotEncoder(handle_unknown='ignore')),
                ('model',
                 XGBClassifier(base_score=None, booster=None, callbacks=None,
                               colsample_bylevel=None, colsample_bynode=None,
                               colsample_bytree=None, device=None,
                               early_stopping_rounds=None,
                               enable_categorical=False, eval_metric=None,
                               feature_types=No...ow_policy=None,
                               importance_type=None,
                               interaction_constraints=None, learning_rate=None,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=N

# Add more models in the same code #

In [117]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [118]:
# Load the Titanic dataset from Seaborn
titanic_data = sns.load_dataset('titanic')

# Select features and target variable
X = titanic_data[['pclass', 'sex', 'age', 'fare', 'embarked']]
y = titanic_data['survived']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [119]:
models=[("Random Forest", RandomForestClassifier(random_state=42)),
        ("Gradient Boosting", GradientBoostingClassifier(random_state=42)),
        ("SVM", SVC(random_state=42)),
        ("Logistic Regression", LogisticRegression(random_state=42))]

In [122]:
best_model = None
best_accuracy = 0.0

In [120]:
for name, model in models:
    pipeline=Pipeline([
        ("Imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
        ("model", model)
    ])

In [124]:
 # Perform cross-validation
scores = cross_val_score(pipeline, X_train, y_train, cv=5)
    
    # Calculate mean accuracy
mean_accuracy = scores.mean()
    
    # Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

In [125]:
# Make predictions on the test data
y_pred = pipeline.predict(X_test)
    
    # Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)

In [126]:
print("Model:", name)
print("Cross-validation Accuracy:", mean_accuracy)
print("Test Accuracy:", accuracy)
print()

Model: Logistic Regression
Cross-validation Accuracy: 0.7977839062346105
Test Accuracy: 0.8100558659217877



In [129]:
# Check if the current model has the best accuracy
if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_model = pipeline

# Retrieve the best model
print("Best Model:", best_model)

Best Model: Pipeline(steps=[('Imputer', SimpleImputer(strategy='most_frequent')),
                ('encoder', OneHotEncoder(handle_unknown='ignore')),
                ('model', LogisticRegression(random_state=42))])
