<div id="header">
    <p style="color:#6a66bd; text-align:center; font-weight:bold; font-family:verdana; font-size:20px;">Sklearn Pipeline
    </p>
</div>

---

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>What is Pipeline?</strong>
<br>
• Pipelines are a powerful way to streamline your machine learning workflow by organizing the steps of data processing and model training into a single object.
<br>
• A pipeline in Scikit-learn is a way to chain multiple processing steps together allowing for a clean and efficient code.
<br>
<br>
<strong>Key Components of a Pipeline</strong>
<br>
➩ <strong>Transformers:</strong> These modify the input data.
<br>
• Common transformers like :
<br>
→ StandardScaler: Standardizes features by transforming mean and standard deviation to 0 and 1 respectively.
<br>
→ MinMaxScaler: Scales features to a given range, typically [0, 1].
<br>
→ OneHotEncoder: Converts categorical variables into a format that can be provided to ML algorithms.
<br>
→ PCA (Principal Component Analysis): Reduces the dimensionality of data.
<br>
➩ <strong>Estimators:</strong> These are algorithms that learn from the data.
<br>
• Common estimators like :
<br>
→ LogisticRegression: A linear model for binary classification.
<br>
→ RandomForestClassifier: An ensemble method using decision trees.
<br>
→ Support Vector Machines (SVM): Used for classification and regression tasks.
</div>

In [31]:
# Importing Libraries
import numpy as np
import pandas as pd

In [32]:
# Reading CSV File
df = pd.read_csv('titanic.csv')
df.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
802,803,1,1,"Carter, Master. William Thornton II",male,11.0,1,2,113760,120.0,B96 B98,S
642,643,0,3,"Skoog, Miss. Margit Elizabeth",female,2.0,3,2,347088,27.9,,S
136,137,1,1,"Newsom, Miss. Helen Monypeny",female,19.0,0,2,11752,26.2833,D47,S
532,533,0,3,"Elias, Mr. Joseph Jr",male,17.0,1,1,2690,7.2292,,C
299,300,1,1,"Baxter, Mrs. James (Helene DeLaudeniere Chaput)",female,50.0,0,1,PC 17558,247.5208,B58 B60,C


In [33]:
# Dropping unnecessary columns
df.drop(columns=['PassengerId','Name','Ticket','Cabin'], inplace=True)

In [34]:
# Sample of the DataFrame
df.sample(5)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
229,0,3,female,,3,1,25.4667,S
650,0,3,male,,0,0,7.8958,S
247,1,2,female,24.0,0,2,14.5,S
248,1,1,male,37.0,1,1,52.5542,S
578,0,3,female,,1,0,14.4583,C


In [35]:
# Null values in the DataFrame
df.isna().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Train Test Split</strong>
<br>
The train-test split is a common technique in machine learning for evaluating model performance. It involves dividing your dataset into two parts :
<br>
• <strong>Training Set :</strong> Used to train the model.
<br>
• <strong>Testing Set :</strong> Used to evaluate the model's performance on unseen data.
<br>
<br>
<strong>Parameters</strong>
<br>
• <strong>arrays :</strong> This can be a list or a tuple of arrays (e.g, features and target variables).
<br>
• <strong>test_size :</strong> Determines the proportion of the dataset to include in the test split (e.g, 0.2 for 20%).
<br>
• <strong>random_state :</strong> Controls the shuffling applied to the data before the split (e.g., any integer).
<br>
• <strong>shuffle :</strong> A boolean that indicates whether to shuffle the data before splitting.
</div>

In [36]:
# Importing train_test_split
from sklearn.model_selection import train_test_split

In [37]:
# Defining Features and Target Variables
X = df.iloc[:,1:]
y = df['Survived']

In [38]:
# Splitting the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [39]:
# Shape of Training and Testing Set
print(X_train.shape, X_test.shape)

(623, 7) (268, 7)


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>What is ColumnTransformer?</strong>
<br>
• The ColumnTransformer in scikit-learn is a powerful tool for applying different preprocessing steps to different subsets of the features in a dataset.
<br>
• It allows you to handle mixed data types (numerical, categorical, etc.) in a structured way.
<br>
• The ColumnTransformer was introduced in scikit-learn version 0.20, which was released in December 2018. 
<br>
• This feature allows users to apply different preprocessing steps to different subsets of features in a dataset, making it especially useful for handling mixed data types.
</div>

In [40]:
# Importing ColumnTransformer
from sklearn.compose import ColumnTransformer
# Importing SimpleImputer
from sklearn.impute import SimpleImputer
# Importing OneHotEncoder and MinMaxScaler
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
# Importing DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

In [41]:
# ColumnTransformer Object for SimpleImputer
# To fill Null values in the DataFrame
tnf1 = ColumnTransformer(transformers=[
    ('age_imputer', SimpleImputer(), [2]),
    ('embarked_imputer', SimpleImputer(strategy='most_frequent'), [6])
], remainder='passthrough')

In [42]:
# ColumnTransformer Object for Feature Encoding using OneHotEncoder
tnf2 = ColumnTransformer(transformers=[
    ('sex_embarked_encoding', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), [1,6])
], remainder='passthrough')

In [43]:
# ColumnTransformer Object for MinMaxScaler
tnf3 = ColumnTransformer(transformers=[
    ('scale', MinMaxScaler(), slice(0,10))
])

In [44]:
# DecisionTreeClassifier Object
tnf4 = DecisionTreeClassifier()

In [45]:
# Importing Pipeline
from sklearn.pipeline import Pipeline

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
• To visualize your Scikit-learn pipeline as a diagram you can use the set_config function from the sklearn library. 
<br>
• This feature provides a visual representation of the components in your pipeline making it easier to understand the flow of data through each step.
</div>

In [46]:
# Special Code to turn on Diagram of Pipeline
from sklearn import set_config
set_config(display='diagram')

In [47]:
# Creating Pipeline Object
pipe = Pipeline([
    ('tnf1',tnf1),
    ('tnf2',tnf2),
    ('tnf3',tnf3),
    ('tnf4',tnf4)
])

In [48]:
# Fitting the Pipeline on Training Data
# You call fit method only when you have a Model chained in your Pipeline
# Else you call fit_transform method on your training data
pipe.fit(X_train, y_train)

In [49]:
# Making Predictions from Testing Data
y_pred = pipe.predict(X_test)

In [50]:
# Importing Accuracy Score
from sklearn.metrics import accuracy_score

In [51]:
# Accuracy Score of the Pipe
accuracy_score(y_test, y_pred)

0.6194029850746269

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
• Pickle is commonly used for saving and loading trained models along with any preprocessing steps you may have applied.
<br>
• This allows you to persist your model's state so that it can be reused later without needing to retrain it which saves time and computational resources.
</div>

In [52]:
# Exporting Model in Pickle File
import pickle
pickle.dump(pipe, open('pipe.pkl','wb'))

In [53]:
# Test Input Data
test = np.array([2,'male',50.0,0,0,10.5000,'S']).reshape(1,7)

In [54]:
# Loading the Model from Pickle File
import pickle
pipe = pickle.load(open('pipe.pkl','rb'))

In [55]:
# Making Prediction
pipe.predict(test)

array([0])