# What are Pipeline?
A pipeline is a sequence of steps or operations that are executed in a specific order to accomplish a particular task or to process data. In computer science, pipelines are commonly used to process large amounts of data efficiently and to automate repetitive tasks. Basically, they are a mechanism in the sklanr library(not an algorithm) which chains together multiple steps so that the output of each step is used as the input to the next step.

Pipelines are used in a wide variety of applications, including data processing, machine learning, and software development. For example, in data processing, a pipeline might be used to clean and normalize data, extract features, and then train a machine learning model on the processed data. It makes it easy to apply the smae preprocessing to train and test.

Some of the advantages of using pipelines include:

1. Efficiency: Pipelines can help automate repetitive tasks and reduce the time it takes to process large amounts of data.

2. Consistency: Pipelines ensure that the same set of operations are performed on all data, which can help improve the accuracy and reliability of results.

3. Modularity: Pipelines are often designed as a series of modular steps, which makes it easier to modify or update individual steps without affecting the rest of the pipeline.

4. Scalability: Pipelines can be designed to handle large amounts of data, making them a scalable solution for processing big data.

Overall, pipelines are a powerful tool for processing data and automating tasks, and they offer several advantages over other methods of data processing.

# Lets observe how would a project look without the implementaion of piplines on a titanic dataset

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier

In [2]:
df = pd.read_csv("train.csv") # Note that in this notebook I will only be utilizing the piplines neglecting the model performance

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
 df=df.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1) # NO USE

In [5]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [6]:
# Step 1 - train test split
X_train, X_test, y_train, y_test = train_test_split(df.drop("Survived", axis=1), df.iloc[:,0], test_size=0.2, random_state=42)

In [7]:
X_train

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
331,1,male,45.5,0,0,28.5000,S
733,2,male,23.0,0,0,13.0000,S
382,3,male,32.0,0,0,7.9250,S
704,3,male,26.0,1,0,7.8542,S
813,3,female,6.0,4,2,31.2750,S
...,...,...,...,...,...,...,...
106,3,female,21.0,0,0,7.6500,S
270,1,male,,0,0,31.0000,S
860,3,male,41.0,2,0,14.1083,S
435,1,female,14.0,1,2,120.0000,S


In [8]:
y_train

331    0
733    0
382    0
704    0
813    0
      ..
106    1
270    0
860    0
435    1
102    0
Name: Survived, Length: 712, dtype: int64

In [9]:
df.isnull().sum() # THere are missing values in the data, without handling them we cant move forward

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

In [27]:
# Step-2 Applying Imputaion
si_age = SimpleImputer() # Filling the missing value in age with mean
si_embarked = SimpleImputer(strategy="most_frequent") # Filing the missing value in emabred with the most common occurance

X_train_age = si_age.fit_transform(X_train[['Age']])
X_train_embarked = si_embarked.fit_transform(X_train[['Embarked']])

X_test_age = si_age.transform(X_test[['Age']])
X_test_embarked = si_embarked.transform(X_test[['Embarked']])

In [28]:
# Step-3 One Hot Encoding
ohe_sex = OneHotEncoder(sparse=False,handle_unknown='ignore')
ohe_embarked = OneHotEncoder(sparse=False,handle_unknown='ignore')

X_train_sex = ohe_sex.fit_transform(X_train[['Sex']])
X_train_embarked = ohe_embarked.fit_transform(X_train_embarked)

X_test_sex = ohe_sex.transform(X_test[['Sex']])
X_test_embarked = ohe_embarked.transform(X_test_embarked)

In [29]:
ohe_train_embarked

array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       ...,
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.]])

In [33]:
X_train_rem = X_train.drop(["Sex", "Age", "Embarked"], axis=1) # Dropping these cuz I aldy have numpy array for these
X_test_rem = X_test.drop(["Sex", "Age", "Embarked"], axis=1) # Dropping these cuz I aldy have numpy array for these

In [34]:
# Noe joining all the columns
X_train_transformed = np.concatenate((X_train_rem,X_train_age,X_train_sex,X_train_embarked),axis=1)
X_test_transformed = np.concatenate((X_test_rem,X_test_age,X_test_sex,X_test_embarked),axis=1)

In [39]:
X_train_transformed.shape

(712, 10)

In [40]:
X_test_transformed.shape

(179, 10)

In [41]:
clf = DecisionTreeClassifier()
clf.fit(X_train_transformed, y_train)

In [44]:
y_pred=clf.predict(X_test_transformed)

In [45]:
from sklearn.metrics import accuracy_score
accuracy_score(y_pred,y_test)

0.7821229050279329

In [48]:
import pickle
pickle.dump(ohe_sex,open("models/ohe_sex.pkl", "wb"))
pickle.dump(ohe_embarked,open("models/ohe_embarked.pkl", "wb"))
pickle.dump(clf,open("models/clf.pkl", "wb"))

# THis is the preprocessing done by me on the backend if I am to deploy this then for every input each of the above mentioned step should be handled indivisually to preict if user inputed config survied or not which will be done in 1.2