<a href="https://colab.research.google.com/github/SHIVANISINGH1303/MLByShivani/blob/main/without_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('/content/train_titanic.csv')

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df.drop(columns=['PassengerId','Name','Ticket','Cabin'],inplace=True)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [5]:
# Step 1: Perform train test split
X=df.drop(columns=['Survived'])
y=df['Survived']

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [6]:
X_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
331,1,male,45.5,0,0,28.5,S
733,2,male,23.0,0,0,13.0,S
382,3,male,32.0,0,0,7.925,S
704,3,male,26.0,1,0,7.8542,S
813,3,female,6.0,4,2,31.275,S


In [7]:
y_train.head()

Unnamed: 0,Survived
331,0
733,0
382,0
704,0
813,0


In [8]:
df.isnull().sum()

Unnamed: 0,0
Survived,0
Pclass,0
Sex,0
Age,177
SibSp,0
Parch,0
Fare,0
Embarked,2


Here, 'Age' and 'Embarked' have missing values, so we'll fill them using SimpleImputer.

**Applying Imputation**

In [9]:
# Import the necessary class
from sklearn.impute import SimpleImputer

# Initialize imputers
# For numerical columns like 'Age'
si_age = SimpleImputer()

# For categorical columns like 'Embarked'
si_embarked = SimpleImputer(strategy='most_frequent')

In [10]:
# Fit and transform 'Age' on training data
X_train_age = si_age.fit_transform(X_train[['Age']])
# Transform 'Age' on test data
X_test_age = si_age.transform(X_test[['Age']])

# Fit and transform 'Embarked' on training data
X_train_embarked = si_embarked.fit_transform(X_train[['Embarked']])
# Transform 'Embarked' on test data
X_test_embarked = si_embarked.transform(X_test[['Embarked']])

In [11]:
# X_train_age
# SimpleImputer with strategy='mean' fills missing values with the average of the data.

In [12]:
# X_train_embarked
# SimpleImputer with strategy='most frequent' fills missing values.

Her, "Sex" and "Embarked" are categorical columns, they need to be encoded into numerical format before feeding into a machine learning model.

**Applying One-hot Encoding**

In [13]:
# Import the necessary class
from sklearn.preprocessing import OneHotEncoder

# Initialize encoders
ohe_sex = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
ohe_embarked = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# Fit on training data and transform
X_train_sex = ohe_sex.fit_transform(X_train[['Sex']])
X_train_embarked = ohe_embarked.fit_transform(X_train_embarked)

# Transform test data
X_test_sex = ohe_sex.transform(X_test[['Sex']])
X_test_embarked = ohe_embarked.transform(X_test_embarked)

In [14]:
# X_train_sex

In [15]:
# X_train_embarked

NOTE: "Both 'Sex' and 'Embarked' are not encoded together because 'Embarked' has missing values, which we filled earlier and these outcome is saved in new array. The original dataset still contains missing values in 'Embarked'.

We have three new separate arrays for 'Sex', 'Embarked', and 'Age'. The rest of the data remains unchanged.

I'll drop the columns 'Sex', 'Embarked', and 'Age' from the train and test datasets, as their transformed new arrays are already prepared. Then, I will concatenate these with remaining (Pclass,	SibSp, Parch	, Fare).

In [16]:
X_train_rem = X_train.drop(columns=['Sex','Age','Embarked'])
X_train_rem

Unnamed: 0,Pclass,SibSp,Parch,Fare
331,1,0,0,28.5000
733,2,0,0,13.0000
382,3,0,0,7.9250
704,3,1,0,7.8542
813,3,4,2,31.2750
...,...,...,...,...
106,3,0,0,7.6500
270,1,0,0,31.0000
860,3,2,0,14.1083
435,1,1,2,120.0000


In [17]:
X_test_rem = X_test.drop(columns=['Sex','Age','Embarked'])

In [18]:
X_train_transformed = np.concatenate((X_train_rem,X_train_age,X_train_sex,X_train_embarked),axis=1)
X_test_transformed = np.concatenate((X_test_rem,X_test_age,X_test_sex,X_test_embarked),axis=1)

# Since we haven't used ColumnTransformer here, we have to manually concatenate the features.

In [19]:
X_train_transformed.shape

(712, 10)

In [20]:
X_test_transformed.shape

(179, 10)

Now, X_input is ready. This means your data preprocessing is complete, and your feature set is prepared for model training.
Next, we’ll train a model using the DecisionTreeClassifier algorithm.

In [21]:
# Step 1: Import Decision Tree classifier from sci-kit learn
from sklearn.tree import DecisionTreeClassifier

# Step 2: Create the model
clf_model = DecisionTreeClassifier()

# Step 3: Train the model
clf_model.fit(X_train_transformed, y_train)

In [22]:
# Step 4: Make prediction on the test data by model
y_pred = clf_model.predict(X_test_transformed)
y_pred

array([0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 1])

For each passenger in the test set, your model will provide a prediction of either **0** or **1**.
- **0** might indicate, for example, that the passenger did not survive.

- **1** might indicate that the passenger survived.

In [23]:
# Step 5: Evaluate the Model
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

Model Accuracy: 0.7932960893854749


We want to deploy this model on a web application  that includes a **form** where users can enter details for a new passenger.

When a user enters the details, our model will take that input and predict whether the passenger will survive or not.

But before deploying the model, we use the **pickle** module to save the trained model so that it can be reused later without the need for retraining each time.

**01. Import the `pickle` module**

import pickle

**02. Save (serialize) the model as a pickle file**

pickle.dump(model_name, open('path/filename.pkl', 'wb'))

        OR (better practice using `with` statement)

with open('path/filename.pkl', 'wb') as file:

pickle.dump(model_name, file)

In [24]:
import pickle

# Save trained classifier model
pickle.dump(clf_model, open('models_clf.pkl', 'wb'))

# Save OneHotEncoder for 'sex' column
pickle.dump(ohe_sex, open('models_ohe_sex.pkl', 'wb'))

# Save OneHotEncoder for 'embarked' column
pickle.dump(ohe_embarked, open('models_ohe_embarked.pkl', 'wb'))


**Why do we save encoders like ohe_sex, ohe_embarked along with trained model?**

Because when a new user inputs values like: "male" for sex or "Southampton" for embarked.

Since a machine learning model only works with numbers then we need to convert these categorical values into numerical format using the same OneHotEncoder that was used during training.

This ensures the input data is transformed in the exact same way, maintaining consistency with the model’s expectations.

Here, we needed to pass the SimpleImputer as well. But we skipped it to keep things short, because in this case, I will be manually entering all the values as the user.

After running the above pickle.dump(...) commands, these three files will be created (added) in your Google Colab file directory.

You can find them in the left-side "Files" panel in Google Colab.

These are **binary files**, so you cannot read or open them directly, but you can use them later in your code.

**To use these pickle files later, you'll need to load the data back into your program.**

# How to **load and use a pickle file**

**01. Import the `pickle` module**

import pickle

**02. Load (deserialize) the model from the pickle file**

*{Open the pickle file in read-binary mode 'rb' and load the data using pickle.load()}*

model_name = pickle.load(open('path/filename.pkl', 'rb'))

    OR (better practice using `with` statement)

with open('path/filename.pkl', 'rb') as file:

  model_name = pickle.load(file)

**Note:**
Before making predictions, you must apply the **same preprocessing steps** (like filling missing values, encoding categories, scaling, etc.) to the new input data — just like you did during training.Only after that, pass the processed data to the model for prediction.

**03. Prepare your input features**

Combine the preprocessed features into a single feature vector (e.g., using np.concatenate). Now you can use this prepared input to make predictions with your machine learning model.

**04.Make predictions**

Use the loaded model's predict() method with your new input data.

model_name.predict(new_data)


In [25]:
import pickle
import numpy as np

In [26]:
# Load the classifier model
clf_model = pickle.load(open('models_clf.pkl','rb'))

# Load the one-hot encoder for 'sex'
ohe_sex = pickle.load(open('models_ohe_sex.pkl','rb'))

# Load the one-hot encoder for 'embarked'
ohe_embarked = pickle.load(open('models_ohe_embarked.pkl','rb'))


In [27]:
# Assume User input
# Pclass|gender|age|SibSp|Parch|Fare|Embarked
test_input = np.array([[2, 'male', 31.0, 0, 0, 10.5, 'S']], dtype=object).reshape(1,7)

In [28]:
test_input

array([[2, 'male', 31.0, 0, 0, 10.5, 'S']], dtype=object)

In [29]:
# Extract and encode 'sex' feature (column 1)
test_input_sex = ohe_sex.transform(test_input[:, 1].reshape(1, -1))
test_input_sex

array([[0., 1.]])

In [30]:
# Extract and encode 'embarked' feature (last column)
test_input_embarked = ohe_embarked.transform(test_input[:, -1].reshape(1, -1))
test_input_embarked

array([[0., 0., 1.]])

In [31]:
# Extract 'age' feature (column 2)
test_input_age = test_input[:, 2].reshape(1, -1)
test_input_age

array([[31.0]], dtype=object)

In [32]:
test_input_rem=test_input[:,[0,3,4,5]]
test_input_rem

array([[2, 0, 0, 10.5]], dtype=object)

In [33]:
# Concatenate all features into one array
test_input_transformed= np.concatenate((test_input_rem,test_input_age,test_input_sex,test_input_embarked), axis=1)
test_input_transformed

array([[2, 0, 0, 10.5, 31.0, 0.0, 1.0, 0.0, 0.0, 1.0]], dtype=object)

In [34]:
test_input_transformed.shape

(1, 10)

Here, You must keep the same order of features when combining data, as was done during feature engineering.

To make accurate predictions, a new user input must be encoded and formatted exactly like your training and testing data (for example, the train-test dataset has 10 columns here). Before predicting, process and arrange the user input in the same way as the original data so the model can understand it and produce accurate results.

In [35]:
# Make a prediction using your trained model
prediction = clf_model.predict(test_input_transformed)
print("Prediction:", prediction)

Prediction: [1]


1 shows that passenger will survive.

*`Hence, Manual preprocessing(without pipeline) is fine for small tests but risky for real projects. It’s error-prone, hard to scale, and can lead to inconsistent results.`*



**Best Practice:**
*`Use pipelines like `Pipeline` and `ColumnTransformer` to handle preprocessing. They make the process automatic, consistent, and easier to manage in real projects.`*
