In this notebook , sklearn pipeline is implemented to clean the data using simple imputer and one hot encoder and then train a RandomForest classification model.

# Importing the required tools

In [1]:
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV , RandomizedSearchCV

# Importing the data

In [2]:
df = pd.read_csv('train.csv')

In [3]:
df = df.sample(frac=1)

In [4]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
104,105,0,3,"Gustafsson, Mr. Anders Vilhelm",male,37.0,2,0,3101276,7.925,,S
84,85,1,2,"Ilett, Miss. Bertha",female,17.0,0,0,SO/C 14885,10.5,,S
823,824,1,3,"Moor, Mrs. (Beila)",female,27.0,0,1,392096,12.475,E121,S
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
332,333,0,1,"Graham, Mr. George Edward",male,38.0,0,1,PC 17582,153.4625,C91,S


Shuffling the data

In [5]:
df = df.sample(frac=1)

In [6]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
729,730,0,3,"Ilmakangas, Miss. Pieta Sofia",female,25.0,1,0,STON/O2. 3101271,7.925,,S
853,854,1,1,"Lines, Miss. Mary Conover",female,16.0,0,1,PC 17592,39.4,D28,S
362,363,0,3,"Barbara, Mrs. (Catherine David)",female,45.0,0,1,2691,14.4542,,C
198,199,1,3,"Madigan, Miss. Margaret ""Maggie""",female,,0,0,370370,7.75,,Q
100,101,0,3,"Petranec, Miss. Matilda",female,28.0,0,0,349245,7.8958,,S


In [7]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 729 to 345
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB


In [9]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


The distribution of numerical feature values across the samples....

This helps us determine, among other early insights, how representative is the training dataset of the actual problem domain.

Total samples are 891 or 40% of the actual number of passengers on board the Titanic (2,224).

Survived is a categorical feature with 0 or 1 values.

Around 38% samples survived representative of the actual survival rate at 32%.

Most passengers (> 75%) did not travel with parents or children.

Nearly 30% of the passengers had siblings and/or spouse aboard.

Fares varied significantly with few passengers (<1%) paying as high as $512.

Few elderly passengers (<1%) within age range 65-80.

Convert data type of "Fare" from float to integer

In [10]:
df.Fare = df.Fare.astype(int)

The "Name" and "Survived" columns are dropped from the dataset because :

We have to predict "Survived" value on the test dataset.

"Name" does not play a significant role in modelling

In [11]:
X = df.drop(["Survived" , "Name"] , axis=1)
Y = df["Survived"]

Checking for missing values in the dataset

In [12]:
X.isna().sum()

PassengerId      0
Pclass           0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [13]:
Y

729    0
853    1
362    0
198    1
100    0
      ..
224    1
480    0
886    0
372    0
345    1
Name: Survived, Length: 891, dtype: int64

# Implementing the pipeline

Notice that we have different pipeline for categorical features , embarked features and numerical features.
This is because categorical values require filling "missing" value in their NaN cells , while embarked features require filling "S" in their NaN cells , which is most common there.
Also numeric features do not need one hot encoder.

In [14]:
np.random.seed(0)


# Define different features and transformer pipeline
categorical_features = ["Sex" , "Pclass" , "Cabin" , "Ticket"]
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))])

Embarked_features = ["Embarked"]
Embarked_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="S")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))])

numeric_features = ["Age"]
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))
])

# Setup preprocessing steps (fill missing values, then convert to numbers)
preprocessor = ColumnTransformer(
                    transformers=[
                        ("cat", categorical_transformer, categorical_features),
                        ("num", numeric_transformer, numeric_features),
                        ("emb" , Embarked_transformer , Embarked_features)
                    ])

# Creating a preprocessing and modelling pipeline
model = Pipeline(steps=[("preprocessor", preprocessor),
                        ("model", RandomForestClassifier())])

In [15]:
model.fit(X,Y)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Sex', 'Pclass', 'Cabin',
                                                   'Ticket']),
                                                 ('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer())]),
                                                  ['Age']),
                

Evaluating model on training data

In [16]:
model.score(X,Y)

0.9988776655443322

# Randomized Search CV

In [17]:
pipe_grid = {
    "preprocessor__num__imputer__strategy": ["mean", "median"],
    "model__criterion": ["gini" , "entropy"],
    "model__n_estimators": [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000],
    "model__max_depth": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    "model__max_features": ["auto", "sqrt", "log2"],
    "model__min_samples_split": np.arange(2,8,2),
    "model__min_samples_leaf": np.arange(1,10,1)
}

In [18]:
gs_model = RandomizedSearchCV(model, pipe_grid, cv=5, n_iter=200, verbose=2 , n_jobs=-1 , random_state=0)
gs_model.fit(X,Y)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('preprocessor',
                                              ColumnTransformer(transformers=[('cat',
                                                                               Pipeline(steps=[('imputer',
                                                                                                SimpleImputer(fill_value='missing',
                                                                                                              strategy='constant')),
                                                                                               ('onehot',
                                                                                                OneHotEncoder(handle_unknown='ignore'))]),
                                                                               ['Sex',
                                                                                'Pclass',
                                 

In [19]:
gs_model.best_params_

{'preprocessor__num__imputer__strategy': 'median',
 'model__n_estimators': 400,
 'model__min_samples_split': 6,
 'model__min_samples_leaf': 1,
 'model__max_features': 'log2',
 'model__max_depth': 100,
 'model__criterion': 'entropy'}

# GridSearch CV

In [20]:
pipe_grid_2 = {
    "preprocessor__num__imputer__strategy": ["median"],
    "model__criterion": ["gini"],
    "model__n_estimators": [400,500],
    "model__max_depth": [20,25,30],
    "model__max_features": ["sqrt"],
    "model__min_samples_split": [2],
    "model__min_samples_leaf": [1,2]
}

In [21]:
gs_model = GridSearchCV(model, pipe_grid_2, cv=5, verbose=True , n_jobs=-1 )
gs_model.fit(X,Y)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('cat',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(fill_value='missing',
                                                                                                        strategy='constant')),
                                                                                         ('onehot',
                                                                                          OneHotEncoder(handle_unknown='ignore'))]),
                                                                         ['Sex',
                                                                          'Pclass',
                                                                          'Cabin',
          

In [22]:
gs_model.best_params_

{'model__criterion': 'gini',
 'model__max_depth': 25,
 'model__max_features': 'sqrt',
 'model__min_samples_leaf': 1,
 'model__min_samples_split': 2,
 'model__n_estimators': 400,
 'preprocessor__num__imputer__strategy': 'median'}

# Importing test data

In [23]:
df_test = pd.read_csv("test.csv")

In [24]:
x_test = df_test.drop(["Name"] , axis=1)
y_preds = gs_model.predict(x_test)

In [25]:
survived = y_preds
survived

array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

In [26]:
dict = {"PassengerId" : x_test["PassengerId"],
       "Survived" : y_preds}

In [27]:
final = pd.DataFrame(dict,
                    index = None)

In [28]:
final

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [29]:
final.to_csv("results.csv")