<a href="https://www.kaggle.com/code/erkanhatipoglu/titanic-on-function-transformers-and-pipelines?scriptVersionId=106641894" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction  <a id='introduction'></a>

This is a starter code for those who want to use function transformers within sklearn pipelines. We will use the Titanic dataset for this purpose.
Kagglers interested in using early_stopping_rounds and cross-validation with pipelines may refer to my notebook [Housing Prices: Pipeline Starter Code](https://www.kaggle.com/erkanhatipoglu/housing-prices-pipeline-starter-code).

Kagglers interested in using grid search may refer to my notebook [Housing Prices: GridSearchCV Example](https://www.kaggle.com/erkanhatipoglu/housing-prices-gridsearchcv-example).

Kagglers interested in more advanced subjects of sklearn pipelines may refer to my notebook [Introduction to Sklearn Pipelines with Titanic](https://www.kaggle.com/erkanhatipoglu/introduction-to-sklearn-pipelines-with-titanic).

Thank you for reading.



# Table of Contents
* [Introduction](#introduction)
* [Helper Functions](#functions)
* [Loading Data](#loading)
* [Function Transformers](#functiontransformers) 
* [Preprocessing](#preprocessing) 
* [Cross-validation](#cross-validation)    
* [Prediction](#prediction) 
* [Saving and submission](#saving)  
* [References](#references)

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
pd.set_option('display.max_rows', None)

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv


# Helper functions   <a id='functions'></a>   

<div class="alert alert-block alert-info">
<b>Tip:</b> We will use some utility functions throughout the notebook. Collecting them in one place is a good idea, making the code more organized.
</div>

In [2]:
def save_file (predictions):
    """Save submission file."""
    # Save test predictions to file
    output = pd.DataFrame({'PassengerId': sample_sub_file.PassengerId,
                       'Survived': predictions})
    output.to_csv('submission.csv', index=False)
    print ("Submission file is saved")
    
def transform_age(df):
    ''' A function that transforms the Age column of the Titanic dataset.
        'Age' feature is transformed into a categorical data of the passengers
        such that masters and people whose age are smaller than 16 is defined
        as child.'''
    # Make a copy to avoid changing original data
    X_temp = df.copy()
    
    # Create Age_new column
    pd.DataFrame.insert(X_temp, len(X_temp.columns),"Age_new","",False)    
    
    # Get the index values
    index_values = X_temp.index.values.astype(int)
    
    for i in index_values:
        age = X_temp.at[i, 'Age'].astype(float)
        name = X_temp.loc[i,'Name']
        if name.find('.'):
            title = name.split('.')[0].split()[-1]

        if np.isnan(age):
            if title == "Master":
                X_temp.loc[i,'Age_new'] = "Child"
            else:
                X_temp.loc[i,'Age_new'] = "Adult"
        else:
            if age < 16:
                X_temp.loc[i,'Age_new'] = "Child"
            else:
                X_temp.loc[i,'Age_new'] = "Adult"
        
    drop = ["Age", "Name"]
    X_temp.drop(drop, axis=1, inplace=True)
    X_temp.rename(columns={'Age_new':'Age'}, inplace=True)
    return X_temp

def transform_family(df):
    '''A funtion that calculates the family size by summing Parch and SibSp columns into the 'Fcount' column. Afterward Parch 
    and SibSp columns are dropped.'''
    # Make a copy to avoid changing original data
    X_temp = df.copy()
    
    # Create Fcount column
    pd.DataFrame.insert(X_temp, len(X_temp.columns),"Fcount","",False)    
    
    # Get the index values
    index_values = X_temp.index.values.astype(int)
    
    for i in index_values:
        X_temp.loc[i, 'Fcount'] = X_temp.loc[i, 'Parch'] + X_temp.loc[i,'SibSp']
        
    X_temp["Fcount"] = X_temp["Fcount"].astype('int64')
    X_temp.drop(['Parch', 'SibSp'], axis=1, inplace=True)

    return X_temp

print("Functions loaded")

Functions loaded


## Loading Data   <a id='loading'></a>   

<div class="alert alert-block alert-info">
We will start with loading the data. After loading the data, we will drop the ticket column since we do not need it. Next, we will split our data into training and test data sets.
</div> 

In [3]:
# Loading data
train_data = pd.read_csv('/kaggle/input/titanic/train.csv', index_col='PassengerId')
test_data = pd.read_csv('/kaggle/input/titanic/test.csv', index_col='PassengerId')
sample_sub_file = pd.read_csv('/kaggle/input/titanic/gender_submission.csv')

# Make a copy to avoid changing original data
X = train_data.copy()
y = X.Survived
X_test = test_data.copy()

# Remove target from predictors
X.drop(['Survived'], axis=1, inplace=True)
print("['Survived'] column dropped from training data!")

# Remove Ticket, Cabin, Embarked columns. We will not use them.
cols_dropped = ["Ticket", "Cabin", "Embarked"]
X.drop(cols_dropped, axis = 1, inplace = True)
X_test.drop(cols_dropped, axis = 1, inplace = True)
print("{} dropped from both training and test data!".format(cols_dropped))

print("\nShape of training data: {}".format(X.shape))
print("Shape of target: {}".format(y.shape))
print("Shape of test data: {}".format(X_test.shape))
print("Shape of submission data: {}".format(sample_sub_file.shape))

# Split the data for validation
X_train, X_valid, y_train, y_valid = train_test_split(X,y, random_state=2)

print("\nShape of X_train data: {}".format(X_train.shape))
print("Shape of X_valid: {}".format(X_valid.shape))
print("Shape of y_train: {}".format(y_train.shape))
print("Shape of y_valid: {}".format(y_valid.shape))

print("\nFiles Loaded")

['Survived'] column dropped from training data!
['Ticket', 'Cabin', 'Embarked'] dropped from both training and test data!

Shape of training data: (891, 7)
Shape of target: (891,)
Shape of test data: (418, 7)
Shape of submission data: (418, 2)

Shape of X_train data: (668, 7)
Shape of X_valid: (223, 7)
Shape of y_train: (668,)
Shape of y_valid: (223,)

Files Loaded


In [4]:
X_train.head()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
200,2,"Yrois, Miss. Henriette (""Mrs Harbeck"")",female,24.0,0,0,13.0
130,3,"Ekstrom, Mr. Johan",male,45.0,0,0,6.975
91,3,"Christmann, Mr. Emil",male,29.0,0,0,8.05
231,1,"Harris, Mrs. Henry Birkhardt (Irene Wallach)",female,35.0,1,0,83.475
127,3,"McMahon, Mr. Martin",male,,0,0,7.75


# Function Transformers  <a id='functiontransformers'></a> 

<div class="alert alert-block alert-info">First, let's see how to define a 'function transformer.' We can then fit our dataset with the function transformers to see the result.
</div>

In [5]:
# Define the custom transformers for the pipeline
age_transformer = FunctionTransformer(transform_age)
family_transformer = FunctionTransformer(transform_family)

In [6]:
X_temp = age_transformer.fit_transform(X)
X_temp = family_transformer.fit_transform(X_temp)

In [7]:
X_temp[5:10]

Unnamed: 0_level_0,Pclass,Sex,Fare,Age,Fcount
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6,3,male,8.4583,Adult,0
7,1,male,51.8625,Adult,0
8,3,male,21.075,Child,4
9,3,female,11.1333,Adult,2
10,2,female,30.0708,Child,1


## Preprocessing   <a id='preprocessing'></a>   

<div class="alert alert-block alert-info">Although we have already defined the function transformers above, we will start from scratch and redefine them in this part for pipelines. This is for the convenience of those who want to copy and paste the code.
</div>

In [8]:
# Define transformers

# Define the custom transformers for the pipeline
age_transformer = FunctionTransformer(transform_age)
family_transformer = FunctionTransformer(transform_family)

# Define transformer for categorical columns using a pipeline
cat_cols = ["Sex", "Age", "Pclass"]
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(drop = 'first', sparse = False))
])

# Define column transformer for categorical data
column_transformer = ColumnTransformer(transformers=[('cat', categorical_transformer, cat_cols)], remainder='passthrough')

In [9]:
# Define Model
model = XGBClassifier(seed=42)

In [10]:
# Define preprocessor
preprocessor = Pipeline(steps=[('age', age_transformer),
                              ('family', family_transformer),
                              ('column', column_transformer)])

# Make a copy to avoid changing original data 
X_valid_eval=X_valid.copy()

# Preprocessing of validation data
X_valid_eval = preprocessor.fit(X_train, y_train).transform (X_valid_eval)

# Display the number of remaining columns after transformation 
print("We have", X_valid_eval.shape[1], "features left")

We have 6 features left


In [11]:
# Create and Evaluate the Pipeline
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

In [12]:
# Preprocessing of training data, fit model 
X_cv = X.copy()
X_sub = X_test.copy()

# Cross-validation <a id='cross-validation'></a>

In [13]:
# Cross-validation
scores = cross_val_score(my_pipeline, X_cv, y,
                              cv=5,
                              scoring='accuracy')

print("MAE score:\n", scores)
print("MAE mean: {}".format(scores.mean()))
print("MAE std: {}".format(scores.std()))

MAE score:
 [0.79888268 0.79775281 0.84831461 0.84269663 0.85393258]
MAE mean: 0.8283158621555458
MAE std: 0.024752313171335374


# Prediction   <a id='prediction'></a>

In [14]:
# Preprocessing of training data, fit model 
my_pipeline.fit(X_cv, y)

# Get predictions
preds = my_pipeline.predict(X_sub)

# Saving and submission   <a id='saving'></a>

In [15]:
# Use predefined utility function
save_file(preds)

Submission file is saved


# References   <a id='references'></a>
* [10-simple-hacks-to-speed-up-your-data-analysis - Parul Pandey](https://www.kaggle.com/parulpandey/10-simple-hacks-to-speed-up-your-data-analysis)
* [Dataset Transformations - Scikit-learn](https://scikit-learn.org/stable/data_transforms.html)
* [Intermediate Machine Learning Course - Pipelines](https://www.kaggle.com/alexisbcook/pipelines)
* [Kaggle Learn](https://www.kaggle.com/learn/overview)