Progress list:

- ~~Develop basic pipeline that can process data~~
    - ~~NaN handling for categorical and numerical columns of DataFrame~~
    - ~~Model building - simple, defined RandomForestRegression to start with~~
- Enhance pipeline with cross-validation/gridsearch
- Improve feature engineering using insights
    - Handling of Ticket data (Ticket_num, Ticket_pre)
    - Name splitting (Title, First_name, Surname, Other_names, Maiden_names)
    - 'Is_alone' column, for Parch & SibSp == 0
    - Familial_rel column, try to work out role within family (e.g. Father, Mother, Grandfather, etc.)
    - Age inference (How best to do this?)
        - First step: view known age distributions for each title (see Explore_data) 
        - For 'Master' title passengers, use median of other 'Master's
        - May be able to simply use median ages for groups within Familial_rel column, if not, split data by Pclass and then do so
        - Split data by Pclass, 
            - then separate out Is_alone and allocate median to these passengers  
- Implement tests to check pipeline handling is going as expected
    - Test whether the output model is better than the baseline: doing nothing to the data (other than removing NaN values)
    - Test whether the data in the transformed columns is in the expected format

[Custom pipeline transformations](https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65)

- Idea is to create a custom transformer class that handles the numerical or categorical columns named within in it using a specified method
- 'where will I find these base classes that come with most of the methods I need to write my transformer class on top of? Fret not. Scikit-Learn provides us with two great base classes, [TransformerMixin](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) and [BaseEstimator](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html). Inheriting from TransformerMixin ensures that all we need to do is write our fit and transform methods and we get fit_transform for free.'

For my use case:

In [134]:
import re
import numpy as np 
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import FeatureUnion, Pipeline 

#Custom Transformer that extracts columns passed as argument to its constructor 
class FeatureSelector(BaseEstimator, TransformerMixin):
    #Class Constructor 
    def __init__(self, feature_names):
        self._feature_names = feature_names 
    
    #Return self nothing else to do here    
    def fit( self, X, y = None):
        return self 
    
    #Method that describes what we need this transformer to do
    def transform(self, X, y = None):
        return X[self._feature_names] 
    
# Custom transformer that:
# Extracts ticket_num from the Ticket column
# ~~Create a numerical column for 'shared_exact_ticket' if ticket_num is not unique
# ~~Create a numerical column for 'shared_adjacent_ticket' if ticket_num +/- 1 is not unique
# splits 'Name' column into Title, First_name, Surname, Maiden_first_name and Maiden_surname
# Creates a 'num_cabins' column by counting the number of spaces in the 'Cabin' column, Nan => 0
# ~~Create a 'Maiden_fam_aboard' column if Maiden_surname matches any instance in Surname column

## ~~ issues: not working with whole dataset, so will likely skew results! e.g. if shared ticket is in another part of the data

class CategoricalTransformer(BaseEstimator, TransformerMixin):
    # Return self nothing else to do here
    def __init__(self):
        return self
        
    # Return self nothing else to do here
    def fit(self, X, y = None):
        return self

    # Helper function to extract number of cabins (if NaN, 0)
    def get_num_cabins(self, obj):
        try:
            return str(obj).count(' ') + 1
        except:
            return 0
        
    # Helper function to extract surname from 'Name' column
    def get_surname(self, obj):
        str(obj).split(sep = ", ")[0]
        
    # Helper function to extract title from 'Name' column
    def get_title(self, obj):
        str(obj).split(sep = ", ")[1].split(sep = ". ")[0] # or re.search(r"(?<=, ).+?(?=\. )", str(obj)).group(1)
        
#     # Helper function to extract first name from 'Name' column
#     def get_first_name(self, obj):
#         Result = re.search(r"(?<=\. ).+?(?= )", str(obj))[0]
#         if Result:
#             return Result
        
#     # Helper function to extract maiden surname name from 'Name' column
#     def get_maiden_surname(self, obj):
#         Result = re.search(r"(?<= )[a-zA-Z]+?(?=\))", str(obj))[0]
#         if Result:
#             return Result
        
#     # Helper function to extract maiden surname name from 'Name' column
#     def get_maiden_first_name(self, obj):
#         Result = re.search(r"(?<=\()[a-zA-Z]+?(?= )", str(obj))[0]
#         if Result:
#             return Result
        
    # Helper function that gets the ticket number from 'Ticket' column
    def get_ticket_num(self, obj):
        re.search(r"(?=(?:\D*\d))([a-zA-Z0-9]*$)", str(obj))
        
    #Transformer method we wrote for this transformer 
    def transform(self, X , y = None):
        #Depending on constructor argument add num_cabins
        #using the helper functions written above 
        X.loc[:, 'Num_cabins'] = X['Cabin'].apply(self.get_num_cabins) 
        
        X.loc[:, 'Ticket_num'] = X['Ticket'].apply(self.get_ticket_num) 
                
        X.loc[:, 'Title']             = X['Name'].apply(self.get_title)
#         X.loc[:, 'First_name']        = X['Name'].apply(self.get_first_name)
        X.loc[:, 'Surname']           = X['Name'].apply(self.get_surname)
#         X.loc[:, 'Maiden_first_name'] = X['Name'].apply(self.get_maiden_first_name)
#         X.loc[:, 'Maiden_surname']    = X['Name'].apply(self.get_maiden_surname)
        
        #Drop unnecessary Name column 
        X = X.drop('Name', axis = 1 )
            
        #returns numpy array
        return X.values 

# Custom transformer we wrote to engineer features (create an 'Is_alone'
# column if SibSp & Parch are both 0
# passed as boolen arguements to its constructor
class NumericalTransformer(BaseEstimator, TransformerMixin):
    #Class Constructor
    def __init__( self, Is_alone = True ):
        self._Is_alone = Is_alone
        
    #Return self, nothing else to do here
    def fit( self, X, y = None ):
        return self 
    
    #Custom transform method we wrote that creates aformentioned features and drops redundant ones 
    def transform(self, X, y = None):
        #Check if needed 
        if self._Is_alone:
            #create new column
            X.loc[:,'Is_alone'] = np.where((X['SibSp'] == 0) & (X['Parch'] == 0), 1, 0)
        
        #returns numpy array
        return X.values 
            
# Cardinality: number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
# categorical_features = [cname for cname in X.columns if X[cname].nunique() < 20 and X[cname].dtype == "object"]
categorical_features = ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

# Select numerical columns
# numerical_features = [cname for cname in X.columns if X[cname].dtype in ['int64', 'float64']]
numerical_features = ['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

#Defining the steps in the categorical pipeline 
categorical_pipeline = Pipeline(steps = [
    ('cat_selector', FeatureSelector(categorical_features)),
    ('cat_transformer', CategoricalTransformer()),
    ('imputer', SimpleImputer(strategy = 'constant', 
                               fill_value = 'None')),
    ('one_hot_encoder', OneHotEncoder(sparse = False))
])
    
#Defining the steps in the numerical pipeline     
numerical_pipeline = Pipeline(steps = [
    ('num_selector', FeatureSelector(numerical_features)),
    ('num_transformer', NumericalTransformer()),
    ('imputer', SimpleImputer(strategy = 'median')),
    ('std_scaler', StandardScaler())
])

#Combining numerical and categorical piepline into one full big pipeline horizontally 
#using FeatureUnion
full_pipeline = FeatureUnion(transformer_list = [
    ('categorical_pipeline', categorical_pipeline),
    ('numerical_pipeline', numerical_pipeline)
])

TypeError: __init__() should return None, not 'CategoricalTransformer'

In [70]:
SimpleImputer?

In [141]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Import data
train_data = pd.read_csv('Data/train.csv')
test_data = pd.read_csv('Data/test.csv')

y = train_data['Survived'].copy()

X = train_data.drop('Survived', axis = 1).copy()
X_test = test_data.copy()

# Divide data into training and validation subsets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, 
                                                                train_size=0.8, test_size=0.2,
                                                                random_state=0)

# #The full pipeline as a step in another pipeline with an estimator as the final step
# full_pipeline_m = Pipeline(steps = [
#     ('full_pipeline', full_pipeline),
#     ('model', RandomForestRegressor())
# ])

#Can call fit on it just like any other pipeline
full_pipeline_m.fit(X_train, y_train)

# #Can predict with it like any other pipeline
# y_pred = full_pipeline_m.predict(X_valid) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the document

ValueError: Expected 2D array, got scalar array instead:
array=nan.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [157]:
import re
import numpy as np 
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import FeatureUnion, Pipeline 

#Custom Transformer that extracts columns passed as argument to its constructor 
class FeatureSelector(BaseEstimator, TransformerMixin):
    #Class Constructor 
    def __init__(self, feature_names):
        self._feature_names = feature_names 
    
    #Return self nothing else to do here    
    def fit( self, X, y = None):
        return self 
    
    #Method that describes what we need this transformer to do
    def transform(self, X, y = None):
        return X[self._feature_names] 
    
# Custom transformer that:
# Extracts ticket_num from the Ticket column
# ~~Create a numerical column for 'shared_exact_ticket' if ticket_num is not unique
# ~~Create a numerical column for 'shared_adjacent_ticket' if ticket_num +/- 1 is not unique
# splits 'Name' column into Title, First_name, Surname, Maiden_first_name and Maiden_surname
# Creates a 'num_cabins' column by counting the number of spaces in the 'Cabin' column, Nan => 0
# ~~Create a 'Maiden_fam_aboard' column if Maiden_surname matches any instance in Surname column

## ~~ issues: not working with whole dataset, so will likely skew results! e.g. if shared ticket is in another part of the data

class CategoricalTransformer(BaseEstimator, TransformerMixin):
    # Return self nothing else to do here
    def __init__(self):
        pass
        
    # Return self nothing else to do here
    def fit(self, X, y = None):
        return self

    # Helper function to extract number of cabins (if NaN, 0)
    def get_num_cabins(self, obj):
        try:
            return str(obj).count(' ') + 1
        except:
            return 0
                        
    #Transformer method we wrote for this transformer 
    def transform(self, X , y = None):
        #Depending on constructor argument add num_cabins
        #using the helper functions written above 
        X.loc[:, 'Num_cabins'] = X['Cabin'].apply(self.get_num_cabins) 
                    
        #returns numpy array
        return X.values 

# Custom transformer we wrote to engineer features (create an 'Is_alone'
# column if SibSp & Parch are both 0
# passed as boolen arguements to its constructor
class NumericalTransformer(BaseEstimator, TransformerMixin):
    #Class Constructor
    def __init__(self):
        pass
        
    #Return self, nothing else to do here
    def fit( self, X, y = None ):
        return self 
    
    #Custom transform method we wrote that creates aformentioned features and drops redundant ones 
    def transform(self, X, y = None):
        X.loc[:,'Is_alone'] = np.where((X['SibSp'] == 0) & (X['Parch'] == 0), 1, 0)
        return X.values
            
# Cardinality: number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
# categorical_features = [cname for cname in X.columns if X[cname].nunique() < 20 and X[cname].dtype == "object"]
categorical_features = ['Sex', 'Cabin']

# Select numerical columns
# numerical_features = [cname for cname in X.columns if X[cname].dtype in ['int64', 'float64']]
numerical_features = ['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

#Defining the steps in the categorical pipeline 
categorical_pipeline = Pipeline(steps = [
    ('cat_selector', FeatureSelector(categorical_features)),
    ('cat_transformer', CategoricalTransformer()),
    ('imputer', SimpleImputer(strategy = 'constant', 
                               fill_value = 'None')),
    ('one_hot_encoder', OneHotEncoder(sparse = False))
])
    
#Defining the steps in the numerical pipeline     
numerical_pipeline = Pipeline(steps = [
    ('num_selector', FeatureSelector(numerical_features)),
    ('num_transformer', NumericalTransformer()),
    ('imputer', SimpleImputer(strategy = 'median'))
])

#Combining numerical and categorical piepline into one full big pipeline horizontally 
#using FeatureUnion
full_pipeline = FeatureUnion(transformer_list = [
    ('categorical_pipeline', categorical_pipeline),
    ('numerical_pipeline', numerical_pipeline)
])

#The full pipeline as a step in another pipeline with an estimator as the final step
full_pipeline_m = Pipeline(steps = [
    ('full_pipeline', full_pipeline),
    ('model', RandomForestRegressor())
])

In [160]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Import data
train_data = pd.read_csv('Data/train.csv')
test_data = pd.read_csv('Data/test.csv')

y = train_data['Survived'].copy()

X = train_data.drop('Survived', axis = 1).copy()
X_test = test_data.copy()

# Divide data into training and validation subsets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, 
                                                                train_size=0.8, test_size=0.2,
                                                                random_state=0)


#Can call fit on it just like any other pipeline
full_pipeline_m.fit(X_train, y_train)

# #Can predict with it like any other pipeline
# y_pred = full_pipeline_m.predict(X_valid) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the document

ValueError: Found unknown categories ['C87', 'B37', 'C7', 'E34', 'B42', 'C85', 'C106', 'D10 D12', 'B102', 'C62 C64', 'C54', 'C47', 'D45', 'B78', 'B41', 'F G73', 'C83', 'D49', 'D9', 'B50'] in column 1 during transform