# Assignment 5

We will develop a classification pipeline to predict if a passenger from the Titanic survived or not. Go to Kaggle page for Titanic data and download the training and testing data sets. (Verification: 891 data points for training and 418 data points for testing dataset files)

# 1. [70 pts] 
Preprocess the data, impute missing values as you see fit, and remove features that seem useless.

In [1]:
import numpy as np
import pandas as pd


X_train = pd.read_csv('datasets/train.csv')
y = pd.read_csv('datasets/test.csv')


print('Training\n')
print(X_train.describe())
print(X_train.dtypes)
print(X_train.head())

print();print();print()

print('Testing\n')
print(y.describe())
print(y.dtypes)
print(y.head())

Training

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  
PassengerId      int64
Survived         int64
Pclass           int64
Na

In [2]:
def print_columns_uniques(df_name, df):
    for c in df.columns:
        print(f'[{df_name}]\tcolumn: "{c}" \t total cells: { len(df[c]) } \t Non-Null count: { df[c].count() } \t Null%: { (1 - (df[c].count()/len(df[c])))*100 }\n')
    print('\n\n')

In [3]:
print_columns_uniques('y', y)

print_columns_uniques('X_train', X_train)

[y]	column: "PassengerId" 	 total cells: 418 	 Non-Null count: 418 	 Null%: 0.0

[y]	column: "Pclass" 	 total cells: 418 	 Non-Null count: 418 	 Null%: 0.0

[y]	column: "Name" 	 total cells: 418 	 Non-Null count: 418 	 Null%: 0.0

[y]	column: "Sex" 	 total cells: 418 	 Non-Null count: 418 	 Null%: 0.0

[y]	column: "Age" 	 total cells: 418 	 Non-Null count: 332 	 Null%: 20.574162679425832

[y]	column: "SibSp" 	 total cells: 418 	 Non-Null count: 418 	 Null%: 0.0

[y]	column: "Parch" 	 total cells: 418 	 Non-Null count: 418 	 Null%: 0.0

[y]	column: "Ticket" 	 total cells: 418 	 Non-Null count: 418 	 Null%: 0.0

[y]	column: "Fare" 	 total cells: 418 	 Non-Null count: 417 	 Null%: 0.23923444976076125

[y]	column: "Cabin" 	 total cells: 418 	 Non-Null count: 91 	 Null%: 78.22966507177034

[y]	column: "Embarked" 	 total cells: 418 	 Non-Null count: 418 	 Null%: 0.0




[X_train]	column: "PassengerId" 	 total cells: 891 	 Non-Null count: 891 	 Null%: 0.0

[X_train]	column: "Survived" 	 total

In [17]:
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer


def get_last_name(name):
    '''
        Assumes name is in the format
        <Last Name>, <rest of name>
    '''
    return name.lower().strip().split(',')[0]

def drop_cols(df, cols):
    return

column_transformer = ColumnTransformer(
    transformers=[
        ('last_name', get_last_name, ['a', 'b']) 
    ],
    remainder='passthrough'  # Keep other columns unchanged
)


# Custom transformer for specific column imputation
class SpecificColumnImputer(BaseEstimator, TransformerMixin):
    '''
        Transformer to be applied to the Cabin column, to fill in missing values
    '''
    def __init__(self, column, strategy='constant', fill_value=None):
        self.column = column
        self.strategy = strategy
        self.fill_value = fill_value

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        imputer = SimpleImputer(strategy=self.strategy, fill_value=self.fill_value)
        X[[self.column]] = imputer.fit_transform(X[[self.column]])
        return X

    def fit_transform(self, X, y=None):
        return self.transform(X)

# Create a pipeline with the custom transformer
columns_to_drop=['PassengerId', 'Ticket']
numeric_columns = [c for c in X_train.columns if X_train[c].dtypes != 'object' and c not in columns_to_drop]
pipeline = Pipeline([
    ('drop_cols', FunctionTransformer(lambda df: df.drop(columns_to_drop, axis=1) )),
    # (ColumnTransformer(get_last_name) ), 
    ('specific_imputer', SpecificColumnImputer(column='Cabin', strategy='constant', fill_value='probably_rough')),
    # ('other_imputer', SimpleImputer(strategy='mean'))  
])

# Fit and transform the DataFrame using the pipeline
X_imputed = pipeline.fit_transform(X_train)

print(numeric_columns)

['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']


### 1 Answer
I will be using a RandomForestClassifier.  By investigating nulls in each column of each dataset we can an idea of where imputation is needed.  Steps I took to process are:

- Split `X_train` into `X_train` and `y_train`
- `Ticket` is a ticket number and seems useless.  Maybe there is something to be extracted, but it's not obvious. Drop the column.
- `PassengerId` is an ID column.  Perhaps lower numbers meant higher class fares, however we don't have that detail. Drop the column.
- Perhaps "LastName" has something to do with survival.  Since this can be extracted by getting the column value to the left of the first ",". This is easy to do so we will
- `Cabin` is mostly null. To me this indicates that the column is important, and that most people are in an unamed cabin.  For that I will define a custom imputer.

# TODO:
- Fill unknown numeric features with `mean`
- Fill unknown object features with `most_frequent`
- Apply feature scaling to `Age`, `Fare`
- Apply Random Forest Classifier

In [14]:
for c in X_imputed.columns:
    print(c, X_imputed[c].dtypes)

Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Fare float64
Cabin object
Embarked object
