Thoughts:
Age is a potentially high correlator, but many entries have null age. Could see what values correlate to age, and fit a model to predict null ages.

In [None]:
import sys
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.base import clone
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
from dython.nominal import associations, identify_nominal_columns

sys.path.append("/home/andrew/PycharmProjects/PyTorch")
from src.kaggle_api import get_dataset
from src.estimator_comparison import test_estimators
from src.gen import train_test_from_null, get_xy_from_dataframe

# Whether to run intensive grid searches (True) or simple fits (False)
intensive = True

Load in dataset and show the breakdown of null values per column

In [None]:
data_path = get_dataset("titanic")
raw_train_data = pd.read_csv(data_path / "train.csv")
raw_test_data = pd.read_csv(data_path / "test.csv")

print(raw_train_data.info())
print(raw_test_data.info())

Let's also see how many rows contain null values, and the breakdown of these per column

Most of the missing values across both training and test datasets come from Cabin and Age. Let's combine the datasets and inspect further.

In [None]:
raw_comb_data = pd.concat([raw_train_data, raw_test_data], ignore_index=True)
print(raw_comb_data['Cabin'].value_counts())
print(raw_comb_data['Age'].value_counts())

Both factors may correlate to survivability, but with so many missing Cabin entries it makes sense to remove it for now.
NOTE: Strip Cabin to letter only and see if there's a correlation/connection between fare/class/cabin/ticket, it may be that lower class cabins are not recorded etc.

Before removing the Cabin column, let's inspect some other columns:

In [None]:
print(raw_comb_data["Ticket"].value_counts())
raw_comb_data[["Ticket", "Name"]]

The Ticket column is very messy and contains duplicates, so it is unlikely that much can be obtained from it, and surely there is no correlation between name and survival!

But what about a correlation between name (or more specifically title) and age? This could be a useful predictor, let's try and extract titles using regex:

In [None]:
#print(list(comb_data["Name"]))
raw_comb_data["Title"] = raw_comb_data["Name"].str.extract(r",\s?(\w*).{1}")
raw_comb_data["Title"].value_counts()

That worked well! There's just a few outliers to work with. Some of these can be rectified easily by looking at the "Sex" column, for example a male with the title Dr or Rev can be called "Mr" for our purposes. Others will require a little more thought:

In [None]:
comb_data = raw_comb_data.copy()
comb_data = comb_data.assign(Title=None)

replace_male = (comb_data["Sex"] == "male") & (~comb_data["Title"].isin(["Mr", "Master"]))
comb_data.loc[replace_male, "Title"] = "Mr"
comb_data.loc[replace_male & (comb_data["Age"] < 18), "Title"] = "Master"

replace_female = (comb_data["Sex"] == "female") & (~comb_data["Title"].isin(["Miss", "Mrs"]))
comb_data.loc[replace_female, "Title"] = "Miss"
comb_data.loc[replace_female & (comb_data["Age"] > 18) & (comb_data["SibSp"] | comb_data["Parch"]), "Title"] = "Mrs"

print(comb_data["Title"].value_counts())
comb_data

Finally, there are a couple of null values left outside the Age column, so let's fill them with reasonable values.

TODO: This should be done separately for train and test, build a preprocessing pipeline and apply to both individually.

In [None]:
comb_data["Fare"] = comb_data["Fare"].fillna(comb_data["Fare"].mean())
comb_data["Embarked"] = comb_data["Embarked"].fillna(comb_data["Embarked"].mode())

comb_data.reset_index(drop=True)
comb_data

This is our baseline database to predict both passenger age and survival. It currently contains span both training and test datasets, since we want to use as much data as possible to build the age model.

First, let's look at age, dropping unnecessary columns:
TODO: Explain why each is dropped

In [None]:
age_data = comb_data.copy()
age_data = age_data.drop(["PassengerId", "Cabin", "Ticket", "Name", "Survived", "Fare", "Embarked", "Sex"], axis=1)
age_data

We are going to need to encode our Title column to numeric values, let's do that first:
NOTE is this appropriate? One-hot encode instead?

In [None]:
age_data["Title"] = pd.factorize(age_data["Title"])[0]
age_data

Let's look at each feature individually now with respect to age:

Now we can have a look at how our columns correlate

In [None]:
age_data.corr(method='pearson')

It looks like Title does in fact have a high correlation to Age! Let's have a more visual look at this:

In [None]:
assoc_func = lambda data, nom_col: associations(
    data,
    nominal_columns=nom_col,
    numerical_columns=None,
    mark_columns=False,
    nom_nom_assoc="cramer",
    num_num_assoc="pearson",
    cramers_v_bias_correction=False,
    nan_strategy="drop_samples",
    ax=None,
    figsize=None,
    annot=True,
    fmt='.2f',
    cmap=None,
    sv_color='silver',
    cbar=True,
    vmax=1.0,
    vmin=None,
    plot=True,
    compute_only=False,
    clustering=False,
    title=None,
    filename=None
)

correl = assoc_func(age_data, "auto")

Let's try again, this time specifying categorical columns. Also, we can now drop the Sex column, since it is fully correlated with Title which gives more information with respect to age.

In [None]:
cat_cols = identify_nominal_columns(age_data)
print(cat_cols)

nom_features = ["Pclass", "Title"]
assoc_func(age_data, nom_features)

In [None]:
age_target = "Age"
features = [c for c in age_data.columns if c != age_target]
print(features)

for f in features:
    g = sns.FacetGrid(age_data, col=f)
    g.map_dataframe(sns.histplot, x=age_target)

The data becomes very sparse with increasing SibSp and Parch, so let's combine higher numbers
NOTE how about combining features overall?

In [None]:
age_data["SibSp"] = age_data["SibSp"].clip(upper=3)
age_data["Parch"] = age_data["Parch"].clip(upper=2)

for f in features:
    g = sns.FacetGrid(age_data, col=f)
    g.map_dataframe(sns.histplot, x=age_target)

For some (ALL?) regressors we have to provide numeric values, so let's convert categorical data to one-hot vectors. We will also return k-1 columns since all 0's in a row will point to the baseline category.

In [None]:
# age_data_numerical = pd.get_dummies(age_data, columns=["Title"], drop_first=True)

We don't need age to be predicted precisely to the number, rather we could simplify our model if we turned our current continuous age range regression problem into an age band classification problem.

To do this, we have to band or "bin" our existing age data. We do not want to define these bands arbitrarily, however a reasonable starting point would be to band them in terms of frequency:

In [None]:
bins = 5
# age_group_labels = [f"Group{i}" for i in range(bins)]
age_data["Age"], bin_bounds = pd.qcut(age_data["Age"], q=bins, precision=0, labels=False, retbins=True)

age_bins = {i: bin_bounds[i:i+2] for i, k in enumerate(bin_bounds)}
print(age_bins)
age_data["Age"]

Finally, let's split the age dataset into train and test based on which rows do not have age specified. Then we can start making predictions!

In [None]:
train_age_data, age_test_data = train_test_from_null(age_data, age_target)
X_age, y_age = get_xy_from_dataframe(train_age_data, age_target)
age_test_data, _ = get_xy_from_dataframe(age_test_data, age_target)
X_age

In [None]:
candidate_models = [
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    ExtraTreesClassifier(),
    GradientBoostingClassifier(),
    AdaBoostClassifier()
]

test_estimators(X_age, y_age, models=candidate_models, type_filter="classifier")

In [None]:
"""
clf = DecisionTreeClassifier()

hyperparams = {
    "criterion": ['gini', 'entropy'],
    "max_depth": range(2, 16),
    "min_samples_split": range(2, 10),
    "min_samples_leaf": range(1, 5)
}
"""

chosen_clf = GradientBoostingClassifier(loss="log_loss", criterion="friedman_mse", n_estimators=50)

age_clf = clone(chosen_clf)

hyperparams = {
    "learning_rate": [0.01, 0.025, 0.05, 0.075, 0.1],
    "min_samples_split": np.linspace(0.1, 0.5, 4),
    "min_samples_leaf": np.linspace(0.1, 0.5, 4),
    "max_depth": [5, 8],
    "subsample":[0.6, 0.8, 0.95, 1.0],
}

if not intensive:
    age_clf.fit(X_age, y_age)
    age_pred = age_clf.predict(age_test_data)
    disp = age_clf
else:
    age_cv = GridSearchCV(age_clf, param_grid=hyperparams, cv=10, n_jobs=-1, verbose=2)
    age_cv.fit(X_age, y_age)
    print("model score: %.3f" % age_cv.best_score_)
    age_pred = age_cv.predict(age_test_data)
    disp = age_cv

disp

In [None]:
age_test_data = age_test_data.assign(Age=age_pred)

all_age_data = pd.concat([train_age_data, age_test_data]).sort_index()
all_age_data.info()

Great! We have our age predictions, now let's go back to our baseline dataset and make another copy for our survival prediction:

Let's fill the age column, clip the SibSp and Parch columns again, and re

In [None]:
survive_data = comb_data.copy()
survive_data = survive_data.drop(["PassengerId", "Cabin", "Ticket", "Name", "Sex"], axis=1)
survive_data["Title"] = pd.factorize(survive_data["Title"])[0]
survive_data["Embarked"] = pd.factorize(survive_data["Embarked"])[0]
survive_data["SibSp"] = survive_data["SibSp"].clip(upper=3)
survive_data["Parch"] = survive_data["Parch"].clip(upper=2)
survive_data["Age"] = all_age_data["Age"]
survive_data

In [None]:
target = "Survived"
features = [c for c in survive_data.columns if c != target]

train_data, test_data = train_test_from_null(survive_data, target)

X_train, y_train = get_xy_from_dataframe(train_data, target)
X_test, _ = get_xy_from_dataframe(test_data, target)
X_train.info()
X_test.info()

In [None]:
test_estimators(X_train, y_train, models=candidate_models, type_filter="classifier")

In [None]:
# Use the same estimator
clf = clone(chosen_clf)

if not intensive:
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    disp = clf
else:
    cv = GridSearchCV(clf, param_grid=hyperparams, cv=10, n_jobs=-1, verbose=2)
    cv.fit(X_train, y_train)
    print("model score: %.3f" % cv.best_score_)
    y_pred = cv.predict(X_test)
    disp = cv

test_data = test_data.copy()
test_data[target] = y_pred
test_data

Last but not least, we must save our prediction in csv format, providing only passenger ID and binary survive columns

In [None]:
# Get correct format
test_data.index += 1
test_data[target] = test_data[target].astype(int)

# Write out
test_data.to_csv(data_path / "initial_prediction.csv", columns=[target], index=True, index_label="PassengerId")

Great! We can now upload this to Kaggle and see our prediction.

Let's review the steps we took here:
1. Inspected the data to understand columns that should be dropped, filled, or reduced.
2. Combined and copied our dataset train and test datasets for Age predictions.
3. Created an additional column Title from Name data, using the Age and Sex columns to replace outliers.
4. Filled small number of missing data columns from Embarked and Fare columns.
5. Dropped unnecessary columns from the Age dataset.
6. Encoded string data from Title column.
7. Clipped sparse data from SibSp and Parch columns.
8. Binned Age values since a continuous age range is too granular with respect to survivability.
9. Predicted missing Age rows, casting them back to the original dataset.
10. Carried out a similar process to calculate Survived data.

Some notes on this method:
- Is the Title column the best we can do? We have a Title for young men (Master), but Miss covers women of all ages! Furthermore, surely being a Miss or Mrs does not determine your fate!
- Combining the train and test to predict the Age column is not best practice, this should be done on both sets of data independently.
- We went through each data wrangling/featuring engineering step individually, can we automate this?
- We chose Age bins based on frequency, how do we know this is the best approach?
- Embarked and ... columns are not ordinal features. For decision trees it does not matter, but for numerical estimators they should be transformed to categorical one hot encoders instead.

So, how do we move forward? What we would like is something that can carryout all the necessary steps to prepare both our training and testing data, with sufficient modularity to allow these steps to be replaced or updated as we discover more meaningful ways to represent the data. This is where pipelines come in, let's build one using scikit-learn in our next notebook!