Thoughts:
Age is a potentially high correlator, but many entries have null age. Could see what values correlate to age, and fit a model to predict null ages.

In [None]:
import sys
import numpy as np
import pandas as pd
from dython.nominal import associations, identify_nominal_columns

sys.path.append("/home/andrew/PycharmProjects/PyTorch")
from src.kaggle_api import get_dataset

Load in dataset and show info

In [None]:
data_path = get_dataset("titanic")
train_data = pd.read_csv(data_path / "train.csv")
test_data = pd.read_csv(data_path / "test.csv")
comb_data = pd.concat([train_data, test_data], axis=0)

print(train_data.info())
print(test_data.info())

First, let's remove any columns that should have no impact on survivability

In [None]:
train_data.drop(["PassengerId", "Ticket"], axis=1, inplace=True)
comb_data.drop(["PassengerId", "Ticket"], axis=1, inplace=True)

Next, let's see how many rows contain null values, and the breakdown of these per column

In [None]:
tot_null = train_data.isna().any(axis=1).sum()
col_null = train_data.isna().sum()

print(f"Total number of null rows is {tot_null}")
print(f"Breakdown per column is: \n{col_null}")

Age may be an important factor, so let's first look at the Cabin column

In [None]:
comb_data['Cabin'].value_counts()

This does not look too helpful with so many null values, so let's drop.

In [None]:
train_data.drop(["Cabin"], axis=1, inplace=True)
comb_data.drop(["Cabin"], axis=1, inplace=True)

In [None]:
train_data[train_data.isna().any(axis=1)]

Let's get straight to the point! What factors correlate with survival? 
This may not be an apples to apples test since some values are categorical

In [None]:
train_data.corr(method ='pearson')

Let's look specifically at age data. Dropping rows with null ages would remove a lot of data, so instead we want to get some insights and try to fill the age column.

In [None]:
#print(list(comb_data["Name"]))
comb_data["Title"] = comb_data["Name"].str.extract(r",\s?(\w*).{1}")
comb_data.drop("Name", axis=1, inplace=True)
comb_data["Title"].value_counts()

Replace obvious titles

In [None]:
replace_male = (comb_data["Sex"] == "male") & (~comb_data["Title"].isin(["Mr", "Master"]))
comb_data.loc[replace_male, "Title"] = "Mr"
comb_data.loc[replace_male & (comb_data["Age"] < 18), "Title"] = "Master"

replace_female = (comb_data["Sex"] == "female") & (~comb_data["Title"].isin(["Miss", "Mrs"]))
comb_data.loc[replace_female, "Title"] = "Miss"
comb_data.loc[replace_female & (comb_data["Age"] > 18) & (comb_data["SibSp"] | comb_data["Parch"]), "Title"] = "Mrs"

comb_data["Title"].value_counts()
comb_data

In [None]:
train_data = train_data.filter(comb_data.columns)
cat_cols = identify_nominal_columns(train_data.filter(comb_data.columns))
print(cat_cols)

In [None]:
assoc_func = lambda data, nom_col: associations(
    data, 
    nominal_columns=nom_col, 
    numerical_columns=None, 
    mark_columns=False, 
    nom_nom_assoc="cramer", 
    num_num_assoc="pearson", 
    cramers_v_bias_correction=False, 
    nan_strategy="drop_samples",  
    ax=None, 
    figsize=None, 
    annot=True, 
    fmt='.2f', 
    cmap=None, 
    sv_color='silver', 
    cbar=True, 
    vmax=1.0, 
    vmin=None, 
    plot=True, 
    compute_only=False, 
    clustering=False, 
    title=None, 
    filename=None
)

correl = assoc_func(comb_data, "auto")

Let's try again, this time specifying categorical columns. Also, we can now drop the Sex column, since it is fully correlated with Title which gives more information with respect to age. 

NOTE we may have less noise if we use Sex instead of Title for the survival model.

In [None]:
train_data.drop("Sex", axis=1, inplace=True)
comb_data.drop("Sex", axis=1, inplace=True)
nom_col = ["Survived", "Pclass", "Sex", "Embarked", "Title"]
correl = assoc_func(comb_data, nom_col)