# Introduction

Since the authors are very new to Machine learning, Artifical Intelligence and less than average 2 years of experience in Python, this might be helpful for such newbies like them. Compared to other good guides, the notebook is disorganised and confusing. We think this is also a difficulty that people at a similar level to the authors may experience, so we left the content as unrefined as possible.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# Data analysis
import numpy as np  # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import random as rnd

# Visualisation
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Common Model Helpers
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# ML Algorithms
from sklearn import ensemble, gaussian_process, linear_model, naive_bayes, neighbors, svm, tree, discriminant_analysis

# Performance mesurement
from time import perf_counter

# Load data for training
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")       # Raw data for training
validation_data = pd.read_csv("/kaggle/input/titanic/test.csv")   # Test data for validation

TF_RANDOM_SEED = 1234

def format_percentile(value, total):
    return float(f"{(value / total) * 100:.2f}")

# Data preprocessing

Before training the model, we preprocess the data to increase efficiency and obtain more accurate results. Especially, checking any inaccurate or missing data is very important. In this example, there are 12 columns, and it is not pleasant to watch our model fails to predict because of some columns which seems to non-vital for decision are missing. Therefore, we analyze the data first and find as many correlations as possible.

## Type definition at glimpse

Data per columns could be analysed as:

| Column name   | Meaning                         | Data series                      | Semantics                    | Categorised as               |
|---------------|---------------------------------|----------------------------------|------------------------------|------------------------------|
| `PassengerId` | Passenger identifier            | \[`0`, `1`, ... `Infinity`]      | n/a                          | Non-negative Integer         |
| `Survived`    | Survival                        | \[`0`, `1`]                      | `0`: False, `1`: True        | Binary                       |
| `PClass`      | Passenger class                 | \[`1`, `2`, `3`]                 | `1`: 1st, `2`: 2nd, `3`: 3rd | Enum                         |
| `Name`        | Passenger name                  | n/a                              | n/a                          | String                       |
| `Sex`         | Passenger gender                | \[`male`, `female` ]             | ...                          | Enum                         |
| `Age`         | Passenger age                   | \[`1`, `2`, ... `Infinity`]      | n/a                          | Positive Integer             |
| `SibSp`       | No. of Siblings/Spouses Aboard  | \[`1`, `2`, ... `Infinity`]      | n/a                          | Positive Integer             |
| `Parch`       | No. of Parents/Children Aboard  | \[`1`, `2`, ... `Infinity`]      | n/a                          | Positive Integer             |
| `Ticket`      | Ticket Number                   | n/a                              | n/a                          | String                       |
| `Fare`        | Passenger fare                  | \[`0.0`, `0.1`, ... `Infinity` ] | n/a                          | Non-negative rational Number |
| `Cabin`       | Passenger cabin                 | n/a                              | n/a                          | String                       |
| `Embarked`    | Port of Embarkation             | \[`C`, `Q`, `S` ]                | Capital letter of city name  | Enum                         |

## Data exploration

Hence there are missing values in training data at glimpse and they makes our model difficult to train, we have to find them first.

In [None]:
def find_missing_columns(data_frame, columns):
    # Find records with missing data(all records that contains "NaN" in any column)
    records_with_nan = data_frame[data_frame.isnull().any(axis=1)]
    total_count = len(data_frame)
    
    # Iterate over each column to find and display missing data
    for column in columns:
        missing_records = data_frame[data_frame[column].isnull()]
        missing_count = len(missing_records)
        
        if missing_count > 0:
            missing_rate = format_percentile(missing_count, total_count)
            print(f"Records that \"{column}\" column is missing: {missing_count} of total {total_count} ({missing_rate}%)")
            # passenger_ids = ", ".join(map(str, missing_records["PassengerId"].to_list()))
            # print(f"PassengerIds: {passenger_ids}")

print("In training data:")
find_missing_columns(train_data, ["Ticket", "Name", "PassengerId", "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Cabin", "Embarked"])

print()

print("In validation data:")
find_missing_columns(train_data, ["Ticket", "Name", "PassengerId", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Cabin", "Embarked"])

We didn't expect that there are incomplete in test data as well. This means actual input may contain missing data in any columns, so we must figure a completion strategy out before training.

## Data completion

Based on the results,  we're going to find interpolatable or removable values to improve data clarity and expect model effectiveness.

### Optimisation

Let's look at some data that seems safe to remove.

- `PassengerId` is a value for simple information distinction, so it does not affect the survival of passengers. However, it is important for identifying upon analysation stage, so we're going to remove it at the just beginning of the training stage.
- It seems that no one boarded without a `Ticket`. Also, passenger information such as name, cabin class, fare, and family status that can be inferred from the `Ticket` is well entered in other columns, so it seems safe to ignore it.

What about handling ambiguous data?

- `Cabin` is information that indicates the location of the cabin and considering by intuition, it is a factor that affects the survival rate, therefore it would be very important information. However, there are too many records that lack `Cabin`. Moreover, while `Age`, `Embarked`, and `Fare` can be expressed as continuous values, `Cabin` is difficult to interpolate properly without additionally obtaining the Titanic's blueprint. Therefore, we boldly remove `Cabin` in training data. It will be ignored upon inference.

In [None]:
train_data = train_data.drop(columns=["Ticket", "Cabin"])  # PassengerId is subject to removal just before training

### Interpolation



`Age` is a natural number, `Embarked` is an enum, and `Fare` is a rational number. All these three information can be expressed as continuous numbers. Therefore, we can apply interpolation methods on them using statistical tools such as the mean, standard deviation, and variance. However, we yet don't have a good determination of interpolation methods. So let's find out more correlations and then choose proper strategy.

In [None]:
plt.subplots_adjust(wspace=20, hspace=100)
plt.figure(figsize=[24,10])

# Draws histogram of clumn that contains series of discrete values such as integers, binaries, etc.
# Complex data like nominal values(enums in this case) cannot be plotted.
def draw_survival_histogram(subplot_coordinates, data, column, x_axis_ticks=None, x_axis_labels=None):
    plt.subplot(*subplot_coordinates)
    plt.hist(x = [data[data["Survived"]==1][column], data[data["Survived"]==0][column]], 
             stacked=True,
             color = ['g','r'],
             label = ["Survived","Dead"])
    plt.title(f"{column} Histogram by Survival")
    plt.xlabel(column)
    plt.ylabel("# of Passengers")
    if x_axis_ticks is not None and x_axis_labels is not None:
        plt.xticks(x_axis_ticks, x_axis_labels)
    elif x_axis_ticks is not None:
        plt.xticks(x_axis_ticks)
    plt.legend()
    
draw_survival_histogram([2, 3, 1], train_data, "Sex",)
draw_survival_histogram([2, 3, 2], train_data, "Age",)
draw_survival_histogram([2, 3, 3], train_data, "Pclass", x_axis_ticks = [1, 2, 3])
draw_survival_histogram([2, 3, 4], train_data, "Fare")

# Nominal values that are more than 2 variations cannot be plotted, so we have to reformat it first.
train_data["Embarked_Code"] = train_data["Embarked"].map(lambda it: {"C": 1, "Q": 2, "S": 3}.get(it, 4))
draw_survival_histogram([2, 3, 5], train_data, "Embarked_Code", x_axis_ticks = [1, 2, 3, 4], x_axis_labels = ["C", "Q", "S", "Unknown"])
train_data = train_data.drop(columns=["Embarked_Code"])  # Only required for plotting, not necessary afterwards

When looking at the histogram above, we can conclude that a record with these characteristics:

- Female(`Sex` = female)
- High Fare(`Fare` > 50.0)
- Higher PClass(`PClass` = 1)
- Embarked = C
- Age < 10.

shows a tendency of high probability of survival. We can say that "A female passenger embarked from C, younger than 10 years, with high social status quo and able to pay more money, is more likely to survive". 

For missing values such as `Age`, `Fare` that can be expressed numerically, it seems better to use median rather than arithmetic mean. We can check the arithmetic mean and median of each data in the following graph, and find that the median has less deviation. In statistics, it is common to select the median to obtain robust results, so selecting the median seems to be a good strategy.

In [None]:
plt.figure(figsize=[24,4])

def draw_mean_and_median_by_survived(subplot_coordinates, data, column):
    summary_table = train_data.groupby("Survived")[column].agg(["mean", "median"])

    # Rename the columns to match the desired format
    summary_table.columns = ["mean", "median"]

    # Plot as a stacked bar chart
    labels = ["Not Survived (0)", "Survived (1)"]
    mean_values = summary_table["mean"]
    median_values = summary_table["median"]

    # Create the bar positions
    x = range(len(labels))  # Positions for each group
    bar_width = 0.35  # Width of each bar

    plt.subplot(*subplot_coordinates)
    plt.bar([pos - bar_width / 2 for pos in x], mean_values, width=bar_width, label="Mean", color="skyblue")
    plt.bar([pos + bar_width / 2 for pos in x], median_values, width=bar_width, label="Median", color="orange")
    
    # Add labels and title
    plt.xticks(x, labels)  # Set x-axis labels
    plt.xlabel("Survival Status")
    plt.ylabel(column)
    plt.title(f"{column} mean and median")
    plt.legend()

draw_mean_and_median_by_survived([1, 2, 1], train_data, "Age")
draw_mean_and_median_by_survived([1, 2, 2], train_data, "Fare")

Thus we fill missing `Age` and `Fare` by medians as follows.

In [None]:
def fill_null_by_median(data, column):
    data[column] = data[column].fillna(data[column].median())

fill_null_by_median(train_data, "Age")
fill_null_by_median(train_data, "Fare")

However, `Embarked` is expressed as nominal value and taking median of it seems useless. Threfore, we make a heuristic approach to find similar value from other records, that has similar `Survival`, `Sex`, `Age`, `Pclass`, `Fare` values.

In [None]:
def fill_missing_embarked(row, train_data):
    # If Embarked is already present, return it
    if pd.notna(row['Embarked']):
        return row['Embarked']
    
    # Step 1: Filter by same Survival value
    filtered_data = train_data[
        (train_data['Survived'] == row['Survived']) & 
        (train_data['Embarked'].notna())  # Only consider records with Embarked values
    ]
    
    # Step 2: Filter by same Sex value
    filtered_data = filtered_data[filtered_data['Sex'] == row['Sex']]
    
    # Step 3: Filter by same Pclass value
    filtered_data = filtered_data[filtered_data['Pclass'] == row['Pclass']]
    
    # If no matches found after filtering, return most common Embarked value
    if filtered_data.empty:
        return train_data['Embarked'].mode()[0]
    
    # Step 4: Sort by Age difference if Age is available
    if pd.notna(row['Age']):
        filtered_data['AgeDiff'] = abs(filtered_data['Age'] - row['Age'])
        filtered_data = filtered_data.sort_values('AgeDiff')
    
    # Step 5: Return the Embarked value of the first matching record
    return filtered_data.iloc[0]['Embarked']

train_data['Embarked'] = train_data.apply(lambda row: fill_missing_embarked(row, train_data), axis=1)

And confirm that there are no records with any `NaN`s.

In [None]:
train_data.info()

## Data conversion

Since most machine learning algorithm can effectively handle numerical values, and to make our model not overfit to test data, it is better to do some conversions.

### Feature aggregation(Feature composition)

In some cases creating a new information by combining multiple, similar data might capture complex patterns that individual features and miss decrease model complexity. In this problem, we can derive a new feature, named `FamilySize` from `SibSp` and `Parch`. After aggregation, we can remove `SibSp` and `Parch` for simplicity.

In [None]:
def fill_family_size(data):
    # +1 for passenger him/herself
    data["FamilySize"] = data.apply(lambda row: row ['SibSp'] + row['Parch'] + 1, axis=1)
    return data.drop(['Parch', 'SibSp'], axis=1)

train_data = fill_family_size(train_data)

train_data.info()

### One-hot encoding (Dummy variable method)

Before training, we have to convert nominal values(`enum`s) to dummy variables. Moreover, we must not convert those nominal values directly to numerical values - those must be represented in different way. This is called **dummy variable** or **one-hot encoding** method. Without the method, there would be huge disadvantages such as:

- Most ML algorithms will simply fail to run because they require numerical values
- Reduced Model performance because:
  * Decision tree algorithm may converge incorrectly on numeric-encoded categories.
  * Neural network may incorrectly weigh the importance of different categories.
- Model misinterpretation such as:
  * female(1) is greater than male(0)
  * With red(1), green(2), blue(3), the difference between blue and red means green.

Therefore we should encode nominal values to dummy variables as following table:

| Original   | One-hot encoded fields                   |
|------------|------------------------------------------|
| `Sex`      | `Sex_female`, `Sex_male`                 |
| `Embarked` | `Embarked_C`, `Embarked_Q`, `Embarked_S` |


In [None]:
def one_hot_encode(data, columns, prefixes):
    return pd.get_dummies(data, columns=columns, prefix=prefixes, dtype="int64")

train_data = one_hot_encode(train_data, columns=["Sex", "Embarked"], prefixes=["Sex", "Embarked"])

train_data.info()

### Binning (Bucketing)

We also consider binning our data. **Binning**(or **Bucketing**) is a technique that transforms continuous numerical values into discrete categories or groups. For example, `Fare` could be binned as:

- 0-100 Low fare
- 100-200 Medium fare
- 200-300 High fare
- 300+ Very high fare

#### Why?

1. Simplifies non-linear relationships to help capturing broader patterns that might be missed in raw data. For example, a model applied binning may treat fare 8.5 and 8.6 as similar value since they might be contained in same bin. Without binning, on the contrary, such model treats the value completely differently and there could be a model performance degradation.
2. Reduces model complexity since continuous range of numbers are reduced to finite set of categories.
3. Reduces noise or outliers. Extreme values can skew statistical analysis; however binning captures those values within categories, thus it prevents individual extreme values from dominating model interpretation.
4. Could highlight meaningful thresholds or group data in ways that help the model learn better patterns.

#### Advantages

1. Model simplicity.
    - Simplifies the feature space by reducing continuous data into categories.
    - Helps less flexible algorithms, like linear regression, perform better by reducing noise.

2. Improved robustness & Noise reduction.
    - Mitigates the impact of outliers, as extreme values are grouped into a bin rather than being treated individually.

3. Pattern discovery & Bettter interpretability.
    - Humans can more easily understand "age groups" or "fare ranges" compared to raw continuous values.

#### Disadvantages

1. Loss of Granular Information.
    - Binning reduces the granularity of the data, potentially discarding important information about the relationship between the feature and the target. For example, "Fare = $10.01" and "Fare = $19.99" might be placed in the same bin, even though they could have different effects on target(`Survival` in this case).

2. Biased model.
    - The choice of the number of bins and their ranges can introduce bias. For instance, is splitting "Age" into 5 bins the best choice? What about 10?

3. Loss of Distribution Detail.
    - Continuous variables often have rich distributions that may carry meaningful patterns. Binning simplifies these patterns, potentially leading to suboptimal results.

#### In action

- Use Binning if:
    * Using algorithms that perform better on categorical or ordinal data
    * To reduce the impact of outliers
    * To get better data interpretability

- Avoid binning if:
    * Extremely precise measurements are required
    * In domains requiring fine-grained analysis
    * Data distribution is already uniform
    * Using algorithms like neural networks or gradient boosting that handle continuous features naturally.

Our data does not meet any avoidances as above, and by intuition this problem could be solved by decision tree algorithm, we now progress on binning. According to the histogram we derived at the data clarification stage, `Fare` and `Age` has definite features as:

| Column | Features                                                                               |
|--------|----------------------------------------------------------------------------------------|
| `Fare` | <ul><li>Most passengers paid less than 50 and they death rate is centred in that range.</li> <li>Fare data in certain range(150-200), (300-480) seems does not exist. </li></ul> |
| `Age`  | Divided in 10-year units, we can observe siginificant fluctuation in 20s, 30s and 40s. |

And there are two types of binning which are:

- `qcut()`: Equal-frequency bins (Quantile binning - each bin has similar number of samples)
- `cut()`: Equal-width bins (Fixed interval binning - each bin covers same range of values)

According to the trends and data distribution, we can apply `qcut` on `Fare`s and `cut` on `Age`s as following:

In [None]:
def bin_ages_and_fares(data): 
    age_bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, np.inf]
    age_labels = ["[0, 10)", "[10, 19)", "[20, 29)", "[30, 39)", "[40, 49)", "[50, 59)", "[60, 69)", "[70, 79)", "80+"]
    age_bins = pd.cut(data["Age"].astype(int), bins=age_bins, labels=age_labels, right=False)
    
    # Fares could be roughly grouped in 5, as we can see in the histogram above.
    fare_bins = pd.qcut(data["Fare"], 5)
    
    # Encode bins to ordinal value since they could be compared
    label = LabelEncoder()
    data["AgeBins_Code"] = label.fit_transform(age_bins)
    data["FareBins_Code"] = label.fit_transform(fare_bins)
    data = data.drop(["Age", "Fare"], axis=1)

    return data

train_data = bin_ages_and_fares(train_data)

train_data.info()

and drop `Name` for model simplicity, since `Name` does not affect survival - EDIT: this judgement might inaccurate because in our data `Name` contains social title, which could be a huge hint for age(`Age`) and social status(`Pclass`) -. After this step, our `train_data` consisted with only numerical values, that is ready for ML algorithms that handles numerical values. Also, `PassengerId` is removed as well as we previously notated.

In [None]:
train_data = train_data.drop(['PassengerId', 'Name'], axis=1)

In [None]:
train_data.describe(include = "all")

In [None]:
train_data.sample(10)

## Split training and testing data

If we use our `train_data` for learning as it is, the model could show 100% accuracy for `train_data`. In other words, it can act as if it is cheating for `train_data`. This is called the model overfitting problem. In this challenge, the total number of data given for learning is 891, so even if it is divided at a certain ratio using `sklearn`'s `train_test_split`, it will not be a big problem because there is still enough data for learning. 

To evaluate that any model is overfitted is, accumulate the loss values of `train_data` and `validation_data` for each learning epoch, and then determine the point when the difference between the two losses begins to increase. The larger the loss deviation, model is overfit.

In [None]:
target = train_data["Survived"]
train_data = train_data.drop(["Survived"], axis=1)
x_train, x_val, y_train, y_val = train_test_split(train_data, target, test_size = 0.25, random_state = TF_RANDOM_SEED)

print(f"x_train shape: {x_train.shape}")
print(f"x_val   shape: {x_val.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_val   shape: {y_val.shape}")

# Training

On our long data preprocessing journey, now we're finally ready for model training. Since we're completely new to AI/ML field and don't have experiences for choosing which ML algorithm is the most suitable for this kind of problem. So, list every ML algorithm up and setup a small tool to evaluate which algorithm is the best for this challenge.

In [None]:
ml_algorithms = [
    # Discriminant Analysis
    discriminant_analysis.LinearDiscriminantAnalysis(),
    discriminant_analysis.QuadraticDiscriminantAnalysis(),

    # Ensemble Methods
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),

    # Gaussian Processes
    gaussian_process.GaussianProcessClassifier(),
    
    # Generalised Linear Models
    linear_model.LogisticRegressionCV(),
    linear_model.PassiveAggressiveClassifier(),
    linear_model.RidgeClassifierCV(),
    linear_model.SGDClassifier(),
    linear_model.Perceptron(),

    # Naive Bayes
    naive_bayes.BernoulliNB(),
    naive_bayes.GaussianNB(),
    
    # Nearest Neighbor
    neighbors.KNeighborsClassifier(),
    
    # Support Vector Machine
    svm.SVC(probability=True),
    svm.NuSVC(probability=True),
    svm.LinearSVC(),
    
    # Trees    
    tree.DecisionTreeClassifier(),
    tree.ExtraTreeClassifier(),
]

##
# Performs training
##
table = pd.DataFrame(columns = ["Name", "Parameters", "Time Elapsed", "Accuracy"])

row_index = 0
for alg in ml_algorithms:
    # Reflection
    alg_name = alg.__class__.__name__
    table.loc[row_index, "Name"] = alg_name
    table.loc[row_index, "Parameters"] = str(alg.get_params())

    start_time = perf_counter()
    alg.fit(x_train, y_train)
    end_time = perf_counter()
    table.loc[row_index, "Time Elapsed"] = end_time - start_time
    y_pred = alg.predict(x_val)
    table.loc[row_index, "Accuracy"] = round(accuracy_score(y_pred, y_val) * 100, 2)
    row_index += 1
    
# Print table
sorted_table = table.sort_values(by=["Accuracy", "Time Elapsed"], ascending=[False, True])
sorted_table

There is an annoying warning, but we don't have enough knowledge to solve it right now and ignore it.

According to the evaluation results above, the `RandomForestClassifier` algorithm gives the best results and takes less time than `LogisticRegressionCV`. Therefore, we will submit the final result using `RandomForestClassifier`.

# Submit prediction results

Now, transform `validation_data` to have same columns and types of `train_data` before training.

In [None]:
validation_data_for_predict = validation_data.drop(columns=["Ticket", "Cabin"])
fill_null_by_median(validation_data_for_predict, "Age")
fill_null_by_median(validation_data_for_predict, "Fare")
validation_data_for_predict["Embarked"] = validation_data_for_predict.apply(lambda row: fill_missing_embarked(row, validation_data_for_predict), axis=1)
validation_data_for_predict = fill_family_size(validation_data_for_predict)
validation_data_for_predict = one_hot_encode(validation_data_for_predict, columns=["Sex", "Embarked"], prefixes=["Sex", "Embarked"])
validation_data_for_predict = bin_ages_and_fares(validation_data_for_predict)
validation_data_for_predict = validation_data_for_predict.drop(['PassengerId', 'Name'], axis=1)

print("train_data information:")
train_data.info()
print()
print("validation_data_for_predict information:")
validation_data_for_predict.info()

In [None]:
model = ensemble.RandomForestClassifier()
model.fit(x_train, y_train)
predictions = model.predict(validation_data_for_predict)

output = pd.DataFrame({ "PassengerId" : validation_data["PassengerId"], "Survived": predictions })
output.to_csv("/kaggle/working/gender_submission.csv", index=False)

# Further improvements

After implementation there would be more feature extractions such as

1. Age consideration by parsing `Name`: We can guess ages from titles in `Name` such as 
  - 'Countess', 'Lady', 'Sir'
  - 'Mlle', 'Ms', 'Miss'
  - 'Mme', 'Mrs'
  - 'Mr'
  - 'Master'
2. Person with non-null `Cabin` has tendency of high survival rate

and applying these features may improve model performance?

Also it would be better to reorganise the ToC as:

```
H1. Introduction
H1. Data Preprocessing - Feature Engineering
  H2. Data exploration
    H3. Finding missing values out
    H3. Planning to fill missing values
  H2. Data conversion
    H3. Feature aggregation
    H3. One-hot encoding
    H3. Binning
H1. Training
```

It would be better to add hyperparameter tuning process into this material