# Requirement #1
요구사항 1. titanic_dataset.py 분석 리포트 작성

## Titanic dataset
The purpose of `TitanicDataset` is to store training data samples and labels.
Samples are used to update the model's hyperparameters for training.
The `__getitem__(idx)` returns one sample at a time.

In [3]:
import os
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader, random_split


class TitanicDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.FloatTensor(X)
        self.y = torch.LongTensor(y)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        feature = self.X[idx]  # data sample
        target = self.y[idx]  # label
        return {'input': feature, 'target': target}

    def __str__(self):
        str = "Data Size: {0}, Input Shape: {1}, Target Shape: {2}".format(
            len(self.X), self.X.shape, self.y.shape
        )
        return str

## Titanic test dataset
The purpose of `TitanicTestDataset` is to store test data samples and labels.
Samples are used to evaluate the model's unbiased performance after training is complete.
The `__getitem__(idx)` returns one sample at a time.

In [4]:
class TitanicTestDataset(Dataset):
    def __init__(self, X):
        self.X = torch.FloatTensor(X)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        feature = self.X[idx]
        return {'input': feature}

    def __str__(self):
        str = "Data Size: {0}, Input Shape: {1}".format(
            len(self.X), self.X.shape
        )
        return str


## Dataset preprocessing
The raw dataset needs to be processed before it's fed into the model.
The `PATH` indicates the parent directory where `train.csv` and `test.csv` file is located.

### Parsing CSV into a dataframe
First, we instruct `pandas` to read the aforementioned files and parse into dataframes.
Next, we concatenate two dataframes into one universal dataframe called `all_df`.
Note that missing data points are compensated from line 12 to 22.

### Extracting train features
```python
train_X = all_df[~all_df["Survived"].isnull()].drop("Survived", axis=1).reset_index(drop=True)
```
`DataFrame.isnull()` yields a boolean dataframe indicating which value is `NaN` or not.
The unary operator `~` is used to negate `isnull()` inside the bracket which yields a dataset without the records whose "Survived" feature is `NaN`. 
Subsequently, `DataFrame.drop("Survived", 1)` is used to drop the label column, and `DataFrame.reset_index(True)` is used to drop the old index and create new.
Test features, also known as `test_X` is extracted in almost the same way.

### Extracting train labels
```python
train_y = train_df["Survived"]
```
The bracket is used to take a slice of column "Survived" from `train_df` and we name it `train_y`.

### Constructing datasets
```python
dataset = TitanicDataset(train_X.values, train_y.values)
```
We're going to build the instance of `TitanicDataset` and `TitanicTestDataset` class.
Here, the dataframes are converted to `numpy.ndarray` before it's passed over to constructor.
Once we get a `TitanicDataset`, we call `torch.utils.data.random_split()` to split the samples into 80% of `train_dataset` and 20% of `validation_dataset`.
The latter will be used to validate the performance of the model that we're going to train later.

In [5]:
def get_preprocessed_dataset():
    PATH = os.path.join(os.path.pardir, "_00_data", "0_titanic")
    train_data_path = os.path.join(PATH, "train.csv")
    test_data_path = os.path.join(PATH, "test.csv")

    train_df = pd.read_csv(train_data_path)
    test_df = pd.read_csv(test_data_path)

    all_df = pd.concat([train_df, test_df], sort=False)
    all_df = get_preprocessed_dataset_1(all_df)
    all_df = get_preprocessed_dataset_2(all_df)
    all_df = get_preprocessed_dataset_3(all_df)
    all_df = get_preprocessed_dataset_4(all_df)
    all_df = get_preprocessed_dataset_5(all_df)
    all_df = get_preprocessed_dataset_6(all_df)

    train_X = all_df[~all_df["Survived"].isnull()].drop("Survived", axis=1).reset_index(drop=True)
    train_y = train_df["Survived"]

    test_X = all_df[all_df["Survived"].isnull()].drop("Survived", axis=1).reset_index(drop=True)

    dataset = TitanicDataset(train_X.values, train_y.values)
    print(dataset)
    train_dataset, validation_dataset = random_split(dataset, [0.8, 0.2])
    test_dataset = TitanicTestDataset(test_X.values)
    print(test_dataset)

    return train_dataset, validation_dataset, test_dataset


## Imperfect data samples
It's not hard to find missing parts in a third-party data sample.
In that case, we have to guess the values in our best effort to correct it.

### Missing fare values
For each value that is missing, we're going to assume that `fare` converges to the average fare of the same `Pclass`.
To find the average fare of each class, take a slice of 2 columns `Pclass` and `Fare` and group by `Pclass` to produce `Fare_mean`.
Then, instruct pandas to merge `Fare_mean` into `all_df` so that `Fare_mean` feature is appended to the dataframe.
Finally, substitute each missing value with the corresponding average value that we had found earlier.
Mass assignment can be achieved by using `DataFrame.loc` with the following arguments:
- `(all_df["Fare"].isnull())` : select every index where `Fare` is missing
- `"Fare"` : select the column whose label is `Fare` 

In [6]:
def get_preprocessed_dataset_1(all_df):
    # Pclass별 Fare 평균값을 사용하여 Fare 결측치 메우기
    Fare_mean = all_df[["Pclass", "Fare"]].groupby("Pclass").mean().reset_index()
    Fare_mean.columns = ["Pclass", "Fare_mean"]
    all_df = pd.merge(all_df, Fare_mean, on="Pclass", how="left")
    all_df.loc[(all_df["Fare"].isnull()), "Fare"] = all_df["Fare_mean"]

    return all_df


### Dividing passenger names
Passenger names can be split into `family_name`, `honorific`, and `name`.
Let's use a regular expression together with `Series.str.split()` to split the name into 3 columns
and use `Series.str.strip()` to strip whitespaces for each part. Then, we append the new columns into the `all_df` dataframe and return it.

In [7]:
def get_preprocessed_dataset_2(all_df):
    # name을 세 개의 컬럼으로 분리하여 다시 all_df에 합침
    name_df = all_df["Name"].str.split("[,.]", n=2, expand=True)
    name_df.columns = ["family_name", "honorific", "name"]
    name_df["family_name"] = name_df["family_name"].str.strip()
    name_df["honorific"] = name_df["honorific"].str.strip()
    name_df["name"] = name_df["name"].str.strip()
    all_df = pd.concat([all_df, name_df], axis=1)

    return all_df


### Missing age values
Now, we're going to leverage the fraction of passenger names to estimate their age.
We first take the `honorific` and `Age` columns to calculate the median ages for each honorific name.
Next, we shall use `DataFrame.columns` property to rename the column labels.
Again, merge the new column into `all_df` dataframe and substitute every missing age with the corresponding `honorific_age_mean` value.
Since we don't need this column outside this function, we simply drop it using `DataFrame.drop()`.

In [8]:
def get_preprocessed_dataset_3(all_df):
    # honorific별 Age 평균값을 사용하여 Age 결측치 메우기
    honorific_age_mean = all_df[["honorific", "Age"]].groupby("honorific").median().round().reset_index()
    honorific_age_mean.columns = ["honorific", "honorific_age_mean", ]
    all_df = pd.merge(all_df, honorific_age_mean, on="honorific", how="left")
    all_df.loc[(all_df["Age"].isnull()), "Age"] = all_df["honorific_age_mean"]
    all_df = all_df.drop(["honorific_age_mean"], axis=1)

    return all_df


## Enhancing the features
New features can be derived from existing ones which may improve the overall sample quality.

### Family count
We can sum the `Parch` and `SibSp` columns to populate `family_num` count for each passenger.

### Alone flag
```python
# index/row selector: all_df["family_num"] == 0
# label/column selector: "alone"
all_df.loc[all_df["family_num"] == 0, "alone"] = 1
```
`DataFrame.loc` property is used to assign `alone: 1` to every passenger whose `family_num` is 0.

### Redundant columns
```python
all_df = all_df.drop(["PassengerId", "Name", "family_name", "name", "Ticket", "Cabin"], axis=1)
```
There are some features which are deemed irrelevant to predict the outcome.
We can drop those columns to get a compact dataset.

In [9]:
def get_preprocessed_dataset_4(all_df):
    # 가족수(family_num) 컬럼 새롭게 추가
    all_df["family_num"] = all_df["Parch"] + all_df["SibSp"]

    # 혼자탑승(alone) 컬럼 새롭게 추가
    all_df.loc[all_df["family_num"] == 0, "alone"] = 1
    all_df["alone"].fillna(0, inplace=True)

    # 학습에 불필요한 컬럼 제거
    all_df = all_df.drop(["PassengerId", "Name", "family_name", "name", "Ticket", "Cabin"], axis=1)

    return all_df


### Miscellaneous honorific names
To reduce the number of variants, we replace every miscellaneous name with `other`.
By supplying the negated chain of boolean expressions to the indexing bracket, we can select
every passenger with honorific name that doesn't fall into 4 common variants (`Mr`, `Miss`, `Mrs`, and `Master`).

### Missing embarked places
```python
all_df["Embarked"].fillna("missing", inplace=True)
```
Next, we use `DataFrame.fillna()` to fill every single missing data point with `missing`.

In [10]:
def get_preprocessed_dataset_5(all_df):
    # honorific 값 개수 줄이기
    all_df.loc[
        ~(
                (all_df["honorific"] == "Mr") |
                (all_df["honorific"] == "Miss") |
                (all_df["honorific"] == "Mrs") |
                (all_df["honorific"] == "Master")
        ),
        "honorific"
    ] = "other"
    all_df["Embarked"].fillna("missing", inplace=True)

    return all_df


### Encoding category values
A category is a feature with object-typed data samples.
Those samples must be encoded in numerical format because PyTorch tensors are homogeneous.
In the following code, `LabelEncoder` is used to transform each category. 

In [11]:
def get_preprocessed_dataset_6(all_df):
    # 카테고리 변수를 LabelEncoder를 사용하여 수치값으로 변경하기
    category_features = all_df.columns[all_df.dtypes == "object"]
    from sklearn.preprocessing import LabelEncoder
    for category_feature in category_features:
        le = LabelEncoder()
        if all_df[category_feature].dtypes == "object":
            le = le.fit(all_df[category_feature])
            all_df[category_feature] = le.transform(all_df[category_feature])

    return all_df


## Modeling a neural network
`MyModel` class is our custom neural network subclassing `nn.Module`.
PyTorch mandates that modules should meet the following criterion:
- an `__init__()` call to parent class is made inside their constructor
- `forward()` method is defined in their class body

### Container
`nn.Sequential` is the main container of this model. It forwards a module's output to the next module's input, creating a chain of layers.
Hence, we don't need to call `Module.forward()` for each module since the container takes care of it.
All we have to do is forwarding the base model: `self.model()`. 

### Anatomy
The model consists of 2 hidden layers each having 30 units of fully connected neurons.
The size of input and output layer is determined by `n_input` and `n_output`. 

In [12]:
from torch import nn


class MyModel(nn.Module):
    def __init__(self, n_input, n_output):
        super().__init__()

        self.model = nn.Sequential(
            nn.Linear(n_input, 30),
            nn.ReLU(),
            nn.Linear(30, 30),
            nn.ReLU(),
            nn.Linear(30, n_output),
        )

    def forward(self, x):
        x = self.model(x)
        return x

## Model evaluation
The purpose of this function is to demonstrate our model's performance.
Here we have defined `n_input=11` because the titanic dataset has 11 features.
Also, we have declared `n_output=2` to make our model produce 2 outputs for each sample.
The first output represents the chance of death and the second output represents the chance of survival.

```python
prediction_batch = torch.argmax(output_batch, dim=1)
```
By doing this, we get a `prediction_batch` vector where each digit indicates the prediction of one sample.
- 0 = the passenger likely died in the crash
- 1 = the passenger likely survived

In [13]:
def test(test_data_loader):
    print("[TEST]")
    batch = next(iter(test_data_loader))
    print("{0}".format(batch['input'].shape))  # [418, 11]
    my_model = TitanicModel(n_input=11, n_output=2)
    output_batch = my_model(batch['input'])  # [418, 2]
    prediction_batch = torch.argmax(output_batch, dim=1)  # [418]
    for idx, prediction in enumerate(prediction_batch, start=892):
        print(idx, prediction.item())


## Model usage

### Run output
```
Data Size: 891, Input Shape: torch.Size([891, 11]), Target Shape: torch.Size([891])
Data Size: 418, Input Shape: torch.Size([418, 11])
train_dataset: 713, validation_dataset.shape: 178, test_dataset: 418
################################################## 1
0 - tensor([ 1.0000,  1.0000, 32.0000,  0.0000,  0.0000, 30.5000,  0.0000, 87.5090,
         4.0000,  0.0000,  1.0000]): 1
1 - tensor([ 3.0000,  0.0000, 22.0000,  3.0000,  1.0000, 25.4667,  2.0000, 13.3029,
         1.0000,  4.0000,  0.0000]): 0
2 - tensor([ 3.0000,  1.0000, 26.0000,  2.0000,  0.0000,  8.6625,  2.0000, 13.3029,
         2.0000,  2.0000,  0.0000]): 0
3 - tensor([ 3.0000,  0.0000, 23.0000,  0.0000,  0.0000,  7.5500,  2.0000, 13.3029,
         1.0000,  0.0000,  1.0000]): 1
4 - tensor([ 2.0000,  1.0000, 32.0000,  2.0000,  0.0000, 73.5000,  2.0000, 21.1792,
         2.0000,  2.0000,  0.0000]): 0
5 - tensor([ 3.0000,  1.0000, 17.0000,  0.0000,  0.0000,  8.6625,  2.0000, 13.3029,
         2.0000,  0.0000,  1.0000]): 0
6 - tensor([ 3.0000,  1.0000, 32.0000,  1.0000,  0.0000, 15.8500,  2.0000, 13.3029,
         2.0000,  1.0000,  0.0000]): 0
7 - tensor([ 3.0000,  1.0000, 47.0000,  0.0000,  0.0000,  7.2500,  2.0000, 13.3029,
         2.0000,  0.0000,  1.0000]): 0
8 - tensor([ 3.0000,  1.0000, 28.0000,  0.0000,  0.0000, 22.5250,  2.0000, 13.3029,
         2.0000,  0.0000,  1.0000]): 0
9 - tensor([  1.0000,   1.0000,  49.0000,   1.0000,   1.0000, 110.8833,   0.0000,
         87.5090,   2.0000,   2.0000,   0.0000]): 0
...
700 - tensor([ 1.0000,  1.0000, 25.0000,  1.0000,  0.0000, 55.4417,  0.0000, 87.5090,
         2.0000,  1.0000,  0.0000]): 1
701 - tensor([ 2.0000,  1.0000, 54.0000,  0.0000,  0.0000, 26.0000,  2.0000, 21.1792,
         2.0000,  0.0000,  1.0000]): 0
702 - tensor([ 3.0000,  0.0000, 36.0000,  0.0000,  2.0000, 15.2458,  0.0000, 13.3029,
         3.0000,  2.0000,  0.0000]): 0
703 - tensor([ 3.0000,  0.0000, 22.0000,  0.0000,  0.0000,  7.7500,  1.0000, 13.3029,
         1.0000,  0.0000,  1.0000]): 1
704 - tensor([ 2.0000,  1.0000, 28.0000,  0.0000,  0.0000, 13.0000,  2.0000, 21.1792,
         2.0000,  0.0000,  1.0000]): 0
705 - tensor([ 1.0000,  0.0000, 51.0000,  1.0000,  0.0000, 77.9583,  2.0000, 87.5090,
         3.0000,  1.0000,  0.0000]): 1
706 - tensor([ 3.0000,  0.0000, 22.0000,  0.0000,  0.0000,  7.8792,  1.0000, 13.3029,
         1.0000,  0.0000,  1.0000]): 1
707 - tensor([ 3.0000,  1.0000, 29.0000,  0.0000,  0.0000,  9.4833,  2.0000, 13.3029,
         2.0000,  0.0000,  1.0000]): 0
708 - tensor([ 3.0000,  1.0000, 29.0000,  0.0000,  0.0000,  7.8958,  2.0000, 13.3029,
         2.0000,  0.0000,  1.0000]): 0
709 - tensor([ 3.0000,  0.0000, 26.0000,  0.0000,  0.0000,  7.9250,  2.0000, 13.3029,
         1.0000,  0.0000,  1.0000]): 1
710 - tensor([ 3.0000,  0.0000, 18.0000,  1.0000,  0.0000, 17.8000,  2.0000, 13.3029,
         3.0000,  1.0000,  0.0000]): 0
711 - tensor([ 3.0000,  1.0000, 18.0000,  1.0000,  0.0000,  6.4958,  2.0000, 13.3029,
         2.0000,  1.0000,  0.0000]): 0
712 - tensor([ 3.0000,  1.0000, 24.0000,  0.0000,  0.0000,  8.0500,  2.0000, 13.3029,
         2.0000,  0.0000,  1.0000]): 0
################################################## 2
[TRAIN]
0 - torch.Size([16, 11]): torch.Size([16])
1 - torch.Size([16, 11]): torch.Size([16])
2 - torch.Size([16, 11]): torch.Size([16])
3 - torch.Size([16, 11]): torch.Size([16])
4 - torch.Size([16, 11]): torch.Size([16])
5 - torch.Size([16, 11]): torch.Size([16])
6 - torch.Size([16, 11]): torch.Size([16])
7 - torch.Size([16, 11]): torch.Size([16])
8 - torch.Size([16, 11]): torch.Size([16])
9 - torch.Size([16, 11]): torch.Size([16])
...
40 - torch.Size([16, 11]): torch.Size([16])
41 - torch.Size([16, 11]): torch.Size([16])
42 - torch.Size([16, 11]): torch.Size([16])
43 - torch.Size([16, 11]): torch.Size([16])
44 - torch.Size([9, 11]): torch.Size([9])
[VALIDATION]
0 - torch.Size([16, 11]): torch.Size([16])
1 - torch.Size([16, 11]): torch.Size([16])
2 - torch.Size([16, 11]): torch.Size([16])
3 - torch.Size([16, 11]): torch.Size([16])
4 - torch.Size([16, 11]): torch.Size([16])
5 - torch.Size([16, 11]): torch.Size([16])
6 - torch.Size([16, 11]): torch.Size([16])
7 - torch.Size([16, 11]): torch.Size([16])
8 - torch.Size([16, 11]): torch.Size([16])
9 - torch.Size([16, 11]): torch.Size([16])
10 - torch.Size([16, 11]): torch.Size([16])
11 - torch.Size([2, 11]): torch.Size([2])
################################################## 3
[TEST]
torch.Size([418, 11])
892 1
893 1
894 1
895 1
896 1
897 1
898 1
899 1
...
1300 1
1301 0
1302 1
1303 0
1304 1
1305 1
1306 1
1307 1
1308 1
1309 1
```

In [14]:
if __name__ == "__main__":
    # Parse csv files into datasets
    train_dataset, validation_dataset, test_dataset = get_preprocessed_dataset()

    # train_dataset: 713, validation_dataset: 178, test_dataset: 418
    # 713 + 178 = 891 (sample count in train.csv)
    print("train_dataset: {0}, validation_dataset.shape: {1}, test_dataset: {2}".format(
        len(train_dataset), len(validation_dataset), len(test_dataset)
    ))
    print("#" * 50, 1)

    for idx, sample in enumerate(train_dataset):
        # sample index - features: label
        print("{0} - {1}: {2}".format(idx, sample['input'], sample['target']))

    print("#" * 50, 2)

    # DataLoaders for batch training
    train_data_loader = DataLoader(dataset=train_dataset, batch_size=16, shuffle=True)
    validation_data_loader = DataLoader(dataset=validation_dataset, batch_size=16, shuffle=True)
    test_data_loader = DataLoader(dataset=test_dataset, batch_size=len(test_dataset))

    # Start training process (no logic yet)
    print("[TRAIN]")
    for idx, batch in enumerate(train_data_loader):
        print("{0} - {1}: {2}".format(idx, batch['input'].shape, batch['target'].shape))

    # Start validation process (no logic yet)
    print("[VALIDATION]")
    for idx, batch in enumerate(validation_data_loader):
        print("{0} - {1}: {2}".format(idx, batch['input'].shape, batch['target'].shape))

    print("#" * 50, 3)

    # Start model testing
    test(test_data_loader)


Data Size: 891, Input Shape: torch.Size([891, 11]), Target Shape: torch.Size([891])
Data Size: 418, Input Shape: torch.Size([418, 11])
train_dataset: 713, validation_dataset.shape: 178, test_dataset: 418
################################################## 1
0 - tensor([  1.0000,   0.0000,  58.0000,   0.0000,   1.0000, 153.4625,   2.0000,
         87.5090,   3.0000,   1.0000,   0.0000]): 1
1 - tensor([ 1.0000,  1.0000, 58.0000,  0.0000,  0.0000, 29.7000,  0.0000, 87.5090,
         2.0000,  0.0000,  1.0000]): 0
2 - tensor([ 1.0000,  1.0000, 44.0000,  2.0000,  0.0000, 90.0000,  1.0000, 87.5090,
         4.0000,  2.0000,  0.0000]): 0
3 - tensor([ 1.0000,  0.0000, 52.0000,  1.0000,  1.0000, 93.5000,  2.0000, 87.5090,
         3.0000,  2.0000,  0.0000]): 1
4 - tensor([ 3.0000,  1.0000, 19.0000,  0.0000,  0.0000,  0.0000,  2.0000, 13.3029,
         2.0000,  0.0000,  1.0000]): 0
5 - tensor([  1.0000,   0.0000,  23.0000,   3.0000,   2.0000, 263.0000,   2.0000,
         87.5090,   1.0000,   5.000

# Requirement #2
요구사항 2. titanic 딥러닝 모델 훈련 코드 및 Activation Function 변경해보기

## Revised code
The revised code below is using `weight&bias` to track the training process, along with `argparse` to read CLI inputs.

In [15]:
import os
import pandas as pd
import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader, random_split
import wandb
import argparse


class TitanicDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.FloatTensor(X)
        self.y = torch.LongTensor(y)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        feature = self.X[idx]
        target = self.y[idx]
        return {'input': feature, 'target': target}

    def __str__(self):
        return "Data Size: {0}, Input Shape: {1}, Target Shape: {2}".format(
            len(self.X), self.X.shape, self.y.shape
        )


class TitanicTestDataset(Dataset):
    def __init__(self, X):
        self.X = torch.FloatTensor(X)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        feature = self.X[idx]
        return {'input': feature}

    def __str__(self):
        return "Data Size: {0}, Input Shape: {1}".format(
            len(self.X), self.X.shape
        )


class TitanicModel(nn.Module):
    def __init__(self, n_input, n_output):
        super().__init__()

        self.model = nn.Sequential(
            nn.Linear(n_input, 30),
            nn.ReLU(),
            nn.Linear(30, 30),
            nn.ReLU(),
            nn.Linear(30, n_output),
        )

    def forward(self, x):
        x = self.model(x)
        return x


def get_model_and_optimizer():
    model = TitanicModel(n_input=11, n_output=2)
    optimizer = optim.SGD(model.parameters(), lr=wandb.config.learning_rate)

    return model, optimizer


def get_dataloaders():
    PATH = os.path.join(os.path.pardir, "_00_data", "0_titanic")
    train_data_path = os.path.join(PATH, "train.csv")
    test_data_path = os.path.join(PATH, "test.csv")
    train_df = pd.read_csv(train_data_path)
    test_df = pd.read_csv(test_data_path)
    all_df = pd.concat([train_df, test_df], sort=False)
    all_df = preprocess(all_df)

    train_X = all_df[~all_df["Survived"].isnull()].drop("Survived", axis=1).reset_index(drop=True)
    train_y = train_df["Survived"]

    test_X = all_df[all_df["Survived"].isnull()].drop("Survived", axis=1).reset_index(drop=True)

    dataset = TitanicDataset(train_X.values, train_y.values)
    train_dataset, validation_dataset = random_split(dataset, [0.8, 0.2])
    test_dataset = TitanicTestDataset(test_X.values)

    train_data_loader = DataLoader(train_dataset, batch_size=wandb.config.batch_size, shuffle=True)
    validation_data_loader = DataLoader(validation_dataset, batch_size=len(validation_dataset))
    test_data_loader = DataLoader(test_dataset, batch_size=len(test_dataset))

    return train_data_loader, validation_data_loader, test_data_loader


def preprocess(dataset):
    print("Preprocessing dataset...")
    # adjust fare
    Fare_mean = dataset[["Pclass", "Fare"]].groupby("Pclass").mean().reset_index()
    Fare_mean.columns = ["Pclass", "Fare_mean"]
    dataset = pd.merge(dataset, Fare_mean, on="Pclass", how="left")
    dataset.loc[(dataset["Fare"].isnull()), "Fare"] = dataset["Fare_mean"]
    
    # adjust name
    name_df = dataset["Name"].str.split("[,.]", n=2, expand=True)
    name_df.columns = ["family_name", "honorific", "name"]
    name_df["family_name"] = name_df["family_name"].str.strip()
    name_df["honorific"] = name_df["honorific"].str.strip()
    name_df["name"] = name_df["name"].str.strip()
    dataset = pd.concat([dataset, name_df], axis=1)
    
    # adjust age
    honorific_age_mean = dataset[["honorific", "Age"]].groupby("honorific").median().round().reset_index()
    honorific_age_mean.columns = ["honorific", "honorific_age_mean", ]
    dataset = pd.merge(dataset, honorific_age_mean, on="honorific", how="left")
    dataset.loc[(dataset["Age"].isnull()), "Age"] = dataset["honorific_age_mean"]
    dataset = dataset.drop(["honorific_age_mean"], axis=1)
    
    # derive columns
    dataset["family_num"] = dataset["Parch"] + dataset["SibSp"]
    dataset.loc[dataset["family_num"] == 0, "alone"] = 1
    dataset["alone"].fillna(0, inplace=True)
    
    # drop redundant columns
    dataset = dataset.drop(["PassengerId", "Name", "family_name", "name", "Ticket", "Cabin"], axis=1)
    
    # reduce honorific variants
    dataset.loc[
        ~(
                (dataset["honorific"] == "Mr") |
                (dataset["honorific"] == "Miss") |
                (dataset["honorific"] == "Mrs") |
                (dataset["honorific"] == "Master")
        ),
        "honorific"
    ] = "other"
    dataset["Embarked"].fillna("missing", inplace=True)
    
    # encode category samples
    category_features = dataset.columns[dataset.dtypes == "object"]
    from sklearn.preprocessing import LabelEncoder
    for category_feature in category_features:
        le = LabelEncoder()
        if dataset[category_feature].dtypes == "object":
            le = le.fit(dataset[category_feature])
            dataset[category_feature] = le.transform(dataset[category_feature])
            
    return dataset


def train(model, optimizer, train_data_loader, validation_data_loader):
    print("Training model...")
    
    n_epochs = wandb.config.epochs
    loss_fn = nn.MSELoss()  # Use a built-in loss function
    next_print_epoch = 100

    for epoch in range(1, n_epochs + 1):
        loss_train = 0.0
        num_trains = 0
        for train_batch in train_data_loader:
            output_train = model(train_batch['input'])
            loss = loss_fn(output_train, train_batch['target'])
            loss_train += loss.item()
            num_trains += 1

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        loss_validation = 0.0
        num_validations = 0
        with torch.no_grad():
            for validation_batch in validation_data_loader:
                output_validation = model(validation_batch['input'])
                loss = loss_fn(output_validation, validation_batch['target'])
                loss_validation += loss.item()
                num_validations += 1

        wandb.log({
            "Epoch": epoch,
            "Training loss": loss_train / num_trains,
            "Validation loss": loss_validation / num_validations
        })

        if epoch >= next_print_epoch:
            print(
                f"Epoch {epoch}, "
                f"Training loss {loss_train / num_trains:.4f}, "
                f"Validation loss {loss_validation / num_validations:.4f}"
            )
            next_print_epoch += 100


def test(test_data_loader):
    print("Testing model...")
    batch = next(iter(test_data_loader))
    print("{0}".format(batch['input'].shape))
    my_model = TitanicModel(n_input=11, n_output=2)
    output_batch = my_model(batch['input'])
    prediction_batch = torch.argmax(output_batch, dim=1)
    for idx, prediction in enumerate(prediction_batch, start=892):
        print(idx, prediction.item())


if __name__ == "__main__":
    train_data_loader, validation_data_loader, test_data_loader = get_dataloaders()
    model, optimizer = get_model_and_optimizer()

    train(model, optimizer, train_data_loader, validation_data_loader)
    test(test_data_loader)


Preprocessing dataset...


TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.