In [1]:
import numpy as np
import pandas as pd

# classifier model for titanic and regressor for impute missing values of age
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

# library for target encoding
from category_encoders import TargetEncoder

# GridSearchCV to find optimal parameters for our model
from sklearn.model_selection import GridSearchCV

In [2]:
train_data = pd.read_csv("./train.csv")
test_data = pd.read_csv("./test.csv")

# Filling Missing Values

Filling the missing values for those features which we are going to use for training the model

Upon Inspection with `DataFrame.info()` for both train and test:  
we have to impute missing values for Age, Cabin, Embarked and Fare

## Pattern between two variable

Before anything let's look at how i am finding variables having pattern with target variable. I am using
those features/variables to impute as well as to make above model.

1. Between two categorical variable: Using mutual info especially `normalized_mutual_info` from sklearn, where
   I see if there is significant value returned by mutual info. If yes, then i am using that variable.

2. Between categorical variable and numerical variable: If categorical variable has two level (meaning it can take
   two value, ex: True and False only) then i use point biserial coefficient. But if categorical variable has many
   levels then i am using anova f value `f_classif` from sklearn to determine whether to use or not.

3. Between two numerical variable: Well i don't have a measure for this, yes i can use corr but again in this dataset
   there are only two numerical variable Age and Fare and i checked if there any pattern exists by plotting the scatter
   plot between them.

## Impute Missing Values

### Embarked

I filled the missing values of embarked with mode of the embarked data because i didn't find any pattern between embarked
and any other variable.

### Fare

For filling missing values of Fare, i tried to do something similar to nearest neighbour where i try to find those rows whose
selected features are similar to missing fare row selected features. To select the feature for this, i have gone through each other
feature and compare the measure from above methods and also cross referenced it with scatter plot of it with fare.

Thus, these are the features which highly corresponds to Fare: `"Pclass", "SibSp", "Parch", "Embarked", "Ticket_Binned"`  
For `Ticket_Binned`, what i did is first extracted the number from ticket. Then binned them on basis of if there is change in number of digits or change in first digit of number.

Thus lastly, i filled the missing value of fare like this:

Missing Value Features:

```
PassengerId                    1044
Pclass                            3
Name             Storey, Mr. Thomas
Sex                            male
Age                            60.5
SibSp                             0
Parch                             0
Ticket                         3701
Fare                            NaN
Cabin                           NaN
Embarked                          S
isFemale                          0
Ticket_Num                     3701
Ticket_Binned                  3085
```

Nearest Neighbour Mean:

```python
res = test_dataset[
    (test_dataset["Pclass"] == 3)
    & (test_dataset["SibSp"] == 0)
    & (test_dataset["Parch"] == 0)
    & (test_dataset["Embarked"] == "S")
    & (test_dataset["Ticket_Num"] < 4000)
    & (test_dataset["Ticket_Num"] > 3000)
]

res_non_nan = res[~res["Fare"].isna()]["Fare"]
test_dataset.loc[152, "Fare"] = res_non_nan.mean()
```

The result is same as that given by sklearn Mutilple Imputation.

Rather than above thing, i am using a succint version which does the above thing in one line:

```python
fare_list = ["Pclass", "SibSp", "Parch", "Embarked", "Ticket_Binned"]
test_dataset["Fare"] = test_dataset["Fare"].fillna(test_dataset.groupby(fare_list)["Fare"].transform("mean"))
```

### Age

Same as above Fare, i first found out those features like above and then tried to do nearest neighbour mean thing but found out that there are few missing entries who doesn't have similar features other entries. Thus, at the end i still had Nan at those places. Thus at the end, i had to use `RandomForestRegressor` with those selected features to fill those missing values.

There is also another thing i want to mention is that, i had a thought that those missing values may represent something. Like in same row, names and other details were filled but Age was missing which isn't normal. So i first created a feature which is either 1 when Age data is not missing and 0 where Age data is missing, then with it calculated the `normalized_mutual_info` with survived. It gave value very very close to zero. Thus resolving the thought that there is no meaning in having those Age value missing.

### Cabin

I am not filling missing values for this, because at first i tried to relate it with ticket number but didn't got any. The Cabin feature had values such as "A25 B35", which made it difficult for me to represent the cabin in some other way. As for i needed to show both deck A and B, but couldn't wrap around how. Lastly, there are alot of missing values thus gave up for this feature.

Also as above, i tried to find any significance in having missing cabin value and got mi score of 0.068. The score indicated a very weak relation with survived. Even though score is higher than Embarked feature, but embarked combined with age gave alot of information. Will explain it later.

In [3]:
def ticket_stuff(dataset):
    dataset["Ticket_Num"] = dataset["Ticket"].apply(lambda x: 0 if x == "LINE" else int(x.split(" ")[-1]))

    # bining the tickets value
    bins = []
    first_digit, num_digit = 0, 0
    numbers_ticket = sorted(pd.unique(dataset["Ticket_Num"]))

    for num in numbers_ticket:
        num_str = str(num)
        if (len(num_str) != num_digit) or (num_str[0] != first_digit):
            bins.append([])
            first_digit = num_str[0]
            num_digit = len(num_str)
        bins[-1].append(num)

    def cate_bins(x, bins):
        res = 0
        for bin in bins:
            if x in bin:
                res = bin[0]
                break
        return res

    dataset["Ticket_Binned"] = dataset["Ticket_Num"].apply(lambda x: cate_bins(x, bins))
    return


def fill_missing_values(dataset):
    # Imputing Embarked
    dataset["Embarked"] = dataset["Embarked"].fillna(dataset["Embarked"].mode().iloc[0])

    ticket_stuff(dataset)

    # Imputing Fare
    fare_list = ["Pclass", "SibSp", "Parch", "Embarked", "Ticket_Binned"]
    dataset["Fare"] = dataset["Fare"].fillna(dataset.groupby(fare_list)["Fare"].transform("mean"))

    # Imputing Age
    age_feature_list = ["Pclass", "SibSp", "Parch", "Ticket_Num", "Fare"]
    non_nan = dataset[~dataset["Age"].isna()]
    X_train = non_nan[age_feature_list]
    y_train = non_nan["Age"]
    model = RandomForestRegressor(n_estimators=100, max_depth=5)
    model.fit(X_train, y_train)
    X_test = dataset[dataset["Age"].isna()][age_feature_list]
    predictions = model.predict(X_test)
    dataset.loc[(dataset["Age"].isna()), "Age"] = predictions

    return

# Feature Selection for Model

The features i am going to use for the model are:

```python
features = [
    "FamilySize",
    "Pclass",
    "isFemale",
    "Title_Targeted",
    "Age",
    "Ticket_Item_Targeted",
    "Embarked_Targeted",
]
```

These are filtered by two stage: Stage 1 is with above measures and selecting those with enough to show pattern with Survived.
Stage 2 is with checking the effect on CV score if the feature was present or not. The CV score is calculated with base `RandomForestClassifier` meaning with default parameters.

For example:

```python
another_features = [
    "Pclass",
    "isFemale",
    "Title_Coded",
    "Embarked_Coded",
    "Fare",
    "isAlone",
    "FamilySize",
    "Ticket_Num"
]

another_X_dataset = dataset[another_features]
another_Y_dataset = dataset["Survived"]

another_model = RandomForestClassifier(random_state=42)

cv_results = cross_val_score(another_model, another_X_dataset, another_Y_dataset, n_jobs=-1)
```
and again checking the `cv_results` by removing `Ticket_Num` from the feature list. If cv result decrease then that feature is selected for model.

I know this method only focuses on one feature, but there are also features with combine with other feature to provide more information like Age and Pclass as shown in this [article](https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8).

So i use above method similarly and use both feature at same time to get the score. Now coming to selected features:

`FamilySize` is feature derived from adding value of `Parch` and `SibSp`. It contributed alot to decision tree and also gave higher mutual info score compared to individuals. Also got the idea of this feature from above article.

`isFemale` is just integer coded of `Sex` feature of dataset. `isFemale` and `Pclass` had highest MI(mutual info) score compared to other categorical variable and it also improved the cv score thus being selected.

`Age` feature also contributed to decision tree and i selected it based on above measure and improvements in cv score. This feature also combines with other feature to give great info as shown in the article above.

Before continuing forward, as you can see that rest of the features have name with Targeted as ending. This is because these features are categorical variable and i used target encoding to encode them to use for randomforest classifer as random forest only accepts numerical variable. I could have used simple integer encoding, but it worsen the score and it kind of make sense as we are destroying the information that the variables are categorical.

For example:

If I have a categorical variable with 4 level (meaning having 4 different value), Red: 0, Green: 1, Yellow: 2, Blue: 3  
If i use decision tree to split it, then if split occurs at 1.5, then it means one side has colours Red Green and other side Yellow Blue.

This purely treats categorical as numerical, whereas it should have splitted like it choose value == Green. Thus one node from splitting branch represents colour green as splitted and other side represents rest of the colour. As kind of like choosing True or False as branch splitting condition as shown in this [video](https://youtu.be/_L39rN6gz7Y?feature=shared).

This also means that integer encoding for categorical variable having level less than or equal to 3 won't make any difference, because for example: Categorical variable with 3 level Red: 0, Green: 1 and Blue: 2. If it splitted at 0.5, then it can also be inferred as choosing Red, as value less than 0.5 is Red and rest are other colour on other side. Same can be said for any other split.

Thus above Embarked can also be encoded with integer but just for sake of consistency i choosed to having target encoding for every other categorical variable having more than 2 level.

Coming Back,

`Title_Targeted` feature is just Title that i got from `Name` feature and encoded it with target encoding. It contributed alot to decision tree as can be seen from `feature_importances_` and also passed above two criteria.

I got `Ticket_Item_Targeted` feature from this [kaggle notebook](https://www.kaggle.com/code/gusthema/titanic-competition-w-tensorflow-decision-forests). It is just the alphabet at start of numbers in `Ticket` feature. I used it and saw that it improved the score thus selected it.

`Embarked_Targeted` itself didn't passed the first criteria as MI score was very low. But using it seems to improve the model score. As also shown in the article, it combines with other feature `isFemale` and improves the score.

In [4]:
def preprocess(dataset):
    # first filling the missing values
    fill_missing_values(dataset)

    # creating new features
    dataset["isFemale"] = (dataset["Sex"] == "female") * 1
    dataset["Title"] = dataset["Name"].str.extract(r" ([A-Za-z]+)\.", expand=False)
    # replacing same meaning title
    dataset["Title"] = dataset["Title"].replace('Mlle', 'Miss')
    dataset["Title"] = dataset["Title"].replace('Mme', 'Mrs')
    dataset["FamilySize"] = dataset["SibSp"] + dataset["Parch"]

    def item_ticket(x):
        val = x.split(" ")
        if len(val) > 1:
            return val[0]
        return np.nan

    dataset["Ticket_Item"] = dataset["Ticket"].apply(item_ticket)
    dataset["Ticket_Item"] = dataset["Ticket_Item"].fillna("None")

    return

In [5]:
preprocess(train_data)
preprocess(test_data)

In [6]:
# Now encoding those categorical variables

encoding_features = ["Title", "Ticket_Item", "Embarked"]

te_model = TargetEncoder(cols=encoding_features)
train_transformed = te_model.fit_transform(train_data[encoding_features], train_data["Survived"])

train_data["Title_Targeted"] = train_transformed["Title"]
train_data["Ticket_Item_Targeted"] = train_transformed["Ticket_Item"]
train_data["Embarked_Targeted"] = train_transformed["Embarked"]

# encoding those of test data
test_transformed = te_model.transform(test_data[encoding_features])

test_data["Title_Targeted"] = test_transformed["Title"]
test_data["Ticket_Item_Targeted"] = test_transformed["Ticket_Item"]
test_data["Embarked_Targeted"] = test_transformed["Embarked"]

In [7]:
# now training the model

features = [
    "FamilySize",
    "Pclass",
    "isFemale",
    "Title_Targeted",
    "Age",
    "Ticket_Item_Targeted",
    "Embarked_Targeted",
]

X_train = train_data[features]
y_train = train_data["Survived"]

test_input = test_data[features]

Now i can just train the model like this:

```python
model = RandomForestClassifier(
    random_state=42,
    n_estimators=200,
    min_samples_split=10,
    max_features='sqrt',
    max_depth=5,
)

model.fit(X_dataset, Y_dataset)
```

But here, i choose those parameter out of nowwhere and thus i don't think these parameter are optimal for our random forest. Thus to solve this issue, why don't we iterate over all possible combination of parameter and choose the one which gives best CV score. `GridSearchCV` does the above thing.

In [8]:
param_grid = {
    'n_estimators': [100, 200, 300, 400],
    'max_features': ['sqrt', 'log2'],
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

rf = RandomForestClassifier(random_state=42)

random_search = GridSearchCV(rf, param_grid, cv=3, verbose=1, n_jobs=-1)

random_search.fit(X_train, y_train)

Fitting 3 folds for each of 288 candidates, totalling 864 fits


In [9]:
# This will give us best params

random_search.best_params_
# and we can also see the cv score `random_search.best_score_`

{'max_depth': 5,
 'max_features': 'sqrt',
 'min_samples_leaf': 4,
 'min_samples_split': 2,
 'n_estimators': 100}

Now with above parameters we can retrain the model, but there is no need as `random_search` object already has that trained model, we can just use that.

One thing i want to note that for myself, `GridSearchCV` doesnt do anything more than just checking the cv score for all possible combination of parameters and trains the model with the best paramter. Previously i got confused about why simple training with best parameters gave different cv score when calculated separately than the best estimator that random search has. But i was wrong here, as i didn't mention random state 42 while training the model separately and thus got a little different result.

In [10]:
model = random_search.best_estimator_

# predictions from model
predictions = model.predict(test_input)
print(predictions)

[0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1 0 1 0 0 0 0 0 1 0 0 0 1
 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 1 0
 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 0 1 0 0 1 0 0 1 1 0 0 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1
 0 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
 1 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0
 0 1 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0
 0 1 0 1 1 1 0 1 0 0 1]


In [11]:
# let's check the feature importance

feature_importance = pd.DataFrame({
    "Feature": features,
    "Importance": model.feature_importances_,
}).sort_values(by="Importance", ascending=False)

print(feature_importance)

                Feature  Importance
3        Title_Targeted    0.378384
2              isFemale    0.237516
1                Pclass    0.152615
0            FamilySize    0.080961
4                   Age    0.067910
5  Ticket_Item_Targeted    0.056936
6     Embarked_Targeted    0.025678
