# Afeka - ML3 - Titanic

Noam Levi  
205530611  
[Kaggle Profile](https://www.kaggle.com/noamlevi)

## Introduction

We're going to be working on the [Titanic dataset](https://www.kaggle.com/c/titanic/data) from the [kaggle competition](https://www.kaggle.com/c/titanic).  

This is a continued work from `Assignment1`.

---

Roadmap:
- [Data Exploration](#Data-Exploration) and Data Visualising - *from `Assignment1`*
- [Data Cleaning](#Data-Cleaning), handling missing data in our df using different methods - *from `Assignment1`*
- [Feature Engineering](#Feature-Engineering), creating/choosing the right features for a better ML model - *from `Assignment1`*
- [Training & Model Comparing](#Training-&-Model-Comparing), Fitting a linear model & loss function plotting using different hyperparameters
- [Testing](#Testing)
- [Summary](#Summary)
- [References](#References)

## Imports

In [None]:
from IPython.display import display, Markdown
import random
import math
import pandas as pd
# from pandas_profiling import ProfileReport
import sweetviz as sv
from tqdm import tqdm
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns
#from scipy import stats
from sklearn import metrics
from sklearn.metrics import make_scorer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer #, StandardScaler, Normalizer, LabelEncoder, OneHotEncoder
from sklearn.model_selection import LeavePOut, KFold, GridSearchCV #, cross_validate, train_test_split
# from sklearn.feature_selection import RFE
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB as GaussianNBC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import GradientBoostingClassifier, BaggingClassifier

sns.set_style("darkgrid")
# plt.style.use("fivethirtyeight")

Joining `train` & `test` to a single `data` df for an easier time while working on the model.

In [None]:
index_col = 'PassengerId'
train = pd.read_csv('../data/train.csv', index_col=index_col)
test = pd.read_csv('../data/test.csv', index_col=index_col)

ntrain = train.shape[0]
ntest = test.shape[0]
data = pd.concat([train, test]).reset_index(drop=True)
# data

# train = pd.get_dummies(data)[:ntrain]
# test = pd.get_dummies(data)[:ntest]
# test.index = itest

In [None]:
# global pipeline
pipeline = Pipeline([(
    'none',
    FunctionTransformer(func=None)
)])
# use pipeline.steps.append() for adding steps

In [None]:
# funcs = [(func1, name1), func2, func3, ...]

def appendToPipeline(pipeline, funcs):
    names = []
    for (n,f) in pipeline.steps:
        names += [n]

    for item in funcs:
        func = None
        name = None
        if isinstance(item, (tuple, list)):
            func = item[0]
            name = item[1]
        else:
            func = item
        
        if name == None:
            name = func.__name__
                
        if name not in names:
            pipeline.steps.append((
                name,
                FunctionTransformer(func)
            ))

## Data Exploration

To start thing off, we can use ~~pandas_profiling~~ sweetviz library to get an overview of the entire training dataset.

In [None]:
# data.head()

In [None]:
report = sv.analyze(train, target_feat='Survived')
report.show_notebook()

In [None]:
sns.set_style('darkgrid')

display(data.head())

Let's take a look at the target value's distribution.

In [None]:
survived_per_sex = pd.DataFrame()
survived_per_sex['Sex'] = ['female','male']*2
survived_per_sex['Survived'] = ['Survived']*2 + ['Died']*2
survived_per_sex['Amount'] = (
    list(train[train['Survived']==1].groupby('Sex').agg('count')['Survived']) +
    list(train[train['Survived']==0].groupby('Sex').agg('count')['Survived'])
)

# survived_per_sex
plt.subplots(figsize=(10,6))
ax = sns.barplot(x='Sex', y='Amount', hue='Survived', order=['male', 'female'], data=survived_per_sex)
ax.set_title('Survivors by Sex', fontsize=16)

For the complete data exploration, see `Assignment1`'s notebook.

## Data Cleaning

We will copy the cleaning step from `Assignment1`'s notebook.

In [None]:
# def NAs(data):
data_na = (data.drop(columns=['Survived']).isnull().sum() / len(data)) * 100
# data_na = (data_dummy.drop(columns=['Survived']).isnull().sum() / len(data)) * 100
data_na = data_na.drop(data_na[data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing %': data_na})
missing_data.head()
# return missing_data

# NAs(train).head()

In [None]:
def handle_cabin(df):
    df = df.copy()
    df['Cabin'].fillna('X', inplace=True)
    return data

def fix_na(df):
    df = df.copy()
    df['Age'].fillna(df['Age'].median(), inplace=True)
    df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
    df['Fare'].fillna(df['Fare'].median(), inplace=True) # for the test data, has missing values.
    return data

appendToPipeline(pipeline, [handle_cabin, fix_na])
data_filled = pipeline.fit_transform(data)

In [None]:
data_filled.head()

In [None]:
# data_filled.info()

## Feature Engineering

We will copy the feature engineering step from `Assignment1`'s notebook.

In [None]:
def str_to_cat(df):
    df = df.copy()
    cols = ['Embarked','Sex', 'Pclass']
    for col in cols:
        df[col] = df[col].astype('category').cat.codes.astype('category')
    return df

def bin_cols(df):
    df = df.copy()
    df['Age'] = pd.cut(df['Age'], 10).cat.codes.astype('category')
    df['Fare'] = pd.qcut(df['Fare'], 13, duplicates='drop').cat.codes.astype('category')
    return df

def add_title_col(df):
    df = df.copy()
    global titles
    df['Title'] = \
        df['Name'].str \
        .split(', ', expand=True)[1].str \
        .split(' ', expand=True)[0] \
        .apply(lambda val: val.strip('.'))
    
    df['Title'] = df['Title'].replace('Mlle', 'Miss')
    df['Title'] = df['Title'].replace('Ms', 'Miss')
    df['Title'] = df['Title'].replace('Mme', 'Mrs')
    # df['Title'] = df['Title'].replace('Mrs', 'Miss') # keep miss instead of mrs
    
    titles = list(df['Title'].value_counts().where(df['Title'].value_counts() > 10).dropna().index)
    df['Title'] = df['Title'].apply(lambda title: title if title in titles else 'Else').astype('category')
    titles = list(df['Title'].cat.categories) # keep track of all titles in the df for later use

    # Encoding `Title` as categorial int
    df['Title'] = df['Title'].astype('category').cat.codes.astype('category')
    df.drop(columns='Name', inplace=True)
    return df

def add_deck_col(df):
    df = df.copy()
    global decks
    df['Deck'] = df['Cabin'].apply( lambda val: str(val)[0].upper() )
    df.drop(columns=['Cabin'], inplace=True)
    
    # check is `decks` already defined
    try:
        decks
    except NameError:
        decks = list(train_df['Deck'].astype('category').cat.categories)
    finally:
        df['Deck'] = df['Deck'].astype('category').cat.codes.astype('category')
        return df

def handle_ticket(df):
    df = df.copy()
    df['Ticket_Frequency'] = df.groupby('Ticket')['Ticket'].transform('count')
    df.drop(columns=['Ticket'], inplace=True)
    return df

def add_family_col(df):
    def family_map(size):
        if size < 2:
            return 'Alone'
        elif size <= 4:
            return 'Small'
        elif size <= 6:
            return 'Medium'
        else: # size > 6
            return 'Large'

    df = df.copy()
    df['FamilySize'] = 1 + df['SibSp'] + df['Parch']
    df['FamilySize'] = df['FamilySize'].apply(family_map)
    df['FamilySize'] = df['FamilySize'].astype('category').cat.codes.astype('category')
    df.drop(columns=['SibSp', 'Parch'], inplace=True)
    return df

def get_dummies(df):
    df = df.copy()
    return pd.get_dummies(df)


In [None]:
appendToPipeline(
    pipeline,
    [
        str_to_cat,
        bin_cols,
        add_title_col,
        add_deck_col,
        handle_ticket,
        add_family_col,
        get_dummies
    ]
)

In [None]:
data_dummy = pipeline.fit_transform(data.copy())
data_dummy.head()

In [None]:
data_dummy.info()

Now we can split our `data` back into `train` and `test`

In [None]:
# train_dummy = pd.get_dummies(data_filled)[:ntrain]
# test_dummy = pd.get_dummies(data_filled)[:ntest]
# test_dummy.index = test.index

# train_no_dummy = data_filled[:ntrain]
# test_no_dummy = data_filled[:ntest]

data_dummy = pipeline.fit_transform(data)
train_dummy = data_dummy[:ntrain]
test_dummy = data_dummy[ntrain:].drop(columns=['Survived'])
test_dummy.index = test.index

display(train_dummy.head())

In [None]:
train_dummy.info()

## Training & Model Comparing

As hyper-paramater testing we will choose different sized feature sets.  

That way we could control the polynomial degree of our model, which in turn will affect the model's flexibility.

In [None]:
def gen_k(start, step, length):
    lst = []
    for i in range(length):
        lst += [start + (step*i)]
    return lst
    
    # def inner(start, step, size):
    #     curr = start
    #     while size != 0:
    #         yield curr
    #         curr += step
    #         size -= 1
    # return [i for i in inner(start, step, size)]

In [None]:
# k = gen_k(start=5, step=4, length=13)
# k

In [None]:
# SalePrice correlation matrix
train_dummy = pipeline.fit_transform(train)
corr = train_dummy.corr()
feats = []

# k = number of diffrent feature sets to choose (polynomial degree)
k = gen_k(start=5, step=4, length=13) # => [5, 9, ..., 53]

# picking the top correlated features, not including the target
for n in k:
    # cols = np.abs(corr).nlargest(n+1, 'SalePrice')['SalePrice'].index.tolist()
    cols = corr.nlargest(n+1, 'Survived')['Survived'].index.tolist()
    cols.remove('Survived')
    feats += [cols]

In [None]:
feats[4]

~~Let's try the sklearn backwards feature selection method, `RFE`.~~

The feature selection based on correletion to the target worked much better than `RFE`.

In [None]:
# # est = SGDRegressor(learning_rate='optimal')
# # est = SGDRegressor()
# est = LinearRegression(normalize=True)

# data = pipeline.fit_transform(train)
# y = data['SalePrice']
# X = data.drop(columns=['SalePrice'])
# feats = []

# # k = number of diffrent feature sets to choose (polynomial degree)
# k = gen_k(start=10, step=7, length=4) # => [10, 17, 24, 31, 38, 45, 52, 59, 66, 73]

# for n in tqdm(k):
#     rfe = RFE(
#         estimator = est,
#         n_features_to_select = n
#     )
#     rfe.fit_transform(X, y)
#     # feats += [ X.columns[rfe.support_].tolist() ]
#     feats += [ X.columns[rfe.support_ == False].tolist() ]

# print('done')

In [None]:
# feats[0]

In [None]:
# plot one of the different heatmaps we chose in the last cell
plt.figure(figsize=(24,11))
sns.set(font_scale=1.25)

# i = random.choice(range(len(feats)))

cols = ['Survived'] + feats[4]
# cm = np.corrcoef(train_dummy[cols].values.T)
hm = sns.heatmap(
    train_dummy[cols].corr(),
    cbar = True,
    annot = True,
    square = True,
    fmt = '.2f',
    cmap = 'coolwarm',
    yticklabels = cols,
    xticklabels = cols,
    # title = f'Top {len(feats)-1} Correlated Features',
    annot_kws = {'size': 11}
)

hm.set(title=f'Top {len(cols)-1} Correlated Features')
plt.show()

In [None]:
def lrmse(y_true, y_pred, squared):
    return metrics.mean_squared_error(np.log(y_true), np.log(y_pred), squared=squared)

lrmse_socrer = make_scorer(lrmse, greater_is_better=False, squared=True)
lmse_socrer = make_scorer(lrmse, greater_is_better=False, squared=False)

In [None]:
### ESTIMATORS ###
estimators = []
# estimators += [SGDRegressor(learning_rate='optimal')]
# estimators += [SGDRegressor()]
# estimators += [LinearRegression(normalize=True)]
# estimators += [{'model':Ridge, 'kws':{'normalize':True, 'alpha':0.5}}]
estimators += [{'model':Ridge, 'kws':{'normalize':True, 'alpha':0.29}}]
estimators += [{'model':Ridge, 'kws':{'normalize':True, 'alpha':0.2}}]
# estimators += [{'model':Ridge, 'kws':{'normalize':True, 'alpha':0.05}}]
# estimators += [ElasticNet(normalize=True, alpha=0.5)]
# estimators += [Lasso(normalize=True, alpha=0.5)]
# estimators += [Lasso(normalize=True, alpha=0.2)]
### ESTIMATORS ###

### CV METHODS ###
# l1o = LeavePOut(p=1)
# fivefold = KFold(n_splits=5, shuffle=True, random_state=101)
tenfold = KFold(n_splits=10, shuffle=True, random_state=101)
### CV METHODS ###

scores = {}
# scores = dict()

for d in estimators:
    model = d['model']
    kws = d['kws']
    est = model(**kws)
    d['est'] = est

    est_name = ""
    for i in str(est.__repr__).strip("<>'").split(' ')[-2:]:
        est_name += i.strip('of ')
    scores[est_name] = []

    for cols in tqdm(feats):
        data = pipeline.fit_transform(train)
        X = data[cols]
        y = data['SalePrice']
        res = cross_validate(
            X = X,
            y = y,
            estimator = est,
            cv = tenfold.split(X),
            # cv = fivefold.split(X),
            # cv = l1o.split(X),
            return_train_score = True,
            return_estimator = True,
            scoring = {
                'neg_LRMSE': lrmse_socrer,
                'neg_LMSE': lmse_socrer,
                'neg_MSE': 'neg_mean_squared_error',
                # 'neg_MAE': 'neg_mean_absolute_error',
                # 'neg_MSLE': 'neg_mean_squared_log_error'
            }
        )
        res['feats'] = cols
        scores[est_name] += [res]

print('\n\ndone')

In [None]:
est_names = list(scores.keys())
print('estimators:\n', est_names)
print('\nscores:\n', list(scores[est_names[0]][0].keys()))

In [None]:
# defs for easier access to loss function
# MSE = 'neg_mean_squared_error'
# LRMSE = 'neg_log_root_mean_squared_error'
# MAE = 'neg_mean_absolute_error'
# MSLE = 'neg_mean_squared_log_error'

MSE = 'MSE'
LRMSE = 'LRMSE'
LMSE = 'LMSE'

# losses = [MSE, RMSE, MAE, MSLE]
# losses = [MSE, LRMSE, MAE, MSLE]
losses = [MSE, LRMSE, LMSE]

Now we're going to need to choose the best estimator among the estimators we tried.  

Let's go over them in a clean df, each score in this df is the mean of all scores that estimator got.

In [None]:
cols = [f'test_{loss}' for loss in losses]
estimators_df = pd.DataFrame(columns=cols, index=est_names)

for est in est_names:
    s = scores[est]
    scores_df = pd.DataFrame(s)

    for loss in losses:
        estimators_df[f'test_{loss}'][est] = np.mean(-scores_df[f'test_neg_{loss}'].apply(np.mean))

estimators_df

In [None]:
# we'll choose the best estimator

# loss = [MSE, 'MSE']
# loss = [LRMSE, 'LRMSE']
loss = [LMSE, 'LMSE']

best_score = np.min(estimators_df[f'test_{loss[0]}'])
best_est = estimators_df[ estimators_df[f'test_{loss[0]}']==best_score ].index[0]

print('the best estimator is:\n', best_est)

Let's choose one of the models and take a look of what we got so far.  

We could plot the difference between the predictions and the true value of the target.  
In a perfect world we would get a 45deg function ($f_{(x)}=x$ âžœ $y_{pred}=y_{true}$)

In [None]:
plt.figure(figsize=(15,10))

def plot_predictions(data, feats, model, title=None, axis=None):
    if title == None:
        title = f'deg={len(feats)}'

    df = pd.DataFrame({
        'y_true': data['SalePrice'],
        'y_pred': model.predict(data[feats])
    })
    g = sns.regplot(
        x = 'y_true',
        y = 'y_pred',
        data = df,
        line_kws = {'color': '#B55D60'},
        scatter_kws = {'edgecolor': 'white'},
        ax = axis
    )
    g.set(title = title)
    return g

s = random.choice(scores[best_est])
cols = s['feats']
model = random.choice(s['estimator'])
train_dummy = pipeline.fit_transform(train)

plot_predictions(train_dummy, cols, model, title=f'Estimator = {best_est}\nPoly_Degree = {len(cols)}')
plt.show()

In [None]:
fig, axs = plt.subplots(nrows=math.ceil(len(k)/3), ncols=3, sharex=True, sharey=True) ##sharex=True, sharey=True##
fig.set_size_inches(17, 22)
# plt.figure(figsize=(16,16))
fig.text(0.5, 0.04, 'y_true', ha='center')
fig.text(0.04, 0.5, 'y_pred', va='center', rotation='vertical')
axs = axs.flatten().tolist()

train_dummy = pipeline.fit_transform(train)

for s in tqdm(scores[best_est]):
    cols = s['feats']
    model = random.choice(s['estimator'])

    g = plot_predictions(train_dummy, cols, model, title=f'Estimator = {best_est}\nPoly_Degree = {len(cols)}', axis=axs.pop(0))
    g.set(xlabel='', ylabel='')

fig.tight_layout()
plt.show()

In [None]:
scores_df = pd.DataFrame(scores[best_est]).drop(columns=['fit_time', 'score_time', 'estimator'])

for loss in losses:
    scores_df[f'test_{loss}'] = -scores_df[f'test_neg_{loss}'].apply(np.mean)
    scores_df[f'train_{loss}'] = -scores_df[f'train_neg_{loss}'].apply(np.mean)
    scores_df = scores_df.drop(columns=[f'test_neg_{loss}', f'train_neg_{loss}'])

# scores_df[f'test_{RMSE}'] = np.sqrt(scores_df[f'test_{MSE}'])
# scores_df[f'train_{RMSE}'] = np.sqrt(scores_df[f'train_{MSE}'])

scores_df['deg'] = scores_df['feats'].apply(len)

scores_df

In [None]:
def plot_loss(loss, x, y, data=scores_df):
    sns.lineplot(
        x = 'deg',
        y = f'{y}_{loss[0]}',
        data = scores_df,
        # data = np.log(scores_df[['deg', f'{y}_{loss[0]}']]),
        marker = 'o'
    )

    # plt.title('Err Plot')
    plt.xlabel('Degree of Polynomial')
    plt.ylabel(f'Loss = {loss[1]}')

In [None]:
plt.figure(figsize=(15,8))
# loss = [MSE, 'MSE']
# loss = [LRMSE, 'LRMSE']
loss = [LMSE, 'LMSE']
# loss = [MAE, 'MAE']
# loss = [MSLE, 'MSLE']

plot_loss(
    loss,
    x = 'deg',
    y = 'train',
    data = scores_df
)

plot_loss(
    loss,
    x = 'deg',
    y = 'test',
    data = scores_df
)

plt.title(f'Estimator = {best_est}')
plt.legend(['train', 'test'], fontsize='large')
plt.show()

## Testing

By looking at the graph above we can notice that the best model we fitted had around 80 features.  

Let's choose that model and try and predict the actual test data from kaggle.

In [None]:
# loss = MSE
loss = LMSE

# get the best score's feature list
best_score = min(scores_df[f'test_{loss}'])
i = scores_df[scores_df[f'test_{loss}'] == best_score].index.tolist()[0]
cols = scores[best_est][i]['feats']

# get a clean train & test DFs
train_dummy = pipeline.fit_transform(train)
test_dummy = pipeline.fit_transform(test)

# train the model on the entire train set
# for i,est in enumerate(estimators):
for d in estimators:
    est = d['est']
    if best_est in str(est.__repr__):
        break
# est = estimators[i]
model = est.fit(train_dummy[cols], train_dummy['SalePrice'])

# predict the test set
y_pred = model.predict(test_dummy[cols])

pred = pd.DataFrame(
    y_pred,
    columns = ['SalePrice'],
    index = test.index
)

display(pred)

In [None]:
# Get the fitted parameters used by the function
plt.figure(figsize=(16,8))

(mu, sigma) = ( round(item, 2) for item in stats.norm.fit(pred) )

display(Markdown(f'$\mu$ = {mu}'))
display(Markdown(f'$\sigma$ = {sigma}'))

# plot the distribution
sns.distplot(
    pred,
    kde_kws = {'color': '#4C4C4C'}
)
plt.legend(
    [f'y_pred ~ $N(\mu=${mu}, $\sigma=${sigma}$)$'],
    fontsize = 'x-large'
)
# plt.ylabel('Frequency')
plt.title('Prediction Distribution')

In [None]:
# # loss = MSE
# loss = LMSE

# # get the best score's feature list
# best_score = min(scores_df[f'test_{loss}'])
# i = scores_df[scores_df[f'test_{loss}'] == best_score].index.tolist()[0]
# cols = scores[best_est][i]['feats']

# # get a clean train & test DFs
# # train_dummy = pipeline.fit_transform(train)
# test_dummy = pipeline.fit_transform(test)

# # train the model on the entire train set
# # for i,est in enumerate(estimators):
# # for est in estimators:
# #     if best_est in str(est.__repr__):
# #         break
# # est = estimators[i]
# # model = est.fit(train_dummy[cols], train_dummy['SalePrice'])

# # scores[best_est][i]['estimator']

# # predict the test set
# preds = pd.DataFrame(columns=[i for i in range(len(scores[best_est][i]['estimator']))])
# for i,model in  enumerate(scores[best_est][i]['estimator']):
#     preds[i] = model.predict(test_dummy[cols])
# preds

# y_pred = []
# for i in preds.index:
#     y_pred += [np.mean(preds.iloc[i])]

# pred = pd.DataFrame(
#     y_pred,
#     columns = ['SalePrice'],
#     index = test.index
# )

# display(pred)

In [None]:
pred.to_csv('pred.csv')

## Summary

### Screenshots

![submissions](./screenshots/submissions.png)  

![leaderboards](./screenshots/leaderboards.png)

### Conclusions

## References

1. 