In [None]:
!pip install autoviz
!pip install category_encoders

# Dataset Description

## Overview
The data has been split into two groups:
1. **Training set** (`train.csv`)
2. **Test set** (`test.csv`)

The **training set** should be used to build your machine learning models. For the training set, we provide the outcome (also known as the *ground truth*) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The **test set** should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include `gender_submission.csv`, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

---

## Data Dictionary

| Variable | Definition                                   | Key                                     |
|----------|----------------------------------------------|-----------------------------------------|
| survival | Survival                                     | 0 = No, 1 = Yes                         |
| pclass   | Ticket class                                 | 1 = 1st, 2 = 2nd, 3 = 3rd               |
| sex      | Sex                                          |                                         |
| age      | Age in years                                 |                                         |
| sibsp    | # of siblings/spouses aboard the Titanic     |                                         |
| parch    | # of parents/children aboard the Titanic     |                                         |
| ticket   | Ticket number                                |                                         |
| fare     | Passenger fare                               |                                         |
| cabin    | Cabin number                                 |                                         |
| embarked | Port of Embarkation                          | C = Cherbourg, Q = Queenstown, S = Southampton |

---

## Variable Notes

- **pclass**: A proxy for socio-economic status (SES)
  - 1st = Upper
  - 2nd = Middle
  - 3rd = Lower

- **age**: Age is fractional if less than 1. If the age is estimated, it is in the form of xx.5

- **sibsp**: The dataset defines family relations in this way:
  - Sibling = brother, sister, stepbrother, stepsister
  - Spouse = husband, wife (mistresses and fiancés were ignored)

- **parch**: The dataset defines family relations in this way:
  - Parent = mother, father
  - Child = daughter, son, stepdaughter, stepson

  Some children traveled only with a nanny, therefore `parch=0` for them.


# Business Understanding

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

# Data Understanding

In [None]:
# import libraries
import numpy as np
import pandas as pd 
pd.set_option('display.max_columns', 500)
import joblib # export model
from datetime import datetime # cek waktu proses

import category_encoders as ce # binary encoding

# machine learning
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn import linear_model
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

#automate EDA
from autoviz.AutoViz_Class import AutoViz_Class

# plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# laod the data
df_train = pd.read_csv('./input/titanic/train.csv')
df_test = pd.read_csv('./input/titanic/test.csv')
df_train.head()

In [None]:
# check tail
df_train.tail()

In [None]:
# check pclass
df_train['Pclass'].unique() 

In [None]:
# check sex
df_train['Sex'].unique() 

In [None]:
# check age
df_train['Age'].unique() 

In [None]:
# check sibsp
df_train['SibSp'].unique() 

In [None]:
# check cabin
df_train['Cabin'].unique() 

In [None]:
#check statistical summary
df_train.info()

In [None]:
# check missing value
df_train.isnull().sum()

In [None]:
# check missing values
df_test.isnull().sum()

In [None]:
# check uniqueness
df_train.nunique()

In [None]:
df_test.nunique()

# Data Preparation

In [None]:
# remove cabin,name, ticker, pId
df_train = df_train.drop(['Cabin','Name','Ticket','PassengerId'], axis=1)
df_train.head()

In [None]:
# remove cabin,name, ticker, pId
df_test = df_test.drop(['Cabin','Name','Ticket','PassengerId'], axis=1)
df_test.head()

In [None]:
#check null train
df_train.isnull().sum()

In [None]:
#check null test
df_test.isnull().sum()

In [None]:
# train summary
df_train.describe()

In [None]:
# test summary
df_test.describe()

In [None]:
# manual EDA -> age
plt.hist(df_train['Age'])

In [None]:
# manual EDA -> SibSp
plt.hist(df_train['SibSp'])

In [None]:
# manual EDA -> parch
plt.hist(df_train['Parch'])

In [None]:
# manual EDA -> fare
plt.scatter(df_train['Fare'], df_train['Survived'])
plt.xlabel("Fare")
plt.ylabel("Survived")
plt.show()

In [None]:
# average imputation on train test
df_train['Age'] = df_train['Age'].fillna(df_train['Age'].mean())
df_train.info()

In [None]:
# average imputation on test test
df_test['Age'] = df_test['Age'].fillna(df_test['Age'].mean())
df_test.info()

In [None]:
#and fare
df_test['Fare'] = df_test['Fare'].fillna(df_test['Fare'].mean())
df_test.info()

In [None]:
# manual EDA (Exploratory Data Analysis)-> age
plt.hist(df_train['Age'])

In [None]:
# manual EDA -> age
plt.hist(df_test['Age'])

    pd.get_dummies: Función de pandas que transforma una columna categórica en múltiples columnas binarias.

    df_test.sex: Especifica la columna del DataFrame de prueba que se desea transformar.

    prefix='sex': Agrega el prefijo "Parch" al nombre de cada columna dummy generada.



In [None]:
# sex -> getdummy on train
sex_dummy = pd.get_dummies(df_train.Sex, prefix='sex')
df_train = pd.concat([df_train, sex_dummy], axis=1)
df_train.head()

In [None]:
# sex -> getdummy on test
sex_dummy = pd.get_dummies(df_test.Sex, prefix='sex')
df_test = pd.concat([df_test, sex_dummy], axis=1)
df_test.head()

In [None]:
# sibsp -> getdummy on train
sibsp_dummy = pd.get_dummies(df_train.SibSp, prefix='SibSp')
df_train = pd.concat([df_train, sibsp_dummy], axis=1)
df_train.head()

In [None]:
# sibsp -> getdummy on test
sibsp_dummy = pd.get_dummies(df_test.SibSp, prefix='SibSp')
df_test = pd.concat([df_test, sibsp_dummy], axis=1)
df_test.head()

In [None]:
# parch -> getdummy on train
parch_dummy = pd.get_dummies(df_train.Parch, prefix='Parch')
df_train = pd.concat([df_train, parch_dummy], axis=1)
df_train.head()

In [None]:
# parch -> getdummy on test
parch_dummy = pd.get_dummies(df_test.Parch, prefix='Parch')
df_test = pd.concat([df_test, parch_dummy], axis=1)
df_test.head()

In [None]:
# embarked -> getdummy on train
emb_dummy = pd.get_dummies(df_train.Embarked, prefix='Embarked')
df_train = pd.concat([df_train, emb_dummy], axis=1)
df_train.head()

In [None]:
# embarked -> getdummy on test
emb_dummy = pd.get_dummies(df_test.Embarked, prefix='Embarked')
df_test = pd.concat([df_test, emb_dummy], axis=1)
df_test.head()

In [None]:
# drop redundant data on train
df_train.drop(['Sex','SibSp','Parch','Embarked'], axis=1, inplace=True)
df_train.head()

In [None]:
# drop redundant data on test
df_test.drop(['Sex','SibSp','Parch','Embarked'], axis=1, inplace=True)
df_test.head()

In [None]:
# check train
df_train.info()

In [None]:
# check test
df_test.info()

In [None]:
# check train
df_train.columns

In [None]:
# check column
df_test.columns

In [None]:
X_train_columns = ['Survived', 'Pclass', 'Age', 'Fare', 'sex_female', 'sex_male',
       'SibSp_0', 'SibSp_1', 'SibSp_2', 'SibSp_3', 'SibSp_4', 'SibSp_5',
       'SibSp_8', 'Parch_0', 'Parch_1', 'Parch_2', 'Parch_3', 'Parch_4',
       'Parch_5', 'Parch_6', 'Embarked_C', 'Embarked_Q', 'Embarked_S']

In [None]:
X_test_columns = ['Pclass', 'Age', 'Fare', 'sex_female', 'sex_male', 'SibSp_0', 'SibSp_1',
       'SibSp_2', 'SibSp_3', 'SibSp_4', 'SibSp_5', 'SibSp_8', 'Parch_0',
       'Parch_1', 'Parch_2', 'Parch_3', 'Parch_4', 'Parch_5', 'Parch_6',
       'Parch_9', 'Embarked_C', 'Embarked_Q', 'Embarked_S']

In [None]:
# TRAIN
# Min Max scaler -> X
X_train = df_train
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_train = pd.DataFrame(X_train)

X_train.columns = X_train_columns
X_train.head()

In [None]:
# TEST
# Min Max scaler -> X
X_test = df_test
scaler = MinMaxScaler()
X_test = scaler.fit_transform(X_test)
X_test = pd.DataFrame(X_test)

X_test.columns = X_test_columns
X_test.head()

In [None]:
# check point <- well done
X_train.to_csv('df_pos_train.csv', encoding='utf-8', index=False)
X_test.to_csv('df_pos_test.csv', encoding='utf-8', index=False)

# Exploratory Data Analysis

In [None]:
# load truely meaningful data
df_pos_train = pd.read_csv('df_pos_train.csv')
df_pos_train.head()

In [None]:
# check shape
df_pos_train.shape

In [None]:
# check correlation
df_pos_train.corr()

In [None]:
# auto EDA with autoviz library
AV = AutoViz_Class()
AV.AutoViz('df_pos_train.csv', depVar='Survived')

# Feature Selection Algorithm

In [None]:
# initialize num_feats
num_feats = 5

In [None]:
# split dependent and independent variables
y = df_pos_train['Survived']
X = df_pos_train.drop('Survived', axis=1)

In [None]:
# 1. PEARSON CORRELATION (filter methods)
def cor_selector(X, y,num_feats):
    cor_list = []
    feature_name = X.columns.tolist()
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
    # replace NaN with 0
    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    # feature name
    cor_feature = X.iloc[:,np.argsort(np.abs(cor_list))[-num_feats:]].columns.tolist()
    # feature selection? 0 for not select, 1 for select
    cor_support = [True if i in cor_feature else False for i in feature_name]
    return cor_support, cor_feature
cor_support, cor_feature = cor_selector(X, y,num_feats)
print(str(len(cor_feature)), 'selected features')

print("pearson correlation")
print(cor_feature)

In [None]:
# 2. CHI SQUARE FEATURES (filter methods)
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
X_norm = MinMaxScaler().fit_transform(X)

chi_selector = SelectKBest(chi2, k=num_feats)
chi_selector.fit(X_norm, y)
chi_support = chi_selector.get_support()
chi_feature = X.loc[:,chi_support].columns.tolist()
print(str(len(chi_feature)), 'selected features')

print("chi feature")
print(chi_feature)

In [None]:
# 3.RECURSIVE FEATURE ELIMINATION (wrapper methods)
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

rfe_selector = RFE(estimator=LogisticRegression(solver='lbfgs'), n_features_to_select=num_feats, step=10, verbose=5)
rfe_selector.fit(X_norm, y)

rfe_support = rfe_selector.get_support()
rfe_feature = X.loc[:,rfe_support].columns.tolist()
print(str(len(rfe_feature)), 'selected features')

In [None]:
# 4. LASSO: SELECT FROM MODEL (embedded methods)
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import MinMaxScaler
X_norm = MinMaxScaler().fit_transform(X)

embeded_lr_selector = SelectFromModel(LogisticRegression(penalty='l1', solver='liblinear'), max_features=num_feats)
embeded_lr_selector.fit(X_norm, y)

embeded_lr_support = embeded_lr_selector.get_support()
embeded_lr_feature = X.loc[:,embeded_lr_support].columns.tolist()
print(str(len(embeded_lr_feature)), 'selected features')

print("lasso model")
print(embeded_lr_feature)

In [None]:
# 5. TREE BASED SELECT FROM MODEL
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

embeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=10, max_depth=6), max_features=num_feats)
embeded_rf_selector.fit(X, y)

embeded_rf_support = embeded_rf_selector.get_support()
embeded_rf_feature = X.loc[:,embeded_rf_support].columns.tolist()
print(str(len(embeded_rf_feature)), 'selected features')


print("random forest")
print(embeded_rf_feature)

In [None]:
# OVERALL
pd.set_option('display.max_rows', None)
feature_name = X.columns.tolist()
 #put all selection together
feature_selection_df = pd.DataFrame({'Feature':feature_name, 'Pearson':cor_support, 'Chi-2':chi_support, 'RFE':rfe_support, 'LASSO':embeded_lr_support,
                                    'Random Forest':embeded_rf_support})
# Contar las selecciones solo en las columnas numéricas
feature_selection_df['Total'] = feature_selection_df[['Pearson', 'Chi-2', 'RFE', 'LASSO', 'Random Forest']].sum(axis=1)

# Mostrar los top 100 (o el número de features deseado)
feature_selection_df = feature_selection_df.sort_values(['Total', 'Feature'], ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df) + 1)
print(feature_selection_df.head(num_feats))


# Train Test Split

In [None]:
# take only the selected value
X_filtered = X[['sex_female', 'Pclass', 'sex_male', 'SibSp_1', 'Age']]

In [None]:
# split into train and test dataset
# train and test ratio => 80:20
X_train, X_val, y_train, y_val = train_test_split(X_filtered, y, test_size=0.2, random_state=0)

In [None]:
print('X_train shape: ', X_train.shape)
print('y_train shape: ', y_train.shape)
print('X_val shape: ', X_val.shape)
print('y_val shape: ', y_val.shape)

# Modeling

In [None]:
# XGBoost
from xgboost import XGBClassifier

xgbModel = XGBClassifier(learning_rate=0.01, max_depth=4, n_estimators=300, seed=0)

In [None]:
# train xgboost
xgbModel.fit(X_train, y_train)

y_xgb_pred_train = xgbModel.predict(X_train)
y_xgb_pred_val = xgbModel.predict(X_val)

In [None]:
# evaluate xgboost
from sklearn.metrics import accuracy_score

print("XGBoost Training accuracy: ", accuracy_score(y_train, y_xgb_pred_train))
print("XGBoost Validation accuracy: ", accuracy_score(y_val, y_xgb_pred_val))

In [None]:
# RandomForest
RFModel = RandomForestClassifier(criterion='gini',
                                           n_estimators=1750,
                                           max_depth=7,
                                           min_samples_split=6,
                                           min_samples_leaf=6,
                                           max_features='sqrt',
                                           oob_score=True,
                                           random_state=42,
                                           n_jobs=-1,
                                           verbose=1) 

In [None]:
# train random forest
RFModel.fit(X_train, y_train)


y_rf_pred_train = RFModel.predict(X_train)
y_rf_pred_val = RFModel.predict(X_val)

In [None]:
# evaluate random forest
from sklearn.metrics import accuracy_score

print("RF Training accuracy: ", accuracy_score(y_train, y_rf_pred_train))
print("RF Validation accuracy: ", accuracy_score(y_val, y_rf_pred_val))

In [None]:
# confusion matrix for the best model (RF in this case)
from sklearn.metrics import confusion_matrix

print("\nConfusion Matrix for RF model\n")
cm = confusion_matrix(y_val, y_rf_pred_val)
print(cm)

In [None]:
# confusion matrix visualization
ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax, fmt='g',cmap='Blues') #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')

plt.show()

In [None]:
#ok nice, now we get RF as the best model. Now, we gonna train this model once more with the full train data.
RFModel.fit(X_filtered, y)

# Evaluation

For this case, Random Forrest has better accuracy than XGBoost with <b>84.26% training score</b> and <b>81% validation score</b>.

# Submit Answer

In [None]:
# get passengerId to fit the submission format
passengerId = pd.read_csv('./input/titanic/test.csv')
passengerId = passengerId['PassengerId']
passengerId.head()

In [None]:
# load ready test data
df_submit = pd.read_csv('df_pos_test.csv')
df_submit = df_submit[['sex_female', 'Pclass', 'sex_male', 'SibSp_1', 'Age']]
df_submit.head()

In [None]:
# create prediction
preds = RFModel.predict(df_submit.values)
preds

In [None]:
# create df format
df = {'PassengerId': passengerId.ravel(), 'Survived': preds}
df_predictions = pd.DataFrame(df)
df_predictions.head(10)

In [None]:
# change the float to int format
df_predictions['Survived']=df_predictions['Survived'].astype(int)
df_predictions.head()

In [None]:
# save output, then submit
df_predictions.to_csv('final_answer.csv', encoding='utf-8', index=False)