## Definisi Masalah

Tenggelamnya kapal RMS Titanic adalah salah satu tragedi yang paling terkenal. Pada tanggal 15 April 1912, dalam pelayaran perdana ny, kapal Titanic tenggelam setelah bertabrakan dengan sebuah gunung es, mengakibatkan 1502 korban jiwa dari 2224 penumpan dan awak kapal. Tragedi ini sangat mengguncang komunitas internasional dan mendorong untuk membuat peraturan keselamatan yang lebih baik di dalam kapal dan pelayaran.

Salah satu alasan kapal Titanic mengakibatkan korban jiwa yang relatif banyak diakibatkan karena tidak cukupnya jumlah kapal penyelemat (skoci) bagi para penumpang dan awak kapal. Meskipun ada yang beruntung selamat pada saat proses penenggelaman, kelompok orang yang lebih banyak selamat dibandingkan yang lain antara lain para wanita, anak - anak, dan para kelas atas.

Dalam tugas ini, kita meminta anda untuk menyelesaikan analisis mengenai kriteria seperti apa orang yang selamat dari tragedi kapal Titanic. Khususnya, kita meminta anda untuk menerapkan <i>machine learning</i> untuk memprediksi penumpang mana saja yang selamat dari tragedi tersebut.

## Requirement

1. Tim pengajar telah menyiapkan data yang sudah dieksplorasi dan dibersihkan sehingga siap untuk dimasukkan ke dalam model
2. Tim pengajar telah menyiapkan 2 model yang sudah di-<i>training</i> dengan menggunakan algoritma <i>Random Forest</i> dan <i>Adaboost</i>
3. Buatlah model FCNN untuk memperkuat analisis dan prediksi anda
5. Buatlah model <i>Ensemble</i> yang menggabungkan model <i>Random Forest</i>, <i>Adaboost</i>, dan <i>FCNN</i>. Anda dapat memilih untuk menggunakan <i>Majority Vote</i> atau <i>Stacking</i>
6. Buatlah tabel perbandingan mengenai performa keempat model yang telah dibuat

## Library Import

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split

## Import Dataset

In [2]:
df = pd.read_csv("./data/titanic.csv")
df.shape

(891, 8)

Berikut adalah deskripsi dari masing - masing kolom:
- survival = Selamat atau tidaknya penumpang yang bersangkutan (0 = Tidak, 1 = Selamat)
- pclass = Kelas tiket penumpang (1 = 1st, 2 = 2nd, 3 = 3rd)
- sex = Gender
- Age = Umur dalam tahun
- sibsp = Jumlah saudara kandung/pasangan yang bersama dengan penumpang bersangkutan di atas kapal
- parch = Jumlah orang tua/anak - anak yang bersama dengan penumpang bersangkutan di atsa kapal
- fare = Jumlah biaya perjalanan penumpang
- cabin = Nomor kabin
- embarked = Nama pelabuhan tempat penumpang berangkat

In [3]:
df.head()

Unnamed: 0,Age,Embarked,Fare,Parch,Pclass,Sex,SibSp,Survived
0,22,S,7.25,0,3,male,1,0.0
1,38,C,71.2833,0,1,female,1,1.0
2,26,S,7.925,0,3,female,0,1.0
3,35,S,53.1,0,1,female,1,1.0
4,35,S,8.05,0,3,male,0,0.0


## Preprocessing

In [4]:
numeric_data = df.loc[:, ["Age", "Fare", "Parch", "SibSp"]].copy()
categorical_data = df.loc[:, ["Embarked", "Pclass", "Sex",]].copy()

In [5]:
# Standard Scaling numerical data
sc = StandardScaler()
numeric_data = sc.fit_transform(numeric_data)
numeric_data.shape

(891, 4)

In [6]:
# Label Binarizer categorical data
embarked_dummy = pd.get_dummies(categorical_data["Embarked"], prefix="embarked")
categorical_data = categorical_data.drop(["Embarked"], axis=1)
categorical_data = pd.concat([categorical_data, embarked_dummy], axis=1)

pclass_dummy = pd.get_dummies(categorical_data["Pclass"], prefix="pclass")
categorical_data = categorical_data.drop(["Pclass"], axis=1)
categorical_data = pd.concat([categorical_data, pclass_dummy], axis=1)

sex_dummy = pd.get_dummies(categorical_data["Sex"], prefix="sex")
categorical_data = categorical_data.drop(["Sex"], axis=1)
categorical_data = pd.concat([categorical_data, sex_dummy], axis=1)

categorical_data.head()

Unnamed: 0,embarked_C,embarked_Q,embarked_S,pclass_1,pclass_2,pclass_3,sex_female,sex_male
0,0,0,1,0,0,1,0,1
1,1,0,0,1,0,0,1,0
2,0,0,1,0,0,1,1,0
3,0,0,1,1,0,0,1,0
4,0,0,1,0,0,1,0,1


In [7]:
# Combine the preprocessed numerical data & categorical data
df = df.loc[:, ["Age", "Fare", "Parch", "SibSp", "Survived"]].copy()
df.loc[:, "Age":"SibSp"] = numeric_data
df = pd.concat([df, categorical_data], axis=1)
df.head()

Unnamed: 0,Age,Fare,Parch,SibSp,Survived,embarked_C,embarked_Q,embarked_S,pclass_1,pclass_2,pclass_3,sex_female,sex_male
0,-0.510769,-0.502445,-0.473674,0.432793,0.0,0,0,1,0,0,1,0,1
1,0.579769,0.786845,-0.473674,0.432793,1.0,1,0,0,1,0,0,1,0
2,-0.238134,-0.488854,-0.473674,-0.474545,1.0,0,0,1,0,0,1,1,0
3,0.375293,0.42073,-0.473674,0.432793,1.0,0,0,1,1,0,0,1,0
4,0.375293,-0.486337,-0.473674,-0.474545,0.0,0,0,1,0,0,1,0,1


In [8]:
# Train test split
X = df.drop(["Survived"], axis=1)
y = df["Survived"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=123)

## #1 Model - Random Forest Classifier

In [9]:
filename = "./model/rf.sav"
rf_model = pickle.load(open(filename, 'rb'))
rf_model

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
            oob_score=False, random_state=123, verbose=1, warm_start=False)

## #2 Model - AdaBoost Classifier

One of many boosting algorithm

In [10]:
filename = "./model/ad.sav"
ab_model = pickle.load(open(filename, 'rb'))
ab_model

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.1, n_estimators=1000, random_state=123)

## #3 Model - FCNN (deep learning) (50 point)

Buatlah model FCNN dengan menggunakan Keras yang dapat memprediksi target (survival) dari kasus di atas. Anda dapat membuat <i>neural network</i> dengan menggunakan 2 <i>hidden layer</i>. 

NB: Jangan lupa untuk meng-<i>import</i> <i>Keras library</i> terlebih dahulu

## #4 Model - Ensemble (40 point)

Buatlah model <i>Ensemble</i> dengan menggabungkan model #1, #2, dan #3. Anda dapat menggunakan metode <i>Majority Vote</i> (class <i>Majority Voter</i> sudah kami sediakan) atau metode <i>Stacking</i>

In [11]:
# Majority Voter Class

from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.preprocessing import LabelEncoder
from sklearn.externals import six
from sklearn.base import clone
from sklearn.pipeline import _name_estimators
import numpy as np
import operator


class MajorityVoteClassifier(BaseEstimator, 
                             ClassifierMixin):
    """ A majority vote ensemble classifier

    Parameters
    ----------
    classifiers : array-like, shape = [n_classifiers]
      Different classifiers for the ensemble

    vote : str, {'classlabel', 'probability'} (default='label')
      If 'classlabel' the prediction is based on the argmax of
        class labels. Else if 'probability', the argmax of
        the sum of probabilities is used to predict the class label
        (recommended for calibrated classifiers).

    weights : array-like, shape = [n_classifiers], optional (default=None)
      If a list of `int` or `float` values are provided, the classifiers
      are weighted by importance; Uses uniform weights if `weights=None`.

    """
    def __init__(self, classifiers, vote='classlabel', weights=None):

        self.classifiers = classifiers
        self.named_classifiers = {key: value for key, value
                                  in _name_estimators(classifiers)}
        self.vote = vote
        self.weights = weights

    def fit(self, X, y):
        """ Fit classifiers.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
            Matrix of training samples.

        y : array-like, shape = [n_samples]
            Vector of target class labels.

        Returns
        -------
        self : object

        """
        if self.vote not in ('probability', 'classlabel'):
            raise ValueError("vote must be 'probability' or 'classlabel'"
                             "; got (vote=%r)"
                             % self.vote)

        if self.weights and len(self.weights) != len(self.classifiers):
            raise ValueError('Number of classifiers and weights must be equal'
                             '; got %d weights, %d classifiers'
                             % (len(self.weights), len(self.classifiers)))

        # Use LabelEncoder to ensure class labels start with 0, which
        # is important for np.argmax call in self.predict
        self.lablenc_ = LabelEncoder()
        self.lablenc_.fit(y)
        self.classes_ = self.lablenc_.classes_
        self.classifiers_ = []
        for clf in self.classifiers:
            fitted_clf = clone(clf).fit(X, self.lablenc_.transform(y))
            self.classifiers_.append(fitted_clf)
        return self

    def predict(self, X):
        """ Predict class labels for X.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
            Matrix of training samples.

        Returns
        ----------
        maj_vote : array-like, shape = [n_samples]
            Predicted class labels.
            
        """
        if self.vote == 'probability':
            maj_vote = np.argmax(self.predict_proba(X), axis=1)
        else:  # 'classlabel' vote

            #  Collect results from clf.predict calls
            predictions = np.asarray([clf.predict(X)
                                      for clf in self.classifiers_]).T

            maj_vote = np.apply_along_axis(
                                      lambda x:
                                      np.argmax(np.bincount(x,
                                                weights=self.weights)),
                                      axis=1,
                                      arr=predictions)
        maj_vote = self.lablenc_.inverse_transform(maj_vote)
        return maj_vote

    def predict_proba(self, X):
        """ Predict class probabilities for X.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
            Training vectors, where n_samples is the number of samples and
            n_features is the number of features.

        Returns
        ----------
        avg_proba : array-like, shape = [n_samples, n_classes]
            Weighted average probability for each class per sample.

        """
        probas = np.asarray([clf.predict_proba(X)
                             for clf in self.classifiers_])
        avg_proba = np.average(probas, axis=0, weights=self.weights)
        return avg_proba

    def get_params(self, deep=True):
        """ Get classifier parameter names for GridSearch"""
        if not deep:
            return super(MajorityVoteClassifier, self).get_params(deep=False)
        else:
            out = self.named_classifiers.copy()
            for name, step in six.iteritems(self.named_classifiers):
                for key, value in six.iteritems(step.get_params(deep=True)):
                    out['%s__%s' % (name, key)] = value
            return out

In [None]:
# Contoh pemakaian MajoritVoteClassifier

mjv_model = MajorityVoteClassifier(classifiers=[rf_model, ab_model, fcnn_model])

"""
    Anda tidak perlu men-training Majority Voter lagi karena model - model yang menjadi 
    komposisi Majority Voter sudah dilatih sebelumnya
"""

## Perbandingan Performa Model (10 point)

Sertakan tabel perbandingan performa model (anda dapat menggunakan F1 score).