<h1 style="text-align:center">My Titanic Approach (Top 5%)</h1>

<div style="text-align:center;"><img src="https://upload.wikimedia.org/wikipedia/commons/6/6e/St%C3%B6wer_Titanic.jpg" /></div>

# Overview

**Context:** 
> The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

**About the Data:**

<ul>
    <li>survival:	Survival</li>
        <ul>
            <li>0 = No</li>
            <li>1 = Yes </li>
        </ul>
    <li>pclass: A proxy for socio-economic status (SES)</li>
        <ul>
            <li>1 = 1st (Upper)</li>
            <li>2 = 2nd (Middle)</li>
            <li>3 = 3rd (Lower)</li>
        </ul>
    <li>sex: Sex</li>
    <li>age: Age in years. Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5</li>
    <li>sibsp: # of siblings / spouses aboard the Titanic. The dataset defines family relations in this way:</li>
        <ul>
            <li>Sibling = brother, sister, stepbrother, stepsister</li>
            <li>Spouse = husband, wife (mistresses and fiancés were ignored)</li>
        </ul>
    <li>parch: # of parents / children aboard the Titanic. The dataset defines family relations in this way:</li>
        <ul>
            <li>Parent = mother, father</li>
            <li>Child = daughter, son, stepdaughter, stepson</li>
            <li>Some children travelled only with a nanny, therefore parch=0 for them.</li>
        </ul>
    <li>ticket: Ticket number</li>
    <li>fare:	Passenger fare</li>
    <li>cabin: Cabin number</li>
    <li>embarked: Port of Embarkation</li>
        <ul>
            <li>C = Cherbourg</li>
            <li>Q = Queenstown</li>
            <li>S = Southampton</li>
        </ul>
</ul> 


# Imports

In [None]:
# Data Processing
import numpy as np 
import pandas as pd 

# Data Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style='whitegrid')

# Modeling
from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_auc_score

from sklearn.model_selection import RandomizedSearchCV

# Getting the data

In [None]:
df_train = pd.read_csv("/kaggle/input/titanic/train.csv")
df_test = pd.read_csv("/kaggle/input/titanic/test.csv")

# Exploring the data

In [None]:
df_train

In [None]:
df_test

### Target Value: Survived

In [None]:
b = sns.countplot(x='Survived', data=df_train)
b.set_title("Survived Distribution");

### Pclass

In [None]:
b = sns.countplot(x='Pclass', data=df_train)
b.set_title("Pclass Distribution");

In [None]:
pd.crosstab(df_train['Survived'], df_train['Pclass']).plot(kind="bar", figsize=(10,6))

plt.title("Survived distribution for Pclass")
plt.xlabel("0 = Not Survived, 1 = Survived")
plt.ylabel("Count")
plt.legend(["Pclass 1", "Pclass 2", "Pclass 3"])
plt.xticks(rotation=0);

Here, we can see that more of Pclass 1 survived than died and a lot more passengers of Pclass 3 died than survived. Pclass 2 is distributed relatively even.

### Sex

In [None]:
b = sns.countplot(x='Sex', data=df_train)
b.set_title("Sex Distribution");

In [None]:
pd.crosstab(df_train['Survived'], df_train['Sex']).plot(kind="bar", figsize=(10,6))

plt.title("Survived distribution for Sex")
plt.xlabel("0 = Not Survived, 1 = Survived")
plt.ylabel("Count")
plt.legend(["male", "female"])
plt.xticks(rotation=0);

We can see that the majority of female passenger survived and the majority of male passenger died.

### Age

In [None]:
b = sns.distplot(df_train['Age'])
b.set_title("Age Distribution");

In [None]:
b = sns.boxplot(y = 'Age', data = df_train)
b.set_title("Age Distribution");

In [None]:
b = sns.boxplot(y='Age', x='Survived', data=df_train);
b.set_title("Age Distribution for Survived");

### SibSp

In [None]:
b = sns.countplot(x='SibSp', data=df_train)
b.set_title("SibSp Distribution");

In [None]:
pd.crosstab(df_train['Survived'], df_train['SibSp']).value_counts()

### Parch

In [None]:
df_train['Parch'].value_counts()

In [None]:
b = sns.countplot(x='Parch', data=df_train)
b.set_title("Parch Distribution");

In [None]:
pd.crosstab(df_train['Survived'], df_train['Parch']).value_counts()

### Fare

In [None]:
b = sns.distplot(df_train['Fare'])
b.set_title("Fare Distribution");

In [None]:
b = sns.boxplot(y = 'Fare', data = df_train)
b.set_title("Fare Distribution");

In [None]:
b = sns.boxplot(y='Fare', x='Survived', data=df_train);
b.set_title("Fare Distribution for Survived");

### Embarked

In [None]:
df_train['Embarked'].value_counts()

In [None]:
b = sns.countplot(x='Embarked', data=df_train)
b.set_title("Parch Distribution");

In [None]:
pd.crosstab(df_train['Survived'], df_train['Embarked']).plot(kind="bar", figsize=(10,6))

plt.title("Survived distribution for Embarked")
plt.xlabel("0 = Not Survived, 1 = Survived")
plt.ylabel("Count")
plt.legend(["C", "Q", "S"])
plt.xticks(rotation=0);

## Handling NaN values

**Where do we have NaN values?**

In [None]:
df_train.isna().sum()

In [None]:
df_test.isna().sum()

**Let's replace the NaN values in `Age` with the mean value.**

In [None]:
df_train['Age'] = df_train['Age'].fillna(df_train['Age'].mean())
df_test['Age'] = df_test['Age'].fillna(df_test['Age'].mean())

**Let's replace the NaN values in `Cabin` with "Missing".**

In [None]:
df_train['Cabin'] = df_train['Cabin'].fillna("Missing")
df_test['Cabin'] = df_test['Cabin'].fillna("Missing")

**Let's get rid of columns with NaN in `Embarked`in `df_train`.**

In [None]:
df_train = df_train.dropna()

**Let's replace the NaN value in `Fare` in `df_test` with the mean value.**

In [None]:
df_test['Fare'] = df_test['Fare'].fillna(df_test['Fare'].mean())

**Let's check if there are any NaN values left.**

In [None]:
df_train.isna().sum()

In [None]:
df_test.isna().sum()

In [None]:
df_train.shape

In [None]:
df_test.shape

**It looks like everything worked!**

# Cleaning the data

In [None]:
df_train.head()

In [None]:
df_test.head()

Let's get rid of the `Name` column for now:

In [None]:
df_train = df_train.drop(columns=['Name'], axis=1)
df_test = df_test.drop(columns=['Name'], axis=1)

Let's map `Sex` to 0 for `male` and 1 for `female`:

In [None]:
sex_mapping = {
    'male': 0,
    'female': 1
}

df_train.loc[:, "Sex"] = df_train['Sex'].map(sex_mapping)
df_test.loc[:, "Sex"] = df_test['Sex'].map(sex_mapping)

Let's get rid of `Ticket` for now:

In [None]:
df_train = df_train.drop(columns=['Ticket'], axis=1)
df_test = df_test.drop(columns=['Ticket'], axis=1)

Let's get rid of `Cabin` for now:

In [None]:
df_train = df_train.drop(columns=['Cabin'], axis=1)
df_test = df_test.drop(columns=['Cabin'], axis=1)

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
df_test['Embarked'].value_counts()

Let's use one-hot-encoding for `Embarked` since those are nominal variables:

In [None]:
df_train = pd.get_dummies(df_train, prefix_sep="__",
                              columns=['Embarked'])
df_test = pd.get_dummies(df_test, prefix_sep="__",
                              columns=['Embarked'])

Let's check if everything worked:

In [None]:
df_train.head()

In [None]:
df_test.head()

# Modeling

In [None]:
# Everything except target variable
X = df_train.drop("Survived", axis=1)

# Target variable
y = df_train['Survived'].values

In [None]:
# Random seed for reproducibility
np.random.seed(42)

# Split into train & test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) 

In [None]:
# Put models in a dictionary
models = {"KNN": KNeighborsClassifier(),
          "Logistic Regression": LogisticRegression(max_iter=10000), 
          "Random Forest": RandomForestClassifier(),
          "SVC" : SVC(probability=True),
          "DecisionTreeClassifier" : DecisionTreeClassifier(),
          "AdaBoostClassifier" : AdaBoostClassifier(),
          "GradientBoostingClassifier" : GradientBoostingClassifier(),
          "GaussianNB" : GaussianNB(),
          "LinearDiscriminantAnalysis" : LinearDiscriminantAnalysis(),
          "QuadraticDiscriminantAnalysis" : QuadraticDiscriminantAnalysis()}

# Create function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
    """
    Fits and evaluates given machine learning models.
    models : a dict of different Scikit-Learn machine learning models
    X_train : training data
    X_test : testing data
    y_train : labels assosciated with training data
    y_test : labels assosciated with test data
    """
    # Random seed for reproducible results
    np.random.seed(42)
    # Make a list to keep model scores
    model_scores = {}
    # Loop through models
    for name, model in models.items():
        # Fit the model to the data
        model.fit(X_train, y_train)
        # Predicting target values
        y_pred = model.predict(X_test)
        # Evaluate the model and append its score to model_scores
        #model_scores[name] = model.score(X_test, y_test)
        model_scores[name] = roc_auc_score(y_test, y_pred)
    return model_scores

In [None]:
model_scores = fit_and_score(models=models,
                             X_train=X_train,
                             X_test=X_test,
                             y_train=y_train,
                             y_test=y_test)
model_scores

`GradientBoostingClassifier` has the best score.

# Predict for df_test

In [None]:
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)

In [None]:
y_pred = gbc.predict(df_test)

In [None]:
y_pred

In [None]:
sub = pd.read_csv('/kaggle/input/titanic/gender_submission.csv')
sub.head()

In [None]:
sub['Survived'] = y_pred
sub.to_csv("results_titanic.csv", index=False)
sub.head()

**Work in Progress**