<a href="https://colab.research.google.com/github/Jarmos-san/Titanic/blob/master/TitanicCompetition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **[Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic)**
---

## **Description**

####**The Challenge**

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

*In this challenge, we ask you to build a predictive model that answers the question: “**what sorts of people were more likely to survive?**” using passenger data (ie name, age, gender, socio-economic class, etc).*

#### **What Data Will I Use in This Competition?**

In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled `train.csv` and the other is titled `test.csv`.

Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.

The `test.csv` dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.

Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived.

#### **Evaluation**

1. **Goal**:

  It is your job to predict if a passenger survived the sinking of the Titanic or not.
  For each in the test set, you must predict a 0 or 1 value for the variable.
2. **Metric**:

  Your score is the percentage of passengers you correctly predict. This is known as [accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision#In_binary_classification).
3. **Submission File Format**:

  You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

  The file should have exactly 2 columns:
  - PassengerId (sorted in any order)
  - Survived (contains your binary predictions: `1` for survived, `0` for deceased)
    
    ```
    PassengerId,Survived
    892,0
    893,1
    894,0
    Etc.
    ```

  You can download an example submission file (gender_submission.csv) on the Data page.

## **Data** 

#### **Overview**

The data has been split into two groups:
- training set (`train.csv`)
- test set (`test.csv`)

The **training set** should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like `passengers’` `gender` and `class`. You can also use feature engineering to create new features.

The **test set** should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include `gender_submission.csv`, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

#### **Data Dictionary**

| Variable |          Definition          |                      Key                      |
|----------|------------------------------|-----------------------------------------------|
| survival | Survival                     | `0`=No,`1`=Yes                                |
| pclass   | Ticket Class                 | `1`=1st,`2`=2nd,`3`=3rd                       |
| sex      | Gender                       |                                               |
| Age      | Age in Years                 |                                               |
| sibsp    | # of Siblings/Spouses aboard |                                               |
| parch    | # of parents/Children aboard |                                               |
| ticket   | Ticket Number                |                                               |
| fare     | Ticket Fare                  |                                               |
| cabin    | Cabin Number                 |                                               |
| embarked | Port of Embarkation          | `C`=Cherboug,`Q`=Queenstown,`S`=Southampton   |

#### **Variable Notes**

- `pclass`: A proxy for socio-economic status (SES)
  
  - 1st = Upper
  - 2nd = Middle
  - 3rd = Lower

- `age`: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

- `sibsp`: The dataset defines family relations in this way...
  - `Sibling` = brother, sister, stepbrother, stepsister
  - `Spouse` = husband, wife (mistresses and fiancés were ignored)

- `parch`: The dataset defines family relations in this way...
  - Parent = mother, father
  - Child = daughter, son, stepdaughter, stepson
  - Some children travelled only with a nanny, therefore parch=0 for them.

## Preparing the Notebook

In [1]:
# Mount GDrive to current Colab instance
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

# Setup Kaggle creds to the Colab Instance
import os
os.environ['KAGGLE_CONFIG_DIR']='/content/gdrive/My Drive/Kaggle'

# Changing directory to download competition data to the ../Titanic/Data folder
%cd /content/gdrive/My Drive/Titanic/Data

# Downloading the required data
!kaggle competitions download -c titanic

# Changing back to working directory
%cd /content/gdrive/My Drive/Titanic

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive
/content/gdrive/My Drive/Titanic/Data
Downloading gender_submission.csv to /content/gdrive/My Drive/Titanic/Data
  0% 0.00/3.18k [00:00<?, ?B/s]
100% 3.18k/3.18k [00:00<00:00, 925kB/s]
Downloading test.csv to /content/gdrive/My Drive/Titanic/Data
  0% 0.00/28.0k [00:00<?, ?B/s]
100% 28.0k/28.0k [00:00<00:00, 3.75MB/s]
Downloading train.csv to /content/gdrive/My Drive/Titanic/Data
  0% 0.00/59.8k [00:00<?, ?B/s]
100% 59.8k/59.8k [00:00<00:00, 6.47MB/s]
/conten

# Importing Necessary Libraries

- **Pandas** *v0.25.3*
- **Seaborn** *v0.10.0*
- **Missingno** *v0.4.2*
- **Numpy** *v1.17.5*

In [2]:
# Importing the required libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import missingno

# import sklearn
# from sklearn.impute import SimpleImputer

# Checking library versions
print(f'Pandas v{pd.__version__}')
print(f'Seaborn v{sns.__version__}')
print(f'Missingno v{missingno.__version__}')
print(f'Numpy v{np.__version__}')
# print(f'Sklearn v{sklearn.__version__}')

%matplotlib inline
sns.set(style='darkgrid')

Pandas v0.25.3
Seaborn v0.10.0
Missingno v0.4.2
Numpy v1.17.5


# Dataset Overview

In [9]:
!ls /content/gdrive/My\ Drive/Titanic/Data

gender_submission.csv  test.csv  train.csv


In [0]:
PATH = '/content/gdrive/My Drive/Titanic/Data/'

def loadDataFrame(data):
    '''
    Loads the DataFrame & returns a copy of it for processing.

    Arguments:
    data = Accepts a filename

    Returns:
    A copy of the DataFrame for processing.
    '''

TrainDF = pd.read_csv(f'{PATH}train.csv')
TestDF = pd.read_csv(f'{PATH}test.csv')