📑 원본 노트북 : [Titanic Data Science Solutions](https://www.kaggle.com/startupsci/titanic-data-science-solutions)

# Titanic Data Science Solutions


### This notebook is a companion to the book [Data Science Solutions](https://www.amazon.com/Data-Science-Solutions-Startup-Workflow/dp/1520545312). 

The notebook walks us through a typical workflow for solving data science competitions at sites like Kaggle.

There are several excellent notebooks to study data science competition entries. However many will skip some of the explanation on how the solution is developed as these notebooks are developed by experts for experts. The objective of this notebook is to follow a step-by-step workflow, explaining each step and rationale for every decision we take during solution development.

## Workflow stages

The competition solution workflow goes through seven stages described in the Data Science Solutions book.

1. Question or problem definition.
2. Acquire training and testing data.
3. Wrangle, prepare, cleanse the data.
4. Analyze, identify patterns, and explore the data.
5. Model, predict and solve the problem.
6. Visualize, report, and present the problem solving steps and final solution.
7. Supply or submit the results.

The workflow indicates general sequence of how each stage may follow the other. However there are use cases with exceptions.

- We may combine mulitple workflow stages. We may analyze by visualizing data.
- Perform a stage earlier than indicated. We may analyze data before and after wrangling.
- Perform a stage multiple times in our workflow. Visualize stage may be used multiple times.
- Drop a stage altogether. We may not need supply stage to productize or service enable our dataset for a competition.


## Question and problem definition

Competition sites like Kaggle define the problem to solve or questions to ask while providing the datasets for training your data science model and testing the model results against a test dataset. The question or problem definition for Titanic Survival competition is [described here at Kaggle](https://www.kaggle.com/c/titanic).

> Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.

We may also want to develop some early understanding about the domain of our problem. This is described on the [Kaggle competition description page here](https://www.kaggle.com/c/titanic). Here are the highlights to note.

- On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.
- One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.
- Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

## Workflow goals

The data science solutions workflow solves for seven major goals.

**Classifying.** We may want to classify or categorize our samples. We may also want to understand the implications or correlation of different classes with our solution goal.

**Correlating.** One can approach the problem based on available features within the training dataset. Which features within the dataset contribute significantly to our solution goal? Statistically speaking is there a [correlation](https://en.wikiversity.org/wiki/Correlation) among a feature and solution goal? As the feature values change does the solution state change as well, and visa-versa? This can be tested both for numerical and categorical features in the given dataset. We may also want to determine correlation among features other than survival for subsequent goals and workflow stages. Correlating certain features may help in creating, completing, or correcting features.

**Converting.** For modeling stage, one needs to prepare the data. Depending on the choice of model algorithm one may require all features to be converted to numerical equivalent values. So for instance converting text categorical values to numeric values.

**Completing.** Data preparation may also require us to estimate any missing values within a feature. Model algorithms may work best when there are no missing values.

**Correcting.** We may also analyze the given training dataset for errors or possibly innacurate values within features and try to corrent these values or exclude the samples containing the errors. One way to do this is to detect any outliers among our samples or features. We may also completely discard a feature if it is not contribting to the analysis or may significantly skew the results.

**Creating.** Can we create new features based on an existing feature or a set of features, such that the new feature follows the correlation, conversion, completeness goals.

**Charting.** How to select the right visualization plots and charts depending on nature of the data and the solution goals.

In [None]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

## Acquire data

The Python Pandas packages helps us work with our datasets. We start by acquiring the training and testing datasets into Pandas DataFrames. We also combine these datasets to run certain operations on both datasets together.

In [None]:
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')
combine = [train_df, test_df]

### Assumtions based on data analysis

We arrive at following assumptions based on data analysis done so far. We may validate these assumptions further before taking appropriate actions.

**Correlating.**

We want to know how well does each feature correlate with Survival. We want to do this early in our project and match these quick correlations with modelled correlations later in the project.

**Completing.**

1. We may want to complete Age feature as it is definitely correlated to survival.
2. We may want to complete the Embarked feature as it may also correlate with survival or another important feature.

**Correcting.**

1. Ticket feature may be dropped from our analysis as it contains high ratio of duplicates (22%) and there may not be a correlation between Ticket and survival.
2. Cabin feature may be dropped as it is highly incomplete or contains many null values both in training and test dataset.
3. PassengerId may be dropped from training dataset as it does not contribute to survival.
4. Name feature is relatively non-standard, may not contribute directly to survival, so maybe dropped.

**Creating.**

1. We may want to create a new feature called Family based on Parch and SibSp to get total count of family members on board.
2. We may want to engineer the Name feature to extract Title as a new feature.
3. We may want to create new feature for Age bands. This turns a continous numerical feature into an ordinal categorical feature.
4. We may also want to create a Fare range feature if it helps our analysis.

**Classifying.**

We may also add to our assumptions based on the problem description noted earlier.

1. Women (Sex=female) were more likely to have survived.
2. Children (Age<?) were more likely to have survived. 
3. The upper-class passengers (Pclass=1) were more likely to have survived.

## Wrangle data

We have collected several assumptions and decisions regarding our datasets and solution requirements. So far we did not have to change a single feature or value to arrive at these. Let us now execute our decisions and assumptions for correcting, creating, and completing goals.

### Correcting by dropping features

This is a good starting goal to execute. <u>By dropping features</u> we are dealing with fewer data points. Speeds up our notebook and eases the analysis.

- Based on our assumptions and decisions *we want to drop the Cabin (correcting #2) and Ticket (correcting #1) features.*

Note that where applicable we perform operations on both training and testing datasets together to stay consistent.

In [None]:
# 출력결과가 아래와 같도록 코드를 작성
# Before {train_df의 모양}, {test_df의 모양}, {combine[0]의 모양}, {combine[1]의 모양}

# train_df에서 'Ticket', 'Cabin' 행을 삭제 후 저장

# test_df 'Ticket', 'Cabin' 행을 삭제 후 저장

# combine에 train_df와 test_df로 구성된 list를 재할당


# 출력결과가 아래와 같도록 코드를 작성
# After {train_df의 모양}, {test_df의 모양}, {combine[0]의 모양}, {combine[1]의 모양}


### Creating new feature extracting from existing

We want to analyze **if Name feature can be engineered to extract titles and test correlation between titles and survival**, before dropping Name and PassengerId features.

In the following code we extract Title feature using regular expressions. The RegEx pattern `(\w+\.)` matches <u>the first word which ends with a dot character within Name feature.</u> The `expand=False` flag returns a DataFrame.

**Observations.**

When we plot Title, Age, and Survived, we note the following observations.

- Most titles band Age groups accurately. For example: Master title has Age mean of 5 years.
- Survival among Title Age bands varies slightly.
- Certain titles mostly survived (Mme, Lady, Sir) or did not (Don, Rev, Jonkheer).

**Decision.**

- We decide to retain the new Title feature for model training.

In [None]:
# combine에 포함된 DataFrame들에 대하여 새로운 행 Title를 생성한다.
# Title는 Name 행에서 .으로 끝나는 단어를 추출
# hint : https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html

# train_df에서 Title에 따른 성별에 대한 빈도표를 출력
# hint : https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html


We can replace many titles with a more common name or classify them as `Rare`.

In [None]:
# combine에 포함된 DataFrame들에 대하여
# Title 행의 요소들을 아래와 같이 변환
# 'Lady','Countess','Capt','Col','Don','Dr','Major','Rev', 'Sir','Jonkheer','Dona' -> 'Rare'
# 'Mlle' -> 'Miss'
# 'Ms' -> 'Miss'
# 'Mme' -> 'Mrs'

# train_df 에서 Title에 따른 Survived의 평균을 구하라.
# hint : groupby


We can convert the categorical titles to ordinal.

In [None]:
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}

# combine에 포함된 DataFrame들에 대하여
# Title의 행의 요소들을 title_mapping을 이용하여 ordinal 데이터로 변환
# 만약 Title의 요소가 NaN이라면 0으로 변환한다.



train_df.head()

Now we can safely drop the Name feature from training and testing datasets. We also do not need the PassengerId feature in the training dataset.

In [None]:
# train_df에서 Name 행과 PassengerId 행을 삭제


# test_df에서 Name 행을 삭제


# combine에 train_df와 test_df로 구성된 list를 재할당



train_df.shape, test_df.shape

🔥 **위의 내용을 통하여 어떤 feature가 생성되었나요? 해당 feature는 어떤 기준으로 생성되었나요? 저자는 왜 이런 feature를 생성하였을까요?**
> (내용을 입력하여 주세요)

### Converting a categorical feature

Now we can convert features which contain strings to numerical values. This is required by most model algorithms. Doing so will also help us in achieving the feature completing goal.

Let us start by converting Sex feature to a new feature called Gender where female=1 and male=0.

In [None]:
# combine에 포함된 DataFrame들에 대하여
# 성별(Sex) 행의 요소들을 female을 숫자 1, male를 숫자 0으로 변환한다.



train_df.head()

🔥 **위의 과정에서 저자는 문자열 데이터를 숫자로 변환하고 있습니다. 왜 이런 작업을 수행한 것일까요?**
> (내용을 입력하여 주세요)

### Completing a numerical continuous feature

Now we should start estimating and completing features with ***missing or null values.*** We will first do this for the Age feature.

We can consider three methods to complete a numerical continuous feature.

1. A simple way is to generate random numbers between mean and [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation).

2. More accurate way of guessing missing values is to use other correlated features. In our case we note correlation among Age, Gender, and Pclass. Guess Age values using [median](https://en.wikipedia.org/wiki/Median) values for Age across sets of Pclass and Gender feature combinations. So, median Age for Pclass=1 and Gender=0, Pclass=1 and Gender=1, and so on...

3. Combine methods 1 and 2. So instead of guessing age values based on median, use random numbers between mean and standard deviation, based on sets of Pclass and Gender combinations.

Method 1 and 3 will introduce random noise into our models. The results from multiple executions might vary. We will prefer method 2.

In [None]:
# 연령(Age)에 따른 생존자(Survived)의 수를 승객 등급(Pclass)과 성별(Sex)별로 히스토그램으로 나타내시오.
# → Pclass == 1이고 Sex == female 때의 연령에 따른 생존자의 수 히스토그램
#   Pclass == 1이고 Sex == male 때의 연령에 따른 사망자의 수 히스토그램
#   Pclass == 2이고 Sex == female 때의 연령에 따른 생존자의 수 히스토그램
#   Pclass == 2이고 Sex == male 때의 연령에 따른 사망자의 수 히스토그램
#   Pclass == 3이고 Sex == female 때의 연령에 따른 생존자의 수 히스토그램
#   Pclass == 3이고 Sex == male 때의 연령에 따른 사망자의 수 히스토그램
# 이때, 히스토그램의 간격은 20으로 한다.

<details>
<summary>💬 answer 1</summary>
<div markdown="1">       

```python
# grid = sns.FacetGrid(train_df, col='Pclass', hue='Gender')
grid = sns.FacetGrid(train_df, row='Pclass', col='Sex', height=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend()
plt.show()
```
    
</div>
</details>

<details>
<summary>💬 answer 2</summary>
<div markdown="1">       

```python
fig, axes = plt.subplots(3, 2, figsize=(6, 6), sharey=True, sharex=True)
axes = [y for x in axes for y in x]

for i, ax in enumerate(axes):
    pclass = i//2 + 1
    sex = 'female' if i % 2 else 'male' 
    
    age_pclass_survied = train_df[(train_df['Pclass'] == pclass) & 
                                  (train_df['Sex'] == sex)]['Age']
    ax.hist(age_pclass_survied, bins=20, color='blue', alpha=.5)
    ax.set_title(f'Pclass = {pclass} | Sex = {sex}')

plt.tight_layout()
plt.show()
```
    
</div>
</details>

Let us start by preparing an empty array to contain guessed Age values based on Pclass x Gender combinations.

In [None]:
# Pclass x Gender 조합(2, 3)을 넣을 수 있는 0으로 찬 빈 배열 guess_ages를 생성한다.


Now we iterate over Sex (0 or 1) and Pclass (1, 2, 3) to calculate guessed values of Age for the six combinations.

In [None]:
print('[Before]')
print(train_df.tail())

# combine에 포함된 DataFrame들에 대하여
for dataset in combine:
    
    # guess_ages의 [i, j]에 dataset에서 Sex가 i이고 Pclass가 j+1일 때의
    # 나이의 중앙값을 저장한다.
    # 주의) 중앙값 연산시에 NaN 값이 포함되지 않도록 주의한다.

    # dataset에서 Sex가 i이고 Pclass가 j+1일 때, Age가 NaN일 경우
    # Age에 guess_ages[i, j]를 대입한다.


print('\n[After]')
print(train_df.tail())

Let us create Age bands and determine correlations with Survived.

In [None]:
# train_df의 Age를 5개의 구간으로 나누어 각 Age가 해당되는 구간을 새로운 행 AgeBand에 저장한다.
# hint : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html


# AgeBand 별 Survived의 평균을 구하고 이를 AgeBand의 오름차순으로 정렬한다.


Let us replace Age with ordinals based on these bands.

In [None]:
# combine에 포함된 DataFrame들에 대하여
# 위에서 구한 AgeBand의 값을 이용하여 Age를 oridanal 데이터로 변환한다.
# 구간의 나이가 가장 낮은 구간을 0의 나이를 0, 가장 높은 구간의 나이를 4로 한다.


train_df.head()

We can not remove the AgeBand feature.

In [None]:
# train_df에서 AgeBand 행을 제거한다.

# combine에 train_df와 test_df로 구성된 list를 재할당


train_df.head()

🔥 **위의 과정에서 저자는 연속되는 숫자 데이터를 범주형 데이터로 변환시켰습니다. 왜 이런 작업을 수행한 것일까요?**
> (내용을 입력하여 주세요)

### Create new feature combining existing features

We can create a new feature for FamilySize which combines Parch and SibSp. This will enable us to drop Parch and SibSp from our datasets.

In [None]:
# 가족 구성원의 수를 의미하는 새로운 행 FamilySize를 생성한다.


# FamilySize 별 Survived의 평균을 구하고 이를 Survived의 내림차순으로 정렬한다.


We can create another feature called IsAlone.

In [None]:
# 혼자 여행을 하지는 여부를 담을 새로운 행 IsAlone을 생성한다.
# 혼자 여행한다면 IsAlone = 1, 동반자가 있다면 IsAlone = 0이 되도록 한다.

# IsAlone 별 Survived의 평균을 출력한다.


Let us drop Parch, SibSp, and FamilySize features in favor of IsAlone.

In [None]:
# train_df에서 Parch, SibSp, FamilySize를 제거한다.


# test_df에서 Parch, SibSp, FamilySize를 제거한다.


# combine에 train_df와 test_df로 구성된 list를 재할당


train_df.head()

We can also create an artificial feature combining Pclass and Age.

In [None]:
# Age와 Pclass를 곱한 새로운 행 Age*Class를 생성한다.


train_df.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10)

🔥 **위의 과정에서 저자는 새로운 두가지 행 IsAlone와 Age\*Class를 생성하였습니다. 각각의 행은 어떤 의미를 가지고 저자는 왜 하필 이 두가지 행을 생성하였을까요?**
> (내용을 입력하여 주세요)

### Completing a categorical feature

Embarked feature takes S, Q, C values based on port of embarkation. *Our training dataset has two missing values.* We simply <u>fill these with the most common occurance.</u>

In [None]:
# train_df의 Embarked 행에서 가장 자주 등장하는 정박지를 구하여
# freq_port에 저장
# hint : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mode.html


freq_port

In [None]:
# combine에 포함된 DataFrame들에 대하여
# Embarked의 값이 NaN인 요소들을 freq_port로 대체한다.


# Embarked 별 Survived의 평균을 구하고 이를 Survived의 내림차순으로 정렬한다.
train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

🔥 **위의 과정에서 저자는 정박지의 결측치를 어떤 값으로 대체하였나요?**
> (내용을 입력하여 주세요)

### Converting categorical feature to numeric

We can now convert the EmbarkedFill feature by creating a new numeric Port feature.

In [None]:
# combine에 포함된 DataFrame들에 대하여
# Embarked의 값이 S이면 0, C이면 1, Q이면 2로 변환한다.
# 이때, 새로이 할당된 값의 자료형은 int이다.



train_df.head()

### Quick completing and converting a numeric feature

We can now complete the Fare feature for single missing value in test dataset using mode to get the value that occurs most frequently for this feature. We do this in a single line of code.

Note that we are not creating an intermediate new feature or doing any further analysis for correlation to guess missing feature as we are replacing only a single value. The completion goal achieves desired requirement for model algorithm to operate on non-null values.

We may also want round off the fare to two decimals as it represents currency.

In [None]:
# test_df의 Fare에서 NaN 값을 Fare의 중앙값으로 대체한다.



test_df.head()

We can create FareBand.

In [None]:
# train_df의 Fare를 4개의 구간으로 나누어 각 Fare가 해당되는 구간을 새로운 행 FareBand에 저장한다.


# FareBand 별 Survived의 평균을 구하고 이를 AgeBand의 오름차순으로 정렬한다.


Convert the Fare feature to ordinal values based on the FareBand.

In [None]:
# combine에 포함된 DataFrame들에 대하여
# 위에서 구한 FareBand의 값을 이용하여 Fare를 oridanal 데이터로 변환한다.
# 구간의 요즘이 가장 낮은 구간을 0의 요금을 0,
# 가장 높은 구간의 요금을 3으로 한다.




# train_df에서 FareBand 행을 제거한다.


# combine에 train_df와 test_df로 구성된 list를 재할당

    
train_df.head(10)

And the test dataset.

In [None]:
test_df.head(10)

## References

This notebook has been created based on great work done solving the Titanic competition and other sources.

- [A journey through Titanic](https://www.kaggle.com/omarelgabry/titanic/a-journey-through-titanic)
- [Getting Started with Pandas: Kaggle's Titanic Competition](https://www.kaggle.com/c/titanic/details/getting-started-with-random-forests)
- [Titanic Best Working Classifier](https://www.kaggle.com/sinakhorami/titanic/titanic-best-working-classifier)