📑 원본 노트북 : [
Titanic Data Science Solutions](https://www.kaggle.com/startupsci/titanic-data-science-solutions)

# Titanic Data Science Solutions


### This notebook is a companion to the book [Data Science Solutions](https://www.amazon.com/Data-Science-Solutions-Startup-Workflow/dp/1520545312). 

The notebook walks us through a typical workflow for solving data science competitions at sites like Kaggle.

There are several excellent notebooks to study data science competition entries. However many will skip some of the explanation on how the solution is developed as these notebooks are developed by experts for experts. The objective of this notebook is to follow a step-by-step workflow, explaining each step and rationale for every decision we take during solution development.

## Workflow stages

The competition solution workflow goes through seven stages described in the Data Science Solutions book.

1. Question or problem definition.
2. Acquire training and testing data.
3. Wrangle, prepare, cleanse the data.
4. Analyze, identify patterns, and explore the data.
5. Model, predict and solve the problem.
6. Visualize, report, and present the problem solving steps and final solution.
7. Supply or submit the results.

The workflow indicates general sequence of how each stage may follow the other. However there are use cases with exceptions.

- We may combine mulitple workflow stages. We may analyze by visualizing data.
- Perform a stage earlier than indicated. We may analyze data before and after wrangling.
- Perform a stage multiple times in our workflow. Visualize stage may be used multiple times.
- Drop a stage altogether. We may not need supply stage to productize or service enable our dataset for a competition.


## Question and problem definition

Competition sites like Kaggle define the problem to solve or questions to ask while providing the datasets for training your data science model and testing the model results against a test dataset. The question or problem definition for Titanic Survival competition is [described here at Kaggle](https://www.kaggle.com/c/titanic).

> Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.

We may also want to develop some early understanding about the domain of our problem. This is described on the [Kaggle competition description page here](https://www.kaggle.com/c/titanic). Here are the highlights to note.

- On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.
- One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.
- Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

## Workflow goals

The data science solutions workflow solves for seven major goals.

**Classifying.** We may want to classify or categorize our samples. We may also want to understand the implications or correlation of different classes with our solution goal.

**Correlating.** One can approach the problem based on available features within the training dataset. Which features within the dataset contribute significantly to our solution goal? Statistically speaking is there a [correlation](https://en.wikiversity.org/wiki/Correlation) among a feature and solution goal? As the feature values change does the solution state change as well, and visa-versa? This can be tested both for numerical and categorical features in the given dataset. We may also want to determine correlation among features other than survival for subsequent goals and workflow stages. Correlating certain features may help in creating, completing, or correcting features.

**Converting.** For modeling stage, one needs to prepare the data. Depending on the choice of model algorithm one may require all features to be converted to numerical equivalent values. So for instance converting text categorical values to numeric values.

**Completing.** Data preparation may also require us to estimate any missing values within a feature. Model algorithms may work best when there are no missing values.

**Correcting.** We may also analyze the given training dataset for errors or possibly innacurate values within features and try to corrent these values or exclude the samples containing the errors. One way to do this is to detect any outliers among our samples or features. We may also completely discard a feature if it is not contribting to the analysis or may significantly skew the results.

**Creating.** Can we create new features based on an existing feature or a set of features, such that the new feature follows the correlation, conversion, completeness goals.

**Charting.** How to select the right visualization plots and charts depending on nature of the data and the solution goals.

In [None]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

## Acquire data

The Python Pandas packages helps us work with our datasets. We start by acquiring the training and testing datasets into Pandas DataFrames. We also combine these datasets to run certain operations on both datasets together.

In [None]:
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')
combine = [train_df, test_df]

## Analyze by describing data

Pandas also helps describe the datasets answering following questions early in our project.

**Which features are available in the dataset?**

Noting the feature names for directly manipulating or analyzing these. These feature names are described on the [Kaggle data page here](https://www.kaggle.com/c/titanic/data).

In [None]:
#  [코드의 동작을 설명해 주세요]
print(train_df.columns.values)

🔥 **각 feature들의 의미는 무엇인가?**

- PassengerId : 
- Survived : 
- Pclass : 
- Name : 
- Sex : 
- Age : 
- SibSp : 
- Parch : 
- Ticket : 
- Fare : 
- Cabin : 
- Embarked : 

**Which features are categorical?**

These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based? Among other things this helps us select the appropriate plots for visualization.

- Categorical: Survived, Sex, and Embarked. Ordinal: Pclass.

**Which features are numerical?**

Which features are numerical? These values change from sample to sample. Within numerical features are the values discrete, continuous, or timeseries based? Among other things this helps us select the appropriate plots for visualization.

- Continous: Age, Fare. Discrete: SibSp, Parch.

In [None]:
#  [코드의 동작을 설명해 주세요]
train_df.head()

**Which features are mixed data types?**

Numerical, alphanumeric data within same feature. These are candidates for correcting goal.

- Ticket is a mix of numeric and alphanumeric data types. Cabin is alphanumeric.

**Which features may contain errors or typos?**

This is harder to review for a large dataset, however reviewing a few samples from a smaller dataset may just tell us outright, which features may require correcting.

- Name feature may contain errors or typos as there are several ways used to describe a name including titles, round brackets, and quotes used for alternative or short names.

In [None]:
#  [코드의 동작을 설명해 주세요]
train_df.tail()

**Which features contain blank, null or empty values?**

These will require correcting.

- Cabin > Age > Embarked features contain a number of null values in that order for the training dataset.
- Cabin > Age are incomplete in case of test dataset.

**What are the data types for various features?**

Helping us during converting goal.

- Seven features are integer or floats. Six in case of test dataset.
- Five features are strings (object).

In [None]:
# [코드의 동작을 설명해 주세요]
train_df.info()
print('_'*40)
# [코드의 동작을 설명해 주세요]
test_df.info()

**What is the distribution of numerical feature values across the samples?**

This helps us determine, among other early insights, how representative is the training dataset of the actual problem domain.

🔥 **아래 항목들은 코드 수행 결과의 어떤 데이터로부터 도출되었는가?**

- Total samples are 891 or 40% of the actual number of passengers on board the Titanic (2,224).
    > **예시)** 위의 문제 정의에서 타이타닉에 타고 있던 승객과 승무원의 총 인원이 2224인 것이 언급 되어있다. train 데이터의 통계 요약에 따르면 PassengerId의 총 갯수가 891임으로 데이터의 표본은 2224명 중 891명이다. 40%라 언급된 것은 891/2224 ≓ 0.4062이기 때문이다.
- Survived is a categorical feature with 0 or 1 values.
    >
- Around 38% samples survived representative of the actual survival rate at 32%.
    > 
- Most passengers (> 75%) did not travel with parents or children.
    >
- Nearly 30% of the passengers had siblings and/or spouse aboard.
    >
- Fares varied significantly with few passengers (<1%) paying as high as $512.
    >
- Few elderly passengers (<1%) within age range 65-80.
    >

In [None]:
# [코드의 동작을 설명해 주세요]
train_df.describe()

# Review survived rate using `percentiles=[.61, .62]` knowing our problem description mentions 38% survival rate.
# Review Parch distribution using `percentiles=[.75, .8]`
# SibSp distribution `[.68, .69]`
# Age and Fare `[.1, .2, .3, .4, .5, .6, .7, .8, .9, .99]`

🔥 **이 과정에서 얻은 수 있는 정보 어떤 의미를 가질까? 왜 이런 과정이 필요할까?**
> 

**What is the distribution of categorical features?**

- Names are unique across the dataset (count=unique=891)
- Sex variable as two possible values with 65% male (top=male, freq=577/count=891).
- Cabin values have several dupicates across samples. Alternatively several passengers shared a cabin.
- Embarked takes three possible values. S port used by most passengers (top=S)
- Ticket feature has high ratio (22%) of duplicate values (unique=681).

In [None]:
# [코드의 동작을 설명해 주세요]
train_df.describe(include=['O'])

### Assumtions based on data analysis

We arrive at following assumptions based on data analysis done so far. We may validate these assumptions further before taking appropriate actions.

**Correlating.**

We want to know <u>**how well does each feature correlate with Survival**</u>. We want to do this early in our project and match these quick correlations with modelled correlations later in the project.

**Completing.**

1. We may want to complete *Age feature* as it is definitely correlated to survival.
2. We may want to complete the *Embarked feature* as it may also correlate with survival or another important feature.

**Correcting.**

1. *Ticket feature* may be dropped from our analysis as it contains high ratio of duplicates (22%) and there may not be a correlation between Ticket and survival.
    - 티켓이 중복이 많다는 것은 어떤 정보에 기반하는가?
        > 
2. Cabin feature may be dropped as it is highly incomplete or contains many null values both in training and test dataset.
3. PassengerId may be dropped from training dataset as it does not contribute to survival.
4. Name feature is relatively non-standard, may not contribute directly to survival, so maybe dropped.
    - PassengerId와 Name는 왜 생존에 기여지하지 않는다고 판단할 수 있는가?
        > 

**Creating.**

1. We may want to create a new feature called Family based on Parch and SibSp to get total count of family members on board.
2. We may want to engineer the Name feature to extract Title as a new feature.
3. We may want to create new feature for Age bands. This turns a continous numerical feature into an ordinal categorical feature.
4. We may also want to create a Fare range feature if it helps our analysis.

**Classifying.**

We may also add to our assumptions based on the problem description noted earlier.

1. Women (Sex=female) were more likely to have survived.
2. Children (Age<?) were more likely to have survived. 
3. The upper-class passengers (Pclass=1) were more likely to have survived.

## Analyze by pivoting features

To confirm some of our observations and assumptions, we can quickly analyze our feature correlations by <u>pivoting features</u> against each other. We can only do so at this stage for features which do not have any empty values. It also makes sense doing so only for features which are categorical (Sex), ordinal (Pclass) or discrete (SibSp, Parch) type.

🔥 **아래 항목들은 코드 수행 결과의 어떤 데이터로부터 도출되었는가?**

- **Pclass** We observe significant correlation (>0.5) among Pclass=1 and Survived (classifying #3). We decide to include this feature in our model.
    > 
- **Sex** We confirm the observation during problem definition that Sex=female had very high survival rate at 74% (classifying #1).
    >
- **SibSp and Parch** These features have zero correlation for certain values. It may be best to derive a feature or a set of features from these individual features (creating #1).
    >

In [None]:
# [코드의 동작을 설명해 주세요]
train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean()

In [None]:
# [코드의 동작을 설명해 주세요]
train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
# [코드의 동작을 설명해 주세요]
train_df[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
# [코드의 동작을 설명해 주세요]
train_df[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)

## Analyze by visualizing data

Now we can continue confirming some of our assumptions using visualizations for analyzing the data.

### Correlating numerical features

Let us start by understanding correlations between numerical features and our solution goal (Survived).

**A histogram chart** is useful for analyzing continous numerical variables like Age where banding or ranges will help identify useful patterns. The histogram can indicate distribution of samples using automatically defined bins or equally ranged bands. This helps us answer questions relating to specific bands (Did infants have better survival rate?)

Note that x-axis in historgram visualizations represents the count of samples or passengers.

**Observations.**<br>
🔥 **아래 항목들은 코드 수행 결과의 어떤 데이터로부터 도출되었는가?**
- Infants (Age <=4) had high survival rate.
    >
- Oldest passengers (Age = 80) survived.
    >
- Large number of 15-25 year olds did not survive.
    >
- Most passengers are in 15-35 age range.
    >

**Decisions.**

This simple analysis confirms our assumptions as decisions for subsequent workflow stages.

- We should consider Age (our assumption classifying #2) in our model training.
- Complete the Age feature for null values (completing #1).
- We should band age groups (creating #3).

In [None]:
# [코드의 동작을 설명해 주세요]
g = sns.FacetGrid(train_df, col='Survived')
g.map(plt.hist, 'Age', bins=20)

### Correlating numerical and ordinal features

We can combine multiple features for identifying correlations using a single plot. This can be done with numerical and categorical features which have numeric values.

**Observations.**

🔥 **아래 항목들은 코드 수행 결과의 어떤 데이터로부터 도출되었는가?**
- Pclass=3 had most passengers, however most did not survive. Confirms our classifying assumption #2.
    >
- Infant passengers in Pclass=2 and Pclass=3 mostly survived. Further qualifies our classifying assumption #2.
    >
- Most passengers in Pclass=1 survived. Confirms our classifying assumption #3.
    >
- Pclass varies in terms of Age distribution of passengers.
    >

**Decisions.**

- Consider Pclass for model training.

In [None]:
# [코드의 동작을 설명해 주세요]
grid = sns.FacetGrid(train_df, col='Pclass', hue='Survived')
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();

🎁 matplotlib를 사용하여 작성한 위와 동일한 동작을 하는 코드

In [None]:
# [코드의 동작을 설명해 주세요]
fig, axes = plt.subplots(1, 3, figsize=(10, 3), sharey=True)

for i, ax in enumerate(axes):
    # [코드의 동작을 설명해 주세요]
    age_survied = train_df[train_df['Pclass'] == i+1]
    
    # [코드의 동작을 설명해 주세요]
    ax.hist(age_survied[age_survied['Survived']==0]['Age'], bins=20, color='blue', alpha=.5)
    
    # [코드의 동작을 설명해 주세요]
    ax.hist(age_survied[age_survied['Survived']==1]['Age'], bins=20, color='orange', alpha=.5)
    
    # [코드의 동작을 설명해 주세요]
    ax.set_title(f'Pclass = {i+1}')

# [코드의 동작을 설명해 주세요]
fig.legend(['0', '1'], loc='center right')
plt.show()

In [None]:
# [코드의 동작을 설명해 주세요]
grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', height=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();

🎁 matplotlib를 사용하여 작성한 위와 동일한 동작을 하는 코드

In [None]:
# [코드의 동작을 설명해 주세요]
fig, axes = plt.subplots(3, 2, figsize=(6, 6), sharey=True, sharex=True)
# [코드의 동작을 설명해 주세요]
axes = [y for x in axes for y in x]

for i, ax in enumerate(axes):
    # [코드의 동작을 설명해 주세요]
    pclass = i//2 + 1
    survied = i % 2
    
    # [코드의 동작을 설명해 주세요]
    age_pclass_survied = train_df[(train_df['Pclass'] == pclass) & 
                                  (train_df['Survived'] == survied)]['Age']
    # [코드의 동작을 설명해 주세요]
    ax.hist(age_pclass_survied, bins=20, color='blue', alpha=.5)
    # [코드의 동작을 설명해 주세요]
    ax.set_title(f'Pclass = {pclass} | Survived = {survied}')

# [코드의 동작을 설명해 주세요]
plt.tight_layout()
plt.show()

### Correlating categorical features

Now we can correlate categorical features with our solution goal.

**Observations.**

🔥 **아래 항목들은 코드 수행 결과의 어떤 데이터로부터 도출되었는가?**

- Female passengers had much better survival rate than males. Confirms classifying (#1).
    > 
- Exception in Embarked=C where males had higher survival rate. This could be a correlation between Pclass and Embarked and in turn Pclass and Survived, not necessarily direct correlation between Embarked and Survived.
    > 
- Males had better survival rate in Pclass=3 when compared with Pclass=2 for C and Q ports. Completing (#2).
    > 
- Ports of embarkation have varying survival rates for Pclass=3 and among male passengers. Correlating (#1).
    > 

**Decisions.**

- Add Sex feature to model training.
- Complete and add Embarked feature to model training.

In [None]:
# [코드의 동작을 설명해 주세요]
grid = sns.FacetGrid(train_df, row='Embarked', height=2.2, aspect=1.6)

# [코드의 동작을 설명해 주세요]
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep', order=[1,2,3], hue_order=["male", "female"])
grid.add_legend()
plt.show()

🎁 matplotlib를 사용하여 작성한 위와 유사한 동작을 하는 코드

In [None]:
fig, axes = plt.subplots(3, 1, figsize=(4, 8), sharey=True, sharex=True)
embarked = ['S', 'C', 'Q']

for i, ax in enumerate(axes):
    # [코드의 동작을 설명해 주세요]
    pclass_embarked_survied = train_df[train_df['Embarked'] == embarked[i]]
    
    # [코드의 동작을 설명해 주세요]
    survied_rate_male = pclass_embarked_survied[pclass_embarked_survied['Sex']=='male'][['Survived', 'Pclass']].groupby('Pclass').mean()
    
    # [코드의 동작을 설명해 주세요]
    survied_rate_female = pclass_embarked_survied[pclass_embarked_survied['Sex']=='female'][['Survived', 'Pclass']].groupby('Pclass').mean()

    # [코드의 동작을 설명해 주세요]
    ax.plot(survied_rate_male, color='blue')
    ax.plot(survied_rate_female, color='orange')
    ax.set_title(f'Embarked = {embarked[i]}')

fig.legend(['male', 'female'], bbox_to_anchor=(1.2, 0.5))
plt.show()

### Correlating categorical and numerical features

We may also want to correlate categorical features (with non-numeric values) and numeric features. We can consider correlating Embarked (Categorical non-numeric), Sex (Categorical non-numeric), Fare (Numeric continuous), with Survived (Categorical numeric).

**Observations.**

🔥 **아래 항목들은 코드 수행 결과의 어떤 데이터로부터 도출되었는가?**

- Higher fare paying passengers had better survival. Confirms our assumption for creating (#4) fare ranges.
    >
- Port of embarkation correlates with survival rates. Confirms correlating (#1) and completing (#2).
    >

**Decisions.**

- Consider banding Fare feature.

In [None]:
# [코드의 동작을 설명해 주세요]
grid = sns.FacetGrid(train_df, row='Embarked', col='Survived', height=2.2, aspect=1.6)
grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None, order=['female', 'male'])
grid.add_legend()
plt.show()

🎁 matplotlib를 사용하여 작성한 위와 유사한 동작을 하는 코드

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(6, 8), sharey=True, sharex=True)
axes = [y for x in axes for y in x]
Embarked = ['S', 'C', 'Q']

for i, ax in enumerate(axes):
    # [코드의 동작을 설명해 주세요]
    embarked = Embarked[i//2]
    survied = i % 2

    # [코드의 동작을 설명해 주세요]
    survied_embarked = train_df[(train_df['Embarked'] == embarked) & 
                                (train_df['Survived'] == survied)][['Sex', 'Fare']]
    
    # [코드의 동작을 설명해 주세요]
    fare_mean = survied_embarked.groupby('Sex').mean()
    
    # [코드의 동작을 설명해 주세요]
    ax.bar(fare_mean.index, fare_mean['Fare'], alpha=.5)
    
    # [코드의 동작을 설명해 주세요]
    ax.set_title(f'Embarked = {embarked} | Survived = {survied}')
    
    # [코드의 동작을 설명해 주세요]
    ax.set_xlabel('Sex')
    
    # [코드의 동작을 설명해 주세요]
    ax.set_ylabel('Fare')

plt.tight_layout()
plt.show()

## References

This notebook has been created based on great work done solving the Titanic competition and other sources.

- [A journey through Titanic](https://www.kaggle.com/omarelgabry/titanic/a-journey-through-titanic)
- [Getting Started with Pandas: Kaggle's Titanic Competition](https://www.kaggle.com/c/titanic/details/getting-started-with-random-forests)
- [Titanic Best Working Classifier](https://www.kaggle.com/sinakhorami/titanic/titanic-best-working-classifier)