# Exploratory Data Analysis
The purpose of this notebook will be to take a preliminary look at the full training data available in the Titanic Kaggle competition. I will be searching for entries which need to be cleaned, developing strategies for replacing missing data, and uncovering patterns with visualizations. 
## Titanic Dataset - Kaggle
(Source: https://www.kaggle.com/competitions/titanic/data)

In [2]:
import pandas as pd

# Read data
X = pd.read_csv('data/train.csv')

# Select numeric and categoric columns
numeric_cols = [cname for cname in X.columns 
                if X[cname].dtype in ['int64', 'float64']]
categoric_cols = [cname for cname in X.columns
                  if X[cname].dtype == 'object']

In [3]:
X.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Summary Stats for Numeric and Categoric Data

In [7]:
X.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [8]:
X[categoric_cols].describe()

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,1,577,7,4,644


### Order of embarkation

1. Southampton
2. Cherbourg
3. Queenstown

I am considering making 'Embarked' an ordinal variable, because there is a natural order to when the passenger embarked based on where they embarked, due to the ship's route. My theory is, the order at which a person embarked could affect where they were lodged on the ship, therefore having potential to affect the person's chance of survival. 

Further analysis and research is needed. It would be worth investigating relationships between 'Embarked', 'Cabin', and 'Ticket', as they could be related to eachother. 

## Null Values

In [11]:
X.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### Cabin

There is a significant portion of null values for 'Cabin':

In [19]:
cabin_null_portion = X['Cabin'].isnull().sum()/len(X['Cabin'].index)
print(f'Cabin null values: {cabin_null_portion}')

Cabin null values: 0.7710437710437711


In [61]:
X['Cabin'].nunique()

147

In [45]:
X.drop_duplicates(subset=['Cabin'])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
...,...,...,...,...,...,...,...,...,...,...,...,...
857,858,1,1,"Daly, Mr. Peter Denis",male,51.0,0,0,113055,26.5500,E17,S
867,868,0,1,"Roebling, Mr. Washington Augustus II",male,31.0,0,0,PC 17590,50.4958,A24,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


There may be a relationship between 'Pclass' and the NaN value for Cabin

In [70]:
for pclass in X['Pclass'].unique():
    nulls = X[X['Pclass']== pclass].Cabin.isnull().sum()
    print(f'Pclass level: {pclass} has {nulls} null values for Cabin')

Pclass level: 3 has 479 null values for Cabin
Pclass level: 1 has 40 null values for Cabin
Pclass level: 2 has 168 null values for Cabin


It appears Lower classes have their Cabin number recorded much less frequently then upper classes. I could impute the nulls with the most frequent value for each class, but that would create dependence between 'Pclass' and 'Cabin'.

In [77]:
for pclass in X['Pclass'].unique():
    valuecount = X[X['Pclass']== pclass].Cabin.value_counts()
    print(f'Pclass level: {pclass} value counts: \n{valuecount}\n')

Pclass level: 3 value counts: 
G6       4
F G73    2
E121     2
F E69    1
E10      1
F G63    1
F38      1
Name: Cabin, dtype: int64

Pclass level: 1 value counts: 
B96 B98        4
C23 C25 C27    4
C22 C26        3
E24            2
E67            2
              ..
E34            1
C7             1
C54            1
E36            1
C148           1
Name: Cabin, Length: 133, dtype: int64

Pclass level: 2 value counts: 
F33     3
E101    3
F2      3
D       3
F4      2
D56     1
E77     1
Name: Cabin, dtype: int64



The Cabins in the Titanic appear to be numbered based on a system where the first letter indicates the deck. The topmost deck is 'A' and bottom deck is 'G' 
<img src="Images/Olympic_&_Titanic_cutaway_diagram.png" align="middle" width="350" />
(Source: https://en.wikipedia.org/wiki/First-class_facilities_of_the_Titanic)

Due to the labeling system of the cabins, it seems likely to be a strong indicator of survival chance. Passengers on deck 'G' would likely experience significant difficulty reaching life boats on the upper decks. According to wikipedia, nearly all of the passengers who did not make it into a lifeboat did not survive, therefore lifeboat accessibility is likely a great predictor of survival. 
(Source: https://en.wikipedia.org/wiki/Sinking_of_the_Titanic)

I am leaning on making an ordinal 'Deck' Variable based on the first letter of the 'Cabin' value. There may be further information contained in the number following the deck letter, so I will not alter that for now. 

As for the null values of 'Cabin', I am considering a couple options:

1. Impute 'Deck' with the most frequent value of 'Deck' for that class
2. Impute 'Deck' with the most frequent value of 'Deck' for all passengers
3. Impute 'Deck' with a constant 'missing_value'
4. Search for other patterns to derive 'Deck' value based on combinations of other features

My main concern is in creating dependence between features, which should be avoided. 