# Titanic - Machine Learning Kaggle Competition

## Exploratory Data Analysis on Titanic Train Dataset
- On this python lab, I will be performing exploratory data analysis to get more familiar with the train dataset.
- I will be using pandas to perform the data analysis as well as matplot lib for any plots/graphs needed.

## Variable Notes from Kaggle:
### pclass: 
- A proxy for socio-economic status (SES)
- 1st = Upper
- 2nd = Middle
- 3rd = Lower

### Age: 
- Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

### sibsp: 
- The dataset defines family relations in this way...
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)

### parch: 
- The dataset defines family relations in this way...
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore parch=0 for them.

### Import Modules
- First, lets import all modules we will be utilizing.

In [232]:
import pandas as pd
import numpy as np

## Train Pandas Dataframe
- Lets read the Train.csv file into a Pandas dataframe and run some helpful dataframe functions to help us get more acquaintance with the data.

In [233]:
df = pd.read_csv("../titanic_data/train.csv")
print(f"rows: {df.shape[0]}, columns: {df.shape[1]}")

# Show first 20 records
df.head(20)

rows: 891, columns: 12


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [234]:
# Display a helpful description of the data
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### Socio-Economic Patterns
- By looking at the description of the data, we can see that 25% of passengers were in the upper and middle class, and 75% of passengers were on the lower class.
- Lets start by analysing the passengers by their socio-economic status on the ship. The goal is to find if their socio-economic factor impact the survival rate of a passenger.

In [235]:
# Extract all Upper class passengers
upper_df = df[df.Pclass == 1]
middle_df = df[df.Pclass == 2]
lower_df = df[df.Pclass == 3]

# Extract count of all class passengers
upper_count = upper_df.shape[0]
middle_count = middle_df.shape[0]
lower_count = lower_df.shape[0]

print(f"Count of All Upper Class passengers: {upper_count}")
print(f"Count of All Middle Class passengers: {middle_count}")
print(f"Count of All Lower Class passengers: {lower_count}")

# Extract all classes survival odds
upper_survival_odd = upper_df["Survived"].value_counts(normalize=True) *100
middle_survival_odd = middle_df["Survived"].value_counts(normalize=True) *100
lower_survival_odd = lower_df["Survived"].value_counts(normalize=True) *100

print(f"\nUpper Class survival odds: {upper_survival_odd}")
print(f"\nMiddle Class survival odds: {middle_survival_odd}")
print(f"\nLower Class survival odds: {lower_survival_odd}")

Count of All Upper Class passengers: 216
Count of All Middle Class passengers: 184
Count of All Lower Class passengers: 491

Upper Class survival odds: Survived
1    62.962963
0    37.037037
Name: proportion, dtype: float64

Middle Class survival odds: Survived
0    52.717391
1    47.282609
Name: proportion, dtype: float64

Lower Class survival odds: Survived
0    75.763747
1    24.236253
Name: proportion, dtype: float64


### Sex Patterns
- From the previous cell we can see that upper class passengers had a higher odd of survival, followed by middle and then lower class.
- Lets dive deeper into the surviving population for all 3 classes. Lets see if sex play a part for the survival of a passenger.

In [236]:
# First, lets extract all passengers that survived for all 3 classes
upper_survived_df = upper_df[upper_df.Survived == 1]
middle_survived_df = middle_df[middle_df.Survived == 1]
lower_survived_df = lower_df[lower_df.Survived == 1]

# Now, lets compare survival odds between males and females.
upper_sex_survived_odd = upper_survived_df["Sex"].value_counts(normalize=True) * 100
middle_sex_survived_odd = middle_survived_df["Sex"].value_counts(normalize=True) * 100
lower_sex_survived_odd = lower_survived_df["Sex"].value_counts(normalize=True) * 100

print(f"Upper Class Sex Survival: {upper_sex_survived_odd}")
print(f"Middle Class Sex Survival: {middle_sex_survived_odd}")
print(f"Lower Class Sex Survival: {lower_sex_survived_odd}")

# Create a df for each class and sex
upper_female_s_df = upper_survived_df[upper_survived_df.Sex == "female"]
upper_male_s_df = upper_survived_df[upper_survived_df.Sex == "male"]
middle_female_s_df = middle_survived_df[middle_survived_df.Sex == "female"]
middle_male_s_df = middle_survived_df[middle_survived_df.Sex == "male"]
lower_female_s_df = lower_survived_df[lower_survived_df.Sex == "female"]
lower_male_s_df = lower_survived_df[lower_survived_df.Sex == "male"]

Upper Class Sex Survival: Sex
female    66.911765
male      33.088235
Name: proportion, dtype: float64
Middle Class Sex Survival: Sex
female    80.45977
male      19.54023
Name: proportion, dtype: float64
Lower Class Sex Survival: Sex
female    60.504202
male      39.495798
Name: proportion, dtype: float64


## Age Patterns
- We can observe from the results of the previous cell that females had a higher odd of survival than males in all 3 classes.
- Females have an odd of more than 50% in all 3 classes.
- Lets see if age could've also play a part on the survival of a passenger

In [237]:
# Lets start by separating male and female into different age groups. 
# e.g: young (age <= 30), middle_age (30 < age <= 60), old (age > 60)
upper_young_f_df = upper_female_s_df[upper_female_s_df.Age <= 30]
upper_young_m_df = upper_male_s_df[upper_male_s_df.Age <= 30]
upper_middle_f_df = upper_female_s_df[upper_female_s_df.Age.between(30,60, inclusive="right")]
upper_middle_m_df = upper_male_s_df[upper_male_s_df.Age.between(30,60, inclusive="right")]
upper_old_f_df = upper_female_s_df[upper_female_s_df.Age > 60]
upper_old_m_df = upper_male_s_df[upper_male_s_df.Age > 60]

#Middle class
middle_young_f_df = middle_female_s_df[middle_female_s_df.Age <= 30]
middle_young_m_df = middle_male_s_df[middle_male_s_df.Age <= 30]
middle_middle_f_df = middle_female_s_df[middle_female_s_df.Age.between(30,60, inclusive="right")]
middle_middle_m_df = middle_male_s_df[middle_male_s_df.Age.between(30,60, inclusive="right")]
middle_old_f_df = middle_female_s_df[middle_female_s_df.Age > 60]
middle_old_m_df = middle_male_s_df[middle_male_s_df.Age > 60]

#Lower class
lower_young_f_df = lower_female_s_df[lower_female_s_df.Age <= 30]
lower_young_m_df = lower_male_s_df[lower_male_s_df.Age <= 30]
lower_middle_f_df = lower_female_s_df[lower_female_s_df.Age.between(30,60, inclusive="right")]
lower_middle_m_df = lower_male_s_df[lower_male_s_df.Age.between(30,60, inclusive="right")]
lower_old_f_df = lower_female_s_df[lower_female_s_df.Age > 60]
lower_old_m_df = lower_male_s_df[lower_male_s_df.Age > 60]


print(f"Upper Class Female between age of 0 and 30 Survival Odds: {(upper_young_f_df.shape[0] / upper_female_s_df.shape[0]) * 100}")
print(f"Upper Class Female between age of 31 and 60 Survival Odds: {(upper_middle_f_df.shape[0] / upper_female_s_df.shape[0]) * 100}")
print(f"Upper Class Female Older than 60 Survival Odds: {(upper_old_f_df.shape[0] / upper_female_s_df.shape[0]) * 100}")
print(f"Upper Class Male between age of 0 and 30 Survival Odds: {(upper_young_m_df.shape[0] / upper_male_s_df.shape[0]) * 100}")
print(f"Upper Class Male between age of 31 and 60 Survival Odds: {(upper_middle_m_df.shape[0] / upper_male_s_df.shape[0]) * 100}")
print(f"Upper Class Male Older than 60 Survival Odds: {(upper_old_m_df.shape[0] / upper_male_s_df.shape[0]) * 100}")

print(f"\nMiddle Class Female between age of 0 and 30 Survival Odds: {(middle_young_f_df.shape[0] / middle_female_s_df.shape[0]) * 100}")
print(f"Middle Class Female between age of 31 and 60 Survival Odds: {(middle_middle_f_df.shape[0] / middle_female_s_df.shape[0]) * 100}")
print(f"Middle Class Female Older than 60 Survival Odds: {(middle_old_f_df.shape[0] / middle_female_s_df.shape[0]) * 100}")
print(f"Middle Class Male between age of 0 and 30 Survival Odds: {(middle_young_m_df.shape[0] / middle_male_s_df.shape[0]) * 100}")
print(f"Middle Class Male between age of 31 and 60 Survival Odds: {(middle_middle_m_df.shape[0] / middle_male_s_df.shape[0]) * 100}")
print(f"Middle Class Male Older than 60 Survival Odds: {(middle_old_m_df.shape[0] / middle_male_s_df.shape[0]) * 100}")


print(f"\nLower Class Female between age of 0 and 30 Survival Odds: {(lower_young_f_df.shape[0] / lower_female_s_df.shape[0]) * 100}")
print(f"Lower Class Female between age of 31 and 60 Survival Odds: {(lower_middle_f_df.shape[0] / lower_female_s_df.shape[0]) * 100}")
print(f"Lower Class Female Older than 60 Survival Odds: {(lower_old_f_df.shape[0] / lower_female_s_df.shape[0]) * 100}")
print(f"Lower Class male between age of 0 and 30 Survival Odds: {(lower_young_m_df.shape[0] / lower_male_s_df.shape[0]) * 100}")
print(f"Lower Class male between age of 31 and 60 Survival Odds: {(lower_middle_m_df.shape[0] / lower_male_s_df.shape[0]) * 100}")
print(f"Lower Class male Older than 60 Survival Odds: {(lower_old_m_df.shape[0] / lower_male_s_df.shape[0]) * 100}")

Upper Class Female between age of 0 and 30 Survival Odds: 36.26373626373626
Upper Class Female between age of 31 and 60 Survival Odds: 51.64835164835166
Upper Class Female Older than 60 Survival Odds: 2.197802197802198
Upper Class Male between age of 0 and 30 Survival Odds: 28.888888888888886
Upper Class Male between age of 31 and 60 Survival Odds: 57.77777777777777
Upper Class Male Older than 60 Survival Odds: 2.2222222222222223

Middle Class Female between age of 0 and 30 Survival Odds: 58.57142857142858
Middle Class Female between age of 31 and 60 Survival Odds: 38.57142857142858
Middle Class Female Older than 60 Survival Odds: 0.0
Middle Class Male between age of 0 and 30 Survival Odds: 58.82352941176471
Middle Class Male between age of 31 and 60 Survival Odds: 23.52941176470588
Middle Class Male Older than 60 Survival Odds: 5.88235294117647

Lower Class Female between age of 0 and 30 Survival Odds: 55.55555555555556
Lower Class Female between age of 31 and 60 Survival Odds: 8.3333

## Alternate Syntax
- Lets do all the above in a more concise way

In [238]:
df["class_group"] = df.Pclass.map({1: "Upper", 2: "Middle", 3: "Lower"})
df["age_group"] = pd.cut(
    df.Age, 
    bins = [-np.inf, 30, 60, np.inf], 
    labels = ["young", "middle", "old"],
    right=True)

grouped_df = (
    df.groupby(["class_group", "Sex", "age_group"], observed=False)
    .agg(survival_rate=('Survived', 'mean'), group_size=('Survived', 'size'))
    .reset_index()
)

new_df = df.merge(grouped_df, on=["class_group", "Sex", "age_group"], how="left")
new_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,class_group,age_group,survival_rate,group_size
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Lower,young,0.16763,173.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Upper,middle,0.979167,48.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Lower,young,0.506329,79.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Upper,middle,0.979167,48.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Lower,middle,0.118421,76.0
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,Lower,,,
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,Upper,middle,0.412698,63.0
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,Lower,young,0.16763,173.0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,Lower,young,0.506329,79.0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,Middle,young,0.931818,44.0


## NaN Values
- First, lets check which columns in the dataset have NaN values.
- Second, lets do some analysis to populate all NaN values correctly across each columns
- Finally, lets create a new dataframe where all NaN values are replaced

In [239]:
# Extract na counts of all columns
na_counts = df.isna().sum().sort_values(ascending=False)
print(na_counts.head(15))

Cabin          687
Age            177
age_group      177
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Ticket           0
Fare             0
class_group      0
dtype: int64


### NaN Results
- We can see from the previous results that Cabin, Age, and Embarked are the only columns with na values
- Cabin is the column with the most na values (687) followed by Age and then Embarked
- Lets try to find patterns for these 3 columns that will allow us to populate them as accurate as possible

## NaN Embarked
- Now lets replace all NaN for Embarked column with an appropriate value.

In [240]:
# Lets fill Embarked utilizing the avg passenger observed by class and gender
embarked_by_class_gender = (
    df.groupby(["class_group", "Sex", "Embarked"], observed=True)
    .size()
    .reset_index(name='n')
)

top_embarkments = (
    embarked_by_class_gender
    .sort_values(["class_group", "Sex", "n", "Embarked"], ascending=[True, True, False, True])
    .drop_duplicates(subset=["class_group", "Sex"], keep="first")
    .rename(columns={'Embarked': 'top_embarked', 'n': 'top_count'})
    .reset_index(drop=True)
)

df = df.merge(top_embarkments, on=["class_group", "Sex"], how="left")
df.Embarked = df.Embarked.fillna(df.top_embarked)
df.drop(columns=["top_count", "top_embarked"], inplace=True)

## NaN Age
- Now lets replace all NaN for Age column with an appropriate value.

In [241]:
# Lets fill Age utilizing the avg passenger observed by class, gender, and embarkment
age_by_class_gender = (
    df.groupby(["class_group", "Sex", "Embarked", "Age"], observed=True)
    .size()
    .reset_index(name='n')
)

top_ages = (
    age_by_class_gender
    .sort_values(["class_group", "Sex", "Embarked", "Age", "n"], ascending=[True, True, True, True, False])
    .drop_duplicates(subset=["class_group", "Sex", "Embarked"], keep="first")
    .rename(columns={"Age": "top_age", "n": "top_count"})
    .reset_index(drop=True)
)

df = df.merge(top_ages, on=["class_group", "Sex", "Embarked"], how="left")
df.Age = df.Age.fillna(df.top_age)
df.drop(columns=["top_count", "top_age"], inplace=True)

# Lets reapply the Age bins to remove NaNs from age_group as well
df["age_group"] = pd.cut(
    df.Age, 
    bins = [-np.inf, 30, 60, np.inf], 
    labels = ["young", "middle", "old"],
    right=True)

# lets also re-calculate survival rate
grouped_df = (
    df.groupby(["class_group", "Sex", "age_group"], observed=False)
    .agg(survival_rate=('Survived', 'mean'), group_size=('Survived', 'size'))
    .reset_index()
)

df = df.merge(grouped_df, on=["class_group", "Sex", "age_group"], how="left")
df.head(10)

#print(df.isna().sum())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,class_group,age_group,survival_rate,group_size
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Lower,young,0.142322,267
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Upper,middle,0.979167,48
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Lower,young,0.53719,121
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Upper,middle,0.979167,48
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Lower,middle,0.118421,76
5,6,0,3,"Moran, Mr. James",male,2.0,0,0,330877,8.4583,,Q,Lower,young,0.142322,267
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,Upper,middle,0.412698,63
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,Lower,young,0.142322,267
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,Lower,young,0.53719,121
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,Middle,young,0.934783,46


## NaN Cabin
- Now lets replace all NaN for the Cabin column with an appropriate value.

In [242]:
# The assignment of a cabin to a passenger was most likely affected by the passenger's socio-economic class and age
# Lets try to find a pattern, lets look group cabin by socio-economic class, we will only use the first letter of the cabin
df.Cabin = df.Cabin.str.strip().str[0]


#Lets extract passangers count per Cabin, grouped by socio-economic class, age group and sex
cabin_by_class = (
    df.groupby(["class_group", "age_group", "Sex", "Cabin"], observed=True, dropna=False)
    .size()
    .reset_index(name='n')
)

keys = ['class_group', 'age_group', 'Sex']
valid_cabins = cabin_by_class[cabin_by_class['Cabin'].notna()].copy()

top_cabins_per_group = (
    valid_cabins
      .sort_values(keys + ['n', 'Cabin'], ascending=[True, True, True, False, True])
      .drop_duplicates(subset=keys, keep='first')
      .rename(columns={'Cabin': 'top_cabin', 'n': 'top_count'})
      .reset_index(drop=True)
)

# Fall back
top_cabin_per_class = (
    valid_cabins
    .sort_values("class_group", ascending=[True])
    .drop_duplicates(subset=["class_group"], keep="first")
    .rename(columns={'Cabin': 'top_cabin', 'n': 'top_count'})
    .reset_index(drop=True)
)

# fill NaN Cabin with the group's most common cabin
df = df.merge(top_cabins_per_group[keys + ['top_cabin']], on=keys, how='left')
df['Cabin'] = df['Cabin'].fillna(df['top_cabin'])
df.drop(columns=['top_cabin', 'top_count'], errors='ignore', inplace=True)

if df.Cabin.isna().sum() > 0:
    df = df.merge(top_cabin_per_class[['class_group', 'top_cabin']], on='class_group', how='left')
    df['Cabin'] = df['Cabin'].fillna(df['top_cabin'])
    df.drop(columns=['top_cabin', 'top_count'], errors='ignore', inplace=True)

#df.head(10)
print(df.isna().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin            0
Embarked         0
class_group      0
age_group        0
survival_rate    0
group_size       0
dtype: int64


# Results
- We added 4 engineered features to the dataset: **class_group**, **age_group**, **survival_rate**, and **group_size**
- **class_group** and **age_group** can be left out, I do not think it'll add any value to the model as it represents sort of the same data as Pclass and Age but in a different format.
- **survival_rate** and **group_size** I believe will be the most beneficial out of these 4 engineered features.
  - **survival_rate** represents the odds of a passenger surviving based on their socio-economic status, gender and age.
  - **group_size** represents how large a particular group is, in this case the groups broken down by socio-economic status, gender, and age. The idea is that this feature in combination with group_size will give us a better picture of a passenger's odd of survival. For example, if the survival rate for a specific group is 100% but the group size is only 1. Passing the survival rate to the model without the group size can be misleading.
## Next Steps
- Now lets train the model with and without these new 4 engineered features.
- Remember that these 4 engineered features should only be calculated in the **TRAIN** dataset, and then used in val & test dataset. Avoid creating these 4 new features for both val and test to avoid **data leakage**.

# Results
- We added 4 engineered features to the dataset: **class_group**, **age_group**, **survival_rate**, and **group_size**
- **class_group** and **age_group** can be left out, I do not think it'll add any value to the model as it represents sort of the same data as Pclass and Age but in a different format.
- **survival_rate** and **group_size** I believe will be the most beneficial out of these 4 engineered features.
  - **survival_rate** represents the odds of a passenger surviving based on their socio-economic status, gender and age.
  - **group_size** represents how large a particular group is, in this case the groups broken down by socio-economic status, gender, and age. The idea is that this feature in combination with group_size will give us a better picture of a passenger's odd of survival. For example, if the survival rate for a specific group is 100% but the group size is only 1. Passing the survival rate to the model without the group size can be misleading.
## Next Steps
- Now lets train the model with and without these new 4 engineered features.
- Remember that these 4 engineered features should only be calculated in the **TRAIN** dataset, and then used in val & test dataset. Avoid creating these 4 new features for both val and test to avoid **data leakage**.