<a href="https://www.kaggle.com/code/emiliocorts/titanic-eda-w-altair-predictions-with-pipeline?scriptVersionId=102390515" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Surviving the Titanic

#### Column info:

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import altair as alt


from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler, LabelBinarizer


from sklearn.compose import ColumnTransformer

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score



In [2]:
train = pd.read_csv("/kaggle/input/titanic/train.csv")
test = pd.read_csv("/kaggle/input/titanic/test.csv")

## Lets explore the dataset

In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [4]:
test.info()
test_id = test.PassengerId


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


In [5]:
train.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [6]:
train.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [7]:
train.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [8]:
test.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

### Conclusions

We can see that we have categorical values and  a lot of missing values as well. Let's see what we can do about that. 

We have a total of 891 entries in the train dataset. This makes it hard to work with the 'Cabin' column, as more than half of its values are null. Things are not better in the test dataset, in which we have a total of 418 rows and 327 null values in the 'Cabin' column.

I've decided to drop this column in both DataSets and to fill the missing values of the other columns with their mean or mode depending their dtype.

###  EDA

In [9]:
train.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


### Little tweak

Before moving on, we'll add an Age group column which will help us improve data visualizations.

In [10]:
def checkAgeGroup(age):
    a = '0-11'
    x = '12-17'
    y = '18-19'
    i = '30-59'
    z= '60+'
    if age < 12:
        return a
    elif age >= 12 and age < 17:
        return x
    elif age >=  17 and age < 20:
        return y
    elif age >= 20 and age < 59:
        return i
    else: return z
        

In [11]:
train['Age_Group'] = [checkAgeGroup(row) for row in train.Age]
train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_Group
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,30-59
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,30-59
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,30-59
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,30-59
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,30-59
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,30-59
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,18-19
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,60+
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,30-59


In [12]:
train.Age_Group.value_counts()

30-59    522
60+      205
0-11      68
18-19     64
12-17     32
Name: Age_Group, dtype: int64

As you'd expect, the columns with the most correlation are the Fare and the Survived column. 

In [13]:
train.Sex.value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [14]:
girls = train[train.Sex == 'female']
guys = train[train.Sex == 'male']


In [15]:
girls[girls['Survived'] == 1]


# We can see that 233 out of 314 or 74% of girls made it.

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_Group
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,30-59
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,30-59
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,30-59
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,30-59
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,12-17
...,...,...,...,...,...,...,...,...,...,...,...,...,...
874,875,1,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0000,,C,30-59
875,876,1,3,"Najib, Miss. Adele Kiamie ""Jane""",female,15.0,0,0,2667,7.2250,,C,12-17
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C,30-59
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0000,,S,30-59


In [16]:
# We can see that only 109 out of 577 or 18% of men made it out. 

guys[guys.Survived ==1 ]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_Group
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S,60+
21,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0000,D56,S,30-59
23,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5000,A6,S,30-59
36,37,1,3,"Mamee, Mr. Hanna",male,,0,0,2677,7.2292,,C,60+
55,56,1,1,"Woolner, Mr. Hugh",male,,0,0,19947,35.5000,C52,S,60+
...,...,...,...,...,...,...,...,...,...,...,...,...,...
838,839,1,3,"Chip, Mr. Chang",male,32.0,0,0,1601,56.4958,,S,30-59
839,840,1,1,"Marechal, Mr. Pierre",male,,0,0,11774,29.7000,C47,C,60+
857,858,1,1,"Daly, Mr. Peter Denis",male,51.0,0,0,113055,26.5500,E17,S,30-59
869,870,1,3,"Johnson, Master. Harold Theodor",male,4.0,1,1,347742,11.1333,,S,0-11


In [17]:
alt.Chart(train).mark_circle().encode(
    x = 'Age',
    y = 'Fare',
    color = 'Survived:N',
    column = 'Age_Group:O'
)

### Money Money Money 
The privilege of money starts exposing itself the older you are. We can observe that the fare  doesnt impact signifcantly on the younger passengers, but as they go older, the influence of it is more obvious. Up until it goes full circle in the 60+ age group and doesnt seem to have an impact anymore.

In [18]:
alt.Chart(train).mark_circle().encode(
    x = 'Age',
    y = 'Fare',
    color = 'Survived:N',
    column = 'Sex:N'
)

In [19]:
survived = train.groupby(['Sex','Survived'])[['Survived']].agg(count_survived=('Survived', 'count')).reset_index()
survived

Unnamed: 0,Sex,Survived,count_survived
0,female,0,81
1,female,1,233
2,male,0,468
3,male,1,109


In [20]:
alt.Chart(survived).mark_bar().encode(
    x='Survived:N', 
    y=alt.Y('count_survived:Q',title='Survived'),
    color='Sex:N',        
    column='Sex'        
).properties(
    width=150,
    height=250
)



### No male privilege here. 

From the plots above we can see that being a woman granted you a big boost to your chances of surviving the disaster (74% of women made it out). Meanwhile men had a rougher time. As only 18% of men survived the Titanic.

In [21]:
pSurvived = train.groupby(['Pclass','Survived'])[['Survived']].agg(count_survived=('Survived', 'count')).reset_index()
pSurvived

Unnamed: 0,Pclass,Survived,count_survived
0,1,0,80
1,1,1,136
2,2,0,97
3,2,1,87
4,3,0,372
5,3,1,119


In [22]:
train.Pclass.value_counts()


3    491
1    216
2    184
Name: Pclass, dtype: int64

In [23]:
print(girls[girls.Pclass == 1].shape)
print(guys[guys.Pclass == 1].shape)


#94 women in Pclass 1, # 122 guys

(94, 13)
(122, 13)


In [24]:
alt.Chart(train).mark_bar().encode(
    alt.Y('Pclass:N'),
    alt.X('count(Pclass):Q',stack="normalize"),
    color = 'Survived:N',
    column = 'Sex:N'
)

We can observe that Pclass 1 people had a bigger chance of seeing their families again, even if you were a guy, you'd have had a better chance than your men peers. Women in Pclass 2 had it way better than their men counterparts. For both sexes alike Pclass was the worst.

In [25]:
alt.Chart(pSurvived).mark_bar().encode(
    x='Survived:N', 
    y=alt.Y('count_survived:Q',title='Survived'),
    color='Pclass:N',        
    column='Pclass:N'        
).properties(
    width=150,
    height=250
)



We can analyze that being in Pclass 1 gave you a bigger change of making it. Being on Pclass 2 gave you a 50/50 chance. And being it Pclass 3 gave you the lowest chance of survival.


#### Conclusions

### Who had the biggest chance of making it?

This one was an easy answer as being a woman gave you a solid 74% chance of making it out alive. Especially if you were in Pclass 1.

## Predicting Survival

I'll drop the Age_group column of my train df.


In [26]:
train.drop('Age_Group',axis = 'columns',inplace = True)

In [27]:
train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Next step is to select which columns to remove globally.

In [28]:
features_drop = ['PassengerId','Name','Ticket','Cabin']

We drop the columns

In [29]:
train = train.drop(features_drop, axis = 'columns')
test = test.drop(features_drop,axis='columns')

Divide columns in numerical vs nominal data.

In [30]:
numerical = train.select_dtypes(include=np.number).columns.tolist()
numerical.remove('Survived')
nominal = train.select_dtypes(exclude = np.number).columns.tolist()

In [31]:
numerical

['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

In [32]:
nominal

['Sex', 'Embarked']

#### Data split to train our model

In [33]:
X = train[numerical + nominal]
y = train['Survived']

In [34]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 42)

### Pipeline time 

In [35]:
nominal_pipeline = Pipeline([
    ('imputer',SimpleImputer(strategy='most_frequent')),
    ('encoder',OneHotEncoder(drop='first'))
])

numeric_pipeline = Pipeline([
    ('imputer',SimpleImputer(strategy='mean')),
    ('scaler',StandardScaler())
])

preprocessing_pipeline = ColumnTransformer([
    ('nominal_preprocessor', nominal_pipeline,nominal),
    ('numeric_preprocessor', numeric_pipeline,numerical)
])

#### Which model to use?

In [36]:


knn = KNeighborsClassifier(n_neighbors=3)
arbol  = DecisionTreeClassifier(criterion="entropy", random_state=42)
bosque = RandomForestClassifier(n_estimators=10, criterion="entropy", random_state=42)
svm = SVC(kernel="linear", random_state=42)

list_models =[knn,arbol,bosque,svm]



In [37]:

def check_model_stats(list_models):
    for model in list_models:
        complete_pipeline = Pipeline([
                ('preprocessor',preprocessing_pipeline),
                ('classifier', model)
            ])
        complete_pipeline.fit(X_train,y_train)
        y_pred = complete_pipeline.predict(X_train)
        y_pred_test = complete_pipeline.predict(X_test)
   
        print(model , ' : Accuracy Test data:',accuracy_score(y_test, y_pred_test)) 
        print(model , ' : Precision - Test data: ',precision_score(y_test, y_pred_test)) # De los que dije si, cuantos realmente fue un si
        print(model , ' : Recall - Test data: ',recall_score(y_test, y_pred_test))
        print(model , ' : f1 - Test data: ',f1_score(y_test, y_pred_test))






check_model_stats(list_models)


KNeighborsClassifier(n_neighbors=3)  : Accuracy Test data: 0.8268156424581006
KNeighborsClassifier(n_neighbors=3)  : Precision - Test data:  0.8115942028985508
KNeighborsClassifier(n_neighbors=3)  : Recall - Test data:  0.7567567567567568
KNeighborsClassifier(n_neighbors=3)  : f1 - Test data:  0.7832167832167832
DecisionTreeClassifier(criterion='entropy', random_state=42)  : Accuracy Test data: 0.7932960893854749
DecisionTreeClassifier(criterion='entropy', random_state=42)  : Precision - Test data:  0.7466666666666667
DecisionTreeClassifier(criterion='entropy', random_state=42)  : Recall - Test data:  0.7567567567567568
DecisionTreeClassifier(criterion='entropy', random_state=42)  : f1 - Test data:  0.7516778523489932
RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=42)  : Accuracy Test data: 0.8268156424581006
RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=42)  : Precision - Test data:  0.8028169014084507
RandomForestClassifier(crite

##### Model Performance 
We can see that the KNeighborsClassifier had the best score overall. We'll use it for our submition to the competition.

In [38]:
complete_pipeline_knn = Pipeline([
                ('preprocessor',preprocessing_pipeline),
                ('classifier', knn)
            ])
complete_pipeline_knn.fit(X_train,y_train)
y_pred = complete_pipeline_knn.predict(X_train)
y_pred_test = complete_pipeline_knn.predict(test)

In [39]:
def download_output(y_pred, name):
  output = pd.DataFrame({'PassengerId': test_id, 
                         'Survived': y_pred_test})
  output.to_csv(name, index=False)

In [40]:
download_output(y_pred_test,'submission.csv')