# Surviving the Titanic

#### Column info:

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

In [136]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import altair as alt


from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler, LabelBinarizer


from sklearn.compose import ColumnTransformer

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score



In [137]:
train = pd.read_csv("/kaggle/input/titanic/train.csv")
test = pd.read_csv("/kaggle/input/titanic/test.csv")

## Lets explore the dataset

In [138]:
train.info()

In [139]:
test.info()
test_id = test.PassengerId


In [140]:
train.tail()

In [141]:
train.dtypes

In [142]:
train.isna().sum()

In [143]:
test.isna().sum()

### Conclusions

We can see that we have categorical values and  a lot of missing values as well. Let's see what we can do about that. 

We have a total of 891 entries in the train dataset. This makes it hard to work with the 'Cabin' column, as more than half of its values are null. Things are not better in the test dataset, in which we have a total of 418 rows and 327 null values in the 'Cabin' column.

I've decided to drop this column in both DataSets and to fill the missing values of the other columns with their mean or mode depending their dtype.

###  EDA

In [144]:
train.corr()

### Little tweak

Before moving on, we'll add an Age group column which will help us improve data visualizations.

In [145]:
def checkAgeGroup(age):
    a = '0-11'
    x = '12-17'
    y = '18-19'
    i = '30-59'
    z= '60+'
    if age < 12:
        return a
    elif age >= 12 and age < 17:
        return x
    elif age >=  17 and age < 20:
        return y
    elif age >= 20 and age < 59:
        return i
    else: return z
        

In [146]:
train['Age_Group'] = [checkAgeGroup(row) for row in train.Age]
train

In [147]:
train.Age_Group.value_counts()

As you'd expect, the columns with the most correlation are the Fare and the Survived column. 

In [148]:
train.Sex.value_counts()

In [149]:
girls = train[train.Sex == 'female']
guys = train[train.Sex == 'male']


In [150]:
girls[girls['Survived'] == 1]


# We can see that 233 out of 314 or 74% of girls made it.

In [151]:
# We can see that only 109 out of 577 or 18% of men made it out. 

guys[guys.Survived ==1 ]

In [152]:
alt.Chart(train).mark_circle().encode(
    x = 'Age',
    y = 'Fare',
    color = 'Survived:N',
    column = 'Age_Group:O'
)

### Money Money Money 
The privilege of money starts exposing itself the older you are. We can observe that the fare  doesnt impact signifcantly on the younger passengers, but as they go older, the influence of it is more obvious. Up until it goes full circle in the 60+ age group and doesnt seem to have an impact anymore.

In [153]:
alt.Chart(train).mark_circle().encode(
    x = 'Age',
    y = 'Fare',
    color = 'Survived:N',
    column = 'Sex:N'
)

In [154]:
survived = train.groupby(['Sex','Survived'])[['Survived']].agg(count_survived=('Survived', 'count')).reset_index()
survived

In [155]:
alt.Chart(survived).mark_bar().encode(
    x='Survived:N', 
    y=alt.Y('count_survived:Q',title='Survived'),
    color='Sex:N',        
    column='Sex'        
).properties(
    width=150,
    height=250
)



### No male privilege here. 

From the plots above we can see that being a woman granted you a big boost to your chances of surviving the disaster (74% of women made it out). Meanwhile men had a rougher time. As only 18% of men survived the Titanic.

In [156]:
pSurvived = train.groupby(['Pclass','Survived'])[['Survived']].agg(count_survived=('Survived', 'count')).reset_index()
pSurvived

In [157]:
train.Pclass.value_counts()


In [158]:
print(girls[girls.Pclass == 1].shape)
print(guys[guys.Pclass == 1].shape)


#94 women in Pclass 1, # 122 guys

In [159]:
alt.Chart(train).mark_bar().encode(
    alt.Y('Pclass:N'),
    alt.X('count(Pclass):Q',stack="normalize"),
    color = 'Survived:N',
    column = 'Sex:N'
)

We can observe that Pclass 1 people had a bigger chance of seeing their families again, even if you were a guy, you'd have had a better chance than your men peers. Women in Pclass 2 had it way better than their men counterparts. For both sexes alike Pclass was the worst.

In [160]:
alt.Chart(pSurvived).mark_bar().encode(
    x='Survived:N', 
    y=alt.Y('count_survived:Q',title='Survived'),
    color='Pclass:N',        
    column='Pclass:N'        
).properties(
    width=150,
    height=250
)



We can analyze that being in Pclass 1 gave you a bigger change of making it. Being on Pclass 2 gave you a 50/50 chance. And being it Pclass 3 gave you the lowest chance of survival.


#### Conclusions

### Who had the biggest chance of making it?

This one was an easy answer as being a woman gave you a solid 74% chance of making it out alive. Especially if you were in Pclass 1.

## Predicting Survival

I'll drop the Age_group column of my train df.


In [161]:
train.drop('Age_Group',axis = 'columns',inplace = True)

In [162]:
train.columns

Next step is to select which columns to remove globally.

In [163]:
features_drop = ['PassengerId','Name','Ticket','Cabin']

We drop the columns

In [164]:
train = train.drop(features_drop, axis = 'columns')
test = test.drop(features_drop,axis='columns')

Divide columns in numerical vs nominal data.

In [165]:
numerical = train.select_dtypes(include=np.number).columns.tolist()
numerical.remove('Survived')
nominal = train.select_dtypes(exclude = np.number).columns.tolist()

In [166]:
numerical

In [167]:
nominal

#### Data split to train our model

In [168]:
X = train[numerical + nominal]
y = train['Survived']

In [169]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 42)

### Pipeline time 

In [170]:
nominal_pipeline = Pipeline([
    ('imputer',SimpleImputer(strategy='most_frequent')),
    ('encoder',OneHotEncoder(drop='first'))
])

numeric_pipeline = Pipeline([
    ('imputer',SimpleImputer(strategy='mean')),
    ('scaler',StandardScaler())
])

preprocessing_pipeline = ColumnTransformer([
    ('nominal_preprocessor', nominal_pipeline,nominal),
    ('numeric_preprocessor', numeric_pipeline,numerical)
])

#### Which model to use?

In [171]:


knn = KNeighborsClassifier(n_neighbors=3)
arbol  = DecisionTreeClassifier(criterion="entropy", random_state=42)
bosque = RandomForestClassifier(n_estimators=10, criterion="entropy", random_state=42)
svm = SVC(kernel="linear", random_state=42)

list_models =[knn,arbol,bosque,svm]



In [172]:

def check_model_stats(list_models):
    for model in list_models:
        complete_pipeline = Pipeline([
                ('preprocessor',preprocessing_pipeline),
                ('classifier', model)
            ])
        complete_pipeline.fit(X_train,y_train)
        y_pred = complete_pipeline.predict(X_train)
        y_pred_test = complete_pipeline.predict(X_test)
   
        print(model , ' : Accuracy Test data:',accuracy_score(y_test, y_pred_test)) 
        print(model , ' : Precision - Test data: ',precision_score(y_test, y_pred_test)) # De los que dije si, cuantos realmente fue un si
        print(model , ' : Recall - Test data: ',recall_score(y_test, y_pred_test))
        print(model , ' : f1 - Test data: ',f1_score(y_test, y_pred_test))






check_model_stats(list_models)


##### Model Performance 
We can see that the KNeighborsClassifier had the best score overall. We'll use it for our submition to the competition.

In [173]:
complete_pipeline_knn = Pipeline([
                ('preprocessor',preprocessing_pipeline),
                ('classifier', knn)
            ])
complete_pipeline_knn.fit(X_train,y_train)
y_pred = complete_pipeline_knn.predict(X_train)
y_pred_test = complete_pipeline_knn.predict(test)

In [174]:
def download_output(y_pred, name):
  output = pd.DataFrame({'PassengerId': test_id, 
                         'Survived': y_pred_test})
  output.to_csv(name, index=False)

In [175]:
download_output(y_pred_test,'submission.csv')