# Machine Learnign with the Titanic disaster

Getting familiar with machine learning with the [Titanic dataset from kaggle](https://www.kaggle.com/competitions/titanic).

We begin by importing the required libraries and functions for ML.

In [95]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

## Importing and cleaning the data

In [96]:
# df = pd.read_csv('/kaggle/input/titanic/train.csv')  # If you are running from Kaggle notebooks
df = pd.read_csv('./Data/train.csv')                   # In my case I worked locally
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


The dataset includes information of the passangers. There are some columns that we can safely assume to play no part in the survival chance of the passangers. I speculate that `PassengerId`, `Name`, `Ticket` and `Cabin` should not have an effect.

In [97]:
df = df.drop(['Name', 'Ticket', 'Cabin'], 'columns')

  df = df.drop(['Name', 'Ticket', 'Cabin'], 'columns')


We continue by analysing the missing values in our data set.

In [98]:
print('Nº of NaN by column\n{}'.format(df.isna().sum(axis=0)))

n_isna = df.isna().sum(axis=0).max()
pc_isna = n_isna / df.shape[0] * 100

print(f'There are at least {n_isna} passangers with NaN values -> {pc_isna:.1f} %')

Nº of NaN by column
PassengerId      0
Survived         0
Pclass           0
Sex              0
Age            177
SibSp            0
Parch            0
Fare             0
Embarked         2
dtype: int64
There are at least 177 passangers with NaN values -> 19.9 %


We cannot just drop ~20% of our already small training dataset. Therefore, substitute NaNs by the mean age calculated from the rest of the passangers. We only substitue the Numerical NaNs (the Age).

In [99]:
def cleanDf(Df):
    """Clean the dataset from NaNs. Categorical and integer values are replaced
    by the mode of the column (the most repeated value), while floats are replaced
    by the mean of the column.

    Args:
        Df (Dataframe): A titanic datafreame

    Returns:
        Dataframe: The same dataframe clean from NaNs
    """
    for column in Df.columns:
        tp = Df[column].dtype
        if tp == 'object' or tp == 'int64':
            Df[column] = Df[column].fillna(Df[column].mode()[0]) 
        else:
            Df[column] = Df[column].fillna(Df[column].mean())
    return Df

df = cleanDf(df)

## Exploring the data

**TODO** Left to do because I will use this data to create a dashboard.

## Normalization/Transformation of data

Now we transform the data to facilitate its usage. First we separate our categorical data and our numerical data and then we build the encoders for each.

In [100]:
def encode(X, num_enc=-1, cat_enc=-1):
    """Returns the titanic dataset encoded. Numerical variables are normalized
    and categorical variables are encoded. Also returns the encoders for later
    use.

    Args:
        X (Dataframe): Dataframe with the titanic dataset

    Returns:
        tuple: (Dataframe, numerical_encoder, categorical_encoder)
    """
    X_cat = X.loc[:,['Sex', 'Embarked']]
    if cat_enc == -1:
        cat_enc = OneHotEncoder().fit(X_cat)

    X_cat_enc = cat_enc.transform(X_cat).toarray()

    X_num = X.loc[:,['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']]
    if num_enc == -1:
        num_enc = MinMaxScaler().fit(X_num)

    X_num_enc = num_enc.transform(X_num)

    X_enc = pd.concat([pd.DataFrame(X_num_enc), pd.DataFrame(X_cat_enc)], axis=1)
    return X_enc, num_enc, cat_enc

X_enc, num_enc, cat_enc = encode(df)

## Fitting and testing the model

Since we are creating a model to provide a yes/no answer we can use a logistic regression model. We fit our training data.

In [101]:
log_model = LogisticRegression().fit(X_enc, df['Survived'])

test = cleanDf(pd.read_csv('./Data/test.csv'))

test_enc = encode(test, num_enc=num_enc, cat_enc=cat_enc)[0]

The model is used to predict the survival of the passengers.

In [102]:
surv_pred = log_model.predict(test_enc)

Create the CSV to submit to Kaggle.

In [103]:
results = pd.concat([test['PassengerId'], pd.DataFrame({'Survived':surv_pred})],axis=1)
results.columns = ['PassengerId', 'Survived']
with open('./surv_prediction.csv', 'w') as f:
    results.to_csv(f, index=False)

The resulting CSV has 0.76555 accuracy.