# Introduction/Project Overview:
In this notebook, I will present my solution and analysis of the Titanic dataset. This is a very famous dataset that can be found on [kaggle](https://www.kaggle.com/c/titanic/data). The dataset contains demographics of the Titanc passengers, incluiding who survived and who did not. The goal is to build a model that can correctly classify new examples (check who will survive or not). Throughout this notebook I will visualize the data, explain some data preprocessing techniques, construct and evaluate models and analyze the results. 

#### Data Exploration & Preprocessing:  
I will go over the dataset, analyzing its various features, checking for missing values, and gaining insights into the distribution of variables. Prior to building the models, I will preprocess the data by handling missing values, encoding categorical variables, and scaling numerical features to ensure good model performance.

#### Model Building & Evaluation:
In this notebook I will try implement several models to try and correctly classify passengers who survied. This is a supervised learning task as we are given the labels of who survived and who did not. For this project the models I have chosen are logistic regression, decision trees, random forests, support vector machines and neural networks. For each of these modules I will evalute their peformance using f1score, confusion matrices, and overall accuracy. 

#### Conclusion: 
Finally, I will interpret the results of the models, identifying significant factors that contribute to passenger survival prediction and discussing potential areas for model improvement. The Titanic dataset is a good challange to test your knowledege on machine learning. This will serve as a good test for me to keep learning and testing my skills.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Data Exploration
As mentioned earlier I got the dataset from kaggle. The link to that can be found above. The download came with two csv files. One for the training set and one for the test set. Since I have it locally on my computer I can eassily access the data as shown below. Some of the first steps we will do before creating a model is to see what our data looks like.

In [2]:
# read train and test sets
train = pd.read_csv('./train.csv') 
test = pd.read_csv('./test.csv') 

We loaded in the data into pandas dataframes and now we want to see what our data looks like. What does vairables does it contain and what data types, etc. 

In [3]:
train.info() # get info on our train 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


From above we see that we have a total of 891 samples. This is a good ammount as it is a big enohgh ammount for the model to learn from but not too large where we would require lots of computing power and time for training. We also see that we have 12 columns.`Age` has 177 missing values which is a big ammount. `Cabin` has a lot of missing values too. 

What we will do now is try and fill in the missing values for age so lets try to explore our data first to see what it contains. 

In [4]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Lets count our missing values

In [5]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

As we saw `Age` has 177 missing values. `Cabin` has 2 however it might not be very useful when making a prediction on who survives the titanic. Lets breakdown each column, what they mean and if we will keep them for constructing our model. 

`PassengderId`: This column is not useful because its just a id assigned by the dataset. We will drop this column before training. 

`Survived`: This column is our labels and is important since this is a supervised machine learning problem. If we wanted to go with a clustering/unsupervised learning we could drop this. For the purpose of the notebook we will be keeping this. 

`Pclass`: On kaggle it says that this columns serves "A proxy for socio-economic status". Where 1 is upper, 2 is middle and 3 is upper. This will be important for our model

`Name`: This is simply the name of the passenger. This could be important as some family last names could mean that they are from a wealthier family and therfore might have a higher chance of surviving. We can also use this to approximate age. For this notebook we will most likely drop it. 

`Sex/Age`: The sex of the passenger. Since women and children were  prioritized in the case of an emergency this would helpful to determined who would survive. 

`SibSp`: In the kaggle description of the dataset it says that sibsp is "# of siblings / spouses aboard the Titanic". This could be helpful. 

`Parch`: In the kaggle description of the dataset it says that parch is "# of parents / children aboard the Titanic". This will also be helpful. 

`Ticket`: Simply the ticket ID so we can remove this.

`Fare`: How much they paid for their ticket. This can be useful as maybe workers did not pay for ticket and upper class people payed for their tickets. We will keep this column. 

`Cabin`: This is the cabin they were staying at. This could be useful but there are several missing values in this column so for now we will ignore it. 

`Embarked`: The Location of where they embarked. This could be important to determine who survived. For examples people who embarked at a certain location might be workers and others might be upper class familiies. This could be useful. 

The columns we will be dropping are `PassengerID`, `Ticket`, and `Cabin`. We will keep `Name` for now because it can help us with filling in missing age values. 

In [6]:
# drop the columns PassengerId, Name, Ticket, Cabin
train.drop(['PassengerId', 'Ticket', 'Cabin'], axis=1, inplace=True)

In [7]:
train = train.dropna(subset=['Embarked']) # drop null values in embarked

In [8]:
train.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      0
dtype: int64

In [9]:
train.info() # get info on our train data after dropping columns

<class 'pandas.core.frame.DataFrame'>
Index: 889 entries, 0 to 890
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  889 non-null    int64  
 1   Pclass    889 non-null    int64  
 2   Name      889 non-null    object 
 3   Sex       889 non-null    object 
 4   Age       712 non-null    float64
 5   SibSp     889 non-null    int64  
 6   Parch     889 non-null    int64  
 7   Fare      889 non-null    float64
 8   Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(3)
memory usage: 69.5+ KB


Now that we got rid of the columns that we did not need, we are going to fill in the missing values for `Age` using the `Name` column. Most names contain a `Mr` or  `Mrs` in them. These people would be older because they are married so what we can do is get the average age of people with `Mr` or `Mrs` in their name and use that for their missing `Age` value.  

If they don't have that in their name and their `Parch` is zero then their age will be 0. This is because they are not a parent or child and in the kaggle datset info we are told that "Some children travelled only with a nanny, therefore parch=0 for them."

Lastly if they don't have `Mr` or `Mrs` in their name then we will get the avarage age of people who don't and use that for their missing `Age` value. 

In [10]:
def get_age_averages(df):
    total_mrs = 0
    total_mr = 0
    sum_mrs = 0
    sum_mr = 0
    for index, row in df.iterrows():
        if not np.isnan(row['Age']):
            if 'mrs' in row['Name'].lower() or 'miss' in row['Name'].lower():
                sum_mrs += row['Age']
                total_mrs += 1
            elif 'mr' in row['Name'].lower():
                sum_mr += float(row['Age'])
                total_mr += 1
    return round(sum_mrs/total_mrs, 2), round(sum_mr/total_mrs, 2)

In [11]:
def fill_age(df, mean_age_mrs, mean_age_mr):
    for index, row in df.iterrows():
        if np.isnan(row['Age']):
            if 'mrs' in row['Name'].lower() or 'miss' in row['Name'].lower():
                df.loc[index, 'Age'] = mean_age_mrs
            elif 'mr' in row['Name'].lower():
                df.loc[index, 'Age'] = mean_age_mr
            elif row['Parch'] ==0:
                df.loc[index, 'Age'] = 0
    return df

In [12]:
mean_age_mrs, mean_age_mr = get_age_averages(train) # get averages for mrs and mr

In [13]:
train_age_filled = fill_age(train.copy(deep=True), mean_age_mrs, mean_age_mr) # fill the age column if nan

In [14]:
train_age_filled.info() # info on data with filling in missing values

<class 'pandas.core.frame.DataFrame'>
Index: 889 entries, 0 to 890
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  889 non-null    int64  
 1   Pclass    889 non-null    int64  
 2   Name      889 non-null    object 
 3   Sex       889 non-null    object 
 4   Age       885 non-null    float64
 5   SibSp     889 non-null    int64  
 6   Parch     889 non-null    int64  
 7   Fare      889 non-null    float64
 8   Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(3)
memory usage: 101.7+ KB


We have filled most of the missing age values but we still have some missing in our dataset. So we are going to fill them in using the average of age of our original dataset.

In [15]:
train_age_filled = train_age_filled.fillna(train.loc[:, 'Age'].mean()) # fill missing values with average age

In [16]:
train_age_filled.drop(['Name'], axis=1, inplace=True) # drop name for both dataframes
train.drop(['Name'], axis=1, inplace=True)

In [17]:
train_age_filled.info() # info on age filled dataset

<class 'pandas.core.frame.DataFrame'>
Index: 889 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  889 non-null    int64  
 1   Pclass    889 non-null    int64  
 2   Sex       889 non-null    object 
 3   Age       889 non-null    float64
 4   SibSp     889 non-null    int64  
 5   Parch     889 non-null    int64  
 6   Fare      889 non-null    float64
 7   Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 94.8+ KB


We now have all the data that we need. We also got rid of the `Name` column because we don't really need it anymore. For this notebook I will be using the dataframe where we filled the missing values `train_age_filled` and the one without it `train`. We will see if filling in the values has any affect on the model performence. Below we are going to check and see if our dataset balanced. Because we are trying to classify `Survived` we need to make sure we have a good ammount of examples of each class `1` or `0`. If not our model will only learn from one of the classes more than the other. 

In [18]:
train['Survived'].value_counts() # seems relatively balanced

Survived
0    549
1    340
Name: count, dtype: int64

Another step before we get to modeling is we need to encode the `Sex` and `Embarked` column because they are currently a string and that is not very useful to us right now.

In [19]:
train = pd.get_dummies(train, prefix=['Sex', 'Embarked'], dtype=float) # encode the columns
train_age_filled = pd.get_dummies(train_age_filled, prefix=['Sex', 'Embarked'], dtype=float)

Now we should normalize the `Age` and `Fare` columns. 

In [20]:
from sklearn.preprocessing import StandardScaler

In [21]:
scaler = StandardScaler() 
train['Age'] = scaler.fit_transform(train[['Age']])
train['Fare'] = scaler.fit_transform(train[['Fare']])
train_age_filled['Age'] = scaler.fit_transform(train_age_filled[['Age']])
train_age_filled['Fare'] = scaler.fit_transform(train_age_filled[['Fare']])

In [22]:
train = train.dropna(subset=['Age']) # drop null age values in train dataframe

In [23]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 712 entries, 0 to 890
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    712 non-null    int64  
 1   Pclass      712 non-null    int64  
 2   Age         712 non-null    float64
 3   SibSp       712 non-null    int64  
 4   Parch       712 non-null    int64  
 5   Fare        712 non-null    float64
 6   Sex_female  712 non-null    float64
 7   Sex_male    712 non-null    float64
 8   Embarked_C  712 non-null    float64
 9   Embarked_Q  712 non-null    float64
 10  Embarked_S  712 non-null    float64
dtypes: float64(7), int64(4)
memory usage: 66.8 KB


In [24]:
train_age_filled.info()

<class 'pandas.core.frame.DataFrame'>
Index: 889 entries, 0 to 890
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    889 non-null    int64  
 1   Pclass      889 non-null    int64  
 2   Age         889 non-null    float64
 3   SibSp       889 non-null    int64  
 4   Parch       889 non-null    int64  
 5   Fare        889 non-null    float64
 6   Sex_female  889 non-null    float64
 7   Sex_male    889 non-null    float64
 8   Embarked_C  889 non-null    float64
 9   Embarked_Q  889 non-null    float64
 10  Embarked_S  889 non-null    float64
dtypes: float64(7), int64(4)
memory usage: 115.6 KB


We now have all the data as we want it. We encoded the `Sex` and `Embarked` columns. We filled missing `Age` values. We have two dataframes to test if filling missing values was worth it. Now we need do apply the same transformations that we did to the training set to the test set. Then we can start modeling.

In [25]:
test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [26]:
test = test.dropna(subset=['Age', 'Fare']) # drop null Age and Fare values in test dataframe

In [27]:
test.drop(['PassengerId', 'Ticket', 'Cabin', 'Name'], axis=1, inplace=True) # drop columns in test dataframe

In [28]:
X_test = pd.get_dummies(test, prefix=['Sex', 'Embarked'], dtype=float) # encode the columns

In [29]:
X_test['Age'] = scaler.fit_transform(X_test[['Age']])
X_test['Fare'] = scaler.fit_transform(X_test[['Fare']])

At this point I realised that we dont have labels for the test set. So I am just going to model and evaluate on the training dataset and leave the predictions for the test. 

# Model Building

As mentioned in the project description, I was going to use logistic regression, random forests and neural networks for this dataset. But since there are no labels for the test dataset I am just going to use random forests since I think it will be the best peforming. I'm also going to just use the data where the values are filled and no longer compare them since I can't. 

In [30]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, confusion_matrix

In [31]:
y = train_age_filled.pop('Survived') 
X = train_age_filled # filled missing values

In [32]:
clf = RandomForestClassifier(max_depth=10, random_state=0) # creating the random forests classifier

In [33]:
clf.fit(X, y) # fitting the data

In [35]:
y_pred = clf.predict(X) # getting predictions on training data

In [36]:
accuracy_f = accuracy_score(y, y_pred) # getting accuracy on training data

In [38]:
conf_matrix_f = confusion_matrix(y, y_pred)
print(f"Accuracy: {accuracy_f}")
print("Confusion Matrix:")
print(conf_matrix_f)

Accuracy: 0.9415073115860517
Confusion Matrix:
[[538  11]
 [ 41 299]]


In [41]:
clf.predict(X) # predictions on the test set

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,

# Conclusion
So ideally I would have checked the data in the `test.csv` to see that the labels are missing. Using the training set to evaluate is not really a good measurement. However seeing that it did not get a .99 or higher on the accuracy might be a good thing because we know its not overfitting. To confirm this we could use the test if we had the labels. Some future improvments things for next time are to check both training and test set. So in the end this notebook serves more as a pandas excercise to get use to using all its features and was really helpful honestly. 