# Introduction/Project Overview:
In this notebook, I will present my solution and analysis of the Titanic dataset. This is a very famous dataset that can be found on [kaggle](https://www.kaggle.com/c/titanic/data). The dataset contains demographics of the Titanc passengers, incluiding who survived and who did not. The goal is to build a model that can correctly classify new examples (check who will survive or not). Throughout this notebook I will visualize the data, explain some data preprocessing techniques, construct and evaluate models and analyze the results. 

#### Data Exploration & Preprocessing:  
I will go over the dataset, analyzing its various features, checking for missing values, and gaining insights into the distribution of variables. Prior to building the models, I will preprocess the data by handling missing values, encoding categorical variables, and scaling numerical features to ensure good model performance.

#### Model Building & Evaluation:
In this notebook I will try implement several models to try and correctly classify passengers who survied. This is a supervised learning task as we are given the labels of who survived and who did not. For this project the models I have chosen are logistic regression, decision trees, random forests, support vector machines and neural networks. For each of these modules I will evalute their peformance using f1score, confusion matrices, and overall accuracy. 

#### Conclusion: 
Finally, I will interpret the results of the models, identifying significant factors that contribute to passenger survival prediction and discussing potential areas for model improvement. The Titanic dataset is a good challange to test your knowledege on machine learning. This will serve as a good test for me to keep learning and testing my skills.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

# Data Exploration & Preprocessing:
As mentioned earlier I got the dataset from kaggle. The link to that can be found above. The download came with two csv files. One for the training set and one for the test set. Since I have it locally on my computer I can eassily access the data as shown below. Some of the first steps we will do before creating a model is to see what our data looks like.

In [2]:
# read train and test sets
train = pd.read_csv('./train.csv') 
test = pd.read_csv('./test.csv') 

We loaded in the data into pandas dataframes and now we want to see what our data looks like. What does vairables does it contain and what data types, etc. 

In [3]:
train.info() # get info on our train 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


The block above gives us a lot of information. For starters it tells us we have a total of 891 samples. This is a good ammount as it is a big enohgh ammount for the model to learn from but not too large where we would require lots of computing power and time for training. We also see that we have 12 columns.`Age` has 177 missing values which is a big ammount. `Cabin` has a lot of missing values too. Lastly it seems we have 5 categorical columns and 7 numerical columns. 

Lets count our missing values and then look more into our data and see what will be useful for training. 

In [4]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

As we saw `Age` has 177 missing values. `Cabin` has 2 however it might not be very useful when making a prediction on who survives the titanic but we will try to fill it. Lets breakdown each column, what they mean and if we will keep them for constructing our model. 

`PassengderId`: This column is not useful because its just a id assigned by the dataset. We will drop this column before training. 

`Survived`: This column is our labels and is important since this is a supervised machine learning problem. If we wanted to go with a clustering/unsupervised learning we could drop this. For the purpose of the notebook we will be keeping this. 

`Pclass`: On kaggle it says that this columns serves "A proxy for socio-economic status". Where 1 is upper, 2 is middle and 3 is lower. This will be important for our model

`Name`: This is simply the name of the passenger. This could be important as some family last names could mean that they are from a wealthier family and therfore might have a higher chance of surviving. We can also use this to approximate age. For this notebook we will most likely drop it. 

`Sex/Age`: The sex of the passenger. Since women and children were  prioritized in the case of an emergency this would helpful to determined who would survive. 

`SibSp`: In the kaggle description of the dataset it says that sibsp is "# of siblings / spouses aboard the Titanic". This could be helpful. 

`Parch`: In the kaggle description of the dataset it says that parch is "# of parents / children aboard the Titanic". This will also be helpful. 

`Ticket`: Simply the ticket ID so we can remove this.

`Fare`: How much they paid for their ticket. This can be useful as maybe workers did not pay for ticket and upper class people payed for their tickets. We will keep this column. 

`Cabin`: This is the cabin they were staying at. This could be useful but there are several missing values in this column so for now we will ignore it. 

`Embarked`: The Location of where they embarked. This could be important to determine who survived. For examples people who embarked at a certain location might be workers and others might be upper class familiies. This could be useful. 

Knowing how many missing values we have will be important because we will want to know if its worth filling or dropping. For example `Cabin` we will most likely drop because there is so many missing values. `Age` seems like a very important feature and not many are missing. Lets move on and get some information on the numerical columns that we have. 

In [5]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


The block above gives us a lot of information. For example we can see that the proportion of people that survived was `38%`. The mean age was `29`, the oldest person was 80 years old. Laslty average fare was `32.20`.

Now lets compare `Age`,`SibSp`,`Parch`, and `Fare` by Survival to see any relationships.

In [6]:
pd.pivot_table(train, index = 'Survived', values = ['Age','SibSp','Parch','Fare'])

Unnamed: 0_level_0,Age,Fare,Parch,SibSp
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,30.626179,22.117887,0.32969,0.553734
1,28.34369,48.395408,0.464912,0.473684


We can see that the average `Age` of people that did not survive was `30.6` while the average `Age` of those who did was `28.34`. Although not a huge difference it makes sense that people who survived are younger as they prioritized women and children in case of emergencies. Additionally a intersting relationship here is `Fare`. For those who did not survive their average `Fare` was `22.11` while those who did survive was `48.39`. This could mean that those who payed larger ammounts of money for their fare came from wealthy families and had priority. This tells us that `Fare` could be an important feature. 

Moving on lets do the same our categorical columns. Lets start by counting the number of survivors in `Pclass` which is socioeconomic class. We add `values=PassengerId` so that we count each passanger if we do not add that we count all the survivors in `pclass` within each column. 

In [7]:
pd.pivot_table(train, index = 'Survived', columns = 'Pclass', values='PassengerId', aggfunc ='count')

Pclass,1,2,3
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,80,97,372
1,136,87,119


From the table above we see that most who died are in the lower class which is `Pclass` 3 with 372 people dead. Most survivors is `Pclass` 1 which is to be expected as that is the upper class. In the middle class we have a almost even number of survivors and dead people. This will for sure be helpful in determining who sill survive. 

In [8]:
pd.pivot_table(train, index = 'Survived', columns = 'Sex', values='PassengerId', aggfunc ='count')

Sex,female,male
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,81,468
1,233,109


In this table above we see that most survivors were female. Most cassualties were male. This is pretty straight forward because women and children were prioritized in a emergency. 

In [9]:
pd.pivot_table(train, index = 'Survived', columns = 'Embarked', values='PassengerId', aggfunc ='count')

Embarked,C,Q,S
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,75,47,427
1,93,30,217


Lastly we have embarked whcih for the most part depending on location is relatively balanced. However in location `S` most people did not survive. This could be because maybe most lower class or workers embarked at this location. Either way it will be helpful in determining who will not survive.

Now that we have explored some of our data and filled in some missing values, lets fill in the `Age` missing values. First lets start by dropping columns. We will drop `PassengerId`, `Ticket`, `Cabin` as we explained above that they will not be as helpful.

In [10]:
# drop the columns PassengerId, Name, Ticket, Cabin
train.drop(['PassengerId', 'Ticket', 'Cabin'], axis=1, inplace=True)

Now lets fill in the missing emabarked values which we saw from above is just two. 

In [11]:
train[train['Embarked'].isnull()] # get info in missing embarked rows

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
61,1,1,"Icard, Miss. Amelie",female,38.0,0,0,80.0,
829,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,80.0,


These two people survived, are part of the upper class and are female. Lets get all the women who are upper class, fare of 80 or greater and survived. Then we can count where they embarked. 

In [12]:
upper_class_women_survived = train[(train['Sex'] == 'female') & (train['Pclass'] == 1) & (train['Fare'] >= 80)  & (train['Survived'] == 1)]

In [13]:
upper_class_women_survived['Embarked'].value_counts()

Embarked
C    24
S    21
Q     1
Name: count, dtype: int64

The table abve says that women who were upper class, fare greater than 80 and survived embarked at `C` or `S`. Most survivors came from `S` with a total of 217. `C` has second most at `93`. With this information we will fill in the missing values witj `S`. This was a lot for 2 missing values but is good pracice. 

In [14]:
train['Embarked'] = train['Embarked'].fillna('C')

Now that we got rid of the columns that we did not need, we are going to fill in the missing values for `Age` using the `Name` column. Most names contain a `Mr` or  `Mrs` in them. These people would be older because they are married so what we can do is get the average age of people with `Mr` or `Mrs` in their name and use that for their missing `Age` value.  

If they don't have that in their name and their `Parch` is zero then their age will be 0. This is because they are not a parent or child and in the kaggle datset info we are told that "Some children travelled only with a nanny, therefore parch=0 for them."

Lastly if they don't have `Mr` or `Mrs` in their name then we will get the avarage age of people who don't and use that for their missing `Age` value. 

In [15]:
def get_age_averages(df):
    total_mrs = 0
    total_mr = 0
    sum_mrs = 0
    sum_mr = 0
    for index, row in df.iterrows():
        if not np.isnan(row['Age']):
            if 'mrs' in row['Name'].lower() or 'miss' in row['Name'].lower():
                sum_mrs += row['Age']
                total_mrs += 1
            elif 'mr' in row['Name'].lower():
                sum_mr += float(row['Age'])
                total_mr += 1
    return round(sum_mrs/total_mrs, 2), round(sum_mr/total_mrs, 2)

In [16]:
def fill_age(df, mean_age_mrs, mean_age_mr):
    for index, row in df.iterrows():
        if np.isnan(row['Age']):
            if 'mrs' in row['Name'].lower() or 'miss' in row['Name'].lower():
                df.loc[index, 'Age'] = mean_age_mrs
            elif 'mr' in row['Name'].lower():
                df.loc[index, 'Age'] = mean_age_mr
            elif row['Parch'] == 0:
                df.loc[index, 'Age'] = 0
    return df

In [17]:
mean_age_mrs, mean_age_mr = get_age_averages(train) # get averages for mrs and mr

In [18]:
fill_age(train, mean_age_mrs, mean_age_mr) # fill the age column if nan

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.00,1,0,7.2500,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.00,1,0,71.2833,C
2,1,3,"Heikkinen, Miss. Laina",female,26.00,0,0,7.9250,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.00,1,0,53.1000,S
4,0,3,"Allen, Mr. William Henry",male,35.00,0,0,8.0500,S
...,...,...,...,...,...,...,...,...,...
886,0,2,"Montvila, Rev. Juozas",male,27.00,0,0,13.0000,S
887,1,1,"Graham, Miss. Margaret Edith",female,19.00,0,0,30.0000,S
888,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,27.84,1,2,23.4500,S
889,1,1,"Behr, Mr. Karl Howell",male,26.00,0,0,30.0000,C


In [19]:
train.info() # info on data with filling in missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       887 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Fare      891 non-null    float64
 8   Embarked  891 non-null    object 
dtypes: float64(2), int64(4), object(3)
memory usage: 62.8+ KB


There are still some missing age values so we are going to fill them in using the average of age of our original dataset.

In [20]:
train = train.fillna(train.loc[:, 'Age'].mean()) # fill missing values with average age

In [21]:
train.drop(['Name'], axis=1, inplace=True) # drop name column

We now have all the data that we need. We also got rid of the `Name` column because we don't really need it anymore. Below we are going to check and see if our dataset balanced. Because we are trying to classify `Survived` we need to make sure we have a good ammount of examples of each class `1` or `0`. If not our model will only learn from one of the classes more than the other. 

In [22]:
train['Survived'].value_counts() # seems slightly unbalanced but should be fine

Survived
0    549
1    342
Name: count, dtype: int64

Another step before we get to modeling is we need to encode the `Pclass`, `Sex`, and `Embarked`.

In [23]:
train = pd.get_dummies(train, columns=['Pclass', 'Sex', 'Embarked'], prefix=['Pclass', 'Sex', 'Embarked'], dtype=float) # encode the columns

Now we should normalize the `Age` and `Fare` columns. 

In [24]:
from sklearn.preprocessing import StandardScaler

In [25]:
scaler = StandardScaler() 
train['Age'] = scaler.fit_transform(train[['Age']])
train['Fare'] = scaler.fit_transform(train[['Fare']])

In [26]:
train = train.dropna(subset=['Age']) # drop null age values in train dataframe

In [27]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    int64  
 1   Age         891 non-null    float64
 2   SibSp       891 non-null    int64  
 3   Parch       891 non-null    int64  
 4   Fare        891 non-null    float64
 5   Pclass_1    891 non-null    float64
 6   Pclass_2    891 non-null    float64
 7   Pclass_3    891 non-null    float64
 8   Sex_female  891 non-null    float64
 9   Sex_male    891 non-null    float64
 10  Embarked_C  891 non-null    float64
 11  Embarked_Q  891 non-null    float64
 12  Embarked_S  891 non-null    float64
dtypes: float64(10), int64(3)
memory usage: 90.6 KB


We now have all the data as we want it. We encoded the `Pclass`, `Sex` and `Embarked` columns. We filled missing `Age` values. We have two dataframes to test if filling missing values was worth it. Now we need do apply the same transformations that we did to the training set to the test set. Then we can start modeling.

In [28]:
test_copy = test.copy() # for kaggle submission later
test_copy = test.iloc[:, :1] 

In [29]:
test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [30]:
fill_age(test, mean_age_mrs, mean_age_mr) # fill the age column using average values from ealier

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.50,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.00,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.00,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.00,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.00,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,50.51,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.00,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.50,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,50.51,0,0,359309,8.0500,,S


In [31]:
test = test.fillna(train.loc[:, 'Fare'].mean()) # since its just one missing value we will fill with mean of train

In [32]:
test.drop(['PassengerId', 'Ticket', 'Cabin', 'Name'], axis=1, inplace=True) # drop columns in test dataframe

In [33]:
X_test = pd.get_dummies(test, columns=['Pclass', 'Sex', 'Embarked'], prefix=['Pclass', 'Sex', 'Embarked'], dtype=float) # encode the columns

In [34]:
X_test['Age'] = scaler.fit_transform(X_test[['Age']])
X_test['Fare'] = scaler.fit_transform(X_test[['Fare']])

In [35]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Age         418 non-null    float64
 1   SibSp       418 non-null    int64  
 2   Parch       418 non-null    int64  
 3   Fare        418 non-null    float64
 4   Pclass_1    418 non-null    float64
 5   Pclass_2    418 non-null    float64
 6   Pclass_3    418 non-null    float64
 7   Sex_female  418 non-null    float64
 8   Sex_male    418 non-null    float64
 9   Embarked_C  418 non-null    float64
 10  Embarked_Q  418 non-null    float64
 11  Embarked_S  418 non-null    float64
dtypes: float64(10), int64(2)
memory usage: 39.3 KB


At this point I realised that we dont have labels for the test set. So I am just going to model and evaluate on the training dataset and leave the predictions for the test. 

# Model Building

As mentioned in the project description, I was going to use logistic regression, random forests and neural networks for this dataset. But since there labels dont come with the test dataset I am just going to use random forests and a neural network since I think it will be the best peforming. I will then print the training accuracy and submit the predictions into kaggle. After that I will use whichever has a highest training accuracy. 

In [36]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV

In [37]:
y_train = train.pop('Survived') 
X_train = train 

In [38]:
rf = RandomForestClassifier(random_state=42) # a classifer to use grid search on

In [39]:
param_grid = { # the parameters to test
    'n_estimators': [100, 200],  
    'max_depth': [10, 20, None],  
    'min_samples_split': [2, 5, 10], 
    'min_samples_leaf': [1, 2, 4],  
    'bootstrap': [True, False]  
} 

In [40]:
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2, scoring='accuracy')

In [41]:
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [42]:
grid_search.best_params_ # print best parameters

{'bootstrap': True,
 'max_depth': 20,
 'min_samples_leaf': 2,
 'min_samples_split': 5,
 'n_estimators': 200}

In [43]:
clf = RandomForestClassifier(n_estimators=200,
                             max_depth=10,
                             bootstrap=True,
                             min_samples_leaf=1,
                             min_samples_split=5,
                             random_state=42) # create the classifier

In [44]:
clf.fit(X_train, y_train) # fitting the data

In [45]:
y_pred = clf.predict(X_train) # getting predictions on training data

In [46]:
accuracy = accuracy_score(y_train, y_pred) # getting accuracy on training data

In [47]:
conf_matrix = confusion_matrix(y_train, y_pred)
print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)

Accuracy: 0.9147025813692481
Confusion Matrix:
[[530  19]
 [ 57 285]]


The accuracy score is 0.91 which is not bad. Its not terribly low but not very high either so its not underfitting or overfitting. Now we can get the predictions and then just submit them. 

In [48]:
predictions = clf.predict(X_test) # predictions on the test set

In [49]:
output = pd.DataFrame({'PassengerId': test_copy.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)

# Conclusion
So ideally I would have checked the data in the `test.csv` to see that the labels are missing. Using the training set to evaluate is not really a good measurement. However seeing that it did not get a .99 on the accuracy might be a good thing because we know its not overfitting. To confirm this we could use the test if we had the labels. Some future improvments things for next time are to check both training and test set. So in the end this notebook serves more as a pandas excercise to get use to using all its features and was really helpful honestly. Additionally I got 0.77751 on kaggle for accuracy so not a terrible experiment.  