# Introduction/Project Overview:
In this notebook, I will present my solution and analysis of the Titanic dataset. This is a very famous dataset that can be found on [kaggle](https://www.kaggle.com/c/titanic/data). The dataset contains demographics of the Titanc passengers, incluiding who survived and who did not. The goal is to build a model that can correctly classify new examples (check who will survive or not). Throughout this notebook I will visualize the data, explain some data preprocessing techniques, construct and evaluate models and analyze the results. 

#### Data Exploration & Preprocessing:  
I will go over the dataset, analyzing its various features, checking for missing values, and gaining insights into the distribution of variables. Prior to building the models, I will preprocess the data by handling missing values, encoding categorical variables, and scaling numerical features to ensure good model performance.

#### Model Building & Evaluation:
In this notebook I will try implement several models to try and correctly classify passengers who survied. This is a supervised learning task as we are given the labels of who survived and who did not. For this project the models I have chosen are random forests and support vector machines. For each of these modules I will evalute their peformance using confusion matrices and training accuracy. I choose to leave out logistic regression, decision trees and naive bayes because I choose to leave out logistic regression because of its assumption of linearity. Because I dont think there is a linear relationship between our features and labels, that would really limit logistic regression. Decision trees because I just think random forests would peform better.

#### Conclusion: 
Finally, I will interpret the results of the models, identifying significant factors that contribute to passenger survival prediction and discussing potential areas for model improvement. The Titanic dataset is a good challange to test your knowledege on machine learning. This will serve as a good test for me to keep learning and testing my skills. The submission will be posted on kaggle to get a score.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

# Data Exploration & Preprocessing:
As mentioned earlier I got the dataset from kaggle. The link to that can be found above. The download came with two csv files. One for the training set and one for the test set. Since I have it locally on my computer I can eassily access the data as shown below. Some of the first steps we will do before creating a model is to see what our data looks like.

In [2]:
# read train and test sets
train = pd.read_csv('./train.csv') 
test = pd.read_csv('./test.csv') 

We loaded in the data into pandas dataframes and now we want to see what our data looks like. What does vairables does it contain and what data types, etc. 

In [3]:
train.info() # get info on our train 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


The block above gives us a lot of information. For starters it tells us we have a total of 891 samples. This is a good ammount as it is a big enohgh ammount for the model to learn from but not too large where we would require lots of computing power and time for training. We also see that we have 12 columns.`Age` has 177 missing values which is a big ammount. `Cabin` has a lot of missing values too. Lastly it seems we have 5 categorical columns and 7 numerical columns. 

Lets count our missing values and then look more into our data and see what will be useful for training. 

In [4]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

As we saw `Age` has 177 missing values. `Cabin` has 2 however it might not be very useful when making a prediction on who survives the titanic but we will try to fill it. Lets breakdown each column, what they mean and if we will keep them for constructing our model. 

`PassengderId`: This column is not useful because its just a id assigned by the dataset. We will drop this column before training. 

`Survived`: This column is our labels and is important since this is a supervised machine learning problem. If we wanted to go with a clustering/unsupervised learning we could drop this. For the purpose of the notebook we will be keeping this. 

`Pclass`: On kaggle it says that this columns serves "A proxy for socio-economic status". Where 1 is upper, 2 is middle and 3 is lower. This will be important for our model

`Name`: This is simply the name of the passenger. This could be important as some family last names could mean that they are from a wealthier family and therfore might have a higher chance of surviving. We can also use this to approximate age. For this notebook we will most likely drop it. 

`Sex/Age`: The sex of the passenger. Since women and children were  prioritized in the case of an emergency this would helpful to determined who would survive. 

`SibSp`: In the kaggle description of the dataset it says that sibsp is "# of siblings / spouses aboard the Titanic". This could be helpful. 

`Parch`: In the kaggle description of the dataset it says that parch is "# of parents / children aboard the Titanic". This will also be helpful. 

`Ticket`: Simply the ticket ID so we can remove this.

`Fare`: How much they paid for their ticket. This can be useful as maybe workers did not pay for ticket and upper class people payed for their tickets. We will keep this column. 

`Cabin`: This is the cabin they were staying at. This could be useful but there are several missing values in this column so for now we will ignore it. 

`Embarked`: The Location of where they embarked. This could be important to determine who survived. For examples people who embarked at a certain location might be workers and others might be upper class familiies. This could be useful. 

Knowing how many missing values we have will be important because we will want to know if its worth filling or dropping. For example `Cabin` we will most likely drop because there is so many missing values. Filling them might have little to no effect on training. `Age` seems like a very important feature and not many are missing. Lets move on and get some information on the numerical columns that we have. 

In [5]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


The block above gives us a lot of information. For example we can see that the proportion of people that survived was `38%`. The mean age was `29`, the oldest person was 80 years old. Laslty average fare was `32.20`.

Now lets compare `Age`,`SibSp`,`Parch`, and `Fare` by Survival to see any relationships.

In [6]:
pd.pivot_table(train, index = 'Survived', values = ['Age','SibSp','Parch','Fare'])

Unnamed: 0_level_0,Age,Fare,Parch,SibSp
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,30.626179,22.117887,0.32969,0.553734
1,28.34369,48.395408,0.464912,0.473684


We can see that the average `Age` of people that did not survive was `30.6` while the average `Age` of those who did was `28.34`. Although not a huge difference it makes sense that people who survived are younger as they prioritized women and children in case of emergencies. Additionally a intersting relationship here is `Fare`. For those who did not survive their average `Fare` was `22.11` while those who did survive was `48.39`. This could mean that those who payed larger ammounts of money for their fare came from wealthy families and had priority. This tells us that `Fare` could be an important feature. 

Moving on lets do the same our categorical columns. Lets start by counting the number of survivors in `Pclass` which is socioeconomic class. We add `values=PassengerId` so that we count each passanger if we do not add that we count all the survivors in `pclass` within each column. 

In [7]:
pd.pivot_table(train, index = 'Survived', columns = 'Pclass', values='PassengerId', aggfunc ='count')

Pclass,1,2,3
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,80,97,372
1,136,87,119


From the table above we see that most who died are in the lower class which is `Pclass` 3 with 372 people dead. Most survivors is `Pclass` 1 which is to be expected as that is the upper class. In the middle class we have a almost even number of survivors and dead people. This will for sure be helpful in determining who sill survive. 

In [8]:
pd.pivot_table(train, index = 'Survived', columns = 'Sex', values='PassengerId', aggfunc ='count')

Sex,female,male
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,81,468
1,233,109


In this table above we see that most survivors were female. Most cassualties were male. This is pretty straight forward because women and children were prioritized in a emergency. 

In [9]:
pd.pivot_table(train, index = 'Survived', columns = 'Embarked', values='PassengerId', aggfunc ='count')

Embarked,C,Q,S
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,75,47,427
1,93,30,217


Lastly we have embarked whcih for the most part depending on location is relatively balanced. However in location `S` most people did not survive. This could be because maybe most lower class or workers embarked at this location. Either way it will be helpful in determining who will not survive.

Below we are going to check and see if our dataset balanced. Because we are trying to classify Survived we need to make sure we have a good ammount of examples of each class 1 or 0. If we do not have a balanced dataset our model will learn more from one of the classes compared to the other. 

In [10]:
train['Survived'].value_counts() # seems slightly unbalanced but should be fine

Survived
0    549
1    342
Name: count, dtype: int64

Now that we have explored some of our data, lets fill in some missing values and work on some feature engineering. First we are going to combine the train and test. That way all the processing can be done all at once. 

In [11]:
train['train_test'] = 1 # distinguish between sets 
test['train_test'] = 0
test['Survived'] = np.NaN # fill survived with nan for test set
data = pd.concat([train,test]) # join 

We previously saw that there are two missing values for `Embarked` in the training set. There might be more in test set but either way lets fill them. 

In [12]:
# check embarked missing values
data[data['Embarked'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,train_test
61,62,1.0,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,,1
829,830,1.0,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,,1


There is still just two. These two people survived, are part of the upper class and are female. Lets get all the women who are upper class, fare of 80 or greater and survived. Then we can count where they embarked. 

In [13]:
upper_class_women_survived = data[(data['Sex'] == 'female') & (data['Pclass'] == 1) & (data['Fare'] >= 80) & (data['Survived'] == 1)]

In [14]:
upper_class_women_survived['Embarked'].value_counts()

Embarked
C    24
S    21
Q     1
Name: count, dtype: int64

The table abve says that most women who were upper class, fare greater than 80 and survived embarked at `C` or `S`. Most survivors came from `S` with a total of 217. `C` has second most at `93`. With this information we will fill in the missing values witj `S`. 

In [15]:
data['Embarked'] = data['Embarked'].fillna('C')

Lets move on to filling missing `Age` values. As we saw earlier there is a good amount of missing in the training set. To fill in the missing `Age` values we are going to be using the `Name` column. Most names contain a `Mr`, `Mrs`or some title in their name. These people would be older because they are married or have some sort of status. This could help us because people with these titles might have a higher chance or survivle. 

What we are going to do is calculate the average age of each person with some sort of title in their name. Then use those values to fill in the missing ones. For example we will calculate the average age of `Mrs` then we will use that average to fill in the age for any other `Mrs`.

If they don't have that in their name and their `Parch` is zero then their age will be 0. This is because they are not a parent or child and in the kaggle datset info we are told that "Some children travelled only with a nanny, therefore parch=0 for them."

Lastly if they don't have `Mr` or `Mrs` in their name then we will get the avarage age of people who don't and use that for their missing `Age` value. First get each persons title and turn it encode it.

In [16]:
# strip their names to just title
data['Name'] = data['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())

Below we are encoding the columns `Pclass`, `Sex`, `Embarked` and `Name`.

In [17]:
data = pd.get_dummies(data, columns=['Pclass', 'Sex', 'Embarked', 'SibSp', 'Parch', 'Name'], prefix=['Pclass', 'Sex', 'Embarked', 'SibSp', 'Parch', 'Name'], dtype=float) # encode the columns

In [18]:
data.info() # check our new columns

<class 'pandas.core.frame.DataFrame'>
Index: 1309 entries, 0 to 417
Data columns (total 48 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   PassengerId        1309 non-null   int64  
 1   Survived           891 non-null    float64
 2   Age                1046 non-null   float64
 3   Ticket             1309 non-null   object 
 4   Fare               1308 non-null   float64
 5   Cabin              295 non-null    object 
 6   train_test         1309 non-null   int64  
 7   Pclass_1           1309 non-null   float64
 8   Pclass_2           1309 non-null   float64
 9   Pclass_3           1309 non-null   float64
 10  Sex_female         1309 non-null   float64
 11  Sex_male           1309 non-null   float64
 12  Embarked_C         1309 non-null   float64
 13  Embarked_Q         1309 non-null   float64
 14  Embarked_S         1309 non-null   float64
 15  SibSp_0            1309 non-null   float64
 16  SibSp_1            1309 non-nu

We have a lot of new columns. The last 18 columns are the peoples titles we will select those to get averages. Now all that's left is to fill in the missing values for `Age`. 

In [19]:
def mean_ages(data, columns):
    for index, row in data.iterrows(): # itterate each row
        for col in columns: # itterate name columns
            missing_vals = data[(data['Age'].isnull()) & (data[col] == 1)] # if missing values
            if not missing_vals.empty:
                mean_age = data[(data['Age'].isnull() == False) & (data[col] == 1)].loc[:, 'Age'].mean() # get mean of age of col
                data.loc[data['Age'].isnull() & (data[col] == 1), 'Age'] = mean_age  # fill missing age for people with same title

In [20]:
cols = data.iloc[:, -18:] # select the last 18 columns because they are name columns

In [21]:
mean_ages(data, cols) # fill the values

In [22]:
data.isnull().sum() # check for missing values 

PassengerId             0
Survived              418
Age                     0
Ticket                  0
Fare                    1
Cabin                1014
train_test              0
Pclass_1                0
Pclass_2                0
Pclass_3                0
Sex_female              0
Sex_male                0
Embarked_C              0
Embarked_Q              0
Embarked_S              0
SibSp_0                 0
SibSp_1                 0
SibSp_2                 0
SibSp_3                 0
SibSp_4                 0
SibSp_5                 0
SibSp_8                 0
Parch_0                 0
Parch_1                 0
Parch_2                 0
Parch_3                 0
Parch_4                 0
Parch_5                 0
Parch_6                 0
Parch_9                 0
Name_Capt               0
Name_Col                0
Name_Don                0
Name_Dona               0
Name_Dr                 0
Name_Jonkheer           0
Name_Lady               0
Name_Major              0
Name_Master 

We have one missing `Fare` value that I did not notice until now. So before we move on we must fill it. Lets see if we can get some info on this missing value.

In [23]:
data[data['Fare'].isnull()]

Unnamed: 0,PassengerId,Survived,Age,Ticket,Fare,Cabin,train_test,Pclass_1,Pclass_2,Pclass_3,...,Name_Master,Name_Miss,Name_Mlle,Name_Mme,Name_Mr,Name_Mrs,Name_Ms,Name_Rev,Name_Sir,Name_the Countess
152,1044,,60.5,3701,,,0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


We see that the `Age` is 60.05 and they have a `Mr` title. Lets see what the `Pclass` is

In [24]:
data[data['Fare'].isnull()]['Pclass_1']

152    0.0
Name: Pclass_1, dtype: float64

In [25]:
data[data['Fare'].isnull()]['Pclass_2']

152    0.0
Name: Pclass_2, dtype: float64

Alright so the `Pclass` must be 3. We can use this information to fill in the fare value. 

In [26]:
upper_class_older_mr = data[(data['Pclass_3'] == 1) & (data['Age'] >= 60) & (data['Name_Mr'] >= 1)]

In [27]:
data['Fare'] = data['Fare'].fillna(upper_class_older_mr['Fare'].mean())

We are almost able to start the modeling proccess but before we do that we must normalize our numerical categories which are `Fare` and `Age`. After that we will scale it to be values between 0 and 1. 

In [28]:
from sklearn.preprocessing import StandardScaler

In [29]:
scaler = StandardScaler() 
data['Age'] = np.log(data['Age']+1)
data['Fare'] = np.log(data['Fare']+1)

In [30]:
data['Age'] = scaler.fit_transform(data[['Age']])
data['Fare'] = scaler.fit_transform(data[['Fare']])

In [31]:
data.drop(['Ticket', 'Cabin'], axis=1, inplace=True)

In [32]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1309 entries, 0 to 417
Data columns (total 46 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   PassengerId        1309 non-null   int64  
 1   Survived           891 non-null    float64
 2   Age                1309 non-null   float64
 3   Fare               1309 non-null   float64
 4   train_test         1309 non-null   int64  
 5   Pclass_1           1309 non-null   float64
 6   Pclass_2           1309 non-null   float64
 7   Pclass_3           1309 non-null   float64
 8   Sex_female         1309 non-null   float64
 9   Sex_male           1309 non-null   float64
 10  Embarked_C         1309 non-null   float64
 11  Embarked_Q         1309 non-null   float64
 12  Embarked_S         1309 non-null   float64
 13  SibSp_0            1309 non-null   float64
 14  SibSp_1            1309 non-null   float64
 15  SibSp_2            1309 non-null   float64
 16  SibSp_3            1309 non-nu

We have done all our data processing. We can now seperate the data and begin the training. 

In [33]:
X_train = data[data['train_test'] == 1].drop(['train_test'], axis =1) # select train set
y_train = data[data['train_test']==1]['Survived'] # select labels

In [34]:
X_test = data[data['train_test'] == 0].drop(['train_test', 'Survived'], axis =1) # select test set

In [35]:
X_train.drop(['PassengerId', 'Survived'], axis=1, inplace=True) # drop columns from train set

In [36]:
passenger_id = X_test.pop('PassengerId') # will need this for kaggle submission

In [37]:
# X_test.drop(['Survived'], axis=1, inplace=True) # drop columns from test set

# Model Building

For the models I choose random forests and svm. For both will be doing a simple grid search to find the best parameters and select those to train the final model on. I will then print the training accuracy and submit the predictions on kaggle. 

In [38]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV

In [39]:
rf = RandomForestClassifier(random_state=42) # a classifer to use grid search on

In [40]:
rf_param_grid = {'n_estimators': [100,200],
                'max_depth': [10, 20],
                'min_samples_leaf': [2,5, 10],
                'min_samples_split': [2,3],
                'bootstrap': [True, False]}

In [41]:
rf_grid_search = GridSearchCV(estimator=rf, param_grid=rf_param_grid, cv=5, n_jobs=-1, verbose=2, scoring='accuracy')

In [42]:
rf_grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


In [43]:
rf_grid_search.best_params_ # print best parameters

{'bootstrap': True,
 'max_depth': 20,
 'min_samples_leaf': 2,
 'min_samples_split': 2,
 'n_estimators': 200}

In [44]:
rf_clf = RandomForestClassifier(bootstrap=True,
                             max_depth=20,
                             min_samples_leaf=2,
                             min_samples_split=2,
                             n_estimators=100,
                             random_state=42,
                            )

In [45]:
rf_clf.fit(X_train, y_train) # fitting the data

In [46]:
rf_y_pred = rf_clf.predict(X_train) # getting predictions on training data

In [47]:
rf_accuracy = accuracy_score(y_train, rf_y_pred) # getting accuracy on training data

In [49]:
rf_conf_matrix = confusion_matrix(y_train, rf_y_pred)
print(f"Training Accuracy: {rf_accuracy}")
print("Confusion Matrix:")
print(rf_conf_matrix)

Training Accuracy: 0.8832772166105499
Confusion Matrix:
[[524  25]
 [ 79 263]]


Our training accurcy is

In [50]:
from sklearn.model_selection import GridSearchCV

In [51]:
svm_param_grid = {
    'C': [0.1, 1, 10, 100],  # Regularization parameter
    'gamma': ['scale', 'auto', 0.1, 1],  # Kernel coefficient
    'kernel': ['linear', 'rbf', 'poly', 'sigmoid']  # Kernel types
}

In [52]:
svm_grid_search = GridSearchCV(estimator=svm, param_grid=svm_param_grid, cv=5, n_jobs=-1, verbose=1)

NameError: name 'svm' is not defined

In [None]:
svm_grid_search.fit(X_train, y_train)

In [None]:
svm_grid_search.best_params_

In [None]:
svm_clf = SVC(C=, gamma=, kernel=)

In [None]:
svm_clf.fit(X_train, y_train)

In [None]:
svm_y_pred = svm_clf.predict(X_train) 

In [None]:
accuracy = accuracy_score(y_train, svm_y_pred) # getting accuracy on training data

In [None]:
svm_conf_matrix = confusion_matrix(y_train, y_pred)
print(f"Training Accuracy: {svm_accuracy}")
print("Confusion Matrix:")
print(svm_conf_matrix)

In [None]:
# predictions = clf.predict(X_test) # predictions on the test set

In [None]:
output = pd.DataFrame({'PassengerId': passenger_id, 'Survived': predictions.astype(int)})

In [None]:
output.to_csv('submission.csv', index=False)

# Conclusion
So ideally I would have checked the data in the `test.csv` to see that the labels are missing. Using the training set to evaluate is not really a good measurement. However seeing that it did not get a .99 on the accuracy might be a good thing because we know its not overfitting. To confirm this we could use the test if we had the labels. Some future improvments things for next time are to check both training and test set. So in the end this notebook serves more as a pandas excercise to get use to using all its features and was really helpful honestly. Additionally I got 0.77751 on kaggle for accuracy so not a terrible experiment.  