# ⛴ Titanic Machine Learning Survival Predictions 🚢

## Goals 🥅

- Out goal is build a model that is able to predict if person survives based on the features given. 

## Project Planning 🌱
When starting a project I like tp outline my steps that I plan to take. Below is the rough outline that I created for this project. 

### Plan 📝
1. Understand the shape of the data (Histograms, Box plots, etc.)
  - Histograms and Boxplots
2. Data Cleaning
  - Value Counts
  - Missing Data
3. Data Exploration
  - Correltaion between metrics
  - Eplore Interesting Themes
    - Wealthy survive? 
    - By location
    - Age Scatterplot with ticket price
    - Young and weathly Variables? 
    - Total spent? 
4. Feature Engineering
  - Preprocess Data together or use a transformer?
5. Data Preprocessing for Model
  - Label Test and Train set. 
6. Basic Model Building 
  - Model Baseline
7. Model Tuning
8. Ensemble Model Building
9. Results

## Import some libraries 📚📚


In [16]:
# For the Data Cleaning, Exploration and Manipulation
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [17]:
train_df = pd.read_csv('../data/train.csv')
test_df = pd.read_csv('../data/test.csv')
gender_submission_df = pd.read_csv('../kaggle_submissions/gender_submission.csv')

In [18]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [19]:
# make sure to check the shape
# you need to make the features with the test data
train_df.shape

(891, 12)

In [20]:
test_df.shape

(418, 11)

## clean the data

In [21]:
train_df = train_df.drop_duplicates()


In [22]:
train_df.shape

(891, 12)

In [23]:
# we can check the percentage of null data per column, by dividing the sum by the length. 
# sort the values by decending to see where we need to focus the most. 
(train_df.isnull().sum()/len(train_df)).sort_values(ascending=False)

Cabin          0.771044
Age            0.198653
Embarked       0.002245
PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
dtype: float64

In [24]:
# we need to make the data uniform
# a majority of the null data is in the cabin column. 
# drop the cabin column

train_df = train_df.drop(columns="Cabin")
test_df = test_df.drop(columns="Cabin")

In [25]:
# cabin should now be gone from both data sets
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [26]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,S


In [27]:
#becasue the null amount is low we can impute the data.
(train_df.isnull().sum()/len(train_df)).sort_values(ascending=False)



Age            0.198653
Embarked       0.002245
PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
dtype: float64

In [28]:
(test_df.isnull().sum()/len(train_df)).sort_values(ascending=False)

Age            0.096521
Fare           0.001122
PassengerId    0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Embarked       0.000000
dtype: float64

In [29]:
test_df['Fare'].value_counts()

7.7500     21
26.0000    19
13.0000    17
8.0500     17
7.8958     11
           ..
7.8208      1
8.5167      1
78.8500     1
52.0000     1
22.3583     1
Name: Fare, Length: 169, dtype: int64

In [30]:
passenger_id = test_df['PassengerId']
passenger_id

0       892
1       893
2       894
3       895
4       896
       ... 
413    1305
414    1306
415    1307
416    1308
417    1309
Name: PassengerId, Length: 418, dtype: int64

### Imputing Embarked

In [31]:
# embarked has the lowest null count, so using the most frequest woould be the best option
from sklearn.impute import SimpleImputer
impute_embarked =SimpleImputer(strategy='most_frequent')
train_df[['Embarked']] = impute_embarked.fit_transform(train_df[["Embarked"]])

In [32]:
# IMpute the fare for the Test set:
impute_fare = SimpleImputer(strategy='most_frequent')
test_df[['Fare']] = impute_fare.fit_transform(test_df[['Fare']])

### Imputing Age

In [33]:
# age is a little different
# using Nearest Neighbors would be a better choise for this one.
from sklearn.impute import KNNImputer
impute_age = KNNImputer(n_neighbors=8)
train_df[['Age']] = impute_age.fit_transform(train_df[['Age']])


In [34]:
test_df[['Age']] = impute_age.fit_transform(test_df[['Age']])

### Quick Check

In [35]:
# make sure that the there are no nulls left. 
(train_df.isnull().sum()/len(train_df)).sort_values(ascending=False)

PassengerId    0.0
Survived       0.0
Pclass         0.0
Name           0.0
Sex            0.0
Age            0.0
SibSp          0.0
Parch          0.0
Ticket         0.0
Fare           0.0
Embarked       0.0
dtype: float64

In [36]:
train_df.shape

(891, 11)

In [37]:
(test_df.isnull().sum()/len(train_df)).sort_values(ascending=False)

PassengerId    0.0
Pclass         0.0
Name           0.0
Sex            0.0
Age            0.0
SibSp          0.0
Parch          0.0
Ticket         0.0
Fare           0.0
Embarked       0.0
dtype: float64

In [38]:
test_df.shape

(418, 10)

## Train data set testing!

### Target and features

In [39]:
# we need to our target to be survived. 
# the rest are features. 

y = train_df['Survived']
X = train_df.drop(columns=['Survived'])

### Holdout Meathod

In [40]:
# now we split the model and test it. 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y)

In [41]:
X_train.dtypes

PassengerId      int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Embarked        object
dtype: object

In [42]:
# only integers and floats can be used for calulation. 

X_train_num = X_train.select_dtypes(include=['int64','float64'])
X_test_num = X_test.select_dtypes(include=['int64','float64'])

### Scale the Features

In [43]:
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler

In [44]:
from sklearn.compose import ColumnTransformer

In [45]:
standard_features = ["Age"]
robust_features = ["Fare"]
minmax_features = ["Pclass", "SibSp", "Parch"]

In [46]:
# it's eaier to do it all at once
# to keep it a pandas DF, use remainder = 'passthrough'
scalers = ColumnTransformer([
    ("standard_scaler", StandardScaler(), standard_features),
    ("robust_scaler", RobustScaler(), robust_features),  
    ("minmax_scaler", MinMaxScaler(), minmax_features),      
]).set_output(transform='pandas')

scalers

In [47]:
# now that the data is scaled, we can train the model
X_train_num_scaled = scalers.fit_transform(X_train_num)
X_test_num_scaled = scalers.fit_transform(X_test_num)

### Encoding Categorical Variables

In [48]:
X_train_cat = X_train.select_dtypes(exclude = ["int64", "float64"])
X_test_cat = X_test.select_dtypes(exclude = ["int64", "float64"])

In [49]:
X_train_cat.head()

Unnamed: 0,Name,Sex,Ticket,Embarked
684,"Brown, Mr. Thomas William Solomon",male,29750,S
144,"Andrew, Mr. Edgardo Samuel",male,231945,S
218,"Bazzani, Miss. Albina",female,11813,C
512,"McGough, Mr. James Robert",male,PC 17473,S
725,"Oreskovic, Mr. Luka",male,315094,S


In [50]:
X_train_cat.dtypes

Name        object
Sex         object
Ticket      object
Embarked    object
dtype: object

In [51]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output = False,
                   drop = "if_binary",
                   handle_unknown = 'ignore').set_output(transform="pandas")
ohe.fit(X_train_cat)


In [52]:
X_train_cat_encoded = ohe.transform(X_train_cat)
X_test_cat_encoded = ohe.transform(X_test_cat)



## One Hot Encoding

In [53]:
scalers = ColumnTransformer([
    ("standard_scaler", StandardScaler(), standard_features),
    ("robust_scaler", RobustScaler(), robust_features),  
    ("minmax_scaler", MinMaxScaler(), minmax_features),      
]).set_output(transform='pandas')
scalers

In [54]:
ohe = OneHotEncoder(sparse= False,
                   drop = "if_binary",
                   handle_unknown = 'ignore').set_output(transform='pandas')

In [55]:
from sklearn.compose import make_column_selector

In [56]:
preprocessor = ColumnTransformer([
    ("scalers", scalers, make_column_selector(dtype_include = ["int64", "float64"])),
    ("encoder", ohe, ['Sex', 'Embarked'])
]).set_output(transform='pandas')

preprocessor

In [57]:
preprocessor.fit(X_train)



In [58]:
X_train_preprocessed = preprocessor.transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

In [59]:
X_train_preprocessed.shape

(668, 9)

In [60]:
train_df.shape

(891, 11)

## Make Baseline Model

In [61]:
# Finding the average
ave_survived = y_train.mean()
ave_survived


0.38622754491017963

In [62]:
# seeing the length of the column
len(y_test)

223

In [63]:
y_test.head()

502    0
379    0
381    1
854    0
118    0
Name: Survived, dtype: int64

In [64]:
# this is just the baseline
y_pred_baseline = pd.Series([ave_survived]*len(y_test))
y_pred_baseline

0      0.386228
1      0.386228
2      0.386228
3      0.386228
4      0.386228
         ...   
218    0.386228
219    0.386228
220    0.386228
221    0.386228
222    0.386228
Length: 223, dtype: float64

In [65]:
# Finding the MSE
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mse = np.sqrt(mean_squared_error(y_test,y_pred_baseline))
print("Root Mean Squared Error:", mse)

Root Mean Squared Error: 0.4846480171742031


In [66]:
#Finding the Mean Absolute Error
mae = mean_absolute_error(y_test, y_pred_baseline)
print("Mean Absolute Error:", mae)

Mean Absolute Error: 0.47193952901372144


In [67]:
#Finding the r2 Score
r2 = r2_score(y_test, y_pred_baseline)
print("R-squared:", r2)

R-squared: -0.0003881076306138098


## Testing Different Models 🚀🧑‍🚀

In [68]:
# We need a logistic regression mmodel to do classification
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

# Fit the Model
model.fit(X_train_preprocessed, y_train)

# Evaluate the model on the validation set
# y_test_pred = model.predict(X_test_preprocessed)
# test_r2 = r2_score(y_test, y_test_pred)
# print("Validation R-squared:", test_r2)

In [69]:
# Preforming cross-validation
from sklearn.model_selection import cross_validate, cross_val_score
cv_scores = cross_val_score(model,X_test_preprocessed, y_test, cv=20, verbose=True)
cv_scores

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    0.2s finished


array([0.66666667, 0.75      , 0.91666667, 0.72727273, 0.90909091,
       0.63636364, 0.81818182, 0.90909091, 0.72727273, 0.81818182,
       0.72727273, 0.81818182, 0.90909091, 1.        , 1.        ,
       0.63636364, 0.81818182, 0.63636364, 0.81818182, 0.90909091])

In [70]:
mean_cv_score = np.mean(cv_scores)
print("Mean Cross-Validation Score:", mean_cv_score)

Mean Cross-Validation Score: 0.8075757575757576


In [71]:
# Other Models I want to try for this classification.
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

models = [DecisionTreeClassifier(), 
          RandomForestClassifier(), 
          GradientBoostingClassifier(),
          LogisticRegression()]

models_names = ['decision_tree',
                'random_forest',
                'gradient_boosting',
                'logistic']

In [72]:
# Comparing models
different_test_scores = []

for model_name, model in zip(models_names, models):

    model.fit(X_train_preprocessed, y_train)
    different_test_scores.append(np.mean(cross_val_score(model, X_test_preprocessed, y_test)))
    

comparing_regression_models = pd.DataFrame(list(zip(models_names, different_test_scores)),
                                                columns =['model_name', 'test_score'])

round(comparing_regression_models.sort_values(by = "test_score", ascending = False))

Unnamed: 0,model_name,test_score
1,random_forest,1.0
3,logistic,1.0
2,gradient_boosting,1.0
0,decision_tree,1.0


## Initial Resultes 🤔


1. By scaling the data and running the model on numerical features, I was able to achieve quite a near perfect R2 score, even with a simple logistic regression model.

## Making predictions with models

In [73]:
# Preprocess the test data
preprocessor.fit(test_df)

test_data_preprocessed = preprocessor.transform(test_df)



In [74]:
test_data_preprocessed

Unnamed: 0,scalers__standard_scaler__Age,scalers__robust_scaler__Fare,scalers__minmax_scaler__Pclass,scalers__minmax_scaler__SibSp,scalers__minmax_scaler__Parch,encoder__Sex_male,encoder__Embarked_C,encoder__Embarked_Q,encoder__Embarked_S
0,0.334993,-0.281005,1.0,0.000,0.000000,1.0,0.0,1.0,0.0
1,1.325530,-0.316176,1.0,0.125,0.000000,0.0,0.0,0.0,1.0
2,2.514175,-0.202184,0.5,0.000,0.000000,1.0,0.0,1.0,0.0
3,-0.259330,-0.245660,1.0,0.000,0.000000,1.0,0.0,0.0,1.0
4,-0.655545,-0.091902,1.0,0.125,0.111111,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...
413,0.000000,-0.271640,1.0,0.000,0.000000,1.0,0.0,0.0,1.0
414,0.691586,4.006002,0.0,0.000,0.000000,0.0,1.0,0.0,0.0
415,0.651965,-0.305572,1.0,0.000,0.000000,1.0,0.0,0.0,1.0
416,0.000000,-0.271640,1.0,0.000,0.000000,1.0,0.0,0.0,1.0


In [75]:
#Logistic Regression

model = LogisticRegression()

model.fit(X_train_preprocessed,y_train)

y_predic_log = model.predict(test_data_preprocessed)
print(f"Predicitions for Logisict Regression Model is as follows:{y_predic_log}")

Predicitions for Logisict Regression Model is as follows:[0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1
 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 1
 1 1 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0
 0 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1
 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0 1 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0
 0 1 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 0 0
 1 1 1 1 1 1 0 1 0 0 0]


## Make it in to a CSV File for submission

In [81]:
test_df["Survived"] = y_predic_log
test_df['PassengerId'] = passenger_id
log_prediction_dataset = test_df[['PassengerId',"Survived"]]

In [82]:
# Specify the file path
file_path = '../kaggle_submissions/logistic_regressor_predictions_2.csv'

log_prediction_dataset.to_csv(file_path, index = False)