# Titanic Machine Learning Competition
*Using Logistic Regression to predict passenger survival*

### Imports

In [275]:
from sklearn.linear_model import LogisticRegression
import pandas as pd 
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.feature_selection import RFE
from statistics import median
import statsmodels.api as sm
from sklearn import preprocessing

In [276]:
titanic_test = pd.read_csv("C:/Users/james/OneDrive/Desktop/Python/Datasets for Projects/Titanic Prediction/test.csv", converters = {'Sex': lambda x: int(x == 'male')})
titanic_train = pd.read_csv("C:/Users/james/OneDrive/Desktop/Python/Datasets for Projects/Titanic Prediction/train.csv", converters = {'Sex': lambda x: int(x == 'male')})

### Preprocessing

First I noticed there are some na values and wanted to check exactly how many existed and deal with them accordingly. I found 177 na values within the Age variable and will deal with it accordingly. I'm not going to worry about the Cabin or Embarked variables as I will not be using them.

In [277]:
nan_count = titanic_train.isna().sum()
print(nan_count)

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


Next I am dropping the variables I won't be using.

In [278]:
titanic_train = titanic_train.drop(['Name', 'Ticket', 'Cabin', 'Parch', 'SibSp', 'Embarked'], axis = 1)
titanic_test = titanic_test.drop(['Name', 'Ticket', 'Cabin', 'Parch', 'SibSp', 'Embarked'], axis = 1)

To deal with the na values within the age variable, I will be replacing them with the median age. After dealing with na values I am scaling all the data in the training and test sets.

In [279]:
titanic_train['Age'] = titanic_train['Age'].fillna(titanic_train['Age'].median())
titanic_test['Age'] = titanic_test['Age'].fillna(titanic_test['Age'].median())

preprocessed_test = pd.DataFrame(preprocessing.scale(titanic_test.drop(['PassengerId'], axis = 1)))
preprocessed_test['PassengerId'] = titanic_test['PassengerId']
preprocessed_test.columns = ['Pclass', 'Sex', 'Age', 'Fare', 'PassengerId']

preprocessed_train = pd.DataFrame(preprocessing.scale(titanic_train.drop(['Survived', 'PassengerId'], axis = 1)))
preprocessed_train['Survived'] = titanic_train['Survived']
preprocessed_train['PassengerId'] = titanic_train['PassengerId']
preprocessed_train.columns = ['Pclass', 'Sex', 'Age', 'Fare', 'Survived', 'PassengerId']
preprocessed_train
preprocessed_test

Unnamed: 0,Pclass,Sex,Age,Fare,PassengerId
0,0.873482,0.755929,0.386231,-0.497811,892
1,0.873482,-1.322876,1.371370,-0.512660,893
2,-0.315819,0.755929,2.553537,-0.464532,894
3,0.873482,0.755929,-0.204852,-0.482888,895
4,0.873482,-1.322876,-0.598908,-0.417971,896
...,...,...,...,...,...
413,0.873482,0.755929,-0.204852,-0.493856,1305
414,-1.505120,-1.322876,0.740881,1.312180,1306
415,0.873482,0.755929,0.701476,-0.508183,1307
416,0.873482,0.755929,-0.204852,-0.493856,1308


### Feature Selection

Then I am going through with feature selection, the final variables used are the Passenger class and Sex variables.

In [280]:
x = preprocessed_train.columns.drop(['Survived', 'PassengerId'])
y = ['Survived']
logreg = LogisticRegression()
print(x)
rfe = RFE(logreg)
rfe = rfe.fit(preprocessed_train[x], preprocessed_train[y].values.ravel())
print(rfe.support_)
print(rfe.ranking_)


Index(['Pclass', 'Sex', 'Age', 'Fare'], dtype='object')
[ True  True False False]
[1 1 2 3]


With none of our variables having a p-value > 0.05 we are good to fit the model.

In [281]:
X_train = preprocessed_train.drop(['Survived', 'Age', 'PassengerId', 'Fare'], axis = 1)
y_train = preprocessed_train['Survived']
X_test = preprocessed_test.drop(['Age', 'PassengerId', 'Fare'], axis = 1)
logit_model=sm.Logit(y_train,X_train)
result=logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.494933
         Iterations 6
                         Results: Logit
Model:              Logit            Method:           MLE       
Dependent Variable: Survived         Pseudo R-squared: 0.257     
Date:               2023-09-25 13:14 AIC:              885.9712  
No. Observations:   891              BIC:              895.5559  
Df Model:           1                Log-Likelihood:   -440.99   
Df Residuals:       889              LL-Null:          -593.33   
Converged:          1.0000           LLR p-value:      3.1429e-68
No. Iterations:     6.0000           Scale:            1.0000    
-------------------------------------------------------------------
          Coef.    Std.Err.      z       P>|z|     [0.025    0.975]
-------------------------------------------------------------------
Pclass   -0.7593     0.0857    -8.8560   0.0000   -0.9274   -0.5913
Sex      -1.2680     0.0897   -14.1304   0.0000   -1.4439 

### Model Fitting and Predicting

In [282]:
logreg = LogisticRegression(solver='lbfgs', max_iter=100)

logreg.fit(X_train, y_train)

y_train_predictions = logreg.predict(X_train)
y_test_predictions = logreg.predict(X_test)


### Outputting Predictions to File

Once the model is fit, I create my predictions and write them to a submissions file to turn it in :)

In [283]:
predictions = pd.DataFrame([preprocessed_test['PassengerId'], y_test_predictions]).transpose()
predictions.columns = ['PassengerId', 'Survived']
predictions.to_csv('submission.csv', index = False)
