# Supervised ML - Predict Titanic Survival

## Logistic Regression - classification with binary response variable

- In logistic regression, the outcome (dependent variable) has only a limited number of possible values. Logistic regression is used when the response variable is ``categorical`` in nature. Vs Linear regression is used when your response variable is ``continuous`` (Ex: Predict height based on weight).

- Mathematically, a binary logistic model has a dependent variable with two possible values, such as pass/fail, win/lose, alive/dead or healthy/sick; these are represented by an indicator variable, where the two values are labeled "0" and "1"

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

from sklearn import datasets, linear_model
from sklearn.linear_model import LogisticRegression

### Load Training Dataset - Titanic_TrainingDataset.csv

In [2]:
train=pd.read_csv("Titanic_TrainingDataset.csv")
train.shape

(891, 12)

In [3]:
train.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Select the columns we need for building model + clean up the data

In [4]:
df=train[['Survived','Pclass','Sex','Age','Fare']]

In [5]:
df.head(5)

Unnamed: 0,Survived,Pclass,Sex,Age,Fare
0,0,3,male,22.0,7.25
1,1,1,female,38.0,71.2833
2,1,3,female,26.0,7.925
3,1,1,female,35.0,53.1
4,0,3,male,35.0,8.05


In [6]:
#change male to 1 and female to 0
df["Sex"] = df["Sex"].apply(lambda sex:1 if sex=="male" else 0)

In [7]:
df.head(5) 

Unnamed: 0,Survived,Pclass,Sex,Age,Fare
0,0,3,1,22.0,7.25
1,1,1,0,38.0,71.2833
2,1,3,0,26.0,7.925
3,1,1,0,35.0,53.1
4,0,3,1,35.0,8.05


In [8]:
df.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
Fare          0
dtype: int64

In [None]:
#handle missing values of age
df["Age"] = df["Age"].fillna(df["Age"].median())
#df["Fare"] = df["Fare"].fillna(df["Fare"].median())

In [None]:
df.isnull().sum()

In [None]:
X_train = df.drop("Survived", axis=1) #predictors from training 
Y_train = df["Survived"]

In [None]:
X_train.head()

In [None]:
Y_train.head()


### Train the model using sklearn.linear_model.LogisticRegression algorithm - one of many many available algorithms in sklearn

https://scikit-learn.org/stable/

In [None]:
#define the logistic regression model
logreg = LogisticRegression() 

# Fit the model according to the given training data - X should be a matrix (2D) and y should be a vector (1D)
titanicmodel = logreg.fit(X_train, Y_train)

In [None]:
# Our model is trained now
type(titanicmodel)

## Let's predict the outcome using the model

In [None]:
data = {'Pclass': [3, 2], 'Sex': [1, 0],'Age': [40, 35], 'Fare': [7.98, 8.99]}
passengers = pd.DataFrame(data)
passengers

In [None]:
predictions = titanicmodel.predict(passengers)
print(predictions)

### Load Test Dataset - Titanic_TestDataset.csv

- Notice that the test dataset doesn't have Survived column which is predicted value

In [None]:
test=pd.read_csv("Titanic_TestDataset.csv")
test.shape

In [None]:
## lets look at the test data
test.head()

### Select the columns we need for testing the model + clean up the data

In [None]:
X_test=test[['Pclass','Sex','Age','Fare']]

In [None]:
X_test["Sex"] = X_test["Sex"].apply(lambda sex:1 if sex=="male" else 0)

In [None]:
X_test.isnull().sum()

In [None]:
X_test["Age"] = X_test["Age"].fillna(X_test["Age"].median())

In [None]:
X_test=X_test.dropna()

In [None]:
X_test.isnull().sum()

In [None]:
X_test.head(5)

## Predict survival for Test dataset - "Batch mode"

In [None]:
# Predict class labels for samples in test dataset
Y_predicted = titanicmodel.predict(X_test)

In [None]:
Y_predicted[2:15]

In [None]:
X_test['SurvivedPred'] = Y_predicted

In [None]:
X_test.head(15)

In [None]:
X_test.sort_values(["Sex", "SurvivedPred"], ascending=False)