# TITANIC #

## EXPERIMENT 3-4 ##

### Logistic Regression Model ###

*Anirudh Roy*

*UID: 19BCS6136*

*Date: 22-02-21*

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

# 1. Collect and understand the data

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import seaborn as sns

In [3]:
titanic = pd.read_csv(r"G:\4TH SEMESTER\train.csv")

In [4]:
titanic.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [5]:
titanic.shape

(891, 12)

Variable Description
---
Survived: Survived (1) or died (0);  this is the target variable  
Pclass: Passenger's class (1st, 2nd or 3rd class)    
Name: Passenger's name  
Sex: Passenger's sex  
Age: Passenger's age  
SibSp: Number of siblings/spouses aboard  
Parch: Number of parents/children aboard  
Ticket: Ticket number  
Fare: Fare  
Cabin: Cabin  
Embarked: Port of embarkation

In [6]:
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [7]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


# 2. Process the Data
Categorical Variables should be transformed into Numerical Variables

### Transform the Embarkment port

There are three ports: C: Cherbourg, Q: Queenstown, S: Southampton

In [8]:
ports = pd.get_dummies(titanic.Embarked , prefix='Embarked')
ports.head()

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1


Now the feature Embarked (a category) has been trasformed into 3 binary features, 
E.g. Embarked_C = 0 not embarked in Cherbourg, 1 = embarked in Cherbourg.  
Finally, the 3 new binary features substitute the original one in the data frame:

In [9]:
titanic = titanic.join(ports)
titanic.drop(['Embarked'], axis=1, inplace=True)

### Transform the gender feature
This transformation is easier, being already a binary classification (male or female, this was 1912).
It doesn't need to create separate dummy categories, a mapping will be enough:

In [10]:
ports = pd.get_dummies(titanic.Sex , prefix='Sex', drop_first= True)

In [11]:
titanic.Sex = titanic.Sex.map({'male':0, 'female':1})

In [12]:
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.2500,,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.9250,,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1000,C123,0,0,1
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.0500,,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",0,27.0,0,0,211536,13.0000,,0,0,1
887,888,1,1,"Graham, Miss. Margaret Edith",1,19.0,0,0,112053,30.0000,B42,0,0,1
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",1,,1,2,W./C. 6607,23.4500,,0,0,1
889,890,1,1,"Behr, Mr. Karl Howell",0,26.0,0,0,111369,30.0000,C148,1,0,0


## Extract the target variable
Create an X dataframe with the input features and an y series with the target (Survived)

In [13]:
y = titanic.Survived.copy() # copy “y” column values out

In [14]:
X = titanic.drop(['Survived'], axis=1) # then, drop y column

### Drop not so important features
For the first model, we ignore some categorical features which will not add too much of a signal.

In [15]:
X.drop(['Cabin'], axis=1, inplace=True) 

In [16]:
X.drop(['Ticket'], axis=1, inplace=True)

In [17]:
X.drop(['Name'], axis=1, inplace=True) 
X.drop(['PassengerId'], axis=1, inplace=True)

In [18]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Pclass      891 non-null    int64  
 1   Sex         891 non-null    int64  
 2   Age         714 non-null    float64
 3   SibSp       891 non-null    int64  
 4   Parch       891 non-null    int64  
 5   Fare        891 non-null    float64
 6   Embarked_C  891 non-null    uint8  
 7   Embarked_Q  891 non-null    uint8  
 8   Embarked_S  891 non-null    uint8  
dtypes: float64(2), int64(4), uint8(3)
memory usage: 44.5 KB


All features are now numeric, ready for regression.  
But we have still a couple of processing to do.

## Check if there are any missing values

In [19]:
X.isnull().values.any()

True

In [20]:
X[pd.isnull(X).any(axis=1)]  # check which rows have NaNs

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
5,3,0,,0,0,8.4583,0,1,0
17,2,0,,0,0,13.0000,0,0,1
19,3,1,,0,0,7.2250,1,0,0
26,3,0,,0,0,7.2250,1,0,0
28,3,1,,0,0,7.8792,0,1,0
...,...,...,...,...,...,...,...,...,...
859,3,0,,0,0,7.2292,1,0,0
863,3,1,,8,2,69.5500,0,0,1
868,3,0,,0,0,9.5000,0,0,1
878,3,0,,0,0,7.8958,0,0,1


True, there are missing values in the data (NaN) and a quick look at the data reveals that they are all in the Age feature.  
One possibility could be to remove the feature, another one is to fill the missing value with a fixed number or the average age.

In [21]:
X.Age.fillna(X.Age.mean(), inplace=True)  # replace NaN with average age

In [22]:
X.isnull().values.any()

False

Now all missing values have been removed.  
The logistic regression would otherwise not work with missing values.

## Split the dataset into training and validation

The **training** set will be used to build the machine learning models. The model will be based on the features like passengers’ gender and class but also on the known survived flag.

The **validation** set should be used to see how well the model performs on unseen data. For each passenger in the test set, I use the model trained to predict whether or not they survived the sinking of the Titanic, then will be compared with the actual survival flag.

In [23]:
from sklearn.model_selection import train_test_split
  # 80 % go into the training test, 20% in the validation test
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=7)

# 3. Modelling

## Get a baseline
A baseline is always useful to see if the model trained behaves significantly better than an easy to obtain baseline, such as a random guess or a simple heuristic like all and only female passengers survived. In this case, after quickly looking at the training dataset - where the survival outcome is present - I am going to use the following:

In [24]:
def simple_heuristic(titanicDF):
    '''
    predict whether or not the passngers survived or perished.
    Here's the algorithm, predict the passenger survived:
    1) If the passenger is female or
    2) if his socioeconomic status is high AND if the passenger is under 18
    '''

    predictions = [] # a list
    
    for passenger_index, passenger in titanicDF.iterrows():
          
        if passenger['Sex'] == 1:
                    # female
            predictions.append(1)  # survived
        elif passenger['Age'] < 18 and passenger['Pclass'] == 1:
                    # male but minor and rich
            predictions.append(1)  # survived
        else:
            predictions.append(0) # everyone else perished

    return predictions

Let's see how this simple algorithm will behave on the validation dataset and we will keep that number as our baseline:

In [25]:
simplePredictions = simple_heuristic(X_valid)
correct = sum(simplePredictions == y_valid)
print ("Baseline: ", correct/len(y_valid))

Baseline:  0.7318435754189944


Baseline: a simple algorithm predicts correctly 73% of validation cases.  
Now let's see if the model can do better.

##  Logistic Regression

Will use a simple logistic regression, that takes all the features in X and creates a regression line.
This is done using the LogisticRegression module in SciKitLearn.

In [26]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

In [27]:
model.fit(X_train, y_train)

LogisticRegression()

# 4. Evaluate the model

In [28]:
model.score(X_train, y_train)

0.8089887640449438

In [29]:
model.score(X_valid, y_valid)

0.7541899441340782

Two things:
- the score on the training set is much better than on the validation set, an indication that could be overfitting and not being a general model, e.g. for all ship sinks.
- the score on the validation set is better than the baseline, so it adds some value at a minimal cost (the logistic regression is not computationally expensive, at least not for smaller datasets).

An advantage of logistic regression (e.g. against a neural network) is that it's easily interpretable.  It can be written as a math formula:

In [30]:
model.intercept_ # the fitted intercept

array([1.71949523])

In [31]:
model.coef_  # the fitted coefficients

array([[-1.00639809e+00,  2.79552733e+00, -4.29591281e-02,
        -3.69125751e-01,  2.09471761e-03,  1.01140873e-03,
         8.30808328e-01,  3.85795216e-01,  2.44784579e-01]])

Which means that the formula is:  
$$ \boldsymbol P(survive) = \frac{1}{1+e^{-logit}} $$  
  
where the logit is:  
  
$$ logit = \boldsymbol{\beta_{0} + \beta_{1}\cdot x_{1} + ... + \beta_{n}\cdot x_{n}}$$ 
  
where $\beta_{0}$ is the model intercept and the other beta parameters are the model coefficients from above, each multiplied for the related feature:  
  
$$ logit = \boldsymbol{1.4224 - 0.9319 * Pclass + ... + 0.2228 * Embarked_S}$$ 

# 5. Iterate on the model
The model could be improved, for example transforming the excluded features above or creating new ones (e.g. I could extract titles from the names which could be another indication of the socio-economic status).

The correlation matrix may give us a understanding of which variables are important

In [33]:
titanic.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
PassengerId,1.0,-0.005007,-0.035144,-0.042939,0.036847,-0.057527,-0.001652,0.012658,-0.001205,-0.033606,0.022148
Survived,-0.005007,1.0,-0.338481,0.543351,-0.077221,-0.035322,0.081629,0.257307,0.16824,0.00365,-0.15566
Pclass,-0.035144,-0.338481,1.0,-0.1319,-0.369226,0.083081,0.018443,-0.5495,-0.243292,0.221009,0.08172
Sex,-0.042939,0.543351,-0.1319,1.0,-0.093254,0.114631,0.245489,0.182333,0.082853,0.074115,-0.125722
Age,0.036847,-0.077221,-0.369226,-0.093254,1.0,-0.308247,-0.189119,0.096067,0.036261,-0.022405,-0.032523
SibSp,-0.057527,-0.035322,0.083081,0.114631,-0.308247,1.0,0.414838,0.159651,-0.059528,-0.026354,0.070941
Parch,-0.001652,0.081629,0.018443,0.245489,-0.189119,0.414838,1.0,0.216225,-0.011069,-0.081228,0.063036
Fare,0.012658,0.257307,-0.5495,0.182333,0.096067,0.159651,0.216225,1.0,0.269335,-0.117216,-0.166603
Embarked_C,-0.001205,0.16824,-0.243292,0.082853,0.036261,-0.059528,-0.011069,0.269335,1.0,-0.148258,-0.778359
Embarked_Q,-0.033606,0.00365,0.221009,0.074115,-0.022405,-0.026354,-0.081228,-0.117216,-0.148258,1.0,-0.496624


The resulting score is **0.75119**  
Note that the score on the validation set has been a good predictor!