<h1> 3. Predicting Titanic Survival Using Logistic Regression </h1>
<h3><b> Preprocessing Steps:</b></h3>
<ul>
    <li>Handle missing values (e.g., fill missing ages with median).</li>
    <li>Encode categorical variables (e.g., one-hot encoding for embarked and gender).</li>
    <li>Standardize numerical features.</li>
</ul>
<h3><b> Task:</b> Implement logistic regression to predict survival on the Titanic and evaluate the model using ROC-AUC. </h3>



In [102]:
# Importing Libraries
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler   # For one-hot encoding, standardization
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [79]:
# Loading the dataset
titanic_dataset = pd.read_csv('Datasets\\Titanic.csv')
print(titanic_dataset.shape, '\n')
titanic_dataset.head()

(891, 12) 



Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [80]:
# Printing the basic statistics of the data
titanic_dataset.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [81]:
# Printing information of dataset
titanic_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


<h2>Data Preprocessing</h2>

<h3><ol><li>Handling Missing Values</li></ol></h3>

In [82]:
# Checking the missing value in the data
titanic_dataset.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

-> 'Age' has 177 missing values and 'Embarked' has 2 missing values. So, we will use imputation techniques likes median and mode imputation. 'Cabin' has more than 75% missing values. So, its better to remove this entire feature.

In [83]:
# Imputing the 'age' feature with median value
titanic_dataset.fillna({'Age': titanic_dataset['Age'].median()}, inplace=True)

# Imputing the 'embarked' feature with mode value
titanic_dataset.fillna({'Embarked': titanic_dataset['Embarked'].mode()[0]}, inplace=True)

# Removing the 'cabin' feature
titanic_dataset.drop('Cabin', axis=1, inplace=True)

In [84]:
# Checking missing values after imputation
print(titanic_dataset.shape, '\n')
titanic_dataset.isnull().sum()

(891, 11) 



PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

-> So the missing values have been imputed using basic imputation techniques.

<h3>2. Encoding Categorical Variables</h3>

In [85]:
# Identifying the non-numeric categorical variables in the dataset
print('Categorical Variables before removing\n', titanic_dataset.select_dtypes(include=['object']).columns)

# Since, name and ticket are not useful for determining survival of passengers. Removing these columns
titanic_dataset.drop(['Name', 'Ticket'], axis=1, inplace=True)

Categorical Variables before removing
 Index(['Name', 'Sex', 'Ticket', 'Embarked'], dtype='object')


In [86]:
# After removing
categorical_features = titanic_dataset.select_dtypes(include=['object']).columns
print('\nCategorical Variables after removing\n', categorical_features)

# Printing categories in each feature
for feature in categorical_features:
    print('\nFeature:', feature)
    print(titanic_dataset[feature].value_counts())


Categorical Variables after removing
 Index(['Sex', 'Embarked'], dtype='object')

Feature: Sex
Sex
male      577
female    314
Name: count, dtype: int64

Feature: Embarked
Embarked
S    646
C    168
Q     77
Name: count, dtype: int64


In [87]:
# Applying one hot encoding
encoder = OneHotEncoder(sparse_output=False, drop='first')

# Encoding the 'Sex' and 'Embarked' feature
encoded_features = encoder.fit_transform(titanic_dataset[categorical_features])
encoded_features_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(input_features=categorical_features))

# Drop the original categorical features and concatenate the encoded DataFrame
titanic_dataset = titanic_dataset.drop(categorical_features, axis=1)
titanic_dataset = pd.concat([titanic_dataset, encoded_features_df], axis=1)

titanic_dataset.head()


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,1,0,3,22.0,1,0,7.25,1.0,0.0,1.0
1,2,1,1,38.0,1,0,71.2833,0.0,0.0,0.0
2,3,1,3,26.0,0,0,7.925,0.0,0.0,1.0
3,4,1,1,35.0,1,0,53.1,0.0,0.0,1.0
4,5,0,3,35.0,0,0,8.05,1.0,0.0,1.0


In [88]:
# Checking the datatypes of each feature
titanic_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Age          891 non-null    float64
 4   SibSp        891 non-null    int64  
 5   Parch        891 non-null    int64  
 6   Fare         891 non-null    float64
 7   Sex_male     891 non-null    float64
 8   Embarked_Q   891 non-null    float64
 9   Embarked_S   891 non-null    float64
dtypes: float64(5), int64(5)
memory usage: 69.7 KB


-> So each object features are converted to binary columns equal to number of categories it contains and datatype of each category is also now converted to int. So, categorical variables have been successfully encoded.

<h3>3. Standardize numerical features</h3>

In [92]:
# Separating the numerical features
numerical_features = titanic_dataset.drop(['Sex_male', 'Embarked_Q', 'Embarked_S'], axis=1)   # Dropping encoded features
numerical_features.drop(['PassengerId', 'Survived'], axis=1, inplace=True)   # PassengerId is not a useful feature and survived is target variable
numerical_features.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
0,3,22.0,1,0,7.25
1,1,38.0,1,0,71.2833
2,3,26.0,0,0,7.925
3,1,35.0,1,0,53.1
4,3,35.0,0,0,8.05


In [93]:
# Applying the standardization (z scores method)
scaler = StandardScaler()
numerical_features_standardized = scaler.fit_transform(numerical_features)

# Converting the standardized features to dataframe
numerical_features_standardized = pd.DataFrame(numerical_features_standardized, columns=numerical_features.columns)
numerical_features_standardized

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
0,0.827377,-0.565736,0.432793,-0.473674,-0.502445
1,-1.566107,0.663861,0.432793,-0.473674,0.786845
2,0.827377,-0.258337,-0.474545,-0.473674,-0.488854
3,-1.566107,0.433312,0.432793,-0.473674,0.420730
4,0.827377,0.433312,-0.474545,-0.473674,-0.486337
...,...,...,...,...,...
886,-0.369365,-0.181487,-0.474545,-0.473674,-0.386671
887,-1.566107,-0.796286,-0.474545,-0.473674,-0.044381
888,0.827377,-0.104637,0.432793,2.008933,-0.176263
889,-1.566107,-0.258337,-0.474545,-0.473674,-0.044381


In [94]:
# Printing the basic statistics of the standardized data
numerical_features_standardized.describe().round(2)

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0
mean,-0.0,0.0,0.0,0.0,0.0
std,1.0,1.0,1.0,1.0,1.0
min,-1.57,-2.22,-0.47,-0.47,-0.65
25%,-0.37,-0.57,-0.47,-0.47,-0.49
50%,0.83,-0.1,-0.47,-0.47,-0.36
75%,0.83,0.43,0.43,-0.47,-0.02
max,0.83,3.89,6.78,6.97,9.67


-> So, the features have been standardized by standard scaler. All features have mean of 0 and std of 1.

In [95]:
# Merging the standardized dataframe to actual dataframe
titanic_dataset.drop(titanic_dataset[numerical_features.columns], axis=1, inplace=True)
titanic_dataset = pd.concat([titanic_dataset, numerical_features_standardized], axis=1)
titanic_dataset.head()

Unnamed: 0,PassengerId,Survived,Sex_male,Embarked_Q,Embarked_S,Pclass,Age,SibSp,Parch,Fare
0,1,0,1.0,0.0,1.0,0.827377,-0.565736,0.432793,-0.473674,-0.502445
1,2,1,0.0,0.0,0.0,-1.566107,0.663861,0.432793,-0.473674,0.786845
2,3,1,0.0,0.0,1.0,0.827377,-0.258337,-0.474545,-0.473674,-0.488854
3,4,1,0.0,0.0,1.0,-1.566107,0.433312,0.432793,-0.473674,0.42073
4,5,0,1.0,0.0,1.0,0.827377,0.433312,-0.474545,-0.473674,-0.486337


-> Now the dataset have been completely preprocessed and is ready for the training models.

<h2>Model Training</h2>

In [98]:
# Separating features and target variable
X = titanic_dataset.drop(['PassengerId', 'Survived'], axis=1)
Y = titanic_dataset['Survived']

# Splitting the dataset into train and test data in 80/20 ratio
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [99]:
# Initializing and fitting the logistic regression model
lr_model = LogisticRegression()
lr_model.fit(X_train, Y_train)

In [111]:
# Predicting the target variable
Y_pred = lr_model.predict_proba(X_test)[:,1]

<h2>Model Evaluation</h2>

<h3><ol><li>ROC-AUC Score</li></ol></h3>

In [112]:
# Calculating the ROC-AUC score for the model
roc_auc = roc_auc_score(Y_test, Y_pred)
print('ROC-AUC Score:', roc_auc)

ROC-AUC Score: 0.8827541827541827


<p>-> <b>predict_proba</b> is used instead of <b>predict</b> to obtain probability scores needed for calculating the <b>ROC-AUC</b>, which measures the model's ability to discriminate between classes across various thresholds. The <b>ROC-AUC</b> value quantifies the overall performance, with a higher score indicating better model discrimination between positive and negative classes.</p>

<p>-> If we use <b>predict</b> instead of <b>predict_proba</b>, we will obtain the value of 0.800 ROC-AUC score instead of 0.883 score. </p>

-> With a ROC-AUC score of 0.88, the model performs well, effectively distinguishing between positive and negative classes 88% of the time. This high score indicates strong discriminatory power and suggests the model has good overall performance in predicting survival probabilities.

<hr>