# Logistic Regression Example - Titanic Dataset

Predict Survival based on passenger class, sex, fare, embarkation, fare band

Steps
* Load data into pandas
* Clean data (select columns), remove any rows with missing values
* Encode data (convert string columns into numbers, required by model). One-hot Ordinal (later) for passenger class
* Encode label column (Died ->0, Survived ->1)
* Split data into training ands test sections
* Build logistic regression model, fit on training data an predict on test data
* Evaluate model with a confusion matrix

In [48]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, classification_report

In [49]:
titanic_url = 'https://raw.githubusercontent.com/MarkWilcock/CourseDatasets/main/Misc%20Datasets/Titanic%20Passenger.csv'
df = pd.read_csv(titanic_url) # read the data
df.head(2) # show the first 5 rows

Unnamed: 0,Passenger Id,Survival,Surname,Other Names,Title,Passenger Class,Gender,Embarked,FareBand,FamilySize,Age (bins),Age,Adult Or Child,Is Age Missing
0,1,Died,Brewe,Arthur Jackson,Dr,1st,male,Cherbourg,30 - above,1,,,Not Known,Missing
1,1,Survived,Fleming,Margaret,Miss,1st,female,Cherbourg,30 - above,1,,,Not Known,Missing


In [50]:
df_slim = df.loc[:, ['Survival', 'Title','Passenger Class','Gender', 'Embarked', 'FareBand']]
df_slim.columns = ['survival', 'title','pass_class','gender', 'embarked', 'fareband']
df_slim.head(2)

Unnamed: 0,survival,title,pass_class,gender,embarked,fareband
0,Died,Dr,1st,male,Cherbourg,30 - above
1,Survived,Miss,1st,female,Cherbourg,30 - above


Encode the categorical columns with a one hot encoder

In [51]:
#category_columns = ['title', 'gender', 'embarked', 'fareband', 'pass_class']
category_columns = ['title', 'gender', 'embarked', 'fareband']
categorical_encoders = OneHotEncoder(sparse_output=False)

In [52]:
ordinal_columns =  ['pass_class']
pass_class_values = ['1st', '2nd', '3rd']
#fareband_values = ['0 - 10', '10 - 20', '20 - 30', '30 - above']
ordinal_encoders = OrdinalEncoder(categories=[pass_class_values]) 

In [53]:
ct = ColumnTransformer(transformers = [
        ('cat', categorical_encoders, category_columns),
        ('ord', ordinal_encoders, ordinal_columns)
        ], 
        remainder = 'drop')
ct.set_output(transform='pandas')
# X is the standard name for the transformed data of features (independent variables)
X = ct.fit_transform(df_slim)

In [54]:
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df_slim.loc[:, 'survival'])

In [55]:
# Spilt into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [56]:
# Build and fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)


In [57]:
#  Evaluate using standard metrics
print('Classification Report\n',  classification_report(y_test, predictions))
print(f'f1 score\n {f1_score(y_test, predictions):3.3f}')


Classification Report
               precision    recall  f1-score   support

           0       0.85      0.85      0.85       114
           1       0.74      0.74      0.74        65

    accuracy                           0.81       179
   macro avg       0.79      0.79      0.79       179
weighted avg       0.81      0.81      0.81       179

f1 score
 0.738


Understand how well the model is performing with a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)


In [58]:
confusion_matrix(y_test, predictions)

array([[97, 17],
       [17, 48]], dtype=int64)