In [157]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [159]:
# Read data
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
combine = [train_df, test_df]

## Analyze by describing data (initial inspection)

In [162]:
# Which features are available in the dataset?
print(train_df.columns.values)

['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']


In [164]:
# preview the data
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [166]:
train_df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


### Features Data Types
* Categorical: Survived, Sex, and Embarked. Ordinal: Pclass.
* Numerical (Continous): Age, Fare. Discrete: SibSp, Parch.
* Mixed data types: Ticket (numeric and alphanumeric) Cabin (alphanumeric).

In [169]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [171]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.1+ KB


### Which features contain blank, null or empty values?
* Cabin, Age, Embarked features contain a number of null values in that order for the training dataset.
* Cabin, Age are incomplete in case of test dataset.
* Missing values: Age has only 714 non-null values out of 891, so 177 values are missing. Cabin is mostly missing (204 non-null, meaning 687 missing). Embarked has 889 non-null, so 2 values are missing. In the test set (not shown above), we should also check for missing values (we expect Age and Cabin to have missing values, and possibly Fare might have one missing in test)
* These will require correcting

In [242]:
# What is the distribution of numerical feature values across the samples?
train_df.describe()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,FamilySize,IsAlone
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,0.647587,29.39413,0.523008,0.381594,32.204208,1.904602,0.602694
std,0.486592,0.836071,0.47799,13.270911,1.102743,0.806057,49.693429,1.613459,0.489615
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0,1.0,0.0
25%,0.0,2.0,0.0,21.0,0.0,0.0,7.9104,1.0,0.0
50%,0.0,3.0,1.0,30.0,0.0,0.0,14.4542,1.0,1.0
75%,1.0,3.0,1.0,35.0,1.0,0.0,31.0,2.0,1.0
max,1.0,3.0,1.0,80.0,8.0,6.0,512.3292,11.0,1.0


In [246]:
survival_rate = train_df['Survived'].mean()
print(f"Survival rate: {survival_rate:.2%}")

Survival rate: 38.38%


In [258]:
train_df.describe()


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,FamilySize,IsAlone
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,0.647587,29.39413,0.523008,0.381594,32.204208,1.904602,0.602694
std,0.486592,0.836071,0.47799,13.270911,1.102743,0.806057,49.693429,1.613459,0.489615
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0,1.0,0.0
25%,0.0,2.0,0.0,21.0,0.0,0.0,7.9104,1.0,0.0
50%,0.0,3.0,1.0,30.0,0.0,0.0,14.4542,1.0,1.0
75%,1.0,3.0,1.0,35.0,1.0,0.0,31.0,2.0,1.0
max,1.0,3.0,1.0,80.0,8.0,6.0,512.3292,11.0,1.0


### Observation
- Total samples are 891 or 40% of the actual number of passengers on board the Titanic (2,224).
- Around 38% samples survived representative of the actual survival rate at 32%.
- Fares varied significantly with few passengers (<1%) paying as high as $512.

## Data Preprocessing and Feature Engineering
- In this section, we will clean the data and create/modify features to make them more useful for our models. The main steps will be:
1. **Handle missing values:**
    - Embarked: Fill the two missing Embarked values with the most common port (since only 2 are missing).
    - Fare: In the test set, if there's a missing Fare value (we should confirm), fill it with a reasonable value (like median Fare).
    - Age: There are quite a few missing ages. We will fill these with a strategy – for example, using the median age for certain groups of passengers (like by Title or Pclass/Sex combination).
    - Cabin: This has a lot of missing data. For simplicity, we will drop this feature in our model or perhaps create a new feature like “HasCabin” (1 or 0 indicating if Cabin is known). Cabin might carry some information (e.g., passengers with cabins could be higher class), but since it's mostly missing, a common approach is to drop it.
  
2. **Feature engineering:**
    - Title Extraction: From the Name, extract the title (Mr, Mrs, Miss, Master, etc.) which can be informative (for example, “Master” are young boys, “Miss” and “Mrs” imply female, etc.). We can then reduce rare titles to a generic “Other” category.
    - Family Size: Combine SibSp and Parch to create a new feature FamilySize = SibSp + Parch + 1 (the +1 is the passenger themselves). This helps capture whether the passenger was alone or with family. We might also derive IsAlone (a boolean indicating no family aboard, i.e., FamilySize=1).
    - Convert categorical to numeric: Convert Sex to a numeric value (e.g., female=0, male=1) and Embarked to numeric. Similarly, convert Title and possibly Pclass if we treat Pclass as categorical.
    - Drop unused columns: We can drop Ticket (it’s not obviously useful for prediction) and Name (since we will use Title instead). We will also drop PassengerId from the training set (it’s just an identifier). However, we will keep PassengerId from the test set separately, as it’s needed for submission.

In [182]:
# Handling Missing Embarked
# Fill missing Embarked in train set with the mode (most common value)
print("Missing Embarked in train before:", train_df['Embarked'].isnull().sum())
train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace=True) # the most frequent value 'S'
print("Missing Embarked in train after:", train_df['Embarked'].isnull().sum())

Missing Embarked in train before: 2
Missing Embarked in train after: 0


In [184]:
# Handling Missing Fare
# Check and fill missing Fare in test set 
# (Because machine learning models can’t handle missing values like NaN or null during prediction. 
# If the test.csv file has even one missing value, especially in a column used as a feature (like Fare), 
# it will throw an error when you try to use the model to make predictions.)
print("Missing Fare in test before:", test_df['Fare'].isnull().sum())
if test_df['Fare'].isnull().any():
    test_df['Fare'].fillna(test_df['Fare'].median(), inplace=True)  # using median
print("Missing Fare in test after:", test_df['Fare'].isnull().sum())

Missing Fare in test before: 1
Missing Fare in test after: 0


In [186]:
# Extract Title from Name in both train and test
# Extracting the title (Mr, Mrs, Miss, etc.) from the Name can be useful. 
# For example, it can help us approximate ages for missing values (Master implies child, etc.) 
# and also serve as an additional categorical feature.
import re

def extract_title(name):
    # Regex: find word ending with a dot following the comma (like "Mr.", "Mrs.", etc.)
    match = re.search(r',\s*([^\.]+)\.', name)
    if match:
        return match.group(1).strip()
    return ""

# Create Title column
train_df['Title'] = train_df['Name'].apply(extract_title)
test_df['Title'] = test_df['Name'].apply(extract_title)

# Let's see unique titles and their counts in train set
print("Unique titles in train:", train_df['Title'].unique())
print(train_df['Title'].value_counts())

Unique titles in train: ['Mr' 'Mrs' 'Miss' 'Master' 'Don' 'Rev' 'Dr' 'Mme' 'Ms' 'Major' 'Lady'
 'Sir' 'Mlle' 'Col' 'Capt' 'the Countess' 'Jonkheer']
Title
Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Major             2
Col               2
the Countess      1
Capt              1
Ms                1
Sir               1
Lady              1
Mme               1
Don               1
Jonkheer          1
Name: count, dtype: int64


In [188]:
# We see a variety of titles. The four most common are Mr, Miss, Mrs, Master. 
# There are some rare ones (Mme, Lady, Sir, etc.) that we can group into an "Other" category. 
# Also note: 'Mlle' and 'Ms' can be considered the same as 'Miss', 'Mme' (French "Madame") is similar to 'Mrs'.
# We will map these accordingly for simplicity.

# Simplify titles by grouping rare titles
title_replacements = {
    "Mlle": "Miss",
    "Ms": "Miss",
    "Mme": "Mrs",
    "Lady": "Royalty", "Countess": "Royalty", "the Countess": "Royalty",
    "Sir": "Royalty", "Don": "Royalty", "Jonkheer": "Royalty",
    "Major": "Officer", "Col": "Officer", "Capt": "Officer", "Rev": "Officer", "Dr": "Officer"
}
train_df['Title'] = train_df['Title'].replace(title_replacements)
test_df['Title']  = test_df['Title'].replace(title_replacements)

# After replacement, check the unique titles
print("Titles after simplification:", train_df['Title'].unique())
print(train_df['Title'].value_counts())

Titles after simplification: ['Mr' 'Mrs' 'Miss' 'Master' 'Royalty' 'Officer']
Title
Mr         517
Miss       185
Mrs        126
Master      40
Officer     18
Royalty      5
Name: count, dtype: int64


- Now our Title categories should be more standardized, likely including Mr, Mrs, Miss, Master, Officer, Royalty (where "Officer" covers military ranks and clergy, "Royalty" covers nobility/honorifics).
- This Title feature can be very useful both for missing age imputation and as a predictor itself.

In [191]:
# Imputing Missing Age
# Age is missing for many passengers (~20% in train). Instead of dropping those records (which would lose information), we'll fill the missing ages with reasonable guesses. 
# A common approach is to use median ages of groups.
# We can use Title as a grouping key for median age, since titles correlate with age (e.g., Master are young boys, Miss/Mrs are generally younger/older women, etc.). 
# Let's compute the median age for each Title group and fill accordingly:

# Compute median age for each Title group from the training data
title_age_median = train_df.groupby('Title')['Age'].median()
print("Median ages by Title:")
print(title_age_median)

# Function to impute age based on Title
def impute_age(row):
    if pd.isnull(row['Age']):
        return title_age_median[row['Title']]
    else:
        return row['Age']

# Apply to both train and test
train_df['Age'] = train_df.apply(impute_age, axis=1)
test_df['Age']  = test_df.apply(impute_age, axis=1)

# Verify no missing Ages remain
print("Missing Age in train after imputation:", train_df['Age'].isnull().sum())
print("Missing Age in test after imputation:", test_df['Age'].isnull().sum())

Median ages by Title:
Title
Master      3.5
Miss       21.0
Mr         30.0
Mrs        35.0
Officer    50.0
Royalty    40.0
Name: Age, dtype: float64
Missing Age in train after imputation: 0
Missing Age in test after imputation: 0


We group by Title in the training set to get a median age for each title. For example, we might see something like: Master ~4 years, Miss ~21 years, Mr ~30 years, Mrs ~35 years, Officer ~49 years, Royalty ~40 years (just as hypothetical values). These median ages are then used to fill missing Age for passengers with the corresponding Title. After this step, there should be 0 missing ages in both train and test data.

In [194]:
# Creating FamilySize and IsAlone
# Let's create the FamilySize feature and also an IsAlone indicator:

# Create FamilySize feature
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch'] + 1
test_df['FamilySize']  = test_df['SibSp'] + test_df['Parch'] + 1

# Create IsAlone feature (1 if no family on board, i.e., FamilySize == 1)
train_df['IsAlone'] = (train_df['FamilySize'] == 1).astype(int)
test_df['IsAlone']  = (test_df['FamilySize'] == 1).astype(int)

print(train_df[['SibSp','Parch','FamilySize','IsAlone']].head(5))

   SibSp  Parch  FamilySize  IsAlone
0      1      0           2        0
1      1      0           2        0
2      0      0           1        1
3      1      0           2        0
4      0      0           1        1


This will show, for example, that Passenger 1 (Braund, Mr. Owen Harris) had SibSp=1, Parch=0, so FamilySize=2 and IsAlone=0 (not alone, he had 1 family member aboard). Passenger 3 (Heikkinen, Miss. Laina) had SibSp=0, Parch=0, FamilySize=1, IsAlone=1 (she was alone). FamilySize can capture the effect of having family: some analyses show that having 1-3 family members could slightly improve survival chances (someone to help), but very large families might reduce chances (harder to get everyone on a lifeboat). The IsAlone feature directly captures if a passenger was solo. We will let the model figure out if these are useful.

### Dropping and Converting Columns
Now we prepare the final data for modeling:
- Drop columns that we don't need in our feature set: PassengerId (from train), Name, Ticket, Cabin (Cabin dropped due to too many missing). We have extracted Title from Name and preserved other useful info, so these can go.
- Convert categorical features (Sex, Embarked, Title, and possibly Pclass) to numeric form. We can use one-hot encoding (get dummy variables) or label encoding. For linear models like logistic regression, one-hot encoding is safer to avoid implying an ordinal relationship. For tree-based models, label encoding is fine. Here we'll do one-hot for Embarked and Title for clarity. For Sex, a simple binary mapping is enough. Pclass is ordinal (1,2,3); we can treat it as numeric or one-hot encode it as well. We'll simply leave Pclass as is (or we could one-hot it; either approach works since tree models can handle numeric categories and logistic can interpret numeric categories, but one-hot might yield slightly better logistic performance for non-linear separation).

In [198]:
# Preserve PassengerId for test set results
train_passenger_id = train_df['PassengerId']
test_passenger_id = test_df['PassengerId']

# Drop columns that won't be used as features
drop_cols = ['PassengerId','Name','Ticket','Cabin']
train_df.drop(columns=drop_cols, inplace=True)
test_df.drop(columns=drop_cols, inplace=True)

# Convert 'Sex' to numeric (female=0, male=1)
train_df['Sex'] = train_df['Sex'].map({'female': 0, 'male': 1}).astype(int)
test_df['Sex']  = test_df['Sex'].map({'female': 0, 'male': 1}).astype(int)

# One-hot encode Embarked and Title
train_df = pd.get_dummies(train_df, columns=['Embarked','Title'], drop_first=True)
test_df  = pd.get_dummies(test_df,  columns=['Embarked','Title'], drop_first=True)

print("Features in train set after encoding:", list(train_df.columns))
print("Features in test set after encoding:", list(test_df.columns))

Features in train set after encoding: ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'FamilySize', 'IsAlone', 'Embarked_Q', 'Embarked_S', 'Title_Miss', 'Title_Mr', 'Title_Mrs', 'Title_Officer', 'Title_Royalty']
Features in test set after encoding: ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'FamilySize', 'IsAlone', 'Embarked_Q', 'Embarked_S', 'Title_Master', 'Title_Miss', 'Title_Mr', 'Title_Mrs', 'Title_Officer']


We use `pd.get_dummies` with `drop_first=True` to avoid dummy variable trap (for k categories, it creates k-1 dummy columns). After this, the training and test sets might have slightly different dummy columns if a category appeared in train and not in test or vice versa. We should ensure they have the same columns. Commonly, one would concatenate train and test before dummy encoding to ensure consistency. Here, since the titles and embarked values in test should be a subset of those in train (usually), it's likely fine. But to be safe, one could align the columns:

In [201]:
# Align the test set columns with train set columns (add any missing dummy columns in test)
for col in train_df.columns:
    if col not in test_df.columns:
        test_df[col] = 0
# Now ensure test_df has all columns that train_df has (except Survived which is not in test)
test_df = test_df[train_df.columns.drop('Survived')]
test_df

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,FamilySize,IsAlone,Embarked_Q,Embarked_S,Title_Miss,Title_Mr,Title_Mrs,Title_Officer,Title_Royalty
0,3,1,34.5,0,0,7.8292,1,1,True,False,False,True,False,False,0
1,3,0,47.0,1,0,7.0000,2,0,False,True,False,False,True,False,0
2,2,1,62.0,0,0,9.6875,1,1,True,False,False,True,False,False,0
3,3,1,27.0,0,0,8.6625,1,1,False,True,False,True,False,False,0
4,3,0,22.0,1,1,12.2875,3,0,False,True,False,False,True,False,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,3,1,30.0,0,0,8.0500,1,1,False,True,False,True,False,False,0
414,1,0,39.0,0,0,108.9000,1,1,False,False,False,False,False,False,0
415,3,1,38.5,0,0,7.2500,1,1,False,True,False,True,False,False,0
416,3,1,30.0,0,0,8.0500,1,1,False,True,False,True,False,False,0


## Exploratory Analysis of Key Features
Before training models, let's take a quick look at how some features relate to survival, to build our intuition:
- Sex: We expect gender to be a strong predictor (the famous "women and children first" policy). Let's confirm the survival rates by gender.
- Pclass: Socio-economic status might have influenced survival (first class had better access to lifeboats).
- FamilySize: Check if being alone vs with family had any effect.

In [204]:
# We'll do a couple of quick calculations:
# Survival rates by gender
survival_by_sex = train_df.groupby('Sex')['Survived'].mean()
print("Survival rate by Sex:")
print(survival_by_sex)

# Survival rates by Pclass
survival_by_class = train_df.groupby('Pclass')['Survived'].mean()
print("\nSurvival rate by Pclass:")
print(survival_by_class)

# Survival rates by IsAlone
survival_by_isalone = train_df.groupby('IsAlone')['Survived'].mean()
print("\nSurvival rate by IsAlone (0=not alone, 1=alone):")
print(survival_by_isalone)

Survival rate by Sex:
Sex
0    0.742038
1    0.188908
Name: Survived, dtype: float64

Survival rate by Pclass:
Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

Survival rate by IsAlone (0=not alone, 1=alone):
IsAlone
0    0.505650
1    0.303538
Name: Survived, dtype: float64


Interpreting these results (the numbers will be in the output):
- By Sex: Suppose it prints Sex=0 (female): ~0.74, Sex=1 (male): ~0.19. That would mean ~74% of females survived vs only ~19% of males. Indeed, in the training set, women had significantly higher survival rate than men, which aligns with expectations​. This confirms that Sex is a very important feature (female passengers were far more likely to survive than males).
- By Pclass: It might show Pclass 1: ~0.63, Pclass 2: ~0.47, Pclass 3: ~0.24. First class passengers had around 63% survival, second class ~47%, third class only ~24%​. Clearly being in 1st class greatly improved survival chances, while 3rd class passengers suffered the worst.
- By IsAlone: If IsAlone=0 (meaning passenger had family on board), perhaps survival ~0.50; IsAlone=1 (alone) maybe ~0.30 (just approximate). This would suggest passengers with at least one family member had a better survival rate than those alone.
  
These quick analyses confirm known patterns: female and higher-class passengers survived at higher rates, and having family might have helped somewhat. Our features Sex, Pclass, FamilySize/IsAlone, etc., are capturing these patterns, which our models can leverage. 
Now we can proceed to modeling.

## Model Training and Evaluation

We'll train two models as required:
1. Logistic Regression – A simple linear model for classification. This will be our baseline.
2. Random Forest – An ensemble of decision trees, a more powerful model that can capture nonlinear relationships.

We will train both on the processed training data and evaluate their performance using accuracy. Since Kaggle’s metric for this competition is accuracy (percentage of correct predictions), that will be our focus. We will also be mindful of overfitting: a model that performs extremely well on training data but poorly on unseen data is overfit. We will use a validation split from the training data to check performance.

### Prepare data for modeling
- Separate features (X) and target (y) from the training DataFrame. Also, ensure we have the same feature columns in test set.


In [209]:
# Split train_df into X (features) and y (target)
X = train_df.drop('Survived', axis=1)
y = train_df['Survived']

# Ensure X and test_df have the same columns
print("Train features shape:", X.shape)
print("Test features shape:", test_df.shape)
print("Columns difference:", set(X.columns) - set(test_df.columns))

Train features shape: (891, 15)
Test features shape: (418, 15)
Columns difference: set()


### Train/Test Split for validation
- We'll hold out a portion of the training data to simulate a validation set (since the test labels are unknown). Let's use 20% of the training data for validation.

In [212]:
from sklearn.model_selection import train_test_split

# Split data: 80% train, 20% validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set size:", X_train.shape[0], "Validation set size:", X_val.shape[0])

Training set size: 712 Validation set size: 179


We use `random_state=42` for reproducibility (Without it, every time you run the code, you get different splits, which means different results). Now we have X_train, y_train for model training, and X_val, y_val for evaluating the model's performance on unseen data.

### Model 1: Logistic Regression
- We train a logistic regression model using `sklearn.linear_model.LogisticRegression`. We may need to specify a solver and possibly increase the maximum iterations if it doesn’t converge by default (since we have a moderate number of features after one-hot encoding).


In [216]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize Logistic Regression
logreg = LogisticRegression(max_iter=1000, random_state=42)
logreg.fit(X_train, y_train)

# Predict on validation set
y_pred_val_logreg = logreg.predict(X_val)
acc_logreg = accuracy_score(y_val, y_pred_val_logreg)
print(f"Logistic Regression accuracy on validation set: {acc_logreg:.4f}")

Logistic Regression accuracy on validation set: 0.8156


In [218]:
# We can also check the accuracy on the training set to see if the model overfits:
# Check training accuracy (to see if there's overfitting)
y_pred_train_logreg = logreg.predict(X_train)
acc_logreg_train = accuracy_score(y_train, y_pred_train_logreg)
print(f"Logistic Regression accuracy on training set: {acc_logreg_train:.4f}")

Logistic Regression accuracy on training set: 0.8385


Logistic regression is not very flexible, so training accuracy might be similar to validation. If training accuracy and validation accuracy are close, the model is not overfitting badly. If training was much higher, that would be a concern.

### Model 2: Random Forest
Now train a Random Forest classifier. We'll use sklearn.ensemble.RandomForestClassifier. We can start with default parameters (100 trees, etc.). We will monitor its performance on validation to compare with logistic regression.

In [222]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Predict on validation set
y_pred_val_rf = rf_clf.predict(X_val)
acc_rf = accuracy_score(y_val, y_pred_val_rf)
print(f"Random Forest accuracy on validation set: {acc_rf:.4f}")

# Training accuracy for comparison
y_pred_train_rf = rf_clf.predict(X_train)
acc_rf_train = accuracy_score(y_train, y_pred_train_rf)
print(f"Random Forest accuracy on training set: {acc_rf_train:.4f}")

Random Forest accuracy on validation set: 0.8380
Random Forest accuracy on training set: 0.9803


The training accuracy for Random Forest could be very high because a forest of deep trees can overfit the training data. For instance, the output might show training accuracy close to 1.0 (i.e., it perfectly classified all training samples). If we see training accuracy much higher than validation (e.g., 1.0 vs 0.83), that indicates some overfitting. However, Random Forest has in-built mechanisms (bagging, feature randomness) to generalize well, so even though it fits training perfectly, it can still perform well on validation. We might consider tuning the Random Forest hyperparameters to mitigate overfitting (like limiting max depth or using fewer features per split), but since our validation accuracy is already decent and not too far from training, it's acceptable. In practice, cross-validation and hyperparameter tuning (grid search) could improve this further, but for now, we are content with these results.

In our case, logistic regression had lower variance (underfit a bit, but stable), and Random Forest had higher variance (potential to overfit). The Random Forest’s higher accuracy on validation suggests it's capturing more signal from the data, so we'll likely use it for the final model. 

Model Selection: We have two models:
Logistic Regression accuracy ≈ 81% on validation.
Random Forest accuracy ≈ 83% on validation.

Random Forest is performing better, so we will choose the Random Forest model for our final predictions on the test set. (If logistic had been close and we preferred simplicity, we might choose logistic, but here the forest wins out.) 

Before moving on, let's ensure our Random Forest isn't missing any feature that was one-hot encoded differently between train and test. We aligned columns, so it should be fine. 

But we can also look at feature importance from the Random Forest to see what it found most predictive:

In [226]:
# Feature importances from Random Forest
importances = pd.Series(rf_clf.feature_importances_, index=X.columns).sort_values(ascending=False)
print("Top 5 feature importances in Random Forest:")
print(importances.head(5))

Top 5 feature importances in Random Forest:
Fare        0.250759
Age         0.213139
Title_Mr    0.124932
Sex         0.113282
Pclass      0.071523
dtype: float64


## Final Prediction on Test Set
Now that we have our chosen model (Random Forest), we will train it on the entire training dataset (all 891 samples) to maximize the data used, and then predict on the test dataset of 418 passengers.

Why retrain on full train data? Because previously we held out 20% for validation. For the final model, we can use all available training data (since we validated already) to potentially improve the model’s performance a bit with more data. This is a common practice: tune/validate with a split or cross-validation, then train final model on all data for deployment.

In [229]:
# Retrain Random Forest on full training data
final_model = RandomForestClassifier(n_estimators=100, random_state=42)
final_model.fit(X, y)

# Predict on test data
test_predictions = final_model.predict(test_df)

# Create the submission DataFrame
submission = pd.DataFrame({
    "PassengerId": test_passenger_id,
    "Survived": test_predictions.astype(int)
})
print(submission.head(10))

   PassengerId  Survived
0          892         0
1          893         0
2          894         0
3          895         1
4          896         0
5          897         0
6          898         0
7          899         0
8          900         1
9          901         0


In [231]:
# Save to CSV
submission.to_csv("submission.csv", index=False)
print("\nSubmission file saved: submission.csv")


Submission file saved: submission.csv


## Conclusion

We have built a complete pipeline for the Titanic prediction challenge, from data preprocessing to model training and submission. Along the way, we handled missing data, created new features (Title, FamilySize), and saw how certain characteristics (being female, in first class, etc.) strongly affect survival chances. We started with a simple Logistic Regression as a baseline and then moved to a Random Forest which gave us better accuracy. 

For a beginner, this project illustrates the end-to-end process of a machine learning task:
- understanding the data,
- cleaning and feature engineering,
- trying different models,
- avoiding overfitting,
- and preparing a submission.

With this foundation, you can experiment with more sophisticated techniques to further improve your score. Good luck and happy Kaggleing!

### (Optional) But is this solution the most optimal?
- Not yet. It's a great educational baseline, but not leaderboard-crushing. For true optimization, you could
    - Use cross-validation instead of a fixed 80/20 split
    - Do feature scaling for logistic regression
    - Try more powerful models like XGBoost or LightGBM
    - Use grid search or RandomizedSearchCV to tune hyperparameters
    - Build ensembles (e.g. average predictions from multiple models)
    - Do stacking or use feature selection techniques
    - Engineer more nuanced features (e.g., Ticket prefixes, Cabin letters, Age buckets, etc.)