rain the chosen model (Logistic Regression) on the full training data and submit predictions for the test set.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, FunctionTransformer

In [2]:
# Load Data
train_df = pd.read_csv("/kaggle/input/titanic/train.csv")
test_df  = pd.read_csv("/kaggle/input/titanic/test.csv")

In [3]:
#Feature engineering

def feature_engineering(X):

    # Work on a copy to avoid modifying the original dataframe
    X_processed = X.copy()
    
    # 1. Family Size: sum of siblings/spouses and parents/children + yourself
    X_processed["FamilySize"] = X_processed["SibSp"] + X_processed["Parch"] + 1

    # 2. Is Alone: 1 if the passenger has no family on board, 0 otherwise
    X_processed["IsAlone"] = (X_processed["FamilySize"] == 1).astype(int)

    # 3. Title from Name: Extract text between the comma and period
    # This uses a regular expression to pull titles like "Mr", "Mrs", "Master", etc.
    X_processed["Title"] = X_processed["Name"].str.extract(r",\s*([^\.]+)\.",expand=False).str.strip() #(..)->capture group [..] -> The logic

    # Optional: Map rare titles to a 'Rare' category to help the model generalize
    # This prevents overfitting on titles that only appear once or twice
    rare_titles = ['Don', 'Rev', 'Dr', 'Major', 'Lady', 'Sir', 'Col', 'Capt', 'the Countess', 'Jonkheer']
    X_processed["Title"] = X_processed["Title"].replace(rare_titles, 'Rare')
    X_processed["Title"] = X_processed["Title"].replace(['Mlle', 'Ms'], 'Miss')
    X_processed["Title"] = X_processed["Title"].replace('Mme', 'Mrs')

    # Drop columns that are no longer needed (since we extracted their info)
    # This prevents the model from trying to process the raw 'Name' string
    X_processed.drop(['Name', 'SibSp', 'Parch'], axis=1)

    return X_processed

# validate=False ensures the pipeline passes a DataFrame, not a NumPy arra
# Passing the function as an object
feature_engineer_transformer = FunctionTransformer(feature_engineering, validate=False)


In [5]:
# Define target & features
X = train_df.drop(columns=["Survived"])
y = train_df["Survived"]

In [6]:
#Pre-processing

X_temp = feature_engineering(X)

#Identify column types
numeric_features = X_temp.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X_temp.select_dtypes(include=["object"]).columns

# Apply imputation & StandardScaler
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())  # to fix the Convergence Warning
])

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(handle_unknown="ignore"))
    ]
)

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

In [7]:
# Logistic regression model
model = LogisticRegression(max_iter=1000) #sets a hard limit on the number of optimization steps

# Full pipeline
clf = Pipeline(
    steps=[
        ("feature_engineer", feature_engineer_transformer),
        ("preprocessor", preprocessor),
        ("model", model)
    ]
)

In [9]:
clf.fit(X, y)

In [10]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [11]:
# Prediction
test_preds = clf.predict(test_df)

In [12]:
# Create submission file
submission = pd.DataFrame({
    "PassengerId": test_df["PassengerId"],
    "Survived": test_preds
})

In [13]:
# save
submission.to_csv("submission.csv", index=False)

In [14]:
check_submission = pd.read_csv('submission.csv')

# View the first 10 rows
check_submission.head(10)

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0


# How to submit
Save version. When successfully saved, click on version number. On the new window click on the three dots (...) next to the successful version and select 'open in viewer'. In the viewer, locate the output tab, find your file and click submit to competition button next to the file name.

## Submission Summary

- Model: Logistic Regression
- Feature set: FamilySize, IsAlone, Title + baseline
- Validation strategy: Stratified 5-fold CV
- Mean CV accuracy: 0.8384
- Public leaderboard score: 0.77272

Observations:
- Difference between CV and leaderboard
  A CV score of 0.8384 vs. an LB score of 0.7727 indicates that your model is performing significantly better on your local training folds than it is on the unseen data provided by the competition. While a small drop is expected, a 6% gap suggests your local evaluation is too optimistic.
  
- Possible reasons for difference
    Data Leakage in CV: This is the most likely culprit. If you calculated the "Title" rare-mapping or "FamilySize" using the entire dataset before splitting for CV, you accidentally leaked information about the target. However, since you used a Pipeline with FunctionTransformer, this risk is minimized.
Small Dataset Variance: The Titanic test set is very small. In 2026, we know that on a set of ~400 samples, a difference of just 3-4 passengers being correctly/incorrectly classified can swing your LB score by 1-2%.
Categorical Encoding Shift: Your OneHotEncoder(handle_unknown='ignore') is safe, but if the Test set contains Titles or Embarkation points that were not in your specific training folds, the model will simply assign them zeros, leading to less accurate predictions.
Overfitting the "Title" Feature: Features like "Title" are highly predictive but can be brittle. If your mapping of "Rare" titles doesn't align with how those same titles behave in the test set, the model's accuracy will drop sharply.


- Confidence in generalization
    Current Confidence: Moderate-Low. While your CV variance (+/- 0.0067) was low, the gap to the LB shows that your model is not yet generalizing well to new distributions.
The "Logistic Regression" Reality: LogReg is a linear model. It may be struggling to capture the non-linear "Survival" thresholds (e.g., Age 0-5 is highly predictive of survival, but Age 5-15 is not; a linear model sees this as one straight line).
