# 3 Data Preparation    

In [None]:
from pathlib import Path
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

from sklearn import metrics

%matplotlib inline


# Find project root (folder that contains "data")
def get_project_root():
    p = Path.cwd()
    while not (p / "data").exists() and p != p.parent:
        p = p.parent
    return p

PROJECT_ROOT = get_project_root()
RAW_DATA_DIR = PROJECT_ROOT / "data" / "raw"

filename="titanic1.csv"
input_path = RAW_DATA_DIR / filename
print("Reading from:", input_path)  # optional but useful
df = pd.read_csv(input_path)




https://www.kaggle.com/code/olanrewajurasheed/titanic-dataset-with-gradient-boosting
https://www.kaggle.com/code/dmilla/introduction-to-decision-trees-titanic-dataset
https://www.kaggle.com/code/sandhyakrishnan02/logistic-regression-titanic-dataset
https://gemini.google.com/app/70d90cac352bc787
https://www.kaggle.com/code/atuljhasg/titanic-top-3-models
a

# 3.1  Select data


For modeling, I keep passenger attributes that could influence survival which are Age, Fare, Sex, family relations, class, embarkation and surival.

In this phase, I am going to prepare the data for the modeling part. 
* Removing redundant data like zeros or passenger id
* Adding values to fields with one 
* Fill empty fields with the most frequent atribute from the category

# 3.2 Clean Data

#### Drop zeroes

In [None]:
for x in range(19):
    zero = 'zero' + '.' + str(x)
    df = df.drop(columns=zero, errors="ignore")
    
df = df.drop(columns='zero', errors="ignore")
col  = df.filter(regex=r'(?i)^zero(\.|$)(?:[0-9]|1[0-8])?$').columns

I removed the zero* columns because they contained no meaningful information. The columns are filled with constant zeros and provide no predictive information for survival; excluding them reduces noise and improves interpretability without affecting model validity.

#### Drop passangerid

In [None]:
df = df.drop(columns='Passengerid', errors="ignore")

I excluded Passengerid because it is an identifier rather than a passenger attribute. It doesn't carry any  causal or behavioral information which might affect probability on survival

#### Rename 2urvive to Survive

In [None]:
df = df.rename(columns={"2urvived": "Survived"})

I renamed 2urvived to Survived to have clear names for rows

# 3.3 Construct Data

#### Fill empty fields with the most frequent atribute from the category

In [None]:
df[df["Embarked"].isna()]   
mode_embarked = df["Embarked"].mode()[0]
df["Embarked"] = df["Embarked"].fillna(mode_embarked)
df

I filled in the missing Embarked values with the most common Embarked category because it’s a category column and only 2 rows were missing, which should have no effect on the probability of survival. Additionally, this keeps all passengers in the dataset, without having to delete any. 

And check last time for empty fields;

In [None]:
df.isna().sum()

# for frature purpoese this is family size and is_alione

In [None]:
#df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
#df["IsAlone"] = (df["FamilySize"] == 1).astype(int)


# 3.4 Format 


### One-hot encoding

Right now my dataset uses ordinal encoding for categorical features. This means each category is  a number like Embarked is 0/1/2. The problem is that this can create a fake order between categories and the model may treat 2 as “more” than 1, even though ports have no ranking. This is especially an issue for Logistic Regression, because it treats numeric values as ordered quantities.

I convert the nominal categorical features to **one-hot encoding**, where each category becomes its own 0/1 column (e.g., Embarked_1, Embarked_2, Embarked_0). This prevents the model from assuming an order and makes the input meaning correct and consistent.

In [None]:
# One-hot encode Embarked 
df = pd.get_dummies(df, columns=['Embarked'], prefix='Embarked')
df

### Model to work with


Titanic is a supervised learning problem, specifically binary classification

* Supervised because OF labels: survived = 0 or 1.
* Binary classification because there are exactly two outcomes
* It’s tabular data: rows = passengers, columns = features

Based on this I can choose a supervised binary classification problem on tabular data.

* **Logistic Regression** It is usually the first model people try for a yes or no prediction. It’s quick to train, gives consistent results, and it’s easy to understand. It tells you which passenger details push the chance of survival up or down. It works best when each detail affects survival in a fairly straightforward way, for example, being in a higher class helps, being older can hurt.

* **Decision Tree** Predicts survival using a flowchart of simple if–then questions (e.g., Sex?, Pclass?, Age threshold?). It splits passengers into smaller groups so that each group contains people with similar outcomes, and the final prediction is based on the most common outcome in the last group. Trees are useful because they can capture clear rules and feature combinations, but a single tree can overfit by becoming too detailed, so its depth and minimum leaf size should be limited.

* **Random Forest** The main point is that it builds many decision trees, each trained on a different random sample of the passengers and a random subset of features. The final prediction is made by combining the trees votes. This reduces the chance that the model memorizes noise from the training data, so it usually generalizes better and produces more reliable results than a single decision tree on tabular datasets.

* **Gradient Boosting** It is strong model by adding many small decision trees one at a time. It starts with a simple first tree that gives predictions. Then it checks where the model is wrong, and trains the next tree to focus mainly on fixing those mistakes. Each new tree is added to the previous ones, so the final prediction is the sum of all trees contributions. Because it improves step by step, Gradient Boosting often achieves very high accuracy on tabular data, but it is harder to follow, how the outcome has been computed. 

All models require numeric input features, so the dataset must not contain raw text categories and must not contain missing values at training time. There is no need to use balanced versions for categories in model. As in titanic different categories, etc. have higher or lower values.

# make it better here
1) K-Nearest Neighbors (KNN)

Why it could work: simple non-parametric classifier; can capture non-linear boundaries.
Why we don’t prioritize it: sensitive to feature scaling and irrelevant features; performance can be unstable on mixed-type tabular data; less interpretable for stakeholders than tree rules or LR coefficients.

2) Support Vector Machine (SVM)

Why it could work: strong classifier, especially with non-linear kernels (RBF).
Why we don’t prioritize it: needs careful scaling and tuning (C, gamma), can be slower, and is harder to explain than LR/trees; probability outputs require extra steps.

3) Naive Bayes

Why it could work: fast, good baseline in some domains.
Why we don’t prioritize it: relies on strong independence assumptions between features (unlikely for Titanic: Pclass, Fare, Embarked are correlated), so it often underperforms.

4) Neural Networks (MLP)

Why it could work: flexible; can model complex interactions.
Why we don’t prioritize it: overkill for a small tabular dataset; needs more tuning and regularization; less interpretable and less stable for this assignment scope.

5) Extremely Randomized Trees (ExtraTrees)

Why it could work: often comparable to Random Forest, sometimes faster/stronger.
Why we didn’t choose it: very similar story to Random Forest, so it adds less educational value than including Gradient Boosting as a different ensemble strategy.



### Target and Input

After cleaning the dataset and handling missing values, I prepared it for modeling by separating the target from the input features. Survived is the target variable 
y, and the remaining passenger attributes form the feature matrix  X. I then split the data into training and test sets. The training set is used to fit the models, while the test set is kept for evaluating performance on unseen passengers. I used a stratified split to keep the proportion of survivors and non-survivors similar in both sets, and a fixed random state to ensure the results are reproducible.

In [None]:
from sklearn.model_selection import train_test_split

y = df["Survived"].astype(int)
X = df.drop(columns=["Survived"], errors="ignore")

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)




# 4 Modeling

  The Titanic dataset is tabular with a mix of numeric and categorical variables, so categorical features must be encoded appropriately. The goal is to predict survival (0/1), which is a supervised binary classification task. Therefore, I selected classification models suitable for structured data and compared an interpretable baseline (Logistic Regression), an interpretable non-linear model (Decision Tree), and two ensemble tree methods (Random Forest and XGBoost) to balance explainability and predictive performance

### Measuring the outcome

The outcome will be mearured using these: 
ROC-AUC (overall ability to separate classes)
F1-score (balance between precision and recall)
Confusion matrix (what errors you make)

In [None]:
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier

def evaluate(model, name):
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    proba = model.predict_proba(X_test)[:, 1]  # works for all 4 here
    print(f"\n{name}")
    print("Accuracy:", round(accuracy_score(y_test, pred), 4))
    print("ROC-AUC :", round(roc_auc_score(y_test, proba), 4))
    print("F1      :", round(f1_score(y_test, pred), 4))

# 2) models
evaluate(LogisticRegression(max_iter=4000), "Logistic Regression")

evaluate(DecisionTreeClassifier(random_state=42, max_depth=5, min_samples_leaf=5), "Decision Tree")

evaluate(LogisticRegression(max_iter=4000,  class_weight="balanced"),"LogisticRegression but balanced")

evaluate(RandomForestClassifier(random_state=42, n_estimators=300, max_depth=None), "Random Forest")

evaluate(XGBClassifier(random_state=42,n_estimators=300,learning_rate=0.05,max_depth=4,subsample=0.8,colsample_bytree=0.8,eval_metric="logloss"),
         "XGBoost (Gradient Boosting)")

evaluate(GradientBoostingClassifier(n_estimators=100,learning_rate=0.05),"GradientBoostingClassifier")




Gradient Boosting