# Titanic Survival Prediction — Custom Decision Tree and Random Forest

In this project, we build two models **from scratch** (without using scikit-learn classifiers):

1. **Part 1 — Custom Decision Tree**
2. **Part 2 — Custom Random Forest**
2. **Part 3 — SVM**

We train them on the **Titanic dataset** to predict passenger survival.  
We also analyze model accuracy and compare results.


**Import libraries**

In [None]:
import pandas as pd
from typing import Dict, Any
from sklearn.model_selection import train_test_split
from typing import Tuple, Optional, Union

### **Part 1 — Custom Decision Tree**

### Load and Prepare Titanic Dataset

In [None]:
# Load training data
data = pd.read_csv("titanic.csv")

print(data.head())
print(data.info())
print(data.isnull().sum())



   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
<c

The Titanic dataset contains passenger information such as age, class, gender, fare, and survival status. Some columns contain missing values and categorical data that need to be handled properly.

---

## **Task: Implement Data Preprocessing Functions**

Its is required to write Python functions (without using advanced preprocessing libraries like `sklearn.preprocessing`) to perform the following steps manually.

---

### *Step 1: Handle Missing Values*

Write functions to:

1. Replace missing values in the **Age** column with the **median age** of all passengers.  
2. Replace missing values in the **Embarked** column with the **most common embarkation port (mode)**.

---

### *Step 2: Drop Irrelevant Columns*

Remove the following columns that are not directly useful for survival prediction:

1. **Cabin** – too many missing values  
2. **Ticket** – non-numeric and not informative  
3. **Name** – mostly unique values  
4. **PassengerId** – identifier, not a feature

---

### **Step 3: Encode Categorical Variables**

Since machine learning models require numerical data, we need to convert categorical variables into numeric format manually.

#### Encode the `Sex` column:
- `male` → `0`  
- `female` → `1`

#### Encode the `Embarked` column:
- `C` → `0`  
- `Q` → `1`  
- `S` → `2`

This manual encoding ensures that categorical features can be interpreted correctly by machine learning algorithms, which typically work only with numerical inputs.


In [None]:
# Handle missing value
def impute_missing_values(df: pd.DataFrame) -> pd.DataFrame:
    """
    Handles missing values in the Age and Embarked columns.
    - Replaces missing 'Age' values with the median of the column.
    - Replaces missing 'Embarked' values with the mode.
    """
    print("\n--- Step 1: Handle Missing Values by imputation ---")

    # Impute Age with Median, The median is robust to outliers, making it a good choice for continuous data.
    median_age = df['Age'].median()
    df['Age'].fillna(median_age, inplace=True)
    print(f"Filled missing Age values with the median: {median_age:.2f}")

    # Impute Embarked with Mode, The mode is the most appropriate central tendency measure for categorical data.
    mode_embarked = df['Embarked'].mode().iloc[0]        # .iloc[0] is used because mode() can return multiple modes if they share the same count.
    df['Embarked'].fillna(mode_embarked, inplace=True)
    print(f"Filled missing Embarked values with the mode: {mode_embarked}")

    return df

# Drop unused columns
def drop_columns(df: pd.DataFrame, columns_to_drop: list) -> pd.DataFrame:
    """
    Removes columns that are considered irrelevant for survival prediction.
    """
    print("\n--- Step 2: Drop Irrelevant Columns ---")
    df.drop(columns=columns_to_drop, axis=1, inplace=True)
    print(f"Dropped columns: {columns_to_drop}")

    return df

# Encode categorical features manually
def encode_categoricals(df: pd.DataFrame) -> pd.DataFrame:
    """
    Manually converts specified categorical columns into numerical format.
    - Sex: male -> 0, female -> 1
    - Embarked: C -> 0, Q -> 1, S -> 2
    """
    print("\n--- Step 3: Encode Categorical Variables ---")

    # Encode 'Sex'
    sex_mapping = {'male': 0, 'female': 1}
    df['Sex'] = df['Sex'].replace(sex_mapping)
    print(f"Encoded 'Sex'column using mapping: {sex_mapping}")

    # Encode 'Embarked'
    embarked_mapping = {'C': 0, 'Q': 1, 'S': 2}
    df['Embarked'] = df['Embarked'].replace(embarked_mapping)
    print(f"Encoded 'Embarked'column using mapping: {embarked_mapping}")

    return df


Split data into train and test

In [None]:
X = data.drop('Survived', axis=1)
y = data['Survived'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## **Task: Implement Core Functions for Decision Tree Splitting**

In this section, we will implement the key mathematical components that allow a **Decision Tree** to decide where to split data during training. we will write three main functions: `entropy`, `information_gain`, and `best_split`.

---

### **Step 1: Compute Entropy**
- The entropy function measures the impurity or uncertainty in a dataset.  
- Formula:  
  $$
  H(y) = -\sum_i p_i \log_2(p_i)
  $$
  where $p_i$ is the probability of each class label.
- Use NumPy operations to calculate probabilities.
- Add a small constant (e.g., `1e-9`) inside the log to avoid numerical errors.
- Test on simple arrays, e.g., `entropy([0, 0, 1, 1])`.

---

### **Step 2: Compute Information Gain**
- Information Gain (IG) quantifies how much entropy decreases after a split.
- Formula:  
  $$
  IG = H(\text{parent}) - \frac{n_{left}}{n}H(\text{left}) - \frac{n_{right}}{n}H(\text{right})
  $$
- Implement this using the `entropy()` function.
- The higher the IG, the better the feature/threshold for splitting.

---

### **Step 3: Find the Best Split**
- The `best_split()` function finds which feature and threshold produce the **maximum information gain**.
- Loop through all features (or a given subset) and:
  - For **numerical features**:
    - Try splitting at all unique threshold values.
    - Create boolean masks for left (`<= t`) and right (`> t`) splits.
  - For **categorical features**:
    - Split data based on equality (`== val`) vs. inequality (`!= val`).
- Skip invalid splits (e.g., when one side is empty).
- Return:
  - `best_feature`: the most informative feature,
  - `best_threshold`: the split value,
  - `best_gain`: the highest information gain achieved.

---

### **Expected Output**
- `entropy()` returns 0 when all samples belong to one class.
- `information_gain()` returns higher values for better splits.
- `best_split()` identifies the optimal feature and threshold for partitioning the data.

Implement the functions below step by step.


In [None]:
import numpy as np
from typing import Tuple, Optional, Union, Dict, Any, List  #list and dict are used in the next cell

# A small constant to prevent log(0)
EPSILON = 1e-9

def entropy(y):
    """
    Compute the entropy of a label array y.
    Formula: H(y) = -Σ p_i * log2(p_i)
    """
    if len(y) == 0:
      return 0.0

    # Get the count of unique classes
    class_labels, counts = np.unique(y, return_counts=True)

    # Calculate probabilities p_i
    probabilities = counts / len(y)
    #Formula
    entropy_value = -np.sum(probabilities * np.log2(probabilities + EPSILON))

    return entropy_value


def information_gain(y, y_left, y_right):
    """
    Compute the information gain of a split.
    IG = H(parent) - (n_left/n)*H(left) - (n_right/n)*H(right)
    """
    n = len(y)
    n_left = len(y_left)
    n_right = len(y_right)

    if n_left == 0 or n_right == 0:
        # Avoid non-splits
        return 0.0

    # Calculate entropy for the parent node
    parent_entropy = entropy(y)

    # Calculate weighted entropy for the child nodes
    weighted_child_entropy = (n_left / n) * entropy(y_left) + \
                             (n_right / n) * entropy(y_right)

    # Calculate Information Gain
    ig = parent_entropy - weighted_child_entropy

    return ig


def best_split(X, y, feature_subset=None):
    """
    Find the best feature and threshold that maximize information gain.

    Parameters:
        X (DataFrame): Feature dataset
        y (Series or array): Target labels
        feature_subset (list): Optional subset of features for random forest

    Returns:
        best_feature (str): Feature name with best split
        best_threshold (float/str): Threshold or category value
        best_gain (float): Information gain of best split
    """
    best_gain = -1.0
    best_feature = None
    best_threshold = None

    features = feature_subset if feature_subset is not None else X.columns

    for feature_name in features:
        feature_data = X[feature_name]

        # Sort unique values
        unique_thresholds = np.sort(feature_data.unique())

        for threshold in unique_thresholds:
            # 1. Create split masks for numerical features
            mask_left = feature_data <= threshold
            mask_right = feature_data > threshold

            # Check if split is valid (both sides must have samples)
            if np.sum(mask_left) == 0 or np.sum(mask_right) == 0:
                continue

            # 2. Split the target array y
            y_left = y[mask_left]
            y_right = y[mask_right]

            # 3. Calculate Information Gain
            current_gain = information_gain(y, y_left, y_right)

            # 4. Update best split if current gain is higher
            if current_gain > best_gain:
                best_gain = current_gain
                best_feature = feature_name
                best_threshold = threshold

    return best_feature, best_threshold, best_gain


## **Task: Implement the Custom Decision Tree Classifier**

In this task, we will implement your own **Decision Tree** algorithm from scratch — without using libraries like `sklearn.tree`.  
This implementation will rely on the previously defined helper functions: `entropy`, `information_gain`, and `best_split`.

---

### **Class Overview**

we will create a `DecisionTree` class with the following methods:

---

### **1. `__init__()`**
- Initializes the tree’s hyperparameters:
  - `max_depth`: the maximum allowed depth of the tree.  
  - `min_samples_split`: minimum number of samples required to make a split.  
  - `feature_subsample`: number of randomly selected features (used in Random Forests).  
- The tree itself will be stored as a nested dictionary (`self.tree`).

---

### **2. `fit()`**
- Recursively builds the decision tree using **Information Gain** as the splitting criterion.
- **Base cases**:
  - If there are no samples left (`len(y) == 0`) → return `0`.
  - If all samples belong to the same class → return that class label.
  - If maximum depth is reached or too few samples remain → return the most common class (majority vote).
- **Recursive case**:
  1. Select the best feature and threshold using `best_split()`.
  2. Split the dataset into left and right subsets based on that threshold.
  3. Recursively call `fit()` on each subset to build subtrees.
- The final tree should be stored in `self.tree` as nested dictionaries with keys:
..., 'threshold': ..., 'gain': ..., 'left': ..., 'right': ...}


---

### **3. `predict_one()`**
- Predict the class for a **single data point**.
- Traverse the decision tree recursively:
- For numeric thresholds: use `<=` for the left branch, `>` for the right.
- For categorical values: use `==` for the left branch, `!=` for the right.
- Stop recursion when a leaf node (class label) is reached.

---

### **4. `predict()`**
- Predict the class for **an entire dataset** (`X`).
- Apply `predict_one()` to each row using a loop or list comprehension.
- Return predictions as a NumPy array.

---

### **Expected Behavior**
- The `fit()` method should build a decision tree automatically.
- The `predict()` method should output the correct class predictions for given input samples.
- You should be able to check accuracy by comparing predicted and actual labels.

---

### **Example Usage (after implementation)**

```python
tree = DecisionTree(max_depth=4)
tree.fit(X_train, y_train)
predictions = tree.predict(X_test)




In [None]:
import numpy as np
from collections import Counter
from sklearn.metrics import accuracy_score

class DecisionTree:
    def __init__(self, max_depth=5, min_samples_split=2, feature_subsample=None):
        """
        Initialize the Decision Tree parameters.
        """
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.feature_subsample = feature_subsample
        self.tree: Optional[Dict[str, Any]] = None

    def _majority_vote(self, y: np.ndarray) -> int:
        """Returns the most common class label in array y."""
        if len(y) == 0:
            return 0
        # Counter().most_common(1) returns [ (label, count) ]
        return Counter(y).most_common(1)[0][0]

    def _get_feature_subset(self, all_features: pd.Index) -> List[str]:
        """Selects a random subset of features."""
        if self.feature_subsample is None or self.feature_subsample >= len(all_features):
            return all_features.tolist()

        return np.random.choice(all_features, size=self.feature_subsample, replace=False).tolist()


    def fit(self, X, y, depth=0):
        """
        Recursively build the decision tree using information gain.
        """
        #Base Cases:

        # Base Case 1:
        if len(np.unique(y)) == 1:
            return y[0]

        # Base Case 2: Max Depth
        if depth >= self.max_depth or len(y) < self.min_samples_split:
            return self._majority_vote(y)

        # Ensure relevant columns are numeric before finding best split
        # This is a safeguard in case preprocessing wasn't fully applied or consistent
        X_processed = X.copy()
        for col in ['Age', 'Fare', 'Pclass', 'SibSp', 'Parch', 'Sex', 'Embarked']:
             if col in X_processed.columns:
                # Attempt conversion to numeric, coercing errors to NaN and then handling NaNs if any remain
                X_processed[col] = pd.to_numeric(X_processed[col], errors='coerce')


        # Recursive Case: Find Best Split
        features = self._get_feature_subset(X_processed.columns)
        feature, threshold, gain = best_split(X_processed, y, feature_subset=features)

        # Base Case 3:
        if feature is None or gain <= EPSILON:
            return self._majority_vote(y)

        # 1. Split Data
        # Use the processed data for splitting
        mask_left = X_processed[feature] <= threshold
        X_left, y_left = X_processed[mask_left], y[mask_left]
        X_right, y_right = X_processed[mask_left], y[mask_left]


        # 2. Recursively build subtrees
        left_subtree = self.fit(X_left, y_left, depth + 1)
        right_subtree = self.fit(X_right, y_right, depth + 1)

        # 3. Build Node Dictionary
        node = {
            'feature': feature,
            'threshold': threshold,
            'gain': gain,
            'left': left_subtree,
            'right': right_subtree
        }

        # Store the tree structure only at the root call
        if depth == 0:
            self.tree = node
            return self
        else:
            return node


    def predict_one(self, x, node=None):
        """
        Predict class label for a single sample.
        """
        if node is None:
            node = self.tree
            if node is None:
                raise ValueError("Tree not fitted. Call the function fit() first.")

        if not isinstance(node, dict):  # If the node is not a dictionary, it's a leaf node (class label)
            return node

        feature = node['feature']
        threshold = node['threshold']

        # to ensure the feature value is numeric for comparison if the threshold is numeric
        value = x[feature]
        if isinstance(threshold, (int, float)):
             # Attempt conversion to numeric, coercing errors to NaN
             value = pd.to_numeric(value, errors='coerce')

             if pd.isna(value):
                 return self.predict_one(x, node['right'])

        #Numerical split
        if isinstance(threshold, (int, float)):
            if value <= threshold:
                return self.predict_one(x, node['left'])
            else:
                return self.predict_one(x, node['right'])
        # Categorical split (equality-based)
        else:
             if value == threshold:
                 return self.predict_one(x, node['left'])
             else:
                 return self.predict_one(x, node['right'])


    def predict(self, X):
        """
        Predict class labels for all rows in dataset X.
        """
        if self.tree is None:
            # Check if the tree was actually fitted
            raise ValueError("Tree not fitted. Call fit() first.")

        # Apply predict_one to each row of the input DataFrame X
        predictions = np.array([self.predict_one(X.iloc[i]) for i in range(len(X))])
        return predictions

### **Train and Evaluate the Custom Decision Tree**

It will be used to train, test, and evaluate your implemented `DecisionTree` class automatically.


In [None]:
# Apply preprocessing steps to X_train and X_test
X_train_processed = impute_missing_values(X_train.copy())
X_train_processed = drop_columns(X_train_processed, ['Cabin', 'Ticket', 'Name', 'PassengerId'])
X_train_processed = encode_categoricals(X_train_processed)

X_test_processed = impute_missing_values(X_test.copy())
X_test_processed = drop_columns(X_test_processed, ['Cabin', 'Ticket', 'Name', 'PassengerId'])
X_test_processed = encode_categoricals(X_test_processed)

tree = DecisionTree(max_depth=4)
tree.fit(X_train_processed, y_train)
y_pred_tree = tree.predict(X_test_processed)

acc_tree = np.mean(y_pred_tree == y_test)
print(f"Custom Decision Tree Accuracy: {acc_tree:.4f}")

compare_tree = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_tree})
compare_tree.head(10)


--- Step 1: Handle Missing Values by imputation ---
Filled missing Age values with the median: 28.00
Filled missing Embarked values with the mode: S

--- Step 2: Drop Irrelevant Columns ---
Dropped columns: ['Cabin', 'Ticket', 'Name', 'PassengerId']

--- Step 3: Encode Categorical Variables ---
Encoded 'Sex'column using mapping: {'male': 0, 'female': 1}
Encoded 'Embarked'column using mapping: {'C': 0, 'Q': 1, 'S': 2}

--- Step 1: Handle Missing Values by imputation ---
Filled missing Age values with the median: 29.00
Filled missing Embarked values with the mode: S

--- Step 2: Drop Irrelevant Columns ---
Dropped columns: ['Cabin', 'Ticket', 'Name', 'PassengerId']

--- Step 3: Encode Categorical Variables ---
Encoded 'Sex'column using mapping: {'male': 0, 'female': 1}
Encoded 'Embarked'column using mapping: {'C': 0, 'Q': 1, 'S': 2}


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(median_age, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(mode_embarked, inplace=True)
  df['Sex'] = df['Sex'].replace(sex_mapping)
  df['Embarked'] = df['Embarked'].replace(embarked_mapping)
The behavior will change in panda

Custom Decision Tree Accuracy: 0.4134


Unnamed: 0,Actual,Predicted
0,1,1
1,0,1
2,0,1
3,1,1
4,1,1
5,1,1
6,1,1
7,0,1
8,1,1
9,1,1


## **Part 2 — Custom Random Forest**

In this section, we will implement your own **Random Forest Classifier** from scratch — without using libraries like `sklearn.ensemble`.

A Random Forest combines multiple Decision Trees to form an ensemble model that improves prediction accuracy and reduces overfitting.

---

### **Instructions**

1. **Initialize Parameters (`__init__`)**
   - `n_estimators`: Number of decision trees to train.
   - `max_depth`: Maximum depth of each decision tree.
   - `min_samples_split`: Minimum number of samples required to split a node.
   - `feature_subsample_ratio`: Fraction of features to randomly select for each tree (e.g., 0.7 = 70% of features per tree).

   Store all trees in a list called `self.trees`.

---

2. **Train the Model (`fit`)**
   - For each tree:
     - Create a **bootstrap sample** by randomly selecting rows from the dataset **with replacement**.
     - Randomly choose a subset of features using the ratio `feature_subsample_ratio`.
     - Train a new instance of your `DecisionTree` class on the sampled data.
   - Append each trained tree to the list `self.trees`.

---

3. **Make Predictions (`predict`)**
   - Each trained decision tree makes predictions for all samples in `X`.
   - Combine predictions from all trees using **majority voting** (most common class).
   - Return the final predicted labels as a NumPy array.

---

### **Expected Behavior**
- The Random Forest should generally achieve **equal or better accuracy** than your single Decision Tree.
- The `fit()` and `predict()` methods should work seamlessly with your previously defined `DecisionTree` implementation.

---

### **Goal**
By completing this part, we will understand how ensemble learning improves model robustness through randomness and aggregation.


Implement Custom Random Forest Class

*   List item
*   List item



In [None]:
import numpy as np
from collections import Counter

class RandomForest:
    def __init__(self, n_estimators=10, max_depth=5, min_samples_split=2, feature_subsample_ratio=0.7):
        """
        Initialize the Random Forest parameters.
        """
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.feature_subsample_ratio = feature_subsample_ratio
        self.trees: List[DecisionTree] = []


    def fit(self, X, y):
        """
        Train multiple Decision Trees using bootstrapped samples and random feature subsets.
        """
        n_samples = len(X)
        n_features = len(X.columns)

        # Calculate the actual number of features to subsample
        feature_subsample_count = int(n_features * self.feature_subsample_ratio)

        self.trees = []

        for _ in range(self.n_estimators):
            # 1. Create a bootstrap sample (sampling rows with replacement)
            bootstrap_indices = np.random.choice(n_samples, size=n_samples, replace=True)
            # Use iloc and .copy() for reliable sampling and type conversion prevention
            X_sample = X.iloc[bootstrap_indices].copy()
            y_sample = y[bootstrap_indices]

            # 2. Train a new DecisionTree instance
            tree = DecisionTree(
                max_depth=self.max_depth,
                min_samples_split=self.min_samples_split,
                feature_subsample=feature_subsample_count
            )

            tree.fit(X_sample, y_sample)

            # 3. Append the trained tree
            self.trees.append(tree)


    def predict(self, X):
        """
        Predict class labels for all samples using majority voting from all trained trees.
        """
        if not self.trees:
            raise ValueError("Random Forest not fitted. Call fit() first.")

        # Get predictions from all individual trees (Shape: n_estimators x n_samples)
        predictions_all = np.array([tree.predict(X) for tree in self.trees])

        # Combine predictions using majority voting across all trees for each sample
        final_predictions = np.array([
            Counter(predictions_all[:, i]).most_common(1)[0][0]
            for i in range(predictions_all.shape[1])
        ])

        return final_predictions



### **Train and Evaluate the Custom Random Forest**

Do **not** change the following cell.  
It will be used to train, test, and evaluate your implemented `RandomForest` class automatically.


In [None]:
forest = RandomForest(n_estimators=10, max_depth=5)
forest.fit(X_train_processed, y_train)
y_pred_forest = forest.predict(X_test_processed)

acc_forest = np.mean(y_pred_forest == y_test)
print(f"Custom Random Forest Accuracy: {acc_forest:.4f}")

compare_forest = pd.DataFrame({
    'Actual': y_test,
    'Tree_Pred': y_pred_tree,
    'Forest_Pred': y_pred_forest
})
compare_forest.head(7)


Custom Random Forest Accuracy: 0.7318


Unnamed: 0,Actual,Tree_Pred,Forest_Pred
0,1,1,0
1,0,1,0
2,0,1,0
3,1,1,1
4,1,1,1
5,1,1,1
6,1,1,1


### **Comparison**

It compares the performance of your **Custom Decision Tree** and **Custom Random Forest** implementations by printing their respective accuracies.


In [None]:
print(f"Decision Tree Accuracy: {acc_tree:.4f}")
print(f"Random Forest Accuracy: {acc_forest:.4f}")


Decision Tree Accuracy: 0.4134
Random Forest Accuracy: 0.7318


**Analytical Question:**  
Based on the accuracy results of the Decision Tree and Random Forest models, explain **why the Random Forest might perform better or worse** than a single Decision Tree.  
Discuss the roles of **ensemble learning**, **variance reduction**, and **overfitting** in your explanation.


**Answer:**
Random Forest performed better than a single Decision Tree.This large improvement is due to the power of ensemble learning and variance reduction(high to low variance).

A Random Forest combines predictions from many independent n Decision Trees (each trained on different n subsets of data and features).
This making of multiple Decision Trees called ensemble learning which leverages the idea that while each tree may be imperfect, their combined “vote” tends to be much more accurate(Majority wins voting)

A single Decision Tree is a high-variance model, a small changes in the training data can lead to very different outcomes.
By using multiple trees (each trained on a random sample), Random Forest reduces variance without significantly increasing bias.

A single Decision Tree tends to overfit training data because it grows deep and memorizes patterns specific to the dataset.
Random Forest prevents overfitting by Bootstrap sampling (each tree sees a different subset of the data). The second one is random feature sampling, each tree uses a random subset of features at each split.



---
## **Part 3 — Support Vector Machines (SVM) with Scikit-Learn**

In this final part, we will explore another powerful supervised learning algorithm: the **Support Vector Machine (SVM)**. Unlike Decision Trees, which make splits by partitioning data, an SVM's goal is to find the optimal hyperplane or decision boundary that best separates the classes.

we will use `scikit-learn`'s implementation to train an SVM on the same preprocessed Titanic data and compare its performance to your from-scratch models.

### **Task: Train and Evaluate SVM Classifiers**

You will use `scikit-learn`'s `SVC` (Support Vector Classifier). We will experiment with two key hyperparameters: the `kernel` and the regularization parameter `C`.

#### **Understanding SVM Kernels**
The **kernel** allows the SVM to find complex, non-linear decision boundaries. We will use two types:
* **`linear` kernel:** This attempts to separate the data with a single straight line (or a flat plane in higher dimensions). It's fast and works well if the data is linearly separable. Think of it as using a ruler to divide two groups of points.
* **`rbf` kernel (Radial Basis Function):** This is the default and more flexible kernel. It can create complex, curved decision boundaries by mapping the data to a higher-dimensional space. Think of it as using a flexible wire to enclose different groups of points.

#### **Understanding the Regularization Parameter `C`**
The **`C` parameter** controls the trade-off between achieving a low training error and a low testing error.
* A **low `C`** makes the decision boundary smoother and the margin wider. The model is more tolerant of misclassifying a few training points, which can lead to better **generalization** (less overfitting).
* A **high `C`** aims to classify every training example correctly, which can lead to a more complex boundary and a narrower margin. This might **overfit** the training data.

**Instructions:**
1.  Use the same `X_train`, `y_train`, `X_test`, and `y_test` data from Part 1.
2.  Train one `SVC` model with a `linear` kernel as a baseline.
3.  Train three separate `SVC` models with an `rbf` kernel, using `C` values of `0.1`, `1`, and `10`.
4.  For each of the four models, calculate and print its accuracy on the test set.

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Train and evaluate an SVM with a linear kernel
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train_processed, y_train)
y_pred_linear = svm_linear.predict(X_test_processed)
acc_linear = accuracy_score(y_test, y_pred_linear)
print(f"SVM (Linear Kernel) Accuracy: {acc_linear:.4f}")


# 2. Train and evaluate SVMs with an RBF kernel and different C values
rbf_accuracies = {}  # <-- Define this dictionary
for C_value in [0.1, 1, 10]:
    svm_rbf = SVC(kernel='rbf', C=C_value, random_state=42)
    svm_rbf.fit(X_train_processed, y_train)
    y_pred_rbf = svm_rbf.predict(X_test_processed)
    acc_rbf = accuracy_score(y_test, y_pred_rbf)
    rbf_accuracies[C_value] = acc_rbf  # <-- Store the accuracy in the dictionary
    print(f"SVM (RBF Kernel, C={C_value}) Accuracy: {acc_rbf:.4f}")

SVM (Linear Kernel) Accuracy: 0.7821
SVM (RBF Kernel, C=0.1) Accuracy: 0.6536
SVM (RBF Kernel, C=1) Accuracy: 0.6592
SVM (RBF Kernel, C=10) Accuracy: 0.7095


### **Final Comparison and Analysis**

Now, let's compare the performance of all the models you've worked with.

In [None]:
#Replace the accuracy variables with your defined variables for SVM accuracies

print(f"Custom Decision Tree Accuracy: {acc_tree:.4f}")
print(f"Custom Random Forest Accuracy: {acc_forest:.4f}")
print(f"Scikit-learn Linear SVM Accuracy: {acc_rbf:.4f}")
for c, acc in rbf_accuracies.items():
    print(f"Scikit-learn RBF SVM Accuracy (C={c}): {acc:.4f}")

Custom Decision Tree Accuracy: 0.4134
Custom Random Forest Accuracy: 0.7318
Scikit-learn Linear SVM Accuracy: 0.7095
Scikit-learn RBF SVM Accuracy (C=0.1): 0.6536
Scikit-learn RBF SVM Accuracy (C=1): 0.6592
Scikit-learn RBF SVM Accuracy (C=10): 0.7095


**Analytical Question:**
1.  Looking at the RBF SVM results, how did changing the `C` parameter affect the model's accuracy? Based on the explanation above, what does your result suggest about the trade-off between a wide margin and fitting the training data for this problem?
2.  Compared to your Random Forest, what is one advantage and one disadvantage of using an SVM? (Consider factors like interpretability, training time, and prediction performance).

**Answer:**
1. As the C value increases, the accuracy also increases slightly — from 0.65 → 0.71.
A low C (0.1) means SVM prioritizes a wider margin and allows some misclassifications. This prevents overfitting but can underfit complex data, leading to lower accuracy.
A high C (10) means SVM enforces stricter classification of training points (tighter decision boundary, smaller margin), fitting the data more closely. This can reduce bias and improve accuracy on this dataset.

2. Compared to the Random Forest, the SVM can create more complex and flexible decision boundaries between classes, which helps it handle complicated datasets.
However, SVMs are harder to understand and take more time to train. The Random Forest gave better accuracy in this case and is easier to explain and interpret.