# ðŸ“˜ Model 3 â€“ Decision Tree From Scratch **with Trainâ€“Test Split**

This notebook implements a **Decision Tree Classifier from scratch** (no scikit-learn) **and includes Trainâ€“Test Split**.

We will:
- Load and clean Cardiovascular dataset
- Drop unwanted columns (`Unnamed: 0`, `id`)
- Remove null and duplicate values
- Split data into **train and test sets** (from scratch)
- Implement decision tree using **Entropy & Information Gain**
- Evaluate using **Accuracy, Confusion Matrix, Precision, Recall, F1**

> Algorithm used: **ID3 Decision Tree**


##  Step 1 â€” Load Dataset and Basic Cleaning

In [2]:
import pandas as pd

df = pd.read_csv('/Users/kunj/ML-DL/Cardio_ML_Project/data/cleaned_cardio_data.csv')

print('Initial shape:', df.shape)
df.head()

Initial shape: (68682, 18)


Unnamed: 0.1,Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,smoke,alco,active,cardio,age_years,BMI,cholesterol_2,cholesterol_3,gluc_2,gluc_3
0,0,0,-0.434357,Male,0.452992,-0.847822,-0.994437,-0.141911,0,0,1,0,50,21.96712,False,False,False,False
1,1,1,0.30924,Female,-1.063175,0.760996,0.799656,0.906111,0,0,1,1,55,34.927679,False,True,False,False
2,2,2,-0.24633,Female,0.07395,-0.707925,0.201625,-1.189932,0,0,0,1,51,23.507805,False,True,False,False
3,3,3,-0.746384,Male,0.579339,0.55115,1.397687,1.954132,0,0,1,1,48,28.710479,False,False,False,False
4,4,4,-0.806764,Female,-1.063175,-1.267513,-1.592468,-2.237953,0,0,0,0,47,23.011177,False,False,False,False


###  Remove unwanted columns, null values, and duplicates

In [3]:
cols_to_drop = ['Unnamed: 0', 'id']

df = df.drop(columns=[c for c in cols_to_drop if c in df.columns], errors='ignore')

df = df.dropna()
df = df.drop_duplicates()

print('After cleaning shape:', df.shape)

After cleaning shape: (68658, 16)


##  Step 2 â€” Create Feature Matrix **X** and Target **y**

In [4]:
target = 'cardio'

X = df.drop(columns=[target])
y = df[target]

data = df.values.tolist()

print('Features shape:', X.shape)
print('Target shape:', y.shape)

Features shape: (68658, 15)
Target shape: (68658,)


##  Step 3 â€” Trainâ€“Test Split (From Scratch â€” No sklearn)

In [5]:
import random

def train_test_split(data, test_ratio=0.2):
    shuffled = data[:]
    random.shuffle(shuffled)
    test_size = int(len(shuffled) * test_ratio)
    test_data = shuffled[:test_size]
    train_data = shuffled[test_size:]
    return train_data, test_data

train_data, test_data = train_test_split(data, test_ratio=0.2)

print('Train size:', len(train_data))
print('Test size:', len(test_data))

Train size: 54927
Test size: 13731


##  Step 4 â€” Decision Tree Algorithm Implementation

### Entropy

In [6]:
import math

def entropy(rows):
    label_count = {}
    for row in rows:
        label = row[-1]
        label_count[label] = label_count.get(label, 0) + 1

    ent = 0.0
    total = len(rows)

    for lbl in label_count:
        p = label_count[lbl] / total
        ent += -p * math.log2(p)
    return ent

###  Split dataset on a feature

In [7]:
def split_data(rows, col, value):
    true_rows, false_rows = [], []
    for row in rows:
        if row[col] >= value:
            true_rows.append(row)
        else:
            false_rows.append(row)
    return true_rows, false_rows

###  Information Gain

In [8]:
def info_gain(left, right, current_entropy):
    p = float(len(left)) / (len(left) + len(right))
    return current_entropy - p * entropy(left) - (1 - p) * entropy(right)

###  Find Best Split

In [9]:
def find_best_split(rows):
    best_gain = 0
    best_col = None
    best_value = None
    current_entropy = entropy(rows)
    n_features = len(rows[0]) - 1

    for col in range(n_features):
        values = set([row[col] for row in rows])

        for val in values:
            true_rows, false_rows = split_data(rows, col, val)
            if len(true_rows) == 0 or len(false_rows) == 0:
                continue

            gain = info_gain(true_rows, false_rows, current_entropy)

            if gain > best_gain:
                best_gain, best_col, best_value = gain, col, val

    return best_gain, best_col, best_value

###  Decision Node Class

In [10]:
class DecisionNode:
    def __init__(self, col=None, value=None, true_branch=None, false_branch=None, label=None):
        self.col = col
        self.value = value
        self.true_branch = true_branch
        self.false_branch = false_branch
        self.label = label

###  Build Tree Recursively

In [11]:
def build_tree(rows, depth=0, max_depth=None):
    gain, col, value = find_best_split(rows)

    if gain == 0 or (max_depth is not None and depth >= max_depth):
        labels = [row[-1] for row in rows]
        label = max(set(labels), key=labels.count)
        return DecisionNode(label=label)

    true_rows, false_rows = split_data(rows, col, value)

    true_branch = build_tree(true_rows, depth+1, max_depth)
    false_branch = build_tree(false_rows, depth+1, max_depth)

    return DecisionNode(col, value, true_branch, false_branch)

###  Prediction

In [12]:
def classify(row, node):
    if node.label is not None:
        return node.label

    value = row[node.col]

    if value >= node.value:
        return classify(row, node.true_branch)
    else:
        return classify(row, node.false_branch)

## Step 5 â€” Train on Training Data

In [None]:
tree = build_tree(train_data)
print('Decision Tree trained on training data.')

## Step 6 â€” Test Accuracy

In [None]:
def accuracy(rows, tree):
    correct = 0
    for row in rows:
        pred = classify(row, tree)
        if pred == row[-1]:
            correct += 1
    return correct / len(rows)

test_acc = accuracy(test_data, tree) * 100
print('Test Accuracy = {:.2f}%'.format(test_acc))

## Step 7 â€” Confusion Matrix + Precision + Recall + F1 (From Scratch)

In [None]:
def confusion_metrics(data, tree):
    TP = TN = FP = FN = 0

    for row in data:
        actual = row[-1]
        predicted = classify(row, tree)

        if actual == 1 and predicted == 1:
            TP += 1
        elif actual == 0 and predicted == 0:
            TN += 1
        elif actual == 0 and predicted == 1:
            FP += 1
        else:
            FN += 1

    return TP, TN, FP, FN

TP, TN, FP, FN = confusion_metrics(test_data, tree)

print('TP:', TP, 'TN:', TN, 'FP:', FP, 'FN:', FN)

precision = TP / (TP + FP) if (TP + FP) != 0 else 0
recall = TP / (TP + FN) if (TP + FN) != 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) != 0 else 0

print('Precision = {:.4f}'.format(precision))
print('Recall = {:.4f}'.format(recall))
print('F1 Score = {:.4f}'.format(f1))