# ðŸ“˜ Model 2 â€“ Decision Tree (From Scratch)

This notebook implements a **Decision Tree Classifier** from scratch **without using any inbuilt machine learning libraries** like scikit-learn.

We will:
- Load and clean the Cardiovascular dataset
- Remove unwanted columns (`Unnamed: 0`, `id`)
- Remove missing and duplicate values
- Create features **X** and target **y**
- Implement **Entropy & Information Gain**
- Build a Decision Tree recursively
- Evaluate accuracy

> **Algorithm used:** ID3 Decision Tree (Entropy based)


## ðŸ”¹ Step 1 â€” Load Dataset & Basic Cleaning

In [1]:
import pandas as pd

# Load your cleaned dataset (change filename if needed)
df = pd.read_csv('/Users/kunj/ML-DL/Cardio_ML_Project/data/cleaned_cardio_data.csv')

print('Initial shape:', df.shape)
df.head()

Initial shape: (68682, 18)


Unnamed: 0.1,Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,smoke,alco,active,cardio,age_years,BMI,cholesterol_2,cholesterol_3,gluc_2,gluc_3
0,0,0,-0.434357,Male,0.452992,-0.847822,-0.994437,-0.141911,0,0,1,0,50,21.96712,False,False,False,False
1,1,1,0.30924,Female,-1.063175,0.760996,0.799656,0.906111,0,0,1,1,55,34.927679,False,True,False,False
2,2,2,-0.24633,Female,0.07395,-0.707925,0.201625,-1.189932,0,0,0,1,51,23.507805,False,True,False,False
3,3,3,-0.746384,Male,0.579339,0.55115,1.397687,1.954132,0,0,1,1,48,28.710479,False,False,False,False
4,4,4,-0.806764,Female,-1.063175,-1.267513,-1.592468,-2.237953,0,0,0,0,47,23.011177,False,False,False,False


### ðŸ§¹ Remove unwanted, null and duplicate values

In [2]:
# Remove unique and unwanted columns (edit if your names differ)
cols_to_drop = ['Unnamed: 0', 'id']

df = df.drop(columns=[c for c in cols_to_drop if c in df.columns], errors='ignore')

# Remove missing values
df = df.dropna()

# Remove duplicate rows
df = df.drop_duplicates()

print('After cleaning:', df.shape)
df.head()

After cleaning: (68658, 16)


Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,smoke,alco,active,cardio,age_years,BMI,cholesterol_2,cholesterol_3,gluc_2,gluc_3
0,-0.434357,Male,0.452992,-0.847822,-0.994437,-0.141911,0,0,1,0,50,21.96712,False,False,False,False
1,0.30924,Female,-1.063175,0.760996,0.799656,0.906111,0,0,1,1,55,34.927679,False,True,False,False
2,-0.24633,Female,0.07395,-0.707925,0.201625,-1.189932,0,0,0,1,51,23.507805,False,True,False,False
3,-0.746384,Male,0.579339,0.55115,1.397687,1.954132,0,0,1,1,48,28.710479,False,False,False,False
4,-0.806764,Female,-1.063175,-1.267513,-1.592468,-2.237953,0,0,0,0,47,23.011177,False,False,False,False


## ðŸ”¹ Step 2 â€” Create Feature Matrix X and Target y

In [3]:
# Target variable
target = 'cardio'

# Separate X and y
X = df.drop(columns=[target])
y = df[target]

# Combine X and y back into a list of rows for custom implementation
data = df.values.tolist()

columns = list(df.columns)

print('X shape:', X.shape)
print('y shape:', y.shape)

X shape: (68658, 15)
y shape: (68658,)


## ðŸ”¹ Step 3 â€” Decision Tree Algorithm Overview
We implement the Decision Tree using:

- **Entropy** to measure impurity
- **Information Gain** to select the best split
- **Recursive Tree Building**
- Leaf nodes predict majority class

We do **NOT** use scikit-learn â€“ everything is built manually.

### âœ… Entropy Function

In [4]:
import math

def entropy(rows):
    label_count = {}
    for row in rows:
        label = row[-1]  # last column = target
        label_count[label] = label_count.get(label, 0) + 1
    
    ent = 0.0
    total = len(rows)
    for lbl in label_count:
        p = label_count[lbl] / total
        ent += -p * math.log2(p)
    return ent

# Quick test
entropy(data)

0.38829257920670057

### âœ… Function to Split Data on a Feature

In [5]:
def split_data(rows, col, value):
    true_rows, false_rows = [], []
    for row in rows:
        if row[col] >= value:
            true_rows.append(row)
        else:
            false_rows.append(row)
    return true_rows, false_rows

### âœ… Information Gain Calculation

In [6]:
def info_gain(left, right, current_entropy):
    p = float(len(left)) / (len(left) + len(right))
    return current_entropy - p * entropy(left) - (1 - p) * entropy(right)

### âœ… Find Best Split (Best Feature and Threshold)

In [7]:
def find_best_split(rows):
    best_gain = 0
    best_col = None
    best_value = None
    current_entropy = entropy(rows)
    n_features = len(rows[0]) - 1  # last column is label

    for col in range(n_features):
        values = set([row[col] for row in rows])

        for val in values:
            true_rows, false_rows = split_data(rows, col, val)
            if len(true_rows) == 0 or len(false_rows) == 0:
                continue

            gain = info_gain(true_rows, false_rows, current_entropy)

            if gain > best_gain:
                best_gain, best_col, best_value = gain, col, val

    return best_gain, best_col, best_value

### âœ… Decision Tree Node Class

In [8]:
class DecisionNode:
    def __init__(self, col=None, value=None, true_branch=None, false_branch=None, label=None):
        self.col = col
        self.value = value
        self.true_branch = true_branch
        self.false_branch = false_branch
        self.label = label

### âœ… Build Decision Tree Recursively

In [9]:
def build_tree(rows):
    gain, col, value = find_best_split(rows)

    # Stop condition â€” no information gain
    if gain == 0:
        labels = [row[-1] for row in rows]
        label = max(set(labels), key=labels.count)
        return DecisionNode(label=label)

    true_rows, false_rows = split_data(rows, col, value)

    true_branch = build_tree(true_rows)
    false_branch = build_tree(false_rows)

    return DecisionNode(col, value, true_branch, false_branch)

### âœ… Prediction Function

In [10]:
def classify(row, node):
    if node.label is not None:
        return node.label

    value = row[node.col]

    if value >= node.value:
        return classify(row, node.true_branch)
    else:
        return classify(row, node.false_branch)

## ðŸ”¹ Step 4 â€” Train Decision Tree Model

In [11]:
tree = build_tree(data)
print('Decision Tree Trained Successfully!')

Decision Tree Trained Successfully!


## ðŸ”¹ Step 5 â€” Evaluate Model Accuracy

In [12]:
def accuracy(rows, tree):
    correct = 0
    for row in rows:
        pred = classify(row, tree)
        if pred == row[-1]:
            correct += 1
    return correct / len(rows)

acc = accuracy(data, tree)
print('Model Accuracy =', acc)

Model Accuracy = 1.0


In [13]:
acc = accuracy(data, tree) * 100
print("Model Accuracy = {:.2f}%".format(acc))

Model Accuracy = 100.00%


In [None]:
#This is common because:
#	â€¢	decision trees easily overfit
#	â€¢	you evaluated on same training data (not test split)

