# 01 — Modeling: PlayTennis 🌤️🎾

<p align="left">
  <img alt="ID3 Algorithm" src="https://img.shields.io/badge/ID3-Decision%20Tree-0A81D1">
  <img alt="Status" src="https://img.shields.io/badge/Notebook-Modeling-1e90ff">
</p>

> <strong>Purpose</strong>: Implement entropy & information gain, build the ID3 decision tree from scratch, validate on the PlayTennis dataset, and visualize results.  
> <strong>Author</strong>: <span style="color:#FF6B6B"><b>Noëlla Buti</b></span>

---

### 🛠️ Workflow
1. 📥 Load dataset (`playtennis.csv`)  
2. 🔢 Compute entropy & information gain  
3. 🌳 Build ID3 decision tree (from scratch)  
4. 👀 Pretty-print the tree & visualize with Graphviz  
5. ✅ Evaluate predictions & accuracy on training data  

<details>
  <summary><b>📁 Artifacts (click to expand)</b></summary>

- Tree visualization (Drive):  
  <code>/content/drive/MyDrive/id3-census-income/reports/assets/playtennis_tree.png</code>  
- Notebook path (Drive):  
  <code>/content/drive/MyDrive/id3-census-income/notebooks/01_playtennis.ipynb</code>  
</details>

---

### 🚦 Results Snapshot
- **Entropy(PlayTennis)** ≈ 0.94  
- **Best split**: Outlook (IG ≈ 0.247)  
- **Tree structure**:
```
[Outlook]
├─ Overcast → Yes
├─ Rain → [Wind: Strong → No, Weak → Yes]
└─ Sunny → [Humidity: High → No, Normal → Yes]
```

- **Accuracy (train)** = 100%  

> 💡 **Tip:** In ID3, continuous features must be binned into categories for splits.

In [1]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/

import os
OUT_DIR = "id3-census-income/reports/assets"
os.makedirs(OUT_DIR, exist_ok=True)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive


## 1. Setup and Data

In [2]:
import pandas as pd
import numpy as np

pt = pd.read_csv("data/raw/playtennis.csv")
pt = pt.drop(columns=["Day"], errors="ignore")
pt.head()

Unnamed: 0,Outlook,Temperature,Humidity,Wind,PlayTennis
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes


## 2. Entropy and Information Gain

In [3]:
from collections import Counter
import math

def entropy(series: pd.Series) -> float:
    counts = Counter(series)
    n = sum(counts.values())
    if n == 0: return 0.0
    return -sum((c/n) * math.log2(c/n) for c in counts.values() if c)

def information_gain(df: pd.DataFrame, feature: str, target: str) -> float:
    H = entropy(df[target])
    weights = df[feature].value_counts(normalize=True)
    cond = sum(w * entropy(df[df[feature]==v][target]) for v, w in weights.items())
    return H - cond

## 3. Check Information Gain per Feature

In [4]:
target = "PlayTennis"
features = [c for c in pt.columns if c != target]

print("Entropy of target:", round(entropy(pt[target]), 3))
for f in features:
    ig = information_gain(pt, f, target)
    print(f"Information Gain({f}) = {ig:.3f}")

Entropy of target: 0.94
Information Gain(Outlook) = 0.247
Information Gain(Temperature) = 0.029
Information Gain(Humidity) = 0.152
Information Gain(Wind) = 0.048


## 4. Mininmal ID3 (Categorical, no pruning)

In [5]:
def majority_label(series: pd.Series):
    return series.mode().iloc[0]

def best_feature_by_ig(df: pd.DataFrame, target: str) -> str:
    feats = [c for c in df.columns if c != target]
    gains = {f: information_gain(df, f, target) for f in feats}
    return max(gains, key=gains.get)

def build_id3(df: pd.DataFrame, target: str, max_depth=None, depth=0):
    if len(df[target].unique()) == 1:
        return df[target].iloc[0]
    if max_depth is not None and depth >= max_depth:
        return majority_label(df[target])
    feats_left = [c for c in df.columns if c != target]
    if not feats_left:
        return majority_label(df[target])

    f = best_feature_by_ig(df, target)
    node = {f: {}}
    for v in sorted(df[f].dropna().unique()):
        sub = df[df[f] == v]
        node[f][v] = build_id3(sub.drop(columns=[f]), target, max_depth, depth+1) if not sub.empty else majority_label(df[target])
    return node

# Build the tree
tree = build_id3(pt.copy(), target)

## 5. Pretty-print Tree

In [6]:
def print_tree(tree, indent=""):
    if not isinstance(tree, dict):
        print(indent + f"→ {tree}"); return
    (feat, branches), = tree.items()
    print(indent + f"[{feat}]")
    for val, subtree in branches.items():
        print(indent + f" ├─ {val}")
        print_tree(subtree, indent + " │   ")

print_tree(tree)

[Outlook]
 ├─ Overcast
 │   → Yes
 ├─ Rain
 │   [Wind]
 │    ├─ Strong
 │    │   → No
 │    ├─ Weak
 │    │   → Yes
 ├─ Sunny
 │   [Humidity]
 │    ├─ High
 │    │   → No
 │    ├─ Normal
 │    │   → Yes


## 6. Predict Function + Quick Sanity Check

In [7]:
def _collect_leaves(node):
    leaves = []
    stack = [node]
    while stack:
        cur = stack.pop()
        if isinstance(cur, dict):
            (feat, branches), = cur.items()
            stack.extend(branches.values())
        else:
            leaves.append(cur)
    return leaves

def predict_one(tree, row: dict):
    node = tree
    while isinstance(node, dict):
        (feat, branches), = node.items()
        val = row.get(feat)
        if val in branches:
            node = branches[val]
        else:
            leaves = _collect_leaves(branches)
            return pd.Series(leaves).mode().iloc[0]
    return node

def predict(tree, X: pd.DataFrame):
    return [predict_one(tree, r.to_dict()) for _, r in X.iterrows()]

y_true = pt[target].tolist()
y_pred = predict(tree, pt.drop(columns=[target]))
acc = (pd.Series(y_true) == pd.Series(y_pred)).mean()
print("Training accuracy:", acc)

Training accuracy: 1.0


## 7. Graphviz export (PNG/PDF)

In [8]:
from graphviz import Digraph

def draw_dict_tree(tree):
    def _add(dot, node, parent=None, edge_label=None, idx=[0]):
        if not isinstance(node, dict):
            nid = f"leaf_{idx[0]}"; idx[0]+=1
            dot.node(nid, str(node), shape="box")
            if parent is not None:
                dot.edge(parent, nid, label=str(edge_label))
            return
        (feat, branches), = node.items()
        fid = f"feat_{feat}"
        dot.node(fid, feat, shape="ellipse")
        if parent is not None:
            dot.edge(parent, fid, label=str(edge_label))
        for val, sub in branches.items():
            _add(dot, sub, parent=fid, edge_label=val, idx=idx)
    dot = Digraph()
    _add(dot, tree)
    return dot

dot = draw_dict_tree(tree)
dot.format = "png"
dot.render(f"{OUT_DIR}/playtennis_tree", cleanup=True)
print(f"Saved tree visualization to {OUT_DIR}/playtennis_tree.png")

Saved tree visualization to id3-census-income/reports/assets/playtennis_tree.png
