# **SML Project: Binary Tree Predictors**

*   **Author:** Matteo Onger
*   **Date:** October 2024

**Dataset documentation**:
*   [Secondary Mushroom](https://archive.ics.uci.edu/dataset/848/secondary+mushroom+dataset)

## VM Setup

In [None]:
# install dataset package
!pip install ucimlrepo

# download repository
!git clone -b dev https://github.com/MatteoOnger/SML_Project.git

# set working directory
%cd /content/SML_Project/

## Code

In [2]:
# ---- LIBRARIES ----
import logging

from ucimlrepo import fetch_ucirepo

from bintreepredictor import BinTreePredictor
from data import DataSet

In [3]:
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", force=True)

In [19]:
mushroom = fetch_ucirepo(id=848)
mushroom_df = mushroom.data.original

In [5]:
#mushroom_df = mushroom_df.drop(columns=mushroom_df.columns[mushroom_df.isna().any()])

In [21]:
train_df = mushroom_df.sample(frac=0.8, random_state=0)
test_df = mushroom_df.drop(train_df.index)

train_ds = DataSet(train_df, "class")
test_ds = DataSet(test_df, "class")

mushroom_df.head()

Unnamed: 0,class,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
0,p,15.26,x,g,o,f,e,,w,16.95,...,s,y,w,u,w,t,g,,d,w
1,p,16.6,x,g,o,f,e,,w,17.99,...,s,y,w,u,w,t,g,,d,u
2,p,14.07,x,g,o,f,e,,w,17.8,...,s,y,w,u,w,t,g,,d,w
3,p,14.17,f,h,e,f,e,,w,15.77,...,s,y,w,u,w,t,p,,d,w
4,p,14.64,x,h,o,f,e,,w,16.53,...,s,y,w,u,w,t,p,,d,w


In [23]:
tree = BinTreePredictor("zero-one", "mode", "entropy", "max_nodes", 500, max_thresholds=5)
train_err = tree.fit(train_ds)

print(f"accuracy:{1 - train_err}")

2024-09-19 13:07:29,374 - INFO - bintreepredictor - BinTreePredictor_id:0 - split:(leaf:0, feat:ring-type, threshold:z) - info_gain:0.0297
2024-09-19 13:07:33,016 - INFO - bintreepredictor - BinTreePredictor_id:0 - split:(leaf:2, feat:stem-color, threshold:w) - info_gain:0.0295
2024-09-19 13:07:37,731 - INFO - bintreepredictor - BinTreePredictor_id:0 - split:(leaf:6, feat:gill-attachment, threshold:p) - info_gain:0.0408
2024-09-19 13:07:41,762 - INFO - bintreepredictor - BinTreePredictor_id:0 - split:(leaf:13, feat:stem-root, threshold:c) - info_gain:0.2401
2024-09-19 13:07:45,733 - INFO - bintreepredictor - BinTreePredictor_id:0 - split:(leaf:28, feat:cap-shape, threshold:f) - info_gain:0.1828
2024-09-19 13:07:50,517 - INFO - bintreepredictor - BinTreePredictor_id:0 - split:(leaf:58, feat:cap-surface, threshold:e) - info_gain:0.2528
2024-09-19 13:07:54,413 - INFO - bintreepredictor - BinTreePredictor_id:0 - split:(leaf:118, feat:cap-shape, threshold:s) - info_gain:0.1556
2024-09-19 13

accuracy:1.0


In [25]:
pred, test_err = tree.predict(test_ds)

print(f"accuracy:{1 - test_err}")

accuracy:0.9995087604388406
