# **SML Project: Binary Tree Predictors**

*   **Author:** Matteo Onger
*   **Date:** October 2024

**Dataset documentation**:
*   [Secondary Mushroom](https://archive.ics.uci.edu/dataset/848/secondary+mushroom+dataset)

## VM Setup

In [1]:
# install dataset package
!pip install ucimlrepo

# download repository
!git clone -b dev https://github.com/MatteoOnger/SML_Project.git

# set working directory
%cd /content/SML_Project/

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7
Cloning into 'SML_Project'...
remote: Enumerating objects: 136, done.[K
remote: Counting objects: 100% (136/136), done.[K
remote: Compressing objects: 100% (98/98), done.[K
remote: Total 136 (delta 80), reused 79 (delta 37), pack-reused 0 (from 0)[K
Receiving objects: 100% (136/136), 32.24 KiB | 4.03 MiB/s, done.
Resolving deltas: 100% (80/80), done.
/content/SML_Project


## Code

In [2]:
# ---- LIBRARIES ----
import logging

from ucimlrepo import fetch_ucirepo

from bintreepredictor import BinTreePredictor
from data import DataSet

In [3]:
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", force=True)

In [28]:
mushroom_df = fetch_ucirepo(id=848).data.original
mushroom_df.head()

Unnamed: 0,class,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
0,p,15.26,x,g,o,f,e,,w,16.95,...,s,y,w,u,w,t,g,,d,w
1,p,16.6,x,g,o,f,e,,w,17.99,...,s,y,w,u,w,t,g,,d,u
2,p,14.07,x,g,o,f,e,,w,17.8,...,s,y,w,u,w,t,g,,d,w
3,p,14.17,f,h,e,f,e,,w,15.77,...,s,y,w,u,w,t,p,,d,w
4,p,14.64,x,h,o,f,e,,w,16.53,...,s,y,w,u,w,t,p,,d,w


In [None]:
#mushroom_df = mushroom_df.drop(columns=mushroom_df.columns[mushroom_df.isna().any()])

In [None]:
train_df = mushroom_df.sample(frac=0.8, random_state=0)
test_df = mushroom_df.drop(train_df.index)

train_ds = DataSet(train_df, "class")
test_ds = DataSet(test_df, "class")

train_df.head()

Unnamed: 0,class,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
60661,e,6.02,o,,n,f,f,f,f,5.0,...,,,n,,,f,f,,d,s
23699,p,5.1,x,,b,f,x,,w,6.32,...,,,w,,,f,f,,d,a
60152,p,8.75,o,,e,f,f,f,f,3.15,...,,g,n,,,f,f,,d,s
57970,p,3.34,o,l,g,f,f,f,f,0.0,...,f,f,f,,,f,f,,d,a
47739,p,4.85,b,t,n,f,,,n,10.7,...,,,w,,,t,,k,g,a


In [None]:
tree = BinTreePredictor("zero-one", "mode", "entropy", "max_nodes", 50, max_thresholds=5)
train_err = tree.fit(train_ds)

print(f"accuracy:{1 - train_err}")

In [None]:
pred, test_err = tree.predict(test_ds)

print(f"accuracy:{1 - test_err}")

accuracy:0.8070247257245784
