In [4]:
#missing values


some columns have a very high percentage of missing values. The following are the options for handling them:
1. Drop columns: We'd have to check the accuracy of the model with and without those columns
2. Imputation: Imputation is not an option because it will lead to specific values becoming overweighted in the buidlign of the tree
3. treat null as separate category: This is an option because while it can have the same problem as #2 above, there is a good possibility that certain values are more likely to be null for certain species. Hence, this could be a factor in the prediction (let's verify this in EDA)
4. Surrogate split: learn a little more about this and consider implementing it in the tree

In [5]:
from scripts.final.DecisionTree import DecisionTree
from scripts.final.utils import *
import pandas as pd
import numpy as np
from ucimlrepo import fetch_ucirepo #for importing data
from summarytools import dfSummary

In [6]:
# fetch dataset 
secondary_mushroom = fetch_ucirepo(id=848) 
  
# data (as pandas dataframes) 
X_loaded = secondary_mushroom.data.features 
y_loaded = secondary_mushroom.data.targets

In [7]:
X = X_loaded.copy()
y = y_loaded.copy()

In [8]:
dfSummary(X)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,cap-diameter [float64],Mean (sd) : 6.7 (5.3) min < med < max: 0.4 < 5.9 < 62.3 IQR (CV) : 5.1 (1.3),"2,571 distinct values",,0 (0.0%)
2,cap-shape [object],1. x 2. f 3. s 4. b 5. o 6. p 7. c,"26,934 (44.1%) 13,404 (21.9%) 7,164 (11.7%) 5,694 (9.3%) 3,460 (5.7%) 2,598 (4.3%) 1,815 (3.0%)",,0 (0.0%)
3,cap-surface [object],1. nan 2. t 3. s 4. y 5. h 6. g 7. d 8. e 9. k 10. i 11. other,"14,120 (23.1%) 8,196 (13.4%) 7,608 (12.5%) 6,341 (10.4%) 4,974 (8.1%) 4,724 (7.7%) 4,432 (7.3%) 2,584 (4.2%) 2,303 (3.8%) 2,225 (3.6%) 3,562 (5.8%)",,"14,120 (23.1%)"
4,cap-color [object],1. n 2. y 3. w 4. g 5. e 6. o 7. r 8. u 9. p 10. k 11. other,"24,218 (39.7%) 8,543 (14.0%) 7,666 (12.6%) 4,420 (7.2%) 4,035 (6.6%) 3,656 (6.0%) 1,782 (2.9%) 1,709 (2.8%) 1,703 (2.8%) 1,279 (2.1%) 2,058 (3.4%)",,0 (0.0%)
5,does-bruise-or-bleed [object],1. f 2. t,"50,479 (82.7%) 10,590 (17.3%)",,0 (0.0%)
6,gill-attachment [object],1. a 2. d 3. nan 4. x 5. p 6. e 7. s 8. f,"12,698 (20.8%) 10,247 (16.8%) 9,884 (16.2%) 7,413 (12.1%) 6,001 (9.8%) 5,648 (9.2%) 5,648 (9.2%) 3,530 (5.8%)",,"9,884 (16.2%)"
7,gill-spacing [object],1. nan 2. c 3. d 4. f,"25,063 (41.0%) 24,710 (40.5%) 7,766 (12.7%) 3,530 (5.8%)",,"25,063 (41.0%)"
8,gill-color [object],1. w 2. n 3. y 4. p 5. g 6. f 7. o 8. k 9. r 10. e 11. other,"18,521 (30.3%) 9,645 (15.8%) 9,546 (15.6%) 5,983 (9.8%) 4,118 (6.7%) 3,530 (5.8%) 2,909 (4.8%) 2,375 (3.9%) 1,399 (2.3%) 1,066 (1.7%) 1,977 (3.2%)",,0 (0.0%)
9,stem-height [float64],Mean (sd) : 6.6 (3.4) min < med < max: 0.0 < 6.0 < 33.9 IQR (CV) : 3.1 (2.0),"2,226 distinct values",,0 (0.0%)
10,stem-width [float64],Mean (sd) : 12.1 (10.0) min < med < max: 0.0 < 10.2 < 103.9 IQR (CV) : 11.4 (1.2),"4,630 distinct values",,0 (0.0%)


###Missing Values

Most of the columns do not have any missing values. None of the numeric columns have mssing values. However, missing values are particularly prominent in 
1. spore-print-color: 89.6%
2. veil-type: 94.8%
3. veil-color: 87.9%
4. stem-root: 84.4%

Other columns with missing values include:
5. stem-surface: 62.4%
6. gill-spacing: 41%
7. cap-surface: 23.1%
8. gill-attachment: 16.2%
9. ring-type: 4%

Considering that very few values in the 4 listed columns above are present, it's unlikely we get any useful information from them. Hence, we drop these columns


In [9]:
X = X.drop(['spore-print-color', 'veil-type', 'veil-color', 'stem-root'], axis=1)

We consider mode imputation for the other columns. However, research shows that mode imputation doesn't increase the predictive power of classification models:
https://www.sciencedirect.com/science/article/pii/S2352914823002289#:~:text=Mode%20Imputation%3A%20This%20is%20one,of%20variance%20in%20the%20variable.

Besides, in these cases mode imputation could severely bias the data. For example, imputing the mode value of a for missing values in the gill attachment column will lead to having double the number of observations with the aaa gill attachment type as the d attachment type, while in the observations, it has only ~20% more data.

We also consider dropping all rows containing missing values in any column but this will lead to excessive loss of data

Thus, the method we adopt for the other fields is to consider the missing values as a separate category

To do this, we force convert all categorical columns to string, turning the missing values into string values with value 'nan'

In [10]:
for col in X:
    X.loc[:,col]=X.loc[:,col].astype(str) if X.loc[:,col].dtype == 'object' else X.loc[:,col]

In [11]:
dfSummary(X)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,cap-diameter [float64],Mean (sd) : 6.7 (5.3) min < med < max: 0.4 < 5.9 < 62.3 IQR (CV) : 5.1 (1.3),"2,571 distinct values",,0 (0.0%)
2,cap-shape [object],1. x 2. f 3. s 4. b 5. o 6. p 7. c,"26,934 (44.1%) 13,404 (21.9%) 7,164 (11.7%) 5,694 (9.3%) 3,460 (5.7%) 2,598 (4.3%) 1,815 (3.0%)",,0 (0.0%)
3,cap-surface [object],1. nan 2. t 3. s 4. y 5. h 6. g 7. d 8. e 9. k 10. i 11. other,"14,120 (23.1%) 8,196 (13.4%) 7,608 (12.5%) 6,341 (10.4%) 4,974 (8.1%) 4,724 (7.7%) 4,432 (7.3%) 2,584 (4.2%) 2,303 (3.8%) 2,225 (3.6%) 3,562 (5.8%)",,0 (0.0%)
4,cap-color [object],1. n 2. y 3. w 4. g 5. e 6. o 7. r 8. u 9. p 10. k 11. other,"24,218 (39.7%) 8,543 (14.0%) 7,666 (12.6%) 4,420 (7.2%) 4,035 (6.6%) 3,656 (6.0%) 1,782 (2.9%) 1,709 (2.8%) 1,703 (2.8%) 1,279 (2.1%) 2,058 (3.4%)",,0 (0.0%)
5,does-bruise-or-bleed [object],1. f 2. t,"50,479 (82.7%) 10,590 (17.3%)",,0 (0.0%)
6,gill-attachment [object],1. a 2. d 3. nan 4. x 5. p 6. e 7. s 8. f,"12,698 (20.8%) 10,247 (16.8%) 9,884 (16.2%) 7,413 (12.1%) 6,001 (9.8%) 5,648 (9.2%) 5,648 (9.2%) 3,530 (5.8%)",,0 (0.0%)
7,gill-spacing [object],1. nan 2. c 3. d 4. f,"25,063 (41.0%) 24,710 (40.5%) 7,766 (12.7%) 3,530 (5.8%)",,0 (0.0%)
8,gill-color [object],1. w 2. n 3. y 4. p 5. g 6. f 7. o 8. k 9. r 10. e 11. other,"18,521 (30.3%) 9,645 (15.8%) 9,546 (15.6%) 5,983 (9.8%) 4,118 (6.7%) 3,530 (5.8%) 2,909 (4.8%) 2,375 (3.9%) 1,399 (2.3%) 1,066 (1.7%) 1,977 (3.2%)",,0 (0.0%)
9,stem-height [float64],Mean (sd) : 6.6 (3.4) min < med < max: 0.0 < 6.0 < 33.9 IQR (CV) : 3.1 (2.0),"2,226 distinct values",,0 (0.0%)
10,stem-width [float64],Mean (sd) : 12.1 (10.0) min < med < max: 0.0 < 10.2 < 103.9 IQR (CV) : 11.4 (1.2),"4,630 distinct values",,0 (0.0%)


We now have 16 columns and no missing values

###Encoding Values
We consider the two encoding methods:
1. One-hot encoding: One-hot encoding is inappropriate for this type of dataset which contains mainly categorical columns because it will lead to the creation of too many columns
2. Label encoding: Label encoding would work fine for this kind of dataset. However, we have to be careful to ensure that the numeric values are not treated as if they carry an ordinal value. An evaluation of the dataset shows that none of the fields could be considered as having an ordinal value. Hence, we want our decision tree to handle them separately. However, numeric encoding would make it difficult to recognize this during the training of our decision tree. So label encoding could make the tree run faster, the results could be inaccurate if the tree considers the numeric values ordinally.

Hence, to allow correct handling of categorical variables, we do not encode the predictors. However, we can encode the predicted variable y numerically where this is not a concern.

In [12]:
#encode y into 0s and 1s. 
y_mapping = encode_labels(y)
print(y_mapping)

{'class': {'e': 0, 'p': 1}}


Thus our data assigns the value of 1 when the mushroom is poisonous and 0 when it is not.

###Splitting the data

Having decided on the set of features to train on, we split our data into train and test sets to begin training

In [13]:
#split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Now we train the tree using each of the three splitting methods

In [14]:
dfSummary(X)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,cap-diameter [float64],Mean (sd) : 6.7 (5.3) min < med < max: 0.4 < 5.9 < 62.3 IQR (CV) : 5.1 (1.3),"2,571 distinct values",,0 (0.0%)
2,cap-shape [object],1. x 2. f 3. s 4. b 5. o 6. p 7. c,"26,934 (44.1%) 13,404 (21.9%) 7,164 (11.7%) 5,694 (9.3%) 3,460 (5.7%) 2,598 (4.3%) 1,815 (3.0%)",,0 (0.0%)
3,cap-surface [object],1. nan 2. t 3. s 4. y 5. h 6. g 7. d 8. e 9. k 10. i 11. other,"14,120 (23.1%) 8,196 (13.4%) 7,608 (12.5%) 6,341 (10.4%) 4,974 (8.1%) 4,724 (7.7%) 4,432 (7.3%) 2,584 (4.2%) 2,303 (3.8%) 2,225 (3.6%) 3,562 (5.8%)",,0 (0.0%)
4,cap-color [object],1. n 2. y 3. w 4. g 5. e 6. o 7. r 8. u 9. p 10. k 11. other,"24,218 (39.7%) 8,543 (14.0%) 7,666 (12.6%) 4,420 (7.2%) 4,035 (6.6%) 3,656 (6.0%) 1,782 (2.9%) 1,709 (2.8%) 1,703 (2.8%) 1,279 (2.1%) 2,058 (3.4%)",,0 (0.0%)
5,does-bruise-or-bleed [object],1. f 2. t,"50,479 (82.7%) 10,590 (17.3%)",,0 (0.0%)
6,gill-attachment [object],1. a 2. d 3. nan 4. x 5. p 6. e 7. s 8. f,"12,698 (20.8%) 10,247 (16.8%) 9,884 (16.2%) 7,413 (12.1%) 6,001 (9.8%) 5,648 (9.2%) 5,648 (9.2%) 3,530 (5.8%)",,0 (0.0%)
7,gill-spacing [object],1. nan 2. c 3. d 4. f,"25,063 (41.0%) 24,710 (40.5%) 7,766 (12.7%) 3,530 (5.8%)",,0 (0.0%)
8,gill-color [object],1. w 2. n 3. y 4. p 5. g 6. f 7. o 8. k 9. r 10. e 11. other,"18,521 (30.3%) 9,645 (15.8%) 9,546 (15.6%) 5,983 (9.8%) 4,118 (6.7%) 3,530 (5.8%) 2,909 (4.8%) 2,375 (3.9%) 1,399 (2.3%) 1,066 (1.7%) 1,977 (3.2%)",,0 (0.0%)
9,stem-height [float64],Mean (sd) : 6.6 (3.4) min < med < max: 0.0 < 6.0 < 33.9 IQR (CV) : 3.1 (2.0),"2,226 distinct values",,0 (0.0%)
10,stem-width [float64],Mean (sd) : 12.1 (10.0) min < med < max: 0.0 < 10.2 < 103.9 IQR (CV) : 11.4 (1.2),"4,630 distinct values",,0 (0.0%)


Model 1

In [15]:
#training the decision tree
entropy_model = DecisionTree(split_using='entropy', max_depth=10)
entropy_model.fit(X_train, y_train)

In [16]:
# for thr in thresholds:
#     if type(thr) == float:
#         print(type(thr))  

In [17]:
# for col in range(X_train.shape[1]):
#     X_column = X_train[:, col]
#     thresholds = np.unique(X_column.astype(str)) if entropy_model._iscategorical(X_column) else np.unique(X_column)
#     try:
#         X_column = X_column.astype(float)
#         for thr in thresholds:
#             print(f"col: {col}, thr: {thr}")
#             lol = np.argwhere(X_column <= thr).flatten()
        
#     except ValueError:
#         for thr in thresholds:
#             print(f"col: {col}, thr: {thr}")
#             lol = np.argwhere(X_column == thr).flatten()        



In [18]:
#performance of entropy model
entropy_pred = entropy_model.predict(X_test)
print(accuracy(y_test, entropy_pred))
print(precision(y_test, entropy_pred))
print(recall(y_test, entropy_pred))

Accuracy: 0.8505813001473719
Precision: 0.9733811591466868
Recall: 0.7538011695906432


Model 2: Gini impurity

In [19]:
#training the decision tree
gini_model = DecisionTree(split_using='gini', max_depth=10)
gini_model.fit(X_train, y_train)

In [20]:
#performance of gini model
gini_pred = gini_model.predict(X_test)
print(accuracy(y_test, gini_pred))
print(precision(y_test, gini_pred))
print(recall(y_test, gini_pred))

Accuracy: 0.9042901588341248
Precision: 0.9544798845968905
Recall: 0.8706140350877193


Model 3: Train Error

Next we adopt the training error using zero-one loss as a splitting criteria

In [21]:
#training the decision tree
train_error_model = DecisionTree(split_using='train_error', max_depth=10)
train_error_model.fit(X_train, y_train)

In [22]:
#performance of train_error model
train_error_pred = train_error_model.predict(X_test)
print(accuracy(y_test, train_error_pred))
print(precision(y_test, train_error_pred))
print(recall(y_test, train_error_pred))

Accuracy: 0.8078434583265106
Precision: 0.8805692021006268
Recall: 0.7599415204678363


In [23]:
#training errors of each model
entropy_train = entropy_model.predict(X_train)
gini_train = gini_model.predict(X_train)
train_error_train = train_error_model.predict(X_train)
print(f"entropy train error: {zero_one_loss(y_train, entropy_train)}")
print(f"gini train error: {zero_one_loss(y_train, gini_train)}")
print(f"train_error train error: {zero_one_loss(y_train, train_error_train)}")

entropy train error: 0.1510387882509467
gini train error: 0.09415617644048715
train_error train error: 0.1857332923958653


It's easy to see that the training errors are similar to the test errors, showing that the model didn't overfit considering the stopping criteria.

Next we repeat the tests but this time adopting a different stopping criterion.

finally we perform hyper parameter tuning to optimize the model on the max depth stopping criterion