**Individual Assignment 1 Task 2**

Name: Ryan Hong Yang Tan

UOW ID: 8560341

Reading the csv file into a dataframe and checking for its length

In [3]:
import pandas as pd
import numpy as np
import random
import math
from google.colab import drive
drive.mount("/content/drive")

# Reading the csv file
data = pd.read_csv('/content/drive/My Drive/secondary_data.csv', sep=';')
# Checking if the data size is correct
print(len(data), '\n')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
61069 



**Pre-processing:**
Checking for null values, categorical values are cleaned by filling with the mode while continuous values are cleaned by filling with the mean

In [4]:
# Checking for null values
print('Null values present before pre processing:')
print(data.isnull().sum())

# Pre-processing
def preprocessing(dataName):
  for col in data.columns:
    # Categorical data are cleaned using mode
    if dataName[col].dtype == 'object':
      # Fills null data with the mode by selecting the mode value of the column
      dataName[col] = dataName[col].fillna(dataName[col].mode()[0])
    # Continuous data are cleaned using mean
    else:
      # Fills null data with the mean value
      dataName[col] = dataName[col].fillna(dataName[col].mean())
  return dataName

# Running the pre processing function
data = preprocessing(data)

# Checking for null values
print('\nNull values present after pre processing:')
print(data.isnull().sum())

Null values present before pre processing:
class                       0
cap-diameter                0
cap-shape                   0
cap-surface             14120
cap-color                   0
does-bruise-or-bleed        0
gill-attachment          9884
gill-spacing            25063
gill-color                  0
stem-height                 0
stem-width                  0
stem-root               51538
stem-surface            38124
stem-color                  0
veil-type               57892
veil-color              53656
has-ring                    0
ring-type                2471
spore-print-color       54715
habitat                     0
season                      0
dtype: int64

Null values present after pre processing:
class                   0
cap-diameter            0
cap-shape               0
cap-surface             0
cap-color               0
does-bruise-or-bleed    0
gill-attachment         0
gill-spacing            0
gill-color              0
stem-height             0
stem-width 

**Selecting features to be used**, features selected being cap-diameter, stem-height, cap-shape, cap-color, does-bruise-or-bleed

Class column has to be included as it represents whether its poisonous or edible

In [5]:
features = ['cap-diameter', 'stem-height', 'cap-shape', 'cap-color', 'does-bruise-or-bleed']
classData = 'class'
data = data[features + [classData]]

**Binning and Encoding**

In [6]:
# Binning function
def binning(col, binCount):
  return pd.qcut(col, binCount, labels = [0,1,2])

# Performing binning on continuous features
data['stem-height'] = binning(data['stem-height'], 3)
data['cap-diameter'] = binning(data['cap-diameter'], 3)

# Encoding categorical features
for cols in data.select_dtypes(include='object').columns:
  data[cols] = data[cols].astype('category').cat.codes

# Checking if all values are binned/encoded
print(data)

      cap-diameter stem-height  cap-shape  cap-color  does-bruise-or-bleed  \
0                2           2          6          6                     0   
1                2           2          6          6                     0   
2                2           2          6          6                     0   
3                2           2          2          1                     0   
4                2           2          6          6                     0   
...            ...         ...        ...        ...                   ...   
61064            0           0          5         11                     0   
61065            0           0          2         11                     0   
61066            0           0          5         11                     0   
61067            0           0          2         11                     0   
61068            0           0          5         11                     0   

       class  
0          1  
1          1  
2          1  
3  

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['stem-height'] = binning(data['stem-height'], 3)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['cap-diameter'] = binning(data['cap-diameter'], 3)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[cols] = data[cols].astype('category').cat.codes
A value is trying to be set on a copy of a

**Splitting** dataset into training and post prune datasets

In [7]:
# Splitting function
def train_test_split(data, test_size):
  if isinstance(test_size, float):
    test_size = round(test_size * len(data))

  indices = data.index.tolist()
  test_indices = random.sample(population=indices, k=test_size)

  test_data = data.loc[test_indices]
  train_data = data.drop(test_indices)

  return train_data, test_data

# Splitting the data into training(2/3) and post-pruning(1/3)
random.seed(42)
train_data, post_prune_data = train_test_split(data, 0.33)

# Checking size of each data set, they should add up to the full dataset size (61069 in this case)
print('Training data size:', len(train_data))
print('Post-prunning data size:', len(post_prune_data))

Training data size: 40916
Post-prunning data size: 20153


**Calculate entropy functions**

In [8]:
def calculate_entropy(data):
    label_col = data[:, -1]
    _, counts = np.unique(label_col, return_counts=True)
    prob = counts / counts.sum()
    entropy = sum(prob * -np.log2(prob))
    return entropy

def calculate_child_entropy(left, right):
    total = len(left) + len(right)
    return ((len(left) / total) * calculate_entropy(left)) + ((len(right) / total) * calculate_entropy(right))

**Get possible split function**

In [9]:
def get_possible_splits(data):
    possible_splits = {}
    n_rows, n_cols = data.shape
    for col in range(n_cols - 1):
        possible_splits[col] = []
        values = data[:, col]
        unique_values = np.unique(values)
        possible_splits[col] = unique_values
    return possible_splits

**Split data function**

In [10]:
def split_data(data, split_col, split_val):
    split_col_val = data[:, split_col]
    left = data[split_col_val == split_val]
    right = data[split_col_val != split_val]
    return left, right

**Determine Best split function**

In [11]:
def determine_best_split(data, possible_splits):
    parent_entropy = calculate_entropy(data)
    best_entropy = parent_entropy
    best_split_col = None
    best_split_val = None

    for col_index in possible_splits:
        for value in possible_splits[col_index]:
            left, right = split_data(data, col_index, value)
            if len(left) == 0 or len(right) == 0:
                continue
            child_entropy = calculate_child_entropy(left, right)
            if child_entropy < best_entropy:
                best_split_col = col_index
                best_split_val = value
                best_entropy = child_entropy

    return best_split_col, best_split_val

**Function to check if we have reached a leaf node**

In [12]:
def check_leaf(data):
    label_col = data[:, -1]
    classes = np.unique(label_col)
    return len(classes) == 1

**Classify data functions**

In [13]:
def classify_data(data):
    label_col = data[:, -1]
    classes, counts = np.unique(label_col, return_counts=True)
    index = counts.argmax()
    classification = classes[index]
    return classification

def classify_one(test_row, tree):
    question = list(tree.keys())[0]
    feature_name, _, value = question.split(" ")
    if str(test_row[feature_name]) == value:
        answer = tree[question][0]
    else:
        answer = tree[question][1]

    if not isinstance(answer, dict):
        return answer
    else:
        return classify_one(test_row, answer)

**Function to build the decision tree**

In [14]:
def decision_tree_classifier(data, max_depth, min_samples, counter=0):
    global COL_HEADERS
    if counter == 0:
        COL_HEADERS = data.columns
        data = data.values

    if check_leaf(data) or (len(data) < min_samples) or (counter == max_depth):
        classification = classify_data(data)
        return classification

    else:
        counter += 1
        possible_splits = get_possible_splits(data)
        split_col, split_val = determine_best_split(data, possible_splits)

        if split_col is None:
            return classify_data(data)

        left, right = split_data(data, split_col, split_val)

        feature_name = COL_HEADERS[split_col]
        question = f"{feature_name} = {split_val}"

        sub_tree = {question: []}

        yes_answer = decision_tree_classifier(left, max_depth, min_samples, counter)
        no_answer = decision_tree_classifier(right, max_depth, min_samples, counter)

        if yes_answer == no_answer:
            sub_tree = yes_answer
        else:
            sub_tree[question].append(yes_answer)
            sub_tree[question].append(no_answer)

        return sub_tree

**Function to predict data with tree built**

In [15]:
def predict(test_data, tree):
    output = []
    for _, row in test_data.iterrows():
        output.append(classify_one(row, tree))
    return output

**Calculate accuracy function**

In [16]:
def calculate_accuracy(test_data, tree):
    correct = 0
    for _, row in test_data.iterrows():
        true_label = row['class']
        pred_label = classify_one(row, tree)
        correct += (pred_label == true_label)
    return correct / len(test_data)

**Testing out the model**


1.   Building the tree
2.   Testing the accuracy



In [17]:
decision_tree = decision_tree_classifier(train_data, max_depth=3, min_samples=2)
# output = predict(post_prune_data, decision_tree)
accuracy = calculate_accuracy(post_prune_data, decision_tree)
print(accuracy)

0.6244727832084553


**Tuning depth of tree by testing it with a range of max depth**

In [18]:
# List of values to try for max_depth:
max_depth_range = list(range(1, 25))
highest_accuracy_depth = 0
highest_accuracy_for_depth = 0


for depth in max_depth_range:
  tree = decision_tree_classifier(train_data, max_depth = depth, min_samples = 2)
  score = calculate_accuracy(post_prune_data, tree)
  print(f'Depth {depth}: {score}')
  if score > highest_accuracy_for_depth:
    highest_accuracy_for_depth = score
    highest_accuracy_depth = depth

print(f'Highest accuracy: {highest_accuracy_for_depth} at depth {highest_accuracy_depth}')

Depth 1: 0.5690964124447973
Depth 2: 0.5973800426735474
Depth 3: 0.6244727832084553
Depth 4: 0.6450652508311417
Depth 5: 0.6479432342579269
Depth 6: 0.666153922492929
Depth 7: 0.6781124398352603
Depth 8: 0.6907656428323327
Depth 9: 0.6948345159529599
Depth 10: 0.7020790949238327
Depth 11: 0.7172629385203195
Depth 12: 0.7274847417257977
Depth 13: 0.7419738996675433
Depth 14: 0.7539820374137846
Depth 15: 0.7638068773879819
Depth 16: 0.7724904480722473
Depth 17: 0.7770059048280653
Depth 18: 0.7775021088671662
Depth 19: 0.7787426189649184
Depth 20: 0.783655038952017
Depth 21: 0.7848459286458592
Depth 22: 0.783655038952017
Depth 23: 0.783655038952017
Depth 24: 0.783655038952017
Highest accuracy: 0.7848459286458592 at depth 21


Results show that the accuracy increases all the way and peaks at depth 21 before decreasing

**Tuning min sample of tree by testing it with a range of min sample**

As accuracy peaked at max_depth = 21, it is decided for that to be the max_depth

In [21]:
# # List of values to try for min_size:
min_size_range = list(range(1, 50))
highest_accuracy_min_size = 0
highest_accuracy_for_min_size = 0

for size in min_size_range:
    tree = decision_tree_classifier(train_data, max_depth = 21, min_samples = size)
    score = calculate_accuracy(post_prune_data, tree)
    print(f'Min sample {size}: {score}')
    if score > highest_accuracy_for_min_size:
        highest_accuracy_for_min_size = score
        highest_accuracy_min_size = size

print(f'Highest accuracy: {highest_accuracy_for_min_size} at min_sample {highest_accuracy_min_size}')

Min sample 1: 0.7848459286458592
Min sample 2: 0.7848459286458592
Min sample 3: 0.7848459286458592
Min sample 4: 0.7848459286458592
Min sample 5: 0.7848459286458592
Min sample 6: 0.7848459286458592
Min sample 7: 0.7846474470302188
Min sample 8: 0.7846474470302188
Min sample 9: 0.7846474470302188
Min sample 10: 0.7846474470302188
Min sample 11: 0.7846474470302188
Min sample 12: 0.7846474470302188
Min sample 13: 0.7845482062223986
Min sample 14: 0.7844985858184885
Min sample 15: 0.7844985858184885
Min sample 16: 0.784200863395028
Min sample 17: 0.784200863395028
Min sample 18: 0.784200863395028
Min sample 19: 0.784200863395028
Min sample 20: 0.7841016225872078
Min sample 21: 0.7841016225872078
Min sample 22: 0.7841016225872078
Min sample 23: 0.7841016225872078
Min sample 24: 0.7839527613754776
Min sample 25: 0.7839527613754776
Min sample 26: 0.7839527613754776
Min sample 27: 0.7839527613754776
Min sample 28: 0.7839527613754776
Min sample 29: 0.7839527613754776
Min sample 30: 0.7839527613

Results shown to be have started at its peak at min_sample 1, and remained constant till min_sample 6 before decreasing, however, in order to ensure that ample amount of data is used I have decided to set the min_sample at 5

In [24]:
final_decision_tree = decision_tree_classifier(train_data, max_depth=21, min_samples=5)
final_accuracy = calculate_accuracy(post_prune_data, final_decision_tree)
print(final_accuracy)

0.7848459286458592
