# Data-Mining Course (EECS 6412)
# Assignment (II): Decision Tree Classifier Implementation in Python





## Objective: Implement a Decision Tree classifier in Python to gain a deeper understanding of its working principles.

**Overal Instructions:**


*   Your task is to implement a Decision Tree classifier in Python.
*   The implementation has been broken down into multiple subfunctions, each with accompanying hints. Your goal is to complete the code for each function.
* You are only allowed to use the **pandas** and **numpy** libraries for this assignment. Some functions from Pandas have been provided for your convenience in the initial section, and you may use them if you feel they are necessary.
* Each part of your solution will be graded separately. However the sections are interrelated. It is crucial that your code is well-documented with comments explaining each part of your implementation.
* Please be aware that your responses will be thoroughly reviewed to ensure originality. Plagiarized or copied work will result in penalties.


**- Please skip the following descriptions and move directly to the Questions section if you are familiar with reading CSV files with Pandas library.**



---


##Please write your full name/names and student IDs here:




*   Full Name:
*   Student ID:




---



## Dataset Description for "Car Acceptability" Classification:
Your codes should be general and must work on each tabular dataset with combined categorical and numerical data types. For this example, you have been provided with three datasets for training, testing and validation. Please download datasets from [here](https://drive.google.com/drive/folders/1uSr0rvbp2dExYRTDxL5vAouLXH1fJsOH?usp=sharing). These samples represent the decisions of car experts regarding the acceptability of cars. The experts have categorized the cars into one of four classes: "acceptable," "unacceptable," "good," or "very good" based on five categorical features and one numerical feature.

# Features:

* **'BUYING':** An intiger representing the purchase price of the car.

* **'MAINTENANCE':** This feature indicates how high the car's maintenance cost is, and it is categorized into four classes: 'vhigh' (very high), 'high', 'med' (medium), or 'low'.

* **'DOORS':** This featurte indicates number of the doors each car has: '2', '3', '4', '5more'(5 or more than 5 doors).

* **'PERSONS':** This feature determines the car's capacity in terms of the number of persons it can accommodate and is categorized as '2', '4', or 'more'.

* **'LUG_BOOT':** This feature represents the size of the car's luggage boot (trunk) and is categorized as 'small', 'med' (medium), or 'big'.

* **'SAFETY':** This feature provides an estimate of the car's safety level and is categorized as 'low', 'med' (medium), or 'high'.

* **'CLASS':** This is the target variable. It indicates the acceptance level of the car and is categorized as 'unacc' (unacceptable), 'acc' (acceptable), 'good', or 'vgood' (very good).

**Please note that in this example the "CLASS" attribute is located at the last column of the tabular datasets**




---


## Accessing the Datasets:
To access and read datasets from Google Drive in Google Colab using the Pandas library, you can follow these steps:

1.   Upload CSV Files to Google Drive: First, ensure that you've uploaded the CSV files (train dataset and test dataset) to your Google Drive. You can create a folder for your project and upload the files there.


2.   Mount Google Drive in Google Colab:mount your Google Drive using the following code:


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive




3.   Access and Read Data using Pandas: You can access your CSV files in the mounted Google Drive directory. For example, if your CSV files are located in a folder named "data_mining_assignment2/assignment_grads/Fall2025/datasets" in your Google Drive, you can read them as follows:




In [None]:
import pandas as pd

# Define the file paths for your CSV files
train_csv_path = '/content/drive/MyDrive/data_mining_assignment2/assignment_grads/Fall2025/datasets/data_train_cn.csv'
test_csv_path = '/content/drive/MyDrive/data_mining_assignment2/assignment_grads/Fall2025/datasets/data_test_cn.csv'
validation_csv_path = '/content/drive/MyDrive/data_mining_assignment2/assignment_grads/Fall2025/datasets/data_validation_cn.csv'

# Read the data into Pandas DataFrames
train_df = pd.read_csv(train_csv_path)
test_df = pd.read_csv(test_csv_path)
validation_df = pd.read_csv(validation_csv_path)



4.   See Some Samples with head() Function:





In [None]:
# See the first 5 samples in the training dataset
print("Samples in the Training Dataset:")
print(train_df.head())

# See the first 5 samples in the test dataset
print("\nSamples in the Test Dataset:")
print(test_df.head())

Samples in the Training Dataset:
   BUYING MAINTENANCE  DOORS PERSONS LUG_BOOT SAFETY  CLASS
0      37       vhigh      4       2      med    low  unacc
1      58        high      2    more    small   high  unacc
2      30         med  5more    more      med    low  unacc
3      89       vhigh      4    more    small   high  unacc
4      10         low      3       2      big   high  unacc

Samples in the Test Dataset:
   BUYING MAINTENANCE  DOORS PERSONS LUG_BOOT SAFETY  CLASS
0      17       vhigh      3    more    small    med  unacc
1      24        high  5more       4    small   high    acc
2      10       vhigh      3    more      big    med    acc
3      73         med      2       4    small   high    acc
4      27        high      3       4    small    med  unacc




5.   Access Feature Names using columns Attribute:






In [None]:
# Get the feature names (column names) of the training dataset
feature_names = train_df.columns
print("Feature Names:")
print(feature_names)


Feature Names:
Index(['BUYING', 'MAINTENANCE', 'DOORS', 'PERSONS', 'LUG_BOOT', 'SAFETY',
       'CLASS'],
      dtype='object')


6. Access Each Column as a Series:

In [None]:
# Access the 'BUYING' column as a Series using square bracket notation
buying_price = train_df['BUYING']
print(buying_price.head())

# Access the 'MAINTENANCE' column:
maintenance_cost = train_df['MAINTENANCE']
print(maintenance_cost.head())

# Access the 'CLASS' column in the test dataset as a Series
labels = train_df['CLASS']
print(labels.head())

0    37
1    58
2    30
3    89
4    10
Name: BUYING, dtype: int64
0    vhigh
1     high
2      med
3    vhigh
4      low
Name: MAINTENANCE, dtype: object
0    unacc
1    unacc
2    unacc
3    unacc
4    unacc
Name: CLASS, dtype: object


7. Use value_counts() function to  find the number of samples for each distinct value for a particular column:

In [None]:
print("Counts of each distinct value in 'BUYING':")
print (maintenance_cost.value_counts())

Counts of each distinct value in 'BUYING':
MAINTENANCE
high     313
med      299
low      298
vhigh    290
Name: count, dtype: int64


---
---

---






# Questions
---

## - Part 1: Check Terminal Node Condition:
(Q.1. **5 Marks**): In the first step, we need to check if a node containing a DataFrame is a terminal node or it needs further splitting. Implement a function called `'check_if_terminal'` to do this task.

Function Requirements:

Input:


*   `'parent_data'`: the DataFrame corresponding to a node.

*   `'threshold'`: Proportion threshold for the majority class.



Calculate the proportion of samples with the majority class label.

If the proportion ≥ threshold, return "Leaf" as flag.

If the proportion < threshold, return "Internal" as the flag.

In addition to the flag, the function must return majority class ("acc"/"unacc"/"good", "vgood")

In [None]:
def check_if_terminal(dataframe, threshold):
  # Get all attribute names from the DataFrame
  all_attrs = dataframe.columns

  # Select the last attribute as the class attribute
  class_attrs = all_attrs[-1]

  # Extract the labels (values of the class attribute)
  labels = dataframe[class_attrs]
  #...............................................
  # write the rest here







  # output flag must be a string (whether "Internal" or "Leaf")
  # majority_class must be a string indicating the majority label of the samples in the node
  #................................................
  return flag, majority_class

In [None]:
# Check your implementation on training dataframe:
flag, majority_class = check_if_terminal(train_df, 0.9)
print("the node type is {}".format(flag))
print("the majority class of the node is {}".format(majority_class))


---

## - Part 2: Gini Function:
(Q.2. **5 Marks**): In order to split a node in a decision tree based on the Gini index criterion, we need to calculate the Gini impurity of the samples. Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly classified if it was randomly labeled according to the distribution of labels in the set.

Task: Write a Python function called `'gini'` that takes the CLASS column of the dataframe, denoted as "labels," and returns the Gini impurity as the output.


In [None]:
import numpy as np

def gini(labels):
  # write your code here








  return gini_impurity


In [None]:
# Check your implementation on training dataframe:
labels = train_df["CLASS"]
gini_value = gini(labels)
print("gini index of the node is {}".format(gini_value))



---

##  - Part 3: Gini Gain Calculation:

(Q.3. **15 Marks**): In this step, you are required to implement a function named **`gini_gain`** that computes the **Gini gain** (i.e., the reduction in Gini impurity) obtained by splitting samples in the `'CLASS'` column (referred to as `'labels'`) based on a specific attribute column (denoted as `'x'`). Both `'labels'` and `'x'` are columns in a given DataFrame. You should utilize the **`gini`** function implemented in Part 2 to help compute the Gini gain.

#### Key Requirements:
- **Categorical Attributes**: If the attribute column `'x'` is categorical, the function should calculate and return the Gini gain achieved by splitting based on the unique values of the attribute.
- **Numerical Attributes**: If the attribute column `'x'` is numerical, the function should not only compute the Gini gain but also determine the optimal split point for the attribute. The function should return both the Gini gain and the corresponding optimal split point.

#### Notes:
- The function should be versatile enough to handle both categorical and numerical attributes.
- You may find it helpful to sort the numerical attribute values to determine potential split points.




In [None]:
def gini_gain(x, labels):
  # Get the number of samples in the dataset
  num_samples = len(labels)


  # Calculate the gini index of the entire dataset (parent gini)
  parent_gini = gini(labels)

  # Determine the attribute type based on its dtype and cardinality
  if x.dtype == "object":
    attr_type = "categorical"
  else:
    attr_type = "numerical"

  # Calculate gini gain based on the attribute type
  if attr_type == "categorical":
    # For categorical attributes, calculate gini gain
    # by considering each unique value separately
    values = x.unique()
    gini_list = []
    portion_list = []
    for val in values:
      pass






  else:
    pass
    # For numerical attributes, calculate gini gain
    # by considering different split points














  # Calculate gini gain as the difference between parent and child gini index
  ginigain = parent_gini - childs_gini

  # Return gini gain and the split point (if numerical)
  return ginigain, split_point

In [None]:
# Check your implementation for training dataframe on "PERSONS" attribute:
labels = train_df["CLASS"]
x = train_df["PERSONS"]
ginigain, split_point = gini_gain(x, labels)
print("gini gain of the node in splitting over PERSONS attribute is {}".format(ginigain))
#split point is None here
##############################################

# Check your implementation for training dataframe on "BUYING" attribute:
labels = train_df["CLASS"]
x = train_df["BUYING"]
ginigain, split_point = gini_gain(x,labels)
print("gini gain of the node in splitting over BUYING attribute is {} and the splitting point is {}".format(ginigain, split_point))


---


## - Part 4: Selecting the Best Attribute for Splitting
(Q.4. **5 Marks**): In this part, you are tasked with implementing a function called  `'select_attribute'`. This function will take a parent DataFrame referenced as `'parent_data'` along with a list of splittable attributes denoted by '`remaining_attrs'` as the input and returns a string representing name of the best attribute which yields to the highest gini gain after splitting. In addition, if the selected attribute is numeric, the function must return the best splitting point too. Otherwise return None. You may use the function written in "Part 3".



In [None]:
def select_attribute(parent_data, remaining_attrs):
  all_attrs = parent_data.columns
  # Extract the class attribute:
  class_attr = all_attrs[-1]

  # Extract the labels (target values) from the parent data
  labels = parent_data[class_attr]



  # Loop through independent attributes and calculate their gini gains
  # ....................................................
  # write the rest here









  # ....................................................
  # if the selected attribute is categorical, return None for the sel_split_point
  return sel_attr, sel_split_point


In [None]:
# Check your implementation on training dataframe:
remaining_attrs = list(train_df.columns[:-1])
sel_attr, sel_split_point = select_attribute(train_df, remaining_attrs)
print("the best attribute for splitting the node is {}".format(sel_attr))




---
# - Part 5: Splitting the nodes at each tree level

(Q.5. **25 Marks**): In this assignment, you will be implementing a crucial part of the decision tree implementation by creating a Python function called `'data_split'`. The purpose of this function is to split a parent node's dataframe into child dataframes based on the best attribute, which yields the highest gini gain. You may use the helper functions that you have already implemented in previous sections.


**Instructions:**
* Write a function called `'data_split'` to split all the nodes in level "n" and to generate all the children nodes in level "n+1".

* Perform node splitting in a systematic manner, progressing level by level. This entails creating all nodes at level n+1 by dividing all nodes eligible for splitting at level n. Refer to the example below for clarification:

![Image](https://drive.google.com/uc?export=download&id=1kIOCkYaxUJMEKumBP6RxOLriQQY2Wlqx)
 As depicted in the illustration, at level 1, there is a solitary node designated as "Node_1_1," symbolizing the first node of the first level. Level 1 has been subdivided into three nodes, identified as "Node_2_1," "Node_2_2," and "Node_2_3," signifying the first, second, and third nodes of the second level of splitting. Please adhere to this notation for naming each node.

* Imagine a dictionary named `'dataframe_dict'`, where the "keys" correspond to the node names at a specific splitting level, and the "values" represent the associated dataframes. To illustrate, for level 1, the `'dataframe_dict'` would consist of a single key, "Node_1_1," with the corresponding value being the primary dataframe:
                dataframe_dict = {"Node_1_1": the main dataframe}
In this example, following the execution of the `'data_split'` function, the `'dataframe_dict'` dictionary should be replaced with a dictionary containing three entries, as demonstrated below:
      dataframe_dict = {
                          "Node_2_1": dataframe_2_1,
                          "Node_2_2": dataframe_2_2,
                          "Node_2_3": dataframe_2_3
                        }


* Similarly, consider another dictionary called `'remaining_attrs'` with "keys" representing the nodes' names, and "values" representing the splittable attributes for each node. In this example, the dictionary at first level would be:

      remaining_attrs = {"Node_1_1": ['BUYING', 'MAINTENANCE', 'DOORS', 'PERSONS', 'LUG_BOOT', 'SAFETY']}
but after running the function `'data_split'`, it would be updated to a dictionary with three keys-values as:

     
      remaining_attrs = {
                         "Node_2_1": ['BUYING', 'MAINTENANCE', 'DOORS', 'LUG_BOOT', 'SAFETY'],
                         "Node_2_2": ['BUYING', 'MAINTENANCE', 'DOORS', 'LUG_BOOT', 'SAFETY'] ,
                         "Node_2_3": ['BUYING', 'MAINTENANCE', 'DOORS', 'LUG_BOOT', 'SAFETY']
                         }

Please note that once we've performed a split on a categorical attribute such as "PERSONS" and generated the children nodes in the subsequent level, we are no longer permitted to split on the same categorical attribute within that branch of the tree. It's important to emphasize that this restriction doesn't apply to numerical attributes.

In the context of this example, this means that the `'remaining_attrs'` dictionary is updated to a three-element dictionary, where none of the nodes in this specific branch have the "PERSONS" attribute as a splittable option anymore.

* Consider the `'tree_model'` as a list containing three additional dictionaries: `'tree_connectivity'`, `'node_labels'`, `'and node_types'`:

            tree_model = [tree_connectivity , node_types, node_labels]

where `'tree_connectivity'` is a dictionary representing the node connection to the parents. The `'node_types'` and `'node_labels'` are also dictionaries containing the ("Leaf" or "Internal") and the majority class for each node, respectively. Your `'data_split'` function must take the `'tree_model'` generated up to  level "n" and must update it to the model up top level "n+1" after splitting. See the below image for this example:
The `'tree_model'` at level 1 is:

<img src="https://drive.google.com/uc?export=download&id=1GSJzh4CNE298LFXQR86LYpEfj4Q883-S"  width=400>


After running the `'data_split'` function, the tree_model will be updated up to level 2 as follows:

![Image](https://drive.google.com/uc?export=download&id=1Y3sGXHBpQtPVpMh8UnVlO0AYXvHNTGfo)


**Therefore**: You must write the function `'data_split'` which takes `'dataframe_dict'`, `'remaining_attrs'`, `'tree_model'`, `'level'`, and `'threshold'` as the input and update `'dataframe_dict'`, `'remaining_attrs'`, and `'tree_model'` upto level "level+1". The function must also retun a boolean flag `'stop_train'` which must be True if any child node is generated. Otherwise, it must return False. Here, input `'threshold'` is the majority class threshold for checking wether a node is a "Leaf" node or an "Internal" node.

**To complete the function**:
* Loop through the nodes in "dataframe_dict" in the current "level". For each node, check if it's an "Internal" node and if so, find the best attribute for splitting. Create child nodes and finally update all the variables.









In [None]:
def data_split(dataframe_dict, remaining_attrs, tree_model, level, threshold):
    """
    Splits data at each node and updates the decision tree model.

    Args:
    dataframe_dict (dict): Current node's data.
    remaining_attrs (dict): Remaining attributes for each node.
    tree_model (list): Tree structure (connectivity, labels, types).
    level (int): Current depth level.
    threshold (float): Class purity threshold for pre-pruning.

    Returns:
    tree_model (list): Updated tree structure.
    stop_train (bool): Indicates whether to stop the training.
    dataframe_dict (dict): Updated data for child nodes.
    remaining_attrs (dict): Updated attributes for child nodes.
    """

    # Unpack tree model
    [tree_connectivity, node_labels, node_types] = tree_model

    # Initialize child node counter
    child_ind = 1

    dataframe_dict_new = {}
    remaining_attrs_new = {}

    # Iterate over parent nodes
    for key in dataframe_dict.keys():
        parent_data = dataframe_dict[key]
        candidate_attrs = remaining_attrs[key].copy()

        # Check if the node is "Internal"
        if node_types[key] == "Internal":
            # TODO: Select best attribute (use select_attribute function)


            # TODO: Handle distinct values or split points
            # (Hint: use .unique() for categorical, split into <= and > for numerical)

            child_dict = {}

            # TODO: Loop through distinct values or split points
            for value in []:  # Replace with appropriate logic
                child_name = f"node_{level + 1}_{child_ind}"

                # TODO: Create mask for filtering data


                # TODO: Store filtered data
                # child_data = parent_data[mask]
                # dataframe_dict_new[child_name] = child_data

                # TODO: Check if child node is terminal (use check_if_terminal)


                # TODO: Update node labels, types, and tree connectivity


                child_ind += 1

            # tree_connectivity[key] = child_dict

    # Update dataframe_dict and remaining_attrs
    # Return updated tree model and stop flag
    stop_train = (child_ind == 1)

    return tree_model, stop_train, dataframe_dict, remaining_attrs


In [None]:
# Now Check your implementation on training dataframe:
# Initializing
threshold = 0.9

tree_connectivity = {}

flag, majority_class = check_if_terminal(train_df, 0.9)

node_types = {"node_1_1": flag}
node_labels = {"node_1_1": majority_class}

# Create an initial tree_model
tree_model = [tree_connectivity, node_labels, node_types]

# Create an initial dataframe_dict
dataframe_dict = {"node_1_1": train_df}

# Create an initial remaining_attrs
independent_attrs = list(train_df.columns[:-1])
remaining_attrs = {"node_1_1": independent_attrs}

# Set level to 1
level = 1


# Update tree model
tree_model, stop_train, dataframe_dict, remaining_attrs = data_split(dataframe_dict, remaining_attrs, tree_model, 1, threshold)
tree_model, stop_train, dataframe_dict, remaining_attrs = data_split(dataframe_dict, remaining_attrs, tree_model, 2, threshold)
tree_model, stop_train, dataframe_dict, remaining_attrs = data_split(dataframe_dict, remaining_attrs, tree_model, 3, threshold)


[tree_connectivity, node_labels, node_types] = tree_model

print("\n tree connectivity:")
print(tree_connectivity)

print("\n node labels:")
print(node_labels)

print("\n node types:")
print(node_types)

print("\n remaining attributes are:")
print(remaining_attrs)






---

## -Part 6: Training the Decision Tree

(Q.6. **10 Marks**): Now, let's create a function called `'tree_train'` to train the decision tree. This function begins by initializing the tree model and dataframe dictionary using the root node named "node_1_1." It then iteratively updates these structures as it progresses through the tree, continuing until no further child nodes are generated. The process starts at level 1, and with each iteration, the level is incremented. Importantly, make sure to utilize the `'data_split'` function, which you have previously implemented, to assist in the tree construction. Ultimately, the function must return the fully trained tree model.




In [None]:
def tree_train(training_data, threshold):
  # Initializing
  tree_connectivity = {}

  flag, majority_class = check_if_terminal(training_data, threshold)

  node_types = {"node_1_1": flag}
  node_labels = {"node_1_1": majority_class}

  # Create a tree_model list to store connectivity, node labels, and node types
  tree_model = [tree_connectivity, node_labels, node_types]

  # Create a dataframe_dict with the initial training data and associate it with the root node
  dataframe_dict = {"node_1_1": training_data}

   # Create a remaining_attrs dictionary with all the independent attributes and associate it with the root node
  indp_attrs = list(training_data.columns[:-1])
  remaining_attrs = {"node_1_1": indp_attrs}

  # Initialize the level of the tree to 1
  level = 1

  # Continue tree construction until a stopping condition is met
  # ...............................................................
  # write the rest here
  # write a loop function and exit the loop if terminating criterion is met








  #.....................................................................
  return tree_model

In [None]:
# Check your implementation on training dataframe:
tree_model = tree_train(train_df, 0.9)
[tree_connectivity, node_labels, node_types] = tree_model

print("\n tree connectivity:")
print(tree_connectivity)

print("\n node labels:")
print(node_labels)

print("\n node types:")
print(node_types)




---

# Part 7: Prediction by the Desicion Tree
(Q.7. **15 Marks**): Following the completion of decision tree training, the next step is to implement the prediction process through the trained tree structure. To achieve this, we need to create a function named `'tree_prediction'`. This function takes two inputs: a test dataframe containing the samples to be predicted and the trained decision tree. It returns the predicted labels generated by the decision tree as a single DataFrame column.

In [None]:
def tree_prediction(testing_data, tree_model):
    # Initialize a list to store the predicted labels
    pred_labels = []

    # Unpack the tree_model list into three separate variables: tree_connectivity, node_labels, and node_types
    [tree_connectivity, node_labels, node_types] = tree_model

    # Iterate through each sample in the testing_data
    for i in range(len(testing_data)):
        # Get a sample from the testing dataset
        sample = testing_data.loc[i]

        # Start at the root node, which is always named "node_1_1"
        current_node = "node_1_1"

        # Begin a loop to traverse the decision tree until a leaf node is reached
        #........................................................................
        # write the rest here:
        # Begin a loop to traverse the decision tree until a leaf node is reached




















        # find the node label and put it in the pred_labels Pandas Series
    #.............................................................................
    # Return the Pandas Series containing the predicted labels
    return pred_labels


In [None]:
# Check your implementation on training dataframe:
tree_model = tree_train(train_df, 0.9)
pred_labels = tree_prediction(train_df, tree_model)
print(pred_labels)



---

## Part 8: Evaluating the Model
(**10 Marks**)
* (Q.8-a.)In the final step of this assignment, you'll apply the decision tree learning process. Start by training the decision tree on the training dataset using the `'tree_train'` function, setting the terminating threshold to 0.9. Next, employ the `'tree_prediction'` function, as previously implemented, to generate predictions for both the training and testing datasets. Following this, your task is to compare these predicted labels with the actual ground-truth labels to compute and report the accuracy rates for both the training and testing datasets.

* (Q.8-b) Now, repeat the process with a different terminating threshold, specifically 0.7, and once again calculate and report the accuracy rates for the training and testing datasets. Finally, compare and contrast the results obtained with the two different threshold values (0.9 and 0.7).

In [None]:
#.................................
# write the rest here:
def accuracy_cal(true_labels, pred_labels):
  #write you code here



  return acc


In [None]:
print("train accuracy with threshold= 0.9 is: {}".format(train_acc))
print("test accuracy with threshold= 0.9 is: {}".format(test_acc))

print("\n train accuracy with threshold= 0.7 is: {}".format(train_acc))
print("test accuracy with threshold= 0.7 is: {}".format(test_acc))




---


## Part 9: Post-Pruning (Reduced Error Pruning):

(Q.9. **10 Marks**): Post-pruning involves simplifying a fully grown decision tree by removing unnecessary internal nodes that do not contribute to better performance. The goal is to reduce overfitting while maintaining or improving accuracy on unseen data.

- Implement the reduced error post-pruning algorithm.
  - The algorithm should take a fully grown tree (represented by `tree_model = [tree_connectivity, node_types, node_labels]`) and a validation dataset.
  
- Follow this process:
  - Start from the bottom of the tree.
  - For each internal node, check whether converting that node to a leaf improves validation accuracy.
  - If converting the node to a leaf increases validation accuracy, keep it as a leaf. Otherwise, retain it as an internal node.

- Progressively scan all nodes from the bottom to the top of the tree.
  - You do not need to remove the connections in `tree_connectivity`; simply change the node type to "leaf."
  - Scanning will stop once a terminal (leaf) node is reached.

- After post-pruning the tree, evaluate the pruned tree on the test dataset and report the test accuracy.



In [None]:
# define the function here
def post_pruning(tree_model, validation_df):
    [tree_connectivity, node_labels, node_types] = tree_model
    #write your code here









    return tree_model

In [None]:
# call the function here
all_attrs = train_df.columns
class_attr = all_attrs[-1]
tree_model = tree_train(train_df, 1)
tree_model_new = post_pruning(tree_model, validation_df)
test_pred_label = tree_prediction(test_df, tree_model_new)
test_acc = accuracy_cal(test_df[class_attr], test_pred_label)
print("test accuracy with threshold= 1 is: {}".format(test_acc))