<a href="https://colab.research.google.com/github/AbhinavMekala/ML_F_PES2UG23CS338_MB-ABHINAV/blob/main/EC_F_PES2UG23CS338_Lab3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%writefile EC_F_PES2UG23CS338_Lab3.py

import torch
def get_entropy_of_dataset(tensor: torch.Tensor):
    """
    Calculate the entropy of the entire dataset.
    Formula: Entropy = -Σ(p_i * log2(p_i)) where p_i is the probability of class i

    Args:
        tensor (torch.Tensor): Input dataset as a tensor, where the last column is the target.

    Returns:
        float: Entropy of the dataset.
    """
    # TODO: Implement this function
    target_col = tensor[:, -1]
    unique_classes, counts = torch.unique(target_col, return_counts=True)
    probabilities = counts.float() / target_col.size(0)
    probabilities = probabilities[probabilities > 0]
    entropy = -torch.sum(probabilities * torch.log2(probabilities))
    return entropy.item()

def get_avg_info_of_attribute(tensor: torch.Tensor, attribute: int):
    """
    Calculate the average information (weighted entropy) of an attribute.
    Formula: Avg_Info = Σ((|S_v|/|S|) * Entropy(S_v)) where S_v is subset with attribute value v.

    Args:
        tensor (torch.Tensor): Input dataset as a tensor.
        attribute (int): Index of the attribute column.

    Returns:
        float: Average information of the attribute.
    """
    # TODO: Implement this function
    target_col = tensor[:, -1]
    unique_values, counts = torch.unique(tensor[:, attribute], return_counts=True)
    total_count = tensor.size(0)
    avg_info = 0.0
    for value, count in zip(unique_values, counts):
        subset = tensor[tensor[:, attribute] == value]
        subset_entropy = get_entropy_of_dataset(subset)
        avg_info += (count.float() / total_count) * subset_entropy
    return avg_info.item()

def get_information_gain(tensor: torch.Tensor, attribute: int):
    """
    Calculate Information Gain for an attribute.
    Formula: Information_Gain = Entropy(S) - Avg_Info(attribute)

    Args:
        tensor (torch.Tensor): Input dataset as a tensor.
        attribute (int): Index of the attribute column.

    Returns:
        float: Information gain for the attribute (rounded to 4 decimals).
    """
    # TODO: Implement this function
    entropy = get_entropy_of_dataset(tensor)
    avg_info = get_avg_info_of_attribute(tensor, attribute)
    information_gain = entropy - avg_info
    return round(information_gain, 4)

def get_selected_attribute(tensor: torch.Tensor):
    """
    Select the best attribute based on highest information gain.

    Returns a tuple with:
    1. Dictionary mapping attribute indices to their information gains
    2. Index of the attribute with highest information gain

    Example: ({0: 0.123, 1: 0.768, 2: 1.23}, 2)

    Args:
        tensor (torch.Tensor): Input dataset as a tensor.

    Returns:
        tuple: (dict of attribute:index -> information gain, index of best attribute)
    """
    # TODO: Implement this function
    information_gains = {}
    for attribute in range(tensor.size(1) - 1):  # Exclude target column
        information_gains[attribute] = get_information_gain(tensor, attribute)
    best_attribute = max(information_gains, key=information_gains.get)
    return information_gains, best_attribute

Writing EC_F_PES2UG23CS338_Lab3.py


In [None]:
!python3 test.py --ID EC_F_PES2UG23CS338_Lab3 --data mushrooms.csv

Running tests with PYTORCH framework
 target column: 'class' (last column)
Original dataset info:
Shape: (8124, 23)
Columns: ['cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat', 'class']

First few rows:

cap-shape: ['x' 'b' 's' 'f' 'k'] -> [5 0 4 2 3]

cap-surface: ['s' 'y' 'f' 'g'] -> [2 3 0 1]

cap-color: ['n' 'y' 'w' 'g' 'e'] -> [4 9 8 3 2]

class: ['p' 'e'] -> [1 0]

Processed dataset shape: torch.Size([8124, 23])
Number of features: 22
Features: ['cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 's

In [None]:
!python3 test.py --ID EC_F_PES2UG23CS338_Lab3 --data Nursery.csv

Running tests with PYTORCH framework
 target column: 'class' (last column)
Original dataset info:
Shape: (12960, 9)
Columns: ['parents', 'has_nurs', 'form', 'children', 'housing', 'finance', 'social', 'health', 'class']

First few rows:

parents: ['usual' 'pretentious' 'great_pret'] -> [2 1 0]

has_nurs: ['proper' 'less_proper' 'improper' 'critical' 'very_crit'] -> [3 2 1 0 4]

form: ['complete' 'completed' 'incomplete' 'foster'] -> [0 1 3 2]

class: ['recommend' 'priority' 'not_recom' 'very_recom' 'spec_prior'] -> [2 1 0 4 3]

Processed dataset shape: torch.Size([12960, 9])
Number of features: 8
Features: ['parents', 'has_nurs', 'form', 'children', 'housing', 'finance', 'social', 'health']
Target: class
Framework: PYTORCH
Data type: <class 'torch.Tensor'>

DECISION TREE CONSTRUCTION DEMO
Total samples: 12960
Training samples: 10368
Testing samples: 2592

Constructing decision tree using training data...

🌳 Decision tree construction completed using PYTORCH!

📊 OVERALL PERFORMANCE METR

In [None]:
!python3 test.py --ID EC_F_PES2UG23CS338_Lab3 --data tictactoe.csv

Running tests with PYTORCH framework
 target column: 'Class' (last column)
Original dataset info:
Shape: (958, 10)
Columns: ['top-left-square', 'top-middle-square', 'top-right-square', 'middle-left-square', 'middle-middle-square', 'middle-right-square', 'bottom-left-square', 'bottom-middle-square', 'bottom-right-square', 'Class']

First few rows:

top-left-square: ['x' 'o' 'b'] -> [2 1 0]

top-middle-square: ['x' 'o' 'b'] -> [2 1 0]

top-right-square: ['x' 'o' 'b'] -> [2 1 0]

Class: ['positive' 'negative'] -> [1 0]

Processed dataset shape: torch.Size([958, 10])
Number of features: 9
Features: ['top-left-square', 'top-middle-square', 'top-right-square', 'middle-left-square', 'middle-middle-square', 'middle-right-square', 'bottom-left-square', 'bottom-middle-square', 'bottom-right-square']
Target: Class
Framework: PYTORCH
Data type: <class 'torch.Tensor'>

DECISION TREE CONSTRUCTION DEMO
Total samples: 958
Training samples: 766
Testing samples: 192

Constructing decision tree using tra

1. Performance Comparison
Nursery Dataset:
o Large dataset, highly multi-class, with strong relationships between
family/finance attributes and target.
o Accuracy: Approx 90% achievable with trees, but may drop slightly due to
many class labels.
o Precision/Recall/F1: Lower than mushrooms because of imbalanced class
distribution.

Mushrooms Dataset:
o Clean, binary classification (edible vs poisonous).
o Certain features like odor split the dataset almost perfectly.
o Accuracy: Close to 100%.
o Precision/Recall/F1: Near-perfect, since splits strongly correlate with the
class.

Tic-Tac-Toe Dataset:
o Medium-sized, binary target.
o Accuracy: Around 85–90%, since patterns are deterministic but
representation is discrete.
o Precision/Recall/F1: Balanced, but sometimes misclassifies unusual board
states.
Ranking by performance, Mushrooms greater than Nursery. Nursery and Tic Tac Toe
are almost the same.


2. Tree Characteristics Analysis
Tree Depth:
o Nursery: Deep. Depth around 10+.
o Mushrooms: Shallow, since features like odor separate almost immediately.
Depth approximately 4–6.
o Tic-Tac-Toe: Medium depth, depending on win conditions.

Number of Nodes:
o Nursery: High, due to many attributes and values.
o Mushrooms: Low, since 1–2 key features decide most splits.
o Tic-Tac-Toe: Medium, correlating with possible winning states.

Most Important Features:
o Nursery: finance, family, and social factors are usually top splits.
o Mushrooms: odor, gill size, spore print color.
o Tic-Tac-Toe: Central square, followed by corners.

Tree Complexity:
o Nursery: High complexity.
o Mushrooms: Low complexity.
o Tic-Tac-Toe: Medium complexity.


3. Dataset-Specific Insights
Nursery Dataset:
• Feature Importance: Financial stability and parental preference dominate
splits.
• Class Distribution: Imbalanced (some decisions like “recommend” are rare).
• Decision Patterns: “Good financial + supportive family → priority admission.”
• Overfitting: Risk is higher due to many features/values; pruning is necessary.
Mushrooms Dataset:
• Feature Importance: Odor is the single strongest indicator (almost perfect
split).
• Class Distribution: Balanced (edible vs poisonous).
• Decision Patterns: “Foul odor means poisonous” emerges early.
• Overfitting: Minimal, since dataset is clean and separable.
Tic-Tac-Toe Dataset:
• Feature Importance: Center cell is most predictive.
• Class Distribution: Slight imbalance depending on X/O placements.
• Decision Patterns: “X in center and X in corner implies positive outcome.”
• Overfitting: Possible if tree memorizes board positions rather than general
patterns.


4. Comparative Analysis Report
a) Algorithm Performance:
• Highest Accuracy: Mushrooms dataset, because of strong attribute-class
correlation.
• Dataset Size Effect: Nursery implies longer training time, deeper trees.
Mushrooms imply efficient, clean splits. Tic-Tac-Toe implies manageable.
• Role of Features: More features (nursery) increase tree depth and complexity,
while fewer decisive features (mushrooms) yield simpler, more accurate trees.

b) Data Characteristics Impact:
• Class Imbalance: Nursery suffers from imbalance, some minority classes
harder to predict. Mushrooms balanced, strong results. Tic-Tac-Toe
moderately imbalanced.
• Feature Types: Binary features (mushrooms, tic-tac-toe) lead to cleaner splits.
Multi-valued categorical features (nursery), more complex splits, higher
overfitting risk.

c) Practical Applications:
• Nursery: Admission decision support systems. Interpretable, but may need
pruning for real-world use.
• Mushrooms: Food safety classification. High accuracy, easily interpretable rules.
• Tic-Tac-Toe: Game AI explanation. Good interpretability for explaining strategies.


Interpretability Advantages:
• Nursery: Explains how financial/social factors affect admission.
• Mushrooms: Clear, human-readable rules.
• Tic-Tac-Toe: Explains why certain moves are critical.


Improvements:
• Nursery: Use tree pruning and possibly convert to Random Forest to reduce
overfitting.
• Mushrooms: Already near-perfect, little improvement needed. Could compress
rules.
• Tic-Tac-Toe: Use feature engineering to reduce depth.


Observation:
• Mushrooms dataset has the highest accuracy, simplest tree.
• Nursery dataset has the largest, complex tree, risk of overfitting.
• Tic-Tac-Toe dataset has medium complexity, interpretable decision paths.