### 2. Finite State Machine Generalization: 
#### (a)  Implement a program that automatically creates a set of if-then clauses from the training table of a binary dataset of your choice. Implement different strategies to minimize the number of if-then clauses. Document your strategies, the number of resulting conditional clauses, and the accuracy achieved.

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier # Decision Tree creates a set of if-then clauses
from sklearn.metrics import accuracy_score

In [2]:
encoder = OneHotEncoder()
def fit_measure_test(tree, X, y, print_if_statements=True):
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
    
    # Fit the tree model
    tree.fit(X_train, y_train)
    
    feature_names = encoder.get_feature_names_out()

    if_statement_count = 0
    # Loop through the tree to print and count if statements
    for i in range(tree.tree_.node_count):
        if tree.tree_.children_left[i] != -1: 
            feature = feature_names[tree.tree_.feature[i]]
            threshold = tree.tree_.threshold[i]
            if print_if_statements:
                print(f"Node {i}: if {feature} <= {threshold}")
            if_statement_count += 1

    

    # Predict and calculate accuracy
    y_pred = tree.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    if print_if_statements:
        print(f"\nNumber of if statements in decision tree: {if_statement_count}")
        print(f"\n\nAccuracy: {accuracy}")

    return if_statement_count, accuracy



In [3]:
# Default Decision Tree
default_tree = DecisionTreeClassifier(random_state=0) 

# I optimized my tree on the mushroom dataset by minimizing the number of if-then clauses until the accuracy was negatively impacted 
my_optimized_tree = DecisionTreeClassifier(max_depth=5, random_state=0)


In [4]:
# !pip install ucimlrepo

from ucimlrepo import fetch_ucirepo

# Import binary mushroom dataset from https://archive.ics.uci.edu/dataset/73/mushroom 
mushroom = fetch_ucirepo(id=73)
X = mushroom.data.features
y = mushroom.data.targets
# X in not in numbers, so we need to one hot encode
X_encoded = encoder.fit_transform(X)

In [5]:
# Trying with default DecisionTreeClassifier
default_size, default_acc = fit_measure_test(default_tree, X_encoded, y)


Node 0: if odor_n <= 0.5
Node 1: if stalk-root_c <= 0.5
Node 2: if stalk-surface-below-ring_y <= 0.5
Node 3: if odor_l <= 0.5
Node 4: if odor_a <= 0.5
Node 9: if stalk-surface-below-ring_s <= 0.5
Node 12: if spore-print-color_r <= 0.5
Node 13: if stalk-surface-below-ring_y <= 0.5
Node 14: if cap-surface_g <= 0.5
Node 15: if cap-shape_c <= 0.5
Node 16: if gill-size_b <= 0.5
Node 17: if population_c <= 0.5
Node 23: if ring-type_p <= 0.5

Number of if statements in decision tree: 13


Accuracy: 1.0


In [6]:
# Trying with my optimized DecisionTreeClassifier
my_optimized_tree_size, my_optimized_tree_acc = fit_measure_test(my_optimized_tree, X_encoded, y)

Node 0: if odor_n <= 0.5
Node 1: if stalk-root_c <= 0.5
Node 2: if stalk-surface-below-ring_y <= 0.5
Node 3: if odor_l <= 0.5
Node 4: if odor_a <= 0.5
Node 9: if stalk-surface-below-ring_s <= 0.5
Node 12: if spore-print-color_r <= 0.5
Node 13: if stalk-surface-below-ring_y <= 0.5
Node 14: if cap-surface_g <= 0.5
Node 15: if cap-shape_c <= 0.5
Node 19: if stalk-root_b <= 0.5

Number of if statements in decision tree: 11


Accuracy: 1.0


In [7]:
print(f'Reduced the number of if-then clauses in tree from {default_size:.3f} to {my_optimized_tree_size:.3f}, Ratio: {default_size/my_optimized_tree_size:.3f}, while maintaining SAME accuracy: {default_acc:.3f} = {my_optimized_tree_acc:.3f}')

Reduced the number of if-then clauses in tree from 13.000 to 11.000, Ratio: 1.182, while maintaining SAME accuracy: 1.000 = 1.000


#### (b) Use the algorithms developed in (a) on different datasets. Again, observe how your choices make a difference.

In [8]:
# Import binary raisin dataset from https://archive.ics.uci.edu/dataset/850/raisin
raisin = fetch_ucirepo(id=850) 
  
X = raisin.data.features 
y = raisin.data.targets 

In [9]:
default_size, default_acc = fit_measure_test(default_tree, X, y)

Node 0: if cap-shape_c <= 424.3491516113281
Node 1: if cap-surface_f <= 1122.4955444335938
Node 2: if cap-shape_k <= 0.8710394203662872
Node 3: if cap-shape_k <= 0.8191300630569458
Node 4: if cap-shape_x <= 0.7351054549217224
Node 5: if cap-shape_x <= 0.7346012592315674
Node 6: if cap-surface_f <= 1006.4865112304688
Node 7: if cap-shape_x <= 0.6815016269683838
Node 8: if cap-shape_x <= 0.6812853217124939
Node 9: if cap-shape_s <= 50049.5
Node 11: if cap-shape_s <= 57456.5
Node 12: if cap-surface_f <= 916.6060180664062
Node 13: if cap-shape_b <= 48691.0
Node 16: if cap-shape_b <= 53680.0
Node 18: if cap-shape_k <= 0.7868431210517883
Node 23: if cap-shape_b <= 62397.0
Node 24: if cap-surface_f <= 772.1489868164062
Node 25: if cap-shape_b <= 40552.5
Node 29: if cap-shape_c <= 344.65069580078125
Node 32: if cap-shape_b <= 67346.0
Node 33: if cap-shape_c <= 370.31199645996094
Node 34: if cap-shape_s <= 65380.0
Node 38: if cap-shape_x <= 0.6463422477245331
Node 40: if cap-surface_f <= 1029.8

In [10]:
my_optimized_tree_size, my_optimized_tree_acc = fit_measure_test(my_optimized_tree, X, y)

Node 0: if cap-shape_c <= 424.3491516113281
Node 1: if cap-surface_f <= 1122.4955444335938
Node 2: if cap-shape_k <= 0.8710394203662872
Node 3: if cap-shape_k <= 0.8191300630569458
Node 4: if cap-shape_x <= 0.7351054549217224
Node 7: if cap-shape_f <= 216.9763641357422
Node 10: if cap-shape_k <= 0.8890786170959473
Node 13: if cap-shape_x <= 0.7291818559169769
Node 14: if cap-shape_k <= 0.7634775638580322
Node 15: if cap-shape_f <= 267.3236083984375
Node 18: if cap-shape_x <= 0.6716567873954773
Node 21: if cap-surface_f <= 1131.0899658203125
Node 23: if cap-shape_c <= 378.50506591796875
Node 26: if cap-shape_c <= 463.8451232910156
Node 27: if cap-shape_b <= 87000.5
Node 28: if cap-surface_f <= 1180.7925415039062
Node 29: if cap-shape_b <= 83993.0
Node 33: if cap-shape_k <= 0.7769868075847626
Node 34: if cap-shape_x <= 0.6314263343811035
Node 37: if cap-shape_c <= 447.36480712890625
Node 40: if cap-surface_f <= 2022.8614501953125
Node 41: if cap-surface_f <= 1196.0665283203125
Node 42: i

In [11]:
print(f'Reduced the number of if-then clauses in tree from {default_size:.3f} to {my_optimized_tree_size:.3f}, Ratio: {default_size/my_optimized_tree_size:.3f}, while maintaining roughly the SAME accuracy: {default_acc:.3f} ≈ {my_optimized_tree_acc:.3f}')

Reduced the number of if-then clauses in tree from 85.000 to 25.000, Ratio: 3.400, while maintaining roughly the SAME accuracy: 0.833 ≈ 0.883


#### (c) Finally, use the programs developed in (a) on a completely random dataset, generated artificially. Vary your strategies but also the number of input columns as well as the number of instances. How many if-then clauses do you need?

In [12]:
import numpy as np
np.random.seed(1)

def generate_equal_classes(n_full, num_classes):
    if n_full % num_classes != 0:
        raise ValueError("n_full is not evenly divisible by num_classes")
    
    # Number of instances per class
    n_per_class = n_full // num_classes
    
    # Generate and shuffle array
    classes = np.concatenate([np.full(n_per_class, i) for i in range(num_classes)])
    np.random.shuffle(classes)
    
    return classes.tolist()

In [13]:
n_full = 256
D = 8
num_classes = 2 # Binary classification

X = np.random.rand(n_full, D)
y = generate_equal_classes(n_full, num_classes)

In [14]:
default_size, default_acc = fit_measure_test(default_tree, X, y)

Node 0: if cap-shape_x <= 0.6881290972232819
Node 1: if cap-shape_x <= 0.03581613302230835
Node 2: if cap-shape_f <= 0.7104642689228058
Node 4: if cap-surface_g <= 0.12292011454701424
Node 7: if cap-shape_k <= 0.1572951152920723
Node 8: if cap-shape_s <= 0.3836905211210251
Node 9: if cap-shape_x <= 0.6371204853057861
Node 12: if cap-surface_g <= 0.9645165801048279
Node 13: if cap-shape_x <= 0.06159835867583752
Node 15: if cap-shape_x <= 0.2226150631904602
Node 16: if cap-shape_x <= 0.149416945874691
Node 21: if cap-surface_g <= 0.010023790411651134
Node 23: if cap-shape_x <= 0.49194470047950745
Node 24: if cap-shape_x <= 0.2906492054462433
Node 25: if cap-surface_g <= 0.9383832514286041
Node 26: if cap-shape_x <= 0.1292453333735466
Node 27: if cap-shape_x <= 0.11197086796164513
Node 28: if cap-shape_b <= 0.4062817245721817
Node 29: if cap-surface_f <= 0.8671679198741913
Node 32: if cap-shape_k <= 0.9295432865619659
Node 34: if cap-shape_k <= 0.9607635140419006
Node 38: if cap-shape_c <

In [15]:
my_optimized_tree_size, my_optimized_tree_acc = fit_measure_test(my_optimized_tree, X, y)

Node 0: if cap-shape_x <= 0.6881290972232819
Node 1: if cap-shape_x <= 0.03581613302230835
Node 2: if cap-shape_f <= 0.7104642689228058
Node 4: if cap-surface_g <= 0.12292011454701424
Node 7: if cap-shape_k <= 0.1572951152920723
Node 8: if cap-shape_s <= 0.3836905211210251
Node 9: if cap-shape_x <= 0.6371204853057861
Node 12: if cap-surface_g <= 0.9645165801048279
Node 15: if cap-surface_g <= 0.010023790411651134
Node 17: if cap-shape_x <= 0.49194470047950745
Node 20: if cap-surface_g <= 0.9467024803161621
Node 21: if cap-shape_f <= 0.8739853501319885
Node 22: if cap-shape_x <= 0.909618079662323
Node 23: if cap-shape_c <= 0.06482357904314995
Node 26: if cap-surface_g <= 0.16831564158201218
Node 29: if cap-shape_b <= 0.3684689998626709
Node 30: if cap-surface_f <= 0.25721919327042997

Number of if statements in decision tree: 17


Accuracy: 0.4423076923076923


In [16]:
print(f'Reduced the number of if-then clauses in tree from {default_size:.3f} to {my_optimized_tree_size:.3f}, Ratio: {default_size/my_optimized_tree_size:.3f}, while maintaining roughly the SAME accuracy: {default_acc:.3f} ≈ {my_optimized_tree_acc:.3f}\n This is pretty much best guess')

Reduced the number of if-then clauses in tree from 51.000 to 17.000, Ratio: 3.000, while maintaining roughly the SAME accuracy: 0.423 ≈ 0.442
 This is pretty much best guess


#### Behavior on Random Data:
- While my optimized model is smaller, its not exactly performing better!
- Both models are essentially best guess
- This means that both models are way larger than needed.

To prove this lets make the **smallest tree possible**

In [17]:

smallest_possible_tree = DecisionTreeClassifier(max_depth=1, random_state=0)
smallest_size, smallest_acc = fit_measure_test(smallest_possible_tree, X, y, print_if_statements=True)

print(f"\nNotice how the accuracy doesn't meaningly change: Default Accuracy={default_acc:.3f}, Optimized Accuracy={my_optimized_tree_acc:.3f}, Smallest Accuracy={smallest_acc:.3f}")


Node 0: if cap-shape_x <= 0.6881290972232819

Number of if statements in decision tree: 1


Accuracy: 0.4230769230769231

Notice how the accuracy doesn't meaningly change: Default Accuracy=0.423, Optimized Accuracy=0.442, Smallest Accuracy=0.423


#### Experimenting with different training table sizes

In [18]:
# 
def run_experiment(D: int, n_full: int):
    n_default_size = 0
    n_default_acc = 0
    n_my_optimized_tree_size = 0
    n_my_optimized_tree_acc = 0
    n_smallest_size = 0
    n_smallest_acc = 0
    
    n_experiments = 50 # Run experiment 50 times 
    for i in range(n_experiments):
        X = np.random.rand(n_full, D) 
        y = generate_equal_classes(n_full, num_classes=2) # Binary classification
        default_size, default_acc = fit_measure_test(default_tree, X, y, print_if_statements=False)
        n_default_size += default_size
        n_default_acc += default_acc
        
        my_optimized_tree_size, my_optimized_tree_acc = fit_measure_test(my_optimized_tree, X, y, print_if_statements=False)
        n_my_optimized_tree_size += my_optimized_tree_size
        n_my_optimized_tree_acc += my_optimized_tree_acc

        smallest_size, smallest_tree_acc = fit_measure_test(smallest_possible_tree, X, y, print_if_statements=False)
        n_smallest_size += smallest_size
        n_smallest_acc += smallest_tree_acc
    
    default_size = n_default_size/n_experiments
    default_acc = n_default_acc/n_experiments

    my_optimized_tree_size = n_my_optimized_tree_size/n_experiments
    my_optimized_tree_acc = n_my_optimized_tree_acc/n_experiments

    smallest_size = n_smallest_size/n_experiments
    smallest_tree_acc = n_smallest_acc/n_experiments

    print(f'Sizes: Avg Default Size={default_size}, Avg Optimized Size={my_optimized_tree_size}, Avg Smallest Size={smallest_size}')
    print(f'Accuracy: Avg Default Accuracy={default_acc}, Avg Optimized Accuracy={my_optimized_tree_acc}, Avg Smallest Accuracy={smallest_tree_acc}\n')


    

In [19]:
run_experiment(D=2, n_full=4)
run_experiment(D=4, n_full=16) 
run_experiment(D=8, n_full=256)

Sizes: Avg Default Size=1.12, Avg Optimized Size=1.12, Avg Smallest Size=1.0
Accuracy: Avg Default Accuracy=0.34, Avg Optimized Accuracy=0.34, Avg Smallest Accuracy=0.34

Sizes: Avg Default Size=3.36, Avg Optimized Size=3.36, Avg Smallest Size=1.0
Accuracy: Avg Default Accuracy=0.455, Avg Optimized Accuracy=0.455, Avg Smallest Accuracy=0.44

Sizes: Avg Default Size=47.86, Avg Optimized Size=15.78, Avg Smallest Size=1.0
Accuracy: Avg Default Accuracy=0.4992307692307694, Avg Optimized Accuracy=0.5011538461538462, Avg Smallest Accuracy=0.49192307692307674



#### Conclusion:
- **6.2 (a) Developed an Optimized Decision Tree on the Mushroom Dataset**
    - Kept decreasing the max depth until accuracy was negatively impacted
    
- **6.2 (b) Tested Optimized Decision Tree on the the more complex Raisin Dataset**
    - The optimizations made on the Mushroom Dataset carried over to the Raisin Dataset allowing Better accuracy with less if-then statements

- **6.2 (c) Tested Optimized Decision Tree on Random Data**
    - Immediately noticed that **both** the optimized and default trees archived **best guess accuracy**
    - Theorized that a model with only one decision would perform the same, so built the **smallest_possible_tree**
    - **Tested**: *Default, Optimized, and Smallest Trees* on varying sizes of random data, all performed effectively the SAME, thus proving my theory 
    - Did observe that as Dimensions (D) grow, the **accuracy of all three** improved up to best guess (50%)
