# UTSA CS 3793/5233: Assignment-3

**Anderson - Kyle - (jvh640)**






## Learning Objectives

Implement 2 different machine learning algorithms
*   Stochastic Gradient Descent
*   ID3 Decision Tree



## Description

This assignment is focused on **machine learning**, mainly on the implementation of 2 different algorithms - Stochastic Gradient Descent & ID3 decision tree.
The assignment is divided into two sections, each for one unique ML algorithm.

The base structure and comments are provided on what should be done. You can use some libraries that help support you for the successful completion of the assignment. However, you **CANNOT** use a complete library that contains the implementation of ML algorithms. You can get pieces of code from online, but please cite the source properly.


##Import Libraries

Write all the import statements here. This should be for both algorithm implmentations. As mentioned before, you can not use any premade ML libraries.

In [1]:
# import all required libraries
import numpy as np # Performs computations in C, so it has better performance than the native math lib and has multidimensional array support
import pandas as pd

In [2]:
# Assume that the data files are in the following folder -- THIS WILL BE USED BY THE TA
basePath = "/content/drive/My Drive/Colab Notebooks/Artificial Intelligence/Data/"


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#Stochastic Gradient Descent

In this section, you will implement the Stochastic Gradient Descent algorithm. The training is for a **binary classification** task i.e. each instance will have a class value of 0 or 1. Also, assume that you are given **all binary-valued attributes** and that there are **no missing values** in the train or test data.


##Algorithm

(40 points)

Following are the data files that will be provided to you for the gradient descent algorithm implementation.

*   Training file - 'gd-train.dat'
*   Testing file - 'gd-test.dat'

In these files, only non-space characters are relevant. The first line contains the attribute names. All the other lines are different example instances to be used for the algorithm. Each column holds values of the attributes, whereas the last column holds the class label for that instance.

Write the code in the following code block, structure is provided. Instructions on the steps to follow are provided as comments.



In [4]:
# Data file name variables
train = basePath + "gd-train.dat"
test = basePath + "gd-test.dat"


In [5]:
# Read the training and testing data files
sgd_training_data = []
with open(train, 'r') as training_file:
  first_line = training_file.readline()
  legend = first_line.split()

  lines = training_file.readlines()
  for line in lines:
    tokens = line.split()
    tmp_input = {}
    if len(legend) == len(tokens):
      for i in range(0,len(legend)):
        tmp_input.update({legend[i]:int(tokens[i])})
    sgd_training_data.append(tmp_input)

sgd_testing_data = []
with open(test, 'r') as test_file:
  first_line = test_file.readline()
  legend = first_line.split()

  lines = test_file.readlines()
  for line in lines:
    tokens = line.split()
    tmp_input = {}
    if len(legend) == len(tokens):
      for i in range(0, len(legend)):
        tmp_input.update({legend[i]:int(tokens[i])})
    sgd_testing_data.append(tmp_input)


In [6]:
# Activation Function - implement Sigmoid
def activation_function(h):
    # given 'h' compute and return 'z' based on the activation function implemented
    z = 1 / (1 + np.exp(-h))
    return z


In [7]:
# Train the model using the given training dataset and the learning rate
# return the "weights" learnt for the perceptron - include the weight assocaited with bias as the last entry
def train(train_data, learning_rate=0.05):
    # initialize weights to 0
    weights = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    # go through each training data instance
    for instance in train_data:
        # get 'x' as one multi-variate data instance and 'y' as the ground truth class label
        instance_values = list(instance.values())
        y = instance_values[len(instance_values) - 1]
        x = instance_values[:(len(instance_values) - 1)]
        # Insert b as constant
        x.insert(0, 1)
        # obtain h(x)
        h = 0
        for i in range(len(x)):
          h += x[i] * weights[i]
        # call the activation function with 'h' as parameter to obtain 'z'
        z = activation_function(h)
        # update all weights individually using learning_rate, (y-z), and the corresponding 'x'
        for i in range(len(weights)):
          weights[i] = weights[i] + learning_rate * (y - z) * x[i]
    # return the final learnt weights
    return weights

In [8]:
# Test the model (weights learnt) using the given test dataset
# return the accuracy value
def test(test_data, weights, threshold):
    # go through each testing data instance
    correct_predictions = 0
    for instance in test_data:
        # get 'x' as one multi-variate data instance and 'y' as the ground truth class label
        instance_values = list(instance.values())
        y = instance_values[len(instance_values)-1]
        x = instance_values[:(len(instance_values)-1)]
        # Insert b as constant
        x.insert(0, 1)
        # obtain h(x)
        h = 0
        for i in range(len(x)):
          h += x[i] * weights[i]
        # call the activation function with 'h' as parameter to obtain 'z'
        z = activation_function(h)
        # use 'threshold' to convert 'z' to either 0 or 1 so as to match to the ground truth binary labels
        prediction = None
        if z > threshold:
          prediction = 1
        else:
          prediction = 0
        # compare the thresholded 'z' with 'y' to calculate the positive and negative instances for calculating accuracy
        if prediction == y:
          correct_predictions += 1
    accuracy = correct_predictions / len(test_data)
    # return the accuracy value for the given test dataset
    return accuracy


In [9]:
# Gradient Descent function
def gradient_descent(df_train, df_test, learning_rate=0.05, threshold=0.5):
    # call the train function to train the model and obtain the weights
    weights = train(df_train, learning_rate)
    # call the test function with the training dataset to obtain the training accuracy
    training_accuracy = test(df_train, weights,threshold)
    # call the test function with the testing dataset to obtain the testing accuracy
    testing_accuracy = test(df_test, weights, threshold)
    # return (trainAccuracy, testAccuracy)
    return (training_accuracy, testing_accuracy)


In [10]:
# Threshold of 0.5 will be used to classify the instance for the test. If the value is >= 0.5, classify as 1 or else 0.
threshold = 0.5


In [11]:
# Main algorithm loop
# Loop through all the different learning rates [0.05, 1]
learning_rates = [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1]
    # For each learning rate selected, call the gradient descent function to obtain the train and test accuracy values
for lr in learning_rates:
  training_accuracy, testing_accuracy = gradient_descent(sgd_training_data, sgd_testing_data, lr, threshold)
  # Print both the accuracy values as "Accuracy for LR of 0.1 on Training set = x %" OR "Accuracy for LR of 0.1 on Testing set = x %"
  print(f"Accuracy for LR of {lr} Training Set = {training_accuracy} %")
  print(f"Accuracy for LR of {lr} Testing Set = {testing_accuracy} %\n")


Accuracy for LR of 0.05 Training Set = 0.68 %
Accuracy for LR of 0.05 Testing Set = 0.7225 %

Accuracy for LR of 0.1 Training Set = 0.68 %
Accuracy for LR of 0.1 Testing Set = 0.7175 %

Accuracy for LR of 0.15 Training Set = 0.68 %
Accuracy for LR of 0.15 Testing Set = 0.7175 %

Accuracy for LR of 0.2 Training Set = 0.69 %
Accuracy for LR of 0.2 Testing Set = 0.715 %

Accuracy for LR of 0.25 Training Set = 0.69 %
Accuracy for LR of 0.25 Testing Set = 0.715 %

Accuracy for LR of 0.3 Training Set = 0.69 %
Accuracy for LR of 0.3 Testing Set = 0.7125 %

Accuracy for LR of 0.35 Training Set = 0.69 %
Accuracy for LR of 0.35 Testing Set = 0.705 %

Accuracy for LR of 0.4 Training Set = 0.7 %
Accuracy for LR of 0.4 Testing Set = 0.7075 %

Accuracy for LR of 0.45 Training Set = 0.69 %
Accuracy for LR of 0.45 Testing Set = 0.69 %

Accuracy for LR of 0.5 Training Set = 0.69 %
Accuracy for LR of 0.5 Testing Set = 0.695 %

Accuracy for LR of 0.55 Training Set = 0.69 %
Accuracy for LR of 0.55 Testing

##Extra Credit - Accuracy Plots

(05 points)

Use the above accuracy results on the training and testing data and write code to plot the graphs as mentioned in the code block below.



In [None]:
# Plot the graphs for accuracy results.
# There will be 2 graphs - one for training data and the other for testing data
# For each graph,
    # X-axis will be the learning rate going from 0.05-1 in increments on 0.05
    # Y-axis will be the accuracy values at the selected learning rate.



#ID3 Decision Tree

In this section, you will implement the ID3 Decision Tree algorithm. The training is for a **binary classification** task i.e. each instance will have a class value of 0 or 1. Also, assume that there are **no missing values** in the train or test data.


## Algorithm

(85 points)

Following are the data files that will be provided to you for the ID3 algorithm implementation.

*   Training file - 'id3-train.dat'
*   Testing file - 'id3-test.dat'

In these files, only non-space characters are relevant. The first line contains the attribute names. All the other lines are example instances to be used for the algorithm. Each column holds values of the attributes, whereas the last column holds the class label for that instance.

In a decision tree, if you reach a leaf node but still have examples that belong to different classes, then choose the most frequent class (among the instances at the leaf node). If you reach a leaf node in the decision tree and have no examples left or the examples are equally split among multiple classes, then choose the class that is most frequent in the entire training set. You do not need to implement pruning. Also, don’t forget to use logarithm base 2 when computing entropy and set (0 log 0) to 0.

Write the code in the following code block, structure is provided. Instructions on the steps to follow are provided as comments. The code should output the following 3 things:

*   Print the Decision Tree created, in the following example format:

    ```
    attr1 = 0 :
        attr2 = 0 :
            attr3 = 0 : 1 -- 4
            attr3 = 1 : 0 -- 9
        attr2 = 1 :
            attr4 = 0 : 0 -- 2
            attr4 = 1 : 1 -- 10
    attr1 = 1 :
        attr2 = 1 : 1 -- 17

    ```

*   Accuracy on the Training data = x %
*   Accuracy on the Test data = x %





In [None]:
# Create Datastructure
class Node:
  def __init__ (self,):
    self.__label = None
    self.__children = {}
    self.__is_leaf = False
    self.__class_label = None
    self.__total_instances = 0

  def get_label(self):
    return self.__label

  def get_children(self):
    return self.__children

  def get_is_leaf(self):
    return self.__is_leaf

  def get_class_label(self):
    return self.__class_label

  def get_total_instances(self):
    return self.__total_instances

  def set_label(self, label):
    self.__label = label

  def set_children(self, children):
    self.__children = children

  def set_is_leaf(self, is_leaf):
    self.__is_leaf = is_leaf

  def set_class_label(self, class_label):
    self.__class_label = class_label

  def set_total_instances(self, total_instances):
    self.__total_instances = total_instances

  def add_child(self, value, child_node):
    self.get_children().update({value: child_node})


  def __str__(self, level=0):
    ret = ""
    if self.get_is_leaf():
        ret += f" : {self.get_class_label()} -- {self.get_total_instances()}"
    ret += "\n"

    for value, child_node in sorted(self.get_children().items()):
        ret += f"{'   ' * level} {self.get_label()} = {value}"
        ret += child_node.__str__(level + 1)

    return ret

# Data file name variables
train = basePath + "id3-train.dat"
test = basePath + "id3-test.dat"



In [None]:
# Pseudocode for the ID3 algorithm. Use this to create function(s).
def ID3(all_data, data, root, attributesRemaining):
  class_count = data['class'].value_counts()

  # If all the instances have only one class label
  if data['class'].nunique() == 1:
    # Make this as the leaf node and use the label as the class value of the node and return the updated tree
    root.set_is_leaf(True)
    root.set_class_label(data['class'].unique()[0])
    root.set_total_instances(data.shape[0])
    return

  # If you reach a leaf node in the decision tree and have no examples left or the examples are equally split among multiple classes
  elif data.empty or ((data['class'].nunique() != 1) and (class_count[0] == class_count[1])):
    root.set_is_leaf(True)
    all_class_count = all_data['class'].value_counts()
    # Choose and the class that is most frequent in the entire training set and return the updated tree
    root.set_class_label(all_class_count.idxmax())
    root.set_total_instances(data.shape[0])
    return

  # If you reached a leaf node but still have examples that belong to different classes (there are no remaining attributes to be split)
  elif not attributesRemaining and (data['class'].nunique() != 1):
    # Assign the most frequent class among the instances at the leaf node and return the updated tree
    root.set_is_leaf(True)
    root.set_class_label(class_count.idxmax())
    root.set_total_instances(data.shape[0])
    return

  curr_class_count = data['class'].value_counts()
  curr_instances = all_data.shape[0]

  if len(curr_class_count) < 2:
    if curr_class_count.index[0] != 0:
      curr_true = curr_class_count[1]/curr_instances
      curr_entropy = 0 + (-curr_true * np.log2(curr_true))
    elif curr_class_count.index[0] != 1:
      curr_false = curr_class_count[0]/curr_instances
      curr_entropy = (-curr_false * np.log2(curr_false)) + 0
  else:
    curr_false = curr_class_count[0]/curr_instances
    curr_true = curr_class_count[1]/curr_instances
    curr_entropy = (-curr_false * np.log2(curr_false)) + (-curr_true * np.log2(curr_true))

  aig = {}

  for attribute in attributesRemaining:
    attribute_class_count_0 = data[data[attribute] == 0]['class'].value_counts()
    attribute_instances_0 = data[data[attribute] == 0]['class'].shape[0]
    attribute_false_0 = attribute_class_count_0.get(0, 0) / attribute_instances_0
    attribute_true_0 = attribute_class_count_0.get(1, 0) / attribute_instances_0
    attribute_entropy_0 = 0
    if attribute_false_0 > 0:
      attribute_entropy_0 += (-attribute_false_0 * np.log2(attribute_false_0))
    if attribute_true_0 > 0:
      attribute_entropy_0 += (-attribute_true_0 * np.log2(attribute_true_0))

    attribute_class_count_1 = data[data[attribute] == 1]['class'].value_counts()
    attribute_instances_1 = data[data[attribute] == 1]['class'].shape[0]
    attribute_false_1 = attribute_class_count_1.get(0, 0) / attribute_instances_1
    attribute_true_1 = attribute_class_count_1.get(1, 0) / attribute_instances_1
    attribute_entropy_1 = 0
    if attribute_false_1 > 0:
      attribute_entropy_1 += (-attribute_false_1 * np.log2(attribute_false_1))
    if attribute_true_1 > 0:
      attribute_entropy_1 += (-attribute_true_1 * np.log2(attribute_true_1))

    weighted_avg_entropy = (attribute_instances_0 / curr_instances) * attribute_entropy_0 + (attribute_instances_1 / curr_instances) * attribute_entropy_1

    ig = curr_entropy - weighted_avg_entropy
    aig.update({attribute: ig})

  best_attribute = max(aig, key=aig.get)
  new_attributes_remaining = attributesRemaining.copy()
  for attribute in new_attributes_remaining:
    if attribute == best_attribute:
      new_attributes_remaining.remove(attribute)

  root.set_label(best_attribute)

  # Not Leaf

  for value in data[best_attribute].unique():

    # Split the tree using the best attribute and recursively call the ID3 function using DFS to fill the sub-tree
    child_node = Node()
    new_data = data[data[best_attribute] == value].drop(columns=[best_attribute])

    ID3(all_data, new_data, child_node, new_attributes_remaining)
    root.add_child(value, child_node)

  return



In [None]:
# Following is the base code structure. Feel free to change the code structure as you see fit, maybe even create more functions.
# Read the first line in the training data file, to get the number of attributes
# Read all the training instances and the ground truth class labels.
train_instances_df = pd.read_csv(train,delimiter='\t')
test_instances_df = pd.read_csv(test,delimiter='\t')

attributes = list(train_instances_df.columns)
# Remove class from attributes list
attributes.pop(-1)

# Create the decision tree by implementing the ID3 algorithm. Pseudocode provided above.
tree_root = Node()
ID3(train_instances_df, train_instances_df.copy(), tree_root, attributes)

# Print the tree in the example format mentioned.
print(tree_root)

# Use the above created tree to predict the training data and print the accuracy as "Accuracy on the Training data = x %"
training_count = 0
training_correct = 0
prev_node = None
# For each training instance, predict the output label
for index, instance in train_instances_df.iterrows():
  training_count += 1
  curr_node = tree_root
  while not curr_node.get_is_leaf():
    if(not curr_node.get_is_leaf()):
      node_value = instance[curr_node.get_label()]
      prev_node = curr_node
      curr_node = curr_node.get_children().get(node_value)
  if curr_node.get_class_label() == instance['class']:
    training_correct += 1

# Compare it with the ground truth class label and calculate the accuracy accordingly
training_acc = training_correct / training_count
print(f"Accuracy on the Training data = {training_acc}")


# Use the above created tree to predict the testing data and print the accuracy as "Accuracy on the Test data = x %"
testing_count = 0
testing_correct = 0
prev_node = None
    # For each testing instance, predict the output label
for index, instance in test_instances_df.iterrows():
  testing_count += 1
  curr_node = tree_root
  while not curr_node.get_is_leaf():
    if(not curr_node.get_is_leaf()):
      node_value = instance[curr_node.get_label()]
      prev_node = curr_node
      curr_node = curr_node.get_children().get(node_value)
  if curr_node.get_class_label() == instance['class']:
    testing_correct += 1
    # Compare it with the ground truth class label and calculate the accuracy accordingly
test_acc = testing_correct / testing_count
print(f"Accuracy on the Training data = {test_acc}")




 attr5 = 0
    attr6 = 0
       attr2 = 0
          attr1 = 0
             attr4 = 0
                attr3 = 0 : 0 -- 9
                attr3 = 1 : 0 -- 12
             attr4 = 1
                attr3 = 0 : 0 -- 13
                attr3 = 1 : 0 -- 14
          attr1 = 1
             attr4 = 0
                attr3 = 0 : 0 -- 15
                attr3 = 1 : 0 -- 10
             attr4 = 1 : 0 -- 30
       attr2 = 1
          attr4 = 0
             attr3 = 0 : 0 -- 25
             attr3 = 1
                attr1 = 0 : 0 -- 11
                attr1 = 1 : 0 -- 18
          attr4 = 1
             attr1 = 0
                attr3 = 0 : 0 -- 10
                attr3 = 1 : 0 -- 18
             attr1 = 1
                attr3 = 0 : 0 -- 12
                attr3 = 1 : 0 -- 17
    attr6 = 1
       attr4 = 0
          attr2 = 0
             attr3 = 0
                attr1 = 0 : 0 -- 11
                attr1 = 1 : 0 -- 13
             attr3 = 1
                attr1 = 0 : 0 -- 9
                attr1

##Extra Credit - Learning Curve

(05 points)

Instead of taking the entire training data (all 800 instances), loop through to select 'x' instances in the increments of 40 (i.e. 40, 80, 120, and so on). For each selected number 'x', randomly pick the example instances from the training data and call the ID3 function to create the decision tree. Calculate the accuracy of the created ID3 tree on the Test data file. Plot the corresponding graph, aka Learning Curve.


In [None]:
# Loop through to select the number of instances 'x' in increments of 40
# For each 'x',
    # Randomly select 'x' instances
    # Create the ID3 decision tree using those instances
    # Calculate the accuracy of the ID3 tree created on the Test data

# Plot the learning curve using the accuracy values
    # X-axis will be the number of training instances used for creating the tree
    # Y-axis will be the accuracy in % on the Test data



#Submission Instructions

1.   Complete all tasks above - **File MUST contain the output for ALL cells**
2.   Export this notebook as .ipynb
      (File > Download as ipynb)
3.   Upload the .ipynb file on Blackboard

##Rubric

*   (40 points) Gradient Descent Algorithm
*   (05 points) Extra Credit - GD Accuracy Plots
*   (85 points) ID3 Algorithm
*   (05 points) Extra Credit - ID3 Learning Curve
