<a href="https://colab.research.google.com/github/Edisuism/Machine_Learning/blob/master/Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2


In this page, I will develop a machine learning algorithm from scratch and then test it on a sample dataset

Steps involved:
- Find sample dataset
- Choose learner algorithm and note down input/output
- Preprocess sample dataset to input requirements and split into test/training sets 
- code learner algorithm **(only this needs to be coded, preprocessing can use from libaries like SKlearn)**
- train algorithm and then apply to test set

# Introduction

The algorithm I have decided to build from scratch is an ID3 Decision tree. This algorithm takes in variables as well in addition to their attributes such as a flower and it's colour and size. As output, the algorithm will determine based on the attributes given and the target variable it has been tasked with estimating, what the target variable is. In the case of the flower it could determine what type of flower it is based on it's attributes. This works by first determining the most impactful attribute in deciding the variable type as the root node and then repeating the process for all the other attributes, achieving a greater level of classification each time.


# Exploration

Some Challenges I find in creating this algorithm is figuring out how to turn the theory into practical, functional code such as calculating entropy with respects to each attribute and it's result on the target variable while ensuring it repeats the process for each layer of the decision tree.


My current plan for the project is to split the dataframe into a training and test set to ensure I will be able to test the algorithm without any additional data. 


As for the algorithm itself, I will calculate the entropy of the target variable of the data set and then for every attribute with the addition of also calculating the information gain of the attributes by subtracting the entropy from the entropy of the target variable. After calculating the information gain for each attribute, the algorithm will choose the highest value as the root node and then cull all pure results from the training dataframe, that is, if all the results are to play tennis if the outlook is overcast, all those rows will be removed from the table. The algorithm will then recalculate the new entropy of the target variables and each attribute as it did to find the root node. This process of calculating entropy of the target variable and the information gain of attributes will repeat until the data is pure.


However, what happens in the case that the tree has already reached the end of it's possibilities (taking note that the tree will not go up a level again in order to test out new possibilities)? I am currently thinking of simply semi-randomising it's estimation of the target variable based on the quantity of the results e.g. if at the end of the tree, there is still 1 result saying to play tennis and 1 result saying to not play tennis, then the answer will have a 50% chance to be either play or not play.

# Methodology

Import data


In [0]:
import numpy as np
import pandas as pd
from numpy import log2 as log
import random
import pprint

url = 'https://raw.githubusercontent.com/Edisuism/Machine_Learning/master/play_tennis.csv'
df = pd.read_csv(url)
print (len(df))
df.head()

Split into training and testing sets

In [0]:
def train_test_split(df, test_size):
  test_size = round(test_size * len(df))
  indices = df.index.tolist()
  test_indices = random.sample(population = indices, k = test_size)
  test_df = df.loc[test_indices]
  train_df = df.drop(test_indices)
  
  return train_df, test_df



In [0]:
train_df, test_df = train_test_split(df, test_size = 0.3)

In [0]:
print (len(train_df))
train_df.head()

In [0]:
print (len(test_df))
test_df.head()

Find Entropy of Class


In [0]:
def class_entropy(df):
  target = df.keys()[-1]
  entropy_class = 0  
  values = df[target].unique()  
  for value in values:
      fraction = df[target].value_counts()[value]/len(df[target])  
      entropy_class += -fraction*np.log2(fraction)
  return entropy_class

class_entropy(train_df)

Find Entropy and Information Gain of Attributes

In [0]:
def attribute_entropy(df,target_attribute):
  target = df.keys()[-1]
  attribute = target_attribute
  eps = np.finfo(float).eps #In case of 0 denominator
  target_variables = df[target].unique() 
  variables = df[attribute].unique()    
  entropy_attribute = 0
  for variable in variables:
      entropy_each_feature = 0
      for target_variable in target_variables:
          num = len(df[attribute][df[attribute]==variable][df[target] ==target_variable]) 
          den = len(df[attribute][df[attribute]==variable])  
          fraction = num/(den+eps)  
          entropy_each_feature += -fraction*log(fraction+eps) #entropy for one feature 
      fraction2 = den/len(df)
      entropy_attribute += -fraction2*entropy_each_feature   #all the entropy for attribute
  E_final = abs(entropy_attribute)
  return E_final

print (attribute_entropy(train_df, 'outlook'))
print (attribute_entropy(train_df, 'temp'))
print (attribute_entropy(train_df, 'humidity'))
print (attribute_entropy(train_df, 'wind'))

Find Highest Info Gain Attribute

In [0]:
def calculate_winner(df):
  IG = []
  for key in df.keys()[1:-1]: #1 to class because [0] is just an index
    IG.append(class_entropy(df)-attribute_entropy(df,key))
  
  return df.keys()[1:-1][np.argmax(IG)]

calculate_winner(train_df)

Splitting function

In [0]:
def split_table(df, node, value):
  return df[df[node] == value].reset_index(drop=True)

Build Tree

In [0]:
def buildTree(df,tree=None): 
    target = df.keys()[-1] 
    node = calculate_winner(df)
    attValue = np.unique(df[node])
    if tree is None:                    
        tree={}
        tree[node] = {}

    for value in attValue:
        
        subtable = split_table(df, node, value)
        clValue, counts = np.unique(subtable[target],return_counts=True)                        
        
        if len(counts)==1:
            tree[node][value] = clValue[0]                                                    
        else:        
            tree[node][value] = buildTree(subtable)
                   
    return tree
  
  

In [0]:
decision_tree = buildTree(train_df)
pprint.pprint(decision_tree)

Prediction

In [0]:
def predict(inst,tree):
    for nodes in tree.keys():        
        value = inst[nodes]
        tree = tree[nodes][value]
        prediction = 0
            
        if type(tree) is dict:
            prediction = predict(inst, tree)
        else:
            prediction = tree
            break;                            
        
    return prediction

In [0]:
decision_tree.keys()

In [0]:
inst = test_df.iloc[1]
inst

In [0]:
prediction = predict(inst,decision_tree)
prediction

In [0]:
def accuracy(df, actual, predicted):
	correct = 0
  for index, row in df.iterrows(): 
    for i in range(len(actual)):
      if actual[i] == predicted[i]:
        correct += 1
	return correct / float(len(actual)) * 100.0




Notes:
Gini index = how much uncertainty there is in a node
,Information gain = how much uncertainty is removed in a node

# Evaluation

# Conclusion

# Ethical