# Decision Tree algorithm from scratch

This is a **classification algorithm**, and it is used to represent all the possible solutions to a decision in a graphical way. it is much similar to the way what we used to make decisions using tree methods manually.

The procedure of building this algorithm from base is little bit long. For the implementation in this notebook, a very simple dataset has used. 

In [1]:
sample_data=[
    ['Green',3,'Mango'],
    ['Yellow',3,'Mango'],
    ['Red',1,'Grape'],
    ['Red',1,'Grape'],
    ['Yellow',3,'Lemon']
]

Next, we can give headers for data columns. They will be useful for handling our sample dataset.

In [2]:
header=['color','diameter','label']

To implement this algorithm, lots of functions and classes are required. Let’s define them one by one. 

First, we have to define a function to get the **unique values** of a column. There are two parameters in **unique_vals** function. 
**rows** parameter represents a list of lists, and each list includes the values corresponding to a row. 
**col** is an integer parameter and it holds the index of the column which we want to get unique values. 

In [3]:
def unique_vals(rows,col):
    return set([row[col] for row in rows])

Next the count of each label has got. **class_count** is a single parameter function, and the **rows** parameter represents a list of lists which have row data. The label name and the count of it are stored in the **counts** dictionary. 

In [4]:
def class_count(rows):
    counts={}
    for row in rows: 
        label=row[-1] # in this dataset the label is always last column
        if label not in counts:
            counts[label]=0
        counts[label]+=1
    return counts

Then we need a function to check whether a value is **numeric** or not. For that the **is_numeric** function has developed. The parameter **value** is the value which we want to check the data type. The method used in this function is the **isinstance(value, data type)**. It simply returns **True** or **False** by comparing the *value* and *data type* parameters. This is a built-in python method that can be used to verify the data type of values. 

In [6]:
def is_numeric(value):
    return isinstance(value,int) or isinstance(value,float)

When the Tree builds, the **partitioning** is an important stage. It divides the dataset. To do so, we need **questions**. The following **class** is coded for that. 

**__init__** method defines two variables called **column** and **value**. Those are the **column index** and the **feature value** respectively. Actual dataset values are going to be compared with these.

Next the **match** method compares the above two values and the corresponding actual dataset values to check whether they satisfy our conditions in the example. The two conditions that are used here, 
- If the value is *numeric* then it should be greater than the *value* defined in the class.
- If the value is *not numeric* then it should be equal to the *label* in the *column* index.


In [7]:
class Question:
    
    def __init__(self,column,value):
        self.column=column
        self.value=value
        
    def match(self,example):
        # compare the feature value in an example with the feature value in this question
        val=example[self.column]
        if is_numeric(val):
            return val>=self.value
        else:
            return val==self.value
        
    def __repr__(self):
        # just a helper to represent question in readable format
        condition='=='
        if is_numeric(self.value):
            condition='>='
        return 'Is %s %s %s?'%(header[self.column],condition,str(self.value))