**Aim:** Write a program to exhibit the decision tree based ID3 Algorithm

**Theory:** In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan used to generate a decision tree from a dataset.The ID3 algorithm begins with the original set S as the root node. On each iteration of the algorithm, it iterates through every unused attribute of the set S and calculates the entropy H(S)or the information gain IG(S)of that attribute. It then selects the attribute which has the smallest entropy (or largest information gain) value. The set S is then split or partitioned by the selected attribute to produce subsets of the data.

**Code:**

Dataset Used is **Iris Dataset**<br>
Importing Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import random
from pprint import pprint

In [62]:
df=pd.read_csv('datasets_19_420_Iris.csv')
df=df.drop('Id',axis=1)

df.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [63]:
df=df.rename(columns={"Species":"label"})
df.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,label
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [56]:
attribute=['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']

Checking for the minimum and maximum value of each Attribute

In [57]:
def printfun(attribute):

    for col in attribute:
        print("{} \n MIN:{} MAX:{}".format(col,df[col].min(),df[col].max()))
    return

In [34]:
printfun(attribute)

SepalLengthCm 
 MIN:4.3 MAX:7.9
SepalWidthCm 
 MIN:2.0 MAX:4.4
PetalLengthCm 
 MIN:1.0 MAX:6.9
PetalWidthCm 
 MIN:0.1 MAX:2.5


Checking for the intervals of the attribute as we are going to divide the attributes into 4 category to make it discrete

In [58]:
def printintervals(attribute):
    
    for col in attribute:
        diff=(df[col].max()-df[col].min())/4
        num=df[col].min()
        print("{} \n".format(col)) 
        for i in range(4):
            k=num
            num += diff
            print("{} - {}".format(k,num))
        print("\n")    

In [59]:
printintervals(attribute)

SepalLengthCm 

4.3 - 5.2
5.2 - 6.1000000000000005
6.1000000000000005 - 7.000000000000001
7.000000000000001 - 7.900000000000001


SepalWidthCm 

2.0 - 2.6
2.6 - 3.2
3.2 - 3.8000000000000003
3.8000000000000003 - 4.4


PetalLengthCm 

1.0 - 2.475
2.475 - 3.95
3.95 - 5.425000000000001
5.425000000000001 - 6.9


PetalWidthCm 

0.1 - 0.7
0.7 - 1.2999999999999998
1.2999999999999998 - 1.9
1.9 - 2.5




Making Every continuous attribute Discrete 

In [64]:
def SepalLengthcat(val):
    
    if val>=4.3 and val<=5.2:
        return 1  
    elif val>5.2 and val<=6.1000000000000005:
        return 2
    elif val>6.1000000000000005 and val<=7.000000000000001:
        return 3
    else :
        return 4
    
    
    
for counter,val in enumerate(df['SepalLengthCm']):
    df['SepalLengthCm'][counter]=int(SepalLengthcat(val))
    
df['SepalLengthCm'].value_counts()    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


2.0    50
1.0    45
3.0    43
4.0    12
Name: SepalLengthCm, dtype: int64

In [65]:
def SepalWidthcat(val):
    
    if val>=2.0 and val<=2.6:
        return 1  
    elif val>2.6 and val<=3.2:
        return 2
    elif val>3.2 and val<=3.8000000000000003:
        return 3
    else :
        return 4
for counter,val in enumerate(df['SepalWidthCm']):
    df['SepalWidthCm'][counter]=int(SepalWidthcat(val))
    
df['SepalWidthCm'].value_counts()    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


2.0    84
3.0    36
1.0    24
4.0     6
Name: SepalWidthCm, dtype: int64

In [66]:
def Petallengthcat(val):
    
    if val>=1.0 and val<=2.475:
        return 1  
    elif val>2.475 and val<=3.95:
        return 2
    elif val>3.95 and val<=5.425000000000001:
         return 3
    else :
        return 4
    
for counter,val in enumerate(df['PetalLengthCm']):
    df['PetalLengthCm'][counter]=int(Petallengthcat(val))
    
df['PetalLengthCm'].value_counts()     

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


3.0    61
1.0    50
4.0    28
2.0    11
Name: PetalLengthCm, dtype: int64

In [67]:
def Petalwidthcat(val):
    
    if val>=0.1 and val<=0.7:
        return 1  
    elif val>0.7 and val<=1.2999999999999998:
        return 2
    elif val>1.2999999999999998 and val<=1.9:
         return 3
    else :
        return 4
for counter,val in enumerate(df['PetalWidthCm']):
    df['PetalWidthCm'][counter]=int(Petalwidthcat(val))
    
df['PetalWidthCm'].value_counts()        

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


3.0    56
1.0    50
4.0    29
2.0    15
Name: PetalWidthCm, dtype: int64

In [68]:
df.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,label
0,1.0,3.0,1.0,1.0,Iris-setosa
1,1.0,2.0,1.0,1.0,Iris-setosa
2,1.0,2.0,1.0,1.0,Iris-setosa
3,1.0,2.0,1.0,1.0,Iris-setosa
4,1.0,3.0,1.0,1.0,Iris-setosa


In [92]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

Label Encoding of the Target Feature is done

In [93]:
le=LabelEncoder()
df['label']=le.fit_transform(df['label'])


In [95]:
df['label'].value_counts()

2    50
1    50
0    50
Name: label, dtype: int64

In [96]:
X=df.copy()
y=df.copy()

In [141]:
X=X.drop(columns='label')

In [142]:
y=y.drop(columns=attribute)

In [144]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

In [137]:
x_final_df= pd.concat([X_train,y_train],axis=0)

# ID3 Implementation

In [85]:
import math
from collections import Counter

In [71]:
def entropy(probs):
    '''
    Takes a list of probabilities and calculates their entropy
    '''
    
    return sum( [-prob*math.log(prob, 2) for prob in probs] )

In [1]:
def entropy_of_list(a_list):
    '''
    Takes a list of items with discrete values 
    and returns the entropy for those items.
    '''
    # Tally Up:
    cnt = Counter(x for x in a_list)
    
    # Convert to Proportion
    num_instances = len(a_list)*1.0
    probs = [x / num_instances for x in cnt.values()]
    
    # Calculate Entropy:
    return entropy(probs)

In [125]:
def infogain(df,attribute,target):
    '''
    Takes a DataFrame of attributes, and quantifies the entropy of a target
    attribute after performing a split along the values of another attribute.
    '''
    df_split= df.groupby(attribute)
    
    numofobv=len(df.index)*1.0
    df_agg_ent=df_split.agg({target:[entropy_of_list,lambda x:len(x)/numofobv]})[target]
    df_agg_ent.columns = ['Entropy', 'PropObservations']
    new_entropy = sum( df_agg_ent['Entropy'] * df_agg_ent['PropObservations'] )
    old_entropy = entropy_of_list(df[target])
    return old_entropy-new_entropy
    

In [129]:
def buildtree(df,target,attribute,default_class=None):
    '''
    This function bulids the tree using recursive call
    '''
    cnt=Counter(x for x in df[target])
    
    # if only one target value is present only one species is present then we will simply return that class .
    
    if len(cnt)==1:
        return list(cnt.keys())[0]
    
    elif df.empty or (not attribute):
         return default_class
        
    else:
        indexofmax= list(cnt.values()).index(max(cnt.values()))
        defaultclass= list(cnt.keys())[indexofmax]
        
        gain=[infogain(df,attr,target) for attr in attribute]
        indexofmax=gain.index(max(gain))
        best_attr=attribute[indexofmax]
        
        tree={best_attr:{}}
        remaining_attr=[i for i in attribute if i!=best_attr]
        
        
        for attr_val,data_subset in df.groupby(best_attr):
            subtree=buildtree(data_subset ,target,remaining_attr,defaultclass)
            tree[best_attr][attr_val]= subtree
            
        return tree    

In [131]:
target='label'
tree=buildtree(x_final_df,target,attribute)
pprint(tree)

{'PetalWidthCm': {1.0: 0,
                  2.0: 1,
                  3.0: {'PetalLengthCm': {2.0: 1,
                                          3.0: {'SepalLengthCm': {1.0: 2,
                                                                  2.0: {'SepalWidthCm': {1.0: 1,
                                                                                         2.0: 1,
                                                                                         3.0: 1}},
                                                                  3.0: {'SepalWidthCm': {1.0: 1,
                                                                                         2.0: 1}}}},
                                          4.0: 2}},
                  4.0: 2}}


Function to make Prediction

In [147]:
def classify(instance, tree, default=None):
    attribute = list(tree.keys())[0]
    if instance[attribute] in tree[attribute].keys():
        result = tree[attribute][instance[attribute]]
        if isinstance(result, dict): # this is a tree, delves deeper
            return classify(instance, result)
        else:
            return result # this is a label
    else:
        return default


In [149]:
X_test['predicted_label']=X_test.apply(classify,axis=1,args=(tree,'NA'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [150]:
X_test.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,predicted_label
73,2.0,2.0,3.0,2.0,1.0
18,2.0,3.0,1.0,1.0,0.0
118,4.0,1.0,4.0,4.0,2.0
78,2.0,2.0,3.0,3.0,1.0
76,3.0,2.0,3.0,3.0,1.0


In [151]:
print("Accuracy is "+ str(sum(X_test['predicted_label']==y_test['label'])/(1.0*len(X_test.index))))

Accuracy is 0.9130434782608695
