## MATH2319/ MATH2387 Machine Learning
#### Semester 1, 2022
### Take-Home Machine Learning Assessment - Q2
### Pragati Patidar (S3858702)

## Q2
* Build a simple decision tree with depth 1 using this dataset for predicting the price (categorical) target feature using the Entropy split criterion.
* Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression.


In [1]:
#importing required packages
import pandas as pd
pd.set_option('display.max_columns', None) 
from io import StringIO
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import pyplot

In [2]:
#reading the automobile data from local repository
df=pd.read_csv("THA_diamonds.csv")
seed=12

In [3]:
#displaying 10 random observations from the Automobile dataset 
df.sample(10)

Unnamed: 0,cut,color,depth,price,carat
57,Good,F,62.9,low,0.5
146,Good,F,61.1,medium,0.72
134,Good,F,64.1,medium,0.72
180,Fair,I,56.1,high,0.96
28,Good,D,60.1,low,0.51
27,Fair,I,55.3,low,0.64
154,Good,D,63.3,medium,0.71
48,Good,F,60.6,low,0.54
9,Fair,F,64.9,low,0.5
18,Fair,F,64.6,low,0.5


### Part A
The dataset for this question has 2 numerical descriptive features, carat and depth.

Discretize these 2 features separately as "category_1", "category_2", and "category_3" respectively using the equal-frequency binning technique.
Display the first 10 rows of the entire set of descriptive features after discretization of these two features.
After this discretization, all features in your dataset will be categorical (which we will assume to be "nominal categorical").

In [4]:
# Discretizing varibale cut into three levels :'small', 'medium', 'high'
df = df.copy()
df['carat'] = pd.qcut(df['carat'], 
                              q=3, 
                              labels=['catagory1', 'catagory2', 'catagory3'])


In [5]:
# Discretizing varibale cut into three levels :'catagory1', 'catagory2', 'catagory3'
df = df.copy()
df['depth'] = pd.qcut(df['depth'], 
                              q=3, 
                              labels=['catagory1', 'catagory2', 'catagory3'])


In [6]:
# Top 10 rows of the df:
df.head(10)

Unnamed: 0,cut,color,depth,price,carat
0,Good,D,catagory2,low,catagory1
1,Fair,F,catagory3,low,catagory1
2,Good,I,catagory1,low,catagory1
3,Good,F,catagory1,low,catagory1
4,Fair,F,catagory3,low,catagory1
5,Fair,F,catagory3,low,catagory1
6,Good,D,catagory2,low,catagory1
7,Good,D,catagory2,low,catagory1
8,Good,D,catagory2,low,catagory1
9,Fair,F,catagory3,low,catagory1


###  Part B 
Compute the impurity of the price target feature.
Let's calculate the entropy for the parent node and see how much uncertainty the tree can reduce by splitting on Balance.

* The idea with entropy is that the more heterogenous and impure a feature is, the higher the entropy. Conversely, the more homogenous and pure a feature is, the lower the entropy.

In [7]:
# define the function using the fomula of entropy
def compute_impurity(feature, impurity_criterion):
    """
    This function calculates impurity of a feature.
    Supported impurity criteria: 'entropy','gini'
    input: feature (this needs to be a Pandas series)
    output: feature impurity
    """
    probs = feature.value_counts(normalize=True)
    
    if impurity_criterion == 'entropy':
        impurity = -1 * np.sum(np.log2(probs) * probs)
    elif impurity_criterion == 'gini':
        impurity = 1 - np.sum(np.square(probs))
    else:
        raise ValueError('Unknown impurity criterion')
        
    return(round(impurity, 3))                        
       
                  

In [8]:
# fitting the function in target variable price:
target_entropy = compute_impurity(df['price'], 'entropy')
target_entropy 

1.716

Entropy of the target variable price 1.716.

###  Part c
* Determining the root node for your decision tree.
* The Root Node is the node that starts the graph. In a normal decision tree it evaluates the variable that best splits the data. Intermediate nodes: These are nodes where variables are evaluated but which are not the final nodes where predictions are made.

In [9]:
#Let's see how the partitions look like for this feature and 
# what the corresponding calculations are using the entropy split criterion.
for level in df['carat'].unique():
    print('level name:', level)
    df_feature_level = df[df['carat'] == level]
    print('corresponding data partition:')
    print(df_feature_level)
    print('partition target feature impurity:', compute_impurity(df_feature_level['price'], 'entropy'))
    print('partition weight:', str(len(df_feature_level)) + '/' + str(len(df)))
    print('====================')


level name: catagory1
corresponding data partition:
     cut color      depth price      carat
0   Good     D  catagory2   low  catagory1
1   Fair     F  catagory3   low  catagory1
2   Good     I  catagory1   low  catagory1
3   Good     F  catagory1   low  catagory1
4   Fair     F  catagory3   low  catagory1
..   ...   ...        ...   ...        ...
79  Fair     F  catagory1   low  catagory1
80  Good     D  catagory2   low  catagory1
81  Good     F  catagory3   low  catagory1
83  Good     D  catagory2   low  catagory1
86  Good     D  catagory1   low  catagory1

[71 rows x 5 columns]
partition target feature impurity: -0.0
partition weight: 71/212
level name: catagory2
corresponding data partition:
      cut color      depth    price      carat
11   Good     I  catagory2      low  catagory2
15   Good     I  catagory2      low  catagory2
16   Good     I  catagory2      low  catagory2
27   Fair     I  catagory1      low  catagory2
34   Fair     I  catagory1      low  catagory2
..    ... 

##### we define a function called compute_impurity() that calculates impurity of a feature using either entropy.



In [10]:
# defining a function for calculating _information_gain:

def comp_feature_information_gain(df, target, descriptive_feature, split_criterion):
    """
    This function calculates information gain for splitting on 
    a particular descriptive feature for a given dataset
    and a given impurity criteria.
    Supported split criterion: 'entropy'
    """
    
    print('target feature:', target)
    print('descriptive_feature:', descriptive_feature)
    print('split criterion:', split_criterion)
            
    target_entropy = compute_impurity(df[target], split_criterion)

    # we define two lists below:
    # entropy_list to store the entropy of each partition
    # weight_list to store the relative number of observations in each partition
    entropy_list = list()
    weight_list = list()
    
    # loop over each level of the descriptive feature
    # to partition the dataset with respect to that level
    # and compute the entropy and the weight of the level's partition
    for level in df[descriptive_feature].unique():
        df_feature_level = df[df[descriptive_feature] == level]
        entropy_level = compute_impurity(df_feature_level[target], split_criterion)
        entropy_list.append(round(entropy_level, 3))
        weight_level = len(df_feature_level) / len(df)
        weight_list.append(round(weight_level, 3))

    print('impurity of partitions:', entropy_list)
    print('weights of partitions:', weight_list)

    feature_remaining_impurity = np.sum(np.array(entropy_list) * np.array(weight_list))
    print('remaining impurity:', feature_remaining_impurity)
    
    information_gain = target_entropy - feature_remaining_impurity
    print('information gain:', information_gain)
    
    print('====================')

    return(information_gain)

In [11]:
#Now that our function has been defined, we will call it for each descriptive feature in the dataset.
#First let's call it using the entropy split criteria.

split_criterion = 'entropy'
for feature in df.drop(columns='price').columns:
    feature_info_gain = comp_feature_information_gain(df, 'price', feature, split_criterion)
    
    

target feature: price
descriptive_feature: cut
split criterion: entropy
impurity of partitions: [1.68, 1.78]
weights of partitions: [0.717, 0.283]
remaining impurity: 1.7083
information gain: 0.00770000000000004
target feature: price
descriptive_feature: color
split criterion: entropy
impurity of partitions: [1.657, 1.445, 1.833]
weights of partitions: [0.269, 0.434, 0.297]
remaining impurity: 1.617264
information gain: 0.09873599999999993
target feature: price
descriptive_feature: depth
split criterion: entropy
impurity of partitions: [1.517, 1.749, 1.74]
weights of partitions: [0.349, 0.316, 0.335]
remaining impurity: 1.6650170000000002
information gain: 0.05098299999999978
target feature: price
descriptive_feature: carat
split criterion: entropy
impurity of partitions: [-0.0, 1.365, 1.529]
weights of partitions: [0.335, 0.373, 0.292]
remaining impurity: 0.9556129999999998
information gain: 0.7603870000000001


We observe that, with the entropy split criteria, the highest information gain occurs with the "color" feature.

This is the for the split at the root node of the corresponding decision tree. In subsequent splits, the above procedure is repeated with the subset of the entire dataset in the current branch until the termination condition is reached.

In [12]:
# inserting data into table:
#feature price:
df_splits = pd.DataFrame(columns=['split','remainder', 'info_gain','is_optimal'])
df_splits.loc[len(df_splits)]=['color',1.6173,  0.0987, False]
df_splits.loc[len(df_splits)]= ['cut',1.7083, 0.0987,False]
df_splits.loc[len(df_splits)]= ['depth',1.6650,0.05098, False]
df_splits.loc[len(df_splits)]= ['carat',1.7083, 0.76038,False]
#
df_splits

Unnamed: 0,split,remainder,info_gain,is_optimal
0,color,1.6173,0.0987,False
1,cut,1.7083,0.0987,False
2,depth,1.665,0.05098,False
3,carat,1.7083,0.76038,False


### Pard D:
Assuming the carat descriptive feature is at the root node
(NOTE: This feature may or may not be the optimal root node, but you will just assume it is). 
Under this assumption, you will make predictions for the price target variable.
*  calculationg the probability by hand for price- low, medium, high, premium.


In [13]:
# Let's see how the partitions look like for this feature and 
# what the corresponding calculations are using the entropy split criterion.
for level in df['carat'].unique():
    low=0
    medium=0
    high=0
    premium=0
    print('level name:', level)
    df_feature_level = df[df['carat'] == level]
    print('corresponding data partition:')
    print(df_feature_level)
    
for value in df['price']:
    if value == 'low':
        low += 1
    elif value== 'medium':
        medium += 1
    elif value ==' high':
        high += 1
    else:
        premium += 1
            
    
    print('partition target feature impurity:', compute_impurity(df_feature_level['price'], 'entropy'))
    print('partition weight:', str(len(df_feature_level)) + '/' + str(len(df)))
    print('price_probability: low', low/(len(df_feature_level)))
    print('price_probability: medium', medium/(len(df_feature_level)))
    print('price_probability: high', high/(len(df_feature_level)))
    print('price_probability: premium', premium/(len(df_feature_level)))
    print('====================')

level name: catagory1
corresponding data partition:
     cut color      depth price      carat
0   Good     D  catagory2   low  catagory1
1   Fair     F  catagory3   low  catagory1
2   Good     I  catagory1   low  catagory1
3   Good     F  catagory1   low  catagory1
4   Fair     F  catagory3   low  catagory1
..   ...   ...        ...   ...        ...
79  Fair     F  catagory1   low  catagory1
80  Good     D  catagory2   low  catagory1
81  Good     F  catagory3   low  catagory1
83  Good     D  catagory2   low  catagory1
86  Good     D  catagory1   low  catagory1

[71 rows x 5 columns]
level name: catagory2
corresponding data partition:
      cut color      depth    price      carat
11   Good     I  catagory2      low  catagory2
15   Good     I  catagory2      low  catagory2
16   Good     I  catagory2      low  catagory2
27   Fair     I  catagory1      low  catagory2
34   Fair     I  catagory1      low  catagory2
..    ...   ...        ...      ...        ...
172  Good     D  catagory1  

partition weight: 62/212
price_probability: low 1.5
price_probability: medium 0.7580645161290323
price_probability: high 0.0
price_probability: premium 0.0
partition target feature impurity: 1.529
partition weight: 62/212
price_probability: low 1.5
price_probability: medium 0.7741935483870968
price_probability: high 0.0
price_probability: premium 0.0
partition target feature impurity: 1.529
partition weight: 62/212
price_probability: low 1.5
price_probability: medium 0.7903225806451613
price_probability: high 0.0
price_probability: premium 0.0
partition target feature impurity: 1.529
partition weight: 62/212
price_probability: low 1.5
price_probability: medium 0.8064516129032258
price_probability: high 0.0
price_probability: premium 0.0
partition target feature impurity: 1.529
partition weight: 62/212
price_probability: low 1.5
price_probability: medium 0.8225806451612904
price_probability: high 0.0
price_probability: premium 0.0
partition target feature impurity: 1.529
partition weigh

In [14]:
# for leaf prediction , we use maximum probability type in each prediction .
# So we have the probabilities highest in catagory3>catagory2>catagory1

In [15]:
# inserting data into table:
#feature price, root node carat:
df_pred = pd.DataFrame(columns=['leaf_condition','low_price_prob', 'medium_price_prob','high_price_prob', 
                                'premium_price', 'leaf_pridiction'])
df_pred.loc[len(df_pred)]=['carat==catagory1', 1.0,0.0,0.0,0.0,'low']
df_pred.loc[len(df_pred)]=['carat== catagory1', 0.28 ,0.69,0.10,0.01,'medium']
df_pred.loc[len(df_pred)]=['carat==catagory1', 0.0,0.42,0.37,0.21,'high']
df_pred


Unnamed: 0,leaf_condition,low_price_prob,medium_price_prob,high_price_prob,premium_price,leaf_pridiction
0,carat==catagory1,1.0,0.0,0.0,0.0,low
1,carat== catagory1,0.28,0.69,0.1,0.01,medium
2,carat==catagory1,0.0,0.42,0.37,0.21,high
