# Machine Learning

A simple decision tree with depth 1 using the given dataset for predicting the price (categorical) target feature using the Entropy split criterion.

#### Author: Achintya Gupta 



In [1]:
# importing necessary libraries

import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

In [2]:
Data = pd.read_csv("THA_diamonds.csv")
Data

Unnamed: 0,cut,color,depth,price,carat
0,Good,D,63.6,low,0.44
1,Fair,F,64.2,low,0.45
2,Good,I,60.4,low,0.50
3,Good,F,56.8,low,0.45
4,Fair,F,64.3,low,0.45
...,...,...,...,...,...
207,Good,F,63.7,premium,0.96
208,Fair,D,57.5,premium,0.90
209,Fair,F,64.7,premium,0.90
210,Good,I,58.2,premium,0.93


### Part A

**Discretizing*** `carat` **and** `depth`

In [3]:
Data1 = Data.copy()

# Discretizing carat and depth
Data1['carat'] = pd.qcut(Data1['carat'], q=3, labels=['small', 'medium', 'large'])
Data1['depth'] = pd.qcut(Data1['depth'], q=3, labels=['shallow', 'middle', 'deep'])

`carat` and `depth` are discretized using **equal binning technique**. `carat` is categorized as small, medium and large where as `depth` is categorized as shallow, middle and deep.

In [4]:
# Displaying first 10 observations after discretizing
Data1.head(10)

Unnamed: 0,cut,color,depth,price,carat
0,Good,D,middle,low,small
1,Fair,F,deep,low,small
2,Good,I,shallow,low,small
3,Good,F,shallow,low,small
4,Fair,F,deep,low,small
5,Fair,F,deep,low,small
6,Good,D,middle,low,small
7,Good,D,middle,low,small
8,Good,D,middle,low,small
9,Fair,F,deep,low,small


### Part B

**Calculating impurity of** `price` **feature**.

In [5]:
# Probability distribution of the price
prob = Data1['price'].value_counts(normalize=True)
prob

low        0.438679
medium     0.349057
high       0.146226
premium    0.066038
Name: price, dtype: float64

In [6]:
# Calculating impurity using entropy
entropy = -1 * np.sum(np.log2(prob) * prob)
entropy

1.7160130346557048

Using **Shannon's model of entropy** , the impurity of `price` is calculated as `1.7160130346557048`

### Part C

**Determining root node of the decsion tree.**

In [7]:
# Storing price as target
target = Data1.price.values

A function `compute_entropy()` is created which computes entropy of a feature.

In [8]:
# Creating a function to calculate entropy of a feature
def compute_entropy(feature):
    prob = feature.value_counts(normalize=True)
    impurity = -1 * np.sum(np.log2(prob) * prob)
    return(round(impurity, 3))

Another function `feature_info_gain()` is created which calculates information gain, remaining impurity, impurity of partitions and weight of partitions. The feature with highest information gain will be the root node of the decision tree.

In [9]:
# Creating a function for calculating information gain, remaining impurity, impurity of partitions and weight 
# of partitions

def feature_info_gain(df, target,feature):
    print('target feature:', target)
    print('descriptive_feature:', feature)
    print('impurity: entropy')
    
    target_entropy = compute_entropy(df[target])
    
    entropy_list = list()
    weight_list = list()
    
    for level in df[feature].unique():
        df_feature_level = df[df[feature] == level]
        
        # impurity of partitions
        entropy_level = compute_entropy(df_feature_level[target])
        entropy_list.append(round(entropy_level, 3))
        
        # weight of partitions
        weight_level = len(df_feature_level) / len(df)
        weight_list.append(round(weight_level, 3))
        
    print('impurity of partitions:', entropy_list)
    print('weights of partitions:', weight_list)

    # remaining impurity
    remaining_impurity = np.sum(np.array(entropy_list) * np.array(weight_list)) 
    remaining_impurity = remaining_impurity.round(3)
    print('remaining impurity:', remaining_impurity)
    
    # information gain
    information_gain = target_entropy - remaining_impurity
    information_gain = information_gain.round(3)
    print('information gain:', information_gain)
    print("\n")
    print('===============================================')
    print("\n")

    return(information_gain)


    
    

In [10]:
# Calculating information gain, remaining impurity, impurity of partitions and weight of partitions for each feature
# Dropping
for feature in Data1.drop(columns='price').columns:
    info_gain = feature_info_gain(Data1, 'price', feature)

target feature: price
descriptive_feature: cut
impurity: entropy
impurity of partitions: [1.68, 1.78]
weights of partitions: [0.717, 0.283]
remaining impurity: 1.708
information gain: 0.008




target feature: price
descriptive_feature: color
impurity: entropy
impurity of partitions: [1.657, 1.445, 1.833]
weights of partitions: [0.269, 0.434, 0.297]
remaining impurity: 1.617
information gain: 0.099




target feature: price
descriptive_feature: depth
impurity: entropy
impurity of partitions: [1.517, 1.749, 1.74]
weights of partitions: [0.349, 0.316, 0.335]
remaining impurity: 1.665
information gain: 0.051




target feature: price
descriptive_feature: carat
impurity: entropy
impurity of partitions: [-0.0, 1.365, 1.529]
weights of partitions: [0.335, 0.373, 0.292]
remaining impurity: 0.956
information gain: 0.76






We observe that the information gain is highest for `carat` feature. Therefore, the **root node** is `carat`.

In [11]:
# Creating the table
df_splits = pd.DataFrame(columns=['split', 'remainder', 'info_gain','is_optimal'])
df_splits.loc[len(df_splits)] = ['carat', 0.956, 0.76,'True']
df_splits.loc[len(df_splits)] = ['color', 1.617, 0.099, 'False']
df_splits.loc[len(df_splits)] = ['depth', 1.665,0.051, 'False']
df_splits.loc[len(df_splits)] = ['cut', 1.708,0.008, 'False']
df_splits

Unnamed: 0,split,remainder,info_gain,is_optimal
0,carat,0.956,0.76,True
1,color,1.617,0.099,False
2,depth,1.665,0.051,False
3,cut,1.708,0.008,False


Since the information gain of `carat` is the highest, it is the most optimal feature to be the root node of the decision tree.

### Part D

**Making predictions for the** `price` **target variable using** `carat` **as the root node**

`carat` feature is partitionted corresponding to `price` target feature and probability of each level of `price` is computed using the below code.

In [12]:
# partioning carat feature with respect to price target feature
for level in Data1['carat'].unique():
    print('Level name:', level)
    df_feature_level = Data1[Data1['carat'] == level]
    print('Data partition coresspondig to price:')
    print(df_feature_level['price'].value_counts())
    print('Target feature impurity of partition:', compute_entropy(df_feature_level['price']))
    print('Weight of partition:', str(len(df_feature_level)) + '/' + str(len(Data1)))
    
    # Calculating probability for all levels of price
    print("Probability of each level of price:")
    print(df_feature_level['price'].value_counts(normalize = True).round(3))
    print("\n")
    print('=================================================')
    print("\n")

Level name: small
Data partition coresspondig to price:
low    71
Name: price, dtype: int64
Target feature impurity of partition: -0.0
Weight of partition: 71/212
Probability of each level of price:
low    1.0
Name: price, dtype: float64




Level name: medium
Data partition coresspondig to price:
medium     48
low        22
high        8
premium     1
Name: price, dtype: int64
Target feature impurity of partition: 1.365
Weight of partition: 79/212
Probability of each level of price:
medium     0.608
low        0.278
high       0.101
premium    0.013
Name: price, dtype: float64




Level name: large
Data partition coresspondig to price:
medium     26
high       23
premium    13
Name: price, dtype: int64
Target feature impurity of partition: 1.529
Weight of partition: 62/212
Probability of each level of price:
medium     0.419
high       0.371
premium    0.210
Name: price, dtype: float64






The target feature `price` is predicted using the above probabilities. 

In [13]:
# Creating the table
df_pred = pd.DataFrame(columns=['leaf_condition', 'low_price_prob', 'medium_price_prob','high_price_prob','premium_price_prob','leaf_prediction_price'])
df_pred.loc[len(df_pred)] = ['small', 1.0, 0, 0, 0,'low']
df_pred.loc[len(df_pred)] = ['medium', 0.278, 0.608, 0.101, 0.013 , 'medium']
df_pred.loc[len(df_pred)] = ['large', 0, 0.419, 0.371, 0.210, 'medium']
df_pred

Unnamed: 0,leaf_condition,low_price_prob,medium_price_prob,high_price_prob,premium_price_prob,leaf_prediction_price
0,small,1.0,0.0,0.0,0.0,low
1,medium,0.278,0.608,0.101,0.013,medium
2,large,0.0,0.419,0.371,0.21,medium
