# Attribute selection with information gain
### Case workbook

Source: [F. Provost, T. Fawcett, "Data Science for Business"](https://data-science-for-biz.com/)
Dataset source: [Mushroom Data Set](https://archive.ics.uci.edu/ml/datasets/Mushroom)

Problem outline: having a dataset with instances described by attributes and target variable, determine which attribute is the most informative with respect to estimating the value of target variable. 

Problem type: classification

Dataset values: categorical

Target variable: edible (e), poisonous (p)

Splitting criterion: [Informastion gain](https://en.wikipedia.org/wiki/Information_gain_in_decision_trees)


<img src="https://images.pexels.com/photos/3100522/pexels-photo-3100522.jpeg?cs=srgb&dl=pexels-katalin-rhorv%C3%A1t-3100522.jpg&fm=jpg" width="400" height="900"></img>
<br>
_Photo by Katalin RHorvát from Pexels_



In [None]:
## IMPORT SECTION ##
####################

## Data utils

import math

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy.stats import entropy

## Other

from pathlib import Path

In [None]:
## Read dataset ##
##################

mushroom_set = pd.read_csv('../input/mushroom-classification/mushrooms.csv')
mushroom_set.head()

In [None]:
## Functions ##
###############

def weight_values(df):
    '''
    Constructs a dataframe containing normalized value counts for each 
    input dataframe's attribute.
    :df: pd.DataFrame
    :return: pd.DataFrame
    '''
    
    output = pd.DataFrame()

    for col in df.columns:
        coldf = pd.DataFrame()

        ## Get relative counts of values
        coldf[col] = df[col].value_counts(normalize=True)

        ## Construct hierarchical index on coldf
        coldf = pd.concat([coldf], keys=[col], names=['attribute','value'])
        
        coldf.rename(columns={col:'val_weights'}, inplace = True)
        
        ## Concat to output dataframe
        output = pd.concat([output, coldf])
    

    return output


def weight_segments(s_df, w_df, l_name = 'class'):
    '''
    Performs a segmentation of categorical attributes and calculates label weights for each attribute's value.
    :s_df: source dataframe pd.DataFrame()
    :w_df: dataframe containing value weights for each attribute pd.DataFrame()
    :l_name: label name str
    :return: pd.DataFrame() 
    '''
    
    class_w_df = pd.DataFrame()
    
    ## Perform segmentation
    ## For each attribute in weight dataframe
    for attribute in w_df.index.levels[0]:
        
            ## For each value in weight dataframe
            for val in w_df.loc[attribute].index:
                
                ## Calculate label weights
                col_class_weights = s_df.loc[s_df[attribute]==val][l_name].value_counts(normalize=True)
                ## Cast pd.Series into pd.DataFrame
                col_class_weights = pd.DataFrame(col_class_weights)
                ## Transpose pd.DataFrame
                col_class_weights = col_class_weights.T
                ## Set categorical values as index
                col_class_weights['value'] = val
                col_class_weights.set_index('value',inplace=True)
                ## Construct hierarchical index
                col_class_weights = pd.concat([col_class_weights], keys=[attribute, val], names=['attribute','value'])
                ## Join pd.DataFrame to output pd.Dataframe
                class_w_df = pd.concat([class_w_df,col_class_weights])

    return class_w_df

In [None]:
## Construct sub dataframes
weight_df = weight_values(mushroom_set)
segmented_df = weight_segments(mushroom_set, weight_df)

## Construct dataframe for entropy calculation
entropy_df = segmented_df.join(weight_df)
entropy_df.fillna(0, inplace = True)
entropy_df

In [None]:
## Calculate parent entropy parameter
parent_entropy = entropy(
    [entropy_df.loc[('class','p')]['val_weights'], 
     entropy_df.loc[('class','e')]['val_weights']],
    base=2
    )

In [None]:
## Calculate normalized entropy for each value in the segment
entropy_df['segment_entropy'] =  entropy([entropy_df['p'], entropy_df['e']], base=2)
entropy_df['weighted_entropy'] = entropy_df['val_weights'] * entropy_df['segment_entropy']
entropy_df

In [None]:
## Construct a list of attribute, information gain tuples
ig_list = [(attribute, parent_entropy - entropy_df.loc[attribute]['weighted_entropy'].sum()) # calculate information gain
           for attribute 
           in entropy_df.index.levels[0] 
           if len(entropy_df.loc[attribute]) > 1 and attribute != 'class'] # exclude segments with one item and labels

In [None]:
## Construct information gain dataframe
ig_df = pd.DataFrame(ig_list,
            columns=['Attribute', 'Information Gain'])

ig_df.set_index('Attribute', inplace = True)
ig_df

In [None]:
## Visualize results

%matplotlib inline

_ = ig_df.plot(kind='bar', title='Information gain by mushroom attribute', color='#beaed4', figsize=[12,6])
_.patch.set_facecolor('#386cb0')
_.patch.set_alpha(0.8)