<font color="red" size="6">Filter Methods</font>
<p><font color="Yellow" size="4">6_Information_Gain</font>

*Information Gain (IG) is a key concept in machine learning and is often used in decision trees (like in the ID3 algorithm) to determine the best feature to split the data on. Information Gain measures the reduction in uncertainty or entropy after a dataset is split based on a particular feature.*

In [2]:
import numpy as np
import pandas as pd
from sklearn.metrics import mutual_info_score

# Create a simple dataset
data = {
    'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny', 'Overcast', 'Overcast', 'Rain'],
    'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'],
    'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],
    'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Strong'],
    'PlayTennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
}

df = pd.DataFrame(data)

# Function to calculate entropy
def entropy(data):
    value_counts = data.value_counts(normalize=True)
    return -np.sum(value_counts * np.log2(value_counts))

# Function to calculate information gain
def information_gain(df, feature, target):
    # Calculate the entropy of the whole dataset
    original_entropy = entropy(df[target])
    
    # Calculate the weighted average entropy after the split based on the feature
    feature_values = df[feature].unique()
    weighted_entropy = 0
    for value in feature_values:
        subset = df[df[feature] == value]
        weighted_entropy += (len(subset) / len(df)) * entropy(subset[target])
    
    # Information Gain is the reduction in entropy
    return original_entropy - weighted_entropy

# Calculate Information Gain for the 'Outlook' feature
ig_outlook = information_gain(df, 'Outlook', 'PlayTennis')
print(f"Information Gain for 'Outlook': {ig_outlook}")


Information Gain for 'Outlook': 0.24674981977443933


<b><font color="orange">ALTERNATE METHOD:</font></b>
The scikit-learn library provides several methods for feature selection, although it does not directly provide a function specifically for Information Gain. However, you can use mutual information (which is related to Information Gain) for feature selection. For classification tasks, mutual_info_classif can be used to compute the mutual information between each feature and the target variable, which can be considered as a measure of Information Gain.

In [4]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.feature_selection import mutual_info_classif

# Load the Iris dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target  # Target variable

# Calculate mutual information between each feature and the target
mi = mutual_info_classif(X, y)

# Display the results
feature_names = iris.feature_names
mi_df = pd.DataFrame({
    'Feature': feature_names,
    'Mutual Information': mi
}).sort_values(by='Mutual Information', ascending=False)

print(mi_df)


             Feature  Mutual Information
2  petal length (cm)            0.999752
3   petal width (cm)            0.978626
0  sepal length (cm)            0.524433
1   sepal width (cm)            0.261887
