<a href="https://colab.research.google.com/github/Clonlyfan/Statistics-and-more/blob/main/decision_tree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
import pandas as pd
import numpy as np
import math

def calculate_entropy(data):
    """Calculates the entropy of a dataset."""
    if len(data) == 0:
        return 0
    probabilities = data['Play Tennis'].value_counts(normalize=True)
    entropy = -sum(p * math.log2(p) for p in probabilities if p > 0)
    return entropy

def calculate_gini_impurity(data):
    """Calculates the Gini impurity of a dataset."""
    if len(data) == 0:
        return 0
    probabilities = data['Play Tennis'].value_counts(normalize=True)
    gini = 1 - sum(p**2 for p in probabilities)
    return gini

def calculate_information_gain(data, attribute):
    """Calculates the information gain of splitting data on a given attribute."""
    total_entropy = calculate_entropy(data)
    attribute_values = data[attribute].unique()
    weighted_entropy = 0

    for value in attribute_values:
        subset = data[data[attribute] == value]
        proportion = len(subset) / len(data)
        weighted_entropy += proportion * calculate_entropy(subset)

    information_gain = total_entropy - weighted_entropy
    return information_gain

def calculate_gini_gain(data, attribute):
    """Calculates the Gini gain of splitting data on a given attribute."""
    total_gini = calculate_gini_impurity(data)
    attribute_values = data[attribute].unique()
    weighted_gini = 0

    for value in attribute_values:
        subset = data[data[attribute] == value]
        proportion = len(subset) / len(data)
        weighted_gini += proportion * calculate_gini_impurity(subset)

    gini_gain = total_gini - weighted_gini
    return gini_gain

# Load the data from the Excel file
file_path = "/content/gamedayornot.xlsx"
try:
    df = pd.read_excel(file_path)
except FileNotFoundError:
    print(f"Error: File not found at {file_path}")
    exit()

# Calculate initial entropy and Gini impurity of the entire dataset
initial_entropy = calculate_entropy(df)
initial_gini = calculate_gini_impurity(df)

print(f"Initial Entropy of the dataset: {initial_entropy:.4f}")
print(f"Initial Gini Impurity of the dataset: {initial_gini:.4f}\n")

features = ['Outlook', 'Temperature', 'Humidity', 'Wind']

# Calculate Information Gain for each feature
print("Information Gain for each feature:")
for feature in features:
    info_gain = calculate_information_gain(df, feature)
    print(f"Information Gain ({feature}): {info_gain:.4f}")
print("\n")

# Calculate Gini Gain for each feature
print("Gini Gain for each feature:")
for feature in features:
    gini_gain = calculate_gini_gain(df, feature)
    print(f"Gini Gain ({feature}): {gini_gain:.4f}")
print("\n")

# Detailed Explanation:

print("Detailed Explanation:\n")

print("1. Initial Entropy and Gini Impurity:")
print("   - **Entropy** measures the impurity or randomness in the target variable ('Play Tennis') of the entire dataset. A higher entropy value indicates more disorder, meaning the classes (Yes/No) are more mixed.")
print(f"     - Initial Entropy: {initial_entropy:.4f}")
print("   - **Gini Impurity** is another measure of impurity. It represents the probability of misclassifying a randomly chosen instance if it were randomly labeled according to the class distribution in the dataset. A higher Gini impurity also indicates more disorder.")
print(f"     - Initial Gini Impurity: {initial_gini:.4f}\n")

print("2. Information Gain:")
print("   - **Information Gain** quantifies the reduction in entropy achieved by splitting the dataset based on a particular feature. It tells us how much more 'organized' the target variable becomes after partitioning the data according to the values of that feature.")
print("   - For each feature ('Outlook', 'Temperature', 'Humidity', 'Wind'):")
print("     - We calculate the entropy of the 'Play Tennis' outcome for each unique value within that feature (e.g., for 'Outlook': Sunny, Overcast, Rainy).")
print("     - Then, we calculate a weighted average of these entropies, where the weights are the proportion of instances belonging to each value of the feature.")
print("     - The Information Gain is the difference between the initial entropy of the dataset and this weighted average entropy. A higher Information Gain suggests that the feature is more effective in classifying the 'Play Tennis' outcome.")
print("   - Based on the calculated Information Gains:")
for feature in features:
    info_gain = calculate_information_gain(df, feature)
    print(f"     - Information Gain ({feature}): {info_gain:.4f}")
print("\n")

print("3. Gini Gain:")
print("   - **Gini Gain** is analogous to Information Gain but uses Gini impurity instead of entropy. It measures the reduction in Gini impurity achieved by splitting the dataset on a particular feature.")
print("   - Similar to Information Gain, a higher Gini Gain for a feature indicates that splitting on that feature leads to a greater reduction in impurity in the resulting subsets, making it a potentially good feature for splitting in a decision tree.")
print("   - For each feature:")
for feature in features:
    gini_gain = calculate_gini_gain(df, feature)
    print(f"     - Gini Gain ({feature}): {gini_gain:.4f}")
print("\n")

print("In the context of building a decision tree:")
print("- The feature with the **highest Information Gain** (if using entropy as the splitting criterion) or the **highest Gini Gain** (if using Gini impurity) would typically be chosen as the root node of the tree.")
print("- This is because these features provide the most information about the target variable and lead to the most homogeneous (pure) child nodes after the split.")
print("- The process would then be recursively applied to the child nodes until a stopping criterion is met (e.g., all instances in a node belong to the same class, or a maximum tree depth is reached).")

Initial Entropy of the dataset: 1.2638
Initial Gini Impurity of the dataset: 0.5408

Information Gain for each feature:
Information Gain (Outlook): 0.5441
Information Gain (Temperature): 0.4441
Information Gain (Humidity): 0.4975
Information Gain (Wind): 0.3532


Gini Gain for each feature:
Gini Gain (Outlook): -0.0314
Gini Gain (Temperature): 0.1704
Gini Gain (Humidity): 0.2234
Gini Gain (Wind): 0.1380


Detailed Explanation:

1. Initial Entropy and Gini Impurity:
   - **Entropy** measures the impurity or randomness in the target variable ('Play Tennis') of the entire dataset. A higher entropy value indicates more disorder, meaning the classes (Yes/No) are more mixed.
     - Initial Entropy: 1.2638
   - **Gini Impurity** is another measure of impurity. It represents the probability of misclassifying a randomly chosen instance if it were randomly labeled according to the class distribution in the dataset. A higher Gini impurity also indicates more disorder.
     - Initial Gini Impurity