<a href="https://colab.research.google.com/github/SahandShabanloueii/ML/blob/main/C4_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# C4.5 Algorithm Overview

The C4.5 algorithm is an advancement of the ID3 algorithm, designed to overcome some of its limitations. Developed by Ross Quinlan, C4.5 is particularly adept at handling both continuous and categorical attributes and is robust against missing data.

## Key Features of C4.5:
- **Handling Different Types of Data:** C4.5 can process datasets with numerical and categorical attributes.
- **Pruning:** To prevent overfitting, C4.5 prunes the trees it generates.
- **Missing Values:** The algorithm can manage missing values without requiring prior imputation.
- **Rule Derivation:** C4.5 can produce sets of if-then rules from the decision trees for easier interpretation.

## How C4.5 Works:
1. **Entropy and Information Gain:** C4.5 begins by calculating the entropy to measure dataset impurity and then computes the information gain for each attribute.
2. **Gain Ratio:** C4.5 uses the gain ratio, which normalizes information gain by the intrinsic information of an attribute, to choose the best attribute for splitting the data.
3. **Tree Generation:** The algorithm recursively splits the data based on the attribute with the highest gain ratio, forming a decision tree.
4. **Post-Pruning:** After building the tree, C4.5 prunes it using error-based pruning, which removes branches that don't improve accuracy on validation data.

## Advantages of C4.5:
- **Reduced Overfitting:** Pruning helps make the model more generalizable.
- **Flexibility:** Its ability to handle various types of data makes C4.5 versatile for different machine learning tasks.
- **Interpretability:** The decision trees or rules are straightforward to understand.

## Use Cases:
C4.5 is suitable for classification tasks where interpretability is crucial, such as medical diagnosis, credit scoring, and customer segmentation.


## Import necessary libraries

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

## Loading the Iris dataset to fit and test our model later

In [2]:
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Split the dataset into a training set and a testing set
# test_size=0.2 means 20% of the data will be used for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Converting the Sklearn dataset to a DataFrame
*italicized text*

In [3]:
train_data = pd.DataFrame(X_train, columns=feature_names)
train_data['class'] = y_train

test_data = pd.DataFrame(X_test, columns=feature_names)
test_data['class'] = y_test

In [4]:
train_data

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
0,4.6,3.6,1.0,0.2,0
1,5.7,4.4,1.5,0.4,0
2,6.7,3.1,4.4,1.4,1
3,4.8,3.4,1.6,0.2,0
4,4.4,3.2,1.3,0.2,0
...,...,...,...,...,...
115,6.1,2.8,4.0,1.3,1
116,4.9,2.5,4.5,1.7,2
117,5.8,4.0,1.2,0.2,0
118,5.8,2.6,4.0,1.2,1


In [5]:
test_data

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
0,6.1,2.8,4.7,1.2,1
1,5.7,3.8,1.7,0.3,0
2,7.7,2.6,6.9,2.3,2
3,6.0,2.9,4.5,1.5,1
4,6.8,2.8,4.8,1.4,1
5,5.4,3.4,1.5,0.4,0
6,5.6,2.9,3.6,1.3,1
7,6.9,3.1,5.1,2.3,2
8,6.2,2.2,4.5,1.5,1
9,5.8,2.7,3.9,1.2,1


## Defining the entropy of a dataset:

In [6]:
def entropy(target_col):
    elements, counts = np.unique(target_col, return_counts=True)
    entropy = np.sum([(-counts[i]/np.sum(counts)) * np.log2(counts[i]/np.sum(counts)) for i in range(len(elements))])
    return entropy

## Defining the Information Gain of an attribute:

In [7]:
def gain_ratio(data, split_attribute_name, target_name="class"):
    # Calculate the entropy of the total dataset
    total_entropy = entropy(data[target_name])

    # Calculate the values and the corresponding counts for the split attribute
    vals, counts = np.unique(data[split_attribute_name], return_counts=True)

    # Calculate the weighted entropy
    Weighted_Entropy = np.sum([(counts[i]/np.sum(counts)) * entropy(data.where(data[split_attribute_name]==vals[i]).dropna()[target_name]) for i in range(len(vals))])

    # Calculate the information gain
    Information_Gain = total_entropy - Weighted_Entropy

    # Calculate the split information
    Split_Information = -np.sum([(counts[i]/np.sum(counts)) * np.log2(counts[i]/np.sum(counts)) for i in range(len(vals))])

    # Calculate the gain ratio
    Gain_Ratio = Information_Gain / Split_Information if Split_Information != 0 else 0

    return Gain_Ratio

## And finally defining the C4.5 algorithm:

In [8]:
def C45(data, originaldata, features, target_attribute_name="class", parent_node_class=None):
    # Define the stopping criteria
    if len(np.unique(data[target_attribute_name])) <= 1:
        return np.unique(data[target_attribute_name])[0]
    elif len(data) == 0:
        return np.unique(originaldata[target_attribute_name])[np.argmax(np.unique(originaldata[target_attribute_name], return_counts=True)[1])]
    elif len(features) == 0:
        return parent_node_class
    else:
        parent_node_class = np.unique(data[target_attribute_name])[np.argmax(np.unique(data[target_attribute_name], return_counts=True)[1])]

        # Select the feature which best splits the dataset using gain ratio
        item_values = [gain_ratio(data, feature, target_attribute_name) for feature in features]
        best_feature_index = np.argmax(item_values)
        best_feature = features[best_feature_index]

        tree = {best_feature:{}}
        features = [i for i in features if i != best_feature]

        for value in np.unique(data[best_feature]):
            sub_data = data.where(data[best_feature] == value).dropna()
            subtree = C45(sub_data, originaldata, features, target_attribute_name, parent_node_class)
            tree[best_feature][value] = subtree

        return tree

## Function to predict for any input sample

In [9]:
def predict(query, tree, default=1):
    for key in list(query.keys()):
        if key in list(tree.keys()):
            try:
                result = tree[key][query[key]]
            except:
                return default

            result = tree[key][query[key]]
            if isinstance(result, dict):
                return predict(query, result)
            else:
                return result

## Function to test the prediction accuracy of the tree

In [10]:
def test(data, tree):
    queries = data.iloc[:,:-1].to_dict(orient="records")
    predicted = pd.DataFrame(columns=["predicted"])

    # Calculate the prediction accuracy
    for i in range(len(data)):
        predicted.loc[i,"predicted"] = predict(queries[i], tree, 1.0)
    print('The prediction accuracy is: ',(np.sum(predicted["predicted"] == data["class"])/len(data))*100,'%')

## Create the decision tree using the ID3 algorithm

In [12]:
decision_tree = C45(train_data, train_data, train_data.columns[:-1])

## Test the decision tree's accuracy on the test data

In [13]:
test(test_data, decision_tree)

The prediction accuracy is:  96.66666666666667 %
