<a href="https://colab.research.google.com/github/Hanzala6701/Decision-tree-/blob/main/Decision_Tree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1) What is a Decision Tree, and how does it work in the context of classification?

- A Decision Tree is a supervised machine learning algorithm that models decisions or classifications as a tree-like flowchart, branching from a root node into internal nodes (conditions) and terminal leaf nodes (final class labels). It classifies data by splitting it into homogeneous subsets based on feature values using criteria like Gini impurity or Information Gain.

i)  Structure: It consists of a root node (start), branches (outcomes of decisions), internal nodes (attribute tests), and leaf nodes (predicted class).

ii) Process: The algorithm starts at the root and poses questions about the data features, splitting the dataset based on the answer.

iii) Splitting: It recursively partitions the data, choosing features that best separate the classes to minimize uncertainty (using metrics like Gini or Entropy).

iv) Stopping Criteria: The splitting continues until the node becomes "pure" (all samples belong to one class) or another stopping criterion, such as maximum depth, is met.

V) Result: A new, unseen data point follows the path of decisions to a leaf node, which assigns the final classification.

2) Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
 -  Gini Impurity and Entropy are metrics used to measure node impurity (mixture of classes) in Decision Trees to determine the best feature split. Gini measures the probability of incorrect classification (\(1-\sum p_{i}^{2}\)), while Entropy measures disorder or information uncertainty (\(-\sum p_{i}\log _{2}(p_{i})\)). Both drive the algorithm to create pure nodes (impurity=0) by selecting splits that maximize information gain.

- Gini Impurity and Entropy Concepts

i) Gini Impurity: Measures the frequency with which a random element would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. It is generally faster to calculate because it does not involve logarithmic functions. It is the default metric used by the CART algorithm.

ii) Entropy: A measure of disorder or uncertainty in the node, ranging from 0 (perfectly pure) to 1 (50/50 split in binary classification). It is used to quantify the information gain, aiming to reduce the uncertainty.

- Impact on Decision Tree Splits

i) Selection Criterion: The decision tree algorithm calculates the impurity of child nodes for all possible splits.

ii) Minimizing Impurity: The algorithm selects the feature split that results in the lowest weighted Gini Impurity or lowest Entropy (highest Information Gain).

iii) Resulting Structure: Both metrics guide the tree to make splits that increase the homogeneity of the nodes.

4) What is Information Gain in Decision Trees, and why is it important for choosing the best split?
-   Information Gain (IG) in decision trees measures the reduction in entropy (impurity or uncertainty) within a dataset after it is split based on a specific attribute. It is calculated as the difference between the parent node's entropy and the weighted average of the child nodes' entropy. High IG indicates that a feature effectively organizes data into purer subsets.

- Why Information Gain is Important for Splitting

i) Optimal Splitting: It identifies the attribute that best separates data into distinct classes, creating the most informative split at each node.

ii) Reduced Uncertainty: By choosing the feature with the highest Information Gain, the model minimizes impurity (maximum homogeneity) in the resulting child nodes.

iii) Decision Tree Construction: It provides a mathematical, objective criterion for choosing the best split, essential for building efficient tree algorithms like ID3, C4.5, and CART.

- Key Concepts Entropy (\(H\)): Measures the impurity or randomness of a node. \(H=0\) indicates a pure node (all samples belong to one class), while higher values indicate higher uncertainty.

In [None]:

"""6) Write a Python program to:

Load the Iris Dataset

Train a Decision Tree Classifier using the Gini criterion

Print the model's accuracy and feature importances

(Include your Python code and output in the code box below.)"""

import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Split the dataset into training and testing sets
# We use a fixed random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Train a Decision Tree Classifier using the Gini criterion
# The 'criterion' parameter is set to 'gini' by default, but explicitly included for clarity
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

# 3. Print the model's accuracy and feature importances

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate and print the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}\n")

# Print the feature importances
print("Feature Importances:")
importances = model.feature_importances_
for name, importance in zip(feature_names, importances):
    print(f"- {name}: {importance:.2f}")

# Example of how the output will look when run:
"""
Model Accuracy: 1.00

Feature Importances:
- sepal length (cm): 0.00
- sepal width (cm): 0.00
- petal length (cm): 0.94
- petal width (cm): 0.06
"""

Model Accuracy: 1.00

Feature Importances:
- sepal length (cm): 0.00
- sepal width (cm): 0.02
- petal length (cm): 0.89
- petal width (cm): 0.09


'\nModel Accuracy: 1.00\n\nFeature Importances:\n- sepal length (cm): 0.00\n- sepal width (cm): 0.00\n- petal length (cm): 0.94\n- petal width (cm): 0.06\n'

In [None]:
""" 7) Write a Python program to:

Load the Iris Dataset

Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree."""

# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

def train_and_compare_decision_trees():
    """
    Loads the Iris dataset, trains two Decision Tree classifiers (one with max_depth=3 and one fully grown),
    and compares their accuracies.
    """

    # 1. Load the Iris Dataset
    # The dataset is loaded as a Bunch object, a dictionary-like container
    iris = load_iris()
    X = iris.data  # Features
    y = iris.target # Target labels

    # 2. Split the dataset into training and testing sets
    # We use a 70/30 split for demonstration
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # 3. Train a Decision Tree Classifier with max_depth=3
    # Setting random_state ensures reproducibility
    tree_depth_3 = DecisionTreeClassifier(max_depth=3, random_state=42)
    tree_depth_3.fit(X_train, y_train)

    # 4. Train a fully-grown tree (default behavior of DecisionTreeClassifier is to grow until all leaves are pure or all leaves contain less than min_samples_split samples)
    full_tree = DecisionTreeClassifier(random_state=42)
    full_tree.fit(X_train, y_train)

    # 5. Make predictions
    y_pred_depth_3 = tree_depth_3.predict(X_test)
    y_pred_full_tree = full_tree.predict(X_test)

    # 6. Compare Accuracies
    accuracy_depth_3 = accuracy_score(y_test, y_pred_depth_3)
    accuracy_full_tree = accuracy_score(y_test, y_pred_full_tree)

    print(f"Accuracy of the tree with max_depth=3: {accuracy_depth_3:.4f}")
    print(f"Accuracy of the fully-grown tree: {accuracy_full_tree:.4f}")

    # Optional: print the actual depth of the fully grown tree
    print(f"Actual depth of the fully-grown tree: {full_tree.tree_.max_depth}")

if __name__ == "__main__":
    train_and_compare_decision_trees()

Accuracy of the tree with max_depth=3: 1.0000
Accuracy of the fully-grown tree: 1.0000
Actual depth of the fully-grown tree: 6


In [None]:
"""8) Write a Python program to:

Load the California Housing dataset from sklearn

Train a Decision Tree Regressor

Print the Mean Squared Error (MSE) and feature importances."""

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd

# 1. Load the California Housing dataset
print("Loading California Housing dataset...")
# Fetching the dataset returns a Bunch object, which behaves like a dictionary
california_housing = fetch_california_housing(as_frame=True)

# Separate features (X) and target variable (y)
X = california_housing.data
y = california_housing.target
feature_names = california_housing.feature_names

print(f"Dataset loaded. X shape: {X.shape}, y shape: {y.shape}")

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Data split into training ({X_train.shape[0]} samples) and testing ({X_test.shape[0]} samples) sets.")

# 2. Train a Decision Tree Regressor
print("\nTraining a Decision Tree Regressor...")
# Initialize the regressor
dt_regressor = DecisionTreeRegressor(random_state=42)

# Fit the model to the training data
dt_regressor.fit(X_train, y_train)
print("Model training complete.")

# 3. Print the Mean Squared Error (MSE) and feature importances

# Make predictions on the test set
y_pred = dt_regressor.predict(X_test)

# Calculate the Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"\nMean Squared Error (MSE) on test set: {mse:.4f}")

# Get and print feature importances
# The feature_importances_ attribute returns the importance scores
importances = dt_regressor.feature_importances_

# Create a pandas Series for better visualization
feature_importance_series = pd.Series(importances, index=feature_names).sort_values(ascending=False)

print("\nFeature Importances:")
print(feature_importance_series)

Loading California Housing dataset...
Dataset loaded. X shape: (20640, 8), y shape: (20640,)
Data split into training (16512 samples) and testing (4128 samples) sets.

Training a Decision Tree Regressor...
Model training complete.

Mean Squared Error (MSE) on test set: 0.4952

Feature Importances:
MedInc        0.528509
AveOccup      0.130838
Latitude      0.093717
Longitude     0.082902
AveRooms      0.052975
HouseAge      0.051884
Population    0.030516
AveBedrms     0.028660
dtype: float64


In [None]:
""" 9) Write a Python program to:

Load the Iris Dataset

Tune the Decision Tree's max depth and min_samples_split using

GridSearchCV

Print the best parameters and the resulting model accuracy"""

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# 1. Load the Iris Dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target  # Target variable

# Split the data into training and testing sets (optional but good practice)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Tune the Decision Tree's max depth and min_samples_split using GridSearchCV
# Define the Decision Tree model
dtc = DecisionTreeClassifier(random_state=42)

# Define the grid of hyperparameters to search
param_grid = {
    'max_depth': [2, 3, 4, 5, 6],
    'min_samples_split': [2, 3, 4, 5, 10]
}

# Set up GridSearchCV
# cv=5 means 5-fold cross-validation is used
grid_search = GridSearchCV(estimator=dtc, param_grid=param_grid, cv=5, scoring='accuracy', refit=True)

# Fit the GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# 3. Print the best parameters and the resulting model accuracy
# The best parameters found by the search
print(f"Best Parameters found by GridSearchCV: {grid_search.best_params_}")

# The best cross-validation score (accuracy)
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}")

# Evaluate the best model on the held-out test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy of the best model on the test set: {test_accuracy:.4f}")

Best Parameters found by GridSearchCV: {'max_depth': 4, 'min_samples_split': 10}
Best Cross-Validation Accuracy: 0.9429
Accuracy of the best model on the test set: 1.0000
