1. What is a Decision Tree, and how does it work ?


A decision tree is a flowchart-like structure used in machine learning and decision analysis to predict outcomes or make decisions by recursively splitting data based on attributes. It works by starting with a root node, branching out based on conditions (internal nodes), and ultimately reaching leaf nodes representing the final prediction or decision.

2. What are impurity measures in Decision Trees ?


In decision trees, impurity measures quantify how "mixed" the classes are within a node, with common metrics including Gini impurity and entropy, used to guide splitting decisions and build the tree.

 3. What is the mathematical formula for Gini Impurity ?


 The mathematical formula for Gini Impurity, used to measure the impurity of a set of data in decision trees, is 1 - Σ (p<sub>i</sub><sup>2</sup>), where p<sub>i</sub> represents the probability of an instance belonging to class i.

4.  What is the mathematical formula for Entropy ?



In one statistical interpretation of entropy, it is found that for a very large system in thermodynamic equilibrium, entropy S is proportional to the natural logarithm of a quantity Ω representing the maximum number of microscopic ways in which the macroscopic state corresponding to S can be realized; that is, S = k ln Ω, in which k is the Boltzmann constant that is related to molecular energy.

5. What is Information Gain, and how is it used in Decision Trees ?


Information Gain measures the reduction in entropy (uncertainty) of a dataset after splitting it based on a particular feature, and it's used in decision trees to determine the best feature to split on at each node.

6.  What is the difference between Gini Impurity and Entropy?

Gini Impurity:
Definition:
Gini impurity measures the probability of incorrectly classifying a randomly chosen element if it were randomly labeled according to the class distribution in the dataset.
Calculation:
It's calculated as 1 minus the sum of the squared probabilities of each class.
Range:
Gini impurity ranges from 0 to 0.5, with 0 representing a pure node (all data points belong to the same class) and 0.5 representing the highest impurity.
Computational Complexity:
Gini impurity is computationally simpler and faster to calculate than entropy because it involves simple arithmetic operations rather than logarithms.
Use in Decision Trees:
Gini impurity is used by the CART (Classification and Regression Tree) algorithm for classification trees.




Entropy:
Definition: Entropy, rooted in information theory, measures the amount of uncertainty or randomness in a dataset.
Calculation: Entropy is calculated as the negative sum of the probabilities of each class multiplied by the logarithm of their probabilities.
Range: Entropy ranges from 0 to 1, with 0 representing a pure node and 1 representing the highest impurity.
Computational Complexity: Entropy is more computationally intensive because it involves logarithms.
Use in Decision Trees: Entropy is used in algorithms like C4.

7. What is the mathematical explanation behind Decision Trees ?


Decision trees rely on mathematical concepts like splitting criteria (e.g., Gini impurity, information gain) to determine the best way to partition data, ultimately leading to a tree structure that predicts outcomes.

8.  What is Pre-Pruning in Decision Trees ?

Pre-pruning, also known as early stopping, is a technique used in decision trees to prevent overfitting by stopping the tree's growth before it reaches its full potential, limiting its complexity and size during the initial building phase.

9. What is Post-Pruning in Decision Trees ?



Post-pruning in decision trees involves removing branches or nodes after the tree has been fully grown, aiming to simplify the model and improve its generalization ability by preventing overfitting.

10. What is the difference between Pre-Pruning and Post-Pruning ?


In the context of decision trees, pre-pruning (early stopping) stops tree growth before it's fully developed, while post-pruning (reduced error pruning) removes branches after the tree is fully grown to improve generalization.

11.  What is a Decision Tree Regressor ?




A Decision Tree Regressor is a supervised machine learning algorithm that uses a tree-like model to predict continuous (numeric) target variables by recursively splitting data based on features that best reduce prediction error.

12. What are the advantages and disadvantages of Decision Trees ?




Advantages:
Interpretability:
Decision trees are easy to understand and visualize, making it simpler to explain their predictions.
Minimal Data Preparation:
They require less data cleaning and preprocessing compared to other algorithms, as they can handle missing values and outliers relatively well.
Handles Both Numerical and Categorical Data:
Decision trees can effectively process both types of data without requiring separate handling.
Feature Selection and Importance:
Decision trees can identify and rank the importance of different features, helping to understand which factors are most influential in making predictions.
Non-Parametric:
They don't make assumptions about the underlying data distribution, making them versatile for various datasets.
Robust to Outliers:
Decision trees are less influenced by outliers compared to some other algorithms.
Can Handle Imbalanced Data:
Decision trees can be used for classification tasks with imbalanced datasets, where one class significantly outnumbers the others.




Disadvantages:


Overfitting:
Decision trees can become overly complex and learn the training data too well, leading to poor generalization on unseen data.
High Variance:
Small changes in the training data can lead to significant changes in the tree structure and predictions, resulting in high variance.
Bias Towards Dominant Classes:
In classification tasks with imbalanced datasets, decision trees can be biased towards the dominant class, leading to poor predictive performance for minority classes.
Sensitivity to Data Variations:
Small changes in the training data can lead to significant changes in the tree structure and predictions.
Can Be Computationally Expensive:
Building and using large decision trees can be computationally intensive, especially for large datasets.
Difficult to Prune Effectively:
Finding the optimal size for a pruned tree can be challenging, and improper pruning can lead to underfitting.

13. How does a Decision Tree handle missing values ?




Decision trees can handle missing values in several ways, including using surrogate splits, treating missing values as a separate category, or distributing instances with missing values to child nodes.

14. How does a Decision Tree handle categorical features ?



Decision trees can handle categorical features, either through direct processing or by first encoding them numerically using methods like one-hot encoding or label encoding, depending on the algorithm and implementation.

15. What are some real-world applications of Decision Trees?




Decision trees find applications across diverse fields, including healthcare (diagnosing conditions), finance (loan approval, fraud detection), and business (marketing, customer segmentation), by helping to analyze data, make predictions, and simplify complex decision-making processes.
Here's a more detailed look at some real-world applications:

      **practical**

1.  Write a Python program to train a Decision Tree Classifier on the Iris dataset and print the model accuracy.

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt






def importdata():
    balance_data = pd.read_csv(
        'https://archive.ics.uci.edu/ml/machine-learning-' +
        'databases/balance-scale/balance-scale.data',
        sep=',', header=None)

    # Displaying dataset information
    print("Dataset Length: ", len(balance_data))
    print("Dataset Shape: ", balance_data.shape)
    print("Dataset: ", balance_data.head())

    return balance_data


In [None]:
# 17. Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the
#feature importances




# Simulating some data
n = 1000

shares_y1 = [
    0,
    0.1,
    0.2,
    0.3,
    0.4,
    0.5,
    0.6,
    0.7,
    0.8,
    0.9,
    1
]

y1_counts = [x * n for x in shares_y1]
y2_counts = [n - x for x in y1_counts]

y = list(zip(y1_counts, y2_counts))



y



# Getting the GINI impurities for such data
ginis = [Node.GINI_impurity(x[0], x[1]) for x in
         y]



plt.plot(shares_y1, ginis, '-o')
plt.xlabel("Share of first class in the whole dataset")
plt.ylabel("GINI impurity")
plt.show()

NameError: name 'Node' is not defined

In [None]:
#18.Write a Python program to train a Decision Tree Classifier using Entropy as the splitting criterion and print the
#model accuracy.



def train_using_entropy(X_train, X_test, y_train):

    # Decision tree with entropy
    clf_entropy = DecisionTreeClassifier(
        criterion="entropy", random_state=100,
        max_depth=3, min_samples_leaf=5)

    # Performing training
    clf_entropy.fit(X_train, y_train)
    return clf_entropy

In [None]:
#19.Write a Python program to train a Decision Tree Regressor on a housing dataset and evaluate using Mean
#Squared Error (MSE).



from sklearn.metrics import mean_squared_error

# Given values
Y_true = [1,1,2,2,4]  # Y_true = Y (original values)

# calculated values
Y_pred = [0.6,1.29,1.99,2.69,3.4]  # Y_pred = Y'

# Calculation of Mean Squared Error (MSE)
mean_squared_error(Y_true,Y_pred)

0.21606

In [None]:
#20.Write a Python program to train a Decision Tree Classifier and visualize the tree using graphviz



%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn import tree


data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df.head()

tree.export_graphviz(clf,
                     out_file="tree.dot",
                     feature_names = fn,
                     class_names=cn,
                     filled = True)


"""
tree.export_graphviz(clf,
                     out_file="treeRotated.dot",
                     feature_names = fn,
                     class_names=cn,
                     rotate = True,
                     filled = True)
"""


[(0, 1000),
 (100.0, 900.0),
 (200.0, 800.0),
 (300.0, 700.0),
 (400.0, 600.0),
 (500.0, 500.0),
 (600.0, 400.0),
 (700.0, 300.0),
 (800.0, 200.0),
 (900.0, 100.0),
 (1000, 0)]

In [24]:
#21.Write a Python program to train a Decision Tree Classifier with a maximum depth of 3 and compare its
#accuracy with a fully grown tree.



plt.figure()
plt.plot(scores["param_max_depth"],
         scores["mean_train_score"],
         label="training accuracy")
plt.plot(scores["param_max_depth"],
         scores["mean_test_score"],
         label="test accuracy")
plt.xlabel("max_depth")
plt.ylabel("Accuracy")
plt.legend()
plt.show()



NameError: name 'scores' is not defined

<Figure size 640x480 with 0 Axes>

In [None]:
#22.Write a Python program to train a Decision Tree Classifier using min_samples_split=5 and compare its
#accuracy with a default tree.


# GridSearchCV to find optimal max_depth
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV


# specify number of folds for k-fold CV
n_folds = 5

# parameters to build the model on
parameters = {'min_samples_leaf': range(1, 40, 3)}

# instantiate the model
dtree = DecisionTreeClassifier(criterion = "gini",
                               random_state = 100)

# fit tree on training data
tree = GridSearchCV(dtree, parameters,
                    cv=n_folds,
                   scoring="accuracy", return_train_score=True)
tree.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=100),
             param_grid={'min_samples_leaf': range(1, 40, 3)},
             return_train_score=True, scoring='accuracy')
# scores of GridSearch CV
scores = tree.cv_results_
pd.DataFrame(scores).head()

In [None]:
#23.Write a Python program to apply feature scaling before training a Decision Tree Classifier and compare its
#accuracy with unscaled data




iristree1 = DecisionTreeClassifier(criterion = "entropy",
                                  random_state = 100,
                                  max_depth=4,
                                  min_samples_leaf=3,
                                  min_samples_split=2)
iristree1.fit(X_train, y_train)
DecisionTreeClassifier(criterion='entropy', max_depth=4, min_samples_leaf=3,
                       random_state=100)
# accuracy score
iristree1.score(X_test,y_test)

# plotting the tree
dot_data = StringIO()
export_graphviz(iristree1, out_file=dot_data,feature_names=features,filled=True,rounded=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

In [None]:
#29 Write a Python program to train a Decision Tree Classifier and visualize the confusion matrix using seaborn?






from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
#Fit the model
logreg = LogisticRegression(C=1e5)
logreg.fig(X,y)
#Generate predictions with the model using our X values
y_pred = logreg.predict(X)
#Get the confusion matrix
cf_matrix = confusion_matrix(y, y_pred)
print(cf_matrix)


import seaborn as sns
sns.heatmap(cf_matrix, annot=True)

In [None]:
#30. Write a Python program to train a Decision Tree Classifier and use GridSearchCV to find the optimal values
#for max_depth and min_samples_split.


from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Define the Decision Tree model
model = DecisionTreeClassifier()

# Create the hyperparameter grid
param_grid = {
    'max_depth': [3, 5, 10],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

# Set up the GridSearchCV with cross-validation
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the GridSearchCV model
grid_search.fit(X_train, y_train)

# Print the best parameters and best accuracy
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Accuracy:", grid_searc.best_score_)