##Assignment Question

 1. What is a Decision Tree, and how does it work?

=> A Decision Tree is a non-parametric supervised learning algorithm that can be used for both classification and regression tasks.  It works by recursively splitting the data based on features to create a tree-like structure.

2. What are impurity measures in Decision Trees


=> Impurity measures in Decision Trees are used to determine how homogeneous a set of data is with respect to the target variable. In other words, they quantify how mixed the classes are within a node. The goal of the decision tree algorithm is to find splits that reduce the impurity of the resulting child nodes.

3. What is the mathematical formula for Gini Impurity


=> Gini Impurity: This measures the probability of misclassifying a randomly chosen element from the dataset if it were randomly labeled according to the distribution of classes in the node. A Gini impurity of 0 means the node is perfectly pure (all samples belong to the same class)

4. What is the mathematical formula for Entropy


=> Entropy: This is a measure of the randomness or disorder in a set of data. In the context of decision trees, it measures the uncertainty in a node based on the distribution of classes. A lower entropy value indicates less uncertainty and a more pure node.

5. What is Information Gain, and how is it used in Decision Trees?

=> Based on the code you provided and the concept of impurity measures, Information Gain is a metric used in Decision Trees to determine the effectiveness of a split. It quantifies how much the entropy (or impurity) of the data is reduced after splitting a node based on a particular feature.

6. What is the difference between Gini Impurity and Entropy


=> Both Gini Impurity and Entropy are impurity measures used in Decision Trees to evaluate the homogeneity of a node. While they serve the same purpose, there are some key differences:

Mathematical Formula:

Gini Impurity: The formula for Gini Impurity is $\sum_{i=1}^{C} p(i) * (1 - p(i))$, where $C$ is the number of classes and $p(i)$ is the proportion of samples belonging to class $i$ in the node. [1]
Entropy: The formula for Entropy is $-\sum_{i=1}^{C} p(i) * \log_2(p(i))$, where $C$ is the number of classes and $p(i)$ is the proportion of samples belonging to class $i$ in the node.

7. What is the mathematical explanation behind Decision Trees

=> These models work by splitting data into subsets based on feature and this splitting is called as decision making and each leaf node tells us prediction. This splitting creates a tree-like structure. They are easy to interpret and visualize for understanding the decision-making process.

8. What is Pre-Pruning in Decision Trees

=> There are two main types of decision tree pruning: Pre-Pruning and Post-Pruning. Sometimes, the growth of the decision tree can be stopped before it gets too complex, this is called pre-pruning.

9. What is Post-Pruning in Decision Trees

=> Post-pruning, also known as pruning the tree, consists of constructing the full tree and thereafter eliminating nodes that do not contribute significantly to predictive power. This is usually accomplished by methods such as cost-complexity pruning.

10. What is the difference between Pre-Pruning and Post-Pruning

=> Pre-pruning involves pruning the tree during its construction. The method assesses at each node if dividing the node further will enhance the overall performance on the validation data. If not, the node is designated as a leaf without additional division.

Post-pruning, also known as pruning the tree, consists of constructing the full tree and thereafter eliminating nodes that do not contribute significantly to predictive power. This is usually accomplished by methods such as cost-complexity pruning.

11. What is a Decision Tree Regressor

=> A Decision Tree Regressor is a machine learning model used for predicting continuous values. It works by splitting the data into subsets based on the values of the input features, creating a tree-like structure where each internal node represents a decision based on a feature, and each leaf node represents a predicted value.

12. What are the advantages and disadvantages of Decision Trees

=> Advantages:

Easy to understand and interpret:-  Decision trees provide a clear visual representation of the decision-making process, making them easy to understand for both technical and non-technical audiences.
Can handle both numerical and categorical data:-  Decision trees can handle different types of data without requiring extensive preprocessing.
Require little data preparation:-  Unlike some other algorithms, decision trees do not require feature scaling or normalization.
Non-parametric:-  They don't make assumptions about the underlying data distribution.
Can model non-linear relationships: Decision trees can capture complex non-linear relationships between features and the target variable.
* Disadvantages:

Prone to overfitting:-  Decision trees can easily overfit the training data, especially when they are grown to a large depth. This can lead to poor performance on unseen data.
Instability:-  Small changes in the data can lead to significant changes in the tree structure.
Bias towards features with more levels: Decision trees may favor features with a larger number of categories or continuous values, as they can potentially create more splits and thus appear to offer higher Information Gain.
Can create biased trees:-  If there is a dominant class in the dataset, the tree may become biased towards that class.
Difficult to handle missing values:-  While some strategies exist, handling missing values can be challenging in decision trees.
Computational cost:-  Building a large decision tree can be computationally expensive, especially for large datasets.

13. How does a Decision Tree handle missing values

=> Handling missing values in Decision Trees can be approached in a few ways, although it's not always straightforward or explicitly built into every Decision Tree implementation

14. How does a Decision Tree handle categorical features

=> Decision Trees handle categorical features by considering the different categories as potential splits. The way this is done can vary slightly depending on whether the categorical feature is nominal (categories with no inherent order

15. What are some real-world applications of Decision Trees?

=> Decision Trees are a versatile algorithm with applications in various real-world scenarios across different domains. Here are some examples:

Medical Diagnosis: Decision trees can be used to assist in diagnosing diseases based on patient symptoms and medical history. The tree can help doctors make decisions about potential diagnoses by guiding them through a series of questions.

Credit Risk Assessment: Financial institutions use decision trees to assess the creditworthiness of loan applicants. The tree can evaluate factors like income, debt-to-income ratio, and credit history to predict the likelihood of default.

##Practical Questions

16. Write a Python program to train a Decision Tree Classifier on the Iris dataset and print the model accuracy

In [None]:
#code for the the above ques.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier()

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy:", accuracy)

17. Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances


In [None]:
#code for the the above ques.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier(criterion='gini')

clf.fit(X_train, y_train)

feature_importances = clf.feature_importances_

for feature, importance in zip(iris.feature_names, feature_importances):
    print(f"{feature}: {importance}")

18. Write a Python program to train a Decision Tree Classifier using Entropy as the splitting criterion and print the model accuracy


In [None]:
#code for the the above ques.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier(criterion='entropy')

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy:", accuracy)


19. Write a Python program to train a Decision Tree Regressor on a housing dataset and evaluate using Mean Squared Error (MSE)


In [None]:
#code for the the above ques.
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

boston = load_boston()
X = boston.data
y = boston.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

regressor = DecisionTreeRegressor()

regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error:", mse)


20. Write a Python program to train a Decision Tree Classifier and visualize the tree using graphviz

In [1]:
#code for the the above ques.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split
import graphviz

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier()

clf.fit(X_train, y_train)

dot_data = export_graphviz(clf, out_file=None,
                           feature_names=iris.feature_names,
                           class_names=iris.target_names,
                           filled=True, rounded=True,
                           special_characters=True)
graph = graphviz.Source(dot_data)
graph.render("iris_decision_tree")
graph.view()


'iris_decision_tree.pdf'

21. Write a Python program to train a Decision Tree Classifier with a maximum depth of 3 and compare its accuracy with a fully grown tree


In [None]:
#code for the the above ques.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf_full_tree = DecisionTreeClassifier()
clf_limited_depth = DecisionTreeClassifier(max_depth=3)

clf_full_tree.fit(X_train, y_train)
clf_limited_depth.fit(X_train, y_train)


y_pred_full_tree = clf_full_tree.predict(X_test)
y_pred_limited_depth = clf_limited_depth.predict(X_test)

accuracy_full_tree = accuracy_score(y_test, y_pred_full_tree)
accuracy_limited_depth = accuracy_score(y_test, y_pred_limited_depth)

print("Accuracy with Full Tree:", accuracy_full_tree)
print("Accuracy with Limited Depth Tree:", accuracy_limited_depth)

22. Write a Python program to train a Decision Tree Classifier using min_samples_split=5 and compare its accuracy with a default tree


In [None]:
#code for the the above ques.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf_default = DecisionTreeClassifier()
clf_min_samples_split_5 = DecisionTreeClassifier(min_samples_split=5)

clf_default.fit(X_train, y_train)
clf_min_samples_split_5.fit(X_train, y_train)

y_pred_default = clf_default.predict(X_test)
y_pred_min_samples_split_5 = clf_min_samples_split_5.predict(X_test)

accuracy_default = accuracy_score(y_test, y_pred_default)
accuracy_min_samples_split_5 = accuracy_score(y_test, y_pred_min_samples_split_5)

print("Accuracy with Default Tree:", accuracy_default)

23. Write a Python program to apply feature scaling before training a Decision Tree Classifier and compare its accuracy with unscaled data


In [None]:
#code for the the above ques.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf_unscaled = DecisionTreeClassifier()
clf_scaled = DecisionTreeClassifier()

clf_unscaled.fit(X_train, y_train)
clf_scaled.fit(X_train_scaled, y_train)

y_pred_unscaled = clf_unscaled.predict(X_test)
y_pred_scaled = clf_scaled.predict(X_test_scaled)

accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

print("Accuracy with Unscaled Data:", accuracy_unscaled)

24. Write a Python program to train a Decision Tree Classifier using One-vs-Rest (OvR) strategy for multiclass classification


In [None]:
#code for the the above ques.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

base_clf = DecisionTreeClassifier()

ovr_clf = OneVsRestClassifier(base_clf)

ovr_clf.fit(X_train, y_train)

y_pred = ovr_clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy with One-vs-Rest strategy:", accuracy)

25. Write a Python program to train a Decision Tree Classifier and display the feature importance scores

In [None]:
#code for the the above ques.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier()

clf.fit(X_train, y_train)

feature_importances = clf.feature_importances_

for feature, importance in zip(iris.feature_names, feature_importances):
    print(f"{feature}: {importance}")

26. Write a Python program to train a Decision Tree Regressor with max_depth=5 and compare its performance with an unrestricted tree


In [None]:
#code for the the above ques.

from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

boston = load_boston()
X = boston.data
y = boston.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

regressor_full_tree = DecisionTreeRegressor()
regressor_limited_depth = DecisionTreeRegressor(max_depth=5)

regressor_full_tree.fit(X_train, y_train)
regressor_limited_depth.fit(X_train, y_train)

y_pred_full_tree = regressor_full_tree.predict(X_test)
y_pred_limited_depth = regressor_limited_depth.predict(X_test)

mse_full_tree = mean_squared_error(y_test, y_pred_full_tree)
mse_limited_depth = mean_squared_error(y_test, y_pred_limited_depth)

print("Mean Squared Error with Full Tree:", mse_full_tree)
print("Mean Squared Error with Limited Depth Tree:", mse_limited_depth)

27. Write a Python program to train a Decision Tree Classifier, apply Cost Complexity Pruning (CCP), and visualize its effect on accuracy


In [None]:
#code for the the above ques.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

ccp_alphas = []
ccp_accuracies = []

for i in range(1, 10):
    clf = DecisionTreeClassifier(ccp_alpha=i/10)
    clf.fit(X_train, y_train)
    ccp_alphas.append(i/10)
    ccp_accuracies.append(clf.score(X_test, y_test))

plt.plot(ccp_alphas, ccp_accuracies)

plt.xlabel('CCP Alpha')
plt.ylabel('Accuracy')
plt.title('Effect of CCP Alpha on Accuracy')
plt.show()

28. Write a Python program to train a Decision Tree Classifier and evaluate its performance using Precision, Recall, and F1-Score


In [5]:
#code for the the above ques.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create an instance of the DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1_score = f1_score(y_test, y_pred, average='weighted')


print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1_score)


Precision: 1.0
Recall: 1.0
F1-Score: 1.0


29. Write a Python program to train a Decision Tree Classifier and visualize the confusion matrix using seaborn.

In [None]:
#code for the the above ques.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier()

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()


30. Write a Python program to train a Decision Tree Classifier and use GridSearchCV to find the optimal values for max_depth and min_samples_split.

In [None]:
#code for the the above ques.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

iris = load_iris()
X = iris.data

y = iris.target

param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

clf = DecisionTreeClassifier()

grid_search = GridSearchCV(clf, param_grid, cv=5)

grid_search.fit(X, y)

best_params = grid_search.best_params_

print("Best Parameters:", best_params)
