# **Decision Tree | Assignment**

*Assignment Code: DA-AG-012*

**Question 1: What is a Decision Tree, and how does it work in the context of classification?**

Decision tree is basically like a flow chart where we split data step by step using some rules. At the top we have root node, then it breaks into branches based on feature values, and finally we reach leaf node which gives us class label. For example if we want to classify whether a fruit is Apple or Orange, first split can be on color (Red or Orange). If color = Red, go left, if Orange go right. Then maybe check size or sweetness for next split. At the end, leaf node says "Apple" or "Orange".

In classification, decision tree works by finding the best question (split) at each step. It chooses split by impurity measure like Gini or Entropy, whichever reduce impurity most. So data becomes more pure (similar class inside each branch). This keeps going until stopping condition like max depth or no further split possible.

Main advantage is it is simple to understand, it feels like if-else condition. For example doctor can use decision tree to decide if patient has flu or not: check fever yes/no, then cough yes/no, then result. But disadvantage is tree can grow very complex and overfit data.

So in short decision tree for classification is a supervised ML algorithm that learn rules from data and arrange them in tree structure to predict labels.

**Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?**

(I couldn't find the Summation Symbol so uses E as Summation)

When building a decision tree, we need to decide how to split the data at each step. For that we use impurity measures. Two most common are Gini Impurity and Entropy. Both basically tell us how "mixed" the classes are in a node.

Gini Impurity: It shows how often a randomly chosen element would be misclassified if we assign label randomly according to distribution of labels in the node. Formula is 1 − E(pi^2), where pi is probability of class i. If node is pure (all samples same class), Gini = 0. If node is very mixed, value is higher.

Entropy: This comes from information theory. Formula is −E(pi * log2(pi)). If all samples in node are same class, entropy = 0 (no uncertainty). If classes are equally distributed, entropy is maximum.

How they impact split:
When tree is splitting, it tries to reduce impurity. It checks each feature and possible cut point, then calculates impurity before and after split. The split that reduce impurity most is chosen. For example if using Gini, the algorithm pick the split that gives lowest Gini value after division. If using Entropy, it picks split with highest information gain (reduction of entropy).

Example: Suppose we classify fruits into Apple and Orange. If one node has 10 apple and 0 orange, impurity is 0, so perfect. But if node has 5 apple and 5 orange, impurity is high, so tree will try to split further maybe on color to separate them better.

So both Gini and Entropy serve same purpose: making nodes more pure. Gini is faster to compute, entropy more theoretical. But end result often similar in practice.

**Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.**

Decision trees can easily become too big and complicated, which cause overfitting. To avoid that we use pruning. There are two types: pre-pruning and post-pruning.

Pre-pruning means we stop tree from growing too much during the building process itself. For example, we can set max depth = 5, or say each node must have at least 10 samples to split. If these conditions not met, tree will stop growing further. Advantage of pre-pruning is it saves time and makes tree simpler from beginning itself. Also, it reduce overfitting early. Example: if we are making tree to predict if student pass/fail, we can stop splitting when only few students remain in one branch, because that split won’t generalize well.

Post-pruning means we first allow tree to grow fully (like full big tree with many branches). Then afterwards we cut some branches which are not useful or too specific. This is done by checking performance on validation set. Advantage of post-pruning is that tree can explore more possible splits first, and then we remove only the unnecessary ones. So it usually gives better accuracy than pre-pruning. Example: in medical dataset, tree may create small branches only for 1-2 patients, but after pruning, those branches are removed since they don’t improve accuracy.

In short, pre-pruning stops growth before it becomes too complex, while post-pruning cuts down extra complexity after full tree is built.

**Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?**

Information Gain is like a score that tells us how good a split is. It basically measure how much “uncertainty” or impurity got reduced when we split a node. If the split makes the groups more pure (closer to single class), then the information gain is high.

The formula is basically:
Information Gain = impurity before split − impurity after split
(since I dont have summation symbol, I write E as summation, so impurity after split = E(weight of branch * impurity of branch))

For example, suppose in one node we have 50% apple and 50% orange. That node is very mixed, impurity high. If we split on color and get one branch 100% apple and other branch 100% orange, then impurity after split is 0. So information gain is maximum. That’s why tree will choose that split.

Importance: Decision tree has to decide which feature and which value to split on at every step. It doesn’t do randomly. It looks at all possible splits, calculates information gain for each, and picks the split with highest gain. This way the tree learns rules that separate data in the most efficient way.

Without using information gain (or gini), tree would not know which split is better. It might create useless splits and not classify well. So information gain is key because it makes sure tree reduce impurity step by step and becomes accurate.

Example: If we classify students pass/fail, splitting on "study hours" might give higher information gain than splitting on "shoe size", so tree will pick "study hours" as first rule.

**Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?**

Decision trees are used in many real-life areas because they are simple and easy to understand. Few common applications are

Healthcare: Doctors can use decision trees to predict if a patient has a disease. For example check symptoms like fever, cough, blood pressure step by step to reach a conclusion.

Finance: Banks use decision trees to decide if they should give loan or not. They look at features like salary, past credit history, and existing debts.

Marketing: Companies use decision trees to find which customers are likely to buy product. For example split customers by age, income, and previous purchase history.

Education: Universities can use decision trees to predict if student will pass or fail based on attendance, assignments, and marks in exams.

Advantages

Very easy to understand, even for non technical people. It looks like if-else rules, so managers and doctors can read it.

No need of scaling or normalizing data. It works on raw features directly.

Can handle both categorical data (like yes/no, male/female) and numerical data (like age, salary).

Limitations

Trees can easily overfit, especially when they grow deep and memorize training data instead of learning patterns.

Small changes in data can change the structure of the tree a lot, which means low stability.

For continuous features, tree might not be as smooth as regression models, it creates step like decision boundaries.

So in short, decision trees are very useful because of their simplicity and explainability, but they must be controlled with pruning or ensemble methods (like random forest) to avoid overfitting.

**Question 6: Write a Python program to:
Load the Iris Dataset
Train a Decision Tree Classifier using the Gini criterion
Print the model’s accuracy and feature importances**

In [1]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", clf.feature_importances_)


Accuracy: 1.0
Feature Importances: [0.         0.01911002 0.89326355 0.08762643]


**Question 7: Write a Python program to: Load the Iris Dataset, Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.**

In [2]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

model_depth3 = DecisionTreeClassifier(max_depth=3, random_state=42)
model_depth3.fit(X_train, y_train)
y_pred_depth3 = model_depth3.predict(X_test)
print("Accuracy with depth=3:", accuracy_score(y_test, y_pred_depth3))

model_full = DecisionTreeClassifier(random_state=42)
model_full.fit(X_train, y_train)
y_pred_full = model_full.predict(X_test)
print("Accuracy with full tree:", accuracy_score(y_test, y_pred_full))


Accuracy with depth=3: 1.0
Accuracy with full tree: 1.0


**Question 8: Write a Python program to: Load the Boston Housing Dataset, Train a Decision Tree Regressor, Print the Mean Squared Error (MSE) and feature importances**

In [5]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd

california = fetch_california_housing()
X_house = pd.DataFrame(california.data, columns=california.feature_names)
y_house = pd.Series(california.target)

X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(X_house, y_house, test_size=0.3, random_state=42)

reg_model = DecisionTreeRegressor(random_state=42)
reg_model.fit(X_train_h, y_train_h)

y_pred_h = reg_model.predict(X_test_h)
print("Mean Squared Error:", mean_squared_error(y_test_h, y_pred_h))
print("Feature Importances:", reg_model.feature_importances_)


Mean Squared Error: 0.5280096503174904
Feature Importances: [0.52345628 0.05213495 0.04941775 0.02497426 0.03220553 0.13901245
 0.08999238 0.08880639]


**Question 9: Write a Python program to: Load the Iris Dataset, Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV, Print the best parameters and the resulting model accuracy**

In [6]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "max_depth": [2, 3, 4, 5, None],
    "min_samples_split": [2, 3, 4, 5, 10]
}

grid = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)


Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Best Accuracy: 0.9428571428571428


**Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to:**

Handle the missing values

Encode the categorical features

Train a Decision Tree model

Tune its hyperparameters

Evaluate its performance
And describe what business value this model could provide in the real-world setting.