# Decision Tree

**Q.1-** What is a Decision Tree, and how does it work in the context of
classification?

 A Decision Tree is a machine learning model that is used for making predictions by following a step-by-step questioning process. It is structured like a tree, where each internal node represents a question or test on a feature, each branch represents the outcome of that question, and each leaf node represents the final decision or prediction.

 How it works in classification:
1.	Start at the root: The algorithm looks at all features (like age, income, gender, etc.).
2.	Choose the best feature: It picks the feature that best separates the data into classes (using measures like Gini index or Entropy/Information Gain).
3.	Split the data: It divides the dataset into branches based on that feature.
4.	Repeat: For each branch, it again finds the best feature and splits further.
5.	Stop: This continues until:
o	All data in a branch belongs to one class, OR
o	The tree reaches a stopping condition (like max depth).
6.	Prediction: For a new input, the model follows the path of decisions down the tree until it reaches a leaf, and then assigns the class.


**Q2.** Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

- Gini Impurity:

 1. It tells us how often we would be wrong if we randomly picked a class label in a node.
 2. Value is 0 when all data in that node belongs to one class (perfectly pure).
 3.	The more mixed the classes are, the higher the Gini value.

- Entropy
 4. It tells us how much disorder or uncertainty is in a node.
 5. Value is 0 when the node is pure (all same class).
 6. The value is higher when classes are more evenly mixed (maximum confusion).

- How they affect splits in a Decision Tree:

 1. When building the tree, the algorithm checks different features and decides where to split the data.
 2. It chooses the split that makes the child nodes purer (closer to one class only).
 3. Gini and Entropy are just two different ways to measure impurity, but both usually lead to similar splits.

**Q3.** What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

- Pre-Pruning (also called Early Stopping)
 1. The tree stops growing early.
 2. We set some conditions before training (like maximum depth, minimum samples per leaf, or minimum information gain).
 3.	If the condition is met, the tree will not split further.

- Advantage:
 1. Saves time and memory because the tree doesn’t grow too big.
 2. Example: In real-time fraud detection, a small tree with limited depth gives faster predictions.

- Post-Pruning (also called Pruning after training):
 1. The tree is grown fully first, and then unnecessary branches are cut back.
 2. This is done by checking which branches don’t improve accuracy much (often using a validation set).
- Advantage:
 1. Produces a simpler and more accurate model by removing overfitting.
 2. Example: In medical diagnosis, post-pruning makes the model more generalizable, avoiding overly specific rules.


**Q4.** What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

 - Information Gain in Decision Trees is a measure of how much uncertainty or impurity in the data is reduced after splitting it using a particular feature. It is calculated as the difference between the entropy of the parent node and the weighted average entropy of the child nodes after the split. In simple terms, Information Gain shows how well a feature separates the data into pure groups, where each group contains mostly a single class. It is important because, at each step, a Decision Tree must decide which feature to use for splitting, and the feature with the highest Information Gain is chosen. This ensures that the tree makes the data more organized and reduces confusion, leading to better classification accuracy.

**Q5.** What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

- Real-World Applications of Decision Trees:
 1. Medical Diagnosis – predicting whether a patient has a disease based on symptoms.
 2. Fraud Detection – identifying fraudulent transactions in banking or insurance.
 3.	Customer Churn Prediction – checking if a customer is likely to leave a service.
 4. Credit Scoring & Loan Approval – deciding whether to approve a loan based on applicant details.
 5. Marketing & Sales – segmenting customers and recommending products.
 6.	Manufacturing – quality control and fault detection in production.
 7. Education – predicting student performance based on attendance, study habits, etc.

- Advantages of Decision Trees:
 1. Easy to understand and interpret – works like a flowchart of questions.
 2. No need for data scaling – can handle raw data without normalization.
 3. Handles both numerical and categorical data.
 4.	Works well with small datasets.
 5. Fast prediction time – once built, the tree is quick to use.
- Limitations of Decision Trees:
 1. Prone to overfitting – can grow too complex and memorize the data.
 2. Unstable – small changes in data may create a very different tree.
 3. Biased towards features with more levels (many unique values).
 4. Less accurate compared to ensemble methods like Random Forests or Gradient Boosting.
 5. Not good for continuous predictions if the tree is very shallow.

Dataset Info:
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).



In [1]:
#Q6. Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier using the Gini criterion
# ● Print the model’s accuracy and feature importances

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Decision Tree Accuracy:", accuracy)

print("Feature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")



Decision Tree Accuracy: 1.0
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


In [2]:
#Q7. Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
# a fully-grown tree.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

tree_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_limited.fit(X_train, y_train)
y_pred_limited = tree_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train, y_train)
y_pred_full = tree_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

print("Accuracy with max_depth=3:", accuracy_limited)
print("Accuracy with full tree  :", accuracy_full)


Accuracy with max_depth=3: 1.0
Accuracy with full tree  : 1.0


In [3]:
#Q8. Write a Python program to:
# ● Load the California Housing dataset from sklearn
# ● Train a Decision Tree Regressor
# ● Print the Mean Squared Error (MSE) and feature importances

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing()
X = housing.data
y = housing.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

print("Feature Importances:")
for feature, importance in zip(housing.feature_names, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Mean Squared Error (MSE): 0.495235205629094
Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


In [4]:
#Q9. Write a Python program to:
# ● Load the Iris Dataset
# ● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
# ● Print the best parameters and the resulting model accuracy

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

param_grid = {
    "max_depth": [2, 3, 4, 5, None],
    "min_samples_split": [2, 3, 4, 5, 10]
}

dt = DecisionTreeClassifier(random_state=42)

grid_search = GridSearchCV(estimator=dt, param_grid=param_grid,
                           cv=5, scoring="accuracy", n_jobs=-1)
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", best_params)
print("Model Accuracy on Test Set:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy on Test Set: 1.0


**Q10.** Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.
Explain the step-by-step process you would follow to:
* Handle the missing values
* Encode the categorical features
* Train a Decision Tree model
* Tune its hyperparameters
* Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

 **Step-by-Step Process:**

1. Handling Missing Values
The first step would be to carefully check the dataset for missing values. If some features have only a small number of missing entries, I would use strategies like replacing them with the mean (for numerical features), median (if there are outliers), or mode (for categorical features). If a feature has too many missing values and doesn’t add much importance, I might even drop it to avoid noise in the model.

2. Encoding the Categorical Features
Since Decision Trees can only work with numbers, I need to convert categorical variables into numeric form. For categorical variables with no natural order (like “Male” or “Female”), I would use One-Hot Encoding. For features with a natural order (like “Low”, “Medium”, “High”), I could use Label Encoding. This ensures the tree can split the data properly without misunderstanding the categories.

3. Training a Decision Tree Model
Once the dataset is clean and encoded, I would split it into training and test sets, usually 80% for training and 20% for testing. Then, I’d create a Decision Tree Classifier using scikit-learn and fit it on the training data. At this stage, I would start with default settings just to get a baseline performance.

4. Hyperparameter Tuning
Decision Trees are powerful but can easily overfit, so tuning is very important. I would use GridSearchCV or RandomizedSearchCV to try different values of parameters like max_depth (how deep the tree can grow), min_samples_split (minimum samples to split a node), and min_samples_leaf (minimum samples at a leaf). The goal is to find the best balance between accuracy and generalization.

5. Evaluating Performance
To evaluate the model, I would not just look at accuracy, but also metrics like precision, recall, and F1-score, since in healthcare, predicting false negatives (saying “no disease” when the patient actually has it) can be very risky. I would also look at the confusion matrix to see how well the model is distinguishing between diseased and non-diseased patients. If needed, I might also use cross-validation for more reliable performance estimates.

**Business Value in Real-World Setting**

- A Decision Tree model for disease prediction could be very valuable in healthcare. It can help doctors and hospitals quickly identify high-risk patients based on their health records, reducing the time needed for manual assessments. This means patients who are more likely to have the disease can be prioritized for further medical testing or treatment, leading to earlier interventions and potentially saving lives. From a business perspective, this model can also reduce healthcare costs by focusing resources on patients who need them the most and minimizing unnecessary tests for low-risk patients. Moreover, since Decision Trees are easy to explain, doctors and healthcare providers can trust and understand the reasoning behind each prediction, making the model more practical in real-world decision-making.