**Question 1: What is a Decision Tree, and how does it work in the context of
classification?**

A Decision Tree is a type of machine learning algorithm that helps a computer make decisions based on data. It’s mainly used for classification, which means it helps decide which category something belongs to.

You can imagine a decision tree as a series of questions that lead to an answer. For example, if you were trying to decide whether someone will buy a computer, you might ask questions like:

Is the person older than 30?

Do they have a high income?

Are they a student?

Each question splits the data into smaller groups. At the top of the tree, you start with all the data, and as you move down, you keep dividing it based on the answers to these questions. The process continues until each group at the bottom (called a leaf node) represents a clear decision or class — for example, “Yes, they will buy a computer” or “No, they won't.”

The algorithm decides which question to ask at each step by finding the feature that best separates the data. It does this using mathematical measures like Information Gain or Gini Index. These measures help identify which feature gives the most meaningful split.

In short, a decision tree works by breaking a complex decision into smaller, simpler ones. It's easy to understand and interpret, but one drawback is that it can sometimes make overly specific decisions if not properly controlled — a problem known as overfitting.

**Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?**

When a decision tree decides where to split the data, it needs a way to measure how “mixed” or “uncertain” a group of examples is. Two common ways to measure this are Gini Impurity and Entropy. Both tell the tree how pure or impure a particular node is — in other words, how similar or different the items in that group are.

Gini Impurity tells us how often we would get something wrong if we randomly labeled an item from that group.

If all items belong to one class (for example, all “Yes” or all “No”), the impurity is 0 — meaning it's perfectly pure.

If the group is half “Yes” and half “No,” then the impurity is higher, because there's more confusion or mix.
The decision tree always tries to split the data so that each side becomes as pure as possible — that means lower Gini Impurity.

Entropy works in a similar way, but it measures the amount of disorder or uncertainty in the data.

If a group contains only one type of class, the entropy is 0, meaning it’s perfectly ordered.

If the classes are evenly mixed, the entropy is higher because the group is more uncertain.
The goal of the tree is to make splits that reduce entropy as much as possible, which is called gaining information.

In simple terms, both Gini Impurity and Entropy help the decision tree figure out the best question to ask at each step. The algorithm picks the feature that makes the data more organized — so that the smaller groups it creates are as pure and easy to classify as possible.

**Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.**

When a decision tree is built, it can sometimes become too detailed and start learning noise from the data instead of actual patterns — this problem is called overfitting. To prevent that, we use a process called pruning, which means cutting down parts of the tree that don't add much value.

There are two main ways to prune a decision tree: Pre-Pruning and Post-Pruning.

Pre-Pruning

Pre-Pruning (also known as early stopping) means stopping the tree from growing too much right from the start.
While the tree is being built, the algorithm checks certain conditions — like the maximum depth, minimum number of samples in a node, or minimum information gain — and stops splitting when those conditions are met.

Example:
If we tell the algorithm that the tree can have a maximum depth of 5, it will stop creating new branches after level 5, even if it could go deeper.

Practical advantage:
Pre-pruning saves both time and computing resources, since it prevents unnecessary growth of the tree. It also helps avoid overfitting early on.

Post-Pruning

Post-Pruning means letting the tree grow fully first, and then removing the branches that don't improve accuracy.
After the complete tree is built, the algorithm tests each branch on a validation dataset and cuts off the parts that don’t contribute to better predictions.

Example:
If a certain branch only improves the accuracy by a tiny amount on the training data but not on new data, that branch is pruned away.

Practical advantage:
Post-pruning usually gives better accuracy because the model first explores all possible splits and then removes only the unnecessary ones, leading to a well-balanced tree.

**Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?**

Information Gain is a concept used in decision trees to decide which feature should be used to split the data at each step. It tells us how much "information" or "certainty" a feature gives us about the target class when we use it to divide the data.

When a decision tree tries to make a split, its main goal is to make the resulting groups as pure as possible — meaning that each group should mostly contain data from one class. To measure this, the algorithm looks at the entropy (the amount of disorder or uncertainty) before and after the split.

Information Gain is simply the reduction in entropy after making a split.
If splitting the data on a particular feature reduces a lot of uncertainty, that feature gives high information gain — and it's considered a good choice for splitting.

In simple words:
Information Gain tells the decision tree how much a particular question helps to separate the data clearly.

A feature with high Information Gain makes the groups more organized and easier to classify.

A feature with low Information Gain doesn't help much in making the groups clearer.

Example:
If you are predicting whether someone will buy a product, and splitting the data by "Income Level" creates two groups where one mostly buys and the other mostly doesn't, that feature gives high Information Gain. On the other hand, if splitting by "Hair Color" doesn't change the distribution much, it gives low Information Gain.

**Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?**


Common Real-World Applications

Medical Diagnosis
Decision trees can help doctors predict whether a patient has a certain disease based on symptoms, test results, or medical history.
Example: Predicting if a tumor is benign or malignant.

Finance and Banking
Banks use decision trees to decide whether to approve a loan or credit card application by checking factors like income, age, and credit score.
Example: Loan approval and credit risk assessment.

Marketing and Customer Segmentation
Businesses use them to identify which type of customers are most likely to buy a product or respond to a campaign.
Example: Predicting customer churn or targeting the right audience.

Fraud Detection
Decision trees can classify transactions as normal or suspicious based on patterns in past data.
Example: Detecting fraudulent credit card transactions.

Manufacturing and Quality Control
Used to determine the causes of defects in production and to maintain consistent product quality.
Example: Predicting machine failure or identifying defective products.

Main Advantages

Easy to Understand:
The flowchart-like structure is simple to interpret, even for people without a technical background.

No Need for Data Scaling:
Decision trees don't require normalization or standardization of features.

Works with Both Types of Data:
They handle both categorical and numerical data easily.

Captures Non-Linear Relationships:
They can model complex patterns without needing mathematical transformations.

Main Limitations

Overfitting:
Decision trees can become too complex and start fitting noise instead of real patterns, especially if not pruned properly.

Unstable:
A small change in the data can lead to a completely different tree structure.

Biased Towards Features with More Levels:
Features with many categories may dominate the splits, even if they aren’t the most important.

Less Accurate Alone:
A single decision tree may not be as accurate as ensemble methods like Random Forest or Gradient Boosted Trees.



In [1]:
# Question 6: Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier using the Gini criterion
# ● Print the model’s accuracy and feature importances
# (Include your Python code and output in the code box below.)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Decision Tree Classifier (Gini) Results")
print("--------------------------------------")
print(f"Model Accuracy: {accuracy:.2f}")
print("Feature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Decision Tree Classifier (Gini) Results
--------------------------------------
Model Accuracy: 1.00
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


In [2]:
# Question 7: Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
# a fully-grown tree.
# (Include your Python code and output in the code box below.)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

tree_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_limited.fit(X_train, y_train)
y_pred_limited = tree_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train, y_train)
y_pred_full = tree_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

print("Decision Tree Accuracy Comparison")
print("--------------------------------")
print(f"Limited Tree (max_depth=3) Accuracy: {accuracy_limited:.2f}")
print(f"Fully Grown Tree Accuracy: {accuracy_full:.2f}")



Decision Tree Accuracy Comparison
--------------------------------
Limited Tree (max_depth=3) Accuracy: 1.00
Fully Grown Tree Accuracy: 1.00


In [3]:
# Question 8: Write a Python program to:
# ● Load the Boston Housing Dataset
# ● Train a Decision Tree Regressor
# ● Print the Mean Squared Error (MSE) and feature importances
# (Include your Python code and output in the code box below.)
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

boston = fetch_openml(name="boston", version=1, as_frame=True)
X = boston.data
y = boston.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("Decision Tree Regressor Results")
print("--------------------------------")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print("Feature Importances:")
for feature, importance in zip(X.columns, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Decision Tree Regressor Results
--------------------------------
Mean Squared Error (MSE): 10.42
Feature Importances:
CRIM: 0.0513
ZN: 0.0034
INDUS: 0.0058
CHAS: 0.0000
NOX: 0.0271
RM: 0.6003
AGE: 0.0136
DIS: 0.0707
RAD: 0.0019
TAX: 0.0125
PTRATIO: 0.0110
B: 0.0090
LSTAT: 0.1933


In [4]:
# Question 9: Write a Python program to:
# ● Load the Iris Dataset
# ● Tune the Decision Tree's max_depth and min_samples_split using
# GridSearchCV
# ● Print the best parameters and the resulting model accuracy
# (Include your Python code and output in the code box below.)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5]
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Decision Tree Hyperparameter Tuning Results")
print("-------------------------------------------")
print("Best Parameters:", grid_search.best_params_)
print(f"Model Accuracy: {accuracy:.2f}")



Decision Tree Hyperparameter Tuning Results
-------------------------------------------
Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy: 1.00


**Question 10: Imagine you're working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.**

Step 1: Handling Missing Values
First, I'd carefully look through the dataset to see where the missing values are. For numerical columns, I'd fill in the missing values with the mean or median of that column, depending on how the data is distributed. For categorical columns, I'd replace missing values with the most frequent category.
If some columns have too many missing values (say, more than 40-50%), I might even consider dropping them if they don't add much value.

Step 2: Encoding Categorical Features
Next, since decision trees can't understand text directly, I'd convert all categorical data into numerical form.
For columns with only a few categories (like “Gender” or “Smoker/Non-Smoker”), I'd use Label Encoding.
For columns with many unique categories (like “City” or “Blood Type”), I'd use One-Hot Encoding so that no single value dominates the others.

Step 3: Training the Decision Tree Model
After cleaning and encoding the data, I'd split the dataset into training and testing sets — typically 80% for training and 20% for testing. Then I'd create a Decision Tree Classifier, train it on the training data, and check how well it predicts on the test data.

Step 4: Tuning Hyperparameters
To make sure the model doesn't overfit (memorize) or underfit (miss patterns), I'd tune hyperparameters like:

max_depth (how deep the tree can go)

min_samples_split (minimum samples needed to split a node)

criterion (whether to use “gini” or “entropy”)

I'd use GridSearchCV to automatically test different combinations of these parameters and find the best one.

Step 5: Evaluating the Model
Once the best model is ready, I'd evaluate it using metrics like:

Accuracy — how many predictions are correct

Precision & Recall — to check how well it identifies patients with the disease

F1-Score — a balance between precision and recall

Confusion Matrix — to visualize correct and incorrect predictions

If the dataset is imbalanced (for example, very few patients have the disease), I'd use techniques like SMOTE (Synthetic Minority Oversampling) or use the AUC-ROC score to better understand performance.

Business Value in the Real World
In a healthcare setting, such a model could be incredibly valuable. It could help doctors quickly identify high-risk patients and prioritize medical attention. It could also assist in early detection, helping hospitals manage resources better and improving patient outcomes.
Ultimately, the model wouldn't replace doctors — it would support them by providing data-driven insights, saving time, and potentially saving lives.