**Question 1: What is a Decision Tree, and how does it work in the context of
classification?**

Ans. A Decision Tree is a supervised machine learning algorithm used for classification tasks. It works by splitting the dataset into subsets based on the value of input features. Each internal node in the tree represents a decision on a feature, each branch represents an outcome of that decision, and each leaf node represents a class label.

The model uses rules like "if-else" conditions to make decisions and classify data step-by-step down the tree.

**Example:**
To predict if a customer will buy a product:

* If Age > 30

  * If Income > 50k → Predict: Buy
  * Else → Predict: Don’t Buy
* Else → Predict: Don’t Buy

**Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?**

Ans. **Gini Impurity** and **Entropy** are impurity measures used to decide how a decision tree splits data at each node.

* **Gini Impurity** measures the probability of incorrectly classifying a randomly chosen element.

  * Formula: *Gini = 1 − Σ (pᵢ)²*, where *pᵢ* is the probability of class *i*.

* **Entropy** measures the amount of disorder or uncertainty in the data.

  * Formula: *Entropy = −Σ (pᵢ × log₂(pᵢ))*

**Impact on Splits:**

* Both measures help find the best feature to split on by evaluating how "pure" the resulting subsets will be.
* Lower impurity means better splits.
* The decision tree selects the feature that gives the **highest information gain** (for entropy) or **largest reduction in Gini**.

Q3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

Ans. Pre-Pruning (Early Stopping): Pre-pruning involves stopping the growth of the decision tree during the training phase. This is done by setting constraints such as maximum tree depth, minimum number of samples required to split a node, or a threshold on impurity reduction.

Advantage: It reduces training time and prevents the model from overfitting by avoiding unnecessary complexity from the start.

Post-Pruning (Pruning After Full Growth): Post-pruning allows the tree to grow fully and then removes branches that do not contribute significantly to the model's performance. This is usually done using a validation set or cross-validation.

Advantage: It improves the model’s generalization by simplifying a complex tree, which can lead to better performance on unseen data.

**Q4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?**

**Ans.** Information Gain is a key concept used in building Decision Trees. It measures how much “information” a feature provides about the target class by quantifying the reduction in entropy (uncertainty or impurity) when a dataset is split based on that feature.

The formula for Information Gain (IG) when splitting a dataset $D$ on attribute $A$ is:

$$
IG(D, A) = Entropy(D) - \sum_{v \in Values(A)} \frac{|D_v|}{|D|} \cdot Entropy(D_v)
$$

Where:

* $Entropy(D)$ is the entropy of the entire dataset.
* $D_v$ is the subset of data for which attribute $A$ has value $v$.
* $|D_v| / |D|$ is the proportion of data in $D_v$.
* $Entropy(D_v)$ is the entropy of subset $D_v$.

**Importance of Information Gain:**

* It helps select the best attribute to split the data at each node in the decision tree.
* A higher Information Gain indicates that the feature better separates the data into distinct classes.
* It leads to more accurate and efficient decision trees by reducing impurity and improving classification performance.

**Example:**
In a dataset used to predict whether a person will play tennis, if the feature "Outlook" results in the highest Information Gain, then the decision tree will split on "Outlook" first, as it provides the most useful information for classification.

**Q5. What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?**

**Ans:** **Real-World Applications of Decision Trees:**

1. **Medical Diagnosis:**
   Used to diagnose diseases based on symptoms, test results, and patient history.

2. **Loan Approval and Credit Scoring:**
   Used by banks to decide loan approvals based on factors like credit score, income, and employment status.

3. **Marketing and Customer Segmentation:**
   Helps businesses identify target customer groups and personalize marketing strategies.

4. **Fraud Detection:**
   Detects unusual patterns in transactions that may indicate fraudulent activity.

5. **Manufacturing and Quality Control:**
   Used to identify causes of defects and improve product quality.

**Main Advantages of Decision Trees:**

* Easy to understand and interpret.
* Can handle both numerical and categorical data.
* No need for feature scaling or normalization.
* Can model non-linear relationships.
* Handles missing values effectively.

**Main Limitations of Decision Trees:**

* Prone to overfitting, especially with deep trees.
* Sensitive to small changes in data (instability).
* Can be biased toward features with many categories.
* Single decision trees often have lower accuracy compared to ensemble methods.

In [1]:
'''Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances'''

# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Decision Tree Classifier with Gini criterion
model = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
model.fit(X_train, y_train)

# Predict on the test data
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Print feature importances
print("Feature Importances:")
for feature_name, importance in zip(iris.feature_names, model.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")

Model Accuracy: 1.00
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


In [2]:
'''Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.
'''

# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Decision Tree with max_depth=3
model_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
model_limited.fit(X_train, y_train)
y_pred_limited = model_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# Fully-grown Decision Tree (no depth limit)
model_full = DecisionTreeClassifier(random_state=42)
model_full.fit(X_train, y_train)
y_pred_full = model_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Print both accuracies
print(f"Accuracy with max_depth=3: {accuracy_limited:.2f}")
print(f"Accuracy with fully-grown tree: {accuracy_full:.2f}")

Accuracy with max_depth=3: 1.00
Accuracy with fully-grown tree: 1.00


In [3]:
'''
Question 8: Write a Python program to:
● Load the California Housing dataset from sklearn
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances
'''

# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Decision Tree Regressor
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")

# Print feature importances
print("Feature Importances:")
for name, importance in zip(housing.feature_names, model.feature_importances_):
    print(f"{name}: {importance:.4f}")

Mean Squared Error (MSE): 0.50
Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


In [4]:
'''Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy'''

# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter grid to tune
param_grid = {
    'max_depth': [2, 3, 4, 5, 6, None],
    'min_samples_split': [2, 3, 4, 5]
}

# Initialize the Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Set up GridSearchCV
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best parameters from GridSearch
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Evaluate the best model on test data
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with Best Parameters: {accuracy:.2f}")

Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy with Best Parameters: 1.00


Question 10:
Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.

Explain the step-by-step process you would follow to:

● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance

And describe what business value this model could provide in the real-world setting.

Answer:
1. Handle Missing Values
Understand the missingness: First, analyze how data is missing. Is it random, or does it follow a pattern? This affects how you handle it.

Imputation:

For numerical features, you could fill missing values with mean, median, or use more advanced methods like K-Nearest Neighbors imputation.
For categorical features, you might fill missing with the mode (most frequent value) or create a special category like "Unknown".
If missing values are too prevalent or critical, consider dropping those features or samples carefully.
Why it matters: Models can’t handle missing data directly, so cleaning this up ensures your model sees complete, reliable inputs.

2. Encode Categorical Features
Identify categorical variables: This could be patient gender, blood type, or any non-numeric info.

Encoding methods:

For nominal categories without order (e.g., blood type), use One-Hot Encoding.
For ordinal categories (e.g., disease severity: mild, moderate, severe), use Label Encoding or map them to meaningful numeric scales.
Why it matters: Machine learning models, including Decision Trees, require numeric input, so encoding transforms your data into a digestible form.

3. Train a Decision Tree Model
Split the data: Use an 80-20 or 70-30 split between training and test sets, or use cross-validation to ensure your results generalize.

Initialize the model: Start with a default Decision Tree classifier.

Train on processed data: Fit the model on your training data.

Why Decision Trees: They handle mixed data types well, are interpretable (important in healthcare), and can capture nonlinear patterns.

4. Tune Hyperparameters
Key hyperparameters to tune:

max_depth: controls tree complexity, balancing underfitting and overfitting.
min_samples_split and min_samples_leaf: control how many samples needed to split or be a leaf node, affecting generalization.
max_features: number of features to consider at each split.
Use GridSearchCV or RandomizedSearchCV: Explore combinations systematically with cross-validation to find the sweet spot.

Why tuning matters: Proper tuning prevents overfitting or underfitting, improving the model’s predictive power on unseen data.

5. Evaluate Performance
Metrics:

For classification, consider Accuracy, Precision, Recall, F1-score, and ROC-AUC.
In healthcare, Recall (sensitivity) is often critical — you want to catch as many patients with the disease as possible, even at the cost of some false positives.
Validation: Use a separate test set or cross-validation to ensure your model performs reliably.

Interpretability: Use feature importance and decision tree visualization to explain model decisions to clinicians and stakeholders.

Business Value of This Model
Early detection: Predicting disease early means patients can receive timely treatment, improving outcomes and reducing healthcare costs.

Resource optimization: Helps healthcare providers prioritize high-risk patients for screening or intervention, making better use of limited resources.

Personalized care: Tailors monitoring and care plans based on individual risk, improving patient satisfaction and effectiveness.

Data-driven decisions: Provides actionable insights backed by data, enabling the company to develop better products, policies, or outreach programs.

Trust and transparency: Decision Trees’ interpretability supports building trust with clinicians, regulators, and patients, crucial in healthcare.