**Q1. What is a Decision Tree, and how does it work in the context of classification ?**
- A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks. In the context of classification, it's a flowchart-like structure that helps make decisions based on input features to predict a class label.
The algorithm follows a divide-and-conquer approach:
1. Feature Selection: The algorithm selects the best feature to split the dataset based on a criterion (like Gini Impurity, Entropy/Information Gain, or Chi-square).

2. Splitting: The data is split into subsets based on the selected feature's values.

3. Recursion: The process is repeated recursively for each subset until: All data in a node belongs to the same class, or A stopping criterion is met (e.g., max depth, minimum samples, etc.)

4. Prediction: To classify a new instance, you start at the root and follow the branches according to the feature values of the instance until you reach a leaf node with the predicted class.

**Q2. Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree ?**
- An impurity measure quantifies how mixed the class labels are at a particular node.
1. Gini Impurity :- Gini Impurity is used in the CART algorithm.
  Where:
pi= proportion of instances belonging to class
C = number of classes
Example: Suppose a node contains 3 samples:
2 of class A
1 of class B

Then:
                      Gini=1−(2/3)2−(1/3)2=1−4/9−1/9=4/9≈0.444
If all samples were of class:
                      Gini=1−(1)2=0(pure)

2. Entropy (Information Gain) : Used in the ID3 and C4.5 algorithms.                     
Formula:
                       Entropy=−i=1∑C​pi​log2​(pi​)

They impact At each node:
- The algorithm evaluates all possible splits for each feature.
- For each split, it calculates the weighted average impurity (Gini or Entropy) of the child nodes.
- It selects the split that minimizes the weighted impurity — i.e., the split that makes the data more pure.

This is called:
Gini Gain for Gini Impurity
Information Gain for Entropy    




**Q3. What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.**
- The difference between Pre-Pruning and Post-Pruning in Decision Trees are as follows:

| Feature           | Pre-Pruning                   | Post-Pruning                          |
| ----------------- | ----------------------------- | ------------------------------------- |
| When Applied      | During tree building          | After full tree is built              |
| Control           | Stops growth early            | Trims unnecessary branches            |
| Complexity        | Simpler implementation        | Can be more computationally expensive |
| Risk              | May stop too early (underfit) | May overfit before pruning            |
| Practical Benefit | Faster training               | Better generalization accuracy        |

1. Pre-Pruning (a.k.a Early Stopping):- Pre-pruning stops the tree from growing once a condition is met during tree construction.
Practical Advantage: Faster training time — because it avoids growing unnecessary parts of the tree.
Example: If a split leads to very little information gain, you stop splitting further — even if you could.

2. Post-Pruning (a.k.a Pruning after Training):- Post-pruning allows the tree to fully grow, then removes branches that do not contribute much to accuracy.
Practical Advantage: Improves accuracy and generalization — by removing overfitting parts after evaluating the whole tree structure.
Example:After building the full tree, you test subtrees on a validation set. If pruning them doesn’t reduce accuracy, they are removed.

**Q4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?**
- Information Gain (IG) is a measure used to determine how well a feature splits the data into distinct classes in a decision tree. It quantifies the reduction in uncertainty (entropy) after splitting a dataset based on a particular feature.
It is important for choosing the best split because:
In decision tree algorithms (like ID3 and C4.5), choosing the best split at each node is critical.

Information Gain tells us: "How much better are we at classifying the data if we split on this feature?"

The feature that maximizes information gain is selected for the split — because it best reduces disorder in the data.

**Q5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?**
- Real-World Applications of Decision Trees
1. Medical Diagnosis
Use case: Diagnosing diseases based on symptoms and patient history.
Example: Classifying whether a tumor is benign or malignant.
2. Credit Scoring and Loan Approval
Use case: Evaluating whether a customer is eligible for a loan.
Example: Splitting on income, employment status, credit history.
3. Fraud Detection
Use case: Identifying unusual transactions or behavior.
Example: Splitting on transaction amount, location, frequency.
4. Customer Churn Prediction
Use case: Predicting if a customer is likely to stop using a service.
Example: Splitting on service usage, complaints, tenure.
5. Marketing and Recommendation Systems
Use case: Segmenting customers and predicting product preferences.
Example: Predicting if a user will click on an ad based on behavior.

Advantages:

| Advantage                              | Description                                                      |
| -------------------------------------- | ---------------------------------------------------------------- |
| **Easy to Understand**                 | Intuitive, similar to decision-making processes humans use.      |
| **No Data Scaling Required**           | Works without normalization or standardization.                  |
| **Handles Both Data Types**            | Works with both numerical and categorical variables.             |
| **Feature Selection Built-in**         | Automatically selects the most important features during splits. |
| **Can Model Non-linear Relationships** | Captures complex decision boundaries.                            |

Limitations:

| Limitation                           | Description                                                              |
| ------------------------------------ | ------------------------------------------------------------------------ |
| **Overfitting**                      | Trees can grow very deep and memorize training data.                     |
| **Unstable**                         | Small changes in data can lead to a completely different tree structure. |
| **Biased to Dominant Features**      | Features with more levels may dominate the splits.                       |
| **Greedy Algorithm**                 | Makes locally optimal choices that may not lead to a globally best tree. |
| **Poor Performance with Noisy Data** | Sensitive to outliers and irrelevant features.                           |



**Q6. Dataset Info:

● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).

● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).

**Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances**

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

print("\nFeature Importances:")
for name, importance in zip(feature_names, clf.feature_importances_):
    print(f"{name}: {importance:.4f}")

Model Accuracy: 1.00

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


**Q7. Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.**

In [2]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

tree_depth3 = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_depth3.fit(X_train, y_train)
y_pred_depth3 = tree_depth3.predict(X_test)
accuracy_depth3 = accuracy_score(y_test, y_pred_depth3)

tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train, y_train)
y_pred_full = tree_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

print(f"Accuracy with max_depth=3: {accuracy_depth3:.2f}")
print(f"Accuracy with fully-grown tree: {accuracy_full:.2f}")

Accuracy with max_depth=3: 1.00
Accuracy with fully-grown tree: 1.00


**Q8. Write a Python program to:
● Load the California Housing dataset from sklearn
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances.**

In [3]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

california = fetch_california_housing()
X = california.data
y = california.target
feature_names = california.feature_names

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.4f}")

print("\nFeature Importances:")
for name, importance in zip(feature_names, regressor.feature_importances_):
    print(f"{name}: {importance:.4f}")

Mean Squared Error (MSE): 0.4952

Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


**Q9. Write a Python program to:**

● **Load the Iris Dataset**

● **Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV**

● **Print the best parameters and the resulting model accuracy**

In [4]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5]
}

dt = DecisionTreeClassifier(random_state=42)

grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid_search.best_params_)
print(f"Test Set Accuracy: {accuracy:.2f}")

Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Test Set Accuracy: 1.00


**Q10. Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.**

**Explain the step-by-step process you would follow to:**

● **Handle the missing values**

● **Encode the categorical features**

● **Train a Decision Tree model**

● **Tune its hyperparameters**

● **Evaluate its performance**

**And describe what business value this model could provide in the real-worldsetting.**

Answer 1. Handle Missing Values

Why: Missing data can reduce model performance or cause errors during training.
Steps:

- Numerical features: Use mean, median, or model-based imputation

In [None]:
from sklearn.impute import SimpleImputer
num_imputer = SimpleImputer(strategy='median')
X_num = num_imputer.fit_transform(X_num)


- Categorical features: Use mode or a constant value like "Missing"

In [None]:
cat_imputer = SimpleImputer(strategy='most_frequent')
X_cat = cat_imputer.fit_transform(X_cat)


2. Encode Categorical Features
Why: Decision Trees in scikit-learn require numerical inputs.
Steps:
- Use One-Hot Encoding for nominal categories

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
X_cat_encoded = encoder.fit_transform(X_cat)


- Combine with numerical data

In [None]:
import numpy as np
X_final = np.hstack((X_num, X_cat_encoded))


3. Train a Decision Tree Model

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.2, random_state=42)


4. Tune Hyperparameters
Use GridSearchCV to tune key parameters:
Common Hyperparameters:
- max_depth
- min_samples_split
- min_samples_leaf
- criterion (gini or entropy)



In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_


5. Evaluate Model Performance
Use multiple metrics:

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

y_pred = best_model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


Business Value:

| Value Area                 | Description                                                          |
| -------------------------- | -------------------------------------------------------------------- |
| **Early Detection**        | Helps detect diseases at early stages, improving treatment outcomes. |
| **Decision Support**       | Assists doctors with evidence-based risk assessment.                 |
| **Resource Optimization**  | Prioritizes testing or referrals for high-risk patients.             |
| **Cost Reduction**         | Avoids unnecessary procedures for low-risk cases.                    |
| **Compliance & Reporting** | Provides auditable decision logic for regulatory compliance.         |
| **Scalability**            | Automates screening in large populations with minimal manual effort. |
