***Decision Tree | Assignment***

# Question 1:  What is a Decision Tree, and how does it work in the context of classification? 
Answer:  
- def :  
A Decision Tree is a machine learning algorithm that is used for classification and regression tasks. In classification, it is used to predict the class (category) of a given input based on its features.

- It works like a flowchart:

Each internal node represents a condition on a feature (example: "Is age > 18?").

Each branch represents the outcome of that condition (Yes/No).

Each leaf node gives the final decision or class label.

- Working in classification:

The dataset is divided into smaller subsets based on the most important feature using measures like Gini Index or Information Gain.

This splitting continues until all the data is perfectly classified or some stopping condition is reached.

To classify a new example, the tree is followed from the root node to a leaf node by answering the conditions step by step.

The class label of the reached leaf node is given as the prediction.

# Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree? 
Answer:
In a Decision Tree, we need to decide where to split the data. To do this, we measure how “impure” or “mixed” the classes are in a node. Two common impurity measures are Gini Impurity and Entropy.

- Gini Impurity

Formula: 
Gini=1−∑pi2​  
here
pi= probability of class i in the node.

Gini tells us how often a randomly chosen element would be misclassified if we label it randomly according to the class distribution.

Range: 0 (pure, only one class) to 0.5 (for 2 classes, completely mixed).

Decision Trees (like CART) often use Gini because it is faster to calculate.

- Entropy (Information Gain)

Formula: 
Entropy=−∑pi​log2​(pi)
Measures the uncertainty in the node.

Range: 0 (pure) to 1 (for 2 classes equally mixed).

When splitting, Decision Trees calculate Information Gain, which is the reduction in entropy after the split.

- Impact on splits:

Both Gini and Entropy try to create “pure” child nodes (nodes with mostly one class).

Gini tends to make splits that isolate the most frequent class quickly.

Entropy is more sensitive to class distribution and can give slightly different splits.

In practice, both often give similar results, but Gini is computationally simpler.

# Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each. 
Answer: 
| **Aspect**       | **Pre-Pruning (Early Stopping)**                      | **Post-Pruning**                                    |
| ---------------- | ----------------------------------------------------- | --------------------------------------------------- |
| **When applied** | During tree building (stops splitting early)          | After the tree is fully grown                       |
| **How it works** | Uses conditions like max depth, min samples, min gain | Removes branches that do not improve accuracy       |
| **Goal**         | Prevent the tree from becoming too complex            | Simplify a fully grown tree without losing accuracy |
| **Advantage**    | Saves time & computation, avoids very large trees     | Usually gives better accuracy and generalization    |


# Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split? 
Answer:
Information Gain (IG):

Information Gain measures the reduction in uncertainty (entropy) after splitting the data on an attribute.

Formula:

IG(S,A)=Entropy(S)−∑∣S∣∣Sv​∣​×Entropy(Sv​)

where
S is the dataset, and 𝑆𝑣are the subset after sliting an attribute A

- Meaning:

High IG means the split gives more “pure” subsets (less mixed classes).

Low IG means the split does not help much in separating classes.

- Importance in Decision Trees:

Decision Trees use IG to choose the best attribute for splitting at each node.

The attribute with the highest Information Gain is selected because it reduces uncertainty the most.

This helps the tree make more accurate decisions and avoid useless splits.

# Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations? 
Answer:
1. Real-world applications of Decision Trees:

Application	Example
Medical Diagnosis e.g:	Predicting if a patient has a disease
Credit Risk / Loan Approval e,g :	Approving or rejecting loan applications
Customer Churn Prediction	e,g:Predicting if a customer will leave a service
Spam Email Detection	e,g:Classifying emails as spam or not spam
Marketing & Sales	Deciding which customers to target for a campaign

2. Main Advantages:

Easy to understand and interpret (like a flowchart)

Can handle both numerical and categorical data

No need for much data preprocessing

Can reveal important features automatically

3. Main Limitations:

Can overfit if the tree is too deep

Sensitive to small changes in data

Not always the most accurate compared to ensemble methods (like Random Forest or XGBoost)

Can be biased if some classes dominate

In [2]:
#Question 6:   Write a Python program to: ● Load the Iris Dataset ● Train a Decision Tree Classifier using the Gini criterion ● Print the model’s accuracy and feature importances 
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data   # Features
y = iris.target # Labels

# 2. Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Train Decision Tree Classifier with Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# 4. Predictions
y_pred = clf.predict(X_test)

# 5. Model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# 6. Feature importances
print("\nFeature Importances:")
for name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{name}: {importance:.4f}")


Model Accuracy: 1.0

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


In [4]:
# Question 7:Write a Python program to: ● Load the Iris Dataset ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree. 
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Train a Decision Tree with max_depth=3
tree_limited = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)
tree_limited.fit(X_train, y_train)
y_pred_limited = tree_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# 4. Train a fully-grown Decision Tree (no max_depth)
tree_full = DecisionTreeClassifier(criterion="gini", random_state=42)
tree_full.fit(X_train, y_train)
y_pred_full = tree_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# 5. Print accuracies
print("Accuracy of Decision Tree with max_depth=3:", accuracy_limited)
print("Accuracy of fully-grown Decision Tree:", accuracy_full)
 

Accuracy of Decision Tree with max_depth=3: 1.0
Accuracy of fully-grown Decision Tree: 1.0


In [8]:
# Question 8: Write a Python program to: ● Load the Boston Housing Dataset ● Train a Decision Tree Regressor ● Print the Mean Squared Error (MSE) and feature importances 

# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# 1. Load California Housing dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

# 2. Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# 4. Predictions
y_pred = regressor.predict(X_test)

# 5. Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# 6. Feature importances
print("\nFeature Importances:")
for name, importance in zip(housing.feature_names, regressor.feature_importances_):
    print(f"{name}: {importance:.4f}")



Mean Squared Error: 0.5280096503174904

Feature Importances:
MedInc: 0.5235
HouseAge: 0.0521
AveRooms: 0.0494
AveBedrms: 0.0250
Population: 0.0322
AveOccup: 0.1390
Latitude: 0.0900
Longitude: 0.0888


In [10]:
# Question 9: Write a Python program to: ● Load the Iris Dataset ● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV ● Print the best parameters and the resulting model accuracy 
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# 1. Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Define Decision Tree and parameter grid
dt = DecisionTreeClassifier(random_state=42)
param_grid = {
    'max_depth': [1, 2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5]
}

# 4. Apply GridSearchCV
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# 5. Best parameters
print("Best Parameters:", grid_search.best_params_)

# 6. Evaluate best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the tuned model:", accuracy)


Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Accuracy of the tuned model: 1.0


# Queation 10:Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. 
#Explain the step-by-step process you would follow to: ● Handle the missing values ● Encode the categorical features ● Train a Decision Tree model ● Tune its hyperparameters ● Evaluate its performance 
#And describe what business value this model could provide in the real-world setting. 
- Step 1: Handle Missing Values

Identify missing values in the dataset using .isnull() or .info().

Impute missing values:

For numerical features, use mean or median.

For categorical features, use mode (most frequent value) or a special category like “Unknown”.

Optional: Drop rows or columns if too many values are missing.

- Step 2: Encode Categorical Features

Convert categorical variables into numbers because Decision Trees in most libraries require numeric input.

Use Label Encoding if the feature is ordinal (has order).

Use One-Hot Encoding if the feature is nominal (no order).

In [13]:
#example:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder


Step 3: Train a Decision Tree Model

Split the dataset into train and test sets using train_test_split.

Train a Decision Tree Classifier on the training data:

In [16]:
#example
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)


- Step 4: Tune Hyperparameters

Use GridSearchCV or RandomizedSearchCV to find the best settings:

max_depth → controls tree depth

min_samples_split → min samples to split a node

min_samples_leaf → min samples at a leaf

criterion → “gini” or “entropy”

This prevents overfitting and improves model performance.

- Step 5: Evaluate Performance

Use the test set to check accuracy.

Use additional metrics like:

Precision → correct positive predictions / all predicted positive

Recall → correct positive predictions / all actual positive

F1-score → balance of precision and recall

Confusion matrix → shows true/false positives/negatives

In [19]:
#example:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 1.0
[[19  0  0]
 [ 0 13  0]
 [ 0  0 13]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45



- Step 6: Business Value

Early disease detection → helps doctors take timely actions.

Resource optimization → prioritize patients at higher risk.

Reduce healthcare costs → prevent serious complications by early intervention.

Personalized care → target treatment based on predicted risk.

This model can save lives, improve patient care, and reduce costs by identifying high-risk patients before the disease progresses.