**Question1:**  What is a Decision Tree, and how does it work in the context of
classification?

 -  A Decision Tree is a supervised machine learning algorithm used for both classification and regression, but it’s most commonly used for classification tasks. A Decision Tree is like a flowchart-like structure where:

 - Each internal node represents a test on a feature (e.g., “Age > 30?”).

 - Each branch represents an outcome of the test (Yes/No).

 - Each leaf node represents a final class label (e.g., “Approved” or “Rejected”).

 - It basically splits the dataset into smaller and smaller groups based on conditions, until the groups are as pure (similar) as possible.

 **How is works ?**

 **1.** Select the best feature to split the data — using criteria like:

 - Gini Index

 - Entropy / Information Gain

**2.** Split the dataset based on that feature’s values.

**3.** Repeat the process recursively for each subset.

**4.** Stop when:

 - All samples in a node belong to one class, or

 - No further improvement can be made.

 ---


**Question2:** Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

-- **Gini Impurity**

 - Measures how impure a node is.

 - Formula: Gini=1 − ∑(pi)2

 - Gini = 0 → pure node, Gini = 0.5 → mixed classes.

 - Lower Gini means better split.

--**Entropy**

 - Measures uncertainty in a node.

 - Formula: 𝐸𝑛𝑡𝑟𝑜𝑝𝑦= −∑( 𝑝𝑖log2𝑝𝑖 )

 - Entropy = 0 → pure, Entropy = 1 → most impure.

 - Split chosen gives highest information gain (reduction in entropy).

 --**Impact on Split**

  - Decision Tree checks all features.

 - Chooses split with lowest impurity (using Gini or Entropy).

 - Gini → faster, default in sklearn.

 - Entropy → uses info gain concept.

 ---

**Question3:** What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

 -  | Feature          | **Pre-Pruning (Early Stopping)**                                             | **Post-Pruning (Reduced Error Pruning)**           |
| ---------------- | ---------------------------------------------------------------------------- | -------------------------------------------------- |
| **When applied** | During tree building                                                         | After the full tree is built                       |
| **How it works** | Stops growing the tree early using conditions (e.g., max depth, min samples) | Grows full tree, then removes unnecessary branches |
| **Goal**         | Prevent overfitting early                                                    | Simplify the complex tree                          |
| **Computation**  | Faster (less training time)                                                  | Slower (needs full tree first)                     |
| **Advantage**    | Saves time and avoids overfitting                                            | Gives simpler and more accurate final model        |


**Pre-Pruning Advantage:** Saves time and prevents overfitting early.
**Post-Pruning Advantage:** Produces a simpler and more accurate model after checking performance.

---


**Question4:** What is Information Gain in Decision Trees, and why is it important for choosing the best split?

 -  Information Gain is a measure used in Decision Trees to determine which feature provides the most useful information for classifying data.It is based on the concept of Entropy, which measures the level of impurity or disorder in a dataset.
 -  When a dataset is split based on a feature, the Information Gain tells us how much entropy decreases as a result of that split — in other words, how much more organized or pure the data becomes. A higher Information Gain means that the feature helps to make the data more homogeneous (pure), and thus, it is considered a better feature for splitting.
 - So, in Decision Trees, at each node, the algorithm selects the feature with the highest Information Gain to make the best possible split and build an effective model.

 ---

**Question5:** What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

 --  **Real-World Applications**

 - **Banking:** Loan approval or credit risk prediction.

 - **Healthcare:** Disease diagnosis based on symptoms.

 - **Marketing:** Predicting customer churn or purchase behavior.

 - **Finance:** Fraud detection.

 - **Education:** Predicting student performance.

 -- **Adavantages**

  - Easy to understand and visualize.

 - Works with both numerical and categorical data.

 - No need for feature scaling.

 - Can handle non-linear relationships.

 -- **Limitations**

  - Prone to overfitting if not pruned.

 - Small data changes can change the whole tree.

 - Biased towards features with more categories.

 - Less accurate compared to ensemble methods (like Random Forest).

 ---


**Question6:**Write a Python program to:

 - Load the Iris Dataset

 - Train a Decision Tree Classifier using the Gini criterion

 -  Print the model’s accuracy and feature importances



In [1]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree Classifier using Gini criterion
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Print accuracy
print("Model Accuracy:", accuracy_score(y_test, y_pred))

# Print feature importances
print("Feature Importances:")
for name, importance in zip(iris.feature_names, model.feature_importances_):
    print(f"{name}: {importance:.3f}")


Model Accuracy: 1.0
Feature Importances:
sepal length (cm): 0.000
sepal width (cm): 0.019
petal length (cm): 0.893
petal width (cm): 0.088


**Question7:** Write a Python program to:
 - Load the Iris Dataset
 -  Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

In [2]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree with max_depth = 3
model_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
model_limited.fit(X_train, y_train)
y_pred_limited = model_limited.predict(X_test)
acc_limited = accuracy_score(y_test, y_pred_limited)

# Train fully-grown Decision Tree (no depth limit)
model_full = DecisionTreeClassifier(random_state=42)
model_full.fit(X_train, y_train)
y_pred_full = model_full.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

# Print accuracies
print("Accuracy (max_depth=3):", acc_limited)
print("Accuracy (fully-grown tree):", acc_full)


Accuracy (max_depth=3): 1.0
Accuracy (fully-grown tree): 1.0


**Question8:** Write a Python program to:

 - Load the Boston Housing Dataset
 - Train a Decision Tree Regressor
 - Print the Mean Squared Error (MSE) and feature importances


In [4]:
# Import libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset (replacement for Boston)
housing = fetch_california_housing()
X = housing.data
y = housing.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree Regressor
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# Print feature importances
print("\nFeature Importances:")
for name, importance in zip(housing.feature_names, model.feature_importances_):
    print(f"{name}: {importance:.3f}")


Mean Squared Error (MSE): 0.5280096503174904

Feature Importances:
MedInc: 0.523
HouseAge: 0.052
AveRooms: 0.049
AveBedrms: 0.025
Population: 0.032
AveOccup: 0.139
Latitude: 0.090
Longitude: 0.089


**Question9:** Write a Python program to:
 - Load the Iris Dataset
 - Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
 - Print the best parameters and the resulting model accuracy

In [5]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the model
model = DecisionTreeClassifier(random_state=42)

# Define parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5, 6]
}

# Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best parameters and best model
print("Best Parameters:", grid_search.best_params_)

# Predict using best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy with Best Parameters:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 6}
Model Accuracy with Best Parameters: 1.0


**Question10:**magine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
 -  Handle the missing values
 -  Encode the categorical features
 - Train a Decision Tree model
 - Tune its hyperparameters
 - Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

 -- **1. Handle Missing Values**

 - Check which features have missing data.

 - For numerical columns, fill missing values with median or mean.

 - For categorical columns, fill with most frequent value or a new category like "Unknown".

 - Use SimpleImputer from sklearn to handle this automatically.

 -- **2. Encode Categorical Features**

 - Convert categorical data into numbers so the model can understand.

 - Use One-Hot Encoding for nominal features and Ordinal Encoding if categories have an order.

 - This can be done easily with ColumnTransformer or OneHotEncoder.

 -- **3. Train the Decision Tree**

 - Split the dataset into training and testing sets using train_test_split().

 - Train a DecisionTreeClassifier (e.g., criterion='gini' or 'entropy').

 - Fit the model on training data and test on unseen data to check basic accuracy.

 -- **4. Tune Hyperparameters**

 - Use GridSearchCV or RandomizedSearchCV to find best values for parameters like:

   - max_depth

   - min_samples_split

   - min_samples_leaf

 - This helps to prevent overfitting and improve model generalization.

 -- **5. Evaluate Model Performance**

 - Use metrics like accuracy, precision, recall, F1-score, and ROC-AUC.

 - For medical predictions, recall (sensitivity) is often most important — we don’t want to miss patients who actually have the disease.

 -- **6. Business Value**

 - Helps in early disease detection, improving patient outcomes.

 - Supports doctors in making faster and data-driven decisions.

 - Saves time and healthcare costs by identifying high-risk patients early.

 - A well-tuned Decision Tree is interpretable, so medical professionals can trust and understand the model’s reasoning.

 ---