Question 1:  What is a Decision Tree, and how does it work in the context of
classification?
- What is a Decision Tree?
  - A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. In classification, it is used to predict categorical outcomes (e.g., "Yes"/"No", "Spam"/"Not Spam", "Disease"/"No Disease").It resembles a tree-like structure where :
    - Internal nodes represent decision rules (e.g., Age < 30?)
    - Branches represent outcomes of these rules (e.g., Yes/No)
    - Leaf nodes represent final class labels
- How it Works in Classification :
  - Start at the Root Node : Begin with the entire dataset.
  - Choose the Best Feature to Split : Use algorithms like Gini Index, Entropy & Information Gain, or Chi-Square to find the best feature that splits the data into distinct classes.
  - Split the Dataset : Based on the selected feature, split the dataset into subsets.
  - Repeat Recursively : Apply the same process to each subset, building branches and further nodes until :
    - All data in a node belongs to one class, or
    - A maximum depth is reached, or
    - No further information gain can be achieved
  - Make Prediction : To classify a new input, traverse the tree from root to leaf, following decisions based on input features.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
- What Are Impurity Measures?
  - In a Decision Tree, impurity measures help determine how “mixed” or impure a node is. A node is pure if all its samples belong to a single class.
  - The two most common impurity measures are :
    - Gini Impurity
    - Entropy (used with Information Gain)
  - These are used to evaluate the quality of a split. A split is good if it reduces impurity — i.e., results in more “pure” child nodes.
1. Gini Impurity
  - Definition :
    - Gini(𝐷)=1−∑𝑖=1 to 𝐶(𝑝𝑖2)
      - Where :
        - 𝐶 = number of classes
        - 𝑝𝑖 = proportion of samples in class
  - Interpretation : Ranges from 0 (pure) to (1 - 1/C)
     - If all records in a node belong to one class → Gini = 0 (ideal case)
     - The lower the Gini, the better the split
  - Example : If a node contains :
     - 60% Class A → 𝑝1=0.6
     - 40% Class B → 𝑝2=0.4
     - Gini = 1−(0.62+0.42)=1−(0.36+0.16)=0.48
2. Entropy (Information Gain)
  - Definition :
    - Entropy(𝐷)=−∑𝑖=1𝐶𝑝𝑖log⁡2(𝑝𝑖)
    - Measures the amount of uncertainty in the dataset
    - Used to calculate Information Gain
    - Information Gain = Entropy (Parent Node) - Weighted Entropy (Children)
  - Interpretation :
    - Entropy is 0 when node is pure
    - Higher entropy means more disorder/mixed classes
  - Example : Same class proportions :
    - 𝑝1=0.6p1=0.6,𝑝2=0.4
    - Entropy =−(0.6log⁡20.6+0.4log⁡20.4)≈0.971
- Impact on Decision Tree Splits
  - When building a tree : For each feature, the algorithm tries all possible split points
  - It calculates Gini or Entropy for the resulting child nodes
  - Best split = one that minimizes Gini or maximizes Information Gain
  - This process repeats recursively to grow the tree

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
- What is Pruning in Decision Trees? : Pruning is the process of reducing the size of a decision tree by removing unnecessary branches that may lead to overfitting.
- There are two main types of pruning:
1. Pre-Pruning (Early Stopping)
- Definition : Stops the tree from growing too deep during construction.
- How it works : The algorithm does not split a node if :
  - The number of instances is below a threshold (e.g., min_samples_split)
  - The depth exceeds a limit (max_depth)
  - The information gain is too small
- Practical Advantage :
  - Faster training and simpler trees
  - Useful when working with large datasets or limited computing resources.
2. Post-Pruning (Reduced Error Pruning)
- Definition : The tree is fully grown first, then unnecessary branches are pruned back.
- How it works : After building the tree:
  - Evaluate subtrees on a validation set
  - Remove nodes if their removal does not reduce prediction accuracy
- Practical Advantage :
  - Better generalization and more accurate models
  - Useful when you want to minimize overfitting after tree construction.
- Pre-Pruning vs. Post-Pruning (Comparison)
1. When It Is Applied :
  - Pre-Pruning : Applied during the construction of the decision tree.
  - Post-Pruning : Applied after the full tree has been built.
2. Purpose :
  - Pre-Pruning : To stop the tree early to avoid overfitting.
  - Post-Pruning : To cut back the overfitted parts of a fully grown tree.
3. How It Works  :
  - Pre-Pruning : Uses conditions like max_depth, min_samples_split, or min_gain to prevent further splits.
  - Post-Pruning: Evaluates each subtree using a validation set and prunes if it doesn't improve performance.
4. Risk of Underfitting :
  - Pre-Pruning : High risk — might underfit the data if stopped too early.
  - Post-Pruning : Low risk — starts with a complex tree, then simplifies only where needed.
5. Computation Time :
  - Pre-Pruning : Faster, as it avoids growing unnecessary branches.
  - Post-Pruning : Slower, as it builds a full tree and then evaluates multiple pruning options.
6. Model Complexity :
  - Pre-Pruning : Produces simpler and smaller trees.
  - Post-Pruning : Allows complex trees initially, but results in a balanced, optimized tree.
7. Control Parameters
  - Pre-Pruning : Controlled by hyperparameters like max_depth, min_samples_leaf, etc.
  - Post-Pruning : Controlled using pruning algorithms (like cost-complexity pruning) and a validation set.
8. Typical Use Case :
  - Pre-Pruning : Preferred when working with large datasets or time constraints.
  - Post-Pruning : Preferred when model accuracy and generalization are the top priorities.
9. Example in scikit-learn :
  - Pre-Pruning : Use parameters like DecisionTreeClassifier(max_depth=3)
  - Post-Pruning : Use ccp_alpha (Cost Complexity Pruning) for pruning after full growth.

Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?
- What is Information Gain? : Information Gain (IG) is a metric used in Decision Trees to measure the effectiveness of an attribute (feature) in splitting the dataset into pure subsets (i.e., subsets where most or all instances belong to the same class).It is based on Entropy, which measures the impurity or disorder of the data.
- Formula for Information Gain :
  - Information Gain = Entropy (Parent)−∑(𝑛𝑖𝑛×Entropy (Child𝑖))
    - Where :
      - 𝑛 = total samples in parent node
      - 𝑛𝑖 = samples in each child node
- Entropy is calculated using the class distribution in each node.
- Why Is It Important? : Information Gain tells us how much “information” a feature gives us about the class. A higher Information Gain means that the split on that feature reduces uncertainty the most.
  - In simple terms :
    - Higher IG = Better split = More pure subsets = Better tree decisions
  - How It Helps in Choosing the Best Split :
    - For each feature : Try all possible splits (like thresholds for numerical data or categories for categorical data).Calculate the Information Gain from each split.Choose the feature and threshold that results in the highest Information Gain.
  - This helps the tree pick the most informative feature first, leading to faster convergence and better accuracy.
- Example : Suppose you're classifying "Play Tennis" based on "Outlook":
  - Entropy before split = 0.94
  - After splitting on "Outlook", weighted entropy = 0.69
  - Information Gain = 0.94−0.69=0.25
  - So, splitting on "Outlook" gives an Information Gain of 0.25, which means this feature reduces uncertainty significantly.

Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
- Real-World Applications of Decision Trees
  - Medical Diagnosis : Predicting whether a patient has a disease based on symptoms, age, test results, etc.Example: Diagnosing diabetes or heart disease.
  - Loan Approval in Banking : Classifying loan applications as “approved” or “rejected” based on income, credit score, and history.
  - Customer Churn Prediction : Identifying which customers are likely to leave a service based on usage, feedback, and support interactions.
  - Fraud Detection : Detecting fraudulent transactions using features like transaction amount, location, frequency, etc.
  - Retail: Product Recommendation & Inventory Management.Classifying customers based on buying patterns to suggest products or manage stock.
  - Manufacturing : Predicting equipment failure or quality control based on sensor data or machine logs.
  - Education : Predicting student performance or drop-out risk using attendance, grades, and engagement metrics.
- Advantages of Decision Trees
  - Easy to Understand and Interpret : Resembles human decision-making; can be visualized and explained easily.
  - Handles Both Numerical and Categorical Data : Works well with mixed-type features without needing complex preprocessing.
  - No Need for Feature Scaling : Unlike SVM or KNN, normalization or standardization is not required.
  - Can Handle Missing Values : Some implementations (e.g., in sklearn) can manage missing data during training.
  - Works Well for Small to Medium Datasets : Especially when data has clear patterns or rules.
- Limitations of Decision Trees
  - Prone to Overfitting : Without pruning, decision trees can become too complex and memorize training data.
  - Unstable to Small Changes : Small variations in data can lead to completely different trees (high variance).
  - Biased Toward Features with More Levels : Features with many unique values may dominate splits.
  - Limited Predictive Power Alone : Often less accurate than ensemble methods like Random Forest or Gradient Boosting.

Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.
- You’re a data scientist for a healthcare company using a dataset with mixed data types (numerical + categorical) and missing values to predict disease presence. Here's how you'd proceed:
- Step 1 : Handle the Missing Values
  - Understand the Missingness
  - Identify how much data is missing and why (MCAR, MAR, MNAR).
  - Use .isnull().sum() in pandas to check missing counts.
  - Numerical Features
    - Use mean/median imputation:
      - from sklearn.impute import SimpleImputer
      - num_imputer = SimpleImputer(strategy='median')
      - df[numerical_cols] = num_imputer.fit_transform(df[numerical_cols])
  - Categorical Features
    - Use most frequent (mode) or a special category like "Unknown":
      - cat_imputer = SimpleImputer(strategy='most_frequent')
      - df[categorical_cols] = cat_imputer.fit_transform(df[categorical_cols])
- Step 2 : Encode the Categorical Features
  - For Decision Trees, label encoding works fine (they don't require one-hot):
     - from sklearn.preprocessing import OrdinalEncoder
     - encoder = OrdinalEncoder()
     - df[categorical_cols] = encoder.fit_transform(df[categorical_cols])
  - Alternatively, use pd.get_dummies() if categories are unordered and not too many :
     - df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
- Step 3 : Train a Decision Tree Model
  - Split the data into train and test sets:
    - from sklearn.model_selection import train_test_split
    - X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  - Train the Decision Tree:
    - from sklearn.tree import DecisionTreeClassifier
    - model = DecisionTreeClassifier(random_state=42)
    - model.fit(X_train, y_train)
- Step 4 : Tune Hyperparameters
  - Use GridSearchCV to find the best combination :
     - from sklearn.model_selection import GridSearchCV
     - param_grid = {
         - 'max_depth': [3, 5, 10, None],
         - 'min_samples_split': [2, 10, 20],
         - 'min_samples_leaf': [1, 5, 10],
         - 'criterion': ['gini', 'entropy']
     - }
    - grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42),
                          param_grid, cv=5, scoring='accuracy')
    - grid_search.fit(X_train, y_train)
    - best_model = grid_search.best_estimator_
Step 5 : Evaluate Model Performance
  - Make predictions :
    - y_pred = best_model.predict(X_test)
  - Metrics :
    - from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
    - print("Accuracy:", accuracy_score(y_test, y_pred))
    - print("Precision:", precision_score(y_test, y_pred))
    - print("Recall:", recall_score(y_test, y_pred))
    - print("F1 Score:", f1_score(y_test, y_pred))
    - print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
- Business Value in Real-World Healthcare
  - Early Disease Detection : Helps doctors flag high-risk patients early, even before symptoms appear.
  - Resource Optimization : Hospitals can prioritize tests or treatments for patients flagged as high-risk.
  - Improved Patient Outcomes : Reduces disease progression with early intervention.
  - Cost Savings : Reduces unnecessary diagnostic procedures for low-risk patients.
  - Data-Driven Decisions : Enables personalized treatment plans using insights from historical data.



In [1]:
"""Dataset Info:
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).
Question 6:   Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances """
#Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

#Load the Iris dataset
iris = load_iris()
X = iris.data                   # Feature matrix
y = iris.target                 # Target labels
feature_names = iris.feature_names

#Split the data into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Create and train the Decision Tree Classifier using Gini criterion
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

#Predict on the test set
y_pred = model.predict(X_test)

#Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

#Print feature importances
print("\nFeature Importances:")
for name, importance in zip(feature_names, model.feature_importances_):
    print(f"{name}: {importance:.4f}")


Model Accuracy: 1.00

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


In [2]:
"""Question 7:  Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree. """
#Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

#Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

#Split into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Train Decision Tree with max_depth = 3 (Pre-Pruned)
pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pruned_tree.fit(X_train, y_train)
pruned_preds = pruned_tree.predict(X_test)
pruned_accuracy = accuracy_score(y_test, pruned_preds)

#Train Fully-grown Decision Tree (no depth limit)
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_preds = full_tree.predict(X_test)
full_accuracy = accuracy_score(y_test, full_preds)

#Print both accuracies
print(f"Accuracy of Pruned Tree (max_depth=3): {pruned_accuracy:.2f}")
print(f"Accuracy of Fully-grown Tree        : {full_accuracy:.2f}")


Accuracy of Pruned Tree (max_depth=3): 1.00
Accuracy of Fully-grown Tree        : 1.00


In [3]:
"""Question 8: Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances"""
#Import necessary libraries
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

#Load the Boston Housing dataset (via OpenML since load_boston is deprecated)
boston = fetch_openml(name="boston", version=1, as_frame=True)
X = boston.data
y = boston.target
feature_names = X.columns

#Split into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

#Predict and calculate Mean Squared Error (MSE)
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")

#Print feature importances
print("\nFeature Importances:")
for name, importance in zip(feature_names, regressor.feature_importances_):
    print(f"{name}: {importance:.4f}")


Mean Squared Error (MSE): 10.42

Feature Importances:
CRIM: 0.0513
ZN: 0.0034
INDUS: 0.0058
CHAS: 0.0000
NOX: 0.0271
RM: 0.6003
AGE: 0.0136
DIS: 0.0707
RAD: 0.0019
TAX: 0.0125
PTRATIO: 0.0110
B: 0.0090
LSTAT: 0.1933


In [4]:
"""Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy"""
#Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

#Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

#Split dataset into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set up the Decision Tree model
dtree = DecisionTreeClassifier(random_state=42)

#Define the hyperparameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 4, 6, 8]
}

#Use GridSearchCV to find the best parameters
grid_search = GridSearchCV(estimator=dtree,
                           param_grid=param_grid,
                           cv=5,
                           scoring='accuracy',
                           n_jobs=-1)

#Fit GridSearchCV on training data
grid_search.fit(X_train, y_train)

#Get the best parameters and best model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

#Evaluate the best model on test set
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

#Print results
print("Best Parameters:", best_params)
print(f"Test Set Accuracy: {accuracy:.2f}")


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Test Set Accuracy: 1.00
