# Theory & Practical Questions

Question 1:  What is a Decision Tree, and how does it work in the context of
classification?

Answer: A Decision Tree is a supervised learning algorithm that helps us make decisions by splitting data into branches based on features. In classification, it works by dividing the dataset into smaller subsets using conditions until we reach leaf nodes that represent class labels. We can think of it as asking a series of yes/no questions that guide us toward the correct category.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

Answer: Gini Impurity and Entropy are measures used to check how mixed the classes are in a dataset. Gini measures the probability of misclassifying an item, while Entropy measures the amount of randomness or disorder in the data. When building a Decision Tree, we aim for splits that reduce impurity the most, so these measures guide us in choosing the best feature to split on.

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

Answer: Pre-pruning means we stop the tree from growing too deep by setting limits like maximum depth or minimum samples per split, which helps us avoid overfitting early. Post-pruning, on the other hand, lets the tree grow fully and then trims back unnecessary branches to simplify it. A key advantage of pre-pruning is saving time and computation, while post-pruning often gives us a more accurate and generalized model.

Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

Answer: Information Gain measures how much uncertainty or impurity is reduced when we split the data using a particular feature. It compares the entropy before the split and after the split, and the higher the gain, the better the feature is at separating the classes. It is important because it helps us choose the most informative feature, leading to a more accurate and efficient Decision Tree.

Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Answer: Decision Trees are widely used in areas like medical diagnosis, credit risk assessment, fraud detection, and customer segmentation. Their main advantage is that they are easy to understand, interpret, and visualize. However, they can easily overfit the data and may not perform well when the dataset is very large or noisy unless we use techniques like pruning.

Dataset Info:
  * Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
  * Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).

Question 6:   Write a Python program to:
* Load the Iris Dataset
* Train a Decision Tree Classifier using the Gini criterion
* Print the model’s accuracy and feature importances
* (Include your Python code and output in the code box below.)

In [5]:
# Answer:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target


X = df.drop('target', axis=1)
y = df['target']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)


classifier = DecisionTreeClassifier(criterion='gini')
classifier.fit(X_train, y_train)


y_pred = classifier.predict(X_test)


accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print("Feature Importances:", classifier.feature_importances_)



Accuracy: 0.9555555555555556
Feature Importances: [0.02146947 0.02146947 0.57196476 0.38509631]


Question 7:  Write a Python program to:
* Load the Iris Dataset
* Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
* fully-grown tree.
* (Include your Python code and output in the code box below.)

In [9]:
# Answer:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

classifier_limited = DecisionTreeClassifier(max_depth=3)
classifier_limited.fit(X_train, y_train)
y_pred_limited = classifier_limited.predict(X_test)
acc_limited = accuracy_score(y_test, y_pred_limited)

classifier_full = DecisionTreeClassifier()
classifier_full.fit(X_train, y_train)
y_pred_full = classifier_full.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

print(f"Accuracy with max_depth=3: {acc_limited}")
print(f"Accuracy with fully grown tree: {acc_full}")


Accuracy with max_depth=3: 0.9555555555555556
Accuracy with fully grown tree: 0.9555555555555556


Question 8: Write a Python program to:
* Load the California Housing dataset from sklearn
* Train a Decision Tree Regressor
* Print the Mean Squared Error (MSE) and feature importances
*(Include your Python code and output in the code box below.)

In [12]:
# Answer:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd

housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['target'] = housing.target

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print("Feature Importances:", regressor.feature_importances_)


Mean Squared Error (MSE): 0.48688516601820087
Feature Importances: [0.51029809 0.05208805 0.02900928 0.0266508  0.02703983 0.13941315
 0.1091124  0.10638838]


Question 9: Write a Python program to:
* Load the Iris Dataset
* Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
* Print the best parameters and the resulting model accuracy
* (Include your Python code and output in the code box below.)

In [14]:
# Answer:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
import pandas as pd


iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target


X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5]
}

dt = DecisionTreeClassifier()

grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with Best Parameters: {accuracy}")


Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Model Accuracy with Best Parameters: 0.9555555555555556


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
* Handle the missing values
* Encode the categorical features
* Train a Decision Tree model
* Tune its hyperparameters
* Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

Answer:
* Handle Missing Values: For numerical features, we can use mean or median imputation, and for categorical features, we can use the most frequent category to fill missing values.
* Encode Categorical Features: Apply One-Hot Encoding or Label Encoding so that the Decision Tree can understand categorical data.
* Train the Decision Tree: Split the dataset into training and testing sets, then train a Decision Tree model on the training data.
* Tune Hyperparameters: Use GridSearchCV or RandomizedSearchCV to find the best values for parameters like max_depth, min_samples_split, and criterion.
* Evaluate the Model: Check performance using metrics like accuracy, precision, recall, F1-score, and ROC-AUC to ensure reliability.
* This model can help healthcare professionals predict diseases early, identify high-risk patients, and support better decision-making. It reduces manual workload, improves efficiency, and ultimately leads to better patient outcomes and cost savings for the company.