Q1) What is a Decision Tree, and how does it work in the context of
classification?

Ans1)

    A Decision Tree is a supervised machine learning algorithm that is widely used for classification problems. It works by breaking down a dataset into smaller subsets based on feature values, eventually forming a tree-like structure of decisions. At each internal node, the algorithm selects the feature that best splits the data into groups that are as homogeneous as possible with respect to the target variable. Techniques like Information Gain, Entropy, or Gini Index are used to determine the “best” split. The branches represent the outcomes of a decision, while the leaf nodes represent the final classification labels.

    For example, in a customer dataset, a decision tree might first split on “Age,” then on “Income,” and finally on “Purchase History” to decide whether a customer will buy a product or not. The step-by-step splitting continues until the data is divided into pure groups or until a stopping condition is reached. Decision trees are easy to interpret and visualize, making them helpful for understanding how classification decisions are made. However, they can sometimes overfit the data, which is why techniques like pruning or ensemble methods (Random Forests, Gradient Boosted Trees) are often used to improve performance.

Q2) Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

Ans2)

    Gini Impurity and Entropy are two common impurity measures used in decision trees to decide the best feature for splitting the data.

    Gini Impurity measures how often a randomly chosen element from the dataset would be incorrectly classified if it was randomly labeled according to the class distribution. A Gini value of 0 means the node is pure (all samples belong to one class), while higher values mean more impurity.

    Entropy, from information theory, measures the level of uncertainty or disorder in the data. If all samples in a node belong to the same class, entropy is 0, but if classes are equally mixed, entropy is at its maximum.

    During tree construction, the algorithm tries to split the data in a way that reduces impurity the most—this is called Information Gain when using entropy and Gini Gain when using Gini Impurity. In practice, both measures usually lead to similar trees, but Gini is computationally faster, while entropy gives more weight to less frequent classes. By minimizing impurity at each step, the decision tree becomes better at separating the classes and making accurate predictions.
  
Q3) What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

Ans3)

    Pre-pruning and Post-pruning are techniques used to prevent overfitting in decision trees.

    Pre-pruning (Early Stopping): The tree growth is stopped early before it becomes too complex. Conditions like maximum depth, minimum samples per split, or minimum information gain are set to control the tree size.

    Advantage: Saves computation time and prevents overly complex trees right from the start.

    Post-pruning (Pruning after Full Growth): The tree is allowed to grow fully, and then unnecessary branches that do not improve accuracy are cut back. This is done using validation data or statistical measures.

      Advantage: Produces a simpler tree that generalizes better while still considering all possible splits during training.

Q4) What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

Ans4)

    Information Gain is a measure used in decision trees to determine which feature provides the best split at a node. It is based on the concept of entropy, which measures the level of disorder or uncertainty in a dataset. When the data is split on a particular feature, the entropy of the resulting subsets is compared with the entropy of the original dataset. Information Gain is the reduction in entropy after the split.

    It is important because a higher information gain means the feature creates purer groups (more homogeneous with respect to the target class), which improves the decision tree’s ability to classify data accurately. By always choosing the feature with the highest information gain, the decision tree grows in a way that maximizes learning and minimizes uncertainty at each step.

Q5) What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Ans5)

    Decision trees are widely used in real-world applications because they are simple to understand and effective for both classification and regression tasks. Some common applications include:

    Business & Marketing: Predicting customer churn, segmenting customers, or deciding whether a customer will respond to a marketing campaign.

    Finance: Assessing credit risk, loan approval, and fraud detection.

    Healthcare: Diagnosing diseases, recommending treatments, and predicting patient outcomes.

    Engineering & Manufacturing: Detecting faults, quality control, and process optimization.

    Advantages: Decision trees are easy to interpret and visualize, require little data preprocessing, and can handle both categorical and numerical data. They also capture non-linear relationships between features.

    Limitations: They are prone to overfitting, especially with deep trees, and can be unstable, as small changes in data may produce a very different tree. Decision trees also tend to be less accurate compared to ensemble methods like Random Forests or Gradient Boosted Trees.


In [1]:
#Q6) Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier using the Gini criterion
#● Print the model’s accuracy and feature importances

# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data   # Features
y = iris.target # Labels

# Split dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Print feature importances
print("Feature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Model Accuracy: 1.0
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


In [2]:
#Q7) Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree with max_depth=3
tree_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_limited.fit(X_train, y_train)
y_pred_limited = tree_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# Train a fully-grown Decision Tree
tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train, y_train)
y_pred_full = tree_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Print accuracy comparison
print(f"Accuracy of tree with max_depth=3: {accuracy_limited:.4f}")
print(f"Accuracy of fully-grown tree: {accuracy_full:.4f}")


Accuracy of tree with max_depth=3: 1.0000
Accuracy of fully-grown tree: 1.0000


In [3]:
#Q8) Write a Python program to:
#● Load the California Housing dataset from sklearn
#● Train a Decision Tree Regressor
#● Print the Mean Squared Error (MSE) and feature importances

# Import necessary libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
california = fetch_california_housing()
X = california.data
y = california.target

# Split dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.4f}")

# Print feature importances
print("Feature Importances:")
for feature, importance in zip(california.feature_names, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Mean Squared Error (MSE): 0.4952
Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


In [4]:
#Q9) Write a Python program to:
#● Load the Iris Dataset
#● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
#● Print the best parameters and the resulting model accuracy

# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10, 15]
}

# Initialize Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Evaluate the best model on test data
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with best parameters: {accuracy:.4f}")


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy with best parameters: 1.0000


Q10) Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

Ans10)

    In a healthcare setting, predicting whether a patient has a certain disease requires careful handling of the dataset, especially when it contains mixed data types and missing values. The first step is to handle missing values, which can be done by imputing numerical features with measures like the mean or median, and categorical features with the mode, or by using more advanced methods like K-Nearest Neighbors imputation. Next, categorical features need to be converted into numerical form so that a Decision Tree can process them; this can be done using one-hot encoding or ordinal encoding depending on the nature of the categories. Once the data is cleaned and encoded, a Decision Tree classifier can be trained on the dataset, as it can handle both numerical and categorical variables and provides interpretable results. To improve the model’s performance, hyperparameter tuning can be performed, adjusting parameters such as max_depth, min_samples_split, and min_samples_leaf using techniques like GridSearchCV or RandomizedSearchCV. After training, the model should be evaluated using metrics like accuracy, precision, recall, F1-score, and the confusion matrix, especially because disease prediction often involves imbalanced classes.

    The business value of such a model in a healthcare setting is significant: it can help clinicians identify high-risk patients early, prioritize testing and treatments, optimize resource allocation, and reduce healthcare costs. By providing data-driven insights, the model supports proactive decision-making, improves patient outcomes, and enhances operational efficiency in hospitals or clinics.