
1. What is a Decision Tree, and how does it work in the context of classification?

- A Decision Tree is a supervised model that recursively splits data based on feature-based rules. Each internal node checks a condition like “feature ≤ threshold”, dividing data into subsets with reduced impurity. This continues until leaves contain mostly a single class. For classification, the leaf’s majority class becomes the prediction. Decision Trees are interpretable and can handle nonlinear boundaries.

2. Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

- Gini Impurity calculates the likelihood of incorrect classification if labels were randomly assigned. Entropy measures uncertainty or disorder. Both impurities become zero when a node is perfectly pure. During tree construction, every split is evaluated, and the split that gives the highest impurity reduction is chosen. This makes nodes purer and improves classification accuracy.

3. What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

- Pre-pruning restricts tree growth early using parameters such as max_depth or min_samples_split, preventing overfitting and reducing training time. Post-pruning grows a full tree first and then removes unnecessary branches, improving generalization. Pre-pruning is faster; post-pruning produces more optimized trees.

4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?

- Information Gain measures the reduction in entropy achieved after a split. A high Information Gain indicates that the split separates the classes effectively. The tree uses this value to choose the most informative features, improving prediction accuracy and reducing tree depth.

5. What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

- Decision Trees are used in fraud detection, medical diagnosis, credit scoring, churn prediction, and recommendation systems. They are easy to interpret and handle mixed data types. However, they can overfit and are sensitive to small data changes. Ensembles like Random Forests help overcome these issues.

Question 6: Write a Python program to:
* Load the Iris Dataset
* Train a Decision Tree Classifier using the Gini criterion
* Print the model’s accuracy and feature importances





In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, pred))
print("Feature Importances:")
for name, imp in zip(iris.feature_names, clf.feature_importances_):
    print(name, ":", imp)


Accuracy: 0.8947368421052632
Feature Importances:
sepal length (cm) : 0.013393924898349679
sepal width (cm) : 0.020090887347524518
petal length (cm) : 0.9198866667893217
petal width (cm) : 0.04662852096480414


Question 7: Write a Python program to:
* Load the Iris Dataset
* Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.


In [2]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

clf3 = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_full = DecisionTreeClassifier(random_state=42)

clf3.fit(X_train, y_train)
clf_full.fit(X_train, y_train)

pred3 = clf3.predict(X_test)
pred_full = clf_full.predict(X_test)

print("Accuracy (max_depth=3):", accuracy_score(y_test, pred3))
print("Accuracy (full tree):", accuracy_score(y_test, pred_full))


Accuracy (max_depth=3): 0.8947368421052632
Accuracy (full tree): 0.8947368421052632


Question 8: Write a Python program to:
* Load the California Housing dataset from sklearn
* Train a Decision Tree Regressor
* Print the Mean Squared Error (MSE) and feature importances

In [3]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42)

reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)
pred = reg.predict(X_test)

print("MSE:", mean_squared_error(y_test, pred))
print("Feature Importances:")
for name, imp in zip(data.feature_names, reg.feature_importances_):
    print(name, ":", imp)


MSE: 0.5285224061284108
Feature Importances:
MedInc : 0.5262413969849339
HouseAge : 0.050926206129984955
AveRooms : 0.048154956928807016
AveBedrms : 0.02803899237580427
Population : 0.03691354728127817
AveOccup : 0.13491387033351493
Latitude : 0.08801244866407874
Longitude : 0.08679858130159805


Question 9: Write a Python program to:
* Load the Iris Dataset
* Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
* Print the best parameters and the resulting model accuracy

In [5]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris # Import load_iris
from sklearn.model_selection import train_test_split # Import train_test_split


# Load the Iris Dataset again to ensure correct data is used
iris = load_iris()
X, y = iris.data, iris.target

# Split the data again to ensure correct data is used
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y)


params = {
    "max_depth": [None, 2, 3, 4, 5, 6],
    "min_samples_split": [2, 5, 10]
}

grid = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid=params,
    cv=5,
    n_jobs=-1)

grid.fit(X_train, y_train)

best_model = grid.best_estimator_
pred = best_model.predict(X_test)

print("Best Parameters:", grid.best_params_)
print("Accuracy:", accuracy_score(y_test, pred))

Best Parameters: {'max_depth': None, 'min_samples_split': 5}
Accuracy: 0.9210526315789473


## Question 10: Healthcare Disease Prediction with Decision Trees

Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to:

*   Handle the missing values
*   Encode the categorical features
*   Train a Decision Tree model
*   Tune its hyperparameters
*   Evaluate its performance

And describe what business value this model could provide in the real-world setting.

### Step-by-Step Process:

1.  **Data Loading and Initial Exploration:**
    *   Load the dataset into a pandas DataFrame.
    *   Perform initial data exploration to understand the structure, identify data types, and get a sense of the extent of missing values and categorical features.

2.  **Handling Missing Values:**
    *   **Identify Missing Values:** Determine which features have missing values and the percentage of missing data in each.
    *   **Choose Imputation Strategy:** Select appropriate strategies based on the data type and distribution. Common methods include:
        *   **Mean/Median Imputation:** For numerical features. Use the mean for normally distributed data and the median for skewed data.
        *   **Mode Imputation:** For categorical features.
        *   **K-Nearest Neighbors (KNN) Imputation:** Impute missing values based on the values of the k-nearest neighbors.
        *   **Model-Based Imputation:** Use a model to predict missing values based on other features.
    *   **Implement Imputation:** Apply the chosen imputation methods to fill in the missing values.

3.  **Encoding Categorical Features:**
    *   **Identify Categorical Features:** Determine which features are categorical (nominal or ordinal).
    *   **Choose Encoding Strategy:** Select appropriate encoding methods:
        *   **One-Hot Encoding:** For nominal features where there is no inherent order. Creates new binary columns for each category.
        *   **Label Encoding:** For ordinal features where there is a natural order. Assigns a unique integer to each category. Be cautious with Decision Trees as they can misinterpret the numerical order if the feature is not truly ordinal.
        *   **Target Encoding:** Encodes categories based on the mean of the target variable for each category. Can be useful but can also lead to overfitting.
    *   **Implement Encoding:** Apply the chosen encoding methods to convert categorical features into a numerical format that the Decision Tree can understand.

4.  **Splitting the Data:**
    *   Split the dataset into training, validation (optional but recommended for hyperparameter tuning), and testing sets. A common split is 70-80% for training, 10-15% for validation, and 10-15% for testing. Ensure the split is stratified if the target variable is imbalanced to maintain the original class distribution in each set.

5.  **Training a Decision Tree Model:**
    *   Import the `DecisionTreeClassifier` from scikit-learn.
    *   Instantiate the model. Start with default hyperparameters or values based on domain knowledge or initial exploration.
    *   Train the model using the training data (`fit(X_train, y_train)`).

6.  **Hyperparameter Tuning:**
    *   **Identify Key Hyperparameters:** Focus on hyperparameters that control the complexity of the tree, such as:
        *   `max_depth`: The maximum depth of the tree.
        *   `min_samples_split`: The minimum number of samples required to split an internal node.
        *   `min_samples_leaf`: The minimum number of samples required to be at a leaf node.
        *   `criterion`: The function to measure the quality of a split (Gini impurity or entropy).
    *   **Choose Tuning Method:**
        *   **Grid Search:** Exhaustively searches over a specified range of hyperparameter values.
        *   **Random Search:** Randomly samples hyperparameter values from a specified distribution. Often more efficient than grid search for large search spaces.
        *   **Cross-Validation:** Use cross-validation (e.g., k-fold cross-validation) during tuning to get a more robust estimate of the model's performance for each set of hyperparameters.
    *   **Implement Tuning:** Use `GridSearchCV` or `RandomizedSearchCV` from scikit-learn to find the best combination of hyperparameters based on a chosen evaluation metric (e.g., accuracy, precision, recall, F1-score, AUC, depending on the business problem and class imbalance).
    *   **Select Best Model:** Choose the model with the best performance on the validation set (or cross-validation) as the final model.

7.  **Evaluating Model Performance:**
    *   Evaluate the performance of the best model on the unseen test set using appropriate evaluation metrics. For disease prediction, common metrics include:
        *   **Accuracy:** Overall percentage of correct predictions.
        *   **Precision:** Of all patients predicted to have the disease, what percentage actually have it? (Minimizes false positives).
        *   **Recall (Sensitivity):** Of all patients who actually have the disease, what percentage were correctly identified? (Minimizes false negatives).
        *   **F1-Score:** The harmonic mean of precision and recall, providing a balance between the two.
        *   **AUC (Area Under the ROC Curve):** Measures the model's ability to distinguish between positive and negative classes.
        *   **Confusion Matrix:** A table summarizing the prediction results, showing true positives, true negatives, false positives, and false negatives.
    *   Analyze the results in the context of the business problem to understand the model's strengths and weaknesses.

### Business Value in a Real-World Setting:

A Decision Tree model for disease prediction in a healthcare company could provide significant business value:

*   **Early Detection and Intervention:** Identifying patients at high risk of developing a disease allows for earlier intervention, potentially leading to better patient outcomes, reduced treatment costs, and improved quality of life.
*   **Resource Allocation:** The model can help healthcare providers prioritize resources by identifying patients who require more immediate attention or specialized care.
*   **Personalized Treatment Plans:** Understanding the factors that contribute to disease risk for individual patients can help in developing more personalized and effective treatment plans.
*   **Cost Reduction:** By enabling early detection and targeted interventions, the model can help reduce the overall cost of healthcare by preventing the progression of diseases and minimizing the need for more expensive treatments later on.
*   **Improved Patient Management:** The model can support healthcare professionals in making more informed decisions about patient management, leading to improved efficiency and effectiveness of care delivery.
*   **Research and Insights:** The feature importances from the Decision Tree can provide valuable insights into the key factors associated with the disease, which can inform further research and understanding of the disease mechanisms.
*   **Risk Stratification:** Patients can be stratified into different risk categories based on the model's predictions, allowing for tailored monitoring and preventative measures.

