In [None]:
Question 1: What is a Decision Tree, and how does it work in the context of
classification?

ANSWER:
What is a Decision Tree?
A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks.
In classification, it is used to assign a class label (e.g., spam or not spam, yes or no, disease A or disease B).
It resembles a tree-like structure:
Root Node → represents the entire dataset and the first decision to be made.
Internal Nodes → represent conditions or tests on features (e.g., "Is age > 30?").
Branches → outcomes of the decision (Yes/No or True/False).
Leaf Nodes → represent the final class labels (e.g., "Approve loan" or "Reject loan").

How does it work in Classification?
Start with all data at the root node.
Select the best attribute (feature) to split on using a criterion like:
Information Gain (based on Entropy)
Gini Impurity
Chi-Square
The goal is to create groups that are as pure (homogeneous) as possible.
Split the dataset into subsets based on the chosen feature.
Example:
Feature: "Age > 30" → Yes branch, No branch.
Repeat recursively on each subset until:
All samples in a node belong to the same class, or
No further meaningful splits can be made, or
A stopping condition is reached (like maximum depth or minimum samples per leaf).
Assign class labels to the leaf nodes based on the majority class of data points in that node.

In [None]:
Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

ANSWER:


Gini Impurity and Entropy are two measures of impurity used in decision trees to determine how good a split is.
Both aim to capture how mixed the classes are within a node, since a perfectly pure node is one where all data points belong to the same class.

Gini Impurity reflects the probability of misclassifying a randomly chosen element if it were assigned a class label at random according to the class proportions in the node.
   A node with Gini equal to zero is completely pure, while higher values indicate greater class mixing. Entropy,
 on the other hand, comes from information theory and measures the amount of uncertainty or disorder in the class distribution.
   A node with low entropy is highly ordered, meaning its samples mostly belong to one class, whereas high entropy indicates more uniform mixing across classes.

When building a decision tree, the algorithm evaluates all possible splits and chooses the one that most reduces impurity.
If entropy is used, the split that maximizes information gain—that is, the reduction in entropy—is chosen.
If Gini is used, the split that minimizes Gini impurity is preferred. Although both criteria usually lead to similar results, there are subtle differences. Gini tends to be more sensitive to the presence of a dominant class and often isolates the majority class earlier. Entropy, by contrast, is more sensitive to how evenly distributed the classes are and therefore encourages splits that produce more balanced partitions.

In essence, both impurity measures guide the tree toward purer child nodes, but they differ slightly in how they value class distributions. This difference can influence the structure and depth of the resulting tree, even if the final classification accuracy is often similar.



In [None]:
Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.
ANSWER:
Pre-pruning and post-pruning are two strategies used to prevent decision trees from growing too complex and overfitting the training data,
but they differ in when the stopping or simplification happens.

In pre-pruning, the tree growth is restricted *during* the construction phase itself.
The algorithm decides in advance whether a node should be split further by checking conditions such as the maximum depth,
 the minimum number of samples required in a node, or whether the impurity reduction from a potential split is large enough.
 If these conditions are not met, the split is stopped early, and the node is left as a leaf.

  A practical advantage of pre-pruning is that it makes the training process more efficient because the algorithm avoids exploring unnecessary
  branches of the tree.

In post-pruning, on the other hand, the tree is allowed to grow fully until it either perfectly classifies
the training data or cannot be split further. Once the complete tree is built, it is then simplified by
removing branches or subtrees that do not improve generalization. This pruning step is usually guided by performance
on a validation set or by applying statistical tests to check whether further splits add meaningful predictive power.
A practical advantage of post-pruning is that it generally produces more accurate and reliable trees because it allows
the algorithm to first capture all possible patterns before deciding which parts are noise and should be removed.




In [None]:
Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

ANSWER:
Information gain is a concept from information theory that is used in decision trees to evaluate how good a particular split is.
It measures the reduction in uncertainty or impurity about the target class that results from partitioning the data based on a given attribute.
In other words, it tells us how much “information” about the class labels we gain by knowing the value of a certain feature.

Mathematically, information gain is defined as the difference between the entropy of the parent node and
the weighted average entropy of the child nodes after the split. If a split results in child nodes that are much purer than the parent,
the information gain will be high; if the classes remain mixed, the gain will be low.

The importance of information gain lies in its role as the criterion for selecting the best attribute at each step in building the decision tree.
By choosing the split with the highest information gain, the algorithm ensures that each decision reduces uncertainty as much as possible and pushes
the data closer to pure class separation. This greedy strategy of maximizing information gain at each node is what guides the tree toward effective
classification.



In [None]:
Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

ANSWER:

### **Real-World Applications of Decision Trees**

* **Medical Diagnosis** → Classifying patients based on symptoms and test results.
* **Finance** → Credit scoring, loan approval, fraud detection.
* **Marketing** → Customer segmentation, churn prediction, targeted advertising.
* **Business Operations** → Risk assessment and decision-making support.
* **Machine Learning** → Basis for ensemble models like Random Forests and Gradient Boosting.

---

### **Advantages of Decision Trees**

* Easy to **understand and interpret**, even for non-technical users.
* Requires **little data preprocessing** (no need for scaling or normalization).
* Handles both **categorical and numerical features** naturally.
* Can capture **non-linear relationships** between features and outcomes.

---

### **Limitations of Decision Trees**

* **Prone to overfitting** if not pruned or regularized.
* **Unstable**: small changes in data can drastically change the tree structure.
* May have **lower predictive accuracy** compared to advanced models.
* Struggles with **high-dimensional datasets** unless used in ensembles.




In [1]:
'''
Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances

'''


# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data        # Features
y = iris.target      # Target labels

# 2. Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# 4. Make predictions
y_pred = clf.predict(X_test)

# 5. Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# 6. Print feature importances
print("\nFeature Importances:")
for feature_name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")


Model Accuracy: 1.0

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


In [2]:
'''
Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.
'''
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train a Decision Tree with max_depth=3
tree_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_limited.fit(X_train, y_train)

# 4. Train a fully-grown Decision Tree (no max_depth limit)
tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train, y_train)

# 5. Make predictions
y_pred_limited = tree_limited.predict(X_test)
y_pred_full = tree_full.predict(X_test)

# 6. Evaluate accuracy
accuracy_limited = accuracy_score(y_test, y_pred_limited)
accuracy_full = accuracy_score(y_test, y_pred_full)

print("Accuracy of Decision Tree with max_depth=3:", accuracy_limited)
print("Accuracy of Fully-grown Decision Tree:", accuracy_full)


Accuracy of Decision Tree with max_depth=3: 1.0
Accuracy of Fully-grown Decision Tree: 1.0


In [4]:
'''
Question 8: Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances

'''
# Import required libraries
#Note: In recent versions of scikit-learn, load_boston()
#has been removed because of ethical concerns. We can use fetch_openml("boston", version=1) instead.

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd

# 1. Load the Boston Housing dataset
boston = fetch_openml(name="boston", version=1, as_frame=True)
X = boston.data
y = boston.target

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# 4. Make predictions
y_pred = regressor.predict(X_test)

# 5. Compute Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# 6. Print feature importances
print("\nFeature Importances:")
for feature_name, importance in zip(X.columns, regressor.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")


Mean Squared Error (MSE): 10.416078431372549

Feature Importances:
CRIM: 0.0513
ZN: 0.0034
INDUS: 0.0058
CHAS: 0.0000
NOX: 0.0271
RM: 0.6003
AGE: 0.0136
DIS: 0.0707
RAD: 0.0019
TAX: 0.0125
PTRATIO: 0.0110
B: 0.0090
LSTAT: 0.1933


In [5]:
'''
Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy
'''

# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Define the Decision Tree Classifier
dtree = DecisionTreeClassifier(random_state=42)

# 4. Define the hyperparameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10]
}

# 5. Perform GridSearchCV
grid_search = GridSearchCV(
    estimator=dtree,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)
grid_search.fit(X_train, y_train)

# 6. Get best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# 7. Evaluate the model with best parameters
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy with Best Parameters:", accuracy)





Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy with Best Parameters: 1.0


In [6]:
'''Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

ANSWER:

### **1. Handle Missing Values**

The first step is to identify and handle missing data.

* For **numerical features**, missing values can be imputed using the **mean, median, or a model-based approach**. Median is often preferred for healthcare data to reduce the influence of outliers.
* For **categorical features**, missing values can be imputed using the **most frequent category** or a special “Unknown” category.
* It is important to perform imputation **after splitting the data** into training and testing sets to avoid data leakage.

---

### **2. Encode Categorical Features**

Decision Trees can handle categorical features in some implementations, but many libraries require numeric encoding:

* **Label Encoding**: Convert categories into integer labels (useful if ordinal relationships exist).
* **One-Hot Encoding**: Convert categories into binary dummy variables (common for non-ordinal categories).

Encoding ensures the tree can process the data correctly while preserving information.

---

### **3. Train a Decision Tree Model**

* Select a **Decision Tree Classifier** because the target is categorical (disease present/absent).
* Split the dataset into **training and testing sets**.
* Fit the tree on the training data.
* Start with default hyperparameters initially to get a baseline model.

---

### **4. Tune Hyperparameters**

To improve performance and prevent overfitting:

* Use **GridSearchCV** or **RandomizedSearchCV** for cross-validated hyperparameter tuning.
* Important parameters include:

  * `max_depth`: Limits tree depth to avoid overfitting.
  * `min_samples_split`: Minimum samples required to split a node.
  * `min_samples_leaf`: Minimum samples required in a leaf node.
  * `criterion`: Gini impurity or entropy.
* Select hyperparameters that **maximize cross-validated accuracy or F1-score**.

---

### **5. Evaluate Model Performance**

* Evaluate using the **test set** to measure generalization.
* Metrics to consider:

  * **Accuracy**: Overall correctness.
  * **Precision and Recall**: Especially important in healthcare to avoid false negatives.
  * **F1-score**: Balances precision and recall.
  * **ROC-AUC**: Measures ability to discriminate between classes.
* Optionally, inspect **feature importances** to understand which factors contribute most to disease prediction.

---

### **6. Business Value of the Model**

* **Early Detection**: Identifies patients at high risk before symptoms worsen, enabling timely intervention.
* **Resource Allocation**: Helps hospitals prioritize testing and treatment for high-risk patients.
* **Cost Reduction**: Reduces unnecessary tests for low-risk patients.
* **Decision Support**: Provides actionable insights to doctors, improving treatment planning.
* **Public Health Insights**: Aggregated predictions can guide preventive care and policy decisions.

---
 Here’s a complete Python workflow for your healthcare disease prediction scenario using a Decision Tree.
 This example covers missing value handling, encoding categorical features, training, hyperparameter tuning, and evaluation.

'''


# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# --- 1. Load dataset (example CSV) ---
# Replace 'healthcare_data.csv' with your actual dataset
data = pd.read_csv('healthcare_data.csv')

# Assume 'target' is the disease column (1 = disease, 0 = no disease)
X = data.drop('target', axis=1)
y = data['target']

# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object', 'category']).columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns

# --- 2. Preprocessing: Handle missing values and encode categorical features ---
numerical_transformer = SimpleImputer(strategy='median')  # Impute numerical missing values with median
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Fill missing categorical values
    ('onehot', OneHotEncoder(handle_unknown='ignore'))      # One-hot encode categories
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_cols),
    ('cat', categorical_transformer, categorical_cols)
])

# --- 3. Split into training and testing sets ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# --- 4. Create pipeline with Decision Tree ---
dtree = DecisionTreeClassifier(random_state=42)
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', dtree)])

# --- 5. Hyperparameter tuning using GridSearchCV ---
param_grid = {
    'classifier__max_depth': [3, 5, 7, None],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4],
    'classifier__criterion': ['gini', 'entropy']
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# --- 6. Best parameters and model ---
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

best_model = grid_search.best_estimator_

# --- 7. Evaluate the model ---
y_pred = best_model.predict(X_test)

print("\nEvaluation Metrics:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1-Score:", f1_score(y_test, y_pred))
# ROC-AUC for binary classification
y_pred_prob = best_model.predict_proba(X_test)[:,1]
print("ROC-AUC:", roc_auc_score(y_test, y_pred_prob))



FileNotFoundError: [Errno 2] No such file or directory: 'healthcare_data.csv'