QUESTION 1
A Decision Tree is a supervised machine learning algorithm widely used for classification tasks. It works by breaking down a dataset into smaller subsets based on feature values, forming a tree-like structure. The process starts at the root node (entire dataset), then splits data at internal nodes using criteria like Information Gain or Gini Index to choose the best feature. Each branch represents a decision outcome, and the process continues until reaching leaf nodes, which represent the final class labels. For classification, a new sample is classified by tracing its feature values from the root down to a leaf. Decision Trees are popular because they are simple, interpretable, and mimic human decision-making.

QUESTION 2
Gini Impurity:

1) Gini impurity measures how often a randomly chosen sample would be incorrectly classified if it was randomly labeled according to the class distribution in a node.

2)Value ranges from 0 (pure node: only one class) to 0.5 (maximum impurity in binary classification).

3)The lower the Gini value, the purer the node.

Entropy:

Entropy is an information theory measure of uncertainty in a dataset.


Value is 0 when the node is pure (all samples same class) and higher when classes are mixed.

Entropy is used to calculate Information Gain, which measures how much uncertainty is reduced after a split.

Impact on Splits in Decision Trees:

1)Both Gini Impurity and Entropy guide the tree in selecting the best feature and split point.

2)The algorithm evaluates all possible splits and chooses the one that gives the largest reduction in impurity (highest purity gain).

3)Practically, Gini tends to be slightly faster to compute, while Entropy is more theoretically grounded in information theory, but both usually produce very similar trees.

QUESTION 3

Pre-Pruning (Early Stopping):

Pre-pruning stops the tree growth early by applying conditions such as maximum depth, minimum samples per split, or minimum information gain.

It prevents the tree from becoming too complex.

Advantage: Saves time and avoids overfitting by keeping the tree simpler.

Post-Pruning (Pruning After Full Growth):

Post-pruning allows the tree to grow fully and then removes branches that add little predictive power, often using validation data.

It reduces complexity after observing the complete structure.

Advantage: Produces a more accurate and generalized model since it prunes only unnecessary branches.

QUESTION 4

Information Gain:
Information Gain is a measure used in Decision Trees to decide the best feature for splitting the data. It is based on Entropy and shows how much uncertainty (impurity) is reduced after a split.

Importance:

A higher Information Gain means the split gives a clearer separation of classes.

The Decision Tree selects the feature with the maximum Information Gain at each step.

This ensures the tree becomes more accurate and efficient, reducing impurity quickly.

QUESTION 5

**Real-World Applications of Decision Trees:**

* **Finance:** Credit scoring, loan approval, fraud detection.
* **Healthcare:** Disease diagnosis, treatment planning.
* **Marketing:** Customer segmentation, predicting churn, product recommendations.
* **Operations:** Risk analysis, decision support systems.
* **Education:** Predicting student performance or dropout risk.

**Advantages:**

1. Simple to understand and interpret (like human decision-making).
2. Can handle both numerical and categorical data.
3. Requires little data preprocessing (no need for scaling/normalization).
4. Works well for feature selection by identifying important variables.

**Limitations:**

1. Prone to **overfitting** if not pruned.
2. Can be **unstable** (small data changes may create a different tree).
3. Biased toward features with more levels.
4. Less accurate compared to ensemble methods like Random Forests.




In [1]:
#QUESTION 6
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# Predict on test data
y_pred = clf.predict(X_test)

# Print model accuracy
print("Model Accuracy:", accuracy_score(y_test, y_pred))

# Print feature importances
print("Feature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Model Accuracy: 1.0
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


In [2]:
#QUESTION 7
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Decision Tree with max_depth = 3
clf_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# Fully grown Decision Tree (no depth limit)
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Print results
print("Decision Tree with max_depth=3 Accuracy:", accuracy_limited)
print("Fully-grown Decision Tree Accuracy:", accuracy_full)


Decision Tree with max_depth=3 Accuracy: 1.0
Fully-grown Decision Tree Accuracy: 1.0


In [3]:
#QUESTION 8
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Decision Tree Regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predict on test data
y_pred = reg.predict(X_test)

# Calculate and print Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# Print feature importances
print("Feature Importances:")
for feature, importance in zip(housing.feature_names, reg.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Mean Squared Error (MSE): 0.5280096503174904
Feature Importances:
MedInc: 0.5235
HouseAge: 0.0521
AveRooms: 0.0494
AveBedrms: 0.0250
Population: 0.0322
AveOccup: 0.1390
Latitude: 0.0900
Longitude: 0.0888


In [4]:
#QUESTION 9
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define the Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)

# Define the grid of hyperparameters
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10, 20]
}

# Set up GridSearchCV
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best parameters and train the model with them
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Predict on test data and calculate accuracy
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", best_params)
print("Model Accuracy with Best Parameters:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Model Accuracy with Best Parameters: 1.0


QUESTION 10

**Step-by-Step Process for Building a Decision Tree Model in Healthcare**

1. **Handle Missing Values:**

   * **Identify missing data** in the dataset.
   * For **numerical features**, impute missing values using the **mean or median**.
   * For **categorical features**, use the **mode** or a special category like “Unknown.”
   * Optionally, use more advanced techniques like **KNN imputation** for better accuracy.

2. **Encode Categorical Features:**

   * Convert categorical variables into numerical format because Decision Trees can handle **ordinal or one-hot encoded variables**.
   * For **nominal categories**, use **One-Hot Encoding**.
   * For **ordinal categories**, map them to **integer values** representing their order.

3. **Train a Decision Tree Model:**

   * Split the dataset into **training and testing sets**.
   * Initialize a **Decision Tree Classifier**, choosing a criterion like **Gini** or **Entropy**.
   * Fit the model on the **training data**.

4. **Tune Hyperparameters:**

   * Use **GridSearchCV** or **RandomizedSearchCV** to find the best parameters:

     * `max_depth` (controls tree depth)
     * `min_samples_split` (minimum samples to split a node)
     * `min_samples_leaf` (minimum samples in a leaf)
     * `criterion` (Gini or Entropy)
   * Helps reduce **overfitting** and improves generalization.

5. **Evaluate Model Performance:**

   * Use metrics suitable for classification:

     * **Accuracy**: Overall correctness.
     * **Precision & Recall**: Especially important in healthcare to minimize false negatives/positives.
     * **F1-Score**: Balance between precision and recall.
     * **Confusion Matrix**: To understand types of errors.
   * Optionally, use **cross-validation** to ensure stability.

6. **Business Value in Real-World Healthcare:**

   * **Early detection of diseases** helps in timely treatment and improves patient outcomes.
   * **Resource optimization:** Hospitals can allocate resources efficiently to high-risk patients.
   * **Decision support for doctors:** Provides data-driven insights to complement medical expertise.
   * **Reducing healthcare costs** by focusing preventive care on likely patients.



