**Decision Tree | Assignment**

**Question 1:** What is a Decision Tree, and how does it work in the context of classification?

  - A Decision Tree is a supervised learning algorithm used for both classification and regression. In classification, it works by splitting the dataset into smaller groups based on feature values, forming a tree-like structure.

  - Each internal node represents a test on a feature (for example, “age > 30?”), each branch is the outcome of the test (Yes/No), and each leaf node represents the final class label.

  - The tree is built using measures like Gini impurity or Entropy, which decide the best feature to split on at each step. The process continues until the data is well separated or a stopping condition (like maximum depth) is reached.

**Question 2:** Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

  -Decision Trees use impurity measures to decide which feature to split on at each step. The two common measures are Gini Impurity and Entropy.

   - Gini Impurity:Measures how often a randomly chosen sample would be misclassified if it were labeled according to the class distribution in a node.

  - Value = 0 → all samples belong to one class (pure node).

 -   Higher value → more mixed classes.

 - Entropy: Measures the disorder or uncertainty in a node.

  - Value = 0 → node is pure.

  - Higher value → node contains more mixed classes.

  - Impact on splits: When building a tree, the algorithm looks for the split that reduces impurity the most.

  - Using Gini or Entropy, the best split creates child nodes that are more pure than the parent node.

  - This ensures the tree separates the classes effectively, improving classification accuracy.

**Question 3:** What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.


  - Pruning is a way to control the growth of a Decision Tree so it does not become too complex. There are two types:

  - Pre-Pruning (Early Stopping): Here, we stop the tree from growing too deep right at the training stage. For example, we can set limits like maximum depth, minimum samples per leaf, or minimum information gain.

  - Advantage: It saves time and reduces the risk of overfitting because the tree never grows unnecessarily large.

  - Post-Pruning (Pruning After Training): In this method, we first grow the tree fully and then remove branches that do not add much value. The idea is to simplify the model while keeping accuracy close to the original.

  - Advantage: It often gives better accuracy compared to pre-pruning because the tree had the chance to learn all possible splits before being simplified.

**Question 4:** What is Information Gain in Decision Trees, and why is it important for choosing the best split?

  - Information Gain measures how much uncertainty or impurity is reduced after splitting a node in a Decision Tree.

It is calculated as:

  - **Information Gain = Impurity of parent node - (Weighted sum of impurities of child nodes)**

  - A high Information Gain means the split creates purer child nodes.

  - A low Information Gain means the split does not help much in separating classes.

  - Why it is important: Decision Trees use Information Gain to choose the best feature and value to split on. By always selecting the split with the highest Information Gain, the tree improves its ability to classify data correctly and makes more accurate predictions.


**Question 5:** What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

  - Real-world applications:

Banking: Predicting whether a customer will default on a loan.
Healthcare: Diagnosing diseases based on patient symptoms and test results.

E-commerce: Predicting whether a customer will buy a product or respond to a marketing campaign.

Finance: Fraud detection by classifying transactions as normal or suspicious.

  - Advantages:

Easy to understand and visualize — even non-technical people can follow the decisions.

Can handle both numerical and categorical data.

No need to scale or normalize features.

  - Limitations:

Can overfit if the tree grows too deep.

Small changes in data can lead to completely different trees.

Usually less accurate than ensemble methods like Random Forest or Gradient Boosting.



In [1]:
''' Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances '''

# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset (built-in in sklearn)
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)  # features
y = data.target  # labels

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Classifier using Gini criterion
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate and print accuracy
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)

# Print feature importances
print("\nFeature Importances:")
for name, importance in zip(X.columns, model.feature_importances_):
    print(f"{name}: {importance:.4f}")


Accuracy: 1.0

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


In [2]:
''' Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree. '''

# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree with max_depth=3
tree_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_limited.fit(X_train, y_train)
y_pred_limited = tree_limited.predict(X_test)
acc_limited = accuracy_score(y_test, y_pred_limited)

# Train fully-grown Decision Tree (no depth limit)
tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train, y_train)
y_pred_full = tree_full.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

# Print accuracies
print("Accuracy with max_depth=3:", acc_limited)
print("Accuracy with fully-grown tree:", acc_full)

Accuracy with max_depth=3: 1.0
Accuracy with fully-grown tree: 1.0


In [3]:
''' Question 8: Write a Python program to:
● Load the California Housing dataset from sklearn
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances '''

import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Regressor
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate and print Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# Print feature importances
print("\nFeature Importances:")
for name, importance in zip(X.columns, model.feature_importances_):
    print(f"{name}: {importance:.4f}")

Mean Squared Error (MSE): 0.495235205629094

Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


In [4]:
''' Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy '''

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set up Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'max_depth': [1, 2, 3, 4, 5],
    'min_samples_split': [2, 3, 4, 5]
}

# Apply GridSearchCV
grid = GridSearchCV(dt, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Best parameters and accuracy
best_params = grid.best_params_
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print("Best Parameters:", best_params)
print("Accuracy with best parameters:", acc)

Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Accuracy with best parameters: 1.0


In [6]:
''' Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting. '''

'''Answer :  Handle Missing Values:

First, check which columns have missing values.

For numerical features, fill missing values with the mean or median.

For categorical features, fill missing values with the mode (most common category) or use a special category like “Unknown”.

This ensures the model can process all data without errors.

Encode Categorical Features:

Convert categorical features into numerical form since Decision Trees in sklearn require numeric input.

For features with no order, use one-hot encoding.

For features with ordinal order, use label encoding.

Train a Decision Tree Model:

Split the dataset into training and test sets.

Train a Decision Tree Classifier, using a criterion like Gini or Entropy.

Initially, you can use default parameters to see basic performance.

Tune Hyperparameters:

Use GridSearchCV or RandomizedSearchCV to find the best parameters.

Important parameters include:

max_depth → limits tree depth to prevent overfitting

min_samples_split → minimum samples required to split a node

min_samples_leaf → minimum samples required at a leaf node

max_features → number of features considered for each split

This helps make the model both accurate and simple.

Evaluate Performance:

Use metrics like accuracy, precision, recall, and F1-score.

In healthcare, recall is especially important because missing a patient with a disease is risky.

Use a confusion matrix to see false positives and false negatives.

Business Value:

This model can help early detection of diseases, allowing doctors to intervene sooner.

It can prioritize high-risk patients for further tests.

Helps the healthcare company reduce costs by avoiding unnecessary tests and focusing resources efficiently.

The model can also identify important risk factors, giving insights into which features contribute most to the disease. '''

'Answer Handle Missing Values:\n\nFirst, check which columns have missing values.\n\nFor numerical features, fill missing values with the mean or median.\n\nFor categorical features, fill missing values with the mode (most common category) or use a special category like “Unknown”.\n\nThis ensures the model can process all data without errors.\n\nEncode Categorical Features:\n\nConvert categorical features into numerical form since Decision Trees in sklearn require numeric input.\n\nFor features with no order, use one-hot encoding.\n\nFor features with ordinal order, use label encoding.\n\nTrain a Decision Tree Model:\n\nSplit the dataset into training and test sets.\n\nTrain a Decision Tree Classifier, using a criterion like Gini or Entropy.\n\nInitially, you can use default parameters to see basic performance.\n\nTune Hyperparameters:\n\nUse GridSearchCV or RandomizedSearchCV to find the best parameters.\n\nImportant parameters include:\n\nmax_depth → limits tree depth to prevent over