#Decision Tree Assignment


1.  What is a Decision Tree, and how does it work in the context of
classification?
    - A Decision Tree is a supervised machine learning algorithm commonly used for classification tasks. It works by breaking down a dataset into smaller and smaller subsets through a series of decisions, forming a tree-like structure. Each internal node of the tree represents a test on a feature, each branch represents the outcome of that test, and each leaf node corresponds to a class label or prediction. The algorithm selects the best feature to split the data at each step using measures such as Information Gain or Gini Impurity, aiming to create the most distinct separation between classes. This process continues recursively until the data in a node belong to the same class or a stopping condition like maximum depth is reached. In classification, the final prediction is made by following the path of conditions down the tree until reaching a leaf node, which assigns the class label. Decision Trees are easy to interpret and visualize, but they can overfit the training data if not properly pruned or regularized.

2. Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?
  - n Decision Trees, impurity measures are used to decide how to split the data at each node so that the resulting subsets are as pure (homogeneous) as possible. Two of the most common impurity measures are Gini Impurity and Entropy. Gini Impurity measures how often a randomly chosen sample from a node would be incorrectly classified if it were randomly labeled according to the class distribution in that node. A Gini value of 0 means the node is pure (all samples belong to one class), while higher values indicate more impurity
  Entropy, on the other hand, is based on the concept of information theory. It measures the amount of uncertainty (or disorder) in a node. Entropy is 0 when the node is pure and reaches its maximum when the classes are equally mixed.When building a Decision Tree, these measures impact how the splits are chosen. The algorithm evaluates all possible features and thresholds, then selects the split that results in the greatest reduction in impurity—known as Information Gain when using entropy, or simply the decrease in Gini Impurity when using Gini. In essence, both measures aim to make the resulting child nodes as pure as possible, which improves the tree’s ability to classify new data accurately.

3. What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.
   - Pre-pruning and Post-pruning are two techniques used to prevent Decision Trees from overfitting.

Pre-pruning (Early Stopping): In pre-pruning, the tree-building process is stopped early, before it grows into a fully complex tree. This is done by setting conditions such as maximum depth, minimum number of samples per node, or minimum information gain required to split. The idea is to stop splitting when further divisions are unlikely to provide significant improvement.

Practical Advantage: It reduces training time and computational cost since the tree is kept shallow and simple.

Post-pruning (Pruning after Full Growth): In post-pruning, the tree is first allowed to grow fully, possibly leading to overfitting, and then it is pruned back by removing branches that do not contribute significantly to accuracy. This is often done using validation data or cost-complexity pruning to balance accuracy and simplicity.

Practical Advantage: It generally results in better accuracy on unseen data, since the tree first learns all possible patterns and then removes only the unnecessary complexity.

4. What is Information Gain in Decision Trees, and why is it important for
choosing the best split?
   - Information Gain (IG) is a metric used in Decision Trees to decide which feature and threshold should be chosen for splitting the dataset at a node. It is based on the concept of entropy from information theory, which measures the impurity or uncertainty in a dataset. Information Gain measures how much “uncertainty” is reduced by splitting on a particular feature. A higher IG means the split produces more homogeneous subsets (purer nodes), which improves the decision-making ability of the tree. Information Gain is crucial because it helps the algorithm pick the best feature at each step of tree construction. Without it (or similar measures like Gini Index), the tree would not know which splits make the data more organized by class. Choosing splits with the highest IG ensures the tree becomes more efficient and accurate at classification, while minimizing unnecessary complexity.

5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
   - Decision Trees are widely used in real-world applications such as medical diagnosis in healthcare, credit scoring and fraud detection in finance, customer segmentation and churn prediction in marketing, and quality control in manufacturing. Their main advantages are that they are simple to interpret, can handle both numerical and categorical data without feature scaling, and work well even with smaller datasets. However, they also have limitations, such as a tendency to overfit if not pruned, instability with small changes in data, bias toward features with many categories, and generally lower accuracy compared to ensemble methods like Random Forests.


In [None]:
'''6. Dataset Info:
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).
Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances'''
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print("Model Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")




Model Accuracy: 1.0
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


In [None]:
'''7. Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.'''


from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)
acc_limited = accuracy_score(y_test, y_pred_limited)

clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

print("Accuracy with max_depth=3:", acc_limited)
print("Accuracy with fully-grown tree:", acc_full)


Accuracy with max_depth=3: 1.0
Accuracy with fully-grown tree: 1.0


In [None]:
'''Question 8: Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances'''

from sklearn.datasets import fetch_openml
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

boston = fetch_openml(name="boston", version=1, as_frame=True)
X, y = boston.data, boston.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error:", mse)
print("Feature Importances:")
for feature, importance in zip(X.columns, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")



Mean Squared Error: 11.588026315789474
Feature Importances:
CRIM: 0.0585
ZN: 0.0010
INDUS: 0.0099
CHAS: 0.0003
NOX: 0.0071
RM: 0.5758
AGE: 0.0072
DIS: 0.1096
RAD: 0.0016
TAX: 0.0022
PTRATIO: 0.0250
B: 0.0119
LSTAT: 0.1900


In [None]:
'''Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy'''
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid_search.best_params_)
print("Accuracy with Best Model:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Accuracy with Best Model: 1.0


10. Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.
   - 1. Handle Missing Values: Identify missing data and impute—use mean/median for numerical features and mode or “Unknown” for categorical features.
   
2. Encode Categorical Features: Apply label encoding for ordinal variables and one-hot encoding for nominal variables.

3. Train Decision Tree: Split data into training/testing sets, initialize a Decision Tree classifier, and fit it to the training data.

4. Hyperparameter Tuning: Use GridSearchCV or RandomizedSearchCV to optimize max_depth, min_samples_split, min_samples_leaf, and criterion.

5. Evaluate Performance: Assess accuracy, precision, recall, F1-score, and ROC-AUC; analyze the confusion matrix and feature importances.

6. Business Value: Enables early disease detection, guides clinicians with interpretable rules, optimizes resource allocation, and reduces unnecessary tests, improving patient outcomes and operational efficiency.