# Decision Tree Assignment


## Theoretical

**Q1. What is a Decision Tree, and how does it work**

A Decision Tree is a flowchart-like structure used for classification and regression. It splits the dataset into subsets based on the value of input features. At each internal node, a feature is selected that best splits the data based on a criterion such as Gini Impurity or Entropy.

**Q2. What are impurity measures in Decision Trees**

Impurity measures determine how well a split separates the data. Common measures include Gini Impurity and Entropy.

**Q3. What is the mathematical formula for Gini Impurity**

$$Gini = 1 - \sum_{i=1}^{n} p_i^2$$ where $p_i$ is the probability of class $i$ in a node.

**Q4. What is the mathematical formula for Entropy**

$$Entropy = -\sum_{i=1}^{n} p_i \log_2(p_i)$$ where $p_i$ is the probability of class $i$ in a node.

**Q5. What is Information Gain, and how is it used in Decision Trees**

Information Gain is the reduction in impurity after a dataset is split on an attribute. It is calculated as the difference between the impurity of the parent node and the weighted impurity of the child nodes.

**Q6. What is the difference between Gini Impurity and Entropy**

Both are measures of impurity, but Gini tends to be simpler and faster to compute. Entropy involves logarithmic computation and may provide more balanced splits.

**Q7. What is the mathematical explanation behind Decision Trees**

Decision Trees build a model by selecting splits that maximize Information Gain or minimize impurity. This is done recursively until a stopping condition is met (like depth, min samples, etc).

**Q8. What is Pre-Pruning in Decision Trees**

Pre-Pruning stops the tree growth early using conditions like max depth, min samples split, etc.

**Q9. What is Post-Pruning in Decision Trees**

Post-Pruning allows the tree to grow fully, and then prunes back branches that do not improve generalization using a cost-complexity approach.

**Q10. What is the difference between Pre-Pruning and Post-Pruning**

Pre-Pruning prevents overfitting during training, while Post-Pruning removes overfit branches after training.

**Q11. What is a Decision Tree Regressor**

A Decision Tree Regressor is a type of decision tree used for regression tasks, predicting continuous values.

**Q12. What are the advantages and disadvantages of Decision Trees**

- Advantages: Easy to interpret, handles both numerical and categorical data, non-parametric.
- Disadvantages: Prone to overfitting, unstable with small variations in data.

**Q13. How does a Decision Tree handle missing values**

Some implementations allow handling missing values by surrogate splits or skipping features with missing values.

**Q14. How does a Decision Tree handle categorical features**

Categorical features can be split based on subsets of categories. Most libraries convert them to dummy variables.

**Q15. What are some real-world applications of Decision Trees?**

Loan approval, medical diagnosis, customer segmentation, fraud detection.

## Practical

**Q1. Train a Decision Tree Classifier on the Iris dataset and print the model accuracy**

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

**Q2. Train a Decision Tree Classifier using Gini Impurity and print the feature importances**

In [None]:
model = DecisionTreeClassifier(criterion='gini')
model.fit(X_train, y_train)
print("Feature Importances:", model.feature_importances_)

**Q3. Train a Decision Tree Classifier using Entropy and print the model accuracy**

In [None]:
model = DecisionTreeClassifier(criterion='entropy')
model.fit(X_train, y_train)
print("Accuracy (Entropy):", model.score(X_test, y_test))

**Q4. Train a Decision Tree Regressor on a housing dataset and evaluate using Mean Squared Error**

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y)
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))

**Q5. Train a Decision Tree Classifier and visualize the tree using graphviz**

In [None]:
from sklearn.tree import export_graphviz
import graphviz

dot_data = export_graphviz(model, out_file=None,
                           feature_names=load_iris().feature_names,
                           class_names=load_iris().target_names,
                           filled=True, rounded=True)
graph = graphviz.Source(dot_data)
graph.render("decision_tree")  # Saves to decision_tree.pdf

**Q6. Train a Decision Tree Classifier with max_depth=3 and compare accuracy with a full tree**

In [None]:
model1 = DecisionTreeClassifier(max_depth=3)
model1.fit(X_train, y_train)
print("Accuracy (depth=3):", model1.score(X_test, y_test))

model2 = DecisionTreeClassifier()
model2.fit(X_train, y_train)
print("Accuracy (full tree):", model2.score(X_test, y_test))

**Q7. Train a Decision Tree Classifier using min_samples_split=5 and compare its accuracy**

In [None]:
model = DecisionTreeClassifier(min_samples_split=5)
model.fit(X_train, y_train)
print("Accuracy (min_samples_split=5):", model.score(X_test, y_test))

**Q8. Apply feature scaling before training a Decision Tree Classifier and compare accuracy**

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_scaled, y)

model_scaled = DecisionTreeClassifier()
model_scaled.fit(X_train_s, y_train_s)
print("Accuracy with scaling:", model_scaled.score(X_test_s, y_test_s))

model_unscaled = DecisionTreeClassifier()
model_unscaled.fit(X_train, y_train)
print("Accuracy without scaling:", model_unscaled.score(X_test, y_test))

**Q9. Train a Decision Tree Classifier using One-vs-Rest (OvR) strategy for multiclass classification**

In [None]:
from sklearn.multiclass import OneVsRestClassifier
ovr_model = OneVsRestClassifier(DecisionTreeClassifier())
ovr_model.fit(X_train, y_train)
print("OvR Accuracy:", ovr_model.score(X_test, y_test))

**Q10. Train a Decision Tree Classifier and display the feature importance scores**

In [None]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
print("Feature Importances:", model.feature_importances_)

**Q11. Train a Decision Tree Regressor with max_depth=5 and compare its performance**

In [None]:
reg1 = DecisionTreeRegressor(max_depth=5)
reg1.fit(X_train, y_train)
print("MSE (depth=5):", mean_squared_error(y_test, reg1.predict(X_test)))

reg2 = DecisionTreeRegressor()
reg2.fit(X_train, y_train)
print("MSE (unrestricted):", mean_squared_error(y_test, reg2.predict(X_test)))

**Q12. Train a Decision Tree Classifier, apply Cost Complexity Pruning (CCP), and visualize its effect**

In [None]:
path = model.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    print(f"Accuracy with ccp_alpha={ccp_alpha:.5f}: {clf.score(X_test, y_test):.4f}")

**Q13. Train a Decision Tree Classifier and evaluate using Precision, Recall, and F1-Score**

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

y_pred = model.predict(X_test)
print("Precision:", precision_score(y_test, y_pred, average='macro'))
print("Recall:", recall_score(y_test, y_pred, average='macro'))
print("F1 Score:", f1_score(y_test, y_pred, average='macro'))

**Q14. Train a Decision Tree Classifier and visualize the confusion matrix using seaborn**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, model.predict(X_test))
sns.heatmap(cm, annot=True, fmt="d")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

**Q15. Train a Decision Tree Classifier and use GridSearchCV for tuning**

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth': [3, 5, 7], 'min_samples_split': [2, 5, 10]}
grid = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=3)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Score:", grid.best_score_)