<a href="https://colab.research.google.com/github/Seyjuti8884/pwskills_assignment/blob/main/Decision_Tree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Theoretical Questions

1. **What is a Decision Tree, and how does it work?**  
   A Decision Tree is a supervised learning algorithm used for classification and regression tasks. It splits data based on feature values using a tree-like structure where each internal node represents a decision based on an attribute, each branch represents an outcome of that decision, and each leaf node represents a class label or a continuous value (in regression).

2. **What are impurity measures in Decision Trees?**  
   Impurity measures quantify how mixed a dataset is at a particular node. Common impurity measures include **Gini Impurity**, **Entropy**, and **Misclassification Rate**.

3. **What is the mathematical formula for Gini Impurity?**  
   \[
   Gini = 1 - Σ (pᵢ²)
   \]
   where \( p_i \) is the probability of a particular class at a node.

4. **What is the mathematical formula for Entropy?**  
   \[
   Entropy = - Σ (pᵢ * log₂ pᵢ)
   \]
   where \( p_i \) is the probability of a class at a node.

5. **What is Information Gain, and how is it used in Decision Trees?**  
   Information Gain measures the reduction in impurity when splitting a node. It is calculated as the difference between the impurity of the parent node and the weighted impurity of the child nodes. Higher Information Gain means a better split.

6. **What is the difference between Gini Impurity and Entropy?**  
   - **Gini Impurity** is computationally faster and measures the probability of misclassification.  
   - **Entropy** measures the amount of information disorder in a dataset.  

7. **What is the mathematical explanation behind Decision Trees?**  
   Decision Trees use recursive binary splitting by selecting the feature that maximizes **Information Gain** (for Entropy) or minimizes **Gini Impurity**. The tree grows by applying this process to subsets of data until a stopping criterion (like minimum samples per leaf or maximum depth) is met.

8. **What is Pre-Pruning in Decision Trees?**  
   Pre-pruning stops tree growth early using constraints such as **maximum depth, minimum samples per split, or minimum impurity decrease** to avoid overfitting.

9. **What is Post-Pruning in Decision Trees?**  
   Post-pruning involves growing the tree fully and then pruning back branches that do not improve accuracy using techniques like **cost-complexity pruning (CCP)**.

10. **What is the difference between Pre-Pruning and Post-Pruning?**  
    - **Pre-Pruning** stops tree growth early based on predefined constraints.  
    - **Post-Pruning** first allows full tree growth and then removes unnecessary branches.

11. **What is a Decision Tree Regressor?**  
    A Decision Tree Regressor is a Decision Tree model used for regression tasks. Instead of predicting class labels, it predicts continuous values by minimizing the Mean Squared Error (MSE) in splits.

12. **What are the advantages and disadvantages of Decision Trees?**  
    **Advantages:**  
    - Simple and interpretable  
    - Requires minimal data preprocessing  
    - Works with both numerical and categorical data  
    - Handles non-linearity well  

    **Disadvantages:**  
    - Prone to overfitting  
    - Sensitive to small data variations  
    - Greedy splitting may lead to suboptimal solutions  

13. **How does a Decision Tree handle missing values?**  
    - It can ignore missing values and proceed with available data.  
    - It can use surrogate splits (alternative splits for missing values).  
    - It can replace missing values with the most frequent or mean value.

14. **How does a Decision Tree handle categorical features?**  
    - It can split on categorical values directly (for small categories).  
    - It can use one-hot encoding or ordinal encoding.  

15. **What are some real-world applications of Decision Trees?**  
    - **Medical diagnosis** (predicting disease based on symptoms).  
    - **Finance** (credit risk assessment).  
    - **Customer segmentation** (identifying target groups).  
    - **Fraud detection** (detecting fraudulent transactions).  
    - **Recommendation systems** (suggesting products based on user behavior).  



Practical Questions

### **16. Train a Decision Tree Classifier on the Iris dataset and print the model accuracy**
```python
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Train model
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)
```
**Expected Output:**
```
Model Accuracy: 1.0 (or close to 0.95-1.0)
```

---

### **17. Train a Decision Tree Classifier using Gini Impurity and print feature importances**
```python
clf = DecisionTreeClassifier(criterion="gini")
clf.fit(X_train, y_train)

# Print feature importances
print("Feature Importances:", clf.feature_importances_)
```
**Expected Output:**
```
Feature Importances: [0.02, 0.01, 0.57, 0.40]
```

---

### **18. Train a Decision Tree Classifier using Entropy and print model accuracy**
```python
clf = DecisionTreeClassifier(criterion="entropy")
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)
```
**Expected Output:**
```
Model Accuracy: 1.0
```

---

### **19. Train a Decision Tree Regressor on a housing dataset and evaluate using MSE**
```python
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target, test_size=0.2, random_state=42)

# Train model
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)

# Predict and evaluate
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
```
**Expected Output:**
```
Mean Squared Error: ~0.5-1.5 (varies based on train-test split)
```

---

### **20. Train a Decision Tree Classifier and visualize the tree using Graphviz**
```python
from sklearn.tree import export_text, export_graphviz
import graphviz

# Export tree structure
dot_data = export_graphviz(clf, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)

# Visualize
graph = graphviz.Source(dot_data)
graph.render("decision_tree")  # Saves the tree as a file
graph.view()
```
**Expected Output:**  
A visual tree structure saved as `decision_tree.pdf` or displayed.

---

### **21. Train a Decision Tree Classifier with max depth = 3 and compare accuracy**
```python
clf_depth3 = DecisionTreeClassifier(max_depth=3)
clf_depth3.fit(X_train, y_train)
y_pred_depth3 = clf_depth3.predict(X_test)

accuracy_depth3 = accuracy_score(y_test, y_pred_depth3)
accuracy_full = accuracy_score(y_test, clf.predict(X_test))

print("Accuracy with max_depth=3:", accuracy_depth3)
print("Accuracy with fully grown tree:", accuracy_full)
```
**Expected Output:**
```
Accuracy with max_depth=3: ~0.90-0.95
Accuracy with fully grown tree: ~1.0
```

---

### **22. Train a Decision Tree Classifier with `min_samples_split=5` and compare accuracy**
```python
clf_split5 = DecisionTreeClassifier(min_samples_split=5)
clf_split5.fit(X_train, y_train)
y_pred_split5 = clf_split5.predict(X_test)

accuracy_split5 = accuracy_score(y_test, y_pred_split5)
accuracy_default = accuracy_score(y_test, clf.predict(X_test))

print("Accuracy with min_samples_split=5:", accuracy_split5)
print("Accuracy with default tree:", accuracy_default)
```
**Expected Output:**
```
Accuracy with min_samples_split=5: ~0.95-1.0
Accuracy with default tree: 1.0
```

---

### **23. Apply feature scaling before training a Decision Tree and compare accuracy**
```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf_scaled = DecisionTreeClassifier()
clf_scaled.fit(X_train_scaled, y_train)

y_pred_scaled = clf_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

print("Accuracy with feature scaling:", accuracy_scaled)
print("Accuracy without scaling:", accuracy)
```
**Expected Output:**
```
Accuracy with feature scaling: Similar to unscaled (Decision Trees are insensitive to scaling)
```

---

### **24. Train a Decision Tree Classifier using One-vs-Rest (OvR) strategy for multiclass classification**
```python
from sklearn.multiclass import OneVsRestClassifier

ovr_clf = OneVsRestClassifier(DecisionTreeClassifier())
ovr_clf.fit(X_train, y_train)

y_pred_ovr = ovr_clf.predict(X_test)
accuracy_ovr = accuracy_score(y_test, y_pred_ovr)

print("OvR Decision Tree Accuracy:", accuracy_ovr)
```
**Expected Output:**
```
OvR Decision Tree Accuracy: ~0.95-1.0
```

---

### **25. Train a Decision Tree Classifier and display feature importance scores**
```python
print("Feature Importances:", clf.feature_importances_)
```
**Expected Output:**
```
Feature Importances: [0.02, 0.03, 0.55, 0.40]  # Values may vary
```

---

### **26. Train a Decision Tree Regressor with max_depth=5 and compare performance**
```python
regressor_depth5 = DecisionTreeRegressor(max_depth=5)
regressor_depth5.fit(X_train, y_train)

y_pred_depth5 = regressor_depth5.predict(X_test)
mse_depth5 = mean_squared_error(y_test, y_pred_depth5)

print("MSE with max_depth=5:", mse_depth5)
print("MSE with unrestricted tree:", mse)
```
**Expected Output:**
```
MSE with max_depth=5: ~0.8-1.2
MSE with unrestricted tree: ~0.5-1.5
```

---

### **27. Train a Decision Tree Classifier, apply Cost Complexity Pruning (CCP), and visualize effect on accuracy**
```python
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

for alpha in ccp_alphas:
    clf_pruned = DecisionTreeClassifier(ccp_alpha=alpha)
    clf_pruned.fit(X_train, y_train)
    print(f"Alpha: {alpha}, Accuracy: {accuracy_score(y_test, clf_pruned.predict(X_test))}")
```
**Expected Output:**
```
Alpha: 0.0, Accuracy: 1.0
Alpha: 0.01, Accuracy: ~0.95
Alpha: 0.1, Accuracy: ~0.90
```

---

### **28. Train a Decision Tree Classifier and evaluate performance using Precision, Recall, and F1-Score**
```python
from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
```
**Expected Output:**
```
Precision: ~1.0
Recall: ~1.0
F1-Score: ~1.0
```

---

### **29. Train a Decision Tree Classifier and visualize the confusion matrix using Seaborn**
```python
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')

plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
```
**Expected Output:**  
A heatmap showing the confusion matrix.

---

### **30. Train a Decision Tree Classifier and use GridSearchCV to find optimal `max_depth` and `min_samples_split`**
```python
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Accuracy:", grid_search.best_score_)
```
**Expected Output:**
```
Best Parameters: {'max_depth': 5, 'min_samples_split': 2}  # Varies based on data
Best Accuracy: ~0.95-1.0
```

---

