<a href="https://colab.research.google.com/github/Tushar-rancy/Decission-Tree-assignment/blob/main/Decision_Tree_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decision Tree – Assignment Q&A

### Q1. What is a Decision Tree, and how does it work

**Answer:**
A Decision Tree is a flowchart-like structure where each internal node represents a decision on a feature, each branch represents an outcome of the decision, and each leaf node represents a class label or output.

### Q2. What are impurity measures in Decision Trees

**Answer:**
Impurity measures evaluate how mixed the classes are in the data subset. Common measures: Gini Impurity and Entropy.

### Q3. What is the mathematical formula for Gini Impurity

**Answer:**
Gini = 1 - ∑ (p_i)² where p_i is the probability of class i.

### Q4. What is the mathematical formula for Entropy

**Answer:**
Entropy = - ∑ p_i * log₂(p_i)

### Q5. What is Information Gain, and how is it used in Decision Trees

**Answer:**
Information Gain measures the reduction in entropy or Gini after a dataset split. It's used to choose the best feature to split on.

### Q6. What is the difference between Gini Impurity and Entropy

**Answer:**
Both are impurity metrics. Entropy involves logarithms and can be slower. Gini is generally faster and often yields similar splits.

### Q7. What is the mathematical explanation behind Decision Trees

**Answer:**
They recursively split data based on the feature that provides the best Information Gain or Gini reduction.

### Q8. What is Pre-Pruning in Decision Trees

**Answer:**
Pre-Pruning stops the tree from growing when a condition (like max depth or min samples) is met.

### Q9. What is Post-Pruning in Decision Trees

**Answer:**
Post-Pruning grows the full tree and then removes branches that do not provide significant gain.

### Q10. What is the difference between Pre-Pruning and Post-Pruning

**Answer:**
Pre-Pruning halts tree growth early; Post-Pruning removes nodes after a full tree is built.

### Q11. What is a Decision Tree Regressor

**Answer:**
It’s a tree model used for predicting continuous values instead of class labels.

### Q12. What are the advantages and disadvantages of Decision Trees

**Answer:**
**Advantages**: Interpretable, no need for feature scaling
**Disadvantages**: Prone to overfitting, unstable with small data changes.

### Q13. How does a Decision Tree handle missing values

**Answer:**
Some implementations can handle them by surrogate splits or ignoring those rows.

### Q14. How does a Decision Tree handle categorical features

**Answer:**
It can split based on categories directly or after encoding.

### Q15. What are some real-world applications of Decision Trees?

**Answer:**
Loan approval, fraud detection, disease diagnosis, and customer segmentation.

## Practical Implementation of Decision Trees

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

df = pd.read_csv("your_dataset.csv")
df.head()

### Q2. Explore the dataset – check for nulls and datatypes.

In [None]:
print(df.info())
print(df.isnull().sum())

### Q3. Split data into features and target variable.

In [None]:
X = df.drop('target', axis=1)
y = df['target']

### Q4. Perform train-test split.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Q5. Train a Decision Tree Classifier.

In [None]:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

### Q6. Evaluate the classifier on test data.

In [None]:
y_pred = clf.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test, y_pred))

### Q7. Visualize the trained Decision Tree.

In [None]:
plt.figure(figsize=(16,8))
plot_tree(clf, filled=True, feature_names=X.columns, class_names=True)
plt.show()

### Q8. Use Gini and Entropy as splitting criteria and compare performance.

In [None]:
clf_gini = DecisionTreeClassifier(criterion='gini')
clf_entropy = DecisionTreeClassifier(criterion='entropy')

clf_gini.fit(X_train, y_train)
clf_entropy.fit(X_train, y_train)

print('Gini Accuracy:', accuracy_score(y_test, clf_gini.predict(X_test)))
print('Entropy Accuracy:', accuracy_score(y_test, clf_entropy.predict(X_test)))

### Q9. Implement Pre-Pruning (limit depth and min samples).

In [None]:
clf_pruned = DecisionTreeClassifier(max_depth=3, min_samples_split=10)
clf_pruned.fit(X_train, y_train)
print('Pruned Accuracy:', accuracy_score(y_test, clf_pruned.predict(X_test)))

### Q10. Train and evaluate a Decision Tree Regressor.

In [None]:
reg = DecisionTreeRegressor(max_depth=3)
reg.fit(X_train, y_train)
print('Regressor Score:', reg.score(X_test, y_test))

### Q16. Import necessary libraries and load dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

df = pd.read_csv("your_dataset.csv")
df.head()

### Q17. Explore the dataset – check for nulls and datatypes.

In [None]:
print(df.info())
print(df.isnull().sum())

### Q18. Split data into features and target variable.

In [None]:
X = df.drop('target', axis=1)
y = df['target']

### Q19. Perform train-test split.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Q20. Train a Decision Tree Classifier.

In [None]:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

### Q21. Evaluate the classifier on test data.

In [None]:
y_pred = clf.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test, y_pred))

### Q22. Visualize the trained Decision Tree.

In [None]:
plt.figure(figsize=(16,8))
plot_tree(clf, filled=True, feature_names=X.columns, class_names=True)
plt.show()

### Q23. Use Gini and Entropy as splitting criteria and compare performance.

In [None]:
clf_gini = DecisionTreeClassifier(criterion='gini')
clf_entropy = DecisionTreeClassifier(criterion='entropy')

clf_gini.fit(X_train, y_train)
clf_entropy.fit(X_train, y_train)

print('Gini Accuracy:', accuracy_score(y_test, clf_gini.predict(X_test)))
print('Entropy Accuracy:', accuracy_score(y_test, clf_entropy.predict(X_test)))

### Q24. Implement Pre-Pruning (limit depth and min samples).

In [None]:
clf_pruned = DecisionTreeClassifier(max_depth=3, min_samples_split=10)
clf_pruned.fit(X_train, y_train)
print('Pruned Accuracy:', accuracy_score(y_test, clf_pruned.predict(X_test)))

### Q25. Evaluate model performance using accuracy, precision, recall, and F1-score.

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score
print('Precision:', precision_score(y_test, y_pred, average='binary'))
print('Recall:', recall_score(y_test, y_pred, average='binary'))
print('F1 Score:', f1_score(y_test, y_pred, average='binary'))

### Q26. Plot feature importance from Decision Tree.

In [None]:
importances = clf.feature_importances_
features = X.columns
sns.barplot(x=importances, y=features)
plt.title('Feature Importances')
plt.show()

### Q27. Export the Decision Tree as a DOT file.

In [None]:
from sklearn.tree import export_graphviz
export_graphviz(clf, out_file='tree.dot', feature_names=X.columns, class_names=True, filled=True)

### Q28. Train and evaluate a Decision Tree Regressor.

In [None]:
reg = DecisionTreeRegressor(max_depth=3)
reg.fit(X_train, y_train)
print('Regressor Score:', reg.score(X_test, y_test))

### Q29. Visualize the regression tree.

In [None]:
plt.figure(figsize=(16,8))
plot_tree(reg, filled=True, feature_names=X.columns)
plt.title('Regression Tree')
plt.show()

### Q30. Save and load a trained Decision Tree model.

In [None]:
import joblib
joblib.dump(clf, 'decision_tree_model.pkl')
loaded_model = joblib.load('decision_tree_model.pkl')
print('Loaded model score:', loaded_model.score(X_test, y_test))