<a href="https://colab.research.google.com/github/Himani954/Data-types-and-structure/blob/main/Decision_Tree_%7C_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Question 1: What is a Decision Tree, and how does it work in the context of classification?**

# **Ans1.**
# Decision Tree Overview
A Decision Tree is a type of supervised learning algorithm used for both classification and regression tasks. In the context of classification, a Decision Tree is a tree-like model of decisions that leads to a class label.

How Decision Trees Work for Classification
1. Tree Structure : A Decision Tree consists of nodes (decision points) and edges (outcomes of decisions). The top node is the root node. Each internal node represents a feature or attribute to split the data. Each leaf node represents a class label.

2. Splitting Data : At each internal node, the algorithm decides how to split the data based on a feature and a threshold. The goal is to split the data into subsets that are more "pure" in terms of class labels.

3. Decision Criteria : Common criteria for deciding splits include Gini impurity and information gain (based on entropy).
    - Gini Impurity : Measures the impurity of a node. A lower Gini impurity indicates a more pure node.
    - Information Gain : Measures the reduction in entropy (or increase in purity) after a split.

4. Stopping Criteria : The tree growing stops when a stopping criterion is met, such as when all instances in a node belong to the same class, when a maximum depth is reached, or when the number of instances in a node is below a threshold.

5. Prediction : To classify a new instance, you start at the root node and follow the tree based on the feature values of the instance until you reach a leaf node. The class label of the leaf node is the predicted class.

Example of Decision Tree in Classification
Consider a classification problem to predict whether a person is likely to buy a computer based on age and income. A Decision Tree might split first on age (e.g., <=30 vs. >30), then on income for one of those branches. The final leaf nodes would give the predicted class (buy or not buy).

Advantages and Considerations
- Advantages : Decision Trees are easy to interpret and visualize. They can handle both numerical and categorical data.
- Considerations : Decision Trees can overfit if not pruned or if too deep. They are sensitive to the data used for training.

# **Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?**


# **Ans2.**
# Gini Impurity and Entropy as Impurity Measures
In Decision Trees, both Gini impurity and Entropy are used as criteria to decide how to split the data at each node. They measure the "impurity" or "disorder" of a node in terms of the class labels.

Gini Impurity
- Definition : Gini impurity for a node is calculated as \(Gini = 1 - \sum_{i=1}^{C} p_i^2\), where \(p_i\) is the proportion of class \(i\) in the node, and \(C\) is the number of classes.
- *Interpretation*: Gini impurity ranges from 0 (pure node, all instances belong to one class) to \(1 - \frac{1}{C}\) (for a node with equal distribution among \(C\) classes).
- Use in Decision Trees : The algorithm chooses splits to minimize Gini impurity, leading to more pure child nodes.

Entropy
- Definition : Entropy for a node is calculated as \(Entropy = - \sum_{i=1}^{C} p_i \log_2(p_i)\), where \(p_i\) is the proportion of class \(i\).
- *Interpretation*: Entropy measures the uncertainty or disorder in the node. Lower entropy means more purity.
- Use in Decision Trees : Splits are chosen to maximize information gain, which is the reduction in entropy from the parent node to the child nodes.

Impact on Splits in a Decision Tree
- Both Gini impurity and Entropy guide the splitting process by evaluating the quality of a split.
- The goal is to create child nodes that are purer (lower Gini impurity or lower Entropy) than the parent node.
- Difference in Practice : While both lead to similar trees in many cases, Gini impurity is computationally simpler and is the default in some implementations like scikit-learn. Entropy (via information gain) is more commonly associated with the ID3 and C4.5 algorithms.

Example
Consider a node with classes A (40%), B (30%), C (30%).
- Gini impurity = \(1 - (0.4^2 + 0.3^2 + 0.3^2)\).
- Entropy = \(-0.4\log_2(0.4) - 0.3\log_2(0.3) - 0.3\log_2(0.3)\).

Both metrics would guide the Decision Tree to split this node in a way that increases purity in the child nodes.


# **Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.**

# **Ans3.**
# Pre-Pruning vs. Post-Pruning in Decision Trees
- Pre-Pruning (Early Stopping) : Stops the growth of the Decision Tree early based on certain criteria like maximum depth, minimum number of samples to split a node, or minimum number of samples in a leaf. The tree is not fully grown.
- Post-Pruning : Grows the Decision Tree to its maximum size, then prunes back some branches to reduce overfitting. Pruning is based on criteria like reduced error pruning or cost complexity pruning.

Practical Advantages
1. Pre-Pruning :
    - Advantage : Can be computationally more efficient as it stops growing the tree early, avoiding unnecessary computations for branches that wouldn't contribute much.
2. Post-Pruning :
    - Advantage : Can lead to a more optimal tree since the full tree is grown and then pruned, allowing for a more informed decision on which branches to prune based on the complete tree structure.

# **Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?**

# **Ans4.**
# Information Gain in Decision Trees
- Definition : Information Gain is a measure of the reduction in entropy (or uncertainty) in the target variable due to a split based on a particular feature. It's calculated as \(Information\ Gain = Entropy(parent) - \sum \frac{N_{child}}{N_{parent}} Entropy(child)\).
- Purpose in Decision Trees : Information Gain is used to decide which feature to use for splitting at a node. The feature with the highest Information Gain is chosen because it reduces uncertainty the most.

Importance for Choosing the Best Split
- Reducing Uncertainty : By maximizing Information Gain, the Decision Tree algorithm chooses splits that most effectively reduce uncertainty about the class labels, leading to more pure child nodes.
- Feature Selection at Each Node : At each node, calculating Information Gain for each feature helps decide which feature best splits the data for classification.

Example
If a split based on a feature reduces entropy significantly (high Information Gain), it means that split is very effective in separating the classes.

# **Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?**

# **Dataset Info:**
# **● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or provided CSV).**

# **● Boston Housing Dataset for regression tasks (sklearn.datasets.load_boston() or provided CSV).**

# **Ans5.**
# Common Real-World Applications of Decision Trees
1. Classification Tasks : Decision Trees are used in areas like credit scoring (predicting loan defaults), medical diagnosis (classifying diseases based on symptoms and tests), and customer segmentation.
2. Regression Tasks : They can predict continuous outcomes like housing prices (as in the Boston Housing Dataset) or stock prices.

Main Advantage
- Interpretability : Decision Trees are easy to interpret and visualize. The tree structure allows for straightforward understanding of how decisions are made based on feature values.

Main Limitations
- Overfitting : Decision Trees can overfit the training data if not pruned or if too deep, leading to poor generalization on unseen data.
- Instability : Small changes in the data can lead to a very different tree being generated.

Example with Given Datasets
- Iris Dataset (Classification) : A Decision Tree can classify iris flowers into one of three species based on features like sepal length, sepal width, petal length, and petal width.
- Boston Housing Dataset (Regression) : A Decision Tree can predict housing prices based on features like crime rate, number of rooms, etc.

# **Question 6: Write a Python program to:**
# **● Load the Iris Dataset**
# **● Train a Decision Tree Classifier using the Gini criterion**
# **● Print the model’s accuracy and feature importances.**

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy
print(f"Model Accuracy: {accuracy:.2f}")

# Print feature importances
print("Feature Importances:")
for feature_name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")

Model Accuracy: 1.00
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


# **Question 7: Write a Python program to:**
# **● Load the Iris Dataset**
# **● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.**


In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model 1: Decision Tree with max_depth=3
clf_depth3 = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_depth3.fit(X_train, y_train)
y_pred_depth3 = clf_depth3.predict(X_test)
accuracy_depth3 = accuracy_score(y_test, y_pred_depth3)

# Model 2: Fully grown Decision Tree (no max_depth)
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Print the accuracy comparison
print(f"Accuracy with max_depth=3: {accuracy_depth3:.2f}")
print(f"Accuracy with fully-grown tree: {accuracy_full:.2f}")

Accuracy with max_depth=3: 1.00
Accuracy with fully-grown tree: 1.00


# **Question 8: Write a Python program to:**
# **● Load the California Housing dataset from sklearn**
# **● Train a Decision Tree Regressor**
# **● Print the Mean Squared Error (MSE) and feature importances**

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict and calculate Mean Squared Error
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

# Print the Mean Squared Error
print(f"Mean Squared Error (MSE): {mse:.2f}")

# Print feature importances
print("Feature Importances:")
for name, importance in zip(data.feature_names, regressor.feature_importances_):
    print(f"{name}: {importance:.4f}")

Mean Squared Error (MSE): 0.50
Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


# **Question 9: Write a Python program to:**
# **● Load the Iris Dataset**

# **● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV**

# **● Print the best parameters and the resulting model accuracy**


In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 4, 6, 8]
}

# Create the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Set up GridSearchCV
grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best estimator and predict on test data
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print the best parameters and accuracy
print("Best Parameters:", grid_search.best_params_)
print(f"Model Accuracy: {accuracy:.2f}")

Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy: 1.00


# **Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.**

# **Explain the step-by-step process you would follow to:**

# **● Handle the missing values**

# **● Encode the categorical features**

# **● Train a Decision Tree model**

# **● Tune its hyperparameters**

# **● Evaluate its performance**

# **And describe what business value this model could provide in the real-world setting.**

# **Ans10.**
As a Data Scientist at a healthcare company, your goal is to predict whether a patient has a certain disease using a large dataset that includes both numerical and categorical features, and contains missing values.
1. Handling Missing Values

Identify missing data:

Use .isnull().sum() to understand where missing values exist.

Imputation strategies:

For numerical features: Impute using the mean or median depending on distribution.

For categorical features: Impute using the mode (most frequent value).

Consider more advanced imputation (e.g., KNN imputer or IterativeImputer) if the dataset is complex.

2. Encoding Categorical Features

Use One-Hot Encoding for low-cardinality categorical variables (e.g., gender, region).

Use Ordinal Encoding if the feature has a logical order (e.g., severity: low, medium, high).

For high-cardinality variables, consider Target Encoding or Frequency Encoding.

Ensure encoding is applied consistently on both training and test sets.

3. Train a Decision Tree Model

Split the data into training and testing sets using train_test_split() (typically 70/30 or 80/20).

Train the model:


In [None]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

4. Tune Hyperparameters

Use GridSearchCV or RandomizedSearchCV to find the best combination of hyperparameters.

Important parameters to tune:

max_depth: Limits tree depth to avoid overfitting.

min_samples_split: Minimum samples required to split a node.

min_samples_leaf: Minimum samples required at a leaf node.

criterion: Either "gini" or "entropy".

Example:

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_

5. Evaluate Performance

Use relevant classification metrics:

Accuracy: Overall correct predictions.

Precision and Recall: Crucial in medical settings (e.g., false positives vs false negatives).

F1-Score: Balances precision and recall.

ROC-AUC Score: Measures separability between classes.

Example:

In [None]:
from sklearn.metrics import classification_report, roc_auc_score

y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1]))

Business Value in Real-World Setting
Early Detection: Helps identify at-risk patients early for timely intervention.

Improved Outcomes: Enables doctors to tailor treatment plans, improving patient health.

Resource Optimization: Focuses tests and treatments on the right patients, reducing waste.

Cost Savings: Avoids unnecessary diagnostic procedures, reducing expenses for both patients and the hospital.

Compliance and Reporting: Helps healthcare providers meet regulatory standards and provide better data transparency.