**Question 1: What is a Decision Tree, and how does it work in the context of classification?**

->A Decision Tree is a supervised machine learning algorithm used for classification and also regression tasks. It works by recursively splitting the dataset into smaller, more homogeneous groups based on features, forming a tree-like structure of decision nodes and leaf nodes.

How it works in classification:
1. Root Node: The tree starts with the root node, representing the entire dataset.
2. Splitting: At each node, the algorithm chooses the feature and threshold that best separates the classes.Criteria: Gini Impurity, Entropy (Information Gain).
3. Decision Nodes: Internal nodes represent conditions on features .
4. Leaf Nodes: Endpoints that represent the predicted class.
5. Recursive Partitioning: This process repeats until a stopping condition is met.
6. Prediction: For a new instance, the tree is traversed from root to leaf based on feature values, leading to a final class label.


**Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?**

->*Gini Impurity*

Definition: Measures the probability that a randomly chosen sample in a node would be incorrectly classified if it were labeled according to the class distribution in that node.

*Entropy (Information Gain)*

Definition: Measures the level of uncertainty or disorder in the node. Derived from information theory.

*Impact on Decision Tree Splits:*

* Goal: At each split, the Decision Tree chooses the feature and threshold that reduce impurity the most.
1. With Gini: The algorithm picks the split with the lowest Gini impurity after the split.
2. With Entropy: The algorithm picks the split with the highest Information Gain (reduction in entropy).
Practical Note: Both usually produce similar trees, but Gini is slightly faster to compute.

**Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.**

->*Pre-Pruning:*

1. Stops the tree from growing beyond a certain limit during the building process.
2. Applied while building the tree.
3. Prevent overfitting early by restricting complexity.

Advantage-Saves computation time and memory by avoiding unnecessary splits from the start.

*Post-Pruning:*

1. Grows the full tree first, then removes branches that do not improve performance.
2. Applied after the full tree has been built.
3. Remove unnecessary complexity after seeing the full tree

Advantage-Allows the tree to consider all possible splits before simplifying, which can lead to better accuracy in some cases.

**Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?**

->Information Gain (IG) measures the reduction in impurity (uncertainty) in a dataset after splitting it based on a specific feature.It is calculated using Entropy as the impurity measure.

It important for choosing the best split because,
1. The higher the Information Gain, the more effective the split is at reducing uncertainty.
2. Decision Trees choose the feature and threshold with the highest Information Gain because it leads to purer child nodes.
3. This process ensures that the tree learns the most informative and discriminative patterns first.



**Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?**

->Applications: Used in customer segmentation, medical diagnosis, fraud detection, credit risk assessment, manufacturing quality control, and recommendation systems.

Advantages: Easy to interpret, handles numerical & categorical data, no feature scaling needed, works with missing data, models non-linear relationships.

Limitations: Prone to overfitting, unstable with small data changes, biased toward majority classes, less precise for continuous outputs, uses greedy splitting that may miss global optimum.



In [1]:
"""
Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances

"""
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

from sklearn.model_selection import train_test_split
# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

from sklearn.tree import DecisionTreeClassifier
# Initialize the Decision Tree Classifier with Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)
from sklearn.metrics import accuracy_score
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Model Accuracy:", accuracy)
print("Feature Importances:", clf.feature_importances_)
print("Feature Names:", iris.feature_names)

Model Accuracy: 1.0
Feature Importances: [0.         0.01667014 0.90614339 0.07718647]
Feature Names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [2]:
"""
Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

"""
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

from sklearn.model_selection import train_test_split
# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Fully-grown decision tree (no depth limit)
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
y_pred_full = full_tree.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Decision tree with max_depth = 3
limited_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
limited_tree.fit(X_train, y_train)
y_pred_limited = limited_tree.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# Print results
print("Fully-grown Tree Accuracy:", accuracy_full)
print("Max Depth = 3 Tree Accuracy:", accuracy_limited)

Fully-grown Tree Accuracy: 1.0
Max Depth = 3 Tree Accuracy: 1.0


In [3]:
"""
Question 8: Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances

"""
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load dataset from URL
url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
df = pd.read_csv(url)

# Separate features and target
X = df.drop("medv", axis=1)  # medv = median value of owner-occupied homes
y = df["medv"]

# Split dataset into training & testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Decision Tree Regressor
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Feature Importances
feature_importances = pd.DataFrame({
    "Feature": X.columns,
    "Importance": model.feature_importances_
}).sort_values(by="Importance", ascending=False)

print("\nFeature Importances:")
print(feature_importances)

Mean Squared Error: 10.42

Feature Importances:
    Feature  Importance
5        rm    0.600326
12    lstat    0.193328
7       dis    0.070688
0      crim    0.051296
4       nox    0.027148
6       age    0.013617
9       tax    0.012464
10  ptratio    0.011012
11        b    0.009009
2     indus    0.005816
1        zn    0.003353
8       rad    0.001941
3      chas    0.000002


In [4]:
"""
Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
● Print the best parameters and the resulting model accuracy

"""
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

from sklearn.model_selection import train_test_split, GridSearchCV
# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Define parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5]
}

# Initialize the Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=5,               # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1
)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_

# Evaluate the best model on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", best_params)
print("Model Accuracy with Best Parameters:", accuracy)

Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy with Best Parameters: 1.0


**Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.**

Explain the step-by-step process you would follow to:
* Handle the missing values
* Encode the categorical features
* Train a Decision Tree model
* Tune its hyperparameters
* Evaluate its performance And describe what business value this model could provide in the real-world setting

->
1. Handle Missing Values
* Numerical Features: Replace missing values with median or mean (median is safer for skewed data).
* Categorical Features: Replace missing values with the most frequent category or use a special “Unknown” label.
* Advanced Option: Use SimpleImputer or KNNImputer from sklearn for automated imputation.

2. Encode Categorical Features
* Nominal Features: Use One-Hot Encoding (pd.get_dummies() or OneHotEncoder).
* Ordinal Features: Use Label Encoding or map categories to integers based on order.
* Ensure the encoding is consistent for training and future prediction data.

3. Train a Decision Tree Model
* Split the dataset into training and testing sets (e.g., 80%-20%).
* Use DecisionTreeClassifier from sklearn with an initial set of parameters (e.g., criterion='gini').

4. Tune Hyperparameters
* Use GridSearchCV or RandomizedSearchCV to find optimal max_depth, min_samples_split, and min_samples_leaf.
* Apply cross-validation (e.g., 5-fold) to avoid overfitting and get a robust estimate of performance.

5. Evaluate Performance
* For classification: Check accuracy, precision, recall, F1-score, and ROC-AUC.
* Use a confusion matrix to understand class-level performance.
* If dataset is imbalanced, focus on recall or F1-score for the disease-positive class.

6. Business Value in Real-World Setting:
* Faster Diagnosis: Provides doctors with quick and consistent decision support.
* Early Detection: Identifies high-risk patients earlier, enabling timely intervention.
* Resource Optimization: Helps hospitals prioritize patients who need urgent tests or treatment.
* Cost Reduction: Reduces unnecessary diagnostic tests for low-risk patients.
* Scalability: Can be applied to large patient datasets across multiple hospitals.
