#Questions

Question 1:  What is a Decision Tree, and how does it work in the context of
classification?
  - Decision Tree is a supervised machine learning algorithm.
  - It is used for both classification and regression problems but usually used for classification.
  - It seprates data into classes according to their featuers.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
  - Gini tells us the probablity of wrongly classification of a object.
  - Entropy is the measures of disorder or uncertainty in the group.
  - It helps in choosing the feature and threshold that produce the largest reduction in impurity.

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
  - Pre-Prunig mean you stop the tree from growing too much while building it.
  - It is nice when you need faster training and simpler trees.
  - Post-Pruning mean you first build the full tree, then cut back branches that don’t help much with accuracy.
  - It is nice when you need better accuracy and generalization.

Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?
  - Information gain is amount to information you extract by spliting prant node into children nodes.
  - It is important to find the best split because it affects:-
    - In finding the best feature to split on.
    - Improveing the accuracy by creating nodes that are as pure as possible.

Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
  - Real world exapmles of Decision tree:-
    - Healthcare - Classify whether a patient has a certain disease based on symptoms and test results.
    - Agriculture - Classify crop types or predict plant diseases using weather and soil data.
    - Quality Control - Detect whether a product is defective or not based on sensor data.
    - Marketing - Identify which type of customers are likely to buy a product.

In [4]:
# Question 6:   Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier using the Gini criterion
#● Print the model’s accuracy and feature importances

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

print("\nFeature Importances:")
for feature_name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")


Model Accuracy: 0.96

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.5517
petal width (cm): 0.4293


In [7]:
#Question 7:  Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

full_tree = DecisionTreeClassifier(criterion='gini', random_state=5)
full_tree.fit(X_train, y_train)

limited_tree = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
limited_tree.fit(X_train, y_train)

y_pred_full = full_tree.predict(X_test)
y_pred_limited = limited_tree.predict(X_test)

accuracy_full = accuracy_score(y_test, y_pred_full)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

print(f"Accuracy of Fully-Grown Tree: {accuracy_full:.2f}")
print(f"Accuracy of Tree with max_depth=3: {accuracy_limited:.2f}")


Accuracy of Fully-Grown Tree: 1.00
Accuracy of Tree with max_depth=3: 1.00


In [9]:
#Question 8: Write a Python program to:
#● Load the Boston Housing Dataset
#● Train a Decision Tree Regressor
#● Print the Mean Squared Error (MSE) and feature importances

# Boston Data is removed from sklearn dataset

from sklearn.datasets import fetch_openml
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

boston = fetch_openml(name="boston", version=1, as_frame=True)
X = boston.data
y = boston.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

regressor = DecisionTreeRegressor(criterion='squared_error', random_state=42)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")

print("\nFeature Importances:")
for feature_name, importance in zip(X.columns, regressor.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")


Mean Squared Error (MSE): 11.59

Feature Importances:
CRIM: 0.0585
ZN: 0.0010
INDUS: 0.0099
CHAS: 0.0003
NOX: 0.0071
RM: 0.5758
AGE: 0.0072
DIS: 0.1096
RAD: 0.0016
TAX: 0.0022
PTRATIO: 0.0250
B: 0.0119
LSTAT: 0.1900


In [10]:
#Question 9: Write a Python program to:
#● Load the Iris Dataset
#● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
#● Print the best parameters and the resulting model accuracy

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

dt = DecisionTreeClassifier(random_state=42)

param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5, 10]
}

grid_search = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

print("Best Parameters Found:")
print(best_params)

y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"\nModel Accuracy with Best Parameters: {accuracy:.2f}")


Best Parameters Found:
{'max_depth': 4, 'min_samples_split': 10}

Model Accuracy with Best Parameters: 1.00


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
  - Step 1: Handling Missing Values:-
    - Identify missing data.
    - Impute missing values.
  - Step 2: Encoding Categorical Features:-
    - Use One-Hot Encoding to encode categorical data.
  - Step 3: Training a Decision Tree Model:-
    - Split the dataset into training and testing sets.
    - Create a DecisionTreeClassifier and train it on the training data.
  - Step 4: Hyperparameter Tuning:-
    - Use GridSearchCV or RandomizedSearchCV with cross-validation to find the best combination of hyperparameters.
  - Step 5: Model Evaluation:-
    - Evaluate using appropriate metrics - Accuracy, Precision and Recall, F1-Score