## DECISION TREE ASSIGNMENT

1. What is a Decision Tree, and how does it work in the context of
classification?

  - A decision tree is a type of machine learning model that is used for classification and prediction. It works like a flowchart, where each internal node represents a condition or test on a feature, branches represent possible outcomes, and leaf nodes show the final decision or class label. The model keeps splitting data based on the most important features, usually chosen by measures like Gini Index or Information Gain. In classification, the tree assigns new data to a class by following the path of conditions until it reaches a leaf. It’s simple to understand and visualize, making it popular for explaining decisions.

2. Explain the concepts of Gini Impurity and Entropy as impurity measures.How do they impact the splits in a Decision Tree?

  - Gini Impurity and Entropy are two ways to measure how mixed or “impure” a dataset is when building a decision tree. Gini Impurity shows the chance of misclassifying a randomly chosen element if it were labeled based on the distribution of classes. Entropy, on the other hand, comes from information theory and measures the amount of uncertainty or disorder in the data. When splitting, the decision tree looks for splits that reduce impurity the most, meaning the resulting groups are more pure and consistent. Lower impurity after a split means the decision tree is making better, more accurate classifications.

3. What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

  - Pre-pruning and post-pruning are techniques used to avoid overfitting in decision trees. **Pre-pruning** stops the tree from growing too deep by setting limits like maximum depth, minimum samples per split, or minimum leaf size. This saves time and keeps the model simpler. **Post-pruning**, on the other hand, first allows the tree to grow fully and then trims back branches that don’t add much predictive power. A key advantage of pre-pruning is faster training since the tree is controlled from the start, while post-pruning often gives better accuracy because it evaluates the full tree before simplifying it.


4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?

  - Information Gain tells us how much “clarity” we get after splitting the data using a feature. A feature with high information gain means it separates the data better, making the groups more pure. It is important because the decision tree always wants to choose the split that gives the most useful separation, leading to more accurate results.

5. What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

  - Decision trees are used in many real-life areas. For example, banks use them to check if someone is likely to repay a loan, doctors use them to help in diagnosing diseases, and businesses use them to predict customer behavior. Their main advantages are that they are easy to understand, explain, and visualize, even for non-technical people. However, they also have limitations—trees can easily become too complex and overfit the data, and small changes in data can sometimes change the whole tree structure.






In [1]:
"""6. Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances"""

# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train a Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# 4. Make predictions
y_pred = clf.predict(X_test)

# 5. Print model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# 6. Print feature importances
print("Feature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")



Model Accuracy: 1.0
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


In [2]:
"""Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree."""


# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train Decision Tree with max_depth=3
clf_depth3 = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_depth3.fit(X_train, y_train)

# 4. Train a fully-grown Decision Tree (no depth limit)
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)

# 5. Make predictions
y_pred_depth3 = clf_depth3.predict(X_test)
y_pred_full = clf_full.predict(X_test)

# 6. Print and compare accuracies
accuracy_depth3 = accuracy_score(y_test, y_pred_depth3)
accuracy_full = accuracy_score(y_test, y_pred_full)

print("Decision Tree (max_depth=3) Accuracy:", accuracy_depth3)
print("Decision Tree (Fully-grown) Accuracy:", accuracy_full)


Decision Tree (max_depth=3) Accuracy: 1.0
Decision Tree (Fully-grown) Accuracy: 1.0


In [3]:
"""Question 8: Write a Python program to:
● Load the California Housing dataset from sklearn
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances"""

# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# 1. Load the California Housing dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

# 2. Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train a Decision Tree Regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# 4. Make predictions
y_pred = reg.predict(X_test)

# 5. Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)

# 6. Print feature importances
print("\nFeature Importances:")
for feature, importance in zip(housing.feature_names, reg.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Mean Squared Error (MSE): 0.495235205629094

Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


In [4]:
"""Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy"""

# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Define the Decision Tree and parameter grid
clf = DecisionTreeClassifier(random_state=42)
param_grid = {
    'max_depth': [1, 2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5]
}

# 4. Perform GridSearchCV
grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# 5. Get the best parameters and make predictions
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# 6. Print results
accuracy = accuracy_score(y_test, y_pred)
print("Best Parameters:", best_params)
print("Model Accuracy with Best Parameters:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy with Best Parameters: 1.0


Here’s a more humanized version of your answer in simple, natural language:

  - If I were working as a data scientist for a healthcare company and needed to predict whether a patient has a certain disease, here’s how I would approach it:

     **1. Handle missing values:** I would first look at the dataset to see which columns have missing information. For numbers like age or blood pressure, I’d fill the gaps with the average or median values. For categories like gender or blood type, I’d either use the most common value or create a separate “Unknown” category.

     **2. Encode categorical features:** Since a decision tree works with numbers, I’d convert categorical features into numerical form. For example, unordered categories like blood type can use one-hot encoding, and ordered categories like disease stages can use label encoding.

     **3. Train a Decision Tree model:** I’d split the data into training and testing sets and train a decision tree classifier. The tree would learn patterns in the data to separate patients with the disease from those without.

     **4. Tune hyperparameters:** To make sure the model isn’t too simple or too complicated, I’d test different settings like the tree’s maximum depth or minimum samples per split using GridSearchCV or RandomizedSearchCV. This helps the model perform well on new, unseen patients.

     **5. Evaluate performance:** I’d check the model’s accuracy, precision, recall, F1-score, and maybe ROC-AUC. In healthcare, it’s especially important to correctly identify sick patients, so I’d pay attention to false negatives.

     **Business value:** This model could help doctors catch diseases earlier, prioritize patients who need urgent care, and make better treatment decisions. It could also save costs by avoiding unnecessary tests for low-risk patients while improving overall patient outcomes.