'''

Question 1:
What is a Decision Tree, and how does it work in the context of classification?

Answer:
A Decision Tree is a supervised machine learning algorithm used for classification and regression.
In classification, it works by splitting the dataset into smaller subsets based on feature values.
Each internal node represents a decision rule, each branch represents the outcome of that rule,
and each leaf node represents a class label.

The model starts from the root node and recursively selects the best feature using impurity
measures until a stopping condition is reached.

'''

'''


Question 2:
Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

Answer:
Gini Impurity measures how often a randomly chosen sample would be incorrectly classified.
Lower Gini value indicates higher purity.

Entropy measures the randomness or uncertainty in the dataset.
Lower entropy means the data is more pure.

Both measures help the Decision Tree choose the best feature that results in purer child nodes.

'''
'''


Question 3:
What is the difference between Pre-Pruning and Post-Pruning in Decision Trees?
Give one practical advantage of using each.

Answer:
Pre-Pruning stops the tree growth early by setting limits like maximum depth.
Its advantage is reduced overfitting and faster training.

Post-Pruning allows the tree to grow fully and then removes unnecessary branches.
Its advantage is better generalization and improved accuracy.
'''
'''

Question 4:
What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Answer:
Information Gain measures the reduction in entropy after splitting the dataset on a feature.
A higher Information Gain means the feature provides more useful information.

It is important because it helps select the feature that best separates the data into pure classes.
'''

'''

Question 5:
What are some common real-world applications of Decision Trees,
and what are their main advantages and limitations?

Answer:
Decision Trees are used in healthcare diagnosis, fraud detection, credit scoring,
customer churn prediction, and recommendation systems.

Advantages include easy interpretation, handling mixed data types,
and minimal preprocessing.

Limitations include overfitting, sensitivity to data changes,
and bias with imbalanced datasets.


'''


In [1]:
'''

Question 6:
Write a Python program to load the Iris Dataset, train a Decision Tree Classifier
using the Gini criterion, and print accuracy and feature importances.

Answer:
'''
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = DecisionTreeClassifier(criterion="gini")
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", model.feature_importances_)




Accuracy: 1.0
Feature Importances: [0.03334028 0.         0.88947325 0.07718647]


In [2]:
'''


Question 7:
Write a Python program to train a Decision Tree Classifier with max_depth=3
and compare its accuracy with a fully grown tree.

Answer:
'''
from sklearn.tree import DecisionTreeClassifier

full_tree = DecisionTreeClassifier(random_state=42)
limited_tree = DecisionTreeClassifier(max_depth=3, random_state=42)

full_tree.fit(X_train, y_train)
limited_tree.fit(X_train, y_train)

print("Full Tree Accuracy:", accuracy_score(y_test, full_tree.predict(X_test)))
print("Max Depth=3 Accuracy:", accuracy_score(y_test, limited_tree.predict(X_test)))



Full Tree Accuracy: 1.0
Max Depth=3 Accuracy: 1.0


In [5]:
'''


Question 8:
Write a Python program to load a housing dataset,
train a Decision Tree Regressor,
and print the Mean Squared Error (MSE) and feature importances.


Answer:
'''
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing Dataset (replacement for Boston Housing)
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Decision Tree Regressor
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Output results
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", model.feature_importances_)


Mean Squared Error: 0.495235205629094
Feature Importances: [0.52850909 0.05188354 0.05297497 0.02866046 0.03051568 0.13083768
 0.09371656 0.08290203]


In [4]:
'''


Question 9:
Write a Python program to tune max_depth and min_samples_split
using GridSearchCV and print best parameters and accuracy.

Answer:
'''
from sklearn.model_selection import GridSearchCV

param_grid = {
    "max_depth": [3, 5, 7, None],
    "min_samples_split": [2, 5, 10]
}

grid = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)



Best Parameters: {'max_depth': 7, 'min_samples_split': 2}
Best Accuracy: 0.9416666666666668


'''


Question 10:
Explain the step-by-step process to build a Decision Tree model
for disease prediction and describe its business value.

Answer:
First, handle missing values using mean, median, or mode.
Second, encode categorical features using Label or One-Hot Encoding.
Third, split the data and train a Decision Tree model.
Fourth, tune hyperparameters using GridSearchCV.
Finally, evaluate performance using accuracy, precision, recall, or ROC-AUC.

Business value includes early disease detection, better decision-making,
reduced healthcare costs, and improved patient outcomes.
'''
