#  Assignment


#### #Question 1: What is a Decision Tree, and how does it work in the context of classification?
##### #Ans.A Decision Tree is a supervised machine learning algorithm used for both classification and regression. In classification tasks, it predicts the class label of an instance by learning simple if-else rules from the data.

How it Works:

1. Root Node:\
The process starts with the entire dataset at the root. The algorithm chooses the feature that best separates the classes using criteria such as Gini Index or Information Gain.

2. Splitting:\
The data is divided into subsets based on feature values. Each decision splits the dataset into smaller, more homogeneous groups.

3. Decision Nodes & Branches:\
Internal nodes represent decisions based on features, and branches represent the outcomes of those decisions.

4. Leaf Nodes:\
Splitting continues until a stopping condition is reached (e.g., all samples belong to one class, or no further improvement is possible). Each leaf node represents a final class label.

Example (Binary Classification):

If we want to classify whether a person buys a laptop (Yes/No) based on Age and Income:
* Root Node: Is Age > 30?
   * If No → Predict "No"
   * If Yes → Move to next decision
* Decision Node: Is Income > 50,000?
   * If Yes → Predict "Yes"
   * If No → Predict "No"

#### #Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.How do they impact the splits in a Decision Tree?
##### #Ans.In Decision Trees, impurity measures help determine how well a feature splits the data. The goal is to create nodes that are as pure as possible, meaning most or all samples in a node belong to the same class. Two common measures are Gini Impurity and Entropy.

1. Gini Impurity

* Measures the probability of misclassifying a randomly chosen sample if it is labeled according to the class distribution in the node.
* Formula:\
Gini = 1 - Σ (p_i²)
where p_i is the proportion of class i in the node.
* A Gini value of 0 means the node is pure (all samples belong to one class).

2. Entropy
* Measures the uncertainty or disorder in the node.
* Formula:\
Entropy = - Σ (p_i * log2(p_i))
* Entropy = 0 indicates a pure node, and higher values indicate more mixed classes.
* Information Gain is calculated as the reduction in entropy after a split.

Impact on Splits
* Both measures guide the tree to select splits that create purer subsets.
* Gini is simpler and faster to compute.
* Entropy is more sensitive to class distribution and can sometimes produce slightly different splits.
* The chosen split maximizes the reduction in impurity (or maximizes Information Gain), improving classification accuracy.

#### #Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
##### #Ans.Pre-Pruning (Early Stopping)
* The tree's growth is stopped early based on certain conditions before it becomes overly complex.

* Common stopping criteria:

   * Maximum depth (max_depth)
   * Minimum samples to split (min_samples_split)
   * Minimum samples in a leaf (min_samples_leaf)
   * Minimum impurity decrease
* Practical advantage: Saves computation time and prevents overfitting by keeping the model simpler from the start.

Post-Pruning (Prune After Full Growth)
* The tree is grown to its maximum size (or nearly so), then branches that provide little predictive power are removed.
* Techniques:
   * Reduced Error Pruning (evaluate on validation set and remove unhelpful branches)
   * Cost Complexity Pruning (ccp_alpha in scikit-learn)
* Practical advantage: Allows the model to initially capture complex patterns and then simplifies it for better generalization, often leading to improved accuracy on unseen data.



#### #Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?
##### #Ans.Information Gain (IG) is a measure used in Decision Trees to quantify how much a feature improves the purity of a node. It calculates the reduction in entropy (uncertainty or disorder) after splitting the dataset based on a feature.

Formula:

Information Gain = Entropy(Parent) - Σ ( (size of child / size of parent) x Entropy(Child) )


* Entropy(Parent): Entropy of the dataset before the split.
* Entropy(Child): Entropy of each subset created after the split.

Importance in Choosing Splits:

* Information Gain helps the tree decide which feature to split on at each node.
* The feature that produces the highest Information Gain (i.e., maximally reduces uncertainty) is chosen.
* This ensures that the resulting subsets are more homogeneous, leading to a tree that classifies data more accurately.

Example:
If splitting on Feature A reduces entropy from 0.9 to 0.4, while Feature B reduces it to 0.6, the tree will choose Feature A for the split because it gives higher Information Gain.



#### #Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
##### #Ans.Common Real-World Applications:

1. Medical Diagnosis: Helps predict whether a patient has a particular disease using symptoms and test results.
2. Credit Scoring: Classifies loan applicants as “low risk” or “high risk” based on their financial background.
3. Fraud Detection: Identifies unusual or suspicious transactions in financial systems.
4. Customer Churn Prediction: Predicts which customers are likely to stop using a service.
5. Product Recommendation: Suggests products to users based on their past behavior and preferences.
6. Regression Problems: Can be used to estimate continuous values such as house prices or sales forecasts.

Main Advantages:
* Interpretability: Decision Trees generate easy-to-read, human-understandable rules.
* Versatility: Can handle both numerical and categorical data without complex preprocessing.
* No Feature Scaling Required: Works without normalization or standardization of data.
* Ability to Model Non-Linear Relationships: Can capture complex patterns in the data.
* Feature Importance: Highlights which variables have the most influence on predictions.


Main Limitations:

* Overfitting: Deep or fully grown trees may memorize the training data, reducing performance on new data.
* High Variance: Small changes in the training dataset can result in very different trees.
* Bias Toward High-Cardinality Features: Features with many unique values may dominate the splits.
* Lower Accuracy Compared to Ensemble Methods: Often outperformed by techniques like Random Forests or Gradient Boosting on complex datasets.

In [2]:
# Question 6: Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier using the Gini criterion
# ● Print the model’s accuracy and feature importances

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Split dataset into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Initialize Decision Tree Classifier with Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Get feature importances
importances = clf.feature_importances_

# Print results
print(f"Accuracy of Decision Tree: {accuracy:.4f}\n")
print("Feature Importances:")
for name, importance in zip(feature_names, importances):
    print(f"  {name}: {importance:.4f}")

Accuracy of Decision Tree: 0.9333

Feature Importances:
  sepal length (cm): 0.0000
  sepal width (cm): 0.0286
  petal length (cm): 0.5412
  petal width (cm): 0.4303


In [5]:
# Question 7: Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
# a fully-grown tree.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Fully-grown tree (no depth limit)
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
acc_full = accuracy_score(y_test, clf_full.predict(X_test))

# Tree with max_depth=3
clf_md3 = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_md3.fit(X_train, y_train)
acc_md3 = accuracy_score(y_test, clf_md3.predict(X_test))

# Print results
print(f"Accuracy of fully-grown tree: {acc_full:.4f}")
print(f"Accuracy of max_depth=3 tree: {acc_md3:.4f}")

Accuracy of fully-grown tree: 0.9333
Accuracy of max_depth=3 tree: 0.9778


In [7]:
# Question 8: Write a Python program to:
# ● Load the California Housing dataset from sklearn
# ● Train a Decision Tree Regressor
# ● Print the Mean Squared Error (MSE) and feature importances

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names

# Split dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize and train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)

# Print results
print(f"Mean Squared Error (MSE) on test data: {mse:.4f}\n")

# Print feature importances
print("Feature Importances:")
for name, importance in zip(feature_names, regressor.feature_importances_):
    print(f" - {name}: {importance:.4f}")


Mean Squared Error (MSE) on test data: 0.4952

Feature Importances:
 - MedInc: 0.5285
 - HouseAge: 0.0519
 - AveRooms: 0.0530
 - AveBedrms: 0.0287
 - Population: 0.0305
 - AveOccup: 0.1308
 - Latitude: 0.0937
 - Longitude: 0.0829


In [8]:
# Question 9: Write a Python program to:
# ● Load the Iris Dataset
# ● Tune the Decision Tree’s max_depth and min_samples_split using
# GridSearchCV
# ● Print the best parameters and the resulting model accuracy

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the Decision Tree classifier
dt = DecisionTreeClassifier(random_state=42)

# Set up the grid of parameters to search
param_grid = {
    'max_depth': [2, 3, 4, 5, 6, None],
    'min_samples_split': [2, 5, 10]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(dt, param_grid, cv=5, n_jobs=-1)

# Fit GridSearch to the training data
grid_search.fit(X_train, y_train)

# Best parameters found
best_params = grid_search.best_params_

# Evaluate the best estimator on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print the results
print(f"Best parameters: {best_params}")
print(f"Model accuracy on test set: {accuracy:.4f}")

Best parameters: {'max_depth': 4, 'min_samples_split': 2}
Model accuracy on test set: 1.0000


In [None]:
# Question 10: Imagine you’re working as a data scientist for a healthcare company that
# wants to predict whether a patient has a certain disease. You have a large dataset with
# mixed data types and some missing values.
# Explain the step-by-step process you would follow to:
# ● Handle the missing values
# ● Encode the categorical features
# ● Train a Decision Tree model
# ● Tune its hyperparameters
# ● Evaluate its performance
# And describe what business value this model could provide in the real-world
# setting.

##### #Ans.
1. Handle Missing Values
* Analyze missingness: Examine which data points are missing and whether they are missing at random or follow a pattern. This informs the strategy for handling them.
* Imputation methods:
   * Numerical features: Fill missing values with mean, median, or use advanced techniques like K-Nearest Neighbors imputation.
   * Categorical features: Replace missing values with the most frequent category (mode) or assign a new category such as “Unknown.”
* Dropping data: If certain features or rows have too many missing values and cannot be reliably imputed, consider dropping them carefully.
* Why it matters: Machine learning models cannot handle missing data directly. Cleaning ensures reliable inputs for the model.

2. Encode Categorical Features
* Identify categorical variables: Examples include patient gender, blood type, or disease severity.
* Encoding methods:
   * Nominal categories (no inherent order, e.g., blood type): Use One-Hot Encoding.
   * Ordinal categories (ordered, e.g., disease severity): Use Label Encoding or map to numeric scales.
* Why it matters: Decision Trees require numeric inputs. Encoding transforms categorical features into a format the model can use.

3. Train a Decision Tree Model

* Data splitting: Use an 80-20 or 70-30 split for training and testing, or employ cross-validation to ensure generalization.
* Initialize the model: Start with a default Decision Tree classifier.
* Train the model: Fit it on the processed training data.
* Why Decision Trees: They can handle mixed data types, are interpretable (important in healthcare), and capture non-linear relationships.

4. Tune Hyperparameters
* Important hyperparameters:
   * max_depth - controls tree complexity and balances underfitting vs. overfitting.
   * min_samples_split & min_samples_leaf - determine the minimum samples required to split a node or form a leaf, affecting generalization.
   * max_features - controls the number of features considered for each split.

* Tuning method: Use GridSearchCV or RandomizedSearchCV with cross-validation to find the best combination of parameters.
* Why it matters: Proper tuning improves predictive performance and reduces the risk of overfitting or underfitting.

5. Evaluate Performance

* Metrics: Accuracy, Precision, Recall, F1-score, ROC-AUC.
* Healthcare priority: Recall (sensitivity) is often most important to ensure patients with the disease are detected, even if false positives occur.
* Validation: Test on a separate set or use cross-validation for reliable performance.
* Interpretability: Feature importance and tree visualization help explain model decisions to clinicians and stakeholders.

Business Value in a Real-World Setting

* Early detection: Enables timely treatment, improving patient outcomes and reducing costs.
* Resource optimization: Helps prioritize high-risk patients for interventions, making healthcare delivery more efficient.
* Personalized care: Supports tailored monitoring and treatment plans based on individual risk profiles.
* Data-driven insights: Informs product development, policies, and patient outreach strategies.
* Trust and transparency: Decision Trees interpretability builds confidence among clinicians, regulators, and patients, crucial in healthcare environments.