## Decision Tree Assignment-1

In [1]:
# Q1. Describe the decision tree classifier algorithm and how it works to make predictions.

# Ans:

# Decision Tree Classifier Algorithm
#A Decision Tree Classifier is a supervised learning algorithm used for classification tasks. It works by 
# recursively splitting the dataset based on feature values to create a tree-like model that predicts the target class.

# How a Decision Tree Works:

# Start with the Entire Dataset:
# The root node contains all the training data.

# Feature Selection & Splitting:
# The algorithm selects the best feature to split the data using a splitting criterion like:
    # Gini Impurity (default in Scikit-Learn)
    # Entropy (Information Gain)
# The data is divided into subsets, forming child nodes.

# Recursive Partitioning:
# The process repeats on each child node, splitting further based on other feature values.

# Stopping Conditions:
# The tree stops growing when:
# All instances in a node belong to the same class.
# A stopping condition is met (e.g., max depth, min samples per leaf).

# Making Predictions:
# For a new data point, the decision tree traverses from the root to a leaf node based on feature values.
# The leaf node contains the predicted class.

In [2]:
# Q2. Provide a step-by-step explanation of the mathematical intuition behind decision tree classification.

# Ans:

# A Decision Tree Classifier splits data based on features to create a tree-like model. It uses measures like 
# Gini Impurity or Entropy to select the best feature at each step.

# Step 1: Selecting the Best Feature for Splitting:
# The goal is to find the feature that best separates the classes. We use impurity measures such as:
# Gini Impurity
# Entropy (Information Gain)

# Gini Impurity
# Gini measures the probability that a randomly chosen element is incorrectly classified if it is randomly labeled.
# Gini(D) = ∑(pi^2) , where pi = Probability of class i in dataset D, i = [1 to c] and c = number of classes

# Example Calculation
# If a dataset has 80% Pass (1) and 20% Fail (0):
# Gini = 1 - (0.8 ^ 2 + 0.2 ^ 2) = 1 - (0.64 + 0.04) = 0.32
# A lower Gini value means better purity.

# Entropy (Information Gain)
# Entropy measures the disorder in a dataset.
# Entropy(D) = -(p_positive)*log(p_positive) - (p_negative)*log(p_negative) , Base is 2
# Information Gain (IG) tells us how much entropy is reduced when splitting on a feature:
# IG = H(S) - ∑ |Sv|/|S| * H(Sv)
# where, H(S): Entropy of the root nood and v belongs to Value

# Step 2: Recursive Splitting
# The tree recursively splits nodes using the best feature.
# The process stops when:
# Nodes are pure (contain only one class).
# Stopping criteria (e.g. max depth) is met.

# Step 3: Making Predictions
# For a new data point, the model:
# Traverses the tree based on feature values.
# Reaches a leaf node with a class label.

# Step 4: Pruning (Avoiding Overfitting)
# Pre-Pruning: Limit tree depth, min samples per leaf.
# Post-Pruning: Remove nodes with low impact on accuracy.

In [3]:
# Q3. Explain how a decision tree classifier can be used to solve a binary classification problem.

# Ans:

# Using a Decision Tree Classifier for Binary Classification
# A Decision Tree Classifier is an effective algorithm for binary classification, where the target variable 
# has only two classes (e.g. "Yes" vs "No"). It works by recursively splitting the dataset into homogeneous groups
# using a set of decision rules.

# Step 1: Data Preparation:
# Collect labeled training data with features (independent variables) and a binary target variable.

# Step 2: Selecting the Best Feature for Splitting:
# The tree chooses a feature to split on using Gini Impurity or Entropy.

# Step 3: Recursive Splitting:
# The algorithm repeats the process, splitting nodes further until:
# Nodes become pure (all samples belong to one class).
# A stopping criterion is met (e.g. max depth, min samples per leaf).

# Step 4: Making Predictions
# For a new data point:
# Start at the root node then go to the child node.

# Step 5: Preventing Overfitting
# To avoid overfitting, we can:
# Limit tree depth (e.g. max_depth=3).
# Set a minimum number of samples per leaf.
# Use pruning techniques.

In [4]:
# Q4. Discuss the geometric intuition behind decision tree classification and how it can be used to make
# predictions.

# Ans:

# Geometric Intuition Behind Decision Tree Classification:
# A Decision Tree Classifier partitions the feature space into rectangular regions, making classification
# decisions based on feature splits. This forms a hierarchical decision boundary, where each split divides the
# space into smaller subregions.

# Process of Decision Trees To Make Predictions Geometrically:
# A new data point is placed into the feature space.
# It follows the splitting rules (decision boundaries) to reach a region.
# The majority class in that region is assigned as the prediction.

In [5]:
# Q5. Define the confusion matrix and describe how it can be used to evaluate the performance of a
# classification model.

# Ans:

# A confusion matrix is a table that summarizes the performance of a classification model. It shows the 
# counts of true positive, true negative, false positive, and false negative predictions. It's a powerful tool 
# for understanding not just how well a model is doing, but where it's making mistakes.   

# Summary of confusion matrix:
# The confusion matrix provides a much more detailed picture of model performance than simple accuracy.

# Overall Accuracy:  While not the only important metric, we can calculate accuracy from the confusion matrix:
# Accuracy = (TP + TN) / (TP + TN + FP + FN)

# Precision:  Out of all the instances the model predicted as positive, how many were actually positive?
# Precision = TP / (TP + FP)

# Recall (Sensitivity or True Positive Rate): Out of all the actual positive instances, how many did the model 
# correctly identify?
# Recall = TP / (TP + FN)

# Specificity (True Negative Rate): Out of all the actual negative instances, how many did the model correctly identify?
# Specificity = TN / (TN + FP)

# F1-Score: The harmonic mean of precision and recall. Useful when we want to balance precision and recall, 
# especially in imbalanced datasets.
# F1-Score = 2 * (Precision * Recall) / (Precision + Recall)

# Understanding Errors: The confusion matrix help us to understand the types of errors our model is making.
# Are we getting a lot of false positives or false negatives? This information is crucial for improving our model.

# For example:   

# High FP: The model is too eager to predict positive. We might need to adjust the classification threshold 
# or add more features.
# High FN: The model is missing a lot of actual positives. We might need to adjust the classification threshold,
# or we can use a different model, or gather more data.

In [6]:
# Q6. Provide an example of a confusion matrix and explain how precision, recall, and F1 score can be
# calculated from it.

# Ans:

# Example: Let’s say we have a model predicting whether an email is Spam (1) or Not Spam (0).
# We have the confusion matrix data like, TP = 50, FN = 10, FP = 5 and TN = 100

# Precision: TP / TP + FP = 50 / 50+5 = 0.909 = 90.9%
# Recall: TP / TP + FN = 50 / 50+10 = 0.833 = 83.3%

In [7]:
# Q7. Discuss the importance of choosing an appropriate evaluation metric for a classification problem and
# explain how this can be done.

# Ans:

# Understand the Classification Problem Type
# Binary Classification: Two possible classes (e.g. spam vs. not spam).
# Multi-class Classification: More than two classes (e.g. digit recognition 0-9).
# Imbalanced Classification: When one class significantly outnumbers others (e.g. fraud detection).

# 2. Consider the Consequences of False Positives & False Negatives
# If False Positives (FP) are costly (e.g. medical diagnosis), then we need to focus on Precision.
# If False Negatives (FN) are costly (e.g. fraud detection), then we need to focus on Recall.
# If both FP and FN matter equally, we can use F1-score (harmonic mean of Precision & Recall).

# 3. Choose the Right Metric Based on the Objective
# Accuracy: Good for balanced datasets but misleading for imbalanced datasets.
# Precision (Positive Predictive Value): Useful when FP needs to be minimized.
# Recall (Sensitivity, True Positive Rate): Important when FN needs to be minimized.
# F1-Score: A balanced metric when Precision and Recall are both important.
# ROC-AUC (Receiver Operating Characteristic - Area Under Curve): Measures the model's ability to 
# distinguish between classes, useful for probabilistic classifiers.
# PR-AUC (Precision-Recall Area Under Curve): Better than ROC-AUC when dealing with imbalanced datasets.
# Log Loss: Measures how well the predicted probabilities match actual classes, useful for probabilistic classifiers.
# Cohen's Kappa & Matthews Correlation Coefficient (MCC): More robust for imbalanced datasets.

# 4. Check Business Goals & Interpretability
# In medical applications, high recall is crucial.
# In spam filtering, high precision is better.
# In fraud detection, PR-AUC or MCC may be preferable.

# Conclusion:
# For balanced datasets, Accuracy or F1-score can work well.
# For imbalanced datasets, Precision, Recall, F1-score, or MCC are better.
# When probabilities matter, we can use ROC-AUC, PR-AUC, or Log Loss.

In [8]:
# Q8. Provide an example of a classification problem where precision is the most important metric, and
# explain why.

# Ans:

# Medical Diagnosis for Cancer (Positive = Cancer, Negative = No Cancer)

# Why Precision Matters?
# A False Positive (FP) means diagnosing a healthy person with cancer, leading to unnecessary anxiety, 
# costly treatments, and side effects.
# High precision ensures fewer false positives, meaning only actual cancer patients are flagged.
# Ideal scenario: We want high precision so that only those truly having cancer are diagnosed as positive.

In [9]:
# Q9. Provide an example of a classification problem where recall is the most important metric, and explain why.

# Ans:

# mail Spam Detection (Positive = Spam, Negative = Not Spam)

# Why Recall Matters?
# A False Negative (FN) means missing an actual spam email, which could result in phishing attacks or malware exposure.
# High recall ensures most spam emails are correctly detected, even if some harmless emails get flagged as spam (FPs).
# Ideal scenario: We prioritize high recall to catch as much spam as possible, even if it means filtering some 
# legitimate emails.