## Q1. Describe the decision tree classifier algorithm and how it works to make predictions.


A decision tree is a graphical representation of different options for solving a problem and show how different factors are related. It has a hierarchical tree structure starts with one main question at the top called a node which further branches out into different possible outcomes where:

* <b>Root Node</b> is the starting point that represents the entire dataset.
* <b>Branches</b>: These are the lines that connect nodes. It shows the flow from one decision to another.
* <b>Internal Nodes</b> are Points where decisions are made based on the input features.
* <b>Leaf Nodes</b>: These are the terminal nodes at the end of branches that represent final outcomes or predictions.

<br>

![image.png](attachment:image.png)

<br>


<b>How Decision Trees Work?</b>

    A decision tree working starts with a main question known as the root node. This question is derived from the features of the dataset and serves as the starting point for decision-making.

    From the root node, the tree asks a series of yes/no questions. Each question is designed to split the data into subsets based on specific attributes. For example if the first question is “Is it raining?”, the answer will determine which branch of the tree to follow. Depending on the response to each question you follow different branches. If your answer is “Yes,” you might proceed down one path if “No,” you will take another path.

    This branching continues through a sequence of decisions. As you follow each branch, you get more questions that break the data into smaller groups. This step-by-step process continues until you have no more helpful questions .

    You reach at the end of a branch where you find the final outcome or decision. It could be a classification (like “spam” or “not spam”) or a prediction (such as estimated price).

## Q2. Provide a step-by-step explanation of the mathematical intuition behind decision tree classification.


Decision trees are one of the most intuitive and interpretable machine learning models. They mimic human decision-making processes by breaking down decisions into a series of simple if-else statements. In this article, we’ll explore the mathematical intuition behind decision trees and their implementation, focusing on key concepts like entropy, Gini index, and information gain.

Let’s start with a simple example using if-else statements to understand the concept of a decision tree.

    age = 18

    if age <= 8:
        print("Access Denied")
    elif age > 8 and age < 18:
        print("Child mode is ON")
    else:
        print("Total Access is granted")


This logic can be represented as a binary tree:

<br>

![image.png](attachment:image.png)

<br>

Root Node: The initial condition (age ≤ 8)

Internal Nodes: The subsequent conditions (age > 8 and age < 18).

Leaf Nodes: The final decisions (e.g., “Child mode is ON”).

In machine learning, we perform similar operations on datasets, where we have several features and an output label. The goal is to split the data based on these features to predict an outcome.

<b>How Does a Decision Tree Work?</b>

Given a dataset with multiple features, we iteratively split the data to form a tree structure. Each split is based on a feature, and the objective is to create pure splits where the data in each subset is as homogeneous as possible.

<b>Leaf Node:</b>

A leaf node in a decision tree is a node where the data is entirely homogeneous — either all “yes” or all “no.” The purpose of a leaf node is to achieve a pure split, where no further division is needed.

<b>Evaluating the Purity of a Split: Entropy and Gini Index:</b>

To mathematically evaluate the purity of a split, we use metrics like entropy and Gini impurity.

<b>Entropy</b>

Entropy measures the disorder or uncertainty in the data. For binary classification, the entropy H(S) is given by:

![image-2.png](attachment:image-2.png)

* Entropy values range between 0 and 1.
* A pure split results in an entropy of 0.
* An impure split results in an entropy of 1.

When the probability values are 0.5, the entropy is maximized at 1, indicating maximum impurity. The entropy curve is bell-shaped, peaking at this point of maximum uncertainty.

<b>Gini Impurity</b>:

Gini impurity is another metric for assessing the purity of a split.

It is calculated as:

![image-3.png](attachment:image-3.png)

* Gini impurity values range from 0 to 0.5.
* A pure split results in a Gini impurity of 0.
* An impure split results in a Gini impurity of 0.5.

<b>Entropy vs. Gini Impurity</b>

* For larger datasets, Gini impurity is often preferred due to its computational efficiency.
* Both metrics serve the same purpose but might lead to different decision paths.


<b>Selecting the Best Feature: Information Gain</b>

To determine which feature should be used to split the data, we use a concept called information gain. Information gain is the difference between the entropy of the dataset before the split and the weighted sum of entropies after the split.

![image-4.png](attachment:image-4.png)

A higher information gain indicates a better split.

<b>Handling Overfitting: Post-Pruning and Pre-Pruning:</b>

Decision trees are prone to overfitting, especially with noisy or complex datasets. Overfitting occurs when the model captures noise instead of the underlying pattern, leading to poor generalization on new data. To combat this, we use pruning techniques

<b>Post-Pruning</b>

* Build the entire tree first, then remove branches that add little value.
* This is done using hyperparameter tuning after the tree construction.

Application: Best for smaller datasets.

<b>Pre-Pruning</b>

* Apply constraints during the tree-building process to limit its growth.
* Techniques like setting a maximum depth or minimum samples per leaf node are used.

Application: Ideal for larger datasets.

## Q3. Explain how a decision tree classifier can be used to solve a binary classification problem.


Decision trees are a common type of machine learning model used for binary classification tasks. The natural structure of a binary tree lends itself well to predicting a “yes” or “no” target. It is traversed sequentially here by evaluating the truth of each logical statement until the final prediction outcome is reached. Some examples of classification tasks that can use decision trees are: predicting whether a student will pass or fail an exam, whether an email is spam or not, if transaction is fraudulent or legitimate, etc

Let's assume the data in the dataset contains features commonly used in determining admission to masters’ degree programs, such as GRE, GPA, and letters of recommendation.

As a first step, we will create a binary class (1=admission likely , 0=admission unlikely) from the chance of admit – greater than 80% we will consider as likely. The remaining data columns will be used as predictors.

<b>Fitting and Predicting</b>

We will use scikit-learn‘s tree module to create, train, predict, and visualize a decision tree classifier. The syntax is the same as other models in scikit-learn, once an instance of the model class is instantiated with dt = DecisionTreeClassifier(), .fit() can be used to fit the model on the training set. After fitting, .predict() (and predict_proba()) and .score() can be called to generate predictions and score the model on the test data.

As with other scikit-learn models, only numeric data can be used (categorical variables and nulls must be handled prior to model fitting). In this case, our categorical features have already been transformed and no missing values are present in the data set.

Two methods are available to visualize the tree within the tree module – the first is using tree_plot to graphically represent the decision tree. The second uses export_text to list the rules behind the splits in the decision tree. There are many other packages available for more visualization options – such as graphviz, but may require additional installations and will not be covered here.

<br>

![image.png](attachment:image.png)

<br>

<b>Split Criteria</b>

For a classification task, the default split criteria is Gini impurity – this gives us a measure of how “impure” the groups are. At the root node, the first split is then chosen as the one that maximizes the information gain, i.e. decreases the Gini impurity the most. Our tree has already been built for us, but how was the split cgpa<=8.845 determined? cgpa is a continuous variable, which adds an extra complication, as the split can occur for ANY value of cgpa.

To verify, we will use the defined functions gini and info_gain. By running gini(y_train), we get the same Gini impurity value as printed in the tree at the root node, 0.443.

Next, we are going to verify how the split on cgpa was determined. We will use info_gain over ALL values of cgpa to determine the information gain when split on each value. This is stored in a table and sorted, and voila, the top value for the split is cgpa<=8.845! This is also done for every other feature (and for those continuous ones, every value), to find the top split overall.

<br>

![image-2.png](attachment:image-2.png)

<br>

After this process is repeated, and there is no further info gain by splitting, the tree is finally built. Last to evaluate, any sample traverses through tree and appropriate splits until it reaches a leaf node, and then assigned the majority class of that leaf (or weighted majority).

## Q4. Discuss the geometric intuition behind decision tree classification and how it can be used to make predictions.


A decision tree can be visualized as a hierarchical structure of binary splits, where each node represents a decision point based on a specific feature from the input data. The tree starts at the root node, which represents the entire dataset, and makes binary splits at each internal node based on chosen features. The final predictions are made at the leaf nodes, representing different classes or categories. Decision boundaries are formed at each split, dividing the data into regions belonging to different classes. The tree’s geometric representation allows for easy interpretability and understanding of how the model makes decisions. Overall, decision trees are effective and interpretable models for classification tasks.

## Q5. Define the confusion matrix and describe how it can be used to evaluate the performance of a classification model.


A confusion matrix is a simple table that shows how well a classification model is performing by comparing its predictions to the actual results. It breaks down the predictions into four categories: correct predictions for both classes (true positives and true negatives) and incorrect predictions (false positives and false negatives). This helps you understand where the model is making mistakes, so you can improve it.

The matrix displays the number of instances produced by the model on the test data.

* True Positive (TP): The model correctly predicted a positive outcome (the actual outcome was positive).
* True Negative (TN): The model correctly predicted a negative outcome (the actual outcome was negative).
* False Positive (FP): The model incorrectly predicted a positive outcome (the actual outcome was negative). Also known as a Type I error.
* False Negative (FN): The model incorrectly predicted a negative outcome (the actual outcome was positive). Also known as a Type II error.

<br>


A confusion matrix helps you see how well a model is working by showing correct and incorrect predictions. It also helps calculate key measures like accuracy, precision, and recall, which give a better idea of performance, especially when the data is imbalanced.

1. <b>Accuracy</b>:
Accuracy measures how often the model’s predictions are correct overall. It gives a general idea of how well the model is performing. However, accuracy can be misleading, especially with imbalanced datasets where one class dominates. For example, a model that predicts the majority class correctly most of the time might have high accuracy but still fail to capture important details about other classes.

![image.png](attachment:image.png)

2. <b>Precision</b>:
Precision focuses on the quality of the model’s positive predictions. It tells us how many of the instances predicted as positive are actually positive. Precision is important in situations where false positives need to be minimized, such as detecting spam emails or fraud.

![image-2.png](attachment:image-2.png)

3. <b>Recall</b>:
Recall measures how well the model identifies all actual positive cases. It shows the proportion of true positives detected out of all the actual positive instances. High recall is essential when missing positive cases has significant consequences, such as in medical diagnoses.

![image-3.png](attachment:image-3.png)

4. <b>F1-Score</b>:
F1-score combines precision and recall into a single metric to balance their trade-off. It provides a better sense of a model’s overall performance, particularly for imbalanced datasets. The F1 score is helpful when both false positives and false negatives are important, though it assumes precision and recall are equally significant, which might not always align with the use case.

![image-4.png](attachment:image-4.png)

5. <b>Specificity</b>:
Specificity is another important metric in the evaluation of classification models, particularly in binary classification. It measures the ability of a model to correctly identify negative instances. Specificity is also known as the True Negative Rate. Formula is given by:

![image-5.png](attachment:image-5.png)

6. <b>Type 1 and Type 2 error</b>:

* Type 1 Error:

   A Type 1 Error occurs when the model incorrectly predicts a positive instance, but the actual instance is negative. This is also known as a false positive. Type 1 Errors affect the precision of a model, which measures the accuracy of positive predictions.

   ![image-6.png](attachment:image-6.png)

* <b>Type 2 Error</b>:

   A Type 2 Error occurs when the model fails to predict a positive instance, even though it is actually positive. This is also known as a false negative. Type 2 Errors impact the recall of a model, which measures how well the model identifies all actual positive cases.

   ![image-7.png](attachment:image-7.png)


<br>

Source: https://www.geeksforgeeks.org/confusion-matrix-machine-learning/

## Q6. Provide an example of a confusion matrix and explain how precision, recall, and F1 score can be calculated from it.


Accuracy performance metrics can be decisive when dealing with imbalanced data. In this blog, we will learn about the Confusion matrix and its associated terms, which looks confusing but are trivial. The confusion matrix, precision, recall, and F1 score gives better intuition of prediction results as compared to accuracy. To understand the concepts, we will limit this article to binary classification only.

It is a matrix of size 2×2 for binary classification with actual values on one axis and predicted on another.

![image.png](attachment:image.png)

Let’s understand the confusing terms in the confusion matrix: true positive, true negative, false negative, and false positive with an example.

EXAMPLE

A machine learning model is trained to predict tumor in patients. The test dataset consists of 100 people.

![image-2.png](attachment:image-2.png)


True Positive (TP) — model correctly predicts the positive class (prediction and actual both are positive). In the above example, 10 people who have tumors are predicted positively by the model.
True Negative (TN) — model correctly predicts the negative class (prediction and actual both are negative). In the above example, 60 people who don’t have tumors are predicted negatively by the model.
False Positive (FP) — model gives the wrong prediction of the negative class (predicted-positive, actual-negative). In the above example, 22 people are predicted as positive of having a tumor, although they don’t have a tumor. FP is also called a TYPE I error.
False Negative (FN) — model wrongly predicts the positive class (predicted-negative, actual-positive). In the above example, 8 people who have tumors are predicted as negative. FN is also called a TYPE II error.

With the help of these four values, we can calculate True Positive Rate (TPR), False Negative Rate (FPR), True Negative Rate (TNR), and False Negative Rate (FNR).

![image-3.png](attachment:image-3.png)

Even if data is imbalanced, we can figure out that our model is working well or not. For that, the values of TPR and TNR should be high, and FPR and FNR should be as low as possible.

With the help of TP, TN, FN, and FP, other performance metrics can be calculated.

<b>Precision, Recall</b>:

Both precision and recall are crucial for information retrieval, where positive class mattered the most as compared to negative. Why?

While searching something on the web, the model does not care about something irrelevant and not retrieved (this is the true negative case). Therefore only TP, FP, FN are used in Precision and Recall.

<b>Precision</b>:

Out of all the positive predicted, what percentage is truly positive.

![image-4.png](attachment:image-4.png)

The precision value lies between 0 and 1.

<b>Recall</b>:

Out of the total positive, what percentage are predicted positive. It is the same as TPR (true positive rate).

![image-5.png](attachment:image-5.png)

How are precision and recall useful? Let’s see through examples.

EXAMPLE 1- Credit card fraud detection

![image-6.png](attachment:image-6.png)

We do not want to miss any fraud transactions. Therefore, we want False-Negative to be as low as possible. In these situations, we can compromise with the low precision, but recall should be high. Similarly, in the medical application, we don’t want to miss any patient. Therefore we focus on having a high recall.

So far, we have discussed when the recall is important than precision. But, when is the precision more important than recall?

EXAMPLE 2 — Spam detection

![image-7.png](attachment:image-7.png)

In the detection of spam mail, it is okay if any spam mail remains undetected (false negative), but what if we miss any critical mail because it is classified as spam (false positive). In this situation, False Positive should be as low as possible. Here, precision is more vital as compared to recall.

When comparing different models, it will be difficult to decide which is better (high precision and low recall or vice-versa). Therefore, there should be a metric that combines both of these. One such metric is the F1 score.

<b>F1 Score</b>:

It is the harmonic mean of precision and recall. It takes both false positive and false negatives into account. Therefore, it performs well on an imbalanced dataset.

![image-8.png](attachment:image-8.png)

F1 score gives the same weightage to recall and precision.

There is a weighted F1 score in which we can give different weightage to recall and precision. As discussed in the previous section, different problems give different weightage to recall and precision.

![image-9.png](attachment:image-9.png)

Beta represents how many times recall is more important than precision. If the recall is twice as important as precision, the value of Beta is 2.


<b>Conclusion</b>:

Confusion matrix, precision, recall, and F1 score provides better insights into the prediction as compared to accuracy performance metrics. Applications of precision, recall, and F1 score is in information retrieval, word segmentation, named entity recognition, and many more.

## Q7. Discuss the importance of choosing an appropriate evaluation metric for a classification problem and explain how this can be done.


Understanding how well a machine learning model will perform on unseen data is the main purpose behind working with these evaluation metrics. Classification Metrics like accuracy, precision, recall are good ways to evaluate classification models for balanced datasets, but if the data is imbalanced then other methods like ROC/AUC perform better in evaluating the model performance.

ROC curve isn’t just a single number but it’s a whole curve that provides nuanced details about the behavior of the classifier. It is also hard to quickly compare many ROC curves to each other.

## Q8. Provide an example of a classification problem where precision is the most important metric, and explain why.


Say, as a product manager of the spam detection feature, you decide that cost of a false positive error is high. You can interpret the error cost as a negative user experience due to misprediction. You want to ensure that the user never misses an important email because it is incorrectly labeled as spam. As a result, you want to minimize false positive errors. 

In this case, precision is a good metric to evaluate and optimize for. A higher precision score indicates that the model makes fewer false positive predictions. It is more likely to be correct whenever it predicts a positive outcome.

Like every optimization, it comes at a cost. In this case, some spam emails might be left undetected and will make their way into the users’ inboxes. However, you might decide the cost of this type of error (false negative) is more tolerable. Users can still flag these emails as spam manually. However, whenever you automatically put emails in the spam folder and hide them from the user, they better be genuinely spam. 

## Q9. Provide an example of a classification problem where recall is the most important metric, and explain why.

For example, if an ML model points to possible medical conditions, detects dangerous objects in security screening, or alarms to potentially expensive fraud, missing out might be very expensive. In this scenario, you might prefer to be overly cautious and manually review more instances the model flags as suspicious. 

In other words, you would treat false negative errors as more costly than false positives. If that is the case, you can optimize for recall and consider it the primary metric.