# Theoritical Questions

**Question 1:**

What is a Decision Tree, and how does it work in the context of
classification?

**Answer 1:**

A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks.

In classification, it splits a dataset into subsets based on feature values, using a tree-like model of decisions. Each internal node represents a test on a feature, each branch corresponds to an outcome of the test, and each leaf node holds a class label.

The tree recursively splits the data so that samples with the same class label are grouped together, making predictions by traversing the path from root to leaf according to feature values of the instance being classified.

Decision Tree works by partitioning the data based on a series of questions or rules:

1. Start at the Root Node: The process begins with the entire dataset at the top of the tree, known as the "root node".

2. Splitting the Data: The algorithm evaluates different features (attributes) to find the best way to split the data into two or more homogeneous subsets. The goal is to maximize the purity of the resulting subsets—meaning that each subset contains data points primarily belonging to a single class. Common metrics used to determine the "best" split include Information Gain, Gini Impurity, or Chi-squared.

3. Creating Decision Nodes: The feature that provides the optimal split becomes a "decision node" (or internal node). Branches extending from this node represent the possible outcomes of the test or different values of that feature.

4. Repeating the Process (Recursion): Steps 2 and 3 are repeated recursively for each new subset. This process continues until a stopping criterion is met, such as:

 - All data points in a subset belong to the same class.
 - No more features are available for splitting.
 - The maximum allowed depth of the tree is reached.

5. Reaching Leaf Nodes: When the stopping criteria are met, the final nodes are called "leaf nodes" (or terminal nodes). Each leaf node represents the final class label or a decision outcome for that specific branch's path.
6. Making Predictions: To classify a new data point, it traverses the tree starting from the root, following the branches corresponding to its feature values until it reaches a leaf node. The class label of that leaf node is the predicted classification for the data point.

Effectively, a decision tree asks a series of "if-then-else" questions to navigate from the general population to a specific classification, making it a highly intuitive and easily interpretable model.



**Question 2:**

Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

**Answer 2:**

Gini Impurity and Entropy are impurity measures used in decision trees to determine the best split at each node. Both metrics quantify the "mixedness" or "disorder" of data within a node, with a score of 0 indicating a "pure" node (all data points belong to one class) and higher scores indicating greater impurity. Decision trees use these measures to find splits that maximize the reduction in impurity, also known as information gain.

Gini Impurity measures the probability of an element being incorrectly classified if it were randomly assigned a label based on the class distribution of the node.

A Gini Index of 0 means a perfectly pure node, while a value of  (for a two-class problem) indicates maximum impurity, where classes are split equally.

The algorithm selects the split that results in the lowest Gini Impurity for the child nodes. Lower impurity means the split is better at separating classes.

Entropy quantifies the amount of uncertainty or disorder in the dataset. It measures how much a node's class distribution deviates from a pure distribution.

A node with an Entropy of 0 is pure, and the highest value (example, 1 for a two-class problem with equal probability) represents maximum impurity.

Application in trees: The algorithm chooses the split that provides the greatest reduction in entropy from the parent node to the child nodes. This reduction is known as "information gain".

Impact on Decision Tree Splits

The primary goal of both measures is to guide the decision tree to make the most effective splits at each step.

The algorithm evaluates all possible splits for all features and chooses the one that results in the greatest reduction in impurity (highest information gain).

This process is repeated recursively at each child node until a stopping criterion is met, such as the nodes becoming pure or reaching a predefined maximum depth.

Both measures are effective, but Gini Impurity is generally faster to compute because it avoids logarithmic calculations.
For most datasets, the results from both are very similar, making Gini Impurity a popular choice, especially for large datasets where computational speed is a priority.



**Question 3:**

What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

**Answer 3:**

Pre-Pruning halts the tree-growing process early by setting limits (like maximum depth or minimum samples per split), while Post-Pruning removes nodes from a fully grown tree that don't provide empirical improvement, often based on validation set performance.

Pre-Pruning is computationally faster and helps in handling large datasets early. Post-Pruning is usually more accurate in preventing overfitting by examining fully constructed trees and trimming unnecessary nodes

Pre-pruning stops a decision tree from growing further during training by setting limits like maximum depth, while post-pruning first builds a complete tree and then removes branches that do not improve accuracy. A practical advantage of pre-pruning is its computational efficiency, as it avoids the overhead of building a full tree, whereas a key advantage of post-pruning is that it can lead to better pruning decisions because it evaluates the tree's overall structure before removing branches.

Pre-pruning

• Definition: Also called early stopping, it involves setting constraints before or during the tree's growth to prevent it from becoming too complex.
• How it works: The algorithm is stopped from creating new branches if certain conditions are met, such as a maximum depth or a minimum number of samples required in a leaf node.
• Practical advantage: Computational efficiency. It is faster because it does not need to build the full, potentially large, tree before starting the pruning process.

Post-pruning

• Definition: Also known as backward pruning, it involves building a complete, unpruned tree first and then trimming it afterward.
• How it works: Branches are removed or converted into a leaf node if they do not contribute significantly to the model's accuracy, often by using a cross-validation set to check performance.
• Practical advantage: Potentially more accurate pruning. It can make better decisions about removing branches because it considers the entire structure of the fully grown tree, avoiding the risk of cutting off potentially useful parts too early.



**Question 4:**

What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

**Answer 4:**

Information Gain measures the change in entropy after a dataset is split on an attribute. It quantifies how much uncertainty is reduced by a split, calculated as the difference between the entropy of the parent node and the weighted sum of entropies of child nodes. Information Gain is crucial because the split with the highest value leads to the purest (most informative) child nodes and is thus selected for the next split, driving the construction of an effective, accurate tree.

Information gain measures how much a split on a feature reduces the uncertainty (entropy) in a dataset. It is important for choosing the best split because decision trees aim to maximize information gain, selecting the feature that results in the purest possible child nodes with the highest reduction in disorder.

Information gain quantifies the expected reduction in entropy (or disorder) of a dataset after it's split based on a particular attribute.

 - Entropy: Entropy is a measure of a dataset's impurity or randomness. A dataset with high entropy is mixed with different classes, while a dataset with low entropy is more pure and contains fewer mixed classes.

 - Calculates the difference: It is calculated by subtracting the weighted average entropy of the child nodes from the entropy of the parent node. A higher information gain means the split is more effective.

Why it's important for choosing the best split :

 - Feature selection: Decision trees use information gain to select the best feature to split on at each node.

 - Maximizes purity: The algorithm chooses the feature with the highest information gain, as this feature provides the most information about the target variable and creates the purest possible subsets of data.

 - Guides the tree's structure: By consistently choosing the split with the highest information gain at each step, the tree is built in a way that efficiently separates the data into more homogeneous groups, leading to more accurate predictions.



**Question 5:**

What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

**Answer 5:**

Common applications of decision trees include medical diagnosis, fraud detection, loan approval, and customer churn prediction. The main advantages are their ease of interpretation and visualization, ability to handle both numerical and categorical data, and minimal data preprocessing requirements. However, a key limitation is their tendency to overfit the training data, especially with complex trees.

Common applications :

 - Healthcare: Diagnosing diseases by analyzing patient symptoms, medical history, and test results to guide treatment plans.

 - Fraud detection: Identifying fraudulent transactions based on historical patterns.

 - Loan approval: Assessing credit risk by evaluating factors like credit score, income, and loan history.

 - Customer churn prediction: Identifying customers likely to leave based on behavior and purchase history.

 - Customer segmentation: Grouping customers for targeted marketing campaigns.

 - Education: Predicting student performance based on attendance and past grades.

 - Retail: Predicting sales trends for inventory management.

 - Telecommunications: Predicting customer churn.

Advantages :

 - Easy to interpret: The logic is straightforward and can be visualized, making it easy for non-experts to understand.

 - Little data preparation: They require less data cleaning and can handle both numerical and categorical data without much preprocessing.

 - Handles missing values: Many algorithms can work with incomplete data without needing imputation.

 - Versatile: Can be used for both classification and regression tasks.

Limitations :

 - Overfitting: Decision trees can become very complex, leading them to fit the training data too closely and perform poorly on new data. This is a significant problem, especially with small datasets.

 - Instability: Small changes in the data can lead to a completely different tree being generated.

 - Bias: They can be biased toward features with more levels.

 - Difficult for complex relationships: For tasks with complex relationships, decision trees may not be the most accurate model.



#Practical Questions

In [None]:
# 6

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Loading data
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Decision Tree with Gini
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Accuracy
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

# Feature importances
print("Feature importances:", clf.feature_importances_)


Accuracy: 1.0
Feature importances: [0.         0.01667014 0.90614339 0.07718647]


In [None]:
# 7

# Fully-grown tree
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
acc_full = accuracy_score(y_test, clf_full.predict(X_test))

# max_depth=3
clf_pruned = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_pruned.fit(X_train, y_train)
acc_pruned = accuracy_score(y_test, clf_pruned.predict(X_test))

print("Accuracy full tree:", acc_full)
print("Accuracy max_depth=3:", acc_pruned)


Accuracy full tree: 1.0
Accuracy max_depth=3: 1.0


In [None]:
# 8

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep=r"\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

X, y = data, target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Feature importances:", reg.feature_importances_)

Mean Squared Error: 10.416078431372549
Feature importances: [5.12956739e-02 3.35270585e-03 5.81619171e-03 2.27940651e-06
 2.71483790e-02 6.00326256e-01 1.36170630e-02 7.06881622e-02
 1.94062297e-03 1.24638653e-02 1.10116089e-02 9.00872742e-03
 1.93328464e-01]


In [None]:
# 9

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score


# Loading the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

# Defining the parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
    'min_samples_split': [2, 5, 10]
}

# Instantiating a Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Setting up GridSearchCV
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

print("Parameter grid defined and GridSearchCV set up successfully.")

grid_search.fit(X_train, y_train)
print("Grid search completed.")

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

# Get the best estimator from GridSearchCV
best_clf = grid_search.best_estimator_

# Making predictions on the test set using the best estimator
y_pred_test = best_clf.predict(X_test)

# Calculating the accuracy on the test set
test_accuracy = accuracy_score(y_test, y_pred_test)

print(f"Test set accuracy with best parameters: {test_accuracy:.4f}")

X_train shape: (120, 4)
X_test shape: (30, 4)
y_train shape: (120,)
y_test shape: (30,)
Parameter grid defined and GridSearchCV set up successfully.
Grid search completed.
Best parameters: {'max_depth': 4, 'min_samples_split': 2}
Best cross-validation score: 0.9417
Test set accuracy with best parameters: 1.0000


**Question 10 :**

Imagine you're working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world setting.

**Answer 10 :**

 - Data Loading and Initial Exploration: Loading the dataset and understanding its structure, identifying data types and getting an initial overview of missing values and categorical features

 - Handle Missing Values: Implementing appropriate strategies for handling missing values which may include imputation (e.g., mean, median for numerical; mode for categorical) or removal, based on the nature and extent of missingness.

 - Encoding Categorical Features: Converting categorical features into a numerical format suitable for machine learning models. This could involve one-hot encoding, label encoding, or other relevant techniques, considering the cardinality of features.

 - Splitting Data into Training and Testing Sets: Dividing the preprocessed dataset into training and testing sets to prepare for model development and unbiased evaluation.

 - Training a Decision Tree Model: Instantiate and train a Decision Tree Classifier on the training data using default parameters initially.

 - Tuning Hyperparameters with GridSearchCV: Defining a parameter grid for key Decision Tree hyperparameters (e.g., max_depth, min_samples_split, criterion) and use GridSearchCV with cross-validation to find the optimal combination of parameters.

 - Evaluating Model Performance: Evaluating the tuned Decision Tree model on the unseen test set using appropriate classification metrics such as accuracy, precision, recall, F1-score, and ROC AUC, given the healthcare context.

 - Describing Business Value: Explaining the real-world business value this disease prediction model could provide for a healthcare company, focusing on aspects like early diagnosis, resource allocation, and patient outcomes. Such a model helps automate disease risk prediction, enabling early interventions, prioritizing patient care, reducing manual workload, and improving overall decision-making and resource allocation in healthcare.
