Explain following terms related to Decision Tree

1) Root node

2) Leaf nodes

3) Entropy

4) Gini Index

5) Information gain

6) parameters- max_depth,min_samples_leaf

7) Pruning

8) Advantages and disadvantages of decision tree

10) Compare decision tree and Random Forest

## How does the Decision Tree algorithm Work?

- A decision tree is a flowchart-like structure used for decision making and prediction. 

- It is a supervised learning algorithm used for both classification and regression problems. 

The basic steps for creating a decision tree are as follows:

1. Select the best attribute using Attribute Selection Measures(ASM) to split the records.

2. Make that attribute a decision node and breaks the dataset into smaller subsets.

3. Starts tree building by repeating this process recursively for each child node, selecting the best attribute for each one.

4. The recursion stops when it reaches the leaf node which has no child.

5. Assign the target variable to the leaf node.

6. The tree can be used to predict the target variable for the new data by traversing from the root to the leaf node.

7. After the model is built, it can be used for prediction.

8. The tree can be pruned to avoid overfitting by removing branches that do not provide much information gain.

9. The final tree structure can be visualized for easy interpretation.

#### __1. Root Node__

- The root node, also known as the parent node, is the topmost node in a decision tree.

- It represents the entire dataset and is the starting point for the tree. 

- The root node is responsible for splitting the dataset into smaller subsets, which are then represented by the child nodes.

#### __2. Leaf Nodes__

- A leaf node, also known as a terminal node, is the end point of a decision tree. 

- It represents a final decision or prediction and has no child nodes. 

- Once a leaf node is reached, the decision tree can stop traversing further down the tree and make a prediction based on the value assigned to the leaf node. 

## Attribute Selection Measures:

Entropy, Gini index, and information gain are all measures used to evaluate the quality of a split in a decision tree. 

These measures are used to determine which attribute should be selected as the root node, or internal node, of the tree.

#### __1. Entropy:__

- The entropy is a measure of the impurity of the sample.

- The entropy is calculated by summing the product of the probability of each class and the log of that probability for all classes in the sample.

- The entropy is 0 when all the samples in a node belong to the same class and it is 1 when the samples are equally divided among multiple classes.

- The lower the entropy, the more pure the sample.

- The entropy is calculated using the formula:

- Entropy = - ∑ (p(i) * log2(p(i))) where p(i) is the proportion of the samples of class i in the sample set.

#### __2. Gini index:__

- The Gini index is a measure of the probability of a random sample being classified incorrectly.

- The Gini index is calculated by summing the square of the probability of each class for all classes in the sample.

- The Gini index is 0 when all the samples in a node belong to the same class and it is 1 when the samples are equally 

- divided among multiple classes.

- The lower the Gini index, the more pure the sample.

- The Gini index is calculated using the formula:

- Gini Index = 1 - ∑ (p(i)^2) where p(i) is the proportion of the samples of class i in the sample set.

#### __3. Information gain:__

- Information gain is the decrease in entropy or Gini index after a dataset is split on an attribute.

- The attribute with the highest information gain is chosen as the root node.

- The Information gain is calculated by subtracting the weighted average of the entropy or Gini index of the child nodes from the entropy or Gini index of the parent node.

- The Information gain is calculated using the formula:

- Information Gain = Entropy(parent) - ∑ (w(i) * Entropy(i)) or Information Gain = Gini(parent) - ∑ (w(i) * Gini(i)) where w(i) is the weight of the ith child node and Entropy(i) or Gini(i) is the entropy or Gini index of the ith child node.


#### In summary, 

- Entropy, Gini index, and Information gain are all measures used to evaluate the quality of a split in a decision tree. 

- They are used to determine which attribute should be selected as the root node, or internal node, of the tree. 

- Each measure has its own formula and the attribute which has the highest score according to the measure will be selected as the root node.


## __Parameters__

A decision tree is a machine learning model that is used for both classification and regression tasks. The decision tree algorithm has several parameters that can be adjusted to optimize the performance of the model. These parameters include:

#### __1. Criterion:__
- The criterion parameter is used to specify the impurity measure used to evaluate the quality of a split.

- It can have two values, "gini" and "entropy".

- Gini is the default criterion and it uses the Gini index as the impurity measure.

- Entropy uses the entropy as the impurity measure.

    `from sklearn.tree import DecisionTreeClassifier`
    
    `clf = DecisionTreeClassifier(criterion='entropy')`

#### __2. Splitter:__

- The splitter parameter is used to specify the strategy used to choose the split at each node.

- It can have two values, "best" and "random".

- "best" is the default and it uses the best feature to split the data.

- "random" selects a random feature to split the data.

    `from sklearn.tree import DecisionTreeClassifier`
    
    `clf = DecisionTreeClassifier(splitter='random')`

#### __3. Max_depth:__

- The max_depth parameter is used to specify the maximum depth of the tree.

- The depth of a tree is the number of levels in the tree.

- The default value is None, which means the tree will continue to split until all the data points in a leaf node belong to the same class.

    `from sklearn.tree import DecisionTreeClassifier`
    
    `clf = DecisionTreeClassifier(max_depth=12)`

#### __4. Min_samples_split:__

- The min_samples_split parameter is used to specify the minimum number of samples required to split an internal node.

- The default value is 2, which means an internal node must have at least 2 samples to be split.

    `from sklearn.tree import DecisionTreeClassifier`
    
    `clf = DecisionTreeClassifier(min_samples_split=5)`

#### __5. Min_samples_leaf:__

- The min_samples_leaf parameter is used to specify the minimum number of samples required to be at a leaf node.

- The default value is 1, which means a leaf node must have at least 1 sample.

    `from sklearn.tree import DecisionTreeClassifier`
    
    `clf = DecisionTreeClassifier(min_samples_leaf=3)`

#### __6. Max_features:__

- The max_features parameter is used to specify the maximum number of features considered for splitting at each node.

- It can take an integer, float, or a string value.

- The default value is "auto" which means that max_features=n_features where n_features is the number of features in the input dataset.

    `from sklearn.tree import DecisionTreeClassifier`
    
    `clf = DecisionTreeClassifier(max_features='sqrt')`


### In summary, 
- __decision tree algorithm has several parameters that can be adjusted to optimize the performance of the model. These parameters include Criterion, Splitter, Max_depth, Min_samples_split, Min_samples_leaf, and Max_features. Each parameter has its own role and the values can be changed to achieve better performance on the dataset.__

## __Pruning: Getting an Optimal Decision tree__

__Pruning is a technique used to reduce the complexity of a decision tree by removing branches that do not contribute much to the accuracy of the model. This helps in preventing overfitting, which occurs when the decision tree becomes too complex and starts memorizing the training data.__

There are two main types of pruning methods:

#### __1. Pre-pruning:__

- This method is used to set a stopping condition for the tree-growing process. 
- This can be done by setting a threshold for the maximum tree depth, the minimum number of samples required to split a node, or the minimum number of samples required to be at a leaf node. 
- The tree-growing process stops when one of these conditions is met.

    `from sklearn.tree import DecisionTreeClassifier`

    `clf = DecisionTreeClassifier(max_depth=5, min_samples_split=20, min_samples_leaf=5)`

#### __2. Post-pruning:__
- This method is used to remove branches from the tree after it has been fully grown. 
- One popular method for post-pruning is called reduced error pruning. 
- This method works by removing a branch from the tree and evaluating the impact on the accuracy of the model. 
- If the accuracy does not decrease significantly, the branch is removed permanently.

    `from sklearn.tree import DecisionTreeClassifier`

    `from sklearn.metrics import accuracy_score`

    `clf = DecisionTreeClassifier()`

    `clf.fit(X_train, y_train)`

    `## Evaluate the accuracy of the model before pruning`

    `accuracy = accuracy_score(y_test, clf.predict(X_test))`

    `print('Accuracy before pruning: ', accuracy)`

    `## Remove branches from the tree`

    `clf.prune(threshold=0.1)`

    `## Evaluate the accuracy of the model after pruning`
    
    `accuracy = accuracy_score(y_test, clf.predict(X_test))`

    `print('Accuracy after pruning: ', accuracy)`

In this example, 

we first train a decision tree on the training data and evaluate its accuracy on the test data. Next, we remove branches from the tree using the prune() method, with a threshold of 0.1. Finally, we evaluate the accuracy of the model again to see if the pruning improved the performance of the model.

Note that this is a simplified example and in real-world scenarios, we would need to use a validation set to evaluate the performance of the model after pruning.

#### __Advantages of Decision Trees:__

1. __Easy to understand and interpret:__ Decision trees are easy to understand and interpret, even for people with little or no knowledge of data science. The tree structure of the model makes it easy to visualize the decisions and the corresponding outcomes.

2. __Handles both categorical and numerical data:__ Decision trees can handle both categorical and numerical data, which makes them a versatile tool for data analysis.

3. __Able to handle large datasets:__ Decision trees can handle large datasets and are not affected by the curse of dimensionality, which occurs when the number of features in the dataset is much larger than the number of samples.

4. __Able to handle missing data:__ Decision trees can handle missing data by simply ignoring the missing values when making a split.

5. __Able to handle irrelevant features:__ Decision trees are robust to irrelevant features, which means that they can still make accurate predictions even if some of the features in the dataset are irrelevant to the outcome.

#### __Disadvantages of Decision Trees:__

1. __Overfitting:__ Decision trees are prone to overfitting, especially when the tree becomes too complex. Overfitting occurs when the tree memorizes the training data and is not able to generalize well to new data.

2. __Instability:__ Decision trees can be sensitive to small changes in the data, which can result in a completely different tree being generated. This makes the model less stable and less reliable.

3. __Bias:__ Decision trees can be biased towards features with more levels or towards the classes that have more samples.

4. __Not good for continuous variables:__ Decision trees are not good for continuous variables as they will require to be binned into discrete variables before using it in the tree.

5. __Computational cost:__ The computational cost of creating a decision tree increases with the size of the dataset, making it less suitable for real-time applications.

In general, decision tree is a powerful tool for data analysis and prediction, but it is important to be aware of its limitations and to use appropriate techniques to avoid overfitting, such as pruning or ensemble methods.