### ***Decision tree*** ### 
Is a supervised learning algorithm used for classification and regression tasks.

It works by recursively splitting the data into subsets based on feature values, 

creating a tree-like structure where each internal node represents a decision based on a feature, 

each branch represents the outcome of that decision, 

and each leaf node represents a class label (for classification) or a continuous value (for regression).

Key Terminology

Root Node → the first split in the tree (based on the most important feature).

Splitting → dividing data into subsets.

Leaf/Terminal Node → final prediction (class/value).

Decision Node → when further splitting is possible.

Depth → how many levels the tree has.

**How Does It Decide Splits?**

The tree needs a rule to decide which feature to split on:

***Classification Trees (CART):***

Use Gini Impurity or Entropy (Information Gain).

***Regression Trees:***

Use Variance Reduction or Mean Squared Error (MSE).

***heterogenity*** is when a node contains samples from multiple classes.

***homogeneity*** is when a node contains samples from a single class.

***Entropy*** measures the uncertainty or impurity in a dataset. In decision trees, it helps determine the best feature to split on by calculating the information gain from each potential split. The feature that results in the highest information gain (i.e., the greatest reduction in entropy) is chosen for the split.

Entropy is calculated as:
```math
Entropy(S) = - Σ (p(x) * log2(p(x)))
```
where p(x) is the proportion of instances belonging to class x in dataset S.

***Information Gain (IG)*** is the reduction in entropy after a dataset is split on a feature. It quantifies how much uncertainty in the target variable is reduced by knowing the value of the feature.
```math
IG = Entropy(Parent) - [Weighted Average] * weighted children Entropy(Children)
```
Where:
- Entropy(Parent) is the entropy of the dataset before the split.
- weighted children Entropy is the entropy of the dataset after the split.
- weighted average accounts for the proportion of samples in each child node.

***weighted child entropy*** 
```math
    weighted child entropy = (n_left / n) * Entropy(left) + (n_right / n) * Entropy(right)
```
where n is the total number of samples, 
n_left is the number of samples in the left child, 
n_right is the number of samples in the right child.

***Gini Impurity*** same as Entropy but computationally more efficient.
```math
gini Impurity = 1 - Σ (p_i)^2
```
where p_i is the proportion of samples belonging to class i in the node.

Gini Impurity ranges from 0 (pure node) to 0.5 (maximum impurity for binary classification).

Entropy range from 0 (pure node) to 1 (maximum impurity for n classes).

***How Splits Work in Regression Trees***

Unlike classification trees (Gini or Entropy), regression trees try to predict a continuous value.
So, the split criterion is based on variance reduction or Mean Squared Error (MSE).

Variance Reduction (or Reduction in MSE)

When considering a split:

1. Compute MSE before split (parent node).

2. Compute weighted MSE after split (child nodes):
```math
MSE split​ = (n_left * mse_left + n_right * mse_right) / (n_left + n_right)
```

```math
Variance Reduction = MSE parent ​− MSE split​
```
3. The algorithm tries all possible splits on all features.

4. Choose the split that gives maximum variance reduction (i.e., reduces error the most).

In [None]:
# implementation of Decision Tree

from sklearn.tree import DecisionTreeRegressor

# Create Decision Tree Regressor model
model = DecisionTreeRegressor(max_depth=5, random_state=42)
# where max_depth is a hyperparameter to control overfitting
# random_state is set for reproducibility
# reproducibility means that every time you run the code, you get the same results

# Fit the model to the training data
model.fit(X_train, y_train)

from sklearn.tree import DecisionTreeClassifier

# Create Decision Tree Classifier model
model = DecisionTreeClassifier(max_depth=5, random_state=42)

***Combined pipeline checklist for Decision Trees (supports both classification & regression)***

1. Load data: pd.read_csv / read_parquet / from SQL; set index; parse dates when needed.  
2. Initial EDA: df.head(), df.shape, df.dtypes, df.isnull().sum(), df.describe(), value_counts() for categorical cols, inspect target distribution (hist / countplot) and pairwise correlations for numeric features.  
3. Determine task: classification if target is categorical / small integer classes; regression if continuous.  
4. Preprocess:  
    - Missing values: impute (median/mean for numeric, mode/constant for categorical) or model-based imputation.  
    - Convert dtypes: to_numeric, to_datetime, categorical dtype for categories.  
    - Outliers: clip, transform (log), or model-robust approaches.  
5. Encoding:  
    - Classification: one-hot for nominal, ordinal encoding for ordered categories.  
    - Regression: same encodings; consider target encoding for high-cardinality features with caution.  
6. Feature engineering / selection: create domain features, interactions, select with feature importance or model-based selection.  
7. Split data: train / val / test (stratify for classification when appropriate).  
8. Choose model:  
    - Classification: sklearn.tree.DecisionTreeClassifier (consider class_weight='balanced' for imbalanced data).  
    - Regression: sklearn.tree.DecisionTreeRegressor.  
    - Common hyperparams: max_depth, min_samples_leaf, min_samples_split, max_features, random_state.  
9. Train on training set.  
10. Evaluate:  
     - Classification metrics: accuracy, precision, recall, F1, ROC AUC, confusion matrix.  
     - Regression metrics: MSE, RMSE, MAE, R², residual analysis.  
11. Cross-validation: use cross_val_score or cross_validate with appropriate scoring (scoring='neg_root_mean_squared_error' or 'roc_auc', etc.).  
12. Hyperparameter tuning: GridSearchCV / RandomizedSearchCV (or Bayesian optimizers) with cv and appropriate scoring.  
13. Regularization/pruning: tune max_depth, min_samples_leaf, ccp_alpha (cost-complexity pruning).  
14. Diagnostics: inspect residuals (regression), calibration & confusion matrix (classification), and feature_importances_.  
15. Visualization: sklearn.tree.plot_tree or export_graphviz; prediction vs actual plots; partial dependence plots.  
16. Persist model: joblib.dump / pickle; encapsulate preprocessing + model into a pipeline (sklearn.pipeline.Pipeline).  
17. Deployment & monitoring: serve model, monitor drift, retrain when performance degrades.  
18. Optional helper to choose model programmatically:
