# Decision Tree

- Classification and regression
- It is core of Ensemble learning
- Looks like nested loop
- Structure is like a tree

### General 
- predict whether to take umbrella or not -> [Factors] -> c;ouds are not, weather forecast?
- Pruning : *Removing extra questions that aren't helpful. so the tree dosen't too complicated*
- which factor to choose root node : 
    - Classification : 
        - Gini Impurity [lower value is best ], (*A measure of how often a randomly chosen data point would misclassified*)
        - Information Gain [Greater value is best], (*ENTROPY* - measure of impurity)
    - Regression : 
        - MSE 
        - MAE

### Mathematical-classification
- calculate total gini impurity ( gini = 1 - summation(Pi)^2 )
- calculate gini for splits 
- calculate weighted gini
- choose best split (perfect node) [when, gini = 0 perfect/pure split],(decision tree will end when you get pure split)
- repeat all the steps again with individual datasets, till you get pure split

### Mathematical-regression
- calculate total MSE (MSE = i/n(Xmean - Xi)^2)
- calculate MSE at each feature(computer sets a threshold for each feaature and divides the data)
- calculate the weighted mean
- choose the best split
- repeat steps

## Classification

In [1]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [2]:
iris = load_iris()
df_iris = pd.DataFrame(iris.data, columns = iris.feature_names)
df_iris["target"] = iris.target

df_iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [4]:
x = df_iris.drop("target", axis=1)
y = df_iris["target"]

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state = 42)

In [5]:
clf = DecisionTreeClassifier(random_state = 42)
clf.fit(x_train,y_train)

In [7]:
y_pred = clf.predict(x_test)
accuracy = accuracy_score(y_test,y_pred)

In [8]:
accuracy

1.0

## Regression

In [13]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd

california = fetch_california_housing()
df_california = pd.DataFrame(data = california.data, columns = california.feature_names)
df_california["target"] = california.target

df_california.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [14]:
x = df_california.drop("target", axis = 1)
y = df_california["target"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 2)


In [15]:
reg = DecisionTreeRegressor(random_state = 1)
reg.fit(x_train, y_train)

In [17]:
y_pred = reg.predict(x_test)

mse = mean_squared_error(y_test, y_pred)
r2= r2_score(y_test, y_pred)

In [18]:
print(f"MSE: {mse} and R2: {r2}")

MSE: 0.568170735711749 and R2: 0.578203082611473


### Hyperparameters-Regressor

- criterion{“squared_error”, “friedman_mse”, “absolute_error[use when data has outliears]”, “poisson”},default=”squared_error”

- splitter{“best”, “random”}, default=”best”

- max_depth: int, default=None
[ The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.]

- max_features: int, float or {“sqrt”, “log2”}, default=None
[kindoff feature selection parameter]

- ccp_alpha: non-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. 

### Hyperparameter-Classifier

- criterion{“gini”, “entropy”, “log_loss”}, default=”gini”
- others same as above

### Pros 
- easy to understand (like if-else)
- no feature scaling required [no standardization]
- can handel non-linear data
- can do feature selection

### Cons
- overfitting
- computionally extensive