# Decision Trees
- A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. 
- It has a hierarchical (top-down), tree structure, which consists of a root node, branches, internal nodes and leaf nodes.
- it consists of nodes, there are threes typesof nodes:
1. root node: entire population or sample
2. decision node: conduct evaluations to form homogenous subsets
3. leaf node: terminal nodes
- Decision tree learning employs a divide and conquer strategy by conducting a greedy search   to identify the optimal split points within a tree. 
- greedy means the best split is down at that step and not in a forward looking way
- This process of splitting is then repeated in a top-down, recursive manner until all, or the majority of records have been classified under specific class labels.
- **Recursive binary splitting**: at ever decision node a predictor space in split in two segements/regions, it stops when user defined criteria are met. 
- Whether or not all data points are classified as homogenous sets is largely dependent on the complexity of the decision tree. As a tree grows in size too little data falls within a given subtree. When this occurs, it is known as **data fragmentation**, and it can often lead to **overfitting**. 
- As a result, decision trees have preference for small trees, which is consistent with the **principle of parsimony** in Occam’s Razor; that is, “entities should not be multiplied beyond necessity.”
- To reduce complexity and prevent overfitting, **pruning** is usually employed; this is a process, which removes branches that split on features with low importance. Subtrees are replaced by leaf nodes. 

Pruning Methods: 
- Hold-ot-test (fast and simple)
- Cost-complexity pruning, 

- The model’s fit can then be evaluated through the process of **cross-validation**. 
- Another way that decision trees can maintain their accuracy is by forming an ensemble via a **random forest** algorithm; this classifier predicts more accurate results, particularly when the individual trees are uncorrelated with each other.

## Continous / Regression DT 
- predicts continous target
- linear relationship between features and target
- output from terminal/leaf nodes are the **mean response**, new datapoints are predicted from that mean.
- In regression trees the splitting criteria is the $SSE= \sum(y_i-\bar{y})^2$ (sum of squared error) Loss function.
- At every stage in the regression tree the region is split in two, such that the SSE is minimized.

- **Classification / Rategorical DT** predicts binary categories
- Classification uses the Gini-index to calulate the Loss Function and the best split. 
- output values from terminal nodes, represent the **mode response**, new values will be predicted from that mode
- Recursive Splitting in a classification tree splits regions in two according to a user defined metric, for example the Gini index G.
- **Gini Index**, also known as Gini impurity, calculates the probability misclassification.. 
- If all the elements are linked with a single class then it can be called **pure**.
- Gini index varies between [0,1], where 
- 0 expresses the purity: all the elements belong to a specified class or only one class exists there. 
- 1 indicates the random distribution of elements across various classes. 
- The value of 0.5 of the Gini Index shows an equal distribution of elements over some classes.
- The Gini Index tends to have a preference for larger partitions and hence can be computationally intensive.

### Advantages: 
- DT are can process non-linear data, 
- are easy to interpret, -
- graphically representable, 
- and require less data prepartion

### Disadvantages: 
- very non-robust 
- sensitive to training data
- globally optimum tree not guaranteed

### Assumptions: 
- rote node = entire trainig set
- predictive features are either categorical or if continous they're binned prior to model deplyoyment
- rows in the dataset have a recursive distribution based on the values of the attributes

## 


In [1]:
import os
os.environ["path"]

'C:\\Users\\Lillian\\Anaconda3;C:\\Users\\Lillian\\Anaconda3\\Library\\mingw-w64\\bin;C:\\Users\\Lillian\\Anaconda3\\Library\\usr\\bin;C:\\Users\\Lillian\\Anaconda3\\Library\\bin;C:\\Users\\Lillian\\Anaconda3\\Scripts;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Users\\Lillian\\AppData\\Local\\Microsoft\\WindowsApps;'

In [2]:
os.environ["path"]=os.environ["path"]+";C:\\Program Files (x86)\\Graphviz2.38\\bin"

In [3]:
os.environ["path"]

'C:\\Users\\Lillian\\Anaconda3;C:\\Users\\Lillian\\Anaconda3\\Library\\mingw-w64\\bin;C:\\Users\\Lillian\\Anaconda3\\Library\\usr\\bin;C:\\Users\\Lillian\\Anaconda3\\Library\\bin;C:\\Users\\Lillian\\Anaconda3\\Scripts;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Users\\Lillian\\AppData\\Local\\Microsoft\\WindowsApps;;C:\\Program Files (x86)\\Graphviz2.38\\bin'

In [4]:
import sklearn.datasets as datasets
import pandas as pd
from sklearn import metrics

### Preparing the data

In [5]:
iris=datasets.load_iris()

df = pd.DataFrame(iris.data, columns= iris.feature_names)

y = pd.DataFrame(iris.target)

y.columns = ['labels']

df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [6]:
y.labels.value_counts()

2    50
1    50
0    50
Name: labels, dtype: int64

### The decision tree model

In [7]:
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier()

dtree.fit(df, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [None]:
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus

dot_data = StringIO()

export_graphviz(dtree, out_file=dot_data, filled=True, rounded=True, special_characters=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

Image(graph.create_png())