# Decision Trees
A decision tree is a graphical representation that helps us to make decisions based on certain conditions.

**Node:**

A decision tree is made up of several nodes:

**1.Root Node:** A Root Node represents the entire data and the starting point of the tree. From the above example the
First Node where we are checking the first condition, whether the movie belongs to Hollywood or not that is the
Rood node from which the entire tree grows.

**2.Leaf Node:** A Leaf Node is the end node of the tree, which can’t split into further nodes.
From the above example `watch movie` and `Don’t watch` are leaf nodes.

**3.Parent/Child Nodes:** A Node that splits into a further node will be the parent node for the successor nodes. The
nodes which are obtained from the previous node will be child nodes for the above node.

![Decision Tree](images/decision-tree.png)

**Branches:**

Branches are the arrows which is a connection between nodes, it represents a flow from the starting/Root node to the leaf node.

How to select an attribute to create the tree or split the node:
We use criteria to select attribute which helps us to split the data into partitions.

Here are the most important and useful methods to select the node for splitting the data

## Information Gain:

In the process of selecting an attribute that gives more information about the data, we select the attribute for splitting further from which we get the highest information gain. For calculating Information gain we use metric called `entropy`.

Information from attribute = ∑p(x). Entropy (x)

### Entropy
Degree of randomness in a dataset. Entropy is used to measure the Impurity and disorder in the dataset.

Entropy = – ∑ p(y). log2 p(y)

Information Gain for any attribute = total entropy – Information from attribute after splitting

### Gini Index
Gini Index or Gini Impurity which calculates the probability of an attribute that is randomly selected.

#################################
## Classification Problem solved using decision tree classifier

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv("data/salaries.csv")
df.head()

Unnamed: 0,company,job,degree,salary_more_then_100k
0,google,sales executive,bachelors,0
1,google,sales executive,masters,0
2,google,business manager,bachelors,1
3,google,business manager,masters,1
4,google,computer programmer,bachelors,0


## Data Pre-processing

In [23]:
df.isna().sum()

company                  0
job                      0
degree                   0
salary_more_then_100k    0
dtype: int64

In [20]:
features = df.drop('salary_more_then_100k',axis='columns')
features.head()

Unnamed: 0,company,job,degree
0,google,sales executive,bachelors
1,google,sales executive,masters
2,google,business manager,bachelors
3,google,business manager,masters
4,google,computer programmer,bachelors


In [19]:
target = df['salary_more_then_100k']
target.head()

0    0
1    0
2    1
3    1
4    0
Name: salary_more_then_100k, dtype: int64

## One Hot Encoding the 3 columns

In [11]:
# le = label encoding
from sklearn.preprocessing import LabelEncoder
le_company = LabelEncoder()
le_job = LabelEncoder()
le_degree = LabelEncoder()

In [25]:
features['company_n'] = le_company.fit_transform(features['company'])
features['job_n'] = le_job.fit_transform(features['job'])
features['degree_n'] = le_degree.fit_transform(features['degree'])
features

Unnamed: 0,company,job,degree,company_n,job_n,degree_n
0,google,sales executive,bachelors,2,2,0
1,google,sales executive,masters,2,2,1
2,google,business manager,bachelors,2,0,0
3,google,business manager,masters,2,0,1
4,google,computer programmer,bachelors,2,1,0
5,google,computer programmer,masters,2,1,1
6,abc pharma,sales executive,masters,0,2,1
7,abc pharma,computer programmer,bachelors,0,1,0
8,abc pharma,business manager,bachelors,0,0,0
9,abc pharma,business manager,masters,0,0,1


In [28]:
features_n = features.drop(['company','job','degree'],axis='columns')
features_n.head()

Unnamed: 0,company_n,job_n,degree_n
0,2,2,0
1,2,2,1
2,2,0,0
3,2,0,1
4,2,1,0


In [30]:
from sklearn import tree
model = tree.DecisionTreeClassifier()
model.fit(features_n, target)

DecisionTreeClassifier()

In [31]:
model.score(inputs_n,target)

1.0

In [39]:
# Prediction is as follows: 0 = No, 1 = Yes

print(f'Is salary of Google, Computer Engineer, Bachelors degree > 100 k?  Answer --> {model.predict([[2,1,0]])}')

print(f'Is salary of Google, Computer Engineer, Masters degree > 100 k ?   Answer --> {model.predict([[2,1,1]])}')

Is salary of Google, Computer Engineer, Bachelors degree > 100 k?  Answer --> [0]
Is salary of Google, Computer Engineer, Masters degree > 100 k ?   Answer --> [1]
