# <h1 style="color:purple"> Decision Tree

Mixed Data Types:
1. Decision Trees can handle both numerical and categorical features without requiring extensive preprocessing like one-hot encoding for categorical variables.     

2. Non-linear Relationships: They can effectively capture non-linear relationships between features and the target variable, as they create a series of conditional splits.

---
Conside the Dataset below:     

<img src="./assets/images/Company_Salaries.png" width="1000">

There are mixed categories

---

 <h3 style="color:purple">Let's build a Decion tree to find out which employees have salaries greater > 100k</h3>         

  ![alt text](assets/images/Decision_Tree.png)

For Facebook there is 0 impurity meaning we have 0 entropy but for Google and ABC Pharma we have to make further decision trees.      

---


 <h3 style="color:purple">So we will split the tree further based on more features:</h3>      

  ![alt text](assets/images/Tree_Split_Based_on_Features.png)


---



 <h3 style="color:purple">How to select ordering of features in which you start splitting your decision  tree ?</h3>    

 ![alt text](assets/images/Good_Split_Evaluation.png)


---

<h2 style="color:blue">Entropy v/s Information Gain</h2>

1. Entropy:     
    - High Entropy: A dataset has high entropy when it has a mix of different classes, leading to high uncertainty or disorder. For example, a dataset with an equal mix of red and blue balls has high entropy.        

    - Low Entropy: A dataset has low entropy when it is mostly made up of a single class, leading to low uncertainty or disorder. For example, a dataset with only red balls has zero entropy.        


        <img src="./assets/images/High_Vs_Low_Entropy.png" width="600">      



2. Information Gain:     
    - High Information Gain: A split on a feature that results in a significant reduction of entropy (i.e., creates very pure child nodes) has high information gain. This is the ideal split because it separates the data more effectively.      
    - Low Information Gain: A split that does not significantly reduce entropy, and the child nodes are still quite mixed, has low information gain. This means the feature was not very useful for splitting.


       <img src="./assets/images/High_Vs_Low_InformationGain.png" width="600">   

---

<h2 style="color:blue">Gini Impurity</h2>    

Definition: Gini impurity is another metric used to measure the impurity or randomness of a node in a decision tree. It measures the probability that a randomly selected instance from a node will be misclassified.     
 - High Gini Impurity: Indicates a node that is more mixed with different classes.      
 - Low Gini Impurity: Indicates a node that is more pure. A Gini value of \(0\) means the node is perfectly pure (all instances belong to one class).Â     

    <img src="./assets/images/Gini_Impurity.jpg" width="300">  

In [49]:
import pandas as pd
df = pd.read_csv('./assets/files/salaries.csv')
df.head()

Unnamed: 0,company,job,degree,salary_more_then_100k
0,google,sales executive,bachelors,0
1,google,sales executive,masters,0
2,google,business manager,bachelors,1
3,google,business manager,masters,1
4,google,computer programmer,bachelors,0


In [50]:
inputs = df.drop('salary_more_then_100k', axis='columns')   # Making an Input dataset for ourself drom the original dataset column.
target = df['salary_more_then_100k']                        # Making a target dataset for ourself drom the original dataset column.

In [51]:
inputs

Unnamed: 0,company,job,degree
0,google,sales executive,bachelors
1,google,sales executive,masters
2,google,business manager,bachelors
3,google,business manager,masters
4,google,computer programmer,bachelors
5,google,computer programmer,masters
6,abc pharma,sales executive,masters
7,abc pharma,computer programmer,bachelors
8,abc pharma,business manager,bachelors
9,abc pharma,business manager,masters


In [52]:
target

0     0
1     0
2     1
3     1
4     0
5     1
6     0
7     0
8     0
9     1
10    1
11    1
12    1
13    1
14    1
15    1
Name: salary_more_then_100k, dtype: int64

**Convert 'company', 'job', 'degree' column values to numeric using encoding**

In [53]:
from sklearn.preprocessing import LabelEncoder
le_company = LabelEncoder()
le_job = LabelEncoder()
le_degree = LabelEncoder()

In [54]:
inputs['company_n'] = le_company.fit_transform(inputs['company'])
inputs['job_n'] = le_job.fit_transform(inputs['job'])
inputs['degree_n'] = le_degree.fit_transform(inputs['degree'])
inputs

Unnamed: 0,company,job,degree,company_n,job_n,degree_n
0,google,sales executive,bachelors,2,2,0
1,google,sales executive,masters,2,2,1
2,google,business manager,bachelors,2,0,0
3,google,business manager,masters,2,0,1
4,google,computer programmer,bachelors,2,1,0
5,google,computer programmer,masters,2,1,1
6,abc pharma,sales executive,masters,0,2,1
7,abc pharma,computer programmer,bachelors,0,1,0
8,abc pharma,business manager,bachelors,0,0,0
9,abc pharma,business manager,masters,0,0,1


**So <e style="color:purple;">'google' = 2, 'facebook' = 1, and 'abc pharma' = 0</e> in column_n label**    


**And <e style="color:purple;">'sales executive' = 2, 'computer programmer' = 1, and 'business manager' = 0</e> in job_n label**      

**And <e style="color:purple;">'masters' = 1, and 'bachelors' = 0</e> in degree_n label**      


**Drop the unneccessary columns and use the labeled columns**
**Remember the label decided by LabelEncoders are random and chosen internally**

In [55]:
inputs_n = inputs.drop(['company', 'job', 'degree'], axis='columns')
inputs_n

Unnamed: 0,company_n,job_n,degree_n
0,2,2,0
1,2,2,1
2,2,0,0
3,2,0,1
4,2,1,0
5,2,1,1
6,0,2,1
7,0,1,0
8,0,0,0
9,0,0,1


In [56]:
from sklearn import tree
model = tree.DecisionTreeClassifier()

In [57]:
model.fit(inputs_n, target)  # Not splitting the data into training and testing data for simplicity here. We can do that if needed.

model.criterion              # It uses 'gini' criterion by default. We can change it to 'entropy' if needed. Check the parameters of DecisionTreeClassifier for more details.

'gini'

In [58]:
model.score(inputs_n, target) # Coming out to be 1 because I used the same data for training and testing.

1.0

**Decision tree model.predict() method returns an array of predicted class labels for the input samples.**      

**we have 'salary_more_then_100k as out target column which has two value 0 meaning no and 1 meaning yes**      

**So if model.predict() on a single 2-D array input entry returns a value 0 in the array it means the input set belongs to '0' label class**

In [59]:
model.predict([[2, 2, 1]])  # Predicting if a person, working in Google (encoded as 2), with Sales Executive Job (encoded as 2) and having Masters degree (encoded as 1) makes more than 100k salary or not.



array([0])

In [60]:
model.predict([[2, 1, 1]])  # Predicting if a person working in Google (encoded as 2), with Computer Programmer Job (encoded as 1), and having Masters degree (encoded as 1) makes more than 100k salary or not.



array([1])