### Splitting data by asking questions
#### Applications: Classification and Regression

<table><tr>
<td> 
  <p align="center" style="padding: 10px">
    <img alt="DecisionTree1" src="images/DT1.png" width="400" height="400">
    <br>
    <em style="color: grey">Good Decision Tree</em>
  </p> 
</td>
<td> 
  <p align="center">
    <img alt="DecisionTree2" src="images/DT2.png" width="450" height="450">
    <br>
    <em style="color: grey">Bad Decision Tree</em>
  </p> 
</td>
</tr></table>

1. We need to pick a good first question for the root of our tree out of 5 given candidates
    1. Is it raining?
    2. Is it cold outside (temperature check)?
    3. Am I hungry?
    4. Is there a red car outside?
    5. Is it Monday?
2. Pick a question resulting good accuracy or gini index or entropy
3. $\color{red}{\text{It is a process of always picking the best possible question (greedy algorithm). But this does not guarantee that we get best possible tree.}}$ <br>
Note: This is a quick process (each node requires linear search) and workd very well most of the time. In future try to remove this greedy property
4. $\color{red}{\text{Build all possible decision trees and pick the best one from there}}$ <br>
Note: If dataset has many features then number of possible decision trees is very large. Going through all of them would be very slow
5. In classification leaves have classes, and in regression the leaves have values. Prediction of our model is given by traversing the tree downwad fashion

### Understand the algorithm by solving the following problem
#### Decision Tree as Classifier:
#### Problem: Recommend apps to users according to what they are likely to download

**Apps to be recommended (Target):**
1. Atom Count - An app that counts number of atoms in your body
2. Beehive Finder - An app that maps your location and finds the closest beehives
3. Check Mate Mate - An app for finding Australian chess players <br>

<img alt="DecisionTree3" src="images/DT3.png" width="400" align="center"/>

**Dataset:** <br>

| Platform | Age | App |
| :- | -: | :-: |
| iPhone | 15 | Atom Count
| iPhone | 25 | Check Mate Mate
| Android | 32 | Beehive Finder
| iPhone | 35 | Check Mate Mate
| Android | 12 | Atom Count
| Android | 14 | Atom Count

Which app would you recommend to each of the following three customers? <br>
• Customer 1: a 13-year-old iPhone user <br>
• Customer 2: a 28-year-old iPhone user <br>
• Customer 3: a 34-year-old Android user <br>

**_Human Solution:_**<br>
Customer 1: a 13-year-old iPhone user. To this customer, we should recommend Atom Count,
because it seems (looking at the three customers in their teens) that young people tend to download
Atom Count. <br>
Customer 2: a 28-year-old iPhone user. To this customer, we should recommend Check Mate
Mate, because looking at the two iPhone users in the dataset (aged 25 and 35), they both downloaded
Check Mate Mate. <br>
Customer 3: a 34-year-old Android user. To this customer, we should recommend Beehive Finder,
because there is one Android user in the dataset who is 32 years old, and they downloaded Beehive
Finder. <br>

#### Solution: Building an app-recommendation algorithm

1. Simplify the problem by converting numerical feature age to categorical feature <br>

| Platform | Age | App |
| :- | -: | :-: |
| iPhone | Young | Atom Count
| iPhone | Adult | Check Mate Mate
| Android | Adult | Beehive Finder
| iPhone | Adult | Check Mate Mate
| Android | Young | Atom Count
| Android | Young | Atom Count

2. Classifer1: Question: Does the user use an iPhone or Android? (splits into two groups iPhones and Androids) - App with majority is recommendable output

3. Classifier2: Question: Is the user young or adult? (splits into two groups Youngs and Adults) - App with majority is recommendable output

4. Calculate **accuracy** for all classifiers (higher the accuracy better the classifer)
<img alt="DecisionTree4" src="images/DT4.png" width="600" align="center"/>

5. Calculate **Gini impurity index** - measure of diversity (Lower the gini index better the classifer - lower gini represents elements are similar, large gini represents elements are different) <br>
$\;Gini\_impurity\_index = P(picking\_two\_elements\_different) \\
\hspace{4cm}           = 1 - P(picking\_two\_elements\_similar) \\
\hspace{4cm}           = 1 - p_{1}^{2} - p_{2}^{2} - p_{3}^{2} \;=>[since\;three\;apps\;exists\;in\;dataset]$
    Classifier1 (by platform):
        - Left Leaf (iPhone):{A,C,C}
        - Right leaf (Android):{A,A,B}
    Classifier2 (by age):
        - Left leaf (young):{A,A,A}
        - Right leaf (adult):{B,C,C}
    The gini indices of sets {A,C,C},{A,A,B}, and {B,C,C} are all the same: $1-(\frac{2}{3})^{2}-(\frac{1}{3})^{2}-0=0.444$ <br>
    The gini index of set {A,A,A} is $1-(\frac{3}{3})^{2}-0-0=0$ (gini index of pure set is always 0) <br>
    Classifier1: Average gini index = $\frac{0.444+0.444}{2} = 0.444$ <br>
    Classifier2: Average gini index = $\frac{0.444+0}{2} = 0.222$
<img alt="DecisionTree5" src="images/DT5.png" width="600" align="center"/>

*NOTE: The Gini impurity index should not be confused with the Gini coefficient. The Gini coefficient is used in statistics to calculate the income or wealth inequality in countries. In this book, whenever we talk about the Gini index, we are referring to the Gini impurity index.* <br>

6. Calculate **Entropy** - measure of diversity (measure of homogeneity) <br>
$\;Entropy = -p_{1}log_{2}(p_{1})-p_{2}log_{2}(p_{2})...-p_{n}log_{2}(p_{n}) $ <br> 
    Classifier1 (by platform):
        - Left Leaf (iPhone):{A,C,C}
        - Right leaf (Android):{A,A,B}
    Classifier2 (by age):
        - Left leaf (young):{A,A,A}
        - Right leaf (adult):{B,C,C}
    The entropies of sets {A,C,C},{A,A,B}, and {B,C,C} are all the same: $-\frac{2}{3}log_{2}(\frac{2}{3})-\frac{1}{3}log_{2}(\frac{1}{3})-0=0.918$ <br>
    The entropy of set {A,A,A} is $-\frac{3}{3}log_{2}(\frac{3}{3})-0-0=log_{2}(1)=0$ (gini index of pure set is always 0) <br>
    Classifier1: Average entropy = $\frac{0.918+0.918}{2} = 0.918$ <br>
    Classifier2: Average entropy = $\frac{0.918+0}{2} = 0.459$
<img alt="DecisionTree6" src="images/DT6.png" width="600" align="center"/>
    Thus, again we conclude that the second split is better, because it has a lower average entropy.
    

<table><tr>
<td> 
  <p align="left" style="padding: 10px">
    <img alt="DecisionTree1" src="images/DT1.png" width="400" height="400" align='left>
    <br>
    <em style="color: grey">Good Decision Tree</em>
  </p> 
</td>
</tr></table>