# Week 4 Tutorial/Computer Lab {-}


### Unit Convenor & Lecturer {-}
[George Milunovich](https://www.georgemilunovich.com)  
[george.milunovich@mq.edu.au](mailto:george.milunovich@mq.edu.au)

---

1. Read Week 4 Short-Answer Sample Questions (5 min)
2. Practice Quiz (15 min)
3. Choosing where to split a decision tree question
4. Python Exercise

---



**Decision Tree Split**

- Decide between the tree splits depicted below based on information gain computed using: 1) Classification Error, 2) Entropy, 3) Gini Impurity

<img src="images/image6.jpg" alt="Drawing" style="width: 350px;"/>

**Python Exercise**

Credit score cards are used as a risk control method in the financial industry. Personal information submitted by credit card applicants are used to predict the probability of future defaults. The bank employs such data to decide whether to issue a credit card to the applicant or not.


| Feature Name         | Explanation     | Additional Remarks |
|--------------|-----------|-----------|
| ID | Randomly allocated client number      |         |
| Income   | Annual income  |  |
| Gender   | Applicant's Gender   | Male = 0, Female = 1  |
| Car | Car Ownership | Yes = 1, No = 0 | 
| Children | Number of Children | |
| Real Estate | Real Estate Ownership | Yes = 1, No = 0 
| Days Since Birth | No. of Days | Count backwards from current day (0), -1 means yesterday
| Days Employed | No. of Days | Count backwards from current day(0). If positive, it means the person is currently unemployed.
| Payment Default | Whether a client has overdue credit card payments | Yes = 1, No = 0

</br>

- Import the credit_data.xlsx file from data folder into a pandas DataFrame named df.
- What are the dimensions of the dataset?
- How many unique rows of "ID" column are there?
- Delete duplicate rows from df according to ID (keep the first occurance of each duplicate row).
- How many rows are left in the dataframe? (answer in the Markdown box below in a full sentence)
- Reset the index in `df` using an appropriate function from `pandas` so that the new index corresponds to the number of rows (make sure to delete the old index). Why do we need to do this?
- How many positive values of Days Employed are there? (answer in Markdown)
- Replace the positive values of Days Employed with 0 (zero) in df, and check that the operation was performed successfully.

---
---


## Exercise 1 

- Decide between the tree splits depicted below based on information gain computed using: 1) Classification Error, 2) Entropy, 3) Gini Impurity

<img src="images/image6.jpg" alt="Drawing" style="width: 350px;"/>

<hr style="width:25%;margin-left:0;"> 



### Solution 

- Deciding between two algernative tree splits based on information gain
- For a more detailed solution see "Decision Tree Question.pdf" file in Week 4 zip folder

Computing Information Gain  

$IG(D_p, f)=I(D_p)-\frac{N_\text{left}}{N_\text{p}}I(D_\text{left}) - \frac{N_\text{right}}{N_\text{p}}I(D_\text{right})$

- $I$ - Impurity measure
- $f$ - feature to perform the split, e.g. age or education level
- $D_p$ - dataset of the parent node
- $N_p$ - number of training examples at the parent node
- $D_j$ - dataset of the jth child node
- $N_j$ - number of training examples in the jth child node



where $p(i|t)$ be the proportion of the examples that belong to class $i$ for a node $t$


Lets consider splitting a **parent node which has (40, 40)** examples, i.e 40 examples from class 0 and 40 examples from class 1, in two different ways


<img src="images/image6.jpg" alt="Drawing" style="width: 350px;"/>


Parent Node:
- $N_p = 80$
- $P(i=1|D_P)=\frac{40}{80}=0.5$
- $P(i=2|D_P)=\frac{40}{80}=0.5$


A: 
- Left node: (30, 10) -> $N_L=40$, $p(i=1|D_L)=\frac{30}{40}$, $p(i=2|D_L)=\frac{10}{40}$
- Right node: (10, 30) -> $N_R=40$, $p(i=1|D_R)=\frac{10}{40}$, $p(i=2|D_L)=\frac{30}{40}$
    
B: 
- Left node: (20, 40) -> $N_L=60$, $p(i=1|D_L)=\frac{20}{60}$, $p(i=2|D_L)=\frac{40}{60}$
- Right node: (20, 0) -> $N_R=40$, $p(i=1|D_L)=\frac{20}{20}$, $p(i=2|D_R)=\frac{0}{20}$




Now we compare the two splits A & B based on the three impurity measure
    - Note that B split is purer

**Classification Error** $I_E = 1- \text{max}[p(i)]$

<img src="images/image7.jpg" alt="Drawing" style="width: 350px;"/>

- IG = 0.25 under both scenarios
- Make sure you can do these computations

<!-- ![](images/image7.jpg) -->

<!-- ![](images/image8.jpg) -->

**Entropy** $I_H=-[p(i=1)\text{log}_2p(i=1) + p(i=0)\text{log}_2p(i=0)]$

<img src="images/image9.jpg" alt="Drawing" style="width: 350px;"/>

- Entropy favours B split (IG = 0.31) over A split (IG = 0.19)


<!-- ![](images/image9.jpg) -->


**Gini Impurity** - $I_G=[p(i=1)(1-p(i=1) + p(i=0)(1-p(i=0)]$
    
<img src="images/image8.jpg" alt="Drawing" style="width: 350px;"/>

- Gini impurity favours B split (IG = 0.16) over A split (IG = 0.125)


---
---

## Exercise 2



---

Import the credit_data.xlsx file from data folder into a pandas DataFrame named df.

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# ----------------------------------------------------

df = pd.read_excel('data/credit_data.xlsx')

```

---
What are the dimensions of the dataset?  Provide code and an answer in Markdown

```
df.info()
```

------- add your text answer here ----------

---
How many unique rows of "ID" column are there? Provide code and an answer in Markdown

```
df["ID"].nunique()
```

------- add your text answer here ----------

---
Delete duplicate rows from df according to ID (keep the first occurance of each duplicate row).

```
df.drop_duplicates(subset= ['ID'], inplace = True)
```


---
How many rows are left in the dataframe? Provide code and answer in Markdown

```
df.info()
```

------- add your text answer here ----------

---
Reset the index in `df` using an appropriate function from `pandas` so that the new index corresponds to the number of rows (make sure to delete the old index). 

```
df.tail(20)

```

```
df.reset_index(drop=True, inplace = True)
```

```
df.tail(20)

```

---
How many positive values of Days Employed are there? (answer in Markdown)

```
print(df.loc[df['Days Employed'] > 0].info()) 
```

------- add your text answer here ----------

---

Replace the positive values of Days Employed with 0 (zero) in df, and check that the operation was performed successfully.

```
df.loc[df['Days Employed'] > 0, 'Days Employed'] = 0
print(df.loc[df['Days Employed'] > 0].info())

```
