I solemnly swear that I have not discussed my assignment solutions with anyone in any way and the solutions I am submitting are my own personal work.

Full Name: Wei Zhang

Student ID: S3759607

## Question 1 - Part A

First, read the csv file into the notebook and check the first 5 rows.

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)

data1 = pd.read_csv('A2_Q1.csv', sep = ',')

In [2]:
data1.head()

Unnamed: 0,ID,Age,Education,Marital_Status,Occupation,Annual_Income
0,1,39,bachelors,never married,professional,high
1,2,50,doctorate,married,professional,mid
2,3,18,high school,never married,agriculture,low
3,4,30,bachelors,married,professional,mid
4,5,37,high school,married,agriculture,mid


`Annual_Income` column is selected. Calculate the probability distribution of each type of income in target feature. 

In [3]:
income = data1['Annual_Income']
income_probs = income.value_counts(normalize=True)
income_probs

mid     0.60
high    0.25
low     0.15
Name: Annual_Income, dtype: float64

One way to measure impurity degree is using entropy. Recall that Shannon's model defines entropy as
$$H(x) := - \sum_{i=1}^{\ell}(P(t=i) \times \log_{2}(P(t=i))$$
Then we calculate the impurity of the `Annual_Income` feature using the entropy criterion.

In [4]:
entropy = -1 * np.sum(np.log2(income_probs) * income_probs)
entropy.round(3)

1.353

Another way to measure impurity degree is using Gini index. It is defined as 
$$ \mbox{Gini}(x) := 1 - \sum_{i=1}^{\ell}P(t=i)^{2}$$
The impurity of the `Annual_Income` feature using Gini index is calculated as below.

In [5]:
gini = 1 - np.sum(np.square(income_probs))
gini.round(3)

0.555

### Part B

Dataset is sorted by the continuous `Age` feature in ascending order. Then the whole dataset is displayed as below.

In [6]:
data1.sort_values(by='Age')

Unnamed: 0,ID,Age,Education,Marital_Status,Occupation,Annual_Income
2,3,18,high school,never married,agriculture,low
5,6,23,high school,never married,agriculture,low
12,13,23,bachelors,never married,agriculture,low
19,20,25,bachelors,married,transport,high
13,14,25,high school,married,professional,high
15,16,29,bachelors,never married,agriculture,mid
3,4,30,bachelors,married,professional,mid
9,10,33,high school,married,transport,mid
14,15,35,bachelors,married,agriculture,mid
10,11,36,high school,never married,transport,mid


By looking at `Age` and `Annual_Income` features, candidate `Age` thresholds are decided: ≥24, ≥27, ≥38 and ≥42. The four thresholds are chosen as they are the age point when income level changes. Therefore, 4 new columns answering whether the age is over the given thresholds are added to the dataset as below. Masks are used to achieve this.

In [7]:
mask24 = data1['Age'] >= 24
mask27 = data1['Age'] >= 27
mask38 = data1['Age'] >= 38
mask42 = data1['Age'] >= 42
data1['Age≥24'] = mask24
data1['Age≥27'] = mask27
data1['Age≥38'] = mask38
data1['Age≥42'] = mask42
data1.head()

Unnamed: 0,ID,Age,Education,Marital_Status,Occupation,Annual_Income,Age≥24,Age≥27,Age≥38,Age≥42
0,1,39,bachelors,never married,professional,high,True,True,True,False
1,2,50,doctorate,married,professional,mid,True,True,True,True
2,3,18,high school,never married,agriculture,low,False,False,False,False
3,4,30,bachelors,married,professional,mid,True,True,False,False
4,5,37,high school,married,agriculture,mid,True,True,False,False


As `ID` and `Age` columns are not used as splitting features for impurity calculation, we remove the `ID` and `Age` column in this dataset.


In [8]:
data1_new = data1.drop(['ID', 'Age'], axis=1)
data1_new.head()

Unnamed: 0,Education,Marital_Status,Occupation,Annual_Income,Age≥24,Age≥27,Age≥38,Age≥42
0,bachelors,never married,professional,high,True,True,True,False
1,doctorate,married,professional,mid,True,True,True,True
2,high school,never married,agriculture,low,False,False,False,False
3,bachelors,married,professional,mid,True,True,False,False
4,high school,married,agriculture,mid,True,True,False,False


For convenience, we define a function called `compute_gini()` that calculates the impurity of a feature using Gini Index. We use this function to calculate the impurity of `Annual_Income` feature in the original dataset again.

In [9]:
def compute_gini(feature):
    probs = feature.value_counts(normalize=True)
    impurity = 1 - np.sum(np.square(probs))
    return(round(impurity, 3))
compute_gini(income)

0.555

Next, a new function is defined to calculate the information gain for certain feature using Gini Index.

In [10]:
def comp_feature_infor_gain_gini(df, target, feature):
    """
    This function calculates information gain for splitting on a particular descriptive feature for
    a given dataset using Gini Index.
    """
    print('target feature: ', target)
    print('descriptive_feature: ', feature)
    
    target_gini = compute_gini(df[target])
    
    gini_list = list()
    weight_list = list()
    
    for value in df[feature].unique():
        df_feature_value = df[df[feature] == value]
        gini_value = compute_gini(df_feature_value[target])
        gini_list.append(round(gini_value, 3))
        weight = len(df_feature_value) / len(df)
        weight_list.append(round(weight, 3))
        
    print('impurity of partitions: ', gini_list)
    print('weights of partitions: ', weight_list)
    
    feature_remain_gini = round(np.sum(np.array(gini_list)*np.array(weight_list)), 3)
    print('remaining impurity: ', feature_remain_gini)
    
    information_gain = round(target_gini - feature_remain_gini, 3)
    print('information_gain: ', information_gain)
    
    print('=========')

Now we will call this function for each descriptive feature in the dataset. A for-loop is used to calculate the information gain for each of the descriptive features.

In [11]:
for feature in data1_new.drop(columns = 'Annual_Income').columns:
    feature_info_gain = comp_feature_infor_gain_gini(data1_new, 'Annual_Income', feature)

target feature:  Annual_Income
descriptive_feature:  Education
impurity of partitions:  [0.531, 0.375, 0.625]
weights of partitions:  [0.4, 0.2, 0.4]
remaining impurity:  0.537
information_gain:  0.018
target feature:  Annual_Income
descriptive_feature:  Marital_Status
impurity of partitions:  [0.611, 0.42, 0.375]
weights of partitions:  [0.3, 0.5, 0.2]
remaining impurity:  0.468
information_gain:  0.087
target feature:  Annual_Income
descriptive_feature:  Occupation
impurity of partitions:  [0.5, 0.5, 0.278]
weights of partitions:  [0.4, 0.3, 0.3]
remaining impurity:  0.433
information_gain:  0.122
target feature:  Annual_Income
descriptive_feature:  Age≥24
impurity of partitions:  [0.415, 0.0]
weights of partitions:  [0.85, 0.15]
remaining impurity:  0.353
information_gain:  0.202
target feature:  Annual_Income
descriptive_feature:  Age≥27
impurity of partitions:  [0.32, 0.48]
weights of partitions:  [0.75, 0.25]
remaining impurity:  0.36
information_gain:  0.195
target feature:  Ann

Based on the output above, using Gini Index the highest information gain occurs with the age threshhold at 24. 

In [12]:
df_splits = pd.DataFrame(columns = ['Split', 'Remainder', 'Information_Gain', 'Is_Optimal'])
df_splits.loc[len(df_splits)] = ['Education', 0.537, 0.018, False]
df_splits.loc[len(df_splits)] = ['Marital_Status', 0.468, 0.087, False]
df_splits.loc[len(df_splits)] = ['Occupation', 0.433, 0.122, False]
df_splits.loc[len(df_splits)] = ['Age_24', 0.353, 0.202, True]
df_splits.loc[len(df_splits)] = ['Age_27', 0.36, 0.195, False]
df_splits.loc[len(df_splits)] = ['Age_38', 0.529, 0.026, False]
df_splits.loc[len(df_splits)] = ['Age_42', 0.473, 0.082, False]
df_splits

Unnamed: 0,Split,Remainder,Information_Gain,Is_Optimal
0,Education,0.537,0.018,False
1,Marital_Status,0.468,0.087,False
2,Occupation,0.433,0.122,False
3,Age_24,0.353,0.202,True
4,Age_27,0.36,0.195,False
5,Age_38,0.529,0.026,False
6,Age_42,0.473,0.082,False


### Part C

Assume `Education` feature is at the root node, this dataset is first splitted based on Education value.

In [13]:
data1['Education'].unique()

array(['bachelors', 'doctorate', 'high school'], dtype=object)

The dataset is then splitted into 3 subsets based on the education level.

In [14]:
edu_bachelors = data1[data1['Education'] == 'bachelors']
edu_doctorate = data1[data1['Education'] == 'doctorate']
edu_hs = data1[data1['Education'] == 'high school']

In [15]:
edu_bachelors.head()

Unnamed: 0,ID,Age,Education,Marital_Status,Occupation,Annual_Income,Age≥24,Age≥27,Age≥38,Age≥42
0,1,39,bachelors,never married,professional,high,True,True,True,False
3,4,30,bachelors,married,professional,mid,True,True,False,False
8,9,46,bachelors,divorced,transport,mid,True,True,True,True
12,13,23,bachelors,never married,agriculture,low,False,False,False,False
14,15,35,bachelors,married,agriculture,mid,True,True,False,False


In [16]:
edu_doctorate.head()

Unnamed: 0,ID,Age,Education,Marital_Status,Occupation,Annual_Income,Age≥24,Age≥27,Age≥38,Age≥42
1,2,50,doctorate,married,professional,mid,True,True,True,True
7,8,40,doctorate,married,professional,high,True,True,True,False
11,12,45,doctorate,married,professional,mid,True,True,True,True
16,17,44,doctorate,divorced,transport,mid,True,True,True,True


In [17]:
edu_hs.head()

Unnamed: 0,ID,Age,Education,Marital_Status,Occupation,Annual_Income,Age≥24,Age≥27,Age≥38,Age≥42
2,3,18,high school,never married,agriculture,low,False,False,False,False
4,5,37,high school,married,agriculture,mid,True,True,False,False
5,6,23,high school,never married,agriculture,low,False,False,False,False
6,7,52,high school,divorced,transport,mid,True,True,True,True
9,10,33,high school,married,transport,mid,True,True,False,False


A for-loop is created to calculate the income distribution in each subset.

In [18]:
for df in [edu_bachelors, edu_doctorate, edu_hs]:
    probs = df['Annual_Income'].value_counts(normalize=True)
    print('Income probability')
    print(probs)

Income probability
mid     0.625
high    0.250
low     0.125
Name: Annual_Income, dtype: float64
Income probability
mid     0.75
high    0.25
Name: Annual_Income, dtype: float64
Income probability
mid     0.50
high    0.25
low     0.25
Name: Annual_Income, dtype: float64


According to above output, we have the following predictions.

In [19]:
df_prediction = pd.DataFrame(columns = ['Leaf_Condition', 'Low_Income_Prob', 'Mid_Income_Prob', 'High_Income_Prob', 'Leaf_Prediction'])
df_prediction.loc[len(df_prediction)] = ['Education==high school', 0.25, 0.50, 0.25, 'mid']
df_prediction.loc[len(df_prediction)] = ['Education==bachelors', 0.125, 0.625, 0.25, 'mid']
df_prediction.loc[len(df_prediction)] = ['Education==doctorate', 0, 0.75, 0.25, 'mid']
df_prediction

Unnamed: 0,Leaf_Condition,Low_Income_Prob,Mid_Income_Prob,High_Income_Prob,Leaf_Prediction
0,Education==high school,0.25,0.5,0.25,mid
1,Education==bachelors,0.125,0.625,0.25,mid
2,Education==doctorate,0.0,0.75,0.25,mid


In [20]:
# Part B answer, age at 24 gives the best information gain and therefore can be used at root node.
df_splits

Unnamed: 0,Split,Remainder,Information_Gain,Is_Optimal
0,Education,0.537,0.018,False
1,Marital_Status,0.468,0.087,False
2,Occupation,0.433,0.122,False
3,Age_24,0.353,0.202,True
4,Age_27,0.36,0.195,False
5,Age_38,0.529,0.026,False
6,Age_42,0.473,0.082,False



```{toctree}
:hidden:
:titlesonly:


S3759607_A2_Q2
WebScrapingExample
HealthNeedsAssessment2021
```
