# Discussion of Results

In this section, we investigated how to deal with missing data and imbalanced data. We performed this investigation using single trees, and used the insights gleaned to guide our approaches in our use of ensembles. The ensembles we used are random forests and gradient boosted trees. Below, we provide a discussion of the results while highlighting the challenges we faced and possible extensions.

## Imbalance
The imbalance in the data was challenging to deal with, and our models had very low precisions, meaning that they had difficulty to identify high income individuals accurately. Many low-income individuals had a tendency to be predicted as high-income because they were male. Using SMOTE did not help much with this, as we saw we only got a marginal increase in the area under the ROC curve after using SMOTE. In fact, the original paper on SMOTE (see [1]) suggests combining SMOTE with an undersampling technique on the majority class to obtain better results, as the synthetic samples only considered the minority class. This is something we could explore, as indeed the precision was poor.

## Missingness
The investigation using single trees was surprising to us. The different methods of dealing with missingness led to very similar test performance. Due to limitations in computational power, we could not use multiple imputation. This is a way the models can be extended.

## Tuning
Tuning the random forest and gradient boosted trees was challenging. There was a large number of parameters, and therefore there was a large space to optimise over. To overcome this challenge, we used randomised grid search with cross-validation. This can be extended by using an exhaustive search instead, and defining a finer grid. It may also be of interest to look at sophisticated methods of optimisation, such as the use of tree-structured Parzen estimators (see [2]), which use a Bayesian method and a cheaper to evaluate loss function to perform optimisation. Such methods have good empirical results in the literature [2], and can be an extension to our project.

## Choice of Test Set
We chose the test set such that it only had complete data. This may have biased our results, though it is difficult to tell. It may be worth to repeat an analysis using a totally random split stratified by sex, without ensuring the test set has complete data. It may be worth to also check other choices of split and average the results to get a more robust view of the performance, especially if the models have to be deployed for downstream tasks. 

## Conclusion
We saw that the area under ROC curves was similar across the different methods, but the random forests were more successful in classifying the females. For this reason, we chose the random forest as our tree-based classifier for this problem.

# References

[1] Chawla, Nitesh V., et al. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research 16 (2002): 321-357.

[2] Ozaki, Yoshihiko, et al. "Multiobjective tree-structured parzen estimator for computationally expensive optimization problems." Proceedings of the 2020 genetic and evolutionary computation conference. 2020.