# 1. Discussion of Results

In this section, we investigated how to deal with missing data and imbalanced data. We performed this investigation using single trees, and used the insights gleaned to guide our approaches in our use of ensembles. The ensembles we used are random forests and gradient boosted trees. Below, we provide a discussion of the results while highlighting the challenges we faced and possible extensions.

## 1.1. Imbalance
The imbalance in the data was challenging to deal with, and our models had very low precisions, meaning that they had difficulty to identify high income individuals accurately. Many low-income individuals had a tendency to be predicted as high-income because they were male. Using SMOTE did not help much with this, as we saw we only got a marginal increase in the area under the ROC curve after using SMOTE. In fact, the original paper on SMOTE (see [1]) suggests combining SMOTE with an undersampling technique on the majority class to obtain better results, as the synthetic samples only considered the minority class. This is something we could explore, as indeed the precision was poor.

## 1.2. Missingness
The investigation using single trees was surprising to us. The different methods of dealing with missingness led to very similar test performance. We used mode imputation for the missing occupation values, but this can be problematic. We would recommend changing this value to a numerical ordered factor and attempt to do imputation using it. This would be slightly more robust, as we can `get in the right ballpark' instead of having a chance of being totally off. Also, for further robustness, it would be useful to use multiple imputation. Due to limitations in time and computational power, we could not recode the occupations as ordered factor nor use multiple imputation. This therefore remains a way the models can be extended.

Alternatively, we can use a Bayesian model. This provides a way to deal with missing data in a probabilistic framework (see [3] for details). Markov chain Monte Carlo methods can be used to fit the model, using software such as JAGS [4] or Stan [5]. If this turns out to be complicated or time-consuming, integrated nested Laplace approximations (INLA) provide a way to do fast approximate Bayesian inference using Laplace approximations. This has an implementation in R [6]. This has natural in-built ways to deal with missingness.

## 1.3. Tuning
Tuning the random forest and gradient boosted trees was challenging. There was a large number of parameters, and therefore there was a large space to optimise over. To overcome this challenge, we used randomised grid search with cross-validation. This can be extended by using an exhaustive search instead, and defining a finer grid. It may also be of interest to look at sophisticated methods of optimisation, such as the use of tree-structured Parzen estimators (see [2]), which use a Bayesian method and a cheaper to evaluate loss function to perform optimisation. Such methods have good empirical results in the literature [2], and can be an extension to our project.

## 1.4. Choice of Test Set
We chose the test set such that it only had complete data. This may have biased our results, though it is difficult to tell. It may be worth to repeat an analysis using a totally random split stratified by sex, without ensuring the test set has complete data. It may be worth to also check other choices of split and average the results to get a more robust view of the performance, especially if the models have to be deployed for downstream tasks. 

# 2. Choice of Model

Recalling that our question is to understand the variation of the proportion of high income individuals by sex, we choose the random forest model. This is because it predicted the proportions for females very accurately, as shown by the last bar plot in the Random Forest section. The area under the ROC curve was 0.89. As mentioned above, this is a number with room for improvement. We can see especially that precision is low. The extensions mentioned above can help us get a better performing model.

An advantage of random forests is that it can model arbitrary decision boundaries. We do not make any assumption on the distributions of the features or target, and therefore it is not affected by their distribution. Random forests use bagging in combination with a random choice of features for each split. The majority vote of those trees has lower variance than single trees [7]. 

# 3. Conclusion
In conclusion, while the random forest model shows promise in analyzing income variation by sex, focusing on improving precision and implementing proposed extensions will be crucial for achieving even better predictive accuracy. This will ultimately help us gain deeper insights into income disparities across different demographic groups.

# 4. References

[1] Chawla, Nitesh V., et al. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research 16 (2002): 321-357.

[2] Ozaki, Yoshihiko, et al. "Multiobjective tree-structured parzen estimator for computationally expensive optimization problems." Proceedings of the 2020 genetic and evolutionary computation conference. 2020.

[3] Ma, Zhihua, and Guanghui Chen. "Bayesian methods for dealing with missing data problems." Journal of the Korean Statistical Society 47 (2018): 297-313.

[4] Plummer, Martyn. "JAGS Version 3.3. 0 user manual." (2012): 40.

[5] Carpenter, Bob, et al. "Stan: A probabilistic programming language." Journal of statistical software 76 (2017): 1-32.

[6] Rue, Håvard, et al. "Bayesian computing with INLA: a review." Annual Review of Statistics and Its Application 4.1 (2017): 395-421.

[7] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. "The elements of statistical learning: data mining, inference, and prediction." (2017).