In [121]:
import pickle

import numpy as np
import pandas as pd

from sklearn.ensemble import GradientBoostingClassifier

### Results

They key findings that I have made in analysing and evaluating this dataset are that the type of census information that best predicts your yearly income / `wage` are heavily linked to:

We can look at our best performing model to see how well our hypothesis on which features were most important holds up. We can tell that the features we chose to train on were useful as the model performed relatively well on the test set. However, we can retrospectively look at the `feature_importance` to tell us which features contributed most to the learning of our best model:

In [123]:
XGBoost = pickle.load(open('models/XGBoost-2023-01-20.pkl', 'rb'))

with open("data/XGBoost_features.txt") as file:
    feature_cols = [line.rstrip() for line in file]

In [124]:
df = pd.DataFrame(zip(feature_cols, list(XGBoost.feature_importances_))).sort_values(by=1, ascending=False)
df.columns = ['feature', 'feature_importance']
df

Unnamed: 0,feature,feature_importance
3,weeks worked in year_bins,0.312762
0,education,0.306594
6,age_bins,0.074552
1,sex,0.066919
5,dividends from stocks_bins,0.063454
...,...,...
8,major industry code_ Armed Forces,0.000022
21,major industry code_ Not in universe or children,0.000022
53,detailed household summary in household_ Child...,0.000000
55,detailed household summary in household_ Group...,0.000000


Totalling the one-hot encoded features:

In [125]:
feature_importances = {}

for feature in features:
    feature_importances[feature] = df.loc[df['feature'].str.contains(feature)]['feature_importance'].sum()

In [126]:
df = pd.DataFrame(feature_importances.items()).sort_values(by=1, ascending=False)
df.columns = ['feature', 'feature_importance']
df

Unnamed: 0,feature,feature_importance
13,weeks worked in year,0.312762
1,education,0.306594
0,age,0.08822
6,sex,0.066919
8,dividends from stocks,0.063454
7,capital gains,0.046435
5,major occupation code,0.045798
11,detailed household summary in household,0.028103
9,tax filer stat,0.023092
4,major industry code,0.013883


1. `education` & `age` - We saw that these feature correlated highly the  with `wage` which is in agreement with our prediction that it would contribute highly to the model


2. Financial records - `dividends_from_stocks_bins` & `capital gains` were highly linked with `wage`. This intuitively makes sense as people who tend to benefit from dividends and capital gains are likely to be people who earn a lot to begin with anyway


3. Type of employment - e.g. `major occupation code`, `class of worker`. Had a rather small impact on the model learning - instead features such  This also make a lot of sense as the type of job a person works and the industry that they work in will have a big impact on the money they make, for example someone running a company will likely earn a lot more than someone who doesnt work.


4. Features such as `sex` which were not as highly correlated with `wage` in our correlation matrix seems to be important to our model, which is not entirely unexpected considering the difference in the proportion of men and women earning above 50000


5. `weeks worked in a year` had the largest impact on the model. While it is not entirely unexpected, I did not think it would play such a large role. It probably makes sense as full-time workers logically would be expected to have a lot more earning potential than those working part time.

Overall it appears the model gave heavy weighting to `education`, `age` and `weeks worked in a year`. This makes a lot of sense as these are understandable features that most people would agree have a large impact on your earnings. Other features surrounding type of job and financial records also played a role, although not as much as I had expected based on earlier analysis.

While there are probably more features in the dataset that would have been useful to use, I think choosing a few simple interpretable features helped build a simple but effective model.

#### Recommendations & future improvements

A few thoughts and possible improvements for future models:


1. A few of the features I used were still somewhat correlated with one another, perhaps implementing PCA to ensure our features were completely independent of one another. Though a drawback would be that our features would be less interpretable in the sense we wouldn't know what effect the raw features have on the model. Also was might be hard to implement on categorical data.


2. Explore a wider range of features, may have been nice to have trained more models with more types of features to better understand how well or how poor they are as an indicator for ones `wage`.


3. May have been able to condense features even further as I can see that some didn't contribute to the model's learning at all, suggests that perhaps further binning or condensing of certain labels / total removal of some features may have helped.


4. Better hyperparameter tuning - would've liked to have explored the parameter space better and also to have explored more parameters than I did. Also to have run a randomized or grid search on the full dataset instead of a subset


5. Have a better understanding of some the features to train on, for example I kept `detailed industry recode` as part of my features due to its correlation score, but it served no purpose for the model as on further inspection it contains similar information seen in `major occupation code`. The same may be said for `detailed household and family stat` and `detailed household summary in household`