# Income Prediction in APL using APLearn

In [1]:
]Import # ../../APLSource

In this notebook, we'll build machine learning (ML) models using APLearn to predict the annual income of individuals based on [census data gathered in 1994](https://www.cs.toronto.edu/~delve/data/adult/adultDetail.html). The features, such as age, occupation, marital status, etc., are factors that should intuitively impact one's salary, and the target is a binary variable denoting whether an individual makes over $50K/yr or not. The prerequisites to this tutorial are intermediate proficiency in APL and basic knowledge of ML. You're strongly encouraged to read the [reference guide](https://github.com/BobMcDear/aplearn?tab=readme-ov-file#available-methods) while walking through this notebook to gain a firmer grasp on each function and its arguments. 

We start by loading the dataset, stored as a CSV file called `adult.csv`. Dyalog APL provides a function, [`⎕CSV`](https://help.dyalog.com/19.0/Content/Language/System%20Functions/csv.htm), that allows for importing and exporting CSV data according to the user's needs. For more information concerning each argument, please refer to the official documentation.

In [2]:
(data header)←⎕CSV 'adult.csv' ⍬ 4 1

In our case, the target column comes last, but for simplicity's sake, it's best if we move it to the front by extracting its index among the columns (`header⍳⊂'income'`) and rotating (dyadic `⌽`) both the data and the header by that value. We map this rotation over the data and the header using `¨` to avoid duplicate code.

In [3]:
data header←(header⍳⊂'income')⌽¨data header

Some ML methods, including the ones we'll see shortly, can't natively handle categorical variables (e.g., marital status, with possible values including married, single, widowed, and so on). This calls for [one-hot encoding](https://en.wikipedia.org/wiki/Categorical_variable), a process where every category in such columns is represented as a binary vector with a 1 in the position of that category and 0s elsewhere. For example, if a categorical variable takes on the values `A`, `B` and `C`, it would be one-hot encoded as `1 0 0`, `0 1 0`, `0 0 1` for each of those categories, respectively. One-hot encoding is available in APLearn as `PREPROC.ONE_HOT`. Its fit method, `PREPROC.ONE_HOT.fit`, returns a _state_ (`st` for short) of the to-be-encoded categorical variables given their indices, and its transformation method, `PREPROC.ONE_HOT.trans`, performs the actual encoding (i.e., supplanting the categorical values with one-hot vectors) on the data using that state.

In [4]:
cat_ind←header⍳'workclass' 'education' 'marital-status' 'occupation' 'relationship' 'race' 'gender' 'native-country'
one_hot_st←data PREPROC.ONE_HOT.fit cat_ind
data←data PREPROC.ONE_HOT.trans one_hot_st
⍝ The last two lines may be alternatively chained:
⍝ data←data PREPROC.ONE_HOT.trans data PREPROC.ONE_HOT.fit cat_ind

Although the target variable is binary, it is represented as a string with values `<=50K` and `>50K`. ML models can't handle non-numerical inputs, so these two values need to be converted into integers; that is, `<=50K` becomes 0 and `>50K` 1. `PREPROC.ORD` does just that, and its usage is very similar to that of `PREPROC.ONE_HOT`. One important note is that this transformation, like one-hot encoding, pushes the encoded column(s) to last place.

In [5]:
ord_st←data PREPROC.ORD.fit 0
data←data PREPROC.ORD.trans ord_st
⍝ These two lines may be alternatively chained:
⍝ data←data PREPROC.ORD.trans data PREPROC.ORD.fit 0

When deploying ML models in practice, they will most certainly encounter data they weren't exposed to during training. To mimic this situation, we leave aside a small part of the data (usually 20%), called the validation set, and train only on the remainder. During evaluation, we assess performance on the validation set as an approximation of how well the model will perform on unseen data. We split the data thus using `MISC.SPLIT`, whose right argument is the proportion of the dataset to include in the validation set.  

In [6]:
train val←data MISC.SPLIT.train_val 0.2

Next, we separate the independent variables from the target using `MISC.SPLIT`, where the index of the target is provided as the right argument and the data as the left. After we encoded the target column as integers, it became last place, so its index is `¯1+≢⍉data`. Because the target is in the same position in both the training set and the validation set, we once again resort to `¨` in conjunction with commute (`⍨`) to carry out the separation over both sets in one line. 

In [7]:
(X_t y_t) (X_v y_v)←(¯1+≢⍉data) MISC.SPLIT.xy⍨¨train val

There's one more step in our preprocessing workflow: Each column is currently on a different scale. For example, age can range from 17 to 90, educational number from 1 to 16, and so forth. However, some models train more effectively when the data is [normalized](https://en.wikipedia.org/wiki/Normalization_(statistics)), with a mean of 0 and a standard deviation of 1. This is done by subtracting every column by its mean and subsequently dividing by its standard deviation. The fit method of APLearn's normalization method, `PREPROC.NORM`, returns these means and standard deviations, which should subsequently be handed over to the transformation method to actually normalize the data. We reuse the `⍨¨` pattern for brevity.

In [8]:
norm_st←X_t PREPROC.NORM.fit ⍬
X_t X_v←norm_st PREPROC.NORM.trans⍨¨X_t X_v
⍝ These two lines may be alternatively chained:
⍝ X_t X_v←(X_t PREPROC.NORM.fit ⍬)∘(PREPROC.NORM.trans⍨)¨X_t X_v

It's now time to train and evaluate some models. APLearn offers a unified interface to facilitate switching between various models. Models are trained using `X y MODEL.fit args`, which returns the learned state (i.e., parameters), and we make predictions via `X MODEL.pred st`. Often, we don't care about the state itself and would like to move on to predictions immediately following training, in which case it's shorter to chain these two operations together to get `X MODEL.pred X y MODEL.fit args`. 

Classification models can be grouped into two types depending on their outputs:

* Soft outputs: These models return probabilities. For example, a [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) model might predict the probability of an individual earning less than $50K/yr to be 0.78. Conversely, the probability of their salary exceeding that would be 0.22. In APLearn, probabilities are returned as vectors whose _i_<sup> th</sup> element represents the probability associated with the _i_<sup> th</sup> target class.
* Hard outputs: These models return discrete class labels directly. For example, a [support vector machine classifier](https://en.wikipedia.org/wiki/Support_vector_machine) (SVC) would simply return 1 or 0 based on its estimate of the individual's income.

As our evaluation metric, we'll be using [accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision#In_binary_classification) (`MISC.METRICS.acc`), which requires hard classifications and not probabilities. Therefore, the probabilities outputted by models of the first kind need to be rendered as discrete classifications before calculating accuracy. The entry with the highest probability is selected as the prediction, and since classes are identified by indices, we must find the index of this maximum value. This procedure of deriving the _index_ of the maximum value in a vector (as opposed to the value itself) is called [arg max](https://en.wikipedia.org/wiki/Arg_max) and can be formulated in APL as `0⌷⍉⍒⍤1`. Here, for each sample (`⍤1`), we obtain the indices that would sort the probabilities in descending order (`⍒`) and pick the first element across every row (`0⌷⍉`). 

Putting all of this together, we see that model evaluation involves three steps:

1. Train on the training set and make predictions on the validation set: `y_hat←X_v MODEL.pred X_t y_t MODEL.fit args`.
2. Perform arg max if applicable: `y_hat←0⌷⍉⍒⍤1⊢y_hat`.
3. Compute accuracy: `acc←y_v MISC.METRICS.acc y_hat`.

Let's test this out by training a logsitic regression model, which outputs probabilities. Logistic regression's only hyperparameter is the regularization term.

In [9]:
log_reg_y_hat←X_v SUP.LOG_REG.pred X_t y_t SUP.LOG_REG.fit 0.01
log_reg_y_hat←0⌷⍉⍒⍤1⊢log_reg_y_hat
⎕←y_v MISC.METRICS.acc log_reg_y_hat

By erasing the second line relating to arg max, we can adapt this for SVC. SVC has an extra hyperparameter, the learning rate, in addition to the regularization term.

In [10]:
lin_svc_y_hat←X_v SUP.LIN_SVC.pred X_t y_t SUP.LIN_SVC.fit 0.01 0.01
⎕←y_v MISC.METRICS.acc lin_svc_y_hat

The scores achieved by these models more or less match the [Python baseline](https://github.com/BobMcDear/aplearn/blob/main/examples/adults/python.ipynb). Small discrepancies or variations from run to run are to be expected. As an exercise, you can try improving on the top accuracy of ~85% by experimenting with other models, fine-tuning hyperparameters, or tweaking the preprocessing pipeline.