### Title: Pseudocode for Model Analysis

### Authors: Tahmeed Shafiq (tahmeed@lighthousereports.com)

### Last updated: 09/10/24

Pseudocode and planning for how to white box the model.

**Setup**:<br> 
Import models as needed.<br>
Set `warnings.filterwarnings("ignore", category=FutureWarning)` to suppress dependency warnings.

**Unpickle**:<br>
Unpickle models before and after reweighing.<br>
Extract all keys and parameters in case we need them later.

**Build pipeline**:<br>
Extract pipeline from `prep` key.<br>
Extract feature names and final class labels.<br>

**Features**:<br> 
Here's a translated table of the feature descriptions and importances. Where no range is given, we estimate.

| **Feature**                                | **Description**                                                                                                                                                                                                              | **Given range** | **Estimated range** | **Feature importance** | Feature importance (reweighed) | **To be investigated for bias?** | **Bias notes**                                                                                                                                                                                                                                                                       | **Note**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                    |
|--------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|---------------------|------------------------|--------------------------------|----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------|
| **deelnames_started_percentage_last_year** | Of all the applicant's participations (such as pathways and other instruments) in the year prior to this application, what percentage did he/she start?                                                                      |                 | 0-100               | 0.07317625572          | 0.073176                       | No                               | No direct link to sensitive attribute                                                                                                                                                                                                                                                | Should check if percentage is between 0-100 or 0-1.0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                    |
| **at_least_one_address_in_amsterdam**      | Number of active addresses of the requester. Can be several, e.g. residential address and shipping address.                                                                                                                  |                 | 0-3                 | 0.0003825554705        | 0.000383                       | No                               | Feature is an explicit policy rule for rejection                                                                                                                                                                                                                                     | Is 0-3 reasonable?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                    |
| **active_address_count**                   | Number of active addresses of the requester. Can be several, e.g. residential address and shipping address                                                                                                                   |                 | 0-3                 | 3.05E-05               | 0.000030                       |                                  | Yes                                                                                                                                                                                                                                                                                  | Social class, e.g. homeless or administrators. May be more common in certain population groups.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Is 0-3 reasonable? |
| **days_since_last_relocation**             | Number of days since the applicant last moved house (based on BRP data provided). If there is no known address or only a mailing address, this feature will be populated as if the requester last moved a long time ago.<br> |                 | 0-3650              | 0.02227508473          | 0.022275                       | Yes                              | Homeless, possible migration background and socio-economic status.                                                                                                                                                                                                                   | How long is "a long time ago"?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |                    |
| **days_since_last_dienst_end**             | Number of days since the applicant's last shift ended. There is a maximum of 365 days of review.                                                                                                                             | 0-365           |                     | 0.008268376334         | 0.008268                       | Yes                              | Socio-economic status - At the bottom of the labor market it is more difficult to find a permanent job, so these people may return more quickly.<br>Migration background - If you do not speak the language, you may not be able to find or keep a job. Labor market discrimination. |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                    |
| **has_medebewoner**                        | Does the applicant have at least 1 co-occupant? (yes/no)                                                                                                                                                                     | 0-1             |                     | 0.005575243894         | 0.005575                       |                                  | No                                                                                                                                                                                                                                                                                   | Is related to many sensitive characteristics, but it is impossible to make a direct link to any of these characteristics:<br>Social class - If you do not have much money, you are more likely to have to share a house.<br>Age - Young people who do not yet live on their own, old people who rent out a room.<br>Nationality - Someone who moves here may not have family to live with here, so lives with a landlady or something similar.<br>Marital status - If you are single, you are more likely to live with 'strangers'.<br>Health - If you are not healthy, you may need a caregiver. |                    |
| **avg_percentage_maatregel**               | This feature indicates the average discount percentage of all measures in the year prior to application. This gives an indication of how serious the offences were.                                                          |                 | 0-100               | 0.003545110815         | 0.003545                       | No                               | No direct link to sensitive attribute can be made.                                                                                                                                                                                                                                   | Check percentage range                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |                    |
| **total_vermogen**                         | "Sum of the applicant's assets (from Socrates Assets). If unknown: a power of 0 is used.<br>                                                                                                                                 |                 |                     | 0.01292534192          | 0.012925                       | No                               | Feature is an explicit policy rule for rejection.                                                                                                                                                                                                                                    | Use income threshold to estimate                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |                    |
| **afspraken_no_show_count_last_year**      | Number of appointments with the applicant in the year preceding this application at which the applicant did not appear                                                                                                       | NA              | 0-15                | 0.0003162555345        | 0.012961                       | No                               | No direct link to sensitive attribute can be made.                                                                                                                                                                                                                                   | More than one a month                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                    |
| **has_partner**                            | Does the applicant have a partner? (yes/no)                                                                                                                                                                                  | 0-1             |                     | -0.0003825554705       | -0.000383                      | No                               | Would be interesting, but insufficient data to be able to analyze. During monitoring / pilot we may have sufficient data for an analysis.                                                                                                                                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                    |
| **sum_inkomen_bruto_was_mean_imputed**     | The applicant's gross income is unknown; The average gross income of all applications from the dataset (yes/no) is used                                                                                                      |                 |                     | -0.000698811005        | -0.000699                      | No                               | No direct link to sensitive attribute can be made.                                                                                                                                                                                                                                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                    |
| **applied_for_same_product_last_year**     | "Has the applicant already applied for the same product in the year prior to the application that he/she is now applying for? (yes/no)<br>                                                                                   | 0-1             |                     | -0.001662821754        | -0.001663                      | Yes                              | Socio-economic status - At the bottom of the labor market it is more difficult to find a permanent job, so those people may return more quickly.<br>Migration background - If you do not speak the language, you may not be able to find or keep a job. Labor market discrimination. |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                    |
| **received_same_product_last_year**        | Has the applicant already received the same product in the year prior to the application that he/she is currently applying for? (yes/no)                                                                                     | 0-1             |                     | -0.005228500068        | -0.005229                      | Yes                              | Socio-economic status - At the bottom of the labor market it is more difficult to find a permanent job, so those people may return more quickly.<br>Migration background - If you do not speak the language, you may not be able to find or keep a job. Labor market discrimination. |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                    |
| **afspraken_no_contact_count_last_year**   | "Number of appointments with the applicant in the year prior to this application where no contact could be made or the applicant did not respond                                                                             |                 | 0-15                | 0.01296115356          | 0.000316                       | No                               | No direct link to sensitive attribute.                                                                                                                                                                                                                                               | More than one a month                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                    |
| **sum_inkomen_bruto_value**                | Sum of the gross amounts of all the applicant's incomes (from Socrates Income)                                                                                                                                               |                 |                     | -0.0006325110689       | -0.000633                      | No                               | Feature is an explicit policy rule for rejection.                                                                                                                                                                                                                                    | Use income threshold to estimate                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |                    |


**Synthetic data**:<br>
Generate synthetic data using above feature estimates. Cross-ref ranges with PDP plots.

**Analysis**:<br>
We want to find the marginal effects of feature importance. We could do this using partial dependency plots -- essentially a slice of the 15-dimensional model to one feature. Here's an [example](https://drive.google.com/file/d/1aOoQjqIxZ0RBCnsW68nlQPHQO7NCsJq4/view?usp=sharing).<br>
This plot shows the impact of varying the feature value *and* is identified as a proxy for socioeconomic status in the above feature table.<br>
I think we could refine and replicate this analysis by:

> Plotting both label classifications instead of just one, so we can better gauge the impact on false positives and negatives
> 
> Plotting for both model versions
> 
> Only looking at the 5 features identified as proxies for socioeconomic status: `received_same_product_last_year`,
`applied_for_same_product_last_year`, `days_since_last_dienst_end`, 'days_since_last_relocation' and `active_address_count`. I don't think we'd gain much by plotting for features that aren't indicative of bias, even if they have a higher importance.
> 
> Their feature importance varies, so we'll have to be careful not to imply that the most dramatic plot is the most influential in the final viz. My hunch is that the best way to do that will be to only include a plot for 'days_since_last_relocation'. That's the second-most important feature in both models.

**Run model**:<br>
Run model using `predict_proba`.