# IN-STK5000/9000 - Adaptive methods for data-based decision making
## Credit Project

The code to reproduce experiments can be found [here](https://github.com/gsel9/ml-society-science).

#### Syed Moeen Ali Naqvi - Geir Severin Rakh Elvatun Langberg - Markus Sverdvik Heiervang  
***

### Part 1: Banker agent
In this notebook, we display and comment on the development of our banker model, and measure it against the random banker, as well as documenting the implementations of the different methods used for the class

### Task 1 - Implementing expected utility

Our action space $\mathcal{A}$ is binary: $\mathcal{A} = \{0, 1\} = \{a_1, a_2\} = \{ \text{refuse_loan}, \text{grant_loan} \}$


To calculate the expected utility, we consider two actions: $a_1$ granting the loan or $a_2$ not granting a loan. Moreover, if granting a loan, the outcome at the end of the lending period $n$ is that it can be either fully repaid $\omega_1$ or not repaid $\omega_2$. The utility of granting a loan of $m$ credits that is also repaid is $m((1 + r)^n - 1)$, whereas, if the loan is not repaid, the utility is $-m$. In case of not granting the loan, the utility is zero. Thus, given the probability of being credit-worthy, $P(\omega_1)$, the expected utility is  


$$
    \mathbb{E}(U \mid a) = m((1 + r)^n - 1)P(\omega_1) - m(1 - P(\omega_1)).
$$

This calculation is implemented as follows


```Python
def expected_utility(self, x: pd.Series, action: int) -> float:

        if action:
            # Probability of being credit worthy.
            pi = self.predict_proba(x)

            return x["amount"] * ((1 + self.rate) ** x["duration"] - 1) * pi - x["amount"] * (1 - pi)

        return 0.0
```

### Task 2 - Implementing the fit function

We are using Random forest classifier to fit a model for calculating the probability of credit-worthiness for a creditor. Random forests (RF) construct many individual decision trees at training. Predictions from all trees are pooled to make the final prediction; the mode of the classes for classification. As they use a collection of results to make a final decision, they are referred to as Ensemble techniques.

We are using scikit-learn to implement the classifier. We have included optional hyper-parameter tuning before fitting the model.

Following is the code for fit():

```Python
    def fit(self, X: pd.DataFrame, y: pd.Series) -> None:
        if self.optimize:
            #Finding optimal paramters
            param_grid = [{
                'bootstrap' : [True],
                'max_features' : list(range(10,20,1)),
                'max_depth' : list(range(10,100,10)),
                'n_estimators' : list(range(25,150,25))
            }]

            grid_search = GridSearchCV(
                estimator = RandomForestClassifier(), param_grid = param_grid, cv = 5
            )
            grid_search.fit(X, y)
            self.classifier = RandomForestClassifier(
                random_state=self.random_state, **grid_search.best_params_
            )
        else:
            self.classifier = RandomForestClassifier(
                n_estimators=100,
                random_state=self.random_state,
                class_weight="balanced"
            )
            
        self.classifier.fit(X,y)

```

The method predict_proba() ensures that the fit() is called beforehand and predicts the probability of the loan being returned. 

Following is the code for predict_proba():

```Python
    def predict_proba(self, x: pd.Series) -> float:
        if not hasattr(self, "classifier"):
            raise ValueError("This Group4Banker instance is not fitted yet. Call 'fit' "
                             "with appropriate arguments before using this method.")

        x_reshaped = np.reshape(x.to_numpy(), (1,-1))

        return self.classifier.predict_proba(x_reshaped)[0][0]
```

We are assuming that the labelling process is correct and the labels represent the ground truth. 

### Task 3 - Get best action


Assuming that we are maximising utility, a general function would be

$$
\text{best_action}(x) = \underset{a \in \mathcal{A}}{\text{argmax}} \  \mathbb{E}(U \mid a)
$$

but since our action space is binary, it can be expressed as

$$  
\text{best_action}(x) = \begin{cases}
    1,& \text{if } \mathbb{E}(U \mid a=1) > 0\\
    0,              & \text{otherwise}
\end{cases}
$$

We can translate this into python code as such:

```Python
def get_best_action(self, x: pd.Series) -> int:
        return int(self.expected_utility(x, 1) > 0)
```

### Task 4 - Documenting the banker  

For this part, we'll be interacting with the Group4Banker in the cells below. 
Before measuring the performance, we conduct a series of unit tests to assert that each method works for a few cases

In [1]:
from group4_banker import Group4Banker
import numpy as np

# seed for reproducibility
np.random.seed(42)

In [2]:
from test_group4_banker import TestGroup4Banker
import unittest
# We'll need these arguments when running the tests in jupyter notebook
unittest.main(argv=["first-arg-is-ignored"], exit=False)

.....
----------------------------------------------------------------------
Ran 5 tests in 2.956s

OK


<unittest.main.TestProgram at 0x7f8fa948e110>

We rewrote the TestLending script into a neat command-line interface so that we can customize the programs parameters.   
This will also display progress of the training, since the classifier might take some time.  

From this, we can observe that our banker performs better than the RandomBanker

In [3]:
!python3 TestLendingV2.py ../../data/credit/D_valid.csv --n-tests 100 --seed 12 --interest-rate 0.05

r=0.05, n_tests=100, seed=12

Testing on class: RandomBanker ...
100%|█████████████████████████████████████████| 100/100 [00:08<00:00, 12.08it/s]
Results:
	Average utility: 62194332770.39753
	Average return on investment: 1662784.8621222847

Testing on class: Group4Banker ...
100%|█████████████████████████████████████████| 100/100 [03:12<00:00,  1.85s/it]
Results:
	Average utility: 154030042326.10663
	Average return on investment: 3563445.2297234936


Let's see how our banker performs on the training set

In [4]:
!python3 TestLendingV2.py ../../data/credit/D_train.csv --n-tests 100 --seed 11 --interest-rate 0.05

r=0.05, n_tests=100, seed=11

Testing on class: RandomBanker ...
100%|█████████████████████████████████████████| 100/100 [00:06<00:00, 15.39it/s]
Results:
	Average utility: 286821119298.73846
	Average return on investment: 4944685.923886744

Testing on class: Group4Banker ...
100%|█████████████████████████████████████████| 100/100 [03:04<00:00,  1.76s/it]
Results:
	Average utility: 490355107143.02966
	Average return on investment: 11223170.155701187


### Part 2: Critical evaluation of banker agent?

1. Is it possible to ensure that your policy maximises revenue? How can you take into account
the uncertainty due to the limited and/or biased data? What if you have to decide for credit
for thousands of individuals and your model is wrong? How should you take that type of
risk into account?

In general, we cannot ensure that the policy maximises revenue because the probability estimates $P(\omega_1)$ and $P(\omega_2)$ about the outcome of events $\omega_1$ and $\omega_2$ might be biased. However, from a theoretical point of view where we assume $P(\omega_1)$ and $P(\omega_2)$ are true, to maximize revenue we desire 

$$
    \mathbb{E}(U \mid a) > 0 \Rightarrow m((1 + r)^n - 1)P(\omega_1) > mP(\omega_2) \Leftrightarrow (1 + r)^n - 1 > \frac{P(\omega_2)}{P(\omega_1)}, \quad P(\omega_1) > 0
$$

To select parameters $n$ and $r$ ensuring $\mathbb{E}(U \mid a) > 0$ we can either consider 

$$
    n > \frac{\ln P(\omega_2) - \ln P(\omega_1)}{\ln (1 + r)}, \quad r, P(\omega_1), P(\omega_2) > 0
$$

for fixed $r$, or 

$$
r > \exp\left ( \frac{\ln P(\omega_2) - \ln P(\omega_1)}{n} \right ) - 1, \quad n, P(\omega_1), P(\omega_2) > 0
$$

for fixed $n$.

---


To obtain a model performance measure that is more robust than a single point estimate, one can consider a re-sampling technique - such as cross-validation of bootstraping - to derive an empirical distribution of performance. In our approach, we use bootstraping to predict the optial actions on several random subsets of the test data. This produces a matrix $\mathbf{\hat{Y}} \in \left \{ a_0, a_1 \right \}^{N \times M}$ of predicted actions on $N$ bootstrap samples of size $P$. Note that the predicted actions corresponds to the ground truths $\mathbf{Y}$ in the test data.


--- 

We use a Bayesian approach to evaluate the effect of an increasing number of samples on the model by comparing the posterior probability of rejecting $a_0$ (grant loan) when $a_1$ (don't grant loan) is true, to a ground truth derived from the test data. That is, we evaluate the model's ability to reject a loan that is unlikely to be repaid. The reference rejection rate $\hat{\mu}$ of $H_0$ is computed from the test data bootstrap samples

$$
\hat{\mu}_0 = \frac{1}{NM} \sum_{n, m} \mathbb{I}(Y_{n, m} = 0)
$$

Moreover, assuming $Y_{n, m}$ are i.i.d., we include a $1 - \delta$ confidence interval around $\hat{\mu}_0$ by using Hoeffding's inequality 
$$
    \left | \hat{\mu}_0 - \mathbb{E}\left \{ \hat{\mu}_0 \right \} \right | \leq \sqrt{\frac{\ln2 \delta}{2n}} \Rightarrow \mathbb{E}\left \{ \hat{\mu}_0 \right \} \in \left [ \hat{\mu}_0 - \sqrt{\frac{\ln2 \delta}{2n}}, \hat{\mu}_0 + \sqrt{\frac{\ln2 \delta}{2n}} \right ],
$$
bounding the expected value of $\hat{\mu}_0$. 

In experiments, we ran 200 iterations of 50 bootstrap samples for $\delta = 0.05$. The code to reproduce our experiments can be found in `action_sensitivity.py`. 

In the following figure, we compare the probability of rejecting $H_0$ given model esimtates to the reference rate $\hat{\mu}_0$ derived from test data. The figure shows that the probability of granting a loan that will not be repaid converges to $\hat{\mu}_0$ as the number of boostrap samples increases.

In [5]:
from IPython.display import IFrame

IFrame("reject_null_hypo.pdf", width=800, height=550)

To further enhance model performance, we can update our prediction model providing probability estimates by including more *Decision Trees* in the *Random Forest*. The additional trees are then trained on the test data after predictions are performed. This mechanism could be implemented as follows:

```Python
def adaptive_predict_proba(X, batch_size=0, n_extra_estimators=0):
    """Use test data to re-train the model after each decision."""
    
    y_pred = model.predict()
    
    # Add more decision trees to the model and train these using the new training data.
    model.set_params(n_estimators=int(model.n_estimators) + int(n_extra_estimators), 
                     warm_start=True)
    model.train(X)
        
    return y_pred
```

### 2. Does the existence of this database raise any privacy concerns? 

The database contains information about individuals such as sex, age, personal status, employment status, property, housing status which can be used to identify individuals when linked with other publicly available databases (e.g., social media, tax data, health records, voter's data, etc). If the data is available publicly, a linkage attack can be used to identify many individuals in this database. 

If the database was secret (and known only by the bank), but the credit decisions were public, the overall privacy of the data will definitely increase. However, an attacker can still infer information based on the published credit decisions depending on how the decisions are published. For example, if credit decision include data about the amount and duration of loan, and an attacker already knows certain attributes of an individual, it is possible to identify an individual.


The training set data of people in the database can be protected by adopting the randomized response mechanism (i.e. Global privacy model). For that, we return the true credit decision with probability $\leq 1$. The calculation of each response would not be dependent on the query. 

Whether someone who doesn't already exist in our database, or someone who has some history with the considered credit system, the information provided is sensitive and it needs to be randomized. For that, we can transform each attribute independently by adding a noise to it (i.e. local privacy model). This can be represented as:

$$
\pi(a|x) = \underset{i}{\Huge \Pi} \pi(a_i|x_i)
$$
   
where $x$ represents the complete data and $a = (a_1,.....,a_n)$ represents the mechanism's output. The method is $\epsilon-$differentially private.

We are using localized privacy model for our private decision making mechanism. We have used Laplace transformation for the numerical features in our data. It is defined as: 

$$
\pi(a|x) = Laplace(f(x), \lambda)
$$ 

for any function $f: \chi \to \mathbb{R}$ where Laplace density is defined as: 

$$
\rho(\omega | \mu, \lambda) = \frac{1}{2\lambda}exp(-\frac{|\omega - \mu|}{\lambda})$$
with mean $\mu$ and variance $2\lambda^2$.

For the categorical features, we have used the randomized response for a certain probability.

Our implementation for the randomize_data() is as follows:


```Python

def randomize_data(df, numerical_features, categorical_features, probability, laplace_delta):
    df_copy = df.df.copy()

    for column_name in categorical_features:
        temp_col = df_copy[column_name]
        random_index = np.random.choice(
            a=[True,False],
            size = temp_col.size, 
            p=[probability, 1-probability]
        )
        new_datapoints = np.random.choice(
            np.unique(temp_col),
            size = random_index.sum()
        )
        temp_col[random_index] = new_datapoints
        df_copy[column_name] = temp_col

    for column_name in numerical_features:
        temp_col = df_copy[column_name]
        noise = np.random.laplace(0, laplace_delta, size = temp_col.size)
        noise *= temp_col.std()
        df_copy[column_name] = temp_col + noise 
    return df_copy
```

The exponential mechanism defines the utility of a query $q$ for a user $x$ with a response $a$ as:

$$
\pi(a|x) = \frac{e^{\epsilon U(q,a,x) / \Delta U}}{\sum_{a'}e^{\epsilon U(q,a',x) / \Delta U}}
$$

where $\Delta U = sup_{xNx'}|U(q,a,x) - U(q,a,x)|$

where the lower the value of $\epsilon$, the more randomized a result would be.

For calculating the loss in utility according to privacy, we can come up with the formula: 

$$
\Delta U = Utility_{x, original} - Utility_{x, private}
$$




## 3. Is our banker model fair?
* What interesting features does it consider?   

Looking at the dataset, typical necessary features would be credit history, 
amount and lending period. 
However, some of the features seem to look at details about the person
that might just be loosely connected to return of loan. 
The features Age, phone, foreign and sex/status are all taken into account when the model evaluates, which means that it can generate biases on these grounds.
So we have to talk about which biases are fair, and which we might want to avoid.

The sex/feature attribute is defined as such
     
```console
Attribute 9:  (qualitative)
	      Personal status and sex
	      A91 : male   : divorced/separated
	      A94 : male   : married/widowed
	      A92 : female : divorced/separated/married
          A93 : male   : single
	      A95 : female : single    
```
(from the dataset description in data/credit/german.doc)

What is interresting to see here is that there are more categories for male than female.
Male+divorced/separaeted is a different category than male+married/widowed, but when it comes to female, these categories are merged into one. Why is that? Could it be because we're compensating for missing data? Or are we making assumptions about how this affects the likelihood of returning loan?
    
    
* Consider the model with repsect to Group fairness and conditional independece:
    Does is evaluate on fair grounds?  Can this model be subject to discrimination?  
    
Because we are talking about loan and maximizing revenue, this model has to discriminate on some grounds, such as social class (e.g. job/salary). 
The ability to take a loan is something that directly affects your opportunities in the society, so if equal opportunity for, say gender, is something we strive for, it is important to ensure that that there is no bias with respect to gender alone in the model. Since this attribute is part of the data we're training on, the model is guaranteed to have some, though possibly microscopical, bias with respect to sex. If we want to remove this bias, we should drop information about sex from the dataset, and instead, only use the categorical features married, divorced, widowed, single.  


* What can you say about the fairness with respect to Individual Independence and Meritocracy 

We have to discuss which terms should make someone "qualify" for loan. Other that that, if we only consider relevant attributes, the model should in turn make a meritocratic evaluation.

To test whether the biases can have any effect on the decision making, we can test it by comparing estimated probabilities of single women compared to single men.

In [6]:
!python3 test_biases.py --seed 43

Proportion of returns for single males in the train set:
 0.6980392156862745
Proportion of returns for single females in the train set:
 0.6730769230769231 

Estimated probability for single male: 0.54 
Estimated probabiltiy for single female: 0.56 
Absolute difference 0.020000000000000018 

Estimated probability for single male: 0.59 
Estimated probabiltiy for single female: 0.61 
Absolute difference 0.020000000000000018 

Estimated probability for single male: 0.82 
Estimated probabiltiy for single female: 0.83 
Absolute difference 0.010000000000000009 



The difference in probability estimates is microscopical, but still exsistent.  
We can create an edge-case where the decisions would be different.

In [7]:
from test_biases import get_trained_model, get_encoded_features
import pandas as pd
decision_maker = get_trained_model(0.05)
categories = get_encoded_features()

values = [26.5, 25938, 7, 2, 38, 0, 6, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 
          0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

x_male = pd.Series(dict(zip(categories, values)))
x_female = x_male.copy()
x_male["marital status_3"], x_female["marital status_5"] = 1, 1

print("Best action for single male:", decision_maker.get_best_action(x_male[categories]))
print("Best action for single female:", decision_maker.get_best_action(x_female[categories]))

Best action for single male: 0
Best action for single female: 1


As we can see here, though all the other features are identical, our banker will give loan to the single female, but not the single male. So our model can be subject to discrimination, though unlikely.