# Assignment 2 IN-STK5000, First deadline
## Group 10, Tobias Opsahl, Alva Hørlyk, Ece Centinogly

# a)
One simple way to protect the private information of the individuals is to just hide direct identifiers. However this is generally insufficient as attackers may have other identifying information. This information, combined with the information in the database, can reveal identities. 
<br>
Another method is k-anonymization, where k-1 people are indistinguishable from each other (with respect to quasi-identifiers) in the database. Columns with personal information, like name and date of birth are removed, and the rest of the information is generalized. For instance can a variable like age be categorical with different age-groups. Even though k-anonymization is an improvement from simply removing direct identifiers, an attacker with enough imformation can still infer something about the individuals.
<br>
If we assume that an attacker can have a lot of side-information, it is better to use differential privacy. For instance, we can use the Laplace mechanism, where Laplace distributed noise is added in the model. How much noise we add determines how private the result is. We can randomly chose a fraction of the data and add noise to it. This way, even if the data was publicly available, one would not be certain if the it really was true. The goal would obviously add a fraction of noise that makes the data private enough, but do not lower the predictions significantly. 
<br>
In this task the policy is released and can be used by the public. Then the data have to be anonymized before $\pi(a|x)$ is obtained. This can be done using a local privacy model, where independent Laplace noise $\omega_{i}$ is added to each individual. We have $y_{i}=x_{i}+\omega_{i}$ and use it to get $a=n^{-1}\sum_{i=1}^{n}y_{i}$.

# b) 
Here we assume that the analysts can be trusted with private information, so only the result made available for the public have to be privatized. Then we can use a centralized privacy model. We obtain $\pi(a|x)$ with $a=n^{-1}\sum_{i=1}^{n}x_{i}+\omega$. We do not need to privatize the data, just a bit of the decisions of the model we fitted on it. 

In other words, we can add noise to the actions after fitting the model. Without changing any of the observations in the population, we can fit a model that decides actions, and then add a bit of noise to the actions. This is so it one could not figure out personal data based on the action we chose, which might happen if the model picks ups a simple pattern in the data. 

For the actual implementation, we use both approaches. First, we add noise to the data itself and fit a model. Then we try to fit a model first, then add noise to the results. 
For the first approach, we implement functions for adding noise to our data. For the binary data, the function randomize() choses a ratio "1-theta" from a column, and changes it with a coinflip (50-50 chance of 0 and 1). For the continious variables, we replace the coinflip with a new drawing from the distribution. This approach is not as robust as desired, because generally we do not know the underlying population. We would like to change this for laplace or exponential noice by the next deadline. 

We then loop over all the columns in our population and add noise one by one. The model is fitted on the privatized data. 


For the second approach, we simply fit the model on the data, then sends the outcome columns to the functions to add noise. Note that the way we do this is a little too simple, we add noise the same way. This is a little bit unfortunate because we then will have individuals that do not receive treatments, and someone who recieves multiple, so we will change this later. 

# c) 

Let us now try to implement a policy, and see how the utility is affected by the privacy.
The policy is a simple linear regression model, that is based on a random division of data. We first make a new population, and draw actions randomly (from the RandomPolicy class). We thereby divide into the groups which have gotten treatment 1, 2 or 3. The reward (utility for each person) is calculated, and a linear model is then fitted on each of the three groups. Finally, for each person in the new population, the three models predicts the utility, and the treatment corresponding to the model that calculates the highest utility is chosen. 

We have not changed the utility function, but plan to edit it so the "features" also affects it. If we compare two groups A and B, where A has a much higher amount of symptoms than B, but only slightly more symptoms in the outcomes, then B will have a higher utility then A. This is not satisfactory, since the treatment ratio was much better in group A. Therefore, we aim to make an utility function that is caclulated on the ratio between symptoms before and after the treatment. We do not plan to make the treatment weight in on the calculation. For now however, it is simply kept as it came with the code. 

The policy is very simple, so we plan to implement a method with higher predictive accuricy by the next deadline. Since we need an iterative model that is fitted on the residuals of the last model, we plan to use boosting. 

Here is the code for the simple policy. Please already note that we do not believe our model have any real predictive power. If we change the seed, the coefficient will change greatly, much more than the change of coefficients from group to group. We therefore believe our model is fitted on noise, or that the noise is bigger than the impact of the treatment. To not get too much noise from irrelevant variables, we simply remove the genes. We have to explore if this is a wise desicion more, but it makes the model fitting a little bit more stable. The treatment does seem to be pretty irrelevant. We will try to investigate this more later, but for now, here is what we got:

In [1]:
import numpy as np
import pandas as pd
from aux_file import symptom_names
import simulator
from IPython import embed
from sklearn.linear_model import LinearRegression

In [2]:
class Policy:
    """ A policy for treatment/vaccination. """
    def __init__(self, n_actions, action_set):
        """ Initialise.
        Args:
        n_actions (int): the number of actions
        action_set (list): the set of actions
        """
        self.n_actions = n_actions
        self.action_set = action_set
        print("Initialising policy with ", n_actions, "actions")
        print("A = {", action_set, "}")
    ## Observe the features, treatments and outcomes of one or more individuals
    def observe(self, features, action, outcomes):
        pass 
          
    def get_utility(self, features, action, outcome):
        """ Obtain the empirical utility of the policy on a set of one or more people. 
        If there are t individuals with x features, and the action
        
        Args:
        features (t*|X| array)
        actions (t*|A| array)
        outcomes (t*|Y| array)
        Returns:
        Empirical utility of the policy on this data.
      
        Here the utiliy is defined in terms of the outcomes obtained only, ignoring both the treatment and the previous condition.
        """

        utility = 0
        utility -= 0.2 * sum(outcome[:,symptom_names['Covid-Positive']])
        utility -= 0.1 * sum(outcome[:,symptom_names['Taste']])
        utility -= 0.1 * sum(outcome[:,symptom_names['Fever']])
        utility -= 0.1 * sum(outcome[:,symptom_names['Headache']])
        utility -= 0.5 * sum(outcome[:,symptom_names['Pneumonia']])
        utility -= 0.2 * sum(outcome[:,symptom_names['Stomach']])
        utility -= 0.5 * sum(outcome[:,symptom_names['Myocarditis']])
        utility -= 1.0 * sum(outcome[:,symptom_names['Blood-Clots']])
        utility -= 100.0 * sum(outcome[:,symptom_names['Death']])
        return utility
        
    def get_reward(self, features, actions, outcome):
        
        rewards = np.zeros(len(outcome))
        for t in range(len(features)):
            utility = 0
            utility -= 0.2 * outcome[t,symptom_names['Covid-Positive']]
            utility -= 0.1 * outcome[t,symptom_names['Taste']]
            utility -= 0.1 * outcome[t,symptom_names['Fever']]
            utility -= 0.1 * outcome[t,symptom_names['Headache']]
            utility -= 0.5 * outcome[t,symptom_names['Pneumonia']]
            utility -= 0.2 * outcome[t,symptom_names['Stomach']]
            utility -= 0.5 * outcome[t,symptom_names['Myocarditis']]
            utility -= 1.0 * outcome[t,symptom_names['Blood-Clots']]
            utility -= 100.0 * outcome[t,symptom_names['Death']]
            rewards[t] = utility
        return rewards

    def get_action(self, features):
        """Get actions for one or more people. 
        This is done by making a random policy with 3 treatments,
        then fitting a linear model on each of the 3 subgroups.
        The action is then calculated by which of the three models that predicts
        the highest utility for each individual. 
        """
        n_population = features.shape[0]
        model1, model2, model3 = self.linear_model(n_population)
    
        actions = np.zeros([n_population, self.n_actions])
        pred1 = model1.predict(self.feature_select(features))
        pred2 = model2.predict(self.feature_select(features))
        pred3 = model3.predict(self.feature_select(features))
        for t in range(n_population):
    
            if pred1[t] >= pred2[t] and pred1[t] >= pred3[t]:
                actions[t, 0] = 1
            elif pred2[t] >= pred1[t] and pred2[t] >= pred3[t]:
                actions[t, 1] = 1
            elif pred3[t] >= pred1[t] and pred3[t] >= pred2[t]:
                actions[t, 2] = 1
    
        return actions
    
    def linear_model(self, n_population):
        """
        Fit a linear model on random data. The data is first randomly generated
        and a random policy is made. We then divide the data by the different
        treatments given (which was random), and fit one linear model on each data.
        """
        population = simulator.Population(128, 3, 3)
        treatment_policy = RandomPolicy(3, list(range(3))) # make sure to add -1 for 'no vaccine'
        X = population.generate(n_population)
        A = treatment_policy.get_action(X)
        U = population.treat(list(range(n_population)), A)
        x_data = self.feature_select(X)
        x_data1 = x_data[A[:, 0] == 1] # Action 1
        x_data2 = x_data[A[:, 1] == 1] # Action 2
        x_data3 = x_data[A[:, 2] == 1] # Action 3
        y_data1 = treatment_policy.get_reward(x_data1, 0, U[A[:, 0] == 1])
        y_data2 = treatment_policy.get_reward(x_data2, 0, U[A[:, 1] == 1])
        y_data3 = treatment_policy.get_reward(x_data3, 0, U[A[:, 2] == 1])
                
        linear_model_test1 = LinearRegression()
        linear_model_test2 = LinearRegression()
        linear_model_test3 = LinearRegression()

        model1 = linear_model_test1.fit(x_data1, y_data1)
        model2 = linear_model_test2.fit(x_data2, y_data2)
        model3 = linear_model_test3.fit(x_data3, y_data3)

        return model1, model2, model3
        
    def feature_select(self, X):
        """
        Chooses some columns in X. For now, we just omit the genes
        """
        df = add_feature_names(X)
        temp1 = df.iloc[:, :13]
        temp2 = df.iloc[:, -9:-3]
        return np.asmatrix(temp1.join(temp2))



Here is the Random Policy provided with the code:

In [3]:
class RandomPolicy(Policy):
    """ This is a purely random policy!"""

    def get_utility(self, features, action, outcome):
        """Here the utiliy is defined in terms of the outcomes obtained only, ignoring both the treatment and the previous condition.
        """
        actions = self.get_action(features)
        utility = 0
        utility -= 0.2 * sum(outcome[:,symptom_names['Covid-Positive']])
        utility -= 0.1 * sum(outcome[:,symptom_names['Taste']])
        utility -= 0.1 * sum(outcome[:,symptom_names['Fever']])
        utility -= 0.1 * sum(outcome[:,symptom_names['Headache']])
        utility -= 0.5 * sum(outcome[:,symptom_names['Pneumonia']])
        utility -= 0.2 * sum(outcome[:,symptom_names['Stomach']])
        utility -= 0.5 * sum(outcome[:,symptom_names['Myocarditis']])
        utility -= 1.0 * sum(outcome[:,symptom_names['Blood-Clots']])
        utility -= 100.0 * sum(outcome[:,symptom_names['Death']])
        return utility
    
    def get_action(self, features):
        """Get a completely random set of actions, but only one for each individual.
        If there is more than one individual, feature has dimensions t*x matrix, otherwise it is an x-size array.
        
        It assumes a finite set of actions.
        Returns:
        A t*|A| array of actions
        """

        n_people = features.shape[0]
        ##print("Acting for ", n_people, "people");
        actions = np.zeros([n_people, self.n_actions])
        for t in range(features.shape[0]):
            action = np.random.choice(self.action_set)
            if (action >= 0):
                actions[t,action] = 1
            # embed()
            
        return actions

And finally the privitazing functions, along with some other help functions. We plan to integrate these in the class for later, but they are for now kept outside the class. 

In [4]:
def add_feature_names(X):
    """
    This functions simply makes X to a dataframe and adds the column names, 
    so it is easier to work with.
    """
    features_data = pd.DataFrame(X)
    # features =  ["Covid-Recovered", "Age", "Gender", "Income", "Genome", "Comorbidities", "Vaccination status"]
    features = []
    # features += ["Symptoms" + str(i) for i in range(1, 11)]
    features += ["Covid-Recovered", "Covid-Positive", "No-Taste/Smell", "Fever", 
                 "Headache", "Pneumonia", "Stomach", "Myocarditis", 
                 "Blood-Clots", "Death"]
    features += ["Age", "Gender", "Income"]
    features += ["Genome" + str(i) for i in range(1, 129)]
    # features += ["Comorbidities" + str(i) for i in range(1, 7)]
    features += ["Asthma", "Obesity", "Smoking", "Diabetes", 
                 "Heart disease", "Hypertension"]
    features += ["Vaccination status" + str(i) for i in range(1, 4)]
    features_data.columns = features
    return features_data
    
def add_action_names(actions):
    """
    Add names for actions. Converts array to pandas DataFrame.
    """
    df = pd.DataFrame(actions)
    names = ["Action" + str(i) for i in range(1, len(actions.shape[0]) + 1)]
    df.columns = names
    return df

def add_outcome_names(outcomes):
    """
    Add names for the outcomes. Converts array to pandas DataFrame.
    """
    df = pd.DataFrame(outcomes)
    df.columns = ["Covid-Recovered", "Covid-Positive", "No-Taste/Smell", "Fever", 
                  "Headache", "Pneumonia", "Stomach", "Myocarditis", 
                  "Blood-Clots", "Death"]
    return df
    
def privatize(X, theta):
    """
    Adds noice to the data, column by column. The continious and discreet 
    columns are treated differently. 
    """
    df = add_feature_names(X).copy()
    df["Age"] = randomize_age(df["Age"], theta)
    df["Income"] = randomize_income(df["Income"], theta)
    for column in df.columns:
        if column != "Age" or column != "Income":
            df[column] = randomize(df[column], theta)
    return np.asarray(df)

def privatize_actions(A, theta):
    """
    Adds noise to the actions chosen bu the model. This is currently done
    a little bit primitive, since person no longer receives exactly one
    treatment.
    """
    A1 = A.copy()
    for i in range(A1.shape[1]):
        A1[:, i] = randomize(A1[:, i], theta)
    return A1
    
def randomize(a, theta):
    """
    Randomize a single column. Simply add a cointoss to "theta" amount of the data
    """
    coins = np.random.choice([True, False], p=(theta, (1-theta)), size=a.shape)
    noise = np.random.choice([0, 1], size=a.shape)
    response = np.array(a)
    response[~coins] = noise[~coins]
    return response 
    
def randomize_income(a, theta):
    """
    Randomize by drawing from the same population again
    """
    coins = np.random.choice([True, False], p=(theta, (1-theta)), size=a.shape)
    noise = np.random.gamma(1,10000, size=a.shape)
    response = np.array(a)
    response[~coins] = noise[~coins]
    return response 
    
def randomize_age(a, theta):
    """
    Randomize by drawing from the same population again
    """
    coins = np.random.choice([True, False], p=(theta, (1-theta)), size=a.shape)
    noise = np.random.gamma(3,11, size=a.shape)
    response = np.array(a)
    response[~coins] = noise[~coins]
    return response

# d)

Let us try to explore what the privatizing results in. We first see what happens when we add some noise to the data before we fit the model. We add different amount of noise, with theta values in [0.99, 0.95, 0.9, 0.8, 0.7, 0.6, 0.5]. This takes a couple seconds to run, and excuse the messy output. 

In [5]:
np.random.seed(57)
n_genes = 128
n_vaccines = 3
n_treatments = 3
n_population = 10000
population = simulator.Population(n_genes, n_vaccines, n_treatments)
treatment_policy = Policy(n_treatments, list(range(n_treatments)))
X = population.generate(n_population)
np.random.seed(57)
A = treatment_policy.get_action(X)
np.random.seed(57)
U = population.treat(list(range(n_population)), A)
utility = treatment_policy.get_utility(X, A, U)

thetas = [0.99, 0.95, 0.9, 0.8, 0.7, 0.6, 0.5]
utility_list = np.zeros(len(thetas)+1)
utility_list[0] = utility
for i in range(len(thetas)):
    print(i)
    X_priv = privatize(X, thetas[i])
    np.random.seed(57)
    A_priv = treatment_policy.get_action(X_priv)
    np.random.seed(57)
    U_priv = population.treat(list(range(n_population)), A_priv)
    utility_list[i+1] = treatment_policy.get_utility(X_priv, A_priv, U_priv)

Initialising policy with  3 actions
A = { [0, 1, 2] }




Initialising policy with  3 actions
A = { [0, 1, 2] }
0
Initialising policy with  3 actions
A = { [0, 1, 2] }




1
Initialising policy with  3 actions
A = { [0, 1, 2] }




2
Initialising policy with  3 actions
A = { [0, 1, 2] }




3
Initialising policy with  3 actions
A = { [0, 1, 2] }




4
Initialising policy with  3 actions
A = { [0, 1, 2] }




5
Initialising policy with  3 actions
A = { [0, 1, 2] }




6
Initialising policy with  3 actions
A = { [0, 1, 2] }




In [6]:
utility_list

array([-1914.2, -1914.9, -1916.7, -1920.8, -1926.4, -1929.7, -1831.7,
       -2033.4])

Let us discuss the results. We first get a little but lower utility as we add more noise (lower theta). This seems to be in line with our model. However, when we add a lot of noise, the utility changes by a lot, and even gets higher. We assume that this is because the model is fitted on a lot of noise, so it is not actually explaining much. Simply randomizing all the data might give a higher utility at some time. However, the first 5 theta values are gradually increasing, but not by much, which is in line with the theory. This means that an adition in privacy reduces the prediction by a little amount. This reduction is so little that it can be absolutely worth to change for privacy. However, if the seed is changed, the values might as well. One would have to run this on many seeds, but that takes time, and we wish to do a more detailed analysis on a better model in rather than this. 

Let us look at what happens when we add noise to the actions, and not the population. 

In [7]:
utility_list2 = np.zeros(len(thetas) + 1)
utility_list2[0] = utility
for i in range(len(thetas)):
    np.random.seed(57)
    A_noise = privatize_actions(A, thetas[i])
    U_noise = population.treat(list(range(n_population)), A_noise)
    utility_list2[i+1] = treatment_policy.get_utility(X, A_noise, U_noise)

In [8]:
utility_list2

array([-1914.2, -2512.3, -2511.9, -2513.2, -2511.4, -2414.8, -2118.7,
       -2018.2])

As in the last part, it is hard to trust the results, but adding a bit of noise significantly changed the utility to the worse. Here it seems like the first approach was better. However, we will not look more into this until we have refined our utility function and model.

# Questions

We have some questions about the assigmnet that we hope to get some guidance on, before the next deadline. 

1. How is the model meant to be increamantably updated? We assume that we start with a simple model, calculate the utility, then use stochastic gradiant descent to find out what to update in our model. However, how is this done? Which kind of model can we update in this way? We wanted to use boosting, since it is an iterative process of additave models fit on the residuals, but we do not know how to actually find a library that implements it in a way with the utility function like this. 

2. What is meant to be in the observe() function? Is this what we did in the reward instead, or is it different?

3. When will the "historical data" for the last deadline be available? Is it sufficient to calculate the error bounds with bootstrapping? How is the "improved policy" meant to be found, simply by using the new data, or finding a better model?

4. The code from the population-generation produces some warnings. How can we get rid of them?