# Exercise 2

## Data exploration

Before we dive into fitting a model for prediction in exercise 2, let us explore the data a bit. 

We formalize the problem as follows: We want to use the explanatory variables; genes, age, income, comorbodities, treatment and symptoms before treatment, to predict the response; symptoms after the treatment. More specificly, we are going to look at predicting wether person with a certain symptom is likely to have the symptom after a given treatment. In other words, we assume that the dataset "treatment_features" contains symptoms _before_ the treatment, the table "treatment_action" cotains if they got treatment1, 2, both or none, and the table "treatment_outcome" contains the symptoms after the treatments. This might be a wrong interpretation, since it implies dead people are treated, and some people are ressurected by the treatment. However, it is our best interpretation, so we assume it is a mistake in the dataset. 

During the initial data analysis, we make an important observation. Patient who do not have symtpoms before the treatment (or both, or none) never have symptoms afterwards. If we assume this always is the case, the model will become more accurate. However, this assumtion may only be right for our dataset, giving us a synteticly low test-error. Therefore, we look at both the cases where we use the whole dataset to fit the models, and when we only use the patients that have symptoms before. We will try to predict each symptoms at a time, meaning we will have a different model for each of the symptoms. The specification on how we fit the model will be elaborated later, in the "fitting a model" section, after the initial data analysis. First, let look at how many of the observations that actually have a positive response. 

In [72]:
import numpy as np 
import pandas as pd 
from sklearn.linear_model import Lasso, LinearRegression

In [54]:
# Loading data

In [55]:
def init_features(data):
    """
    Initialize names for observation features and treatment features
    
    Symptoms (10 bits): Covid-Recovered, Covid-Positive, No-Taste/Smell, 
        Fever, Headache, Pneumonia, Stomach, Myocarditis, Blood-Clots, Death
    Age (integer)
    Gender (binary)
    Income (floating)
    Genome (128 bits)
    Comorbidities (6 bits): Asthma, Obesity, Smoking, Diabetes, Heart disease, Hypertension
    Vaccination status (3 bits): 0 for unvaccinated, 1 for receiving a specific vaccine for each bit
    """
    features_data = pd.read_csv(data)
    # features =  ["Covid-Recovered", "Age", "Gender", "Income", "Genome", "Comorbidities", "Vaccination status"]
    features = []
    # features += ["Symptoms" + str(i) for i in range(1, 11)]
    features += ["Covid-Recovered", "Covid-Positive", "No-Taste/Smell", "Fever", 
                 "Headache", "Pneumonia", "Stomach", "Myocarditis", 
                 "Blood-Clots", "Death"]
    features += ["Age", "Gender", "Income"]
    features += ["Genome" + str(i) for i in range(1, 129)]
    # features += ["Comorbidities" + str(i) for i in range(1, 7)]
    features += ["Asthma", "Obesity", "Smoking", "Diabetes", 
                 "Heart disease", "Hypertension"]
    features += ["Vaccination status" + str(i) for i in range(1, 4)]
    features_data.columns = features
    return features_data

In [56]:
def init_actions():
    actions = pd.read_csv("treatment_actions.csv")
    actions.columns = ["Treatment1", "Treatment2"]
    return actions 

In [57]:
def init_outcomes():
    """
    Initialize outcome data
    
    Post-Treatment Symptoms (10 bits): Past-Covid (Ignore), Covid+ (Ignore), 
    No-Taste/Smell, Fever, Headache, Pneumonia, Stomach, Myocarditis, 
    Blood-Clots, Death
    """
    outcomes = pd.read_csv("treatment_outcomes.csv")
    outcome_names = ["Past-Covid", "Covid+", "No-Taste/Smell", "Fever", "Headache", 
                      "Pneumonia", "Stomach", "Myocarditis", "Blood-Clots", "Death"]
    outcomes.columns = outcome_names
    return outcomes

In [58]:
# Fix dataset

In [59]:
observation_features = init_features("observation_features.csv")
data_obs = observation_features
actions = init_actions()
outcomes = init_outcomes()
treatment_features = init_features("treatment_features.csv")
data_treat = treatment_features
# The task said to ignore the two first columns
outcomes = outcomes.iloc[:, 2:]

outcome_names_new = [i + "_after" for i in outcomes.columns] # We want to specify that this is an outcome 
outcomes.columns = outcome_names_new

treatment = data_treat.join(actions).join(outcomes)
tmp1 = treatment.iloc[:, 0:13]
tmp2 = treatment.iloc[:, 141:]
# The three datasets for ex. 2 in one dataset, where all genes are omitted
treat_no_genes = tmp1.join(tmp2)

num_features = ["Age", "Income"]
num_df = treat_no_genes[num_features]
scaled_num_df = (num_df - num_df.mean()) / num_df.std()

treat_no_genes_scaled = treat_no_genes
treat_no_genes_scaled.iloc[:, 10] = scaled_num_df.iloc[:,0]
treat_no_genes_scaled.iloc[:, 12] = scaled_num_df.iloc[:,1]

# Remove column ""Covid-Positive" (because everyone have covid)
tmp1 = treat_no_genes.iloc[:, 0]
tmp2 = treat_no_genes.iloc[:, 2:]
treat_no_genes = pd.DataFrame(tmp1).join(tmp2)

In [60]:
# Looking at differnt treatments

In [61]:
# People with only treatment 1, 211 people
treat_1 = treat_no_genes[(treat_no_genes["Treatment1"] == 1) & (treat_no_genes["Treatment2"] == 0)]
# People with only treatment 2, 211 people
treat_2 = treat_no_genes[(treat_no_genes["Treatment2"] == 1) & (treat_no_genes["Treatment1"] == 0)]
# People with both treatments, 240 people
treat_both = treat_no_genes[(treat_no_genes["Treatment1"] == 1) & (treat_no_genes["Treatment2"] == 1)]
# People with no treatments, 215 people
treat_none = treat_no_genes[(treat_no_genes["Treatment1"] == 0) & (treat_no_genes["Treatment2"] == 0)]

In [62]:
# Number of people with different symptoms after

In [63]:
#print(f"Number of people with different symtoms, total people is {treat_no_genes.shape[0]}")
#print("--------------------------------------------------------------")
#for s in outcomes.columns:
#    print(f"People with symptom {s}: ", treat_no_genes[treat_no_genes[s] == 1].shape[0])

In [64]:
# People with symtom before treatment compared to people with symptom after treatment

In [65]:
print(f"Number of people with different symtoms before and after treatment, total people is {treat_no_genes.shape[0]}")
print("-" * 90)
for sb, sa in zip(treat_no_genes.columns[1:9], outcomes.columns):
    print(f"People with symptom {sb} before treatment: ", treat_no_genes[treat_no_genes[sb] == 1].shape[0])
    print(f"People with symptom {sa} after treatment: ", treat_no_genes[treat_no_genes[sa] == 1].shape[0])
    print("-" * 60)

Number of people with different symtoms before and after treatment, total people is 877
------------------------------------------------------------------------------------------
People with symptom No-Taste/Smell before treatment:  49
People with symptom No-Taste/Smell_after after treatment:  23
------------------------------------------------------------
People with symptom Fever before treatment:  24
People with symptom Fever_after after treatment:  18
------------------------------------------------------------
People with symptom Headache before treatment:  7
People with symptom Headache_after after treatment:  1
------------------------------------------------------------
People with symptom Pneumonia before treatment:  34
People with symptom Pneumonia_after after treatment:  19
------------------------------------------------------------
People with symptom Stomach before treatment:  5
People with symptom Stomach_after after treatment:  3
----------------------------------------

From this table we can make 2 important observations: 1. The treatments seems to be somewhat effective, since less people have symptoms after the treatment than before. 2. There are few of the patients who actually experienced symptoms. This is an important observations since it means that the process for fitting the model will be difficult. For example, only 5 persons had symptoms with their stomach before treatment, and only 3 after. This is out of 877 observations, so we have to work to get a model who does not simply predict "0" all the time. We are going to get back to this point. 

Now let us look if this is different among the different treatment groups:

In [66]:
print(f"Percentages of people experiencing symptoms before and after treatment, in each treatment group")
print("-" * 90)
for sb, sa in zip(treat_1.columns[1:9], outcomes.columns):
    t1_b = treat_1[treat_1[sb] == 1].shape[0]/len(treat_1)
    t2_b = treat_2[treat_2[sb] == 1].shape[0]/len(treat_2)
    tb_b = treat_both[treat_both[sb] == 1].shape[0]/len(treat_both)
    tn_b = treat_none[treat_none[sb] == 1].shape[0]/len(treat_none)
    t1_a = treat_1[treat_1[sa] == 1].shape[0]/len(treat_1)
    t2_a = treat_2[treat_2[sa] == 1].shape[0]/len(treat_2)
    tb_a = treat_both[treat_both[sa] == 1].shape[0]/len(treat_both)
    tn_a = treat_none[treat_none[sa] == 1].shape[0]/len(treat_none)
    print(f"{sb} before treatment: Treatment1: {t1_b*100:.4f}, \
Treatment2: {t2_b*100:.4f}, Both Treatments: {tb_b*100:.4f}, No treatment: {tn_b*100:.4f}")
    print(f"{sa} after treatment: Treatment1: {t1_a*100:.4f}, \
Treatment2: {t2_a*100:.4f}, Both Treatments: {tb_a*100:.4f}, No treatment: {tn_a*100:.4f}")
    print("-" * 60)

Percentages of people experiencing symptoms before and after treatment, in each treatment group
------------------------------------------------------------------------------------------
No-Taste/Smell before treatment: Treatment1: 5.6872, Treatment2: 3.3175, Both Treatments: 7.0833, No treatment: 6.0465
No-Taste/Smell_after after treatment: Treatment1: 3.7915, Treatment2: 0.9479, Both Treatments: 0.0000, No treatment: 6.0465
------------------------------------------------------------
Fever before treatment: Treatment1: 4.7393, Treatment2: 1.8957, Both Treatments: 1.2500, No treatment: 3.2558
Fever_after after treatment: Treatment1: 2.8436, Treatment2: 1.4218, Both Treatments: 0.8333, No treatment: 3.2558
------------------------------------------------------------
Headache before treatment: Treatment1: 0.9479, Treatment2: 0.9479, Both Treatments: 0.8333, No treatment: 0.4651
Headache_after after treatment: Treatment1: 0.0000, Treatment2: 0.0000, Both Treatments: 0.0000, No treatment:

One more thing to check, is if it is people that do not have symptoms before treatment, but get it afterwards. 

In [67]:
symptom_names = ["No-Taste/Smell", "Fever", "Headache", "Pneumonia", "Stomach", "Myocarditis", "Blood-Clots", "Death"]
for symptom in symptom_names:
    no_symptom = treat_no_genes[treat_no_genes[symptom] == 0.0]
    symptom_after = no_symptom[no_symptom[f"{symptom}_after"] == 1.0]
    print(f" Without symptom: {symptom}, {no_symptom.shape[0]}, # gets symptom after treatment {symptom_after.shape[0]}")

 Without symptom: No-Taste/Smell, 828, # gets symptom after treatment 0
 Without symptom: Fever, 853, # gets symptom after treatment 0
 Without symptom: Headache, 870, # gets symptom after treatment 0
 Without symptom: Pneumonia, 843, # gets symptom after treatment 0
 Without symptom: Stomach, 872, # gets symptom after treatment 0
 Without symptom: Myocarditis, 864, # gets symptom after treatment 0
 Without symptom: Blood-Clots, 843, # gets symptom after treatment 0
 Without symptom: Death, 867, # gets symptom after treatment 0


And the answer is no; no symptom-negative persons get sick after the treatment. This is an important observation for the model fitting. The details will be specified later. 

By these tables (mainly the second latest one) we can establish a few things. Firstly, the group that got no treatment has no reduction in symptoms before and after, which makes sence. By this observation, and the fact that there are that the treatments reduces or keeps the number of symptoms the same in every case, we can establish that the treatments are somewhat effective. For headache, every treated person, with every treatment lose their headache. Treatment1 treats all of the Blood Clots cases, and many of the Pneumonia cases. For the rest of the cases, the treatment are either someowhat effective or not effective at all. This is good to note, because if we consider our final model to be good, it should pick up some of this. Fitting the model will be difficult however, since we have very few positives in the response. Finally, we observations that got both treatments, who are "ressurected", meaning that they where dead before the treatment, but not afterwards. This is not to mention that dead people get treated at all, which is the case for 10 patients. This might be an interpretation mistake from our side, but since we could not figure it out, we assume it is a mistake in the dataset. 

## Fitting a model

Problem specification.

In order to test our model, we use cross-validation. We made a pipeline for testing different functions with different error penalties. In order for generalization, cross_validate() takes in the arguments "parameter1" and "parameter2". The first one is thrown into the model, so it can for example be "k" for KNN, or "lambda" for Lasso. The second parameter is given to the error caclulation. This parameter is so that if the model predicts continously among 0 to 1 (or a little under and over), we can adjust what is predicted as a positive or negative. The obvious choice would be "0.5", but since we have way more positive than negative observations, it might be beneficial to lower this rate. Finally, "penalty_factor" determines how much we weigh a false positive against a false negative. This should be tuned so that our model gives a satisfactory outcome. This depend on how "bad" a false negative is compared to a false positive, which is not up to us to determine. Therefore, we will look at many cases. The reason for writing our own functions, is all of the flexibility just mentioned. We did not find this in existing functions, even though it surely exists somewhere. 

Here is the actual code: 

In [91]:
def cross_validate(data, response, model, test_function, k_fold=10, parameter1=1, parameter2=0.5, penalty_factor=1):
    """
    Crossvalidates "model" on "data", according to the error given by 
    "test_function". "response" is the response that we are predicting.
    """
    n = len(data)
    cv_indexes = np.zeros(n)
    counter = 0
    for i in range(n):
        cv_indexes[i] = counter
        counter = (counter + 1) % k_fold
    np.random.shuffle(cv_indexes)
    error = 0
    # embed()
    for k in range(k_fold):
        train = data[cv_indexes != k]
        test = data[cv_indexes == k]
        predictions = model(train, test, response, parameter1)
        # predict on outcomes
        error += test_function(predictions, test, response, parameter2=parameter2, penalty_factor=penalty_factor)
    return error
    
def penalized_error(outcomes, test, response, penalty_factor=5, parameter2=0.5):
    """
    Categorical error, where false negatives error are weighted with 
    "penelize_factor", and false posities are weighted with 1. 
    """
    error = 0
    # embed()
    for pred, exact in zip(outcomes, test[response]):
        if pred < parameter2:
            error += exact * penalty_factor
        elif pred >= parameter2:
            error += (1 - exact)
    return error
    
def knn_model(data, test, response, parameter1=5):
    """
    To be used in "cross_validate()". Data is the X data that the KNN
    model is fitted on, data[response] is the Y data. The function return the 
    prediction done on "test". "parameter1" is the amount of neighbours. 
    """
    k = parameter1
    y_data = data[response]
    x_data = data.drop(columns=[response]) # Check that it does not delete y_data
    test_data = test.drop(columns=[response])
    model = KNeighborsClassifier(n_neighbors=k)
    fit_model = model.fit(x_data, y_data)
    predictions = fit_model.predict(test_data)
    return predictions

def lasso_model(data, test, response, parameter1=1):
    """
    parameter1 is lambda. Lasso model for cross_validate()
    """
    alpha = parameter1
    y_data = data[response]
    x_data = data.drop(columns=[response]) # Check that it does not delete y_data
    test_data = test.drop(columns=[response])
    model = Lasso(alpha=alpha)
    fit_model = model.fit(x_data, y_data)
    predictions = fit_model.predict(test_data)
    # print(predictions)
    return predictions

def linear_model(data, test, response, parameter1=None):
    y_data = data[response].copy()
    x_data = data.drop(columns=[response])
    test_data = test.drop(columns=[response])
    model = LinearRegression()
    fit_model = model.fit(x_data, y_data)
    predictions = fit_model.predict(test_data)

    return predictions

A primitive approach to fitting the model, would be to send in all of the 876 columns. This is since we just established that if one did not experience symptoms before the treatment, one would not after. We also know that the no-treatment group does not change, so their symptom before will be the same after the non-existing treatment (which might be interpreted as a short time period). Since we consider this good assumption based on the data analysis, we can "hard-code" this into the model, and only predict if the patient experience symptoms before. However, we will try doing the primitive way, as it might give us some information about how the model and testing procedure works. 

Let us simply fit a linear model on all our data, and cross validate. Note that we send in unreasonably many explanatory variables, but we will fix this later. 

In [95]:
np.random.seed(57)
for symptom in symptom_names:
    symptom_index = f"{symptom}_after"
    print(f"{symptom}: Amount in group: {sum(treat_no_genes[symptom])}")
    cv = cross_validate(treat_no_genes, symptom_index, linear_model, penalized_error, parameter2=0.5, penalty_factor=1)
    print(f"CV for linear_model: {cv}")
    print()

No-Taste/Smell: Amount in group: 49.0
CV for linear_model: 19.0

Fever: Amount in group: 24.0
CV for linear_model: 6.0

Headache: Amount in group: 7.0
CV for linear_model: 1.0

Pneumonia: Amount in group: 34.0
CV for linear_model: 16.0

Stomach: Amount in group: 5.0
CV for linear_model: 3.0

Myocarditis: Amount in group: 13.0
CV for linear_model: 9.0

Blood-Clots: Amount in group: 34.0
CV for linear_model: 15.0

Death: Amount in group: 10.0
CV for linear_model: 2.0



This seems promising. The linear model does not overfit, since the cross validation error is lower than the amount of total cases. In addition, we actually get some true-positives, since the error is lower. However, let us look at the amount of false-positive against false-negatives. A primive way to do this is just to penalize false positives as 1000, so we can read the error whole number divided by 1000 as the false positives, and the error modulos 1000 as the false negatives.

In [96]:
np.random.seed(57)
for symptom in symptom_names:
    symptom_index = f"{symptom}_after"
    print(f"{symptom}: Amount in group: {sum(treat_no_genes[symptom])}")
    cv = cross_validate(treat_no_genes, symptom_index, linear_model, penalized_error, parameter2=0.5, penalty_factor=1000)
    print(f"CV for linear_model: {cv}")
    print()

No-Taste/Smell: Amount in group: 49.0
CV for linear_model: 16003.0

Fever: Amount in group: 24.0
CV for linear_model: 6.0

Headache: Amount in group: 7.0
CV for linear_model: 1000.0

Pneumonia: Amount in group: 34.0
CV for linear_model: 1015.0

Stomach: Amount in group: 5.0
CV for linear_model: 1002.0

Myocarditis: Amount in group: 13.0
CV for linear_model: 6003.0

Blood-Clots: Amount in group: 34.0
CV for linear_model: 12003.0

Death: Amount in group: 10.0
CV for linear_model: 2.0



Since the error // 1000 is not 0 for many of the observations, we can easily improve our model by the previous observations. If they did not experience the relevant symptom before the treatment, the prediction should be 0, so in this case we can have only the false negative left. This can be hardcoded into the model.

Let us verify this by looking at the coefficients. We fit the linear model on the whole data:

In [114]:
y_data =  treat_no_genes["No-Taste/Smell_after"]
x_data = treat_no_genes.drop(columns="No-Taste/Smell_after")

linear_model = LinearRegression()
fit_model = linear_model.fit(x_data, y_data)
coefficients = pd.concat([pd.DataFrame(x_data.columns), pd.DataFrame(np.transpose(fit_model.coef_))], axis = 1)
coefficients

Unnamed: 0,0,0.1
0,Covid-Recovered,0.005962
1,No-Taste/Smell,0.472467
2,Fever,-0.085558
3,Headache,0.028219
4,Pneumonia,-0.025346
5,Stomach,0.015724
6,Myocarditis,0.020841
7,Blood-Clots,-0.017383
8,Death,0.024377
9,Age,-0.003282


As we see, all of the coefficients are pretty small compared to No-Taste/Smell. This means that the linear regression managed to pick up the correct variable, according to our observation. However, most of the other variables seem noisy, so lets try to remove them. 