In [61]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.stats import binom
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.neighbors import KNeighborsClassifier
np.random.seed(57)

# Note: This is not the final version. It will be uploaded monday 25.10.2021 evening. 

# Exercise 1
## a)
In this section, we are going to explore if the genes, age and comorbidities can predict any of the symptoms. The method we are going to use is a simple correlation check. The method will be briefly presented underneath, while the specifics will be presented with the code shortly afterwards. 

We are exploring which of the explanatory variables (age, genes and comorbidities), which has a high correlation with each of the responses (symptoms). We check all of these correlations and chose our variables according to a correlation-threshold. We chose our threshold in advance, based some calculation. Given that the columns are independent with the response, we see which maximum absolute value of the correlation is expected to get in most cases. We then test our proceduere on synthetic data. This data is produces so that we know which columns that are correlated to the response, and the rest are random. We then test to see if our selection methods works at this data, and then test it on the actual data.

For this approach we assume that if a variable is related to a response, then the correlation will be big. The drawbacks of this assumption is that we can have correlation between a bigger set of variables and the response, but not each of the variables independent. However, testing this requires more computing power.

Now, let's get into the code and the details. Firstly, we made a function for reading the data and naming the columns accordingly. 


In [3]:
def init_features(data):
        """
        Initialize names for observation features and treatment features
        """
        features_data = pd.read_csv(data)
        features = []
        features += ["Covid-Recovered", "Covid-Positive",
            "No-Taste/Smell",  "Fever", "Headache", "Pneumonia",
            "Stomach", "Myocarditis", "Blood-Clots", "Death"]
        features += ["Age", "Gender", "Income"]
        features += ["Genome" + str(i) for i in range(1, 129)]
        features += ["Asthma", "Obesity", "Smoking", "Diabetes", 
                     "Heart disease", "Hypertension"]
        features += ["Vaccination status" + str(i) for i in range(1, 4)]
        features_data.columns = features
        return features_data

We also made some functions for calculating the correlation, and the subset-selection based on the correlation. We simply test all of the columns in the dataframe against the responses and select them iff they have a correlation higher than some threshold

In [4]:
def correlation(col1, col2):
    """
    Calculates the correlation (pearson correlation) between col1 and col2.
    Cor(X, Y) = Sum (x_i - mu_x) (y_i - mu_y) / (std(X) * std(Y) * n)
    Divides by n and not (n-1), as some functions do. 
    """
    mean1 = np.mean(col1)
    mean2 = np.mean(col2)
    sum = 0
    for i in range(len(col1)):
        sum += (col1[i] - mean1) * (col2[i] - mean2)
    cor = sum / (np.std(col1) * np.std(col2) * len(col1))
    return cor

def correlation_select(data, response, correlation_threshold=0.01):
    """
    In:
        data (np.array): ((m, n)) sized array of explanatory variables.
        response (np.array): (m) sized array of the response.
        correlation_threshold (int): Threshold for when the correlation is high
            enough for variable to be chosen.
    Out:
        selected_columns (list): List of the indexes of the columns that are
            chosen, with the corresponding correlation. [[1, cor1], [2, cor2], ... ]
            
    Feature selection based on univariate correlation between a column and the
    response. Looks at each column in "data" independetly and calculates
    the correlation between it and the response. Iff it is over 
    "correlation_threshold" it is chosen. 
    """
    selected_columns = []
    data = data.to_numpy() # This runs a bit faster
    for i in range(data.shape[1]):
        # cor = correlation(response, data[:, i])
        cor = np.corrcoef(response, data[:, i])[1, 0] # Correlation, vectorized
        if abs(cor) > correlation_threshold:
            selected_columns.append([i, cor])
    return selected_columns

We will now explore what to use as a threshold for the subset-selection. The dataset we are exploring have 100000 rows and about 150 columns. Even if the columns are drawn from a distribution independent from the response, the calculated correlation will still be slightly hihger than 0. Therefore, we wish to find out which correlation that is _very little likely_ not to encounter by random data. Hypothesis tests ofthen have confidence intervals defined by the 95 or 99 percentile, but we do not consider this to be sufficient. Since we have over 100 observations, if we where to set the threshold according to the correlation corresponding to something that is 1% or less likely to encounter from random data, we would still expect to chose one column just at random. Therefore, our threshold needs to be stricter. We concluded that we wanted to look at a correlation only 0.1% of random columns would have, altough the exact number 0.1% was arbritralely chosen.  

Now we need to find out how high correlation the most correlated 0.1% of random data is expected to have. We assume that the response and variables are binary with means approximately 0.5. The explanatory variable has a 50% chance of being the same as the response, for each of the rows. With the help of the cumulative function for binomial data, we find the amount of similar inputs to expect:

In [5]:
print(binom.cdf(k=49511, n=100000, p=0.5))

0.0010022415200593084


In other words, in the random case, the most extreme 0.1% of the columns have 100000-49511 = 50489 similar rows. Now we calculate the correlation this corresponds to:

In [6]:
n = 100000
k = 49511
col1 = np.zeros(n)
col2 = np.zeros(n) 
# I want the mean to be close to 0.5, and columns equal in n - k inputs. 
for i in range(int(n/2)):
    col1[i] = 1
    col2[2*i] = 1
for i in range(int(n/2 - k)):
    col1[2*i] = 0

diff = abs(col1-col2)
correlation(col1, col2)

-0.009780467754231225

For simplicity and a little more safity, we round up to an absolute value of 0.01.

Not that there are some weaknesess to our method. The mean is not exacly 0.5, and the calculation will therefore be a little bit off. However, the 0.1% chosen is pretty safe, so if we find correlation above 0.01, it is not very likely to be completely random. 

In our data, the mean is far from 0.5. Let us see what it actually is. 

In [7]:
data = init_features("observation_features.csv") # Initialization of data
genomes = data.iloc[:, 13:141] # Columns corresponding to Genomes
age = data.iloc[:, 10] # Age
comorbidities = data.iloc[:, 141:147] # All of comorbidities
symptoms = data.iloc[:, :10]
vaccines = data.iloc[:, -3:]
df = pd.DataFrame(age).join(genomes.join(comorbidities))
responses = symptoms
vaccine_status = data.iloc[:, -3:] # Columns corresponding to vaccines
vaccines = vaccine_status.join(symptoms) # For 1b). 

In [8]:
    for i in range(10):
        print(f"{df.columns[i]}: {sum(df.iloc[:, i])/len(df):.4f}")
    for symptom in symptoms.columns:
        print(f"{symptom}: {sum(symptoms[symptom])/len(symptoms):.4f}")

Age: 33.0295
Genome1: 0.5013
Genome2: 0.5007
Genome3: 0.5007
Genome4: 0.5005
Genome5: 0.5013
Genome6: 0.4984
Genome7: 0.5024
Genome8: 0.4985
Genome9: 0.4987
Covid-Recovered: 0.0491
Covid-Positive: 0.2203
No-Taste/Smell: 0.0141
Fever: 0.0585
Headache: 0.0321
Pneumonia: 0.0094
Stomach: 0.0028
Myocarditis: 0.0036
Blood-Clots: 0.0099
Death: 0.0033


To not spam the output more than I already have, I do not print all of the variables. However, we do seem a definite trend if we do. The Genes seems to be centered around mean 0.5. The comorbidieties do not, and definitely not the symptoms. This will make the correlation higher for columns that are similar in 50489 rows. It is much we could have checked, but let's try one column that have mean 0.5 (representing a gene) and one with mean around 0.25 (representing the Covid-Positive column). 

In [9]:
col1 = np.zeros(n)
col2 = np.zeros(n)
for i in range(int(n/2)):
    col1[i] = 1
for i in range(int(n/4)):
    col2[i*4] = 1

for i in range(int(n/2 - k)):
    col1[4*i] = 0

diff = abs(col1-col2)
print(np.mean(diff))
# 0.50489
print(correlation(col1, col2))

0.50489
-0.016940267072052973


As we see, the absolute value of the correlation is higher, about 0.017. With the columns that have means even further away from 0.5 it will be even higher, but I will use 0.017 to be sure to get all of the columns (we will see later that there is not really any that meets even this requirement, so a higher threshold would not change much). However, for the synthetic data our means are 0.5, so here I will use 0.01 to confirm that our method works.

Now we can preceed with the actual selection, first by testing on correlated data. We made our own data generator. It takes some input about the dimensions of the data, how many columns to be correlated, and how correlated they should be. The response is created with random binary inputs and a expected mean of 0.5. The comments should make the code pretty readable. 

In [10]:
def create_correlated_data(num_col, num_cor, num_row, prob=0.5):
    """
    In:
        num_col (int): Number of columns in the matrix
        num_cor (int): Number of the columns that should be correlated with 
            the response. num_cor <= num_col.
        num_row (int): Number of observations.
        prob (float): Probability for correlated columns to be equal to the
            response. If not, value is random.
    Out:
        data (pd.DataFrame): ((num_row, num_col)) matrix of data, where the 
            num_cor first columns are correlated with the response.
        response (pd.Series): (num_row) size series of the response, which is
            randomly chosen 0 or 1 for each input. 
            
    Creates correlated data. The response is randomly chosen 0 or 1 with a
    probability of 0.5. Then a matrix of size (num_row, num_col) is created, 
    where the first num_cor columns are correlated with the response, and the 
    rest (num_col - num_cor) is randomly generated. 
    
    For each of the correlated columns, they are chosen equal to the response
    with a probability of "prob". If not, they are randomly chosen 0 or 1 with 
    a probability of 0.5. 
    """
    response = np.random.randint(2, size=num_row) # Random response
    data = np.zeros((num_row, num_col))
    for i in range(num_cor): # Fill in value for matrix
        for j in range(num_row):
            coin_flip = np.random.uniform()
            if coin_flip < prob: 
                data[j, i] = response[j] # Correlated column sets equal to response
            else: 
                coin_flip = np.random.uniform() # Correlated column is set random
                if coin_flip < 0.5:
                    data[j, i] = 0
                else:
                    data[j, i] = 1
    for i in range(num_cor, num_col): # The rest of the columns are random
        data[:, i] = np.random.randint(2, size=num_row)
    return pd.DataFrame(data), pd.Series(response)

Now for the actual experiment. We make a dataset of 100000 rows, 150 columns and the probability for the data generator to be 0.5. 

In [11]:
data, response = create_correlated_data(150, 10, 100000, prob=0.5)

In [12]:
cor_list = correlation_select(data, response, 0.01)

In [13]:
np.asarray(cor_list)[:, 0]

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

This means that our selection chose the 10 first columns, and nothing else, exactly as we wanted. Let us now try to run the method on the actual data:

In [14]:
correlation_list = []
for i in range(len(symptoms.columns)):
    # print(f"Symptom: {responses.columns[i]}")
    correlation_list.append(correlation_select(df, responses.iloc[:, i], 0.017))

In [15]:
correlation_list # Make output nice

[[], [], [], [], [], [], [], [], [], []]

### Conclusion for 1a)

As we see, none of the symptoms have 'high enough' correlation to be relevant, according to our last definition of relevant. If one runs the test with 0.01 instead of 0.017, one would see that about 10 variables have correlation between 0.01 and 0.015. By this method, we conclude that age, genes and comorbidities do not predict the symptoms in any meaningful way. There is some correlation, but we consider this noise. If we were to just strictly predict, and not try to explain, then we might still have used some of the variables. In that case, we would not care about the statistical insignificance, just if our test-error got lower. Using some of the variables would still probably be better than guessing, but not by so much that it would be enough to trust the variables for haveing explanatory power. 

The big drawback here is that we only test one variable at the time. We might expect a cluster of genes to have predictive power, not an individual gene. However, we were not able to find that either. Although, since this is a big drawback, we do not consider this method to rule out the possibility of the genes being relevant to the symptoms. We do however have some evidence towards genes (and comorbidities) being seemingly random compared to the symptoms. 

## b)

We will now explore the efficacy of the vaccines. The patients that are vaccinated will be tested against those who are not, but we will also test the three vaccines individually. We will look at two things: How likely are a patient to be Covid-Positive, given that they have gotten a vaccine or not? 2. How likely are a person to die from Covid (both be Covid-Positive and experience Death), given that they have gotten the vaccine or not? This is formalized as a binomial hypothesis test, which is approximated standard normal with big data sets (groups are both bigger than 40, as we have here). The problem is then to test if there is statistic significance between the vaccinated group and the non-vaccinated group, and if so, is how big is the difference (is it clinicly significant?). 

Not that we did not cover Covid-Recovered in b) and c). This is because we were not able to interpret it. We did not know wether or not the recovery happened before or after the vaccine, or a mix. Therefore, we concluded that the dangers of using it wrong was worse than simply leaving it out. We do still consider this a possible downside of our analysis.

We start by initializing the data sets. 

In [16]:
# Not that vaccine1, vaccine2 and vaccine3 are disjunct sets. 
vaccine1 = vaccines[vaccines["Vaccination status1"] == 1] # Taken first vaccine
vaccine2 = vaccines[vaccines["Vaccination status2"] == 1]
vaccine3 = vaccines[vaccines["Vaccination status3"] == 1]
no_vaccine = vaccines[(vaccines["Vaccination status1"] == 0.0) &
                      (vaccines["Vaccination status2"] == 0.0) &
                      (vaccines["Vaccination status3"] == 0.0)]
any_vaccine = vaccines[(vaccines["Vaccination status1"] == 1) |
                       (vaccines["Vaccination status2"] == 1) |
                       (vaccines["Vaccination status3"] == 1)]

We now define our hypothesis tests. As given on page 521 in _Modern Mathematical Statistics with Applications (Devore, Berk)_, the statistic \\[ z = \frac{\hat{p_1} - \hat{p_2}}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n} + \frac{1}{m}\right)}} \\] with 
\\[\hat{p} = \frac{X + Y}{n + m}\\] and n and m are the sizes of group 1 and group 2, respectivly, is approximately standard normal distributed. Therefore, we can use this statistics to test our hypothesis. 

Our null hypothesis is \\[ H_0: \mu_1 = \mu_2\\] where the first mu is from a vaccinated group, and the other one is from the unvaccinated. We will reject this hypothesis if the test statistics reaches 2.58 in absolute value, which corresponds to a 99% cofindence interval. The reason we chose this interval is that we have many observations, so if a test statsitic is between 1.96 (corresponding to a 95% interval) and 2.58, the varriance would be pretty high, or the different means would be pretty low. 

Note that the way the test is set up, a positive z-value means that vaccination groups have more cases of something, while a negative means it has less. In this case, we are therefore mainly interested in negative values, since it means that the vaccines are "working", less people get Covid-Positive and die. However, in the next task, 1c), we will look at positive z-values, since it corresponds to an _increased risk of side effects_. 

Finally, here are some functions for calculating the tests:

In [17]:
def vaccine_sick_test(df1, df2):
    """
    Hypothesis test on df1 and df2 if you are more probable to get Covid in 
    group df1 and df2.
    """
    n1 = len(df1) # Number of people in each population
    n2 = len(df2)
    s1 = len(df1[df1["Covid-Positive"] == 1]) # Number of "positives" in each groups
    s2 = len(df2[df2["Covid-Positive"] == 1])
    p1 = s1/n1 # Ratio of Covid-Positives over whole group
    p2 = s2/n2
    p_hat = (s1 + s2)/(n1 + n2) # Parameter for test statistic
    z_value = (p1 - p2)/np.sqrt((p_hat*(1-p_hat)*(1/n1 + 1/n2)))
    return p1, p2, z_value

def vaccine_death_test(df1, df2):
    """
    Hypothesis test on if you are more likely to die from covid in df1 than df2.
    We only look at people with covid in both groups.
    """
    # n1 = len(df1)
    # n2 = len(df2)
    n1 = len(df1[df1["Covid-Positive"] == 1]) # Number of people in each population
    n2 = len(df2[df2["Covid-Positive"] == 1])
    s1 = len(df1[(df1["Covid-Positive"] == 1) & (df1["Death"] == 1)]) 
    s2 = len(df2[(df2["Covid-Positive"] == 1) & (df2["Death"] == 1)])
    p1 = s1/n1 # Ratio of Covid-Deaths over whole group
    p2 = s2/n2
    p_hat = (s1 + s2)/(n1 + n2) # Parameter for test statistic
    z_value = (p1 - p2)/np.sqrt((p_hat*(1-p_hat)*(1/n1 + 1/n2)))
    return p1, p2, z_value

def get_df_name(df):
    """
    Getting name of a dataframe.
    """
    name =[x for x in globals() if globals()[x] is df][0]
    return name

In [18]:
groups = [any_vaccine, vaccine1, vaccine2, vaccine3]

for group in groups:
    p1_s, p2_s, z_s = vaccine_sick_test(group, no_vaccine)
    p1_d, p2_d, z_d = vaccine_death_test(group, no_vaccine)
    print()
    print(f"Testing group {get_df_name(group)} against non-vaccinated")
    print(f"Percentages of people who got covid:")
    print(f"    Vaccinated: {p1_s*100:.4f}, Unvaccinated: {p2_s*100:.4f}")
    print(f"    z_value: {z_s}")
    print(f"Percentages of Covid-Positive persons that died:")
    print(f"    Vaccinated: {p1_d*100:.4f}, Unvaccinated: {p2_d*100:.4f}")
    print(f"    z_value: {z_d}")


Testing group any_vaccine against non-vaccinated
Percentages of people who got covid:
    Vaccinated: 19.8556, Unvaccinated: 25.2822
    z_value: -20.288224483052876
Percentages of Covid-Positive persons that died:
    Vaccinated: 0.2352, Unvaccinated: 2.7959
    z_value: -16.054053528345055

Testing group vaccine1 against non-vaccinated
Percentages of people who got covid:
    Vaccinated: 22.4264, Unvaccinated: 25.2822
    z_value: -7.654631648367728
Percentages of Covid-Positive persons that died:
    Vaccinated: 0.1354, Unvaccinated: 2.7959
    z_value: -10.58646105278172

Testing group vaccine2 against non-vaccinated
Percentages of people who got covid:
    Vaccinated: 20.0781, Unvaccinated: 25.2822
    z_value: -14.161658124564054
Percentages of Covid-Positive persons that died:
    Vaccinated: 0.2492, Unvaccinated: 2.7959
    z_value: -9.582089209626343

Testing group vaccine3 against non-vaccinated
Percentages of people who got covid:
    Vaccinated: 17.1234, Unvaccinated: 25.2

Let us try to interpret the results. Firstly, all of the tests are highly statistic significant. The Covid-Positive statistic is even bigger in absolute value than the deaths, and this is expected since there are much fewer deaths than covid-cases. Howeverthe lowest z-value is 7.65 in absolute value, which is still extremely significant. However, what do the results mean? To answer that, let's compare the ratio in the means for the different groups:

In [19]:
groups = [any_vaccine, vaccine1, vaccine2, vaccine3]

def vaccine_death_test(df1, df2):
    """
    Hypothesis test on if you are more likely to die from covid in df1 than df2.
    We only look at people with covid in both groups.
    """
    # n1 = len(df1)
    # n2 = len(df2)
    n1 = len(df1[df1["Covid-Positive"] == 1]) # Number of people in each population
    n2 = len(df2[df2["Covid-Positive"] == 1])
    s1 = len(df1[(df1["Covid-Positive"] == 1) & (df1["Death"] == 1)]) 
    s2 = len(df2[(df2["Covid-Positive"] == 1) & (df2["Death"] == 1)])
    p1 = s1/n1 # Ratio of Covid-Deaths over whole group
    p2 = s2/n2
    p_hat = (s1 + s2)/(n1 + n2) # Parameter for test statistic
    z_value = (p1 - p2)/np.sqrt((p_hat*(1-p_hat)*(1/n1 + 1/n2)))
    return p1, p2, z_value

In general, we see that the vaccines are much better at preventing deaths than preventing covid cases. The vaccines makes only 12-48% less people sick from Covid, but reduces Covid related deaths with 7 to 20 times. We also see that the vaccines work in different ways. Vaccine1 have the best decrease in deaths, but the least decrease in sickness. Vaccine3 works the other way around, decreaseing the sickness by almost 50%, but does not decrease the deaths by halves as much as vaccine1. Finally, vaccine2 stands in the middle of these to. 

However, this death comparrison is slightly unjust for the vaccine3. This is because vaccine3 has an artificiall low death increase compared to the other vaccines. Since this vaccine is better at preventing covid in the first place, the total people with covid in the group is lower. This makes the demoninator smaller, and the of people dying from covid higher. One could avoid this problem by also looking at the total death toll in the groups, not just the strictly covid realated ones. One would consider some people to die from something not realted from covid as well, however, it turns out that the non-vaccinated group has no non-covid deaths. Therefore, this consideration becomes a little bit cumbersome. Let us however try it. I do the same death-test, but now I divide by the whole group, not just the covid positive ones.

In [20]:
def death_test(df1, df2):
    n1 = len(df1) # Number of people in each population
    n2 = len(df2)
    s1 = len(df1[(df1["Covid-Positive"] == 1) & (df1["Death"] == 1)]) 
    s2 = len(df2[(df2["Covid-Positive"] == 1) & (df2["Death"] == 1)])
    p1 = s1/n1 # Ratio of Covid-Deaths over whole group
    p2 = s2/n2
    p_hat = (s1 + s2)/(n1 + n2) # Parameter for test statistic
    z_value = (p1 - p2)/np.sqrt((p_hat*(1-p_hat)*(1/n1 + 1/n2)))
    return p1, p2, z_value

for group in groups:
    p1, p2, z = death_test(group, no_vaccine)
    print(f"Percentages of people who got died:")
    print(f"    Vaccinated {get_df_name(group)}: {p1*100:.4f}, Unvaccinated: {p2*100:.4f}")
    print(f"    z_value: {z}")

Percentages of people who got died:
    Vaccinated any_vaccine: 0.0467, Unvaccinated: 0.7069
    z_value: -18.370370501529564
Percentages of people who got died:
    Vaccinated vaccine1: 0.0304, Unvaccinated: 0.7069
    z_value: -11.219292170093974
Percentages of people who got died:
    Vaccinated vaccine2: 0.0500, Unvaccinated: 0.7069
    z_value: -10.881324677154156
Percentages of people who got died:
    Vaccinated vaccine3: 0.0594, Unvaccinated: 0.7069
    z_value: -10.75217133950315


We now see that the differneces in deaths is not that big, at least not between vaccine2 and 3. However, vaccine1 is a bit better at preventing deaths. We will see in the next part that vaccine2 has a high non-covid death toll, so if we would consider all deaths, vaccine2 would have about twice as many deaths compared to 1 and 3. 

From this we conclude that the vaccince efficancy is highly statisticly significant, but have a much bigger impact on covid related deaths than preventing covid in the first place. The vaccines have different strengths and weaknesses. If one were to chose which vaccine to prefer, one would have to assume wether preventing covid or preventing covid related deaths would be the goal, but we are not going to make this decision. 

Let us consider the weaknesses in this approach. We did not consider all of the data, the genes and comorbidities were overlooked. This was partly because in a) they did not seem very relevant, but also because it would be a lot more work, the possibilities of combinations are pretty high. However, one could assume causality between a cormobidity and a specific vaccine's outcome.Some other methods to consider might be logistic regression and bayesian confidence intervals. 
We also did not consider Covid-Recovered, as mention in the beginning of b). This might have removed some useful patterns in the data.

## c)

For this section, we will explore if the vaccines are likely to cause any side effects. We formalize this statement as how likely one are to _not_ be Covid-Positive, but to have one of the other symptoms. We look at the group that is not vaccinated, and hypothesis test. The hypothesis test is the same as in b). Again, we do not consider Covid-Recovered, as the same reason as in b). 

In [21]:
def side_effect_test(df1, df2, symptom):
    """
    In: 
        df1: (df) DataFrame of population1
        df2: (df) DataFrame of population2
        symptom: (str) Column name corresponding to the symptom to be tested.
    Out:
        p1: (float) Ratio of non-Covid-Positive people that have symptoms in df1
        p2: (float) Ratio of non-Covid-Positive people that have symptoms in df1
        z_value: (float) z-value according to the hypothesis test (see below). 
    Tests if there is a significant increase of propability to get the 
    "symptom" as a side effect in df1 than df2, or vice verca. Side effect
    is defined as having a symptom, but not being Covid-Positive. 
    
    Hypothesis test (approximated standard normal for binomial data, page 521
    in "Modern Mathematical Statistics with Applications, Devore, Berk". 
    z = (p1 - p2)/sqrt(p_hat (1-p_hat) (1/n + 1/m))
    where p1 and p2 is ratio between positives and size of sample 1 and 2, 
    respectivly, n and m are the size of sample 1 and 2, respectivly, and 
    p_hat is (X + Y)/(n + m), where X and Y are the positives in sample 1 and 2, 
    once again respectivly. 
    """
    sample1 = df1[df1["Covid-Positive"] == 0] 
    sample2 = df2[df2["Covid-Positive"] == 0]
    n1 = len(df1) # Number of people in each population
    n2 = len(df2)
    s1 = len(sample1[sample1[symptom] == 1]) # Amount of people having the symptom
    s2 = len(sample2[sample2[symptom] == 1])
    p1 = s1/n1 # Ratio of symptomatic people and non-symptomatic
    p2 = s2/n2
    p_hat = (s1 + s2)/(n1 + n2) # Parameter for test statistic
    z_value = (p1 - p2)/np.sqrt((p_hat*(1-p_hat)*(1/n1 + 1/n2)))
    return p1, p2, z_value

We start by looking at the vaccinated patient as a whole group, and then divide into the specific vaccines. We prespecify that we are interested in z-values bigger than 2.58, since with this much data variables between 1.96 and 2.58 would have pretty low difference or pretty high varriance. 

In [22]:
# Side effects for vaccinated vs un-vaccinated
symptom_names = ["No-Taste/Smell", "Fever", "Headache", "Pneumonia", "Stomach", 
                 "Myocarditis", "Blood-Clots", "Death"]
for symptom in symptom_names: 
    p1, p2, z_value = side_effect_test(any_vaccine, no_vaccine, symptom)
    print(f"Symptom {symptom}: Vaccinated {p1*100:.4f}, unvaccinated: {p2*100:.4f}")
    print(f"    The z-value is {z_value:.4f}")

Symptom No-Taste/Smell: Vaccinated 0.0650, unvaccinated: 0.0425
    The z-value is 1.4787
Symptom Fever: Vaccinated 7.9366, unvaccinated: 0.3797
    The z-value is 54.1827
Symptom Headache: Vaccinated 4.4711, unvaccinated: 0.7868
    The z-value is 33.4855
Symptom Pneumonia: Vaccinated 0.1067, unvaccinated: 0.1074
    The z-value is -0.0318
Symptom Stomach: Vaccinated 0.1851, unvaccinated: 0.1923
    The z-value is -0.2580
Symptom Myocarditis: Vaccinated 0.1718, unvaccinated: 0.0350
    The z-value is 6.2005
Symptom Blood-Clots: Vaccinated 0.1951, unvaccinated: 0.0824
    The z-value is 4.5118
Symptom Death: Vaccinated 0.0284, unvaccinated: 0.0000
    The z-value is 3.3693


The most prominent variables are fever and headache, which has about a 20 and 15 times risk of occuring, respectivly. No-Taste/Smell has a 50% increase, but since it is few cases (0.065%), it is not statisticly significant. Pneumonia and Stomach actually actually has a decreased amount of cases, but since the z-value is so low, it is most likely caused by random fluctuation. We also see an increase for myocarditis and blood clots, with about an 5 time and 2 time increase. The results for death is a little bit strange. It is an statisticly significant increase in deaths, however, there are no non covid related deaths in the unvaccinated group. One would expect someone to die from other things than covid, so the data might be a little bit strange here. In all, there are quite a few side effects, but luckily most of them are from Fever and Headache, which are not severe. I would however be a little bit concerned about the deaths. 

Now lets look at the individuall groups. 

In [23]:
groups = [vaccine1, vaccine2, vaccine3]
for symptom in symptom_names: 
    print()
    for i in range(3): 
        p1, p2, z_value = side_effect_test(groups[i], no_vaccine, symptom)
        print(f"Symptom {symptom}: Vaccine{i+1} {p1*100:.4f}, unvaccinated: {p2*100:.4f}")
        print(f"    The z-value is {z_value:.4f}")


Symptom No-Taste/Smell: Vaccine1 0.0759, unvaccinated: 0.0425
    The z-value is 1.6639
Symptom No-Taste/Smell: Vaccine2 0.0700, unvaccinated: 0.0425
    The z-value is 1.4018
Symptom No-Taste/Smell: Vaccine3 0.0495, unvaccinated: 0.0425
    The z-value is 0.3833

Symptom Fever: Vaccine1 7.6374, unvaccinated: 0.3797
    The z-value is 50.7963
Symptom Fever: Vaccine2 7.8051, unvaccinated: 0.3797
    The z-value is 51.5053
Symptom Fever: Vaccine3 8.3589, unvaccinated: 0.3797
    The z-value is 53.7220

Symptom Headache: Vaccine1 4.2666, unvaccinated: 0.7868
    The z-value is 29.0434
Symptom Headache: Vaccine2 4.4079, unvaccinated: 0.7868
    The z-value is 29.9186
Symptom Headache: Vaccine3 4.7334, unvaccinated: 0.7868
    The z-value is 31.8206

Symptom Pneumonia: Vaccine1 0.0911, unvaccinated: 0.1074
    The z-value is -0.5873
Symptom Pneumonia: Vaccine2 0.1501, unvaccinated: 0.1074
    The z-value is 1.4144
Symptom Pneumonia: Vaccine3 0.0791, unvaccinated: 0.1074
    The z-value is 

Interesting enough, there is high varriance in the different vaccine groups, so we see clear strenght and weaknesses between the vaccines. I will not cover every point, but we see some important differences. Vaccine2 has a much higher death toll, and vaccine3 has the lowest, so low that it is not statisticly significant. Vaccine2 also has a very high blood-clots occurence, while vaccine1 has an insignificant increase and 3 actually has less cases than for non-vaccinated. For Myocardis, vaccine1 and 2 has insigificant changes, but vaccine3 has a very significant 12 times increase. Stomach, Taste and Pneumia are still insignifficant. There are slight varriations to Fever and Headache, but I do not consider them clinicly significant (among the different vaccine groups, they are definitly clinicly significant between vaccinated and non-vaccinated). 

Altough there are some strength and weaknesses to each vaccines, and a medical proffesional should do conclusion and not us, it appears that vaccine3 might be the best vaccine. It decreases the covid cases by the most, and does not have statisticly significant increase in deaths, which is obviously the worst side-effect. Blood clots can be severe, and vaccine3 has not increase here, as opposed to the other vaccines. The weakness of vaccine3 is that it has a clear increase in the risk of myocarditis. I assume that based on these finding, vaccine3 will be the vaccine of choice for patient who do not have a personal increased risk related to myocarditis, decided by a medical professional. Vaccine2 has an substanial increase in the very severe death and blood clots, so I would advise not to use it. 

Weaknesses in this method is about the same as in b). We do not consider all of the data, nor Covid-Recovered. 

# Exercise 2

## Data exploration

Before we dive into fitting a model for prediction in exercise 2, let us explore the data a bit. 

We formalize the problem as follows: We want to use the explanatory variables; genes, age, income, comorbodities, treatment and symptoms before treatment, to predict the response; symptoms after the treatment. More specificly, we are going to look at predicting wether person with a certain symptom is likely to have the symptom after a given treatment. In other words, we assume that the dataset "treatment_features" contains symptoms _before_ the treatment, the table "treatment_action" cotains if they got treatment1, 2, both or none, and the table "treatment_outcome" contains the symptoms after the treatments. This might be a wrong interpretation, since it implies dead people are treated, and some people are ressurected by the treatment. However, it is our best interpretation, so we assume it is a mistake in the dataset. 

During the initial data analysis, we make an important observation. Patient who do not have symtpoms before the treatment (or both, or none) never have symptoms afterwards. If we assume this always is the case, the model will become more accurate. However, this assumtion may only be right for our dataset, giving us a synteticly low test-error. Therefore, we look at both the cases where we use the whole dataset to fit the models, and when we only use the patients that have symptoms before. We will try to predict each symptoms at a time, meaning we will have a different model for each of the symptoms. The specification on how we fit the model will be elaborated later, in the "fitting a model" section, after the initial data analysis. First, let look at how many of the observations that actually have a positive response. 

In [25]:
# Loading data

In [26]:
def init_features(data):
    """
    Initialize names for observation features and treatment features
    
    Symptoms (10 bits): Covid-Recovered, Covid-Positive, No-Taste/Smell, 
        Fever, Headache, Pneumonia, Stomach, Myocarditis, Blood-Clots, Death
    Age (integer)
    Gender (binary)
    Income (floating)
    Genome (128 bits)
    Comorbidities (6 bits): Asthma, Obesity, Smoking, Diabetes, Heart disease, Hypertension
    Vaccination status (3 bits): 0 for unvaccinated, 1 for receiving a specific vaccine for each bit
    """
    features_data = pd.read_csv(data)
    # features =  ["Covid-Recovered", "Age", "Gender", "Income", "Genome", "Comorbidities", "Vaccination status"]
    features = []
    # features += ["Symptoms" + str(i) for i in range(1, 11)]
    features += ["Covid-Recovered", "Covid-Positive", "No-Taste/Smell", "Fever", 
                 "Headache", "Pneumonia", "Stomach", "Myocarditis", 
                 "Blood-Clots", "Death"]
    features += ["Age", "Gender", "Income"]
    features += ["Genome" + str(i) for i in range(1, 129)]
    # features += ["Comorbidities" + str(i) for i in range(1, 7)]
    features += ["Asthma", "Obesity", "Smoking", "Diabetes", 
                 "Heart disease", "Hypertension"]
    features += ["Vaccination status" + str(i) for i in range(1, 4)]
    features_data.columns = features
    return features_data

In [27]:
def init_actions():
    actions = pd.read_csv("treatment_actions.csv")
    actions.columns = ["Treatment1", "Treatment2"]
    return actions 

In [28]:
def init_outcomes():
    """
    Initialize outcome data
    
    Post-Treatment Symptoms (10 bits): Past-Covid (Ignore), Covid+ (Ignore), 
    No-Taste/Smell, Fever, Headache, Pneumonia, Stomach, Myocarditis, 
    Blood-Clots, Death
    """
    outcomes = pd.read_csv("treatment_outcomes.csv")
    outcome_names = ["Past-Covid", "Covid+", "No-Taste/Smell", "Fever", "Headache", 
                      "Pneumonia", "Stomach", "Myocarditis", "Blood-Clots", "Death"]
    outcomes.columns = outcome_names
    return outcomes

In [29]:
# Fix dataset

In [30]:
observation_features = init_features("observation_features.csv")
data_obs = observation_features
actions = init_actions()
outcomes = init_outcomes()
treatment_features = init_features("treatment_features.csv")
data_treat = treatment_features
# The task said to ignore the two first columns
outcomes = outcomes.iloc[:, 2:]

outcome_names_new = [i + "_after" for i in outcomes.columns] # We want to specify that this is an outcome 
outcomes.columns = outcome_names_new

treatment = data_treat.join(actions).join(outcomes)
tmp1 = treatment.iloc[:, 0:13]
tmp2 = treatment.iloc[:, 141:]
# The three datasets for ex. 2 in one dataset, where all genes are omitted
treat_no_genes = tmp1.join(tmp2)

num_features = ["Age", "Income"]
num_df = treat_no_genes[num_features]
scaled_num_df = (num_df - num_df.mean()) / num_df.std()

treat_no_genes_scaled = treat_no_genes
treat_no_genes_scaled.iloc[:, 10] = scaled_num_df.iloc[:,0]
treat_no_genes_scaled.iloc[:, 12] = scaled_num_df.iloc[:,1]

# Remove column ""Covid-Positive" (because everyone have covid)
tmp1 = treat_no_genes.iloc[:, 0]
tmp2 = treat_no_genes.iloc[:, 2:]
treat_no_genes = pd.DataFrame(tmp1).join(tmp2)

In [31]:
# Looking at differnt treatments

In [32]:
# People with only treatment 1, 211 people
treat_1 = treat_no_genes[(treat_no_genes["Treatment1"] == 1) & (treat_no_genes["Treatment2"] == 0)]
# People with only treatment 2, 211 people
treat_2 = treat_no_genes[(treat_no_genes["Treatment2"] == 1) & (treat_no_genes["Treatment1"] == 0)]
# People with both treatments, 240 people
treat_both = treat_no_genes[(treat_no_genes["Treatment1"] == 1) & (treat_no_genes["Treatment2"] == 1)]
# People with no treatments, 215 people
treat_none = treat_no_genes[(treat_no_genes["Treatment1"] == 0) & (treat_no_genes["Treatment2"] == 0)]

In [33]:
# Number of people with different symptoms after

In [34]:
#print(f"Number of people with different symtoms, total people is {treat_no_genes.shape[0]}")
#print("--------------------------------------------------------------")
#for s in outcomes.columns:
#    print(f"People with symptom {s}: ", treat_no_genes[treat_no_genes[s] == 1].shape[0])

In [35]:
# People with symtom before treatment compared to people with symptom after treatment

In [36]:
print(f"Number of people with different symtoms before and after treatment, total people is {treat_no_genes.shape[0]}")
print("-" * 90)
for sb, sa in zip(treat_no_genes.columns[1:9], outcomes.columns):
    print(f"People with symptom {sb} before treatment: ", treat_no_genes[treat_no_genes[sb] == 1].shape[0])
    print(f"People with symptom {sa} after treatment: ", treat_no_genes[treat_no_genes[sa] == 1].shape[0])
    print("-" * 60)

Number of people with different symtoms before and after treatment, total people is 877
------------------------------------------------------------------------------------------
People with symptom No-Taste/Smell before treatment:  49
People with symptom No-Taste/Smell_after after treatment:  23
------------------------------------------------------------
People with symptom Fever before treatment:  24
People with symptom Fever_after after treatment:  18
------------------------------------------------------------
People with symptom Headache before treatment:  7
People with symptom Headache_after after treatment:  1
------------------------------------------------------------
People with symptom Pneumonia before treatment:  34
People with symptom Pneumonia_after after treatment:  19
------------------------------------------------------------
People with symptom Stomach before treatment:  5
People with symptom Stomach_after after treatment:  3
----------------------------------------

From this table we can make 2 important observations: 1. The treatments seems to be somewhat effective, since less people have symptoms after the treatment than before. 2. There are few of the patients who actually experienced symptoms. This is an important observations since it means that the process for fitting the model will be difficult. For example, only 5 persons had symptoms with their stomach before treatment, and only 3 after. This is out of 877 observations, so we have to work to get a model who does not simply predict "0" all the time. We are going to get back to this point. 

Now let us look if this is different among the different treatment groups:

In [37]:
print(f"Percentages of people experiencing symptoms before and after treatment, in each treatment group")
print("-" * 90)
for sb, sa in zip(treat_1.columns[1:9], outcomes.columns):
    t1_b = treat_1[treat_1[sb] == 1].shape[0]/len(treat_1)
    t2_b = treat_2[treat_2[sb] == 1].shape[0]/len(treat_2)
    tb_b = treat_both[treat_both[sb] == 1].shape[0]/len(treat_both)
    tn_b = treat_none[treat_none[sb] == 1].shape[0]/len(treat_none)
    t1_a = treat_1[treat_1[sa] == 1].shape[0]/len(treat_1)
    t2_a = treat_2[treat_2[sa] == 1].shape[0]/len(treat_2)
    tb_a = treat_both[treat_both[sa] == 1].shape[0]/len(treat_both)
    tn_a = treat_none[treat_none[sa] == 1].shape[0]/len(treat_none)
    print(f"{sb} before treatment: Treatment1: {t1_b*100:.4f}, \
Treatment2: {t2_b*100:.4f}, Both Treatments: {tb_b*100:.4f}, No treatment: {tn_b*100:.4f}")
    print(f"{sa} after treatment: Treatment1: {t1_a*100:.4f}, \
Treatment2: {t2_a*100:.4f}, Both Treatments: {tb_a*100:.4f}, No treatment: {tn_a*100:.4f}")
    print("-" * 60)

Percentages of people experiencing symptoms before and after treatment, in each treatment group
------------------------------------------------------------------------------------------
No-Taste/Smell before treatment: Treatment1: 5.6872, Treatment2: 3.3175, Both Treatments: 7.0833, No treatment: 6.0465
No-Taste/Smell_after after treatment: Treatment1: 3.7915, Treatment2: 0.9479, Both Treatments: 0.0000, No treatment: 6.0465
------------------------------------------------------------
Fever before treatment: Treatment1: 4.7393, Treatment2: 1.8957, Both Treatments: 1.2500, No treatment: 3.2558
Fever_after after treatment: Treatment1: 2.8436, Treatment2: 1.4218, Both Treatments: 0.8333, No treatment: 3.2558
------------------------------------------------------------
Headache before treatment: Treatment1: 0.9479, Treatment2: 0.9479, Both Treatments: 0.8333, No treatment: 0.4651
Headache_after after treatment: Treatment1: 0.0000, Treatment2: 0.0000, Both Treatments: 0.0000, No treatment:

One more thing to check, is if it is people that do not have symptoms before treatment, but get it afterwards. 

In [38]:
symptom_names = ["No-Taste/Smell", "Fever", "Headache", "Pneumonia", "Stomach", "Myocarditis", "Blood-Clots", "Death"]
for symptom in symptom_names:
    no_symptom = treat_no_genes[treat_no_genes[symptom] == 0.0]
    symptom_after = no_symptom[no_symptom[f"{symptom}_after"] == 1.0]
    print(f" Without symptom: {symptom}, {no_symptom.shape[0]}, # gets symptom after treatment {symptom_after.shape[0]}")

 Without symptom: No-Taste/Smell, 828, # gets symptom after treatment 0
 Without symptom: Fever, 853, # gets symptom after treatment 0
 Without symptom: Headache, 870, # gets symptom after treatment 0
 Without symptom: Pneumonia, 843, # gets symptom after treatment 0
 Without symptom: Stomach, 872, # gets symptom after treatment 0
 Without symptom: Myocarditis, 864, # gets symptom after treatment 0
 Without symptom: Blood-Clots, 843, # gets symptom after treatment 0
 Without symptom: Death, 867, # gets symptom after treatment 0


And the answer is no; no symptom-negative persons get sick after the treatment. This is an important observation for the model fitting. The details will be specified later. 

By these tables (mainly the second latest one) we can establish a few things. Firstly, the group that got no treatment has no reduction in symptoms before and after, which makes sence. By this observation, and the fact that there are that the treatments reduces or keeps the number of symptoms the same in every case, we can establish that the treatments are somewhat effective. For headache, every treated person, with every treatment lose their headache. Treatment1 treats all of the Blood Clots cases, and many of the Pneumonia cases. For the rest of the cases, the treatment are either someowhat effective or not effective at all. This is good to note, because if we consider our final model to be good, it should pick up some of this. Fitting the model will be difficult however, since we have very few positives in the response. Finally, we observations that got both treatments, who are "ressurected", meaning that they where dead before the treatment, but not afterwards. This is not to mention that dead people get treated at all, which is the case for 10 patients. This might be an interpretation mistake from our side, but since we could not figure it out, we assume it is a mistake in the dataset. 

## Fitting a model

Problem specification.

In order to test our model, we use cross-validation. We made a pipeline for testing different functions with different error penalties. In order for generalization, cross_validate() takes in the arguments "parameter1" and "parameter2". The first one is thrown into the model, so it can for example be "k" for KNN, or "lambda" for Lasso. The second parameter is given to the error caclulation. This parameter is so that if the model predicts continously among 0 to 1 (or a little under and over), we can adjust what is predicted as a positive or negative. The obvious choice would be "0.5", but since we have way more positive than negative observations, it might be beneficial to lower this rate. Finally, "penalty_factor" determines how much we weigh a false positive against a false negative. This should be tuned so that our model gives a satisfactory outcome. This depend on how "bad" a false negative is compared to a false positive, which is not up to us to determine. Therefore, we will look at many cases. The reason for writing our own functions, is all of the flexibility just mentioned. We did not find this in existing functions, even though it surely exists somewhere. 

Here is the actual code: 

In [48]:
def cross_validate(data, response, model, test_function, k_fold=10, parameter1=1, parameter2=0.5, penalty_factor=1):
    """
    Crossvalidates "model" on "data", according to the error given by 
    "test_function". "response" is the response that we are predicting.
    """
    n = len(data)
    cv_indexes = np.zeros(n)
    counter = 0
    for i in range(n):
        cv_indexes[i] = counter
        counter = (counter + 1) % k_fold
    np.random.shuffle(cv_indexes)
    error = 0
    # embed()
    for k in range(k_fold):
        train = data[cv_indexes != k]
        test = data[cv_indexes == k]
        predictions = model(train, test, response, parameter1)
        # predict on outcomes
        error += test_function(predictions, test, response, parameter2=parameter2, penalty_factor=penalty_factor)
    return error
    
def penalized_error(outcomes, test, response, penalty_factor=5, parameter2=0.5):
    """
    Categorical error, where false negatives error are weighted with 
    "penelize_factor", and false posities are weighted with 1. 
    """
    error = 0
    # embed()
    for pred, exact in zip(outcomes, test[response]):
        if pred < parameter2:
            error += exact * penalty_factor
        elif pred >= parameter2:
            error += (1 - exact)
    return error
    
def knn_model(data, test, response, parameter1=5):
    """
    To be used in "cross_validate()". Data is the X data that the KNN
    model is fitted on, data[response] is the Y data. The function return the 
    prediction done on "test". "parameter1" is the amount of neighbours. 
    """
    k = parameter1
    y_data = data[response]
    x_data = data.drop(columns=[response]) # Check that it does not delete y_data
    test_data = test.drop(columns=[response])
    model = KNeighborsClassifier(n_neighbors=k)
    fit_model = model.fit(x_data, y_data)
    predictions = fit_model.predict(test_data)
    return predictions

def lasso_model(data, test, response, parameter1=1):
    """
    parameter1 is lambda. Lasso model for cross_validate()
    """
    alpha = parameter1
    y_data = data[response]
    x_data = data.drop(columns=[response]) # Check that it does not delete y_data
    test_data = test.drop(columns=[response])
    model = Lasso(alpha=alpha)
    fit_model = model.fit(x_data, y_data)
    predictions = fit_model.predict(test_data)
    # print(predictions)
    return predictions

def linear_model(data, test, response, parameter1=None):
    y_data = data[response].copy()
    x_data = data.drop(columns=[response])
    test_data = test.drop(columns=[response])
    model = LinearRegression()
    fit_model = model.fit(x_data, y_data)
    predictions = fit_model.predict(test_data)

    return predictions

A primitive approach to fitting the model, would be to send in all of the 876 columns. This is since we just established that if one did not experience symptoms before the treatment, one would not after. We also know that the no-treatment group does not change, so their symptom before will be the same after the non-existing treatment (which might be interpreted as a short time period). Since we consider this good assumption based on the data analysis, we can "hard-code" this into the model, and only predict if the patient experience symptoms before. However, we will try doing the primitive way, as it might give us some information about how the model and testing procedure works. 

Let us simply fit a linear model on all our data, and cross validate. Note that we send in unreasonably many explanatory variables, but we will fix this later. 

In [40]:
np.random.seed(57)
for symptom in symptom_names:
    symptom_index = f"{symptom}_after"
    print(f"{symptom}: Amount in group: {sum(treat_no_genes[symptom])}")
    cv = cross_validate(treat_no_genes, symptom_index, linear_model, penalized_error, parameter2=0.5, penalty_factor=1)
    print(f"CV for linear_model: {cv}")
    print()

No-Taste/Smell: Amount in group: 49.0
CV for linear_model: 19.0

Fever: Amount in group: 24.0
CV for linear_model: 6.0

Headache: Amount in group: 7.0
CV for linear_model: 1.0

Pneumonia: Amount in group: 34.0
CV for linear_model: 16.0

Stomach: Amount in group: 5.0
CV for linear_model: 3.0

Myocarditis: Amount in group: 13.0
CV for linear_model: 9.0

Blood-Clots: Amount in group: 34.0
CV for linear_model: 15.0

Death: Amount in group: 10.0
CV for linear_model: 2.0



This seems promising. The linear model does not overfit, since the cross validation error is lower than the amount of total cases. In addition, we actually get some true-positives, since the error is lower. However, let us look at the amount of false-positive against false-negatives. A primive way to do this is just to penalize false positives as 1000, so we can read the error whole number divided by 1000 as the false positives, and the error modulos 1000 as the false negatives.

In [41]:
np.random.seed(57)
for symptom in symptom_names:
    symptom_index = f"{symptom}_after"
    print(f"{symptom}: Amount in group: {sum(treat_no_genes[symptom])}")
    cv = cross_validate(treat_no_genes, symptom_index, linear_model, penalized_error, parameter2=0.5, penalty_factor=1000)
    print(f"CV for linear_model: {cv}")
    print()

No-Taste/Smell: Amount in group: 49.0
CV for linear_model: 16003.0

Fever: Amount in group: 24.0
CV for linear_model: 6.0

Headache: Amount in group: 7.0
CV for linear_model: 1000.0

Pneumonia: Amount in group: 34.0
CV for linear_model: 1015.0

Stomach: Amount in group: 5.0
CV for linear_model: 1002.0

Myocarditis: Amount in group: 13.0
CV for linear_model: 6003.0

Blood-Clots: Amount in group: 34.0
CV for linear_model: 12003.0

Death: Amount in group: 10.0
CV for linear_model: 2.0



Since the error // 1000 is not 0 for many of the observations, we can easily improve our model by the previous observations. If they did not experience the relevant symptom before the treatment, the prediction should be 0, so in this case we can have only the false negative left. This can be hardcoded into the model.

Let us verify this by looking at the coefficients. We fit the linear model on the whole data:

In [42]:
y_data =  treat_no_genes["No-Taste/Smell_after"]
x_data = treat_no_genes.drop(columns="No-Taste/Smell_after")

linear_model_test = LinearRegression()
fit_model = linear_model_test.fit(x_data, y_data)
coefficients = pd.concat([pd.DataFrame(x_data.columns), pd.DataFrame(np.transpose(fit_model.coef_))], axis = 1)
coefficients

Unnamed: 0,0,0.1
0,Covid-Recovered,0.005962
1,No-Taste/Smell,0.472467
2,Fever,-0.085558
3,Headache,0.028219
4,Pneumonia,-0.025346
5,Stomach,0.015724
6,Myocarditis,0.020841
7,Blood-Clots,-0.017383
8,Death,0.024377
9,Age,-0.003282


As we see, all of the coefficients are pretty small compared to No-Taste/Smell. This means that the linear regression managed to pick up the correct variable, according to our observation. However, most of the other variables seem noisy, so lets try to remove them. 

In the next few cells we will try linear regression and KNN classifying, first on "symptom (before), Treatment1, Treatment2" as explanatory variables, then with "Income, Age, Gender" added. 

In [67]:
# The linear fit above used the whole dataframe, now we try only a few explanatory variables

In [68]:
# Variables : symptom (before), Treatment1, Treatment2
# Method : linear regression
total_cv_lm1 = 0 # Total cv error, for all of the responses
np.random.seed(57)
for symptom in symptom_names:
    symptom_index = f"{symptom}_after"
    cols = [symptom, "Treatment1", "Treatment2", symptom_index]
    print(f"{symptom}: Amount in group: {sum(treat_no_genes[symptom])}")
    cv = cross_validate(treat_no_genes[cols], symptom_index, linear_model, penalized_error, parameter2=0.5, penalty_factor=1)
    print(f"CV for linear_model: {cv}")
    print()
    total_cv_lm1 += cv

No-Taste/Smell: Amount in group: 49.0
CV for linear_model: 19.0

Fever: Amount in group: 24.0
CV for linear_model: 6.0

Headache: Amount in group: 7.0
CV for linear_model: 1.0

Pneumonia: Amount in group: 34.0
CV for linear_model: 16.0

Stomach: Amount in group: 5.0
CV for linear_model: 3.0

Myocarditis: Amount in group: 13.0
CV for linear_model: 9.0

Blood-Clots: Amount in group: 34.0
CV for linear_model: 19.0

Death: Amount in group: 10.0
CV for linear_model: 2.0



In [69]:
# Variables : symptom (before), Treatment1, Treatment2
# Method : knn
np.random.seed(57)
total_cv_knn1 = 0 # Total cv error, for all of the responses
for symptom in symptom_names:
    symptom_index = f"{symptom}_after"
    cols = [symptom, "Treatment1", "Treatment2", symptom_index]
    print(f"{symptom}: Amount in group: {sum(treat_no_genes[symptom])}")
    cv = cross_validate(treat_no_genes[cols], symptom_index, knn_model, penalized_error, parameter2=0.5, penalty_factor=1)
    print(f"CV for linear_model: {cv}")
    print()
    total_cv_knn1 += cv

No-Taste/Smell: Amount in group: 49.0
CV for linear_model: 9.0

Fever: Amount in group: 24.0
CV for linear_model: 10.0

Headache: Amount in group: 7.0
CV for linear_model: 1.0

Pneumonia: Amount in group: 34.0
CV for linear_model: 2.0

Stomach: Amount in group: 5.0
CV for linear_model: 4.0

Myocarditis: Amount in group: 13.0
CV for linear_model: 3.0

Blood-Clots: Amount in group: 34.0
CV for linear_model: 10.0

Death: Amount in group: 10.0
CV for linear_model: 4.0



In [70]:
# Same, but add "Income", "Age" and "Gender" as columns
# Method : linear regression
np.random.seed(57)
total_cv_lm2 = 0 # Total cv error, for all of the responses
for symptom in symptom_names:
    symptom_index = f"{symptom}_after"
    cols = [symptom, "Age", "Income", "Gender", "Treatment1", "Treatment2", symptom_index]
    print(f"{symptom}: Amount in group: {sum(treat_no_genes[symptom])}")
    cv = cross_validate(treat_no_genes[cols], symptom_index, linear_model, penalized_error, parameter2=0.5, penalty_factor=1)
    print(f"CV for linear_model: {cv}")
    print()
    total_cv_lm2 += cv

No-Taste/Smell: Amount in group: 49.0
CV for linear_model: 18.0

Fever: Amount in group: 24.0
CV for linear_model: 6.0

Headache: Amount in group: 7.0
CV for linear_model: 1.0

Pneumonia: Amount in group: 34.0
CV for linear_model: 15.0

Stomach: Amount in group: 5.0
CV for linear_model: 3.0

Myocarditis: Amount in group: 13.0
CV for linear_model: 10.0

Blood-Clots: Amount in group: 34.0
CV for linear_model: 18.0

Death: Amount in group: 10.0
CV for linear_model: 2.0



In [79]:
# Same, but add "Income", "Age" and "Gender" as columns
# Method : knn
np.random.seed(57)
total_cv_knn2 = 0 # Total cv error, for all of the responses
for symptom in symptom_names:
    symptom_index = f"{symptom}_after"
    cols = [symptom, "Age", "Income", "Gender", "Treatment1", "Treatment2", symptom_index]
    print(f"{symptom}: Amount in group: {sum(treat_no_genes[symptom])}")
    cv = cross_validate(treat_no_genes[cols], symptom_index, knn_model, penalized_error, parameter2=0.5, penalty_factor=1)
    print(f"CV for KNN: {cv}")
    print()
    total_cv_knn2 += cv # Total cv error, for all of the responses

No-Taste/Smell: Amount in group: 49.0
CV for KNN: 13.0

Fever: Amount in group: 24.0
CV for KNN: 11.0

Headache: Amount in group: 7.0
CV for KNN: 1.0

Pneumonia: Amount in group: 34.0
CV for KNN: 9.0

Stomach: Amount in group: 5.0
CV for KNN: 1.0

Myocarditis: Amount in group: 13.0
CV for KNN: 7.0

Blood-Clots: Amount in group: 34.0
CV for KNN: 12.0

Death: Amount in group: 10.0
CV for KNN: 5.0



In [82]:
print("Total cv error (for all responses)")
print(f"LM1: {total_cv_lm1}, KNN1: {total_cv_knn1}, LM2: {total_cv_lm2}, KNN2: {total_cv_knn2}")

Total cv error (for all responses)
LM1: 75.0, KNN1: 43.0, LM2: 73.0, KNN2: 59.0


Here the 1 is from the three explanatory variables, and 2 is from 6. We used 5 neighbours for KNN. Note that this is the absolute error, it could probably be meassured as relative error or as a weighted error, depending on which response the medical expert was interested in. For example, wrongly predicting a death might be penalized more than a wrongly classified fever. However, we do not have the medical knowledge to discuss this, so we simply add all of the errors. When we adjust the KNN parameter in just a bit, we will also look at the relative error.  

As we see, KNN does a little bit better on predicting. We also see that the linear model does about the same with the addition of explanatory variables, but the KNN does worse. We therefore conclude that the "Age", "Gender" and "Income" behaves like noise for predicting, and we will not continue to use them. 
Since KNN did pretty good, let us try to adjust the neigbors parameter. We also note that the CV error varries a bit on the seed, so therefor we run the code 10 times. This cell takes about a minute to run.

In [94]:
# Selecting K for KNN 
# Variables : symptom (before), Treatment1, Treatment2
# Method : knn
np.random.seed(57)
k_lim = 10
total_cv_abs = np.zeros(k_lim) # Total cv error, for all of the responses, as a sum
total_cv_rel = np.zeros(k_lim) # Total cv error, for all the responses, relative to the amount of observations
for _ in range(10):
    for k in range(1, k_lim+1):
        for symptom in symptom_names:
            symptom_index = f"{symptom}_after"
            cols = [symptom, "Treatment1", "Treatment2", symptom_index]
            cv = cross_validate(treat_no_genes[cols], symptom_index, knn_model, penalized_error, parameter1 = k,
                                parameter2=0.5, penalty_factor=1)
            total_cv_abs[k-1] += cv
            total_cv_rel[k-1] += cv/sum(treat_no_genes[symptom])

In [97]:
total_cv_abs/10

array([40.2, 42.5, 40.1, 44. , 40.6, 49.7, 46.9, 49.6, 44.6, 49.5])

In [96]:
total_cv_rel

array([22.18288238, 24.62021809, 23.48425678, 24.8649506 , 23.85604626,
       29.96972635, 29.32876997, 30.19866793, 28.59602687, 30.22903007])

Note that these errors are varrying a bit, so do not trust the different too much. Overall, only one neighbour is the best. However, one might get very high varriance from using one neighbour, so 3 or 5 might be a safer choice. The difference in cross validation error is probably not significant. 

Let us try to take advantage of one of our earlier observations. 

In [28]:
# new knn model

In [33]:
def knn_model_new(data, test, response, parameter1=5):
    """
    To be used in "cross_validate()". Data is the X data that the KNN
    model is fitted on, data[response] is the Y data. The function return the 
    prediction done on "test". "parameter1" is the amount of neighbours. 
    """
    k = parameter1
    y_data = data[response]
    x_data = data.drop(columns=[response]) # Check that it does not delete y_data
    test_data = test.drop(columns=[response])
    model = KNeighborsClassifier(n_neighbors=k)
    symptom = response.replace("_after", "")
    fit_model = model.fit(x_data[x_data[symptom]==1], y_data[x_data[symptom]==1])
    predictions = fit_model.predict(test_data)
    test_data["preds"] = predictions
    test_data.loc[test_data[symptom] == 0, "preds"] = 0
    
    return test_data["preds"].values

In [42]:
# Variables : symptom (before), Treatment1, Treatment2
# Method : knn, the new version
np.random.seed(57)
for symptom in symptom_names:
    symptom_index = f"{symptom}_after"
    cols = [symptom, "Treatment1", "Treatment2", symptom_index]
    print(f"{symptom}: Amount in group: {sum(treat_no_genes[symptom])}")
    cv = cross_validate(treat_no_genes[cols], symptom_index, knn_model_new, penalized_error, parameter2=0.5, penalty_factor=1000)
    print(f"CV for linear_model: {cv}")
    print()

No-Taste/Smell: Amount in group: 49.0
CV for linear_model: 3008.0

Fever: Amount in group: 24.0
CV for linear_model: 6.0

Headache: Amount in group: 7.0
CV for linear_model: 1000.0

Pneumonia: Amount in group: 34.0
CV for linear_model: 1007.0

Stomach: Amount in group: 5.0
CV for linear_model: 3001.0

Myocarditis: Amount in group: 13.0
CV for linear_model: 1001.0

Blood-Clots: Amount in group: 34.0
CV for linear_model: 7000.0

Death: Amount in group: 10.0
CV for linear_model: 2002.0



In [44]:
def linear_model_new(data, test, response, parameter1=None):
    y_data = data[response].copy()
    x_data = data.drop(columns=[response])
    test_data = test.drop(columns=[response])
    model = LinearRegression()
    fit_model = model.fit(x_data[x_data[symptom]==1], y_data[x_data[symptom]==1])
    predictions = fit_model.predict(test_data)
    test_data["preds"] = predictions
    test_data.loc[test_data[symptom] == 0, "preds"] = 0
    
    return test_data["preds"].values

In [45]:
# Variables : symptom (before), Treatment1, Treatment2
# Method : linear regression, the new version
np.random.seed(57)
for symptom in symptom_names:
    symptom_index = f"{symptom}_after"
    cols = [symptom, "Treatment1", "Treatment2", symptom_index]
    print(f"{symptom}: Amount in group: {sum(treat_no_genes[symptom])}")
    cv = cross_validate(treat_no_genes[cols], symptom_index, linear_model_new, penalized_error, parameter2=0.5, penalty_factor=1)
    print(f"CV for linear_model: {cv}")
    print()

No-Taste/Smell: Amount in group: 49.0
CV for linear_model: 6.0

Fever: Amount in group: 24.0
CV for linear_model: 8.0

Headache: Amount in group: 7.0
CV for linear_model: 1.0

Pneumonia: Amount in group: 34.0
CV for linear_model: 2.0

Stomach: Amount in group: 5.0
CV for linear_model: 4.0

Myocarditis: Amount in group: 13.0
CV for linear_model: 1.0

Blood-Clots: Amount in group: 34.0
CV for linear_model: 6.0

Death: Amount in group: 10.0
CV for linear_model: 4.0




# Task 3

## Ethical concerns

We would like to address the ethical issues in this study. We assume that this study was done to investigate the efficiency and possible side effects of vaccines and the results of it will benefit the society we live in. We also assume that the data is gathered from the private
database of an hospital and belongs to patients, even if not everyone is sick with Covid or has symptoms.

## Consent to participate in the research

This is an observational study, where the independent/response variable is not under the control of the researcher. An observational study uses observational data, which refers to information gathered without the presence of the subject of the research. In this study, the subjects of the research are patients. We do not have information about whether the data collection was ethical or lawful, or the subjects of the research consented to be a part of the study. In order to address the ethical issues, we are going to make some assumptions about the sampling methods and the subjects of research. If the patients have agreed to participate in the research, or have been informed that their health data would be used later in a research, have been informed about the purposes of the research and the research procedure, and have given their consent in a written, lawful way, then the data collection method is ethical and correct. If someone has hacked their way into a hospital database in order to conduct this research and gathered information without the consent of the individuals, then the patients' privacy and right to have safe health care has been violated, and this research is highly unethical and unlawful. A third scenario would be where a hospital asks a group of researchers to do research with their data and gives them access to their private database. The results of this research can be published publicly and be used to benefit the society, however the data used in the research must be kept secret and can not be published.

## Privacy of the personal information

The data do not contain personal information such as name, family name, date of birth etc. This indicates that the privacy of the subjects of research was somewhat protected as we are not able to trace back to individuals by simply looking at the data. However, the data contains information about the age, gender and income of the patients. None of these are considered as personal data, but a combination of “innocent” data can be used to identify the patient. For example, if we have access to another dataset with information about the same patients, combining these two datasets, which contains innocent data only, can be easily used to trace back to real people and violates their privacy. If the data belongs to the database of a hospital in Europe and fits the GDPR, there should not be any problem about privacy in this study and the privacy of the is protected patients.

## Justice and fairness

An important ethical principle in research is justice. The researcher must be fair to the participants and the participants’ needs must always come first. The participants must be treated equally throughout the study. The participants must be chosen such that the research population represents our society, i.e having 50% men and 50% women, including people who belong to different racial groups, minorities or cultures, including people with different socioeconomic or academic backgrounds, etc. The researcher must also make sure that people included in the research should not be included only because they are a population that is easy to access, have high availability or simply because they belong to a vulnerable group, such as children, elderly or mentally ill people, that are less able to decline an invitation to participate in the research. If the sample data belongs to a population that represents the society we live in, includes all genders, races and minorities, and if the researcher does not take the advantage of vulnerable groups, then we can say that the research is quite fair and just to its participants. If the researcher has gathered data from a population that belongs to a specific minority or of specific gender which has higher accessibility, or from people that are confined to an institution (i.e prisons, mental health hospitals, children's homes, retirement homes) without respecting or protecting their autonomy, then the research is highly unfair to its participants.
