# 1 Heart Disease (15 points)

In this problem, you'll practice manipulating lists of lists by analyzing some (simulated) data of a clinical study on the relationship between diet and heart disease.

As our dataset, I've simulated some patient LDL cholesterol data based on the mean and standard errors reported in a study that looked at the effects of the so-called DASH diet on several heart disease risk factors in diabetics, including blood lipids. (The DASH diet is "rich in fruits, vegetables, whole grains, low-fat dairy products, and low in saturated fat, total fat, cholesterol, refined grains, and sweets.") The actual raw data isn't available, but if you're curious, here's the study link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3005461/.

In the simulated data below, there are 44 patients, 22 in the control group who followed a "normal" diet, and 22 in the diet group who followed the DASH diet for 8 weeks. 

To control for variability in LDL-C measurements, each patient's LDL-C was measured three times before befinning the diet, and three times after the diet. These six measurements are noted in the header variable, below.

Each group of 22 patients is stored as list of lists. There is one row for each patient, and each row is a list of the three before-diet and three after-diet LDL-C measurements, as shown in the header line.

**Run the cell below to load the data.**

In [None]:
# Simulated LDL-C measurements for control and DASH diet groups.

header = ['before_1', 'before_2', 'before_3', 'after_1', 'after_2', 'after_3'] # just for information
control = [
 [99, 78, 100, 109, 90, 82],
 [100, 61, 75, 77, 68, 77],
 [180, 135, 170, 119, 107, 64],
 [143, 106, 119, 145, 112, 96],
 [113, 82, 89, 125, 135, 119],
 [159, 141, 147, 121, 90, 75],
 [118, 111, 82, 130, 143, 120],
 [144, 116, 133, 103, 109, 92],
 [128, 136, 133, 129, 145, 152],
 [136, 165, 108, 91, 70, 124],
 [106, 128, 134, 113, 75, 78],
 [121, 117, 147, 133, 149, 173],
 [124, 71, 83, 98, 74, 87],
 [137, 132, 160, 130, 146, 112],
 [153, 133, 140, 152, 136, 155],
 [143, 142, 119, 136, 128, 132],
 [94, 120, 80, 133, 110, 112],
 [102, 97, 124, 138, 94, 110],
 [101, 96, 119, 136, 107, 97],
 [122, 108, 102, 132, 131, 154],
 [139, 102, 138, 126, 101, 100],
 [101, 159, 153, 114, 114, 97]
]

diet = [
 [97, 121, 69, 83, 73, 76],
 [139, 120, 110, 113, 98, 121],
 [100, 126, 112, 92, 95, 75],
 [99, 116, 127, 150, 81, 120],
 [73, 114, 96, 97, 88, 112],
 [123, 156, 133, 54, 56, 77],
 [146, 145, 103, 71, 90, 59],
 [88, 89, 89, 115, 73, 117],
 [135, 156, 127, 138, 132, 104],
 [99, 94, 125, 86, 83, 112],
 [137, 111, 124, 88, 73, 87],
 [119, 110, 129, 119, 91, 111],
 [167, 112, 119, 112, 149, 147],
 [173, 129, 114, 101, 135, 112],
 [68, 71, 40, 126, 112, 122],
 [145, 161, 171, 114, 70, 93],
 [142, 144, 129, 129, 96, 95],
 [85, 88, 70, 128, 154, 131],
 [115, 135, 81, 138, 117, 118],
 [148, 124, 148, 101, 78, 118],
 [159, 148, 179, 96, 60, 90],
 [135, 104, 107, 93, 125, 94]
]

## Determine the changes in LDL-C for the control group

Your task is to calculate mean LDL-C before and after the diet, for each subject. In other words, you'll reduce the original data from six measurements per subject to two means per subject, stored as a new list of lists. **Perform the calculations for the control group in the cell below, and for the DASH group in the next problem cell.**

To help you write the code, here are the specific tasks, which you can accomplish in a single `for` loop:

1. For each patient, get the first three measurements (the before-diet measurements) and calculate the mean. You can use a **slice** to grab multiple values from the list of a patient's measurements, then calculate the mean of that slice.

2. Do the same with the last three (after-diet) measurements.

3. Append the before and after means (as a two-item list) to a list of lists called `control_ldl_means`.

If you need to plan your code with pseudocode, feel free to create a blank cell below and write your code.

In [None]:
control_ldl_means = [] # list of lists of subject before and after means

# YOUR ANSWER HERE
print(control_ldl_means) # look at results, see if data structure is what you expected.

In [None]:
assert len(control_ldl_means) == 22
assert len(control_ldl_means[0]) == 2
assert 2651 < sum([patient[0] for patient in control_ldl_means]) < 2651.4

## Determine the changes in LDL-C for the DASH diet group

Now do the same thing for the diet group. This will be easy after completing the previous problem.

In [None]:
diet_ldl_means = [] # append lists of before and after means to this list

# YOUR ANSWER HERE

print(diet_ldl_means)

In [None]:
assert len(diet_ldl_means) == 22
assert len(diet_ldl_means[0]) == 2
assert 2622.6 < sum([patient[0] for patient in diet_ldl_means]) < 2622.7

# Calculate group means and standard deviations of before and after LDL-C levels

## (Working with columns in a list of lists)

At this point, you have summarized the eplicate data as before and after diet means for each subject. These means are stored as a list of two-item lists. Think of the data in `control_ldl_means` and `diet_ldl_means` as tables with 22 rows (subjects) and two columns (LDL-C before/after means).


Next, calculate the population before and after means for each group.  In other words, calculate the mean of each column, for each group.

Because these data are stored as lists of lists, you'll use **list comprehensions** to perform calculations on the columns. Remember that list comprehensions in Python are essentially **concisely written `for` loops that build a new list from an existing list.** 

As a refresher, here's is how to use a list comprehension to calculate the mean of the third column in a list of lists called `data`:

```python
# Method to calculate a column mean with list comprehension
# Assume variable data is a list of lists
import statistics as st

column_3 = [row[2] for row in data] # extract 3rd column values as a new list
st.mean(column_3) # calculate the mean
```

Think about the example above for a moment. Is there a way you could condense the last two lines into a single line of code? (In other words, can you find a way to dispense with the variable `column_3`?)

Using the above example, write code to calculate means and standard deviations for each column in `control_ldl_means` and `diet_ldl_means`. Save your answers to the following variables:

`mean_control_before`, `sd_control_before`, `mean_control_after`, `sd_control_after`, `mean_diet_before`, `sd_diet_before`, `mean_diet_after`, `sd_diet_after`. 

Also, don't forget that tab completion can save you a lot of typing. (Hit tab after typing the first few letters of an already existing variable name.)

Use `st.mean()` and `st.stdev()` from the statistics module for your calculations.

In [None]:
import statistics as st
# YOUR ANSWER HERE

print(mean_control_before, sd_control_before)
print(mean_control_after, sd_control_after)
print(mean_diet_before, sd_diet_before)
print(mean_diet_after, sd_diet_after)

In [None]:
assert 120.515 < mean_control_before < 120.516
assert 21.649 < sd_control_before < 21.650
assert 114.121 < mean_control_after < 114.122
assert 21.427 < sd_control_after < 21.428
assert 119.212 < mean_diet_before < 119.213
assert 25.013 < sd_diet_before < 25.014
assert 102.484 < mean_diet_after < 102.485
assert 20.094 < sd_diet_after < 20.095

# Randomize control and diet labels

Does the DASH diet seem to work? How would you simulate the probability of seeing a reduction in LDL levels as large or larger than that observed in the diet group, if you assume the effect of the DASH diet was in actuality no different from the control diet?

One way would be to pool the data for the two patient groups, and randomly assign the labels 'control' and 'diet' to patients. You would then calculate the mean change in LDL-C levels for each group, etc.

You won't run a full simulation here. Instead, we'll just focus on how to write code that performs the label swap. One way is to do something similar to the Monty Hall 10-door game from `ps6`. In the function `simulate_monty_10`, you randomly picked 5 doors for Monty to open, and then created a list of the doors Monty *didn't* open. For our current problem, you want to randomly pick 22 subjects from a list and assign them to the diet group, while assigning the remaining 22 to the control group.

**INSTRUCTIONS: In the cell below, I've pooled the subject data into a single list of lists. Randomly assign subects to either the control or DASH groups.** 

Here are the coding steps in more detail:

1. Generate a list of position indices for the list `pooled_subjects` (see example below).
2. Using `random.sample()`, randomly choose 22 indices from your list created in step 1. Assign those 22 indices to a list called `diet_indices`. (Don't forget to import the `random` module.)
3. Assign the remaining 22 indices to a list called `control_indices`. (Use a `for` loop to pick only indices **`not in`** `diet_indices`.)
4. Finally, iterate through `diet_indices` and append the correpsonding row in `pooled_subjects` to a new list called `random_diet`. Do the same for the control indices, assigning the subject data to a list called `random_control`.

That completes the label swap. You could package this code into a function that you call repeatedly during 10,000 simulations.


**Generating list indices:** To randomly pick subjects from the list of pooled data, first create a list of *indices* representing the rows. (Just as in `simulate_monty_10`, you randomly picked from door indices, rather picking from the list of actual prizes.) To get indices from a list, you can use `enumerate()` in a `for` loop, as in the Monty Hall code. If you *just* want indices though, here's a one-liner that does it:

```python
data = [5, 8, 2, 6, 1, 23] # create some data
data_indices = list(range(len(data))) # a list of the indices for the list 'data'
```

**NOTE**: If you can replace *all* `for` loops with list comprehensions in your code below, give yourself a big pat on the back. It will make your code much cleaner, though it requires writing a list comprehension that includes an `if` statement. For some examples of fancy list comprehensions, follow this link:
https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions

In [None]:
pooled_subjects = [
 [99, 78, 100, 109, 90, 82],
 [100, 61, 75, 77, 68, 77],
 [180, 135, 170, 119, 107, 64],
 [143, 106, 119, 145, 112, 96],
 [113, 82, 89, 125, 135, 119],
 [159, 141, 147, 121, 90, 75],
 [118, 111, 82, 130, 143, 120],
 [144, 116, 133, 103, 109, 92],
 [128, 136, 133, 129, 145, 152],
 [136, 165, 108, 91, 70, 124],
 [106, 128, 134, 113, 75, 78],
 [121, 117, 147, 133, 149, 173],
 [124, 71, 83, 98, 74, 87],
 [137, 132, 160, 130, 146, 112],
 [153, 133, 140, 152, 136, 155],
 [143, 142, 119, 136, 128, 132],
 [94, 120, 80, 133, 110, 112],
 [102, 97, 124, 138, 94, 110],
 [101, 96, 119, 136, 107, 97],
 [122, 108, 102, 132, 131, 154],
 [139, 102, 138, 126, 101, 100],
 [101, 159, 153, 114, 114, 97],
  [97, 121, 69, 83, 73, 76],
 [139, 120, 110, 113, 98, 121],
 [100, 126, 112, 92, 95, 75],
 [99, 116, 127, 150, 81, 120],
 [73, 114, 96, 97, 88, 112],
 [123, 156, 133, 54, 56, 77],
 [146, 145, 103, 71, 90, 59],
 [88, 89, 89, 115, 73, 117],
 [135, 156, 127, 138, 132, 104],
 [99, 94, 125, 86, 83, 112],
 [137, 111, 124, 88, 73, 87],
 [119, 110, 129, 119, 91, 111],
 [167, 112, 119, 112, 149, 147],
 [173, 129, 114, 101, 135, 112],
 [68, 71, 40, 126, 112, 122],
 [145, 161, 171, 114, 70, 93],
 [142, 144, 129, 129, 96, 95],
 [85, 88, 70, 128, 154, 131],
 [115, 135, 81, 138, 117, 118],
 [148, 124, 148, 101, 78, 118],
 [159, 148, 179, 96, 60, 90],
 [135, 104, 107, 93, 125, 94]
]

# YOUR ANSWER HERE


In [None]:
assert len(random_diet) == 22
assert len(random_diet[3]) == 6
assert len(random_control) == 22
assert len(random_control[2]) == 6

# Calculate population means on the simulated groups

Now that you've randomly generated a control and diet group, calculate the before and after diet means for each group. You can reuse the code from above.

In the cell below, calculate the mean before and after values for each subject, averaging over the replicate measurements:

In [None]:
random_control_ldl_means = []
random_diet_ldl_means = [] # append lists of before and after means to this list

# YOUR ANSWER HERE

print(len(random_diet_ldl_means), len(random_control_ldl_means))

In [None]:
assert len(random_diet_ldl_means) == 22
assert len(random_control_ldl_means) == 22

Now calculate the before and after means for each group, just as you did for the original data above. Only do the means, skip the standard deviations. Save the results to the following variables:

`random_control_before, random_control_after, random_diet_before, random_diet_after`

Suggestion: Run the last three code cells a few times and see how variable the randomized means are, and how they compare with the original means. If you run this code a few times, are you likely to see an affect size similar to the original data?

In [None]:

# YOUR ANSWER HERE

print('Original data:')
print('Control before:', mean_control_before, 'Control after:', mean_control_after)
print('Diet before:', mean_diet_before, 'Diet after:', mean_diet_after)
print('Randomized data:')
print('Control before:', random_control_before, 'Control after:', random_control_after)
print('Diet before:', random_diet_before, 'Diet after:', random_diet_after)

In [None]:
assert 100 < random_control_before < 130
assert 100 < random_control_after < 130
assert 100 < random_diet_before < 130
assert 100 < random_diet_after < 130

**A final note:** Remember in class that we talked about the importance of copying lists, rather just assigning existing lists to new ones. In the task above, you used the lists in `pooled_subjects` to create new lists, without copying them first. This means that `pooled_subjects` and `random_diet` (and `random_control`) are pointing to the same lists. If you change subject data in one list, it would also change in the other one.

For this task, that's not a problem, since you won't change any of the values in any of lists inside the list of lists. So don't worry about this issue, but do be aware of it. A nice explanation of this list behavior, with illustrations, is here: https://www.python-course.eu/python3_deep_copy.php