In this notebook, we're going to talk about the importance of tabular data and why we need to special tools to work with it.

# What Is Tabular Data?

Tabular data is essentially any data that is presented in a table or spreadsheet format, where each row represents an observation or record and each column represents a variable or attribute.

Patient records, blood test results, and logs of vital sign measurements are just some examples of tabular data in medicine. In clinical research, tabular data can also appear in the form of spreadsheets that hold information like demographics, treatment group assignments, and outcomes.

# Why Do We Need Something Fancy?

Knowing the basics of Python, we can actually store tabular data using what we know about lists.

Let's say we wanted to keep track of the following information:

| Patient Name | Sex    | Age |
|--------------|--------|-----|
| Alice        | female | 67  |
| Bob          | male   | 77  |
| Carol        | female | 75  |
| Dan          | male   | 83  |
| Eric         | male   | 64  |

There are two ways we could maintain this data.

## Parallel Lists

One option is to use ***parallel lists***. In this setup, each column is assigned its own list, and all of the lists must have the same length.

In [None]:
names = ["Alice", "Bob", "Carol", "Dan", "Eric"]
sexes = ["female", "male", "female", "male", "male"]
ages = [67, 77, 75, 83, 64]

If we want to grab a column from our table (i.e., a specific attribute from all of our patients), all we need to do is pick the corresponding list.

In [None]:
print('The first names of our patients are:', names)

If we want to get the information about a specific patient by their position in our table, we can do this by accessing all of the lists using the same index.

In [None]:
def print_patient(name, sex, age):
    # Note: the code below is shorthand for the following
    # print('Name: ', name, ', Sex: ', sex, ', Age: ', age')
    print(f'Name: {name}, Sex: {sex}, Age: {age}')

In [None]:
patient_id = 0
print_patient(names[patient_id], sexes[patient_id], ages[patient_id])

If we want to get the information about a specific patient by a particular attribute (e.g., their name), we must first find the correct index by inspecting the corresponding list and then accessing all of the lists using that index.

In [None]:
patient_id = names.index("Dan")
print_patient(names[patient_id], sexes[patient_id], ages[patient_id])

If we want to add a new patient to our table, we can do this by appending the corresponding attributes to the end of each list.

In [None]:
names.append('Felicia')
sexes.append('female')
ages.append(86)
patient_id = -1
print_patient(names[patient_id], sexes[patient_id], ages[patient_id])

While this is an okay strategy for storing tabular data, it has many shortcomings. First and foremost, we must be diligent and always ensure that all of the lists have the same length, even if the data is actually supposed to be missing from our records. Otherwise, the data will be misaligned and we will not be able to access rows properly. Let's take our previous example and assume that Carol did not want to report her sex or age:

In [None]:
# Bad way of handling missing data
bad_names = ["Alice", "Bob", "Carol", "Dan", "Eric"]
bad_sexes = ["female", "male", "male", "male"]
bad_ages = [67, 77, 83, 64]

for patient_id in range(len(bad_names)):
    try:
        print_patient(bad_names[patient_id], bad_sexes[patient_id], bad_ages[patient_id])
    except IndexError:
        print('Index out of bounds!')

Not only do we get an `IndexError`, but we also have a mismatch between patients and their data. We said that Carol was the patient who did not report their sex or age, but if the parallel lists don't have the same length, we lose the mapping between attributes.

One way to fix this is by using placeholder values at the position we would expect Carol's data to be:

In [None]:
# Correct
good_names = ["Alice", "Bob", "Carol", "Dan", "Eric"]
good_sexes = ["female", "male", "undisclosed", "male", "male"]
good_ages = [67, 77, -1, 83, 64]

for patient_id in range(len(bad_names)):
    print_patient(good_names[patient_id], good_sexes[patient_id], good_ages[patient_id])

While this problem can be addressed, it isn't the only issue with storing tabular data in parallel lists. Doing operations that work across multiple attributes can be a challenge. Let's say we wanted to sort our data based on age rather than first name. One way we could do it is as follows:

In [None]:
# Zip the lists together
combined = list(zip(names, sexes, ages))
print(combined)

# Sort the combined list by the third element of each list
sorted_list = sorted(combined, key=lambda tup: tup[2])

# Unzip the sorted list to get the separate lists again
sorted_names, sorted_sexes, sorted_ages = zip(*sorted_list)

# Print the results
for patient_id in range(len(names)):
    print_patient(sorted_names[patient_id], sorted_sexes[patient_id], sorted_ages[patient_id])

While it's possible to manually manipulate parallel lists for almost any cross-attribute operation, they happen often enough to make parallel lists not a great solution for storing tabular data.

## Nested Lists

Another option for storing tabular data with basic Python data structures is to use a ***nested list***. In this setup, each row is represented by a list with the same length, and we collect all of those lists into a single list.

In [None]:
patients = [["Alice", "female", 67],
            ["Bob", "male", 77],
            ["Carol", "female", 75],
            ["Dan", "male", 83],
            ["Eric", "male", 64]]

If we want to get the information about a specific patient by their position in our table, we can do this by simply indexing into the outer list.

In [None]:
def print_patient_list(patient):
    print(f'Name: {patient[0]}, Sex: {patient[1]}, Age: {patient[2]}')

In [None]:
patient_id = 0
print_patient_list(patients[patient_id])

If we want to get the information about a specific patient by a particular attribute (e.g., their name), we must iterate through the outer list and then find the inner list that has the correct information at the right position.

In [None]:
for i in range(len(patients)):
    patient = patients[i]
    if patient[0] == "Dan":
        print_patient_list(patient)

In [None]:
# This code does the same as the one above, just with fewer lines of code
for patient in patients:
    if patient[0] == "Dan":
        print_patient_list(patient)

If we want to grab a column from our table (i.e., a specific attribute from all of our patients), we need to iterate through the elements in our outer list and grab the corresponding element from each inner list.

In [None]:
names = []
for patient in patients:
    names.append(patient[0])
print(names)

If we want to add a new patient to our table, we can do this by appending a new list with the corresponding attributes.

In [None]:
patients.append(['Felicia', 'female', 86])
patient_id = -1
print_patient_list(patients[patient_id])

Again, this is an okay strategy for storing tabular data, but it has many shortcomings. Like before, we must be diligent and always ensure that each of the inner lists have the same length, even if the data is actually supposed to be missing from our records. Otherwise, the data will be misaligned and we will not be able to access rows properly. Let's take our previous example and assume that Carol did not want to report her sex or her age:

In [None]:
# Incorrect
bad_patients = [["Alice", "female", 67],
                ["Bob", "male", 77],
                ["Carol"],
                ["Dan", "male", 83],
                ["Eric", "male", 64]]

for patient_id in range(len(bad_patients)):
    try:
        print_patient_list(bad_patients[patient_id])
    except IndexError:
        print('Not enough data to print!')

Again, we can fix this problem by using placeholder values where the data is missing:

In [None]:
# Correct
good_patients = [["Alice", "female", 67],
                 ["Bob", "male", 77],
                 ["Carol", "undisclosed", -1],
                 ["Dan", "male", 83],
                 ["Eric", "male", 64]]

for patient_id in range(len(good_patients)):
    print_patient_list(good_patients[patient_id])

Doing operations that work across multiple attributes are similarly challenging here. As before, let's try to sort our data based on age rather than first name. You might notice that our way of doing this for parallel lists essentially converted them into a nested list before doing the sorting.

In [None]:
# Sort the combined list by the third element of each list
sorted_patients = sorted(patients, key=lambda tup: tup[2])

# Print the results
for patient_id in range(len(names)):
    print(f'{sorted_patients[patient_id]}')

# Summary

All in all, it is possible to represent tabular data as either a set of parallel lists or a nested list. However, common operations can be extremely tedious with either of these approaches, so people have created data structures specifically designed to handle tabular data.

For the rest of this session, we will talk about two popular libraries – `numpy` and `pandas` – that not only provide data structures for tabular data, but also provide a multitude of methods for handling such data.