Python for Data Analysis, Vilnius University, 2024

# HW1: writing custom functions

Each task consists of three cells:
1. Text cell containing the task description;
1. A code cell to complete the task (with respective comment inside);
1. Code cell with `assert` statements (do not modify it!).

After providing your solution in the second cell, run both code cells one after another to self-check if the task was solved correctly.

There are 25 points to collect (5 per task), and 4 bonus points.

For this homework, **you cannot import any libraries - even in-built ones!**

Don't hesitate to contact me or Martynas if you are stuck.

# Task 1. Move code into a function

This code calculates rounded RMSD between two lists of numbers (`values0` and `values1`). The result is stored in `round_rmsd`:

```python
values0 = [0.1, 1.2, 4.1, 7.7, 2.0, 8.2, 3.4]
values1 = [9.1, 1.2, 4.4, 7.6, 2.1, 8.8, 6.7]
differences = []

for i in range(len(values0)):
    d = abs(values0[i] - values1[i])
    differences.append(d ** 2)

msd = sum(differences) / len(differences)
rmsd = msd ** 0.5
round_rmsd = round(rmsd, 2)
```

- ❗️ _(2 points)_ Move this code into a function. It should have 2 parameters (two lists of numbers). It should return one value (rounded RMSD). Name the function `get_rmsd`.

- ❗️ _(1 point)_ Write typehints in function.

- ❕ _(bonus, 2 points)_ Convert this function into a one-liner (in a new cell).

- ❗️ _(2 points)_ Add an `if-else` statement into function to ensure that lists of given values are not empty and have the same length, otherwise return `None` (do not raise any errors).

In [None]:
# 1 and 2.

def get_rmsd(values0: list[float], values1: list[float]) -> float:
  differences = []
  for i in range(len(values0)):
    d = abs(values0[i] - values1[i])
    differences.append(d ** 2)

  msd = sum(differences) / len(differences)
  rmsd = msd ** 0.5
  round_rmsd = round(rmsd, 2)
  return round_rmsd

# Even better typehints would be:
# get_rmsd(values0: list[float | int], values1: list[float | int]) -> float

assert get_rmsd([0.1, 1.2, 4.1, 7.7, 2.0, 8.2, 3.4], [9.1, 1.2, 4.4, 7.6, 2.1, 8.8, 6.7]) == 3.63
assert get_rmsd([1], [3]) == 2.0

In [None]:
# 3.
values0 = [0.1, 1.2, 4.1, 7.7, 2.0, 8.2, 3.4]
values1 = [9.1, 1.2, 4.4, 7.6, 2.1, 8.8, 6.7]

rmsd = round(sum([abs(v0 - v1)**2 for v0, v1 in zip(values0, values1)]) / len(values0) ** 0.5, 2)
rmsd

In [None]:
# 4.
def get_rmsd(values0: list[float], values1: list[float]) -> float:
  if (len(values0) != len(values1)) or (len(values0) == 0):
    return

  differences = []
  for i in range(len(values0)):
    d = abs(values0[i] - values1[i])
    differences.append(d ** 2)

  msd = sum(differences) / len(differences)
  rmsd = msd ** 0.5
  round_rmsd = round(rmsd, 2)
  return round_rmsd

assert get_rmsd([0.1, 1.2, 4.1, 7.7, 2.0, 8.2, 3.4], [9.1, 1.2, 4.4, 7.6, 2.1, 8.8, 6.7]) == 3.63
assert get_rmsd([1], [3]) == 2.0
assert get_rmsd([], []) == None
assert get_rmsd([1, 2, 3], [1, 2]) == None

## Task 2. Remove hardcoded parameters from a function

You are an [exobiology](https://en.wikipedia.org/wiki/Astrobiology) assistant. Your supervisor has sent you some sequences of extraterrestrial bacteria. These bacteria use a different genetic code, noted by X, Z, W, Y, and Q letters for its bases, but has somewhat similar transcription rules:
- X is complementary to Z;
- W is complementary to Y;
- The extra rare base Q is always turned into W during the transcription process.

This is a function which transcribes human sequences:

```python
def transcribe(seq: str) -> str:
    result = ""
    rules = {"A": "U", "T": "A", "C": "G", "G": "C"}
    for base in seq:
        result += rules[base]
    return result
```

- ❗️ _(2 points)_ Update this function so that it can be used with the extraterrestrial sequences. _Do not leave the human transcription rules inside the function._

- ❕ _(2 points)_ Make the function not case-sensitive.

- ❗️ _(1 point)_ Add the possibility (not a requirement) for user to provide their own transcription rules. Ensure correct typehints.

- ❗️ _(2 points)_ Add a condition to check if the sequence can be translated with the current ruleset, otherwise return `None`.

In [None]:
# 1 and 2.
def transcribe(seq: str) -> str:
    result = ""
    rules = {"X": "Z", "Z": "X", "W": "Y", "Y": "W", "Q": "W"}  # 1
    seq = seq.upper()   # 2
    for base in seq:
        result += rules[base]
    return result

assert transcribe("ZYWXQYQXWYXZXZXZYXWYZXQYWWXXXZYZYXWYXZ") == "XWYZWWWZYWZXZXZXWZYWXZWWYYZZZXWXWZYWZX"
assert transcribe("YqQyyQ") == "WWWWWW"

In [None]:
# 3.
def transcribe(seq: str, rules: dict[str, str] | None = None) -> str:
    result = ""
    if rules is None:
      rules = {"X": "Z", "Z": "X", "W": "Y", "Y": "W", "Q": "W"}
    seq = seq.upper()
    for base in seq:
        result += rules[base]
    return result

assert transcribe("ZYWXQYQXWYXZXZXZYXWYZXQYWWXXXZYZYXWYXZ") == "XWYZWWWZYWZXZXZXWZYWXZWWYYZZZXWXWZYWZX"
assert transcribe("YqQyyQ") == "WWWWWW"
assert transcribe("ABCBDCABCBDABCD", {"A": "B", "B": "C", "C": "D", "D": "A"}) == "BCDCADBCDCABCDA"

In [None]:
# 4.
def transcribe(seq: str, rules: dict[str, str] | None = None) -> str:
    result = ""
    if rules is None:
      rules = {"X": "Z", "Z": "X", "W": "Y", "Y": "W", "Q": "W"}
    seq = seq.upper()
    for base in seq:
      if base not in rules.keys():
        return
      result += rules[base]
    return result

assert transcribe("ZYWXQYQXWYXZXZXZYXWYZXQYWWXXXZYZYXWYXZ") == "XWYZWWWZYWZXZXZXWZYWXZWWYYZZZXWXWZYWZX"
assert transcribe("YqQyyQ") == "WWWWWW"
assert transcribe("ABCBDCABCBDABCD", {"A": "B", "B": "C", "C": "D", "D": "A"}) == "BCDCADBCDCABCDA"
assert transcribe("ZWYYZQAZWQZ") == None
assert transcribe("AAAAAAA", {"A": ""}) == ""

## Task 3. Edge cases

This function produces a list of ordered [fibonacci numbers](https://www.britannica.com/science/Fibonacci-number) up to the provided number (including that number):

```python
def fibonacci_generator(upper_limit: int) -> list[int]:
    fibonacci_numbers = [0, 1]
    while True:
        next_number = fibonacci_numbers[-1] + fibonacci_numbers[-2]
        if next_number > upper_limit:
            return fibonacci_numbers
        fibonacci_numbers.append(next_number)
```

With `upper_limit=5`, the result is: `[0, 1, 1, 2, 3, 5]`.

However, this function returns unexpected results if `upper_limit` is less than 1.

- ❗️ _(2 points)_ Update this function, so that all numbers in the returned list **never** exceed `upper_limit`. The function should always take an integer and return a list.
- ❗️ _(3 points)_ Update this function again: add a new parameter which takes a list of two integers as the start of fibonacci sequence. Ensure that output numbers still do not exeed `upper_limit`.

In [None]:
# 1.
def fibonacci_generator(upper_limit: int) -> list[int]:
    fibonacci_numbers = [0, 1]
    while True:
        next_number = fibonacci_numbers[-1] + fibonacci_numbers[-2]
        if next_number > upper_limit:
            return [x for x in fibonacci_numbers if x <= upper_limit]  # remove too big numbers
        fibonacci_numbers.append(next_number)

assert fibonacci_generator(5) == [0, 1, 1, 2, 3, 5]
assert fibonacci_generator(100) == [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
assert fibonacci_generator(1) == [0, 1, 1]
assert fibonacci_generator(0) == [0]
assert fibonacci_generator(-1) == []
assert fibonacci_generator(-2) == []
assert fibonacci_generator(-100) == []

In [None]:
# 2.
def fibonacci_generator(upper_limit: int, start_from: list[int] = [0, 1]) -> list[int]:
    fibonacci_numbers = start_from[:]

    # to make sure we won't enter an infinite loop:
    assert not all([x==0 for x in start_from])  # we may stuck in same integers if two zeros are provided
    # you can also check for the length and types of start_from list
    # as well as for negative upper_limit or start_from values which can also lead to infinite loop

    while True:
        next_number = fibonacci_numbers[-1] + fibonacci_numbers[-2]
        if next_number > upper_limit:
            return [x for x in fibonacci_numbers if x <= upper_limit]  # remove too big numbers
        fibonacci_numbers.append(next_number)

assert fibonacci_generator(5) == [0, 1, 1, 2, 3, 5]
assert fibonacci_generator(100) == [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
assert fibonacci_generator(1) == [0, 1, 1]
assert fibonacci_generator(0) == [0]
assert fibonacci_generator(-1) == []
assert fibonacci_generator(-2) == []
assert fibonacci_generator(-100) == []
assert fibonacci_generator(100, [5, 2]) == [5, 2, 7, 9, 16, 25, 41, 66]

## Task 4. Write a function from scratch

While you were busy with the fibonacci generator, new data arrived from the exobiology lab: measurements of bacteria cell diameter. You need to prepare a function to calculate the median cell diameter.

- ❗️ _(3 points)_ Write a function to calculate the median for a list of floats. The function should always return either a single float or None.
  - Reminder: you cannot import any modules.

- ❗️ _(2 points)_ You were informed that some cells were measured incorrectly and should be dropped from calculations. To indicate that, corresponding dataset values were set to zeroes. Update the median function so it would ignore zeros when calculating median.


In [None]:
# 1 and 2.
def calculate_median(values):
  filtered = [x for x in values if x != 0]  # for part 2
  if not filtered:
    return

  sorted_values = sorted(filtered)
  if len(sorted_values) % 2:  # odd length
    return values[len(values)//2]
  else:  # even
    return (values[len(values)//2] + values[len(values)//2 - 1]) / 2

assert calculate_median([1., 2., 3.]) == 2.0
assert calculate_median([1., 2., 3., 4.]) == 2.5
assert calculate_median([2.]) == 2.0
assert calculate_median([]) == None
assert calculate_median([1., 0., 2., 3.]) == 2.0
assert calculate_median([0., 0., 0.]) == None

## Task 5. Complex data pre-processing

The neighboring lab specializes in snails and have just received a dataset describing rare snails found in some obscure location. However, during coffee break, they accidentally left the data file open. Their lab cat Helix jumped on a keyboard, made some biscuits, and saved the file. The lab does not have any backups of this data and now needs to remove all corrupted data from the file. They ask for your help, as the file is too big to filter it manually.

The snail data is in a CSV file. Each row contains measurements of a single snail. There are five columns:
1. Snail identification number (positive integer)
2. Latin name for snail species (two words separated by a single space, first starts with a capital letter, all other letters are lowercase, can only contain latin letters)
3. Snail red color index (a [hexadecimal number](https://simple.wikipedia.org/wiki/Hexadecimal), consisting of two digits or A-F letters with minimum 00 and maximum FF)
4. Snail shell size (positive float)
5. Snail spirality index (float, can be both positive and negative but cannot be zero)

The lab uploaded the corrupted file on Github (shame they didn't do it earlier). The file should be automatically downloaded when you run this code in a Notebook:

```bash
!wget https://raw.githubusercontent.com/Tallivm/vu-python/main/snail_data_corrupted.csv
```

- If downloading does not work, enter the link into a new browser window, copy the whole text (using Ctrl+A is highly advised), and save it into a new CSV file named "snail_data_corrupted.csv".

❗️ _(5 points)_ Write code which would go through the contents of the data file, detect all rows with any errors, print the number of such rows, then save a new file containing only correct rows named "snail_data_cleaned.csv".

In [None]:
!wget https://raw.githubusercontent.com/Tallivm/vu-python/main/snail_data_corrupted.csv

In [2]:
def str_is_hex(string: str) -> bool:
  try:
    int(string, 16)
    return True
  except ValueError:
    return False


def can_be_float(string: str) -> bool:
  try:
    float(string)
    return True
  except ValueError:
    return False


def validate_row(row: str) -> bool:
  elements = row.split(',')
  if (
      (len(elements) == 5) and
      # index
      elements[0].isdigit() and
      (int(elements[0]) > 0) and
      # name
      (elements[1].count(' ') == 1) and
      elements[1].replace(' ', '').isalpha() and
      elements[1].split(' ')[0].istitle() and
      elements[1].split(' ')[1].islower() and
      # color
      (len(elements[2]) == 2) and
      str_is_hex(elements[2]) and
      # shell size
      ('.' in elements[3]) and
      can_be_float(elements[3]) and
      (float(elements[3]) > 0) and
      # spirality index
      ('.' in elements[4]) and
      can_be_float(elements[4]) and
      (float(elements[4]) != 0)
    ):
    return True
  else:
    return False


def clean_file(input_filename: str, output_filename: str) -> None:
  with open(input_filename, 'r') as f:
    rows = f.read().splitlines()
  good_rows = [r for r in rows if validate_row(r)]
  with open(output_filename, 'w') as f:
    f.write('\n'.join(good_rows))
  print('Total rows:', len(rows), ' | Bad rows:', len(rows) - len(good_rows))

clean_file('snail_data_corrupted.csv', 'snail_data_clean.csv')
# there will be 9 bad rows

Total rows: 202  | Bad rows: 9


One additional thing than can be checked is whether all indices are unique. For example, you can add these rows after `good_rows = [...]`:

```python
indices = [int(x.split(',')[0]) for x in good_rows]
unique_indices = set(indices)
unique_index_rows = []
for i in unique_indices:
    all_rows_with_i_index = [x for x in good_rows if int(x.split(',')[0]) == i]
    first_match = all_rows_with_i_index[0]
    unique_index_rows.append(first_match)
```

And later you would save `unique_index_rows` instead of `good_rows` into the output file.

There may be a more efficient way to do that, but this may be more clear one.