<h2><center>Week 10 - Exercises class</center></h2>
<h3><center>Programming for Data Science 2024</center></h3>

Troughout this class, I used the following resources:
- [The Good Research Code Handbook](https://goodresearch.dev/)
- [The Hitchhiker’s Guide to Python](https://docs.python-guide.org/)
- Computational Thinking by Karl Beecher

As you've seen in class, there are many ways of writing bad code, for example:
- Mysterious names: variables have names which don’t indicate their function
- Magic numbers: hard-coded values with unexplained meaning
- Duplicated code: large portions of duplicated code with small tweaks
- Uncontrolled side effects: code is written so that it’s unclear where and when variables are changed
- Large functions: big, unwieldy functions that do a little bit of everything
- High cyclomatic complexity: lots of nested ifs and for loops

We can already start by having an organized and documented repository, like in this [example repo](https://github.com/navdeep-G/samplemod), but it's not sufficient as we need to write the code and functions themselves in an organized and scalable way.

### Abstraction & Decomposition

One way of thinking about this problem is by Abstraction and Decomposition. Abstraction and decomposition are two ways of tackling a coding task and a great way for writing structured code. From the 1973 book *How to Solve It* the procedure should roughly look like this:
- understand the problem
- devise a plan
- execute the plan
- review and extend

Decomposition is an heuristic tackling the second point, by which we seeks to break a complex problem down into simpler parts that are easier to deal with. It can be thought as recursively breaking down tasks:

<style>

figcaption {
  text-align: center;
  margin-top: 10px
}
</style>

<figure class="image">
  <img src="images/decomposition-strategy.png" style="width:100%">
  <figcaption >Image taken from Computational Thinking, by Karl Beecher</figcaption>
</figure>

By decomposing, we can also achieve the distinct, but similar objective of Abstraction. Abstraction is a way of expressing an idea in a specific context while at the same time suppressing details irrelevant in that context. In the case of code, we want a function to work and show its main parts, while hiding unnecessary details.

General guidelines for how to make code better could be:
- **Separate concerns**: one function does one thing, a module is a collection of similar functions, etc. As a general rule, a function should fit in a screen, i.e. 40 rows x 80 columns
- **Pure functions**: the inputs come from the arguments and the outputs are returned with the return statement. They are deterministic and stateless, i.e. with has no knowledge about its previous calls
- **No side effects**: functions should not cause "alterations", except those that happens inside the flow from arguments to return; examples are modifying a global, modifying an argument, doing I/O, printing to console etc. 

<style>

figcaption {
  text-align: center;
  margin-top: 10px
}
</style>

<figure class="image">
  <img src="images/good-vs-bad.svg" style="width:100%">
  <figcaption >Image taken from <a href="https://cicero.xyz/v3/remark/0.14.0/github.com/coderefinery/modular-code-development/master/talk.md/#10">CodeRefinery</figcaption>
</figure>

**Example 1** Let's take the following code. Would you say it is a good example? Why?

In [3]:
f_to_c_offset = 32.0
f_to_c_factor = 0.555555555
temp_c = 0.0

def fahrenheit_to_celsius_bad(temp_f):
    global temp_c
    temp_c = (temp_f - f_to_c_offset) * f_to_c_factor
    
fahrenheit_to_celsius_bad(temp_f=100.0)
print(temp_c)

37.77777774


How could it be rewritten?

In [4]:
# Solution

def fahrenheit_to_celsius(temp_f):
    temp_c = (temp_f - 32.0) * (5.0/9.0)
    return temp_c
temp_c = fahrenheit_to_celsius(temp_f=100.0)
print(temp_c)

37.77777777777778


The function is now *pure*:
- it doesn't modify a global variable (stateless)
- it just takes the input and returns the output

**Example 2** (taken from [The Good Research Code Handbook](https://goodresearch.dev/decoupled))

We want to write a piece of code that does three things:
- Loads a file
- Counts the words in the files
- Writes the counted words to an output file

One straightforward way to code this is in a single function:

In [None]:
def count_words_in_file(in_file, out_file):
    counts = {}
    with open(in_file, 'r') as f:
        for l in f:
            # Split words on spaces.
            W = l.lower().split(' ')
            for w in W:
                if w != '':
                    if w in counts:
                        counts[w] += 1
                    else:
                        counts[w] = 0

    with open(out_file, 'w') as f:
        for k in counts.keys():
            f.write( k + ","+ str(counts[k]) + "\n")

Is this a good way? Can you spot problems?

<details>
  <summary>Spoiler</summary>
  
  - Using one-character variable names.
  - Using lots of nested for and if statements (6 levels of indent)
  - Using custom string formatting
  - Mixing IO and computation
  
</details>

How could we improve the code?

In [None]:
# Improvement 1

# We can start by separating the function itself from the I/O operations

def count_words(text):
    # Split words on spaces.
    counts = {}
    W = text.lower().split(' ')
    for w in W:
        if w != '':
            if w in counts:
                counts[w] += 1
            else:
                counts[w] = 0

    return counts

def count_words_in_file(in_file, out_file):
    with open(in_file, 'r') as f:
        counts = count_words(f.read())

    with open(out_file, 'w') as f:
        for k in counts.keys():
            f.write( k + ","+ str(counts[k]) + "\n")

<details>
  <summary>Spoiler</summary>
  
  Now *count_words* is a pure function and it does not have the "side effect" of reading from and writing to a file.
  
</details>

In [5]:
# Improvement 2

# We can then assign more meaningful names, better formatting the file name
# and properly comment the code

def count_words(text):
    """Split words on spaces."""
    counts = {}
    words = text.lower().split(' ')
    for word in words:
        if word != '':
            if word in counts:
                counts[word] += 1
            else:
                counts[word] = 0

    return counts

def count_words_in_file(in_file, out_file):
    with open(in_file, 'r') as f:
        counts = count_words(f.read())

    with open(out_file, 'w') as f:
        for word, count in counts.items():
            f.write(f"{word},{count}\n")

There is actually another problem with this function, which is that it does not return the right output and doesn not properly take newlines into account:

In [6]:
print(count_words("hello world"))
print(count_words("hello world\n\nhello"))

{'hello': 0, 'world': 0}
{'hello': 0, 'world\n\nhello': 0}


Resolving these problems will also (partially) resolve the indentation problems, which are caused by:
- checking whether there are already entries with the word as key in *counts*
- checking if there are empty words caused by multiple spaces close together

We can solve the first of these by using a *defaultdict*. This is an object that behaves like normal dictionaries, except in the case where there are no keys already; in this case, a defaultdict can take a "default" behavior and not throw errors.

In [7]:
# A normal dictionary would throw an error here
normal_dict = {}
normal_dict["a"] += 1
normal_dict

KeyError: 'a'

In [2]:
import collections

# But a defaultdict does not
default_dict = collections.defaultdict(int)  # equivalent to collections.defaultdict(lambda: 0)
default_dict["a"] += 1
default_dict

defaultdict(int, {'a': 1})

For our problem:

In [3]:
def count_words(text):
    """Split words on spaces."""
    counts = collections.defaultdict(int)
    words = text.lower().split(' ')
    for word in words:
        if word != '':
            counts[word] += 1

    return counts

print(count_words("hello world"))

defaultdict(<class 'int'>, {'hello': 1, 'world': 1})


The remaining problems of newlines and nesting could be solved at once by using regular expressions, in this case *\s+*, which searches for any number of spaces, newlines, tabs, etc.

Our final code might look like this:

In [11]:
import collections
import re

def count_words(text):
    """Split words on spaces."""
    counts = collections.defaultdict(int)
    words = re.sub(r"\s+", " ", text.lower()).split(' ')
    for word in words:
        counts[word] += 1
    return counts

def count_words_in_file(in_file, out_file):
    with open(in_file, 'r') as f:
        counts = count_words(f.read())

    with open(out_file, 'w') as f:
        for word, count in counts.items():
            f.write(f"{word},{count}\n")

print(count_words("hello world\n\nhello"))

defaultdict(<class 'int'>, {'hello': 2, 'world': 1})


It's a nice improvement in readability!

If you're interested in another great example there's a [repository](https://github.com/fbaptiste/python-blog/tree/main/Idiomatic_Python/14_decomposition) implementing different levels of refractoring, with an accompanying [Youtube video](https://www.youtube.com/watch?v=AtcWP8LZoLo)

### Coding style & linters

Some examples of good / bad code

In [None]:
def make_complex(*args):
    x, y = args
    return dict(**locals())

In [None]:
# Good code

# Unnecessary level of complication
# Here we name args explicitly
def make_complex(x, y):
    return {'x': x, 'y': y}

In [None]:
valedictorian = max([(student.gpa, student.name) for student in graduates])

In [None]:
# Good code

# the previous code needlessly allocates a list of all (gpa, name) entires in memory
valedictorian = max((student.gpa, student.name) for student in graduates)

In [None]:
[print(x) for x in sequence]

In [None]:
# Good code

# Don't use a list comprehension just for its side effects.
for x in sequence:
    print(x)

In [14]:
# Filter elements greater than 4
a = [3, 4, 5]
for i in a:
    if i > 3:
        a.remove(i)

In [18]:
# Good code

# Don't remove items from a list while you are iterating through it.
# Use list comprehensions create a new list object
filtered_values = [value for value in a if value <= 3]

We can (partially) automatize the process of searching for errors, warning and bad style with linters, e.g. *pylint* or *pyflakes*. In IDEs like VSCode you can choose it in the marketplace extension and see the suggestions on the bottom left.

### Good to have: Formatters

Formatters help you to automatically implent the linter's suggestions. A popular one is black: https://github.com/psf/black. You can install it with *pip install black* or directly form VSCode extensions and it will usually format the code upon file saving.