<img src="resources/cropped-SummerWorkshop_Header.png">  

<h1 align="center">Research Coding Workshop SWDB 2022 </h1> 
<h3 align="center">August 2022</h3> 

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

<h2>Introduction</h2>
    
<p>In this workshop, we'll go over some good coding practices that you can adopt to make your source code easier to debug and share with others. There are handy references out there which discuss coding practices geared towards research scientists <a href="https://goodresearch.dev/">The Good Reseach Code Handbook</a> for example.

</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
    
<h2>Agile Manifesto</h2>
   
<ul> 
  <li> Individuals and interactions over processes and tools
  <li> Working software over comprehensive documentation
  <li> Customer collaboration over contract negotiation
  <li> Responding to change over following a plan
</ul>
    
<img src="resources/research_coding/dilbert_agile.png" style="width: 60%; height: 60%"/>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
    
<h2>Decoupled Code</h2>
   
A <a href="https://en.wikipedia.org/wiki/Code_smell">code smell</a> is a characteristic of source code revealed in a quick glance that indicates a deeper underlying issue or lack of clarity. We'll review a few of the more common ones that show up in research code.
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
    
<h4>Mysterious Names</h4>
   
<p>Avoid single letter and ambiguous variables as much as possible.

In [1]:
# This code works, but it's difficult to understand the meaning at first glance
l = ['John', 'Jane']
for i in l:
    print(i)

John
Jane


In [2]:
# It's a bit of an art, but variable names should be short yet descriptive
# Even counters in loops should try to convey their meaning
students = ['John', 'Jane']
for student in students:
    print(student)

John
Jane


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
    
<h4>Magic Numbers</h4>
   
<p>In most contexts, avoid hard-coding unnamed numerical constants. Like mysterious names, it makes the software unclear. It is also makes your software rigid and difficult to update later. <a href="https://en.wikipedia.org/wiki/Magic_number_(programming)">Wikipedia</a> has a decent article on the topic.

In [3]:
student_test_scores = {'John': [85.0, 87.7], 'Jane': [98.0, 96.6]}

# This function assumes the number of tests will always be 2 and will give an incorrect in other contexts.
def get_average_score(test_scores):
    return (test_scores[0] + test_scores[1]) / 2

janes_avg_score = get_average_score(student_test_scores['Jane'])
print("Jane's Score: ", janes_avg_score)

# Much better. Even though there are more lines of code, it is easier to understand and is flexible
def get_average_score(test_scores):
    num_of_tests = len(test_scores)
    running_total = 0
    for test_score in test_scores:
        running_total += test_score
    return running_total / num_of_tests

johns_avg_score = get_average_score(student_test_scores['John'])
print("John's Score: ", johns_avg_score)

Jane's Score:  97.3
John's Score:  86.35


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
    
<h4>Uncontrolled Side Effects</h4>
   
<p>Avoid writing functions that mutate your input as much as possible. 

In [4]:
def first_student_by_name(students):
    first_student = students.pop(0)
    for student in students:
        first_student = student if student < first_student else first_student
    return first_student

print("List of students: ", students)
first_student = first_student_by_name(students)
print("First student by name: ", first_student)
print("List of students: ", students)
print("John is missing!")

List of students:  ['John', 'Jane']
First student by name:  Jane
List of students:  ['Jane']
John is missing!


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
 
<p>Python variables are mutable, so we need to be careful on how we access the elements. Try to write <a href="https://en.wikipedia.org/wiki/Pure_function">pure functions</a> as much as possible.

In [5]:
students = ['John', 'Jane']

def first_student_by_name(students):
    first_student = students[0]
    for student in students:
        first_student = student if student < first_student else first_student
    return first_student

print("List of students: ", students)
first_student = first_student_by_name(students)
print("First student by name: ", first_student)
print("List of students: ", students)
print("Students are all there!")

List of students:  ['John', 'Jane']
First student by name:  Jane
List of students:  ['John', 'Jane']
Students are all there!


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
    
<h4>Embedded Configurations</h4>
   
<p>Avoid hard-coding configurations, such as I/O paths, directly in your source code

In [6]:
import json

# This method is unlikely to work for any other than the original writer!
def get_results():
    with open('C:\\Users\\john.doe\\Documents\\results.json') as file:
        results = json.load(file)
    return results

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
   
<p>Instead, it's good practice to move configurations into a config file. Python has configparser packaged in the standard library.

In [7]:
# It's better to put the configration loader into it's own module
import configparser

config = configparser.ConfigParser()
config.read('resources/research_coding/config.ini')

results_file_location = config['PATHS']['results']
print(results_file_location)
# A different user only needs to modify the config.ini file for their environment

C:\Users\john.doe\Documents\results.json


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
    
<h2>Testing</h2>

<p> Testing is essential for maintaining long term, healthy code bases and for enabling safe contributions from collaborators. Test can vary from small unit tests to complete integration tests. It's important that your code base has a percentage of test coverage (the amount of source code that your tests evaluate). Python has a unittest package that comes bundled with the standard library, although many people prefer to install pytest.

In [8]:
# Suppose we write an implementation of Newton's method to approximate a square root
def square_root_newton(input_number, 
                       tolerance = 1E-6,
                       max_count = 1E3):
    
    current_number = input_number
    iter_count = 0
 
    while (iter_count < max_count) :
        iter_count += 1
        root = 0.5 * (current_number + (input_number / current_number))
        if (abs(root - current_number) < tolerance):
            break
        current_number = root
 
    return root

In [9]:
# We can add tests to see if our method has some expected behavior.
import unittest

class TestSquareRootNewton(unittest.TestCase):
    def test_hundred(self):
        self.assertAlmostEqual(10.0, square_root_newton(100.0), 1E-6)
    def test_negative_number(self):
        self.assertIsNone(square_root_newton(-100.0), 1E-6)

In [10]:
unittest.main(argv=[''], verbosity=2, exit=False)

test_hundred (__main__.TestSquareRootNewton) ... ok
test_negative_number (__main__.TestSquareRootNewton) ... FAIL

FAIL: test_negative_number (__main__.TestSquareRootNewton)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/ipykernel_29412/3980611700.py", line 8, in test_negative_number
    self.assertIsNone(square_root_newton(-100.0), 1E-6)
AssertionError: -0.12333895788218374 is not None : 1e-06

----------------------------------------------------------------------
Ran 2 tests in 0.004s

FAILED (failures=1)


<unittest.main.TestProgram at 0x7fe2ec138910>

In [11]:
# Drats! We forgot to handle negative numbers. We can add that functionality.
def square_root_newton(input_number, 
                       tolerance = 1E-6,
                       max_count = 1E3):
    if(input_number < 0):
        return None
    
    current_number = input_number
    iter_count = 0
 
    while (iter_count < max_count) :
        iter_count += 1
        root = 0.5 * (current_number + (input_number / current_number))
        if (abs(root - current_number) < tolerance):
            break
        current_number = root
 
    return root

In [12]:
unittest.main(argv=[''], verbosity=2, exit=False)

test_hundred (__main__.TestSquareRootNewton) ... ok
test_negative_number (__main__.TestSquareRootNewton) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.003s

OK


<unittest.main.TestProgram at 0x7fe2ee20a7f0>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
    
<h2>Logging</h2>

<p>Python has a logging framework built into the standard library. You can adjust the log levels based on what information you want conveyed to your user. Some useful information to log are execution times and loop progression.

In [13]:
import logging
logging.basicConfig(level=logging.INFO)


# Let's modify our square root method to log helpful information
def square_root_newton(input_number, 
                       tolerance = 1E-6,
                       max_count = 1E3):
    if(input_number < 0):
        logging.info(f' input_number {input_number} is negative.')
        return None
    current_number = input_number
    iter_count = 0
    while (iter_count < max_count) :
        iter_count += 1
        root = 0.5 * (current_number + (input_number / current_number))
        logging.debug(f' Iteration count: {iter_count}.'
                      f' Approximate root: {root}.'
                      f' Previous root: {current_number}')
        if (abs(root - current_number) < tolerance):
            break
        current_number = root
    if(iter_count == max_count):
        logging.warning(f' Max iteration count, {max_count}, reached!')
    return root

In [14]:
square_root_newton(-100)

INFO:root: input_number -100 is negative.


In [15]:
# Let's set the tolerance and max count to low numbers
square_root_newton(100, tolerance=1E-12, max_count=5)



10.032578510960604

In [16]:
# We can set the logging level to debug to check if more information is available
logging.getLogger().setLevel(logging.DEBUG)

In [17]:
square_root_newton(100, tolerance=1E-12, max_count=5)

DEBUG:root: Iteration count: 1. Approximate root: 50.5. Previous root: 100
DEBUG:root: Iteration count: 2. Approximate root: 26.24009900990099. Previous root: 50.5
DEBUG:root: Iteration count: 3. Approximate root: 15.025530119986813. Previous root: 26.24009900990099
DEBUG:root: Iteration count: 4. Approximate root: 10.840434673026925. Previous root: 15.025530119986813
DEBUG:root: Iteration count: 5. Approximate root: 10.032578510960604. Previous root: 10.840434673026925


10.032578510960604

In [18]:
# Some useful information to log can be execution times.
from time import time

# Let's modify our square root method to log execution time
def square_root_newton(input_number, 
                       tolerance = 1E-6,
                       max_count = 1E3):
    start_time = time()
    if(input_number < 0):
        logging.info(f' input_number {input_number} is negative.')
        return None
    current_number = input_number
    iter_count = 0
    while (iter_count < max_count) :
        iter_count += 1
        root = 0.5 * (current_number + (input_number / current_number))
        logging.debug(f' Iteration count: {iter_count}.'
                      f' Approximate root: {root}.'
                      f' Previous root: {current_number}')
        if (abs(root - current_number) < tolerance):
            break
        current_number = root
    if(iter_count == max_count):
        logging.warning(f' Max iteration count, {max_count}, reached!')
    logging.debug(f' Program took {time()-start_time} seconds to compute.')
    return root

In [19]:
square_root_newton(1234567.89, tolerance=1E-12)

DEBUG:root: Iteration count: 1. Approximate root: 617284.445. Previous root: 1234567.89
DEBUG:root: Iteration count: 2. Approximate root: 308643.22249918996. Previous root: 617284.445
DEBUG:root: Iteration count: 3. Approximate root: 154323.611241495. Previous root: 308643.22249918996
DEBUG:root: Iteration count: 4. Approximate root: 77165.80555270887. Previous root: 154323.611241495
DEBUG:root: Iteration count: 5. Approximate root: 38590.90222559984. Previous root: 77165.80555270887
DEBUG:root: Iteration count: 6. Approximate root: 19311.44669490348. Previous root: 38590.90222559984
DEBUG:root: Iteration count: 7. Approximate root: 9687.688013525303. Previous root: 19311.44669490348
DEBUG:root: Iteration count: 8. Approximate root: 4907.562403157972. Previous root: 9687.688013525303
DEBUG:root: Iteration count: 9. Approximate root: 2579.563391246668. Previous root: 4907.562403157972
DEBUG:root: Iteration count: 10. Approximate root: 1529.0795345888944. Previous root: 2579.563391246668

1111.1111060555554

In [20]:
# We can set the logging level back to info to suppress the debug logs
logging.getLogger().setLevel(logging.INFO)

In [21]:
square_root_newton(1234567.89, tolerance=1E-12)

1111.1111060555554

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
    
<h2>Error Handling</h2>

<p>You should catch and handle errors as best as possible. Let's return back to our get_results_method.

In [22]:
config = configparser.ConfigParser()

In [23]:
def get_results():
    results_file_location = config['PATHS']['results']
    with open(results_file_location) as file:
        results = json.load(file)
    return results

In [24]:
get_results()

KeyError: 'PATHS'

In [25]:
def get_results():
    try:
        results_file_location = config['PATHS']['results']
        with open(results_file_location) as file:
            results = json.load(file)
    except KeyError:
        # We can add a more informative message to a key error
        raise KeyError(f'KeyError: Check that config.ini file is loaded and contains [PATHS] and results!')
    return results

In [26]:
get_results()

KeyError: 'KeyError: Check that config.ini file is loaded and contains [PATHS] and results!'

In [27]:
# We need to read in a file
config.read('resources/research_coding/config.ini')

['resources/research_coding/config.ini']

In [28]:
get_results()

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\john.doe\\Documents\\results.json'

In [29]:
def get_results():
    try:
        results_file_location = config['PATHS']['results']
        with open(results_file_location) as file:
            results = json.load(file)
    except KeyError:
        # We can add a more informative message to a key error
        raise KeyError(f'KeyError: Check that config.ini file is loaded and contains [PATHS] and results!')
    except FileNotFoundError:
        # We don't have to raise an exception. Maybe we want to set the results to None if the file doesn't exist.
        logging.warning(f' Results file not found at {results_file_location}. Defaulting to None.')
        results = None
    return results

In [30]:
get_results()

