# Week 10 Readings

Every programming language enforces certain "rules" for a program to execute successfully and produce the output we expect. These "rules" constitute the **syntax** and **semantics** of the language. **_Syntax_** refers to the rules governing the structure and arrangement of symbols in a program determining a valid line of code, while **_semantics_** describes the meaning behind those symbols and how they are interpreted to perform operations defining what the code does when executed. In simple terms, syntax acts as the "grammar" and semantics as the "meaning" of a programming language.

Python has a simple but elegant syntax compared to many other programming languages making it user-friendly and easily readable. However, when we do not follow the syntax defined by Python, it errors out. We have all experienced that at some point in our programming journey, but syntax errors are not a major cause of concern anymore as IDEs have built-in "intelligence" to parse through the code and highlight incorrect syntax. Essentially, when a Python script fails, it is either because of incorrect syntax or some "exception" that occurred. Error handling becomes important as we don't want our program to crash at least when it encounters known and expected errors but instead handle those gracefully to let the user know there was an error. We saw some examples of exceptions in class and also a demo for handling errors. Let's quickly recap what we learned using a different example and build on it step-by-step.

To illustrate these, let's assume we are developing a tool to perform some analysis for which reading and validating a FASTA file is the first step. A few checks we can think of to read and validate the FASTA file are checking if

* FASTA file exists and is in the path mentioned.
* header lines start with ">"
* header lines are followed by one or more lines of sequences
* it has invalid bases. Let's assume the sequences can contain only 'A', 'a', 'T', 't', 'C', 'c', 'G', 'g', 'N' and 'n'.

To follow along, you can either copy the Vibrio_cholerae_N16961.fna to your current working directory or download the files attached in the discussion post. It is obvious that if the file does not exist, or we provide an incorrect path, Python cannot read the file resulting in an exception. Let's see what the exception is:

In [1]:
#try to read a non-existent file
with open("dummy.fasta") as fin:
    fin.readlines()
print("This doesn't print")

FileNotFoundError: [Errno 2] No such file or directory: 'dummy.fasta'

In [2]:
#give a fake path
with open("/fake/path/Vibrio_cholerae_N16961.fna") as fin:
    lines = fin.readlines()
    print(lines[:10])
print("This doesn't print")

FileNotFoundError: [Errno 2] No such file or directory: '/fake/path/Vibrio_cholerae_N16961.fna'

Both of these resulted in "FileNotFoundError" exception. As we already know, we can use `try` and `except` to handle this exception.

So if we now try to read the FASTA file from an invalid path, it should not throw an error and terminate the execution of the program, but rather print the message or "handle the error". Notice the print statement after this block now gets executed.

In [3]:
try:
    with open("/fake/path/Vibrio_cholerae_N16961.fna") as fin:
        lines = fin.readlines()
        print(lines[:10])
except FileNotFoundError:
    print("File does not exist")
print("This still prints")

File does not exist
This still prints


If we give the name of the file that exists and is in the correct path:

In [4]:
try:
    with open("Vibrio_cholerae_N16961.fna") as fin:
        lines = fin.readlines()
        print(lines[:10])
except FileNotFoundError:
    print("File does not exist")
print("This prints")

['>NZ_CP028827.1 Vibrio cholerae strain N16961 chromosome 1, complete sequence\n', 'GTGTCATCTTCGCTATGGTTGCAATGTTTGCAACGGCTTCAGGAAGAGCTACCTGCCGCAGAATTCAGTATGTGGGTGCG\n', 'TCCGCTTCAAGCGGAGCTCAATGACAATACTCTCACTTTATTCGCCCCGAACCGCTTTGTGTTGGATTGGGTACGCGATA\n', 'AGTACCTCAATAACATCAATCGTCTGCTGATGGAATTCAGTGGCAATGATGTGCCTAATTTGCGCTTTGAAGTGGGGAGC\n', 'CGCCCTGTGGTGGCGCCAAAACCCGCGCCTGTACGTACGGCTGCGGATGTCGCGGCGGAATCGTCGGCGCCTGCGCAATT\n', 'GGCGCAGCGTAAACCTATCCATAAAACCTGGGATGATGACAGTGCTGCGGCTGATATTACTCACCGCTCAAATGTGAACC\n', 'CGAAACACAAGTTCAACAACTTCGTGGAAGGTAAATCTAACCAGTTAGGTCTGGCCGCGGCTCGCCAAGTCTCTGATAAC\n', 'CCAGGTGCGGCGTATAACCCCCTCTTTTTGTATGGCGGCACCGGTTTGGGTAAAACGCACTTGCTGCATGCGGTGGGTAA\n', 'CGCGATTGTTGATAACAACCCGAACGCTAAAGTGGTGTACATGCACTCTGAGCGTTTCGTGCAAGACATGGTAAAAGCCC\n', 'TGCAGAACAACGCGATTGAAGAATTCAAACGCTACTATCGCAGTGTAGATGCCTTGTTGATCGACGATATTCAATTCTTT\n']
This prints


So far, even though not in the context of validating a FASTA file, we have used `if` and `else` statements to parse the contents. Let us use conditionals to validate the content of the FASTA file:

In [5]:
#set containing valid bases
bases = {'A', 'a', 'T', 't', 'C', 'c', 'G', 'g', 'N', 'n'}
header_flag = False
validbase_flag = True

try:
    with open("Vibrio_cholerae_N16961.fna") as fin:
        lines = fin.readlines()
        for line in lines:
            line = line.strip()
            if line.startswith(">"): #header line
                header_flag = True
            else:                    #sequences
                if not header_flag:  #sequence line before header
                    break
                if not set(line).issubset(bases): #check for valid bases
                    validbase_flag = False
                    break

    if header_flag and validbase_flag:
        print("Valid FASTA")
    elif not header_flag:
        print("Invalid FASTA, no header line")
    elif not validbase_flag:
        print("Invalid FASTA, has invalid bases")
            
except FileNotFoundError:
    print("File does not exist")

Valid FASTA


If we give a FASTA file, say for example, with invalid bases:

In [6]:
#set containing valid bases
bases = {'A', 'a', 'T', 't', 'C', 'c', 'G', 'g', 'N', 'n'}
header_flag = False
validbase_flag = True

try:
    with open("invalid_base.fna") as fin:
        lines = fin.readlines()
        for line in lines:
            line = line.strip()
            if line.startswith(">"): #header line
                header_flag = True
            else:                    #sequences
                if not header_flag:  #sequence line before header
                    break
                if not set(line).issubset(bases): #check for valid bases
                    validbase_flag = False
                    break

    if header_flag and validbase_flag:
        print("Valid FASTA")
    elif not header_flag:
        print("Invalid FASTA, no header line")
    elif not validbase_flag:
        print("Invalid FASTA, has invalid bases")
            
except FileNotFoundError:
    print("File does not exist")

Invalid FASTA, has invalid bases


There are a couple of (not technical, but rather design) issues with the above code. It is clumsy and less readable because we have multiple conditional statements to validate the FASTA contents and set boolean flags which are evaluated later using another `if`-`else` block to finally decide if the FASTA file is valid. Moreover, as we start adding more checks(for example, if we decide to allow newline characters), the code starts getting more complicated because we have to use more boolean flags and conditionals to check these flags before we decide if the file is valid. 

So, is there a better way to do it? Yes, and that is using `try`-`except`! But we did not get any error, so why use `try`-`except`? The answer lies in two different approaches we can think about when handling errors:
* Think about all the scenarios in our code that might fail(here, an extensive list of checks to validate FASTA file) and write conditionals to handle all those.
* Let the errors happen and deal with them later using `try`-`except`.

In the code above, we followed the first approach. Let's rewrite the code using the second approach and see what difference it makes:

In [7]:
#set containing valid bases
bases = {'A', 'a', 'T', 't', 'C', 'c', 'G', 'g', 'N', 'n'}
header_flag = False

try:
    with open("invalid_header.fna") as fin:
        lines = fin.readlines()
        for line in lines:
            line = line.strip()
            if line.startswith(">"): #header line
                header_flag = True
            else:                    #sequences
                if not header_flag:  #sequence line before header
                    raise ValueError("Invalid FASTA, no header line")
                if not set(line).issubset(bases): #invalid bases
                    raise ValueError("Invalid FASTA, has invalid bases")
            
except FileNotFoundError:
    print("File does not exist")
except ValueError as value_error:
    print(value_error)

Invalid FASTA, no header line


In [8]:
#set containing valid bases
bases = {'A', 'a', 'T', 't', 'C', 'c', 'G', 'g', 'N', 'n'}
header_flag = False

try:
    with open("invalid_base.fna") as fin:
        lines = fin.readlines()
        for line in lines:
            line = line.strip()
            if line.startswith(">"): #header line
                header_flag = True
            else:                    #sequences
                if not header_flag:  #sequence line before header
                    raise ValueError("Invalid FASTA, no header line")
                if not set(line).issubset(bases): #invalid bases
                    raise ValueError("Invalid FASTA, has invalid bases")
            
except FileNotFoundError:
    print("File does not exist")
except ValueError as value_error:
    print(value_error)

Invalid FASTA, has invalid bases


As we can see, the code is more concise and readable than when we just used conditional statements. Moreover, we were able to completely get rid of the `if`-`else` block for checking the flags to validate the FASTA contents.

In fact, there are programmer-defined terminologies for these two approaches: the first using conditionals is called "Look Before You Leap"(LBYL) and the second using `try`-`except` goes by the name "Easier to ask forgiveness than permission"(EAFP)! The [Python documentation](https://docs.python.org/3.6/glossary.html) defines these two strategies as below:

>  Look Before You Leap: This coding style explicitly tests for pre-conditions before making calls or lookups. This style contrasts with the EAFP approach and is characterized by the presence of many if statements.


> Easier to ask for forgiveness than permission: This common Python coding style assumes the existence of valid keys or attributes and catches exceptions if the assumption proves false. This clean and fast style is characterized by the presence of many try and except statements. The technique contrasts with the LBYL style common to many other languages such as C.

The next obvious question is how do we decide between `if`-`else` or `try`-`except`? Is something recommended over the other? A popular opinion is that EAFP is the more "Pythonic" way. However, Python is flexible and allows the developer to choose either, depending on the situation. Some considerations to decide which approach is better:
* How often do we expect to see the errors? If it's very common, handle it using conditional statements. If errors are rare, then use `try`-`except`. This is because there is a cost associated with exceptions raised using `try`-`except` which we'll learn in a minute.
* Readability - As we just saw, in general, EAFP makes code more readable than LBYL.
* Performance and optimization - The main reason EAFP is considered "Pythonic" is that exception handling in Python is fast and efficient. From [version 3.11](https://docs.python.org/3.11/whatsnew/3.11.html#misc) onwards, Python supports "zero-cost optimizations" which implies that the cost of try statements is almost nil when no exception is raised. You can check the time comparisons of using `if`-`else` and `try`-`except` [here](https://stackoverflow.com/a/1835844). In addition, Python has necessary checks for potential problems built into the language itself and comes with various [built-in exceptions](https://docs.python.org/3/library/exceptions.html). We can also create [custom exceptions](https://martinxpn.medium.com/custom-exceptions-in-python-creating-custom-exceptions-59-100-days-of-python-4f26de8e851d) if Python does not have a built-in exception for our use-case. If you are unfamiliar with object-oriented Python, the syntax in the above document might look unfamiliar. You can revisit this section once object-oriented Python is covered in the coming few weeks.

There are some great examples for each of these considerations and more [here](https://realpython.com/python-lbyl-vs-eafp/#the-lbyl-and-eafp-coding-styles-in-python).

With that, let us return to our code and give it some finishing touches. As we briefly saw in the class, we can use `else` and `finally` block with `try`-`except`. Statements in the `else` block get executed only if our code does not raise any exceptions. On the other hand, statements in `finally` get executed irrespective of an exception in the `try` block. `finally` is useful when we want to log some information or clean up resources(for example, closing a file).

In [9]:
#set containing valid bases
bases = {'A', 'a', 'T', 't', 'C', 'c', 'G', 'g', 'N', 'n'}
header_flag = False

try:
    with open("Vibrio_cholerae_N16961.fna") as fin:
        lines = fin.readlines()
        for line in lines:
            line = line.strip()
            if line.startswith(">"): #header line
                header_flag = True
            else:                    #sequences
                if not header_flag:  #sequence line before header
                    raise ValueError("no header line")
                if not set(line).issubset(bases): #invalid bases
                    raise ValueError("has invalid bases")
            
except FileNotFoundError:
    print("FASTA file does not exist")
except ValueError as value_error:
    print(f"Invalid FASTA: {value_error}")
else:
    print("FASTA file is valid")
finally:
    print("FASTA file validation completed")

FASTA file is valid
FASTA file validation completed


Before we move on to the next topic, let us quickly touch upon another exception called `AssertionError`. It's worth a special mention as it should not be handled using `try`-`except`. This is because `AssertionError` is raised by Python when evaluating a condition with `assert` keyword. `assert` performs a sanity check and is a commonly used debugging/testing tool. If the condition it is used with is true, it does nothing and the program execution proceeds. But if it is false, it raises an `AssertionError` exception which explains why it makes sense to not use `except` to catch the `AssertionError`. Let's look at a simple example to understand `assert`. The below code checks if a number is odd or even and continues if the number is even and fails if it is odd:

In [10]:
num = 14
assert num % 2 == 0, f"Number is odd. ({num = })"
print("Success!")

Success!


In [11]:
num = 15
assert num % 2 == 0, f"Number is odd. ({num = })"
print("Success!")

AssertionError: Number is odd. (num = 15)

Notice that we also added a custom error message using `f-strings`. You can read more about `assert` [here](https://dbader.org/blog/python-assert-tutorial).

## `logging` package

Continuing with our assumption of developing a tool from the first section, a good tool should have a good logging mechanism to log execution details, errors and useful information to help debug in case of errors. Comparing the error messages of Bash with Python should give you a fair idea of the importance of good error messages.

`logging` module in Python's standard library has the tools required to implement a comprehensive logging mechanism. The main component of the logging module is called the "logger" which lets us control what to log, at what level of detail, and where to store or send these logs. The default logger is called `root`. The module supports logging information at five different log levels:
* `DEBUG` - Detailed information to diagnose a problem, typically only of interest to a developer.
* `INFO` - Confirmation that things are working as expected.
* `WARNING` - An indication that something unexpected happened, but the program still executes.
* `ERROR` - Some functions of the program failed due to an error/bug.
* `CRITICAL` - A serious error, indicating that the program itself may be unable to continue running.

Note that by default, only WARNING, ERROR and CRITICAL levels are logged(which can be modified as we'll see in a later example). An example will make this clear:


In [12]:
import logging

logging.debug("This is for debugging")
logging.info("This is informational")
logging.warning("This is a warning")
logging.error("This is an error")
logging.critical("This is critical")

ERROR:root:This is an error
CRITICAL:root:This is critical


The output shows the log level followed by the logger name and the log message by default. We can add more details to make our logs useful. The functionality to modify the default behavior of the root logger is provided by `basicConfig()` function. Let us use this to replace the logger name with a more useful attribute like the date and time of logging. All supported attributes are listed [here](https://docs.python.org/3/library/logging.html#logrecord-attributes).

NOTE: Calling `basicConfig()` only works if the root logger hasn’t been configured before. All logging functions automatically call this function without arguments. Therefore when we ran the above logging functions, `basicConfig()` was already executed with default arguments. To reconfigure it and run the cell below, you'll need to restart the notebook's kernel.

In [1]:
import logging
logging.basicConfig(
     format="%(asctime)s - %(levelname)s - %(message)s",
     datefmt="%Y-%m-%d %H:%M" )

logging.error("This is an error")

2024-10-25 17:16 - ERROR - This is an error


You can write logs to a log file instead of printing it to the terminal (similar to output redirection using `>` in Bash) by providing the file name in `basicConfig()`.

RESTART KERNEL

In [1]:
import logging
logging.basicConfig(
     format="%(asctime)s - %(levelname)s - %(message)s",
     datefmt="%Y-%m-%d %H:%M",
     filename="error.log",
     filemode="a") #open file in append mode

logging.error("This is an error written to file")

In [2]:
!cat error.log

2024-10-25 17:28 - ERROR - This is an error written to file


Now that we have some understanding of the `logging` module, let us apply these to the FASTA file validation example we were working on. First, let us test with a non-existent FASTA file. Setting `exc_info = True` captures the stacktrace and outputs that as a part of the log message.

RESTART KERNEL

In [1]:
import logging

#set containing valid bases
bases = {'A', 'a', 'T', 't', 'C', 'c', 'G', 'g', 'N', 'n'}
header_flag = False

logging.basicConfig(
     format="%(asctime)s - %(levelname)s - %(message)s",
     datefmt="%Y-%m-%d %H:%M",
     level=logging.INFO) #enables INFO level logging

logging.info("Reading FASTA file validation")

try:
    with open("dummy.fna") as fin:
        lines = fin.readlines()
        for line in lines:
            line = line.strip()
            if line.startswith(">"): #header line
                header_flag = True
            else:                    #sequences
                if not header_flag:  #sequence line before header
                    raise ValueError("no header line")
                if not set(line).issubset(bases): #invalid bases
                    raise ValueError("has invalid bases")
            
except FileNotFoundError:
    logging.error("FASTA file does not exist", exc_info = True)
except ValueError as value_error:
    logging.error(f"Invalid FASTA: {value_error}")
else:
    logging.info("FASTA file is valid")
finally:
    logging.info("FASTA file validation completed")

2024-10-25 17:55 - INFO - Reading FASTA file validation
2024-10-25 17:55 - ERROR - FASTA file does not exist
Traceback (most recent call last):
  File "/tmp/ipykernel_368358/56002547.py", line 15, in <module>
    with open("dummy.fna") as fin:
         ^^^^^^^^^^^^^^^^^
  File "/home/anagha/mambaforge/envs/juppy312/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 286, in _modified_open
    return io_open(file, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'dummy.fna'
2024-10-25 17:55 - INFO - FASTA file validation completed


In [2]:
import logging

#set containing valid bases
bases = {'A', 'a', 'T', 't', 'C', 'c', 'G', 'g', 'N', 'n'}
header_flag = False

logging.basicConfig(
     format="%(asctime)s - %(levelname)s - %(message)s",
     datefmt="%Y-%m-%d %H:%M",
     level=logging.INFO) #enables INFO level logging

logging.info("Reading FASTA file validation")

try:
    with open("invalid_header.fna") as fin:
        lines = fin.readlines()
        for line in lines:
            line = line.strip()
            if line.startswith(">"): #header line
                header_flag = True
            else:                    #sequences
                if not header_flag:  #sequence line before header
                    raise ValueError("no header line")
                if not set(line).issubset(bases): #invalid bases
                    raise ValueError("has invalid bases")
            
except FileNotFoundError:
    logging.error("FASTA file does not exist", exc_info = True)
except ValueError as value_error:
    logging.error(f"Invalid FASTA: {value_error}")
else:
    logging.info("FASTA file is valid")
finally:
    logging.info("FASTA file validation completed")

2024-10-25 17:56 - INFO - Reading FASTA file validation
2024-10-25 17:56 - ERROR - Invalid FASTA: no header line
2024-10-25 17:56 - INFO - FASTA file validation completed


In [3]:
import logging

#set containing valid bases
bases = {'A', 'a', 'T', 't', 'C', 'c', 'G', 'g', 'N', 'n'}
header_flag = False

logging.basicConfig(
     format="%(asctime)s - %(levelname)s - %(message)s",
     datefmt="%Y-%m-%d %H:%M",
     level=logging.INFO) #enables INFO level logging

logging.info("Reading FASTA file validation")

try:
    with open("Vibrio_cholerae_N16961.fna") as fin:
        lines = fin.readlines()
        for line in lines:
            line = line.strip()
            if line.startswith(">"): #header line
                header_flag = True
            else:                    #sequences
                if not header_flag:  #sequence line before header
                    raise ValueError("no header line")
                if not set(line).issubset(bases): #invalid bases
                    raise ValueError("has invalid bases")
            
except FileNotFoundError:
    logging.error("FASTA file does not exist", exc_info = True)
except ValueError as value_error:
    logging.error(f"Invalid FASTA: {value_error}")
else:
    logging.info("FASTA file is valid")
finally:
    logging.info("FASTA file validation completed")

2024-10-25 17:57 - INFO - Reading FASTA file validation
2024-10-25 17:57 - INFO - FASTA file is valid
2024-10-25 17:57 - INFO - FASTA file validation completed


Even though it is possible to log info/error messages with `print()` statements, hopefully, you are now convinced that using the `logging` module provides enhanced capabilities to perform logging making your tool user-friendly and informative. For more details about the `logging` module, refer to [Python documentation](https://docs.python.org/3/library/logging.html#).