In [1]:
%%html
<style>
h1,h2,h3 {
    text-align: center;
}

.term {
    text-align: center;
    margin-top: 1em;
    margin-bottom: 1em;
}

.organizers {
    text-align: center;
    margin-left: 20%;
    margin-right: 20%;
    margin-bottom: 1em;
}

.presenter {
    text-decoration: underline;
}
</style>


# Python Programming for Machine Learning

<div class="term">Summer Term 2025</div>

<div class="organizers">
    <span class="presenter">Johannes Maeß</span>
</div>

<center><img src='images/python-logo-only.svg' width=250> </center>

## Python Modules

### Organizing code in modules

- Functions and variables defined in the Python interpreter are temporary and lost upon exiting.
For longer programs, it is recommended to prepare code using a text editor and run it as a script (**creating a script**).

- Python provides a way to organize code into reusable units called **modules**:
    - Modules are files containing definitions and statements that can be imported and used in scripts or interactive sessions. 
    - They help organize code and promote reusability. 
    - You create modules by saving Python code in files with a `.py` extension.

### Single-file modules

- We have multiple files with a `.py` extension in our current working directory

In [2]:
!tree -L 1

[1;36m.[0m
├── PyML-3.ipynb
├── calculator.py
├── [1;36mimages[0m
├── [1;36mmy_linalg_project[0m
├── [1;36mscripts[0m
├── vec_ops.py
└── vec_ops_n_script.py

4 directories, 4 files


- For example, the file `vec_ops.py` is a module containing simple vector operations.

- Print content of `vec_ops.py` script.

In [3]:
pycat vec_ops.py

[38;5;66;03m# vector_operations.py[39;00m
[38;5;28;01mimport[39;00m numpy [38;5;28;01mas[39;00m np 

[38;5;28;01mdef[39;00m _shape_check(vector1:np.array, vector2:np.array) -> [38;5;28;01mNone[39;00m:
    [38;5;28;01mif[39;00m vector1.shape != vector2.shape:
        [38;5;28;01mraise[39;00m ValueError([33m"Vectors must have the same shape"[39m)


[38;5;28;01mdef[39;00m inner_product(vector1:np.array, vector2:np.array) -> float:
    _shape_check(vector1, vector2)
    [38;5;28;01mreturn[39;00m vector1 @ vector2


[38;5;28;01mdef[39;00m elementwise_add(vector1:np.array, vector2:np.array) -> np.array:
    _shape_check(vector1, vector2)
    [38;5;28;01mreturn[39;00m vector1 + vector2


[38;5;28;01mdef[39;00m elementwise_multiply(vector1:np.array, vector2:np.array) -> np.array:
    _shape_check(vector1, vector2)
    [38;5;28;01mreturn[39;00m vector1 * vector2


### Importing single-file modules

- To use a module, you import it into your script or session using the `import` keyword followed by the module name. 

In [4]:
import vec_ops

- **Note**: the functions defined in the `vec_ops` module are not added to the global scope

- Functions and variables defined in the module must be accessed using **dot notation**

In [5]:
import numpy as np

In [6]:
vector1 = np.array([1.0, 2.0, 3.0, 4.0])
vector2 = np.array([5.0, 6.0, 7.0, 8.0])

In [7]:
vec_ops.inner_product(vector1, vector2)

np.float64(70.0)

In [8]:
vec_ops.elementwise_add(vector1, vector2)

array([ 6.,  8., 10., 12.])

### Importing specific objects from modules

- However, one can also directly import functions and other objects.
- This adds the specific functions to the current (global) scope

In [9]:
from vec_ops import inner_product, elementwise_add

In [10]:
inner_product(vector1, vector2)

np.float64(70.0)

In [11]:
elementwise_add(vector1, vector2)

array([ 6.,  8., 10., 12.])

### Importing all objects from a module (NEVER NOT DO THIS)

- It is possible to import all module names except those starting with an underscore as a prefix (these are called **private methods**). 
    - However, **this way of importing is STRONGLY discouraged** as it can lead to namespace pollution and confusion.

In [12]:
# commented out, because this is not cool!
# from vec_ops import *

### Aliased imports

- If the module name is followed by `as`, then the name following as is bound directly to the imported module.

- This also works with `from` imports

In [13]:
import vec_ops as vector_operations
from vec_ops import elementwise_add as eadd

vector_operations.elementwise_add(vector1, vector2)
eadd(vector1, vector2)

array([ 6.,  8., 10., 12.])

### The `__main__` function

- Running a script will naturally run all code within the script.
- To ensure that certain code runs only when the script is executed directly (and not imported, we can check whether `__name__` (current module name) is equal to `__main__`.

<!--  vec_ops_n_script.py -->
```
if __name__=="__main__":
    import sys 

    if len(sys.argv) != 3:
        print("Usage: python script.py vector1 vector2")
        sys.exit(1)

    try:
        vector1 = np.array([float(x) for x in sys.argv[1].split(',')])
        vector2 = np.array([float(x) for x in sys.argv[2].split(',')])
    except ValueError:
        print("Vectors must contain only numerical values separated by commas")
        sys.exit(1)

    print("Vector 1:", vector1)
    print("Vector 2:", vector2)

    print("Inner Product:", inner_product(vector1, vector2))
    print("Elementwise Addition:", elementwise_add(vector1, vector2))
    print("Elementwise Multiplication:", elementwise_multiply(vector1, vector2))
```

### The `__main__`: running vs. import

- Running the file will execute the code below the check (i.e., `if __name__ =="__main__":`)

In [14]:
!python vec_ops_n_script.py 1,2,3,4 5,6,7,8

Vector 1: [1. 2. 3. 4.]
Vector 2: [5. 6. 7. 8.]
Inner Product: 70.0
Elementwise Addition: [ 6.  8. 10. 12.]
Elementwise Multiplication: [ 5. 12. 21. 32.]


- However, importing the module will cause a different module name, thus not executing the code below the condition.

In [15]:
import vec_ops_n_script

vec_ops_n_script.elementwise_add(vector1, vector2)

array([ 6.,  8., 10., 12.])

### Built-in modules

- The Python standard library comprises a vast collection of modules.
- Offers a wide range of functionalities:
  - Handling common tasks like file I/O and regular expressions.
  - Specialized tasks like cryptography and networking.

| **Category**                     | **Modules and Descriptions**                                                                            |
|----------------------------------|---------------------------------------------------------------------------------------------------------|
| **File and Directory Access**    | `os`: Operating system interfaces. <br> `shutil`: High-level file operations (copying, moving, etc.). <br> `pathlib`: Object-oriented filesystem paths. <br> `glob`: Unix-style pathname pattern expansion. |
| **Data Persistence**             | `pickle`: Object serialization. <br> `json`: JSON encoding and decoding. <br> `csv`: CSV file reading and writing. |
| **Text Processing**              | `re`: Regular expression operations. <br> `string`: Common string operations. <br> `textwrap`: Text wrapping and filling. <br> `difflib`: Helpers for computing deltas. |
| **Data Compression and Archiving** | `zipfile`: Work with ZIP archives. <br> `gzip`, `bz2`, `lzma`: Interfaces for compression formats. |
| **Internet Protocols and Support** | `urllib`: URL handling utilities. <br> `http`: HTTP client and server modules. <br> `socket`: Low-level networking interface. <br> `smtplib`: SMTP client library. |
| **Date and Time**                | `datetime`: Basic date and time types. <br> `time`: Time access and conversions. <br> `calendar`: General calendar-related functions. |
| **Mathematics**                  | `math`: Mathematical functions. <br> `random`: Generate pseudo-random numbers. <br> `statistics`: Mathematical statistics functions. |
| **Concurrency and Parallelism**  | `threading`: Higher-level threading interface. <br> `multiprocessing`: Process-based parallelism. <br> `concurrent`: Asynchronous programming. |
| **Cryptography and Security**    | `hashlib`: Secure hash and message digest algorithms. <br> `hmac`: Keyed-Hashing for Message Authentication. <br> `ssl`: TLS/SSL wrapper for socket objects. |
| **Debugging and Profiling**      | `pdb`: The Python Debugger. <br> `profile`, `cProfile`: Performance analysis tools. |


### Investigating attributes using the `dir` built-in

- The built-in function `dir()` is used to find out which names a module defines.

In [16]:
dir(vec_ops_n_script)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_shape_check',
 'elementwise_add',
 'elementwise_multiply',
 'inner_product',
 'np']

### Packages: the way to organize collections of modules

- Python organizes modules using `packages`, which are hierarchical collections of modules.

- `setuptools` is a package development library that simplifies creating, distributing, and installing Python packages.

#### We'll create a *package for simple linear algebra* using the numpy package and analyze its structure.

```
my_linalg_project/
├── README.md
├── pyproject.toml
└── src
    └── small_linalg_package
        ├── __init__.py
        ├── matrix_operations.py
        └── vector_operations.py

3 directories, 6 files
```

- `my_linalg_project/`: This is the root directory of your *project*. It contains all the files and subdirectories related to your linear algebra package.

### Location of a package code and the crucial role of `__init__.py` files

```
my_linalg_project/
├── [...]
└── src
    └── small_linalg_package
        ├── __init__.py
        ├── matrix_operations.py
        └── vector_operations.py
```

- `src/`: A directory containing your package's source code.
  - This structure is recommended to avoid potential import issues and clarify the package code's location (separation from other package-relevant files).
  - *NOTE*: the `src` folder could also contain multiple packages, i.e., we would have a project with multiple packages inside. 

- `small_linalg_package/`: This is the actual package directory.
  - The name of this directory is the name of the package you are creating.
  - It contains the Python modules that make up your package, i.e., `matrix_operations.py` and `vector_operations.py`

- The **`__init__.py` file marks a directory as a Python package**, enabling the organization of modules into cohesive units.
  - This file can be empty or execute package initialization code, facilitating setup tasks like variable initialization or submodule imports.
  - It can also set the `__all__` variable to control what is exported when `from small_linalg_package import *` is used. (NOT RECOMMENDED!)

### Inform the user about the content of the package

- Our folder `my_linalg_project` contains `README.md` file.

In [17]:
pycat my_linalg_project/README.md

[38;5;66;03m# Small linear algebra package[39;00m

A README file typically serves [38;5;28;01mas[39;00m an introduction to a project, providing essential information to users, contributors, [38;5;28;01mand[39;00m collaborators. Here's a general outline of what you might include:

- **Title**: Give your project a clear, concise title at the top of the README.
- **Description**: Briefly describe what the project does [38;5;28;01mand[39;00m its purpose.
- **Installation**: Provide instructions on installing [38;5;28;01mand[39;00m setting up the project. Include any dependencies [38;5;28;01mand[39;00m their installation steps.
- **Usage**: Explain how to use the project. Provide examples [38;5;28;01mif[39;00m applicable.
- **Configuration**: If configuration settings exist, explain how to modify them.
- **Contributing**: Outline guidelines [38;5;28;01mfor[39;00m contributing to the project. Include information on how to report bugs, suggest improvements, [38;5;28;01mand[3

### How to specify installation information of a created package? 

- Python packages use configuration files to specify installation-related details.
  

- Main configuration files:
  - `pyproject.toml` (Modern approach for specifying project configurations.)
  - `setup.py` (Traditional approach still widely used, i.e., with [setuptools](https://setuptools.pypa.io/en/latest/))
  - The backend (build system) and metadata (e.g., package name, version, author, and description) are specified in both types of configuration files.
  - Additionally, dependencies and package contents are defined to ensure proper installation and usage.

- In our project directory we had `my_linalg_project/pyproject.toml` file.

In [18]:
pycat my_linalg_project/pyproject.toml

[build-system]
requires = [[33m"setuptools"[39m]
build-backend = [33m"setuptools.build_meta"[39m

[project]
name = [33m"small_linalg_package"[39m
version = [33m"0.0.1"[39m
dependencies = [
     [33m'importlib-metadata; python_version<="3.10"'[39m,
]


- Used by tools such as `setuptools` and `pip`.
- Contents:
  - Build system requirements
  - Metadata: Package name, Version, Author, Description, Dependencies, Other configurations

### How to install a created package?

- We can use `setuptools` and `pip` to install the `pyproject.toml`.

In [19]:
!pip install ./my_linalg_project

Processing ./my_linalg_project
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hBuilding wheels for collected packages: small_linalg_package
  Building wheel for small_linalg_package (pyproject.toml) ... [?25ldone
[?25h  Created wheel for small_linalg_package: filename=small_linalg_package-0.0.1-py3-none-any.whl size=2548 sha256=11028f18a504b7db69043e4a24b29069e6835ec153630c1406497755bd6b51bf
  Stored in directory: /private/var/folders/3h/t42_2k2d0tqf3gl1_tjslcbr0000gn/T/pip-ephem-wheel-cache-_cduz0es/wheels/14/68/b8/e80e47d8e79338ae1438a5e061b1acd9f613f1180de9e3fde9
Successfully built small_linalg_package
Installing collected packages: small_linalg_package
Successfully installed small_linalg_package-0.0.1


- Adding the `-e` flag when installing a package with `pip` enables **editable** mode, creating a symbolic link to the package directory rather than copying files to the `site-packages` directory. This permits direct modifications to the package code, with changes immediately reflected upon import.

- Note the additional files added to the folder structure after installation. 

In [20]:
!tree my_linalg_project/

[1;36mmy_linalg_project/[0m
├── README.md
├── [1;36mbuild[0m
│   ├── [1;36mbdist.macosx-11.0-arm64[0m
│   └── [1;36mlib[0m
│       └── [1;36msmall_linalg_package[0m
│           ├── __init__.py
│           ├── matrix_operations.py
│           └── vector_operations.py
├── pyproject.toml
└── [1;36msrc[0m
    ├── [1;36msmall_linalg_package[0m
    │   ├── __init__.py
    │   ├── matrix_operations.py
    │   └── vector_operations.py
    └── [1;36msmall_linalg_package.egg-info[0m
        ├── PKG-INFO
        ├── SOURCES.txt
        ├── dependency_links.txt
        ├── requires.txt
        └── top_level.txt

8 directories, 13 files


### Usage of the newly installed small linear algebra package

- Now we can import the code structured in the python package.

In [21]:
import numpy as np 

from small_linalg_package import vector_operations, matrix_operations # Note: the name of our package.

In [22]:
vector1 = np.array([1, 2, 3, 4])
vector2 = np.array([5, 6, 7, 8])

matrix1 = np.stack([vector1, vector2])
matrix2 = 2 * matrix1

print(f"vector1:\n{vector1}\nvector2:\n{vector2}")
print(f"matrix1:\n{matrix1}\nmatrix2:\n{matrix2}")

vector1:
[1 2 3 4]
vector2:
[5 6 7 8]
matrix1:
[[1 2 3 4]
 [5 6 7 8]]
matrix2:
[[ 2  4  6  8]
 [10 12 14 16]]


In [23]:
inner_product = vector_operations.inner_product(vector1, vector2)

print(f"The inner product of vector1 and vector2 is {inner_product}")

The inner product of vector1 and vector2 is 70


In [24]:
matrix3 = matrix_operations.elementwise_multiply(matrix1, matrix2)

print(f"The elementwise mutliplication of matrix1 and matrix2 is\n{matrix3}")

The elementwise mutliplication of matrix1 and matrix2 is
[[  2   8  18  32]
 [ 50  72  98 128]]


## Docstrings and Pydoc

<!-- - what does docstring do?
- we can directly access docstring through pydoc
- pydoc can be directly accessed in python with `help` -->

### Letting the user know that a module, function, method or class does

- Docstrings are string literals at the beginning of module, function, class, or method definitions, documenting their purpose, parameters, return values, and other relevant details.
- They serve as documentation, aiding developers in understanding code functionality without delving into implementation specifics.
- Let's see different docstring examples.

- Docstrings in a function

In [25]:
def greet(name):
    """ This function greets the user with the provided name.

    Parameters:
    name (str): The name of the person to greet.

    Returns:
    str: A greeting message.
    """
    return f"Hello, {name}! Welcome to our PyML class!"

In [26]:
greet('Pythoneer')

'Hello, Pythoneer! Welcome to our PyML class!'

- Docstrings in classes

In [27]:
pycat calculator.py

[38;5;28;01mclass[39;00m Calculator:
    [33m""" A simple calculator class. """[39m

    [38;5;28;01mdef[39;00m check_input(self, x):
        [33m""" Check if the input is an integer.[39m

[33m        Parameters:[39m
[33m        x: The input value to be checked.[39m

[33m        Raises:[39m
[33m        TypeError: If x is not an integer.[39m
[33m        """[39m
        [38;5;28;01mif[39;00m [38;5;28;01mnot[39;00m isinstance(x, int):
            [38;5;28;01mraise[39;00m TypeError([33m"Parameter 'x' must be an integer."[39m)

    
    [38;5;28;01mdef[39;00m add(self, x, y):
        [33m""" Adds two numbers together.[39m

[33m        Parameters:[39m
[33m        x (int): The first number.[39m
[33m        y (int): The second number.[39m

[33m        Returns:[39m
[33m        int: The sum of x and y.[39m

[33m        Raises:[39m
[33m        TypeError: If either x or y is not an integer.[39m
[33m        """[39m
        self.check_input(x)
        se

### Using pydoc to retrieve documentation for Python code.


- Pydoc is a documentation tool that extracts docstrings to generate user-friendly *documentation for Python code*.
- Pydoc is accessible from the command line using `pydoc` followed by the object name, or within Python using the `help()` function.

##### Retrieve documentation of our `greet` function. 

In [31]:
help(greet)

Help on function greet in module __main__:

greet(name)
    This function greets the user with the provided name.
    
    Parameters:
    name (str): The name of the person to greet.
    
    Returns:
    str: A greeting message.



##### Retrieve documentation for a class stored in a module

In [32]:
from calculator import Calculator
help(Calculator)

Help on class Calculator in module calculator:

class Calculator(builtins.object)
 |  A simple calculator class.
 |  
 |  Methods defined here:
 |  
 |  add(self, x, y)
 |      Adds two numbers together.
 |      
 |      Parameters:
 |      x (int): The first number.
 |      y (int): The second number.
 |      
 |      Returns:
 |      int: The sum of x and y.
 |      
 |      Raises:
 |      TypeError: If either x or y is not an integer.
 |  
 |  check_input(self, x)
 |      Check if the input is an integer.
 |      
 |      Parameters:
 |      x: The input value to be checked.
 |      
 |      Raises:
 |      TypeError: If x is not an integer.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables
 |  
 |  __weakref__
 |      list of weak references to the object



- Now we know how we need to use the calculator

In [33]:
calc = Calculator()

calc.add(x=10, y=15)

25

### Different types of docstring styles

- There are different styles of docstrings which one could use. 
- The most prominent are:
    1. Google Style Docstrings
    2. Numpydoc
    3. reStructuredText (reST) Style Docstrings

In [35]:
# 1. Google Style Docstrings
def function_name(param1, param2):
    """
    This is a one-line summary of what the function does.

    Args:
        param1 (int): Description of param1.
        param2 (str): Description of param2.

    Returns:
        bool: Description of return value.
    """
    # Function implementation


##### 2. Numpydoc
def function_name(param1, param2):
    """
    This is a one-line summary of what the function does.

    Parameters
    ----------
    param1 : int
        Description of param1.
    param2 : str
        Description of param2.

    Returns
    -------
    bool
        Description of return value.
    """
    # Function implementation

In [36]:
# 3. reStructuredText (reST) Style Docstrings
def function_name(param1, param2):
    """
    This is a one-line summary of what the function does.

    :param param1: Description of param1.
    :type param1: int
    :param param2: Description of param2.
    :type param2: str
    :return: Description of return value.
    :rtype: bool
    """
    # Function implementation


### Pros and cons of different docstring styles

| Docstring Style                    | Pros                                                                                               | Cons                                                                                                                                     |
|-----------------------------------|----------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| **Google Style Docstrings**       | **Structured**: Follows a clear structure with sections like Parameters, Returns, Raises, Examples, etc., making it easy to read and understand.    | **Verbose**: Can become lengthy, especially for complex functions or methods.                                                              |
|                                   | **Popular**: Widely used, especially in the Python community, making it familiar to many developers. | **Non-standard**: While widely used, it's not an official standard, so there might be variations in how it's applied.                        |
|                                   | **Integrates with Tools**: Many tools and libraries support parsing and generating documentation from Google Style Docstrings.                  |  **Formatting**: Requires strict adherence to formatting conventions for consistency.                                                          |
| **Numpydoc**                      | **Scientific Community**: Popular in the scientific Python community, particularly for documenting NumPy and SciPy functions.                     | **Lengthy**: Like Google Style, it can become verbose for complex functions.                                                                 |
|                                   | **Structured**: Similar to Google Style, it provides a clear structure for documenting parameters, return values, etc.                           |  **Non-standard**: While widely used, it's not an official standard, so there might be variations in how it's applied.                        |
|                                   | **Sphinx Integration**: Compatible with Sphinx, making it suitable for projects using Sphinx for documentation generation.                       |  **Learning Curve**: Requires familiarity with reStructuredText markup, which might be a learning curve for some developers.                   |
| **reStructuredText (reST) Style Docstrings** | **Integration with Sphinx**: Fully compatible with Sphinx and other reStructuredText-based documentation tools.                                 | **Markup Complexity**: Requires knowledge of reStructuredText markup, which can be more complex than plain text.                              |
|                                   | **Clear Structure**: Follows a clear structure with parameters, types, and return values specified inline, making it easy to read.            | **Non-Pythonic**: Some developers may find the syntax less intuitive, especially if they're not familiar with reStructuredText.           |
|                                   | **Sphinx Directives**: Supports Sphinx directives for cross-referencing and linking to other documentation.                                     |                                                                                                                                          |


### How to create (publishable) documentation of Python Code?

- **Sphinx** is a documentation generation tool widely used in the Python community.
  - It automates the process of generating documentation from source code, making it easier to maintain and update. [Official website](https://www.sphinx-doc.org/en/master/). 

- Example: [numpy official documentation](https://numpy.org/doc/stable/index.html) created with Sphinx
    - [ndarray documentation](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)
 
- Example: [Quantus explainability package](https://quantus.readthedocs.io/en/latest/index.html) created with Sphinx
    - [Floating point discretisation](https://quantus.readthedocs.io/en/latest/docs_api/quantus.functions.discretise_func.html#quantus.functions.discretise_func.floating_points)
    - Directly specified in [source](https://github.com/understandable-machine-intelligence-lab/Quantus/blob/8ad10763f2ed670ae28059baa90bb4f2e7b9f3ab/quantus/functions/discretise_func.py#L15)!

- Key Features:
    - **reStructuredText (reST) Support**: Sphinx uses reStructuredText, a lightweight markup language, for documentation source files.
    - **Automatic Generation**: It can automatically generate documentation from source code, including module hierarchies, function signatures, and docstrings.
    - **Customizable Output**: Sphinx supports multiple output formats such as HTML, PDF, and ePub, allowing documentation to be published in various formats.
    - **Cross-Referencing**: Sphinx generates cross-references within the documentation, making it easier for users to navigate and understand the codebase.

- **Note**: `reStructuredText` is a markup language for creating structured text documents, while `reStructuredText Style docstring` refers to using `reStructuredText` syntax within Python docstrings for code documentation.

## Testing code: Crucial in Software Development

### What is code testing?

- Testing ensures code behaves as expected under different conditions.
- In lecture 2, we have seen **runtime testing by using Exceptions**
- Now, we focus on **code testing**

Python offers rich tools and libraries for effective testing.
- `unittest`: Built-in unit testing framework.
- `pytest`: Simple, scalable, and extensible testing framework.
- `doctest`: Testing through documentation.

### Why is testing important?

- **Early Bug Detecting**: Testing helps catch bugs and errors in your code before they become significant issues in production.

- **Maintaining Code Quality**: Writing tests encourages writing modular, maintainable, and understandable code.

- **Refactoring with Confidence**: With a comprehensive test suite, you can refactor your code confidently, knowing that existing functionality won't be broken.

- **Documentation**: Tests serve as executable documentation, providing examples of how your code should be used.

### The Art of Test-driven Development (TDD)

- **Test-driven development (TDD)** is a software development approach in which you **write tests for your code before you actually write it**. 
    

1. **Write a Test**: Write a test case describing your function's desired behavior. This test should fail initially since you haven't implemented the function yet.

2. **Write the Code**: Implement the function to make the test pass. Keep the code as simple as possible to satisfy the test case.

3. **Refactor**: Once the test passes, you can refactor your code if necessary to improve readability, performance, or maintainability. Make sure all tests still pass after refactoring.

4. **Repeat**: Continue this cycle by writing additional tests for different scenarios and implementing the corresponding code until you've covered all desired functionality.

- Let's apply the TDD approach.
- Test functions use Python's built-in `assert` statement to check that conditions are true. If the condition is false, the test fails.

- **Task**: Write a test that asserts the right computation of a factorial. 

In [37]:
def test_factorial():
    assert factorial(5) == 120, "Wrong answer, the factorial of 5 must be 120."
    assert factorial(10) == 3628800, "Wrong answer, the factorial of 10 must be 3628800."
    assert factorial(0) ==1, "Wrong answer, the factorial of 0 must be 1."

- Write a function satisfying the test.

In [38]:
def factorial(n):
    if n == 0:
        return 1
    result = 1
    for i in range(1, n + 1):
        result *= i
    return result

In [39]:
test_factorial()

### How to use `pytest`?

- The `pytest` framework makes it easy to write small, readable tests, and can scale to support complex functional testing for applications and libraries.
- In comparison to our previous small test, `pytest` allows multiple fuctions to test expected exceptions.

- Next we want to extend out `factorial` function such that it tests the inputs. 
- Again, first define the test and then implement the method.
   - Note: we want to use  `pytest.raises` to test that `factorial` raises an expected exception.

In [40]:
import pytest

def test_factorial():
    assert factorial(5) == 120, "Wrong answer, the factorial of 5 must be 120."
    assert factorial(10) == 3628800, "Wrong answer, the factorial of 10 must be 3628800."
    assert factorial(0) ==1, "Wrong answer, the factorial of 0 must be 1."

    # test if factorial raiese a Value Error in case of negative numbers.
    with pytest.raises(ValueError, match="Input must be a non-negative integer.") as exc_info:
        factorial(-1)

    with pytest.raises(TypeError, match="Input must be an integer.") as exc_info:
        factorial(1.5)

    with pytest.raises(TypeError, match="Input must be an integer.") as exc_info:
        factorial("this is a string")


In [41]:
import traceback
try:
    test_factorial()
except:
    traceback.print_exc()

Traceback (most recent call last):
  File "/var/folders/3h/t42_2k2d0tqf3gl1_tjslcbr0000gn/T/ipykernel_68708/3075588067.py", line 3, in <module>
    test_factorial()
  File "/var/folders/3h/t42_2k2d0tqf3gl1_tjslcbr0000gn/T/ipykernel_68708/43458582.py", line 9, in test_factorial
    with pytest.raises(ValueError, match="Input must be a non-negative integer.") as exc_info:
  File "/Users/johannes/miniconda3/envs/pyml3/lib/python3.11/site-packages/_pytest/python_api.py", line 1019, in __exit__
    fail(self.message)
  File "/Users/johannes/miniconda3/envs/pyml3/lib/python3.11/site-packages/_pytest/outcomes.py", line 178, in fail
    raise Failed(msg=reason, pytrace=pytrace)
Failed: DID NOT RAISE <class 'ValueError'>


- modify `factorial` function such that inputs are checked

In [42]:
def factorial(n):
    if not isinstance(n, int):
        raise TypeError("Input must be an integer.")
    if n < 0:
        raise ValueError("Input must be a non-negative integer.")
    if n == 0:
        return 1
    result = 1
    for i in range(1, n + 1):
        result *= i
    return result

In [43]:
test_factorial()

- **Note** that the test passed because the add method raised the Exception as expected

### How to include `pytest` testing benchmark into your python package?

- We add the `test` folder at the same level as `src`, the documentation and installation files. 

```
my_linalg_project/
├── README.md
├── build
├── pyproject.toml
├── src
    ├── small_linalg_package
    │   ├── __init__.py
    │   ├── matrix_operations.py
    │   └── vector_operations.py
    └── [...]
└── test
    ├── __init__.py
    ├── test_matrix_operations.py
    └── test_vector_operations.py
```

- **Test Discovery**: `pytest` automatically *finds and collects test functions and classes* from Python files that follow naming conventions. 
    - *Function Names*: Test functions should start with the word `test` (e.g., `test_factorial`)
    - *Function Files*: Test files (`test/*.py`) should also start with `test_` or end with `_test`. For example, `test_matrix_operations.py` or `matrix_operations_test.py`.

- To make sure our created Python package includes the testing framework `pytest`, we need to add to our `pyproject.toml` the following lines. 

```
[...]


[tool.pytest.ini_options]
testpaths = [
    "test"
]
```

### More comments on `pytest` test functions.

**1. Parametrization**: By using `@pytest.mark.parametrize` as `decorator` one can run a test function with multiple sets of inputs. This reduces code duplication and makes tests more readable.

In [44]:
@pytest.mark.parametrize("n, expected", [
    (5, 120),
    (10, 3628800),
    (0, 1)
])
def test_factorial(n, expected):
    assert factorial(n) == expected

**2. Fixtures**: By using `@pytest.fixture` fixtures one sets up the state or context for the tests. They can be used to initialize resources, such as database connections or test data, that can be shared across multiple test functions.

In [45]:
@pytest.fixture
def valid_inputs():
    return [
        (5, 120),
        (0, 1),
        (1, 1),
        (3, 6),
        (10, 3628800)
    ]

def test_factorial_valid_inputs(valid_inputs):
    for input_val, expected in valid_inputs:
        assert factorial(input_val) == expected

**3. Grouping Tests**: You can group related tests into classes

In [46]:
class TestFactorial:

    @pytest.mark.parametrize("n, expected", [
        (5, 120),
        (10, 3628800),
        (0, 1)
    ])
    def test_factorial(n, expected):
        assert factorial(n) == expected

    def test_factorial_invalid_inputs():
        with pytest.raises(ValueError, match="Input must be a non-negative integer.") as exc_info:
            factorial(-1)

        with pytest.raises(TypeError, match="Input must be an integer.") as exc_info:
            factorial(1.5)

        with pytest.raises(TypeError, match="Input must be an integer.") as exc_info:
            factorial("this is a string")

### Common types of code testing

- **Types of Code Testing**: Common types of code testing include:

1. **Unit Testing**: Tests individual units or components of the software in isolation to verify their functionality.

2. **Integration Testing**: Tests the interactions and interfaces between different modules or components to ensure they work together correctly.

3. **Static Code Analysis**: Analyzes the source code without executing it to identify potential issues such as code duplication, dead code, and adherence to coding standards.

4. **Code Reviews**: Involves manual or automated reviews of the code by developers or peers to identify defects, improve code quality, and share knowledge.

## Code Style and Readability

### Demonstration of horrible code style

- the quality of your code does not only depend on runtime or memory usage
- the readability of code leads to less bugs, higher coding efficiency, and better reproducability
- imagine your co-worker provides you the following code for review (purely fictional algorithm)

In [47]:
def do_computation(x,z, K,i,lim=19):
    print('Initializing computation of the original foobarian algorithm for squared float mixing derived from Bar et al.\'s organic wooden theorem. Prepare for intense runtime and memory consumption.')
    if i<0: raise RuntimeError('failed!')
    f=lambda a,b: a**2+b;r=f(x,z) if i>3 else 0
    for k in range(4):
        for l in range(lim): r=f(x,r)+ f(r**.5,z) if l <i else f(r**.5-1,K)
    return r

do_computation(1,2,3,4,5)
    

Initializing computation of the original foobarian algorithm for squared float mixing derived from Bar et al.'s organic wooden theorem. Prepare for intense runtime and memory consumption.


307721.7434282789

- *please, never write code like this*

### Demonstration of code legibility

- observe how keeping a clean style makes the code the code much more legible (still purely fictional algorithm)
- while this version seems much longer, it does the exact same computation
- code style usually results in longer files, but there is no reason to write short code if nobody including you in 2 weeks cannot easily decrypt what it is supposed to do

In [48]:
import logging

def foobarian_float_mixing(
    base, offset, alt_offset, counter, limit=19
):
    '''Computes the original foobarian algorithm for squared float
    mixing derived from Bar et al.'s (19XX) organic wooden theorem.
    High runtime and memory consumption.
    '''
    
    def square_and_shift(square_base, shift):
        '''Apply the square and shift.'''
        return square_base ** 2 + shift

    print(
        'Initializing computation of the original foobarian algorithm '
        'for squared float mixing derived from Bar et al.\'s organic '
        'wooden theorem. Prepare for intense runtime and memory '
        'consumption.'
    )
    
    if counter < 0:  # sanity check for counter
        raise RuntimeError(
            f'Argument \'counter\' cannot be negative! Was \'{counter}\'.'
        )

    result = 0
    if counter > 3:
        # initialize result for higher counts
        result = square_and_shift(base, offset)
    
    for _ in range(4):
        for lim in range(limit):
            # step of the foobarian float mixing
            if lim < counter:
                # limits of values below counter are computed
                # using Bar et al. Eq. (17)
                result = (
                    square_and_shift(base, result)
                    + square_and_shift(result ** .5, offset)
                )
            else:
                # Foo et al. (20XX) propose an alternative shift in the
                # initial phase
                result = square_and_shift(result ** .5 - 1, alt_offset) 
    return result

foobarian_float_mixing(1, 2, 3, 4, 5)

Initializing computation of the original foobarian algorithm for squared float mixing derived from Bar et al.'s organic wooden theorem. Prepare for intense runtime and memory consumption.


307721.7434282789

### Python Enhancement Proposal 8 (PEP8)

- [PEPs (Python Enhancement Proposals)](https://peps.python.org) are proposals to enhance Python
- some of them are implemented into the language, some rejected, some propose general concepts of Python, such as docstring conventions, or code style
- [PEP8](https://peps.python.org/pep-0008/) is a general style guideline, derived from Guido van Rossum's (creator of Python) original Pythone style guideline
- while PEP8 allows from some custom design choices, it provides some general recommendations on how to write legibile Python

> One of Guido’s key insights is that code is read much more often than it is written. 

### Variable name conventions

- except for built-in types: **only types** or **classes** start with a capital letter

- classes and types are written in **`CamelCase`**

In [49]:
class CachedDict:
    pass

class SparseMatrix:
    pass

class StochasticGradientDescent:
    pass

- everything else (variables, functions, ...) are written in **`snake_case`**

In [50]:
instance_counter = 0
predicted_mean_value = 0

def create_list_of_zeros(size):
    return [0] * size

### Use meaningful variable names

- descriptive variable, function and class names enourmously increase readability, can you guess what this class does?

In [51]:
class Alg:
    def __init__(self):
        self.a = []
    def set(self, i, x):
        for j, (y, z) in enumerate(self.a):
            if i < y:
                self.a.insert(j, (i, x))
                return self
        self.a.append((i, x))
        return self
    def get(self):
        return self.a.pop(0)

Alg().set(1, 'a').set(3, 'c').set(2, 'e').a

[(1, 'a'), (2, 'e'), (3, 'c')]

### Use meaningful variable names (cont'd)

- the following should be much easier to read

In [52]:
class PriorityQueue:
    def __init__(self):
        self.queue = []
    def add(self, priority, payload):
        for index, (priority_reference, _) in enumerate(self.queue):
            if priority < priority_reference:
                self.queue.insert(index, (priority, payload))
                return self
        self.queue.append((priority, payload))
        return self
    def pop_most_urgent(self):
        return self.queue.pop(0)

PriorityQueue().add(1, 'a').add(3, 'c').add(2, 'e').queue

[(1, 'a'), (2, 'e'), (3, 'c')]

### Use whitespaces to your advantage

- characters bunched up too close are difficult to read
- when assigning variables `=` or using binary operators (`+`, `==`, `|`, ...), add whitespaces before and after, except for default values
- always add a whitespace after a comma `,` except at the end of the line

In [53]:
pi = 3.141592653597984
radius = 1.5
circumference = 2 * pi * radius
if circumference / pi >= 4:
    print('That\'s a big circle!')

def get_dict_default(obj, key, default=13):
    return obj.get(key, default)

print(get_dict_default(obj={}, key='unavailable'))

13


### Use newlines to your advantage

- never use more than one empty line in functions, and separate file-level functions with two empty lines
- never more than a single newline at the end or the beginning of your file

In [54]:
def inner(left, right):
    '''Computes the inner product of two vectors'''
    result = 0
    for x, y in zip(left, right):
        result += x * y
    
    return result  # one empty line is okay!


def l2norm(vector):
    '''L2-norm that is also two empty lines below the inner product.'''
    result = 0
    for element in vector:
        result += element
    return result ** .5

### Break long function calls, definitions, and sequences after the comma

- very long function calls, definitions, and sequences can be very hard to read
- make sure your lines are not too long (legacy is 80 characters or less, 120 is fine, more is usually too long)

In [55]:
dictionary = {
    'german': 'Ich weiß nicht, wie man das auf Deutsch sagt',
    'chinese': '我不知道中文该怎么说',
    'greek': 'Δεν ξέρω πώς να το πω αυτό στα ελληνικά',
    'turkish': 'Bunu Türkçe nasıl söyleyeceğimi bilmiyorum',
    'portuguese': 'Não sei como dizer isso em português',
    'japanese': '日本語でどう言えばいいのか分かりません'
}

def translate(
    word,
    language,
    urgency,
    confusedness,
):
    translation = dictionary.get(
        language,
        f'I do not know how to say that in {language.capitalize()}'
    )
    return f'{word}{confusedness * "?"} {translation}{urgency * " !"}'

translate(
    'supercalifragilisticexpialidocious',
    'japanese',
    urgency=2,
    confusedness=4,
)

'supercalifragilisticexpialidocious???? 日本語でどう言えばいいのか分かりません ! !'

### Comment your code, but only when it counts

- commenting your code when something is not trivial to understand is very good practice

- however, trivial comments of trivial code waste your and the code-reader's time

### Tools to help you write clean code 

- `flake8` can help you check whether your code conforms to PEP8

- `pycodestyle` is more aggressive than `flake8`, and is usually used to enforce a specific code style

- there exist auto-formatters like `black`, however, try to write clean code to begin with, and use black only once you got the gist of it

## Debugging
---

### Setup

- a highly important skill for any programmer is to properly debug code, especially when unexpected errors arise

- as an example, we implemented here `insert_sorted`, which inserts a sorted list into another sorted list

In [58]:
%run scripts/insert-broken.py
%pycat scripts/insert-broken.py

[38;5;28;01mdef[39;00m append(elem, target):
    [33m''''Append an element to a list.'''[39m
    target.apend(elem)


[38;5;28;01mdef[39;00m searchsorted(insert, target):
    [33m'''Find the indices for elements in insert if they were to be inserted into the sorted list `target`.[39m
[33m    Assume both insert and target are sorted.[39m
[33m    '''[39m
    result = []
    [38;5;28;01mfor[39;00m i, compared [38;5;28;01min[39;00m enumerate(target):
        [38;5;28;01mfor[39;00m elem [38;5;28;01min[39;00m insert[len(result):]:
            [38;5;28;01mif[39;00m elem <= compared:
                append(i, result)
    result += (len(insert) - len(result)) * [len(target)]
    [38;5;28;01mreturn[39;00m result


[38;5;28;01mdef[39;00m insert_sorted(elems, target):
    [33m'''Insert list `elems` into sorted list `target`.'''[39m
    elems = list(sorted(elems))
    result = list(target)
    indices = searchsorted(elems, target)
    [38;5;28;01mfor[39;00m index, elem

### The Traceback

- running the code will produce an error, which provides us with a **Traceback**

In [59]:
insert_sorted([19, 25], [15, 17, 22])

AttributeError: 'list' object has no attribute 'apend'

- in the top, we see `AttributeError`: this is the **uncaught** Exception that caused the termination of the program

- below, we see 4 the stack trace entries, each starting with the location of the code where the error occured. The location includes the file, and the line number where the error occurred in the respective frame

- in the very bottom, we see once again the raised Exception `AttributeError`, which a specific description of the error, which often is already enough to fix the error

### The call-stack

- the **call stack** keeps track of the active function calls
- each function execution creates a new **frame**, which holds the context and variables of that function call
- during debugging, we can navigate through the different frames

<center><img src='images/call-stack.svg' width=30%></center>

### Post-mortem debugging with PDB

- we can attach python's **debugger** to the code to inspect the runtime environment right at the moment where the error occurred

- for scripts, we can run from the terminal `python -m pdb path/to/script.py` and enter `c` to start the program. It will automatically halt once an uncaught exception arises

- jupyter notebooks allow us to directly debug when an exceptions causes the program to halt, which can be enabled by calling `%pdb on`

In [60]:
%pdb on

Automatic pdb calling has been turned ON


### Post-mortem debugging with PDB (cont'd)

- here we see the call stack again, with a text input below
- try entering `u` or `d` followed by hitting `Enter` in the text input to go up or down the call stack respectively
- try printing variables using `p <variable-name>`, you can only access ones that are accessible in the current frame
- exit the debugger with `q`

In [61]:
insert_sorted([19, 25], [15, 17, 22])

AttributeError: 'list' object has no attribute 'apend'

> [32m/Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-broken.py[39m([92m3[39m)[36mappend[39m[34m()[39m
[32m      1[39m [38;5;28;01mdef[39;00m append(elem, target):
[32m      2[39m     [33m''''Append an element to a list.'''[39m
[32m----> 3[39m     target.apend(elem)
[32m      4[39m 
[32m      5[39m 



ipdb>  l


[92m      1[39m [38;5;28;01mdef[39;00m append(elem, target):
[92m      2[39m     [33m''''Append an element to a list.'''[39m
[32m----> 3[39m     target.apend(elem)
[92m      4[39m 
[92m      5[39m 
[92m      6[39m [38;5;28;01mdef[39;00m searchsorted(insert, target):
[92m      7[39m     '''Find the indices for elements in insert if they were to be inserted into the sorted list `target`.
[92m      8[39m     Assume both insert [38;5;28;01mand[39;00m target are sorted.
[92m      9[39m     '''
[92m     10[39m     result = []
[92m     11[39m     [38;5;28;01mfor[39;00m i, compared [38;5;28;01min[39;00m enumerate(target):



ipdb>  elem


2


ipdb>  target


[]


ipdb>  target.append(elem)
ipdb>  target


[2]


ipdb>  p elem


2


ipdb>  u


> [32m/Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-broken.py[39m([92m14[39m)[36msearchsorted[39m[34m()[39m
[32m     12[39m         [38;5;28;01mfor[39;00m elem [38;5;28;01min[39;00m insert[len(result):]:
[32m     13[39m             [38;5;28;01mif[39;00m elem <= compared:
[32m---> 14[39m                 append(i, result)
[32m     15[39m     result += (len(insert) - len(result)) * [len(target)]
[32m     16[39m     [38;5;28;01mreturn[39;00m result



ipdb>  d


> [32m/Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-broken.py[39m([92m3[39m)[36mappend[39m[34m()[39m
[32m      1[39m [38;5;28;01mdef[39;00m append(elem, target):
[32m      2[39m     [33m''''Append an element to a list.'''[39m
[32m----> 3[39m     target.apend(elem)
[32m      4[39m 
[32m      5[39m 



ipdb>  q


### Debugging non-crashing code

- let us use the fixed version to analyze non-crashing code

In [62]:
%run scripts/insert-fixed.py
%pycat scripts/insert-fixed.py

[38;5;28;01mdef[39;00m append(elem, target):
    [33m''''Append an element to a list.'''[39m
    target.append(elem)


[38;5;28;01mdef[39;00m searchsorted(insert, target):
    [33m'''Find the indices for elements in insert if they were to be inserted into the sorted list `target`.[39m
[33m    Assume both insert and target are sorted.[39m
[33m    '''[39m
    result = []
    [38;5;28;01mfor[39;00m i, compared [38;5;28;01min[39;00m enumerate(target):
        [38;5;28;01mfor[39;00m elem [38;5;28;01min[39;00m insert[len(result):]:
            [38;5;28;01mif[39;00m elem <= compared:
                append(i, result)
    result += (len(insert) - len(result)) * [len(target)]
    [38;5;28;01mreturn[39;00m result


[38;5;28;01mdef[39;00m insert_sorted(elems, target):
    [33m'''Insert list `elems` into sorted list `target`.'''[39m
    elems = list(sorted(elems))
    result = list(target)
    indices = searchsorted(elems, target)
    [38;5;28;01mfor[39;00m index, ele

### How does Python find files and imports?

- in the following, we are going to set manual breakpoints
- we usually need to know the full path, or the path relative to our python-path to set breakpoints in files
- to make this process easier for the sake of this demonstration, we will add the local `scripts` directory to our python-path
- the python-path is a list of directories that python uses to find imports

In [63]:
import sys
print(sys.path)

['/Users/johannes/miniconda3/envs/pyml3/lib/python311.zip', '/Users/johannes/miniconda3/envs/pyml3/lib/python3.11', '/Users/johannes/miniconda3/envs/pyml3/lib/python3.11/lib-dynload', '', '/Users/johannes/miniconda3/envs/pyml3/lib/python3.11/site-packages']


- we can add a new directory to the python-path by simply inserting it into `sys.path`

In [64]:
from pathlib import Path
scripts_dir = str(Path().absolute() / 'scripts')
if scripts_dir not in sys.path:
    sys.path.insert(0, scripts_dir)

- we will analyze the fixed code in the following slides

### Manually setting breakpoints

- to investigate bugs that do not crash the program, we can manually set breakpoints within pdb using `b`
- the `%debug` magic command is used to manually attach the debugger to a code line
- try adding a breakpoint inside the `searchsorted` function by writing `b searchsorted`
- try also adding a breakpoint by filename and line number by entering `b insert-fixed.py:13`
- run the code by entering `c`, this will continue until the first breakpoint
- we can step over lines with `n` and into lines with `s`
- we can disable and reenable breakpoints with `disable <bpnumber>` and `enable <bpnumber>`, or remove them with `clear <bpnumber>`; use `b` to list all breakpoints

In [65]:
%debug print(insert_sorted([19, 25], [15, 17, 22]))

NOTE: Enter 'c' at the ipdb>  prompt to continue execution.
> [32m<string>[39m([92m1[39m)[36m<module>[39m[34m()[39m



ipdb>  breakpoints


*** NameError: name 'breakpoints' is not defined


ipdb>  b
ipdb>  b searchsorted


Breakpoint 1 at /Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-fixed.py:6


ipdb>  b


Num Type         Disp Enb   Where
1   breakpoint   keep yes   at /Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-fixed.py:6


ipdb>  b insert-fixed.py:13


Breakpoint 2 at /Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-fixed.py:13


ipdb>  b


Num Type         Disp Enb   Where
1   breakpoint   keep yes   at /Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-fixed.py:6
2   breakpoint   keep yes   at /Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-fixed.py:13


ipdb>  c


> [32m/Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-fixed.py[39m([92m10[39m)[36msearchsorted[39m[34m()[39m
[32m      8[39m     Assume both insert [38;5;28;01mand[39;00m target are sorted.
[32m      9[39m     '''
[32m---> 10[39m     result = []
[32m     11[39m     [38;5;28;01mfor[39;00m i, compared [38;5;28;01min[39;00m enumerate(target):
[32m     12[39m         [38;5;28;01mfor[39;00m elem [38;5;28;01min[39;00m insert[len(result):]:



ipdb>  n


> [32m/Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-fixed.py[39m([92m11[39m)[36msearchsorted[39m[34m()[39m
[32m      9[39m     '''
[32m     10[39m     result = []
[32m---> 11[39m     [38;5;28;01mfor[39;00m i, compared [38;5;28;01min[39;00m enumerate(target):
[32m     12[39m         [38;5;28;01mfor[39;00m elem [38;5;28;01min[39;00m insert[len(result):]:
[91m2[39m[32m    13[39m             [38;5;28;01mif[39;00m elem <= compared:



ipdb>  c


> [32m/Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-fixed.py[39m([92m13[39m)[36msearchsorted[39m[34m()[39m
[32m     11[39m     [38;5;28;01mfor[39;00m i, compared [38;5;28;01min[39;00m enumerate(target):
[32m     12[39m         [38;5;28;01mfor[39;00m elem [38;5;28;01min[39;00m insert[len(result):]:
[91m2[39m[32m--> 13[39m             [38;5;28;01mif[39;00m elem <= compared:
[32m     14[39m                 append(i, result)
[32m     15[39m     result += (len(insert) - len(result)) * [len(target)]



ipdb>  u


> [32m/Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-fixed.py[39m([92m23[39m)[36minsert_sorted[39m[34m()[39m
[32m     21[39m     elems = list(sorted(elems))
[32m     22[39m     result = list(target)
[32m---> 23[39m     indices = searchsorted(elems, target)
[32m     24[39m     [38;5;28;01mfor[39;00m index, elem [38;5;28;01min[39;00m zip(indices[::-[32m1[39m], elems[::-[32m1[39m]):
[32m     25[39m         result.insert(index, elem)



ipdb>  elems


[19, 25]


ipdb>  d


> [32m/Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-fixed.py[39m([92m13[39m)[36msearchsorted[39m[34m()[39m
[32m     11[39m     [38;5;28;01mfor[39;00m i, compared [38;5;28;01min[39;00m enumerate(target):
[32m     12[39m         [38;5;28;01mfor[39;00m elem [38;5;28;01min[39;00m insert[len(result):]:
[91m2[39m[32m--> 13[39m             [38;5;28;01mif[39;00m elem <= compared:
[32m     14[39m                 append(i, result)
[32m     15[39m     result += (len(insert) - len(result)) * [len(target)]



ipdb>  elem


19


ipdb>  result


[]


ipdb>  c


> [32m/Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-fixed.py[39m([92m13[39m)[36msearchsorted[39m[34m()[39m
[32m     11[39m     [38;5;28;01mfor[39;00m i, compared [38;5;28;01min[39;00m enumerate(target):
[32m     12[39m         [38;5;28;01mfor[39;00m elem [38;5;28;01min[39;00m insert[len(result):]:
[91m2[39m[32m--> 13[39m             [38;5;28;01mif[39;00m elem <= compared:
[32m     14[39m                 append(i, result)
[32m     15[39m     result += (len(insert) - len(result)) * [len(target)]



ipdb>  c


> [32m/Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-fixed.py[39m([92m13[39m)[36msearchsorted[39m[34m()[39m
[32m     11[39m     [38;5;28;01mfor[39;00m i, compared [38;5;28;01min[39;00m enumerate(target):
[32m     12[39m         [38;5;28;01mfor[39;00m elem [38;5;28;01min[39;00m insert[len(result):]:
[91m2[39m[32m--> 13[39m             [38;5;28;01mif[39;00m elem <= compared:
[32m     14[39m                 append(i, result)
[32m     15[39m     result += (len(insert) - len(result)) * [len(target)]



ipdb>  c


> [32m/Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-fixed.py[39m([92m13[39m)[36msearchsorted[39m[34m()[39m
[32m     11[39m     [38;5;28;01mfor[39;00m i, compared [38;5;28;01min[39;00m enumerate(target):
[32m     12[39m         [38;5;28;01mfor[39;00m elem [38;5;28;01min[39;00m insert[len(result):]:
[91m2[39m[32m--> 13[39m             [38;5;28;01mif[39;00m elem <= compared:
[32m     14[39m                 append(i, result)
[32m     15[39m     result += (len(insert) - len(result)) * [len(target)]



ipdb>  c


> [32m/Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-fixed.py[39m([92m13[39m)[36msearchsorted[39m[34m()[39m
[32m     11[39m     [38;5;28;01mfor[39;00m i, compared [38;5;28;01min[39;00m enumerate(target):
[32m     12[39m         [38;5;28;01mfor[39;00m elem [38;5;28;01min[39;00m insert[len(result):]:
[91m2[39m[32m--> 13[39m             [38;5;28;01mif[39;00m elem <= compared:
[32m     14[39m                 append(i, result)
[32m     15[39m     result += (len(insert) - len(result)) * [len(target)]



ipdb>  disable 1


Disabled breakpoint 1 at /Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-fixed.py:6


ipdb>  c


> [32m/Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-fixed.py[39m([92m13[39m)[36msearchsorted[39m[34m()[39m
[32m     11[39m     [38;5;28;01mfor[39;00m i, compared [38;5;28;01min[39;00m enumerate(target):
[32m     12[39m         [38;5;28;01mfor[39;00m elem [38;5;28;01min[39;00m insert[len(result):]:
[91m2[39m[32m--> 13[39m             [38;5;28;01mif[39;00m elem <= compared:
[32m     14[39m                 append(i, result)
[32m     15[39m     result += (len(insert) - len(result)) * [len(target)]



ipdb>  disable 2


Disabled breakpoint 2 at /Users/johannes/Repositories/pyml/lecture/lecture-02/scripts/insert-fixed.py:13


ipdb>  c


[15, 17, 19, 22, 25]


### Breakpoint conditions

- a very powerful tool of breakpoints are conditions, which will only pause execution when an associated Python expression evaluates to `True`
- try setting a conditional breakpoint in the loop using `b insert-fixed.py:13, i==2`, which will only halt when `i` in that line is equal to 2
- condition expressions are executed every time the line is visited, which we can leverage for more advanced debugging
- for instance, we may use a `print` expression, which will never pause execution as its return value `None` evaluates to `False`; however, it will be evaluated every time its breakpoint location is visited
- try changing the condition of the breakpoint (list by `b`) using `condition <bpnumber> print(f'{i}: {elem} vs.
{compared}; {result}')`, or by using a new breakpoint `b insert-fixed.py, print(f'{i}: {elem} vs. {compared}; {result}')`

In [None]:
%debug insert_sorted([1, 3, 7], [2, 4, 6])

### Getting comfortable with PDB

- here is a short, non-exhaustive overview over some of the commands we used

| command | description |
| - | - |
| `w[here]` | prints the trackback, and indicates with an arrow on which frame we currently are |
| `u[p]` | go up one frame in the traceback |
| `d[own]` | go down one frame in the traceback |
| `c[ontinue]` | resumes execution of the program, e.g., after hitting a breakpoint |
| `n[ext]` | proceeds to the next line |
| `s[tep]` | steps into the next function |
| `l[ist]` | prints the current line and context |
| `p <expression>` | evaluate and print a Python expression in the current context |
| `pp <expression>` | does the same as `p`, but the result is pretty-printed |
| `b[reak] [([filename:]lineno \| function) [, condition]]` | sets a breakpoint, with an optional condition, which when evaluating to `True`, will trigger the breakpoint |
| `disable <bpnumber>` | removes one or more breakpoints |
| `help` | lists the available commands, and details when providing it with a command name |
| `q[uit]` | exits PDB |



### Debugging with `breakpoint()` and Jupyter's graphical debugger

- another approach to run the debugger is calling `breakpoint()` inside the code, which when run in the terminal, will execute pdb
- inside jupyter notebooks, `breakpoint()` will trigger the graphical debugger instead
- jupyter notebook will **ignore** `breakpoint()` unless the **debugger** is enabled (bug-button on the top right) ![debug-button](images/debug-button.png)
- with the jupyter graphical debugger, breakpoints can be alternatively set by clicking left of the line number

In [67]:
def compute(offset, vector):
    result = []
    for elem in vector:
        result.append((elem - offset) ** 2)
    breakpoint()
    return sum(result)

compute(1, [1., 2., 3., 4.])

14.0

### Jupyter's graphical debugger

- we may load the previous code and manually set breakpoints inside the cell
- notice on the right side, you will be able to see all local and global variables, the call stack and the breakpoints, the same concepts we have seen in pdb

In [68]:
# %load scripts/insert-fixed.py
def append(elem, target):
    ''''Append an element to a list.'''
    target.append(elem)


def searchsorted(insert, target):
    '''Find the indices for elements in insert if they were to be inserted into the sorted list `target`.
    Assume both insert and target are sorted.
    '''
    result = []
    for i, compared in enumerate(target):
        for elem in insert[len(result):]:
            if elem <= compared:
                append(i, result)
    result += (len(insert) - len(result)) * [len(target)]
    return result


def insert_sorted(elems, target):
    '''Insert list `elems` into sorted list `target`.'''
    elems = list(sorted(elems))
    result = list(target)
    indices = searchsorted(elems, target)
    for index, elem in zip(indices[::-1], elems[::-1]):
        result.insert(index, elem)
    return result

- try setting a few breakpoints inside the cell above, and then execute the cell below
- you can use the buttons next to `CALLSTACK` to continue (`c`), stop, step over (`n`), step in (`s`) or step out, similar to pdb

In [70]:
insert_sorted([1, 3, 7], [2, 4, 6])

[1, 2, 3, 4, 6, 7]

## Profiling
---

### Timing your code

- although your code may be running correctly, it can happen that it is too slow

- the process of identifying performance bottlenecks is called **profiling**

- a straight-forward way to time some specific code directly from within jupyter notebooks is to use the `%timeit` magic command

### The `%timeit` magic command

- to demonstrate `%timeit` take these three different implementations for prime number search, which we load by running or local `primes.py` file

In [71]:
%run scripts/primes.py

In [72]:
%timeit find_primes(1000)

1.47 ms ± 17 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [73]:
%timeit find_primes_faster(1000)

1.05 ms ± 12.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [74]:
%timeit sieve_of_eratosthenes(1000)

132 μs ± 804 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


- the `%timeit` magic command will adaptively run the program multiple times, and generate statistics over multiple runs

### Analysing your code

- the most basic way to identify bottlenecks is by static code analysis

- let us try understand why the first implementation of finding all primes up to a limit is so slow
- we can use the `inspect` module to specifically get the source code of some function

In [75]:
import inspect
print(inspect.getsource(find_primes))

def find_primes(limit):
    primes = []
    for num in range(2, limit + 1):
        for i in range(2, int(num ** 0.5) + 1):
            if divides(num, i):
                break
        else:
            primes.append(num)
    return primes



- do you see the nested loop?
- the runtime in this case is $O(n\sqrt{n})$ where $n$ is the limit, as we loop over each number, and at each iteration, loop square-root of the number-many times

### Profiling

- `%timeit` and simple static code analysis is quite limited and may not easily lead us to bottlenecks

- Python comes with its own **profiling** module, which provides us with a detailed runtime analysis of any program

- the profiling module is called `cProfile` (c-implementation of profile)
- we can call the `run` function on a Python-expressions written in a string to create profiling statistics on this expression
- the second argument is the file in which we store the statistics

In [76]:
import cProfile

cProfile.run('find_primes(1000)', 'find_primes.profile') 

### Profiling Statistics

- the `pstats` (profiling-stats) module provides us with functions to analyse the profiling results

- When printing the output of the statistics, we see several columns:
    - `ncalls`: The number of calls made to the function/method.
    - `tottime`: The total time spent in the function/method excluding time spent in calls to sub-functions.
    - `percall`: The average time spent in the function/method per call (tottime / ncalls).
    - `cumtime`: The cumulative time spent in the function/method, including time spent in calls to sub-functions.
    - `percall`: The average cumulative time spent in the function/method per call (cumtime / ncalls).
    - `filename`:lineno(function): The filename, line number, and function name.


### Investigating Profiling Statistics

- we can create a `Stats` instance by supplying a file name to the constructor
- calling the `sort_stats` method allows us to sort by specific statistic columns
- the `print_stats` method will then print a formatted table

In [77]:
import pstats
stats = pstats.Stats('find_primes.profile')

stats.strip_dirs().sort_stats('ncalls').print_stats(10);

Thu Apr 24 13:54:15 2025    find_primes.profile

         5460 function calls in 0.008 seconds

   Ordered by: call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5288    0.002    0.000    0.002    0.000 primes.py:3(divides)
      168    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.008    0.008 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.006    0.006    0.008    0.008 primes.py:7(find_primes)
        1    0.000    0.000    0.008    0.008 <string>:1(<module>)




- `divides` checks whether a number is divisible by another, can we reduce the number of these calls?

### Faster way of finding primes

- the `find_primes_fast` is slightly more optimized 

In [78]:
print(inspect.getsource(is_prime))
print(inspect.getsource(find_primes_faster))

def is_prime(num):
    if num <= 1:
        return False
    if num <= 3:
        return True
    if divides(num, 2) or divides(num, 3):
        return False
    i = 5
    while i * i <= num:
        if divides(num, i) or divides(num, i + 2):
            return False
        i += 6
    return True

def find_primes_faster(limit):
    primes = []
    for num in range(2, limit + 1):
        if is_prime(num):
            primes.append(num)
    return primes



### Profiling faster way of finding primes

- we profile the function the same way

In [79]:
cProfile.run("find_primes_faster(1000)", 'find_primes_faster.profile')

stats = pstats.Stats('find_primes_faster.profile')

stats.strip_dirs().sort_stats('ncalls').print_stats(10);

Thu Apr 24 13:55:54 2025    find_primes_faster.profile

         4088 function calls in 0.005 seconds

   Ordered by: call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2917    0.001    0.000    0.001    0.000 primes.py:3(divides)
      999    0.003    0.000    0.004    0.000 primes.py:18(is_prime)
      168    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.005    0.005 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.001    0.001    0.005    0.005 primes.py:33(find_primes_faster)
        1    0.000    0.000    0.005    0.005 <string>:1(<module>)




- we can see that the number of calls to `divides` halved

### $$\textbf{Thank you for your attention.}$$