In [None]:
%%html
<style>
h1,h2,h3 {
    text-align: center;
}

.term {
    text-align: center;
    margin-top: 1em;
    margin-bottom: 1em;
}

.organizers {
    text-align: center;
    margin-left: 20%;
    margin-right: 20%;
    margin-bottom: 1em;
}

.presenter {
    text-decoration: underline;
}
</style>



# Python Programming for Machine Learning

## Winter Term 2025/26

## Lecture 5: Python Fundamentals III

### Presenter: Max Eissler

<center><img src='images/python-logo-only.svg' width=250> </center>

## What we'll cover today

This lecture covers advanced Python programming concepts essential for ML development:

- **Variable Scopes**

- **References**

- **Functional Programming**

- **Typing and Dataclasses**

- **Exception Handling**


## Scopes
---

### Recap: Code blocks through indentation

- code blocks are specified using **indentation**

In [None]:
def chaos(obj, target=None):
    if target is None:
        target = []
    if isinstance(obj, (int, float)):
        target.append(obj)    
    return target


print(chaos(1))



In [None]:
def example(obj=):
    obj.append(1)
    return obj


print(example())
print(example())

def example2(obj=None):
    if obj is None:
        obj = []
    obj.append(1)
    return obj


print(example2())
print(example2())



[1]
[1, 1]
[1]
[1]


In [73]:
0 == None

False

### The global scope

- variables are only accessible in the *code block* they were created

- these *code blocks* are the **scope** of the variables

- variables defined in the top level of the file are in the **global** scope (i.e., everything that is not indented)

In [2]:
weather = 'rainy'
print(weather)

rainy


- we can inspect the global scope through the `globals` built-in, which returns a `dict` containing all global variables

- note that jupyter notebooks introduce a lot of global variables

In [3]:
globals()

{'__name__': '__main__',
 '__doc__': 'Automatically created module for IPython interactive environment',
 '__package__': None,
 '__loader__': None,
 '__spec__': None,
 '__builtin__': <module 'builtins' (built-in)>,
 '__builtins__': <module 'builtins' (built-in)>,
 '_ih': ['', 'globals()', "weather = 'rainy'\nprint(weather)", 'globals()'],
 '_oh': {1: {...}},
 '_dh': [PosixPath('/Users/maxi/WORKSPACE/PHD/PYML/lecture/lecture-05')],
 'In': ['', 'globals()', "weather = 'rainy'\nprint(weather)", 'globals()'],
 'Out': {1: {...}},
 'get_ipython': <bound method InteractiveShell.get_ipython of <ipykernel.zmqshell.ZMQInteractiveShell object at 0x1062806b0>>,
 'exit': <IPython.core.autocall.ZMQExitAutocall at 0x106201af0>,
 'quit': <IPython.core.autocall.ZMQExitAutocall at 0x106201af0>,
 'open': <function _io.open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)>,
 '_': {...},
 '__': '',
 '___': '',
 '__vsc_ipynb_file__': '/Users/maxi/WORKSPACE/P

### Referencing global variables

- global variables may be accessed from lower scopes

In [4]:
weather = 'sunny'

def report():
    print(weather)


report()

sunny


- however, (re-)defining variables inside functions (classes, ...) will only define them in the **local** scope

In [5]:
def local_weather():
    weather = 'cloudy'
    print(f'Local weather: {weather}')


local_weather()
print(f'Global weather: {weather}')

Local weather: cloudy
Global weather: sunny


### The local scope

- local scopes are local to the current block (indentation)

In [6]:
def local_weather():
    weather = 'cloudy'

    def local2_weather():
        weather = 'rainy'
        print(f'Double local weather: {weather}')

    local2_weather()
    print(f'Local weather: {weather}')


local_weather()
print(f'Global weather: {weather}')

Double local weather: rainy
Local weather: cloudy
Global weather: sunny


- similar to the global scope, are local variables may be inspected using the `locals` built-in

In [7]:
def func():
    weather = 'funny'
    print(locals())


func()

{'weather': 'funny'}


In [8]:
# Is everyone awake? What will this print?
x = 10
def func():
    x = 20
    print(x)
func()
print(x)

20
10


### What counts as a (new) local scope?

- only `def` and `class` introduce new local scopes

In [9]:
def check_local():
    check = 1

    def my_func():
        if 'check' not in locals():
            print('def is local')
    my_func()

    class MyClass():
        if 'check' not in locals():
            print('class is local')
    if True:
        if 'check' not in locals():
            print('if is local')
    for _ in [1]:
        if 'check' not in locals():
            print('for is local')
    while True:
        if 'check' not in locals():
            print('while is local')
        break


check_local()

def is local
class is local


### For-loops leak their variables

- a particularity of Python is that since `for` does not create a new local scope, its variables remain accessible after the loop has ended

In [10]:
def check_for_leak():
    i = 40
    for i in range(25):
        pass
    print(i)


check_for_leak()

24


### Inplace operations on global variables

- attempting to execute inplace operations on variables not native to the current scope is not directly possible

In [11]:
from traceback import print_exception

temperature = 21


def heat_up():
    temperature += 1


try:
    heat_up()
except UnboundLocalError as error:
    print_exception(error)

Traceback (most recent call last):
  File "/var/folders/xm/1kpql0vs6hlgvb9rrpnmsjr00000gn/T/ipykernel_88438/538375756.py", line 11, in <module>
    heat_up()
  File "/var/folders/xm/1kpql0vs6hlgvb9rrpnmsjr00000gn/T/ipykernel_88438/538375756.py", line 7, in heat_up
    temperature += 1
    ^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'temperature' where it is not associated with a value


### Inplace operations on global variables (cont'd)

- to make inplace operations possible, we can explicitly indicate that we refere to a **global** variable

In [12]:
from traceback import print_exception

temperature = 21


def heat_up():
    global temperature
    temperature += 1
    print(locals()) # will remain empty


heat_up()
print(f'{temperature}°C?! Is it getting hot in here?')

{}
22°C?! Is it getting hot in here?


### Local re-assignment of global variables

- once we indicated that we refer to a **global** variable, we can of course reassign the value

In [13]:
temperature = 21


def heat_up():
    global temperature
    temperature = 30


heat_up()
print(f'{temperature}°C?! Is it getting hot in here?')

30°C?! Is it getting hot in here?


### Local operations on non-local variables

- variables local to a higher block are subject to the same limitations

In [14]:
def simulate_temperature():
    temperature = 21

    def local_heat_up():
        temperature += 1

    try:
        local_heat_up()
    except UnboundLocalError as error:
        print_exception(error)


simulate_temperature()

Traceback (most recent call last):
  File "/var/folders/xm/1kpql0vs6hlgvb9rrpnmsjr00000gn/T/ipykernel_88438/549604822.py", line 8, in simulate_temperature
    local_heat_up()
  File "/var/folders/xm/1kpql0vs6hlgvb9rrpnmsjr00000gn/T/ipykernel_88438/549604822.py", line 5, in local_heat_up
    temperature += 1
    ^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'temperature' where it is not associated with a value


### Local operations on non-local variables (cont'd)

- we can indicate that we refer to a variable that is not local to the current block by using `nonlocal`

In [16]:
def simulate_temperature():
    temperature = 21

    def local_heat_up():
        nonlocal temperature
        temperature += 1

    local_heat_up()
    print(f'{temperature}°C?! Is it getting hot in here?')


simulate_temperature()
print(f'Outside is still {temperature}°C!')

22°C?! Is it getting hot in here?
Outside is still 30°C!


## References
---

- Recap: Python objects are passed by reference, and have an id, a type, and a value

In [17]:
string = 'hey there'
print(id(string))
print(type(string))

4573986608
<class 'str'>


In [18]:
a = [1]
b = [1]
print(a == b)
print(a is b)

True
False


### Garbage collection in Python

- Python does not require explicit deletion of objects
- objects simply cease to exist (are garbage collected) once they are no longer referenced by anything
- for instance, reassigning a variable removes the reference to its previous value

### Dunder method for deletion

- the `__del__` method of an object is called when the object is garbage collected

In [19]:
class DeleteMe:
    '''Says 'Goodbye!' when deleted.'''

    def __init__(self, name):
        self.name = name
        print(f'{self}: Hello!')

    def __repr__(self):
        return f'DeleteMe({self.name})'

    def __del__(self):
        print(f'{self}: Goodbye!')

- we will use this class to demonstrate the timing of the garbage collection

In [20]:
obj = DeleteMe('Mia')

DeleteMe(Mia): Hello!


In [21]:
obj = None

DeleteMe(Mia): Goodbye!


- references are also removed when the scope of variables ends

In [22]:
def party():
    host = DeleteMe('Ben')
    print(f'{host} in the house!')


print('Let\'s party!')
party()
print('Party over!')

Let's party!
DeleteMe(Ben): Hello!
DeleteMe(Ben) in the house!
DeleteMe(Ben): Goodbye!
Party over!


### Deleting Objects

- if we do not need a variable anymore, we can explicitly unset it using `del`

In [23]:
obj = DeleteMe('Ann')

DeleteMe(Ann): Hello!


In [24]:
del obj
try:  # will cause a NameError, we will talk about error handling later
    print(obj)
except NameError as error:
    print(f'Error: {error}')

DeleteMe(Ann): Goodbye!
Error: name 'obj' is not defined


Is this list still in memory?

In [26]:
a = [1, 2, 3]
b = a
del a
print(b)

[1, 2, 3]


deleting a variable will **not** cause deletion of the referenced object if it is still referenced elsewhere

In [27]:
obj = DeleteMe('Joe')
obj2 = obj
print('Deleting obj...')
del obj
print('Deleting obj2...')
del obj2

DeleteMe(Joe): Hello!
Deleting obj...
Deleting obj2...
DeleteMe(Joe): Goodbye!


### How does Python know when an object is not referenced anymore?

- Python keeps track of the number of references by using a **reference counter**

In [28]:
from sys import getrefcount  # do not rely on this function, it is for educational purposes only

In [29]:
def func():
    obj = DeleteMe('Amy')
    print('After creation:', getrefcount(obj))
    obj2 = obj
    print('New reference:', getrefcount(obj))
    del obj2
    print('Remove new reference:', getrefcount(obj))


func()

DeleteMe(Amy): Hello!
After creation: 2
New reference: 3
Remove new reference: 2
DeleteMe(Amy): Goodbye!


**Note**
- garbage collection is not always instantanious
- there are some objects that are immortal, like `None` or low integers, which have very high reference counters

In [31]:
print(getrefcount(None))
del None
print(getrefcount(None))

SyntaxError: cannot delete None (4270500750.py, line 2)

### What counts as a reference?

- variables
- attributes
- membership in any container type 

In [32]:
class A: pass


def func():
    obj = DeleteMe('Leo')
    print(getrefcount(obj))
    var_ref = obj
    print(getrefcount(obj))
    A.obj = obj
    print(getrefcount(obj))
    con = (1, 2, obj)
    print(getrefcount(obj))
    del A.obj


func()

DeleteMe(Leo): Hello!
2
3
4
5
DeleteMe(Leo): Goodbye!


### Fail-case: objects referencing one another

- when we create objects that reference each other, to which we lose access, they may never be garbage collected

In [33]:
objid = None


def party():
    global objid
    obj = DeleteMe('Lea')
    obj.bff = DeleteMe('Max')
    obj.bff.bff = obj
    objid = id(obj)
    print(f'{obj} and {obj.bff} in the house!')


print('Let\'s party!')
party()
print('Party over!')


Let's party!
DeleteMe(Lea): Hello!
DeleteMe(Max): Hello!
DeleteMe(Lea) and DeleteMe(Max) in the house!
Party over!


- we would expect to see `Goodbye!` from both objects
- but the objects reference each other (through `.bff`), so their reference counter will never be 0
- since both objects are now inaccessible, we also cannot easily recover from this
- in simple cases, Python can automatically detect reference cycles and remove them (you may see the above objects be deleted at some point)

### Manually recovering from reference cycles **(do not try at home!)**

- we can clean up this mess by recovering the object through its memory address 
- **we do not recommend you to execute the following code, as it can crash your Python if `objid` does not point to a valid memory address**

In [34]:
# import ctypes
# ctypes.cast(objid, ctypes.py_object).value.guest = None
# del objid

### Avoiding reference cycles

- to avoid this issue in the first place, **weak references** may be used

In [35]:
import weakref


def party():
    global objid
    obj = DeleteMe('Lea')
    obj2 = DeleteMe('Max')
    obj.bff = weakref.ref(obj2)
    obj2.bff = weakref.ref(obj)
    objid = id(obj)
    print(f'{obj.bff()} and {obj2.bff()} in the house!')


print('Let\'s party!')
party()
print('Party over!')

Let's party!
DeleteMe(Lea): Hello!
DeleteMe(Max): Hello!
DeleteMe(Max) and DeleteMe(Lea) in the house!
DeleteMe(Lea): Goodbye!
DeleteMe(Max): Goodbye!
Party over!


### Weak references and reference counters

- weak references do not count towards the reference counter of objects

In [36]:
def func():
    obj = DeleteMe('Eva')
    print(getrefcount(obj))
    obj2 = weakref.ref(obj)
    print(getrefcount(obj))
    obj3 = obj2()
    print(getrefcount(obj))
func()

DeleteMe(Eva): Hello!
2
2
3
DeleteMe(Eva): Goodbye!


- **Note**: resolving weakrefs will increase the reference counter



**Question: Can you think of programming patterns where you might need weak references?**

## Functional Programming in Python
---

### Example: Writing a data preprocessing pipeline

Assume we have some text data and we want to do predict sentiment (i.e. positive vs. negative)
We need to:
- Load the data from a file
- Preprocess the data
- Form training batches 

In [37]:
import csv


def preview_csv(filepath, rows=5):
    with open(filepath) as f:
        reader = csv.reader(f)
        headers = next(reader)
        print("\nHeaders:", headers)
        print("\nFirst", rows, "rows:")
        for i, row in enumerate(reader):
            if i >= rows:
                break
            print(row)


preview_csv("data/imdb_snippet.csv")


Headers: ['review', 'sentiment']

First 5 rows:
["One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I woul

#### Procedural code

In [38]:
import random
import re

with open("data/imdb_snippet.csv") as f:
    csv_reader = csv.reader(f)
    next(csv_reader) # skip headers
    data = [row for row in csv_reader]

texts, labels = zip(*data)

print(f"Preprocessing {len(data)} lines of data\n")
print(f"Raw: {texts[0][:70]}..., label: {labels[0]}\n")

# 1. Clean texts (lowercase, remove extra spaces)
cleaned_texts = []
for text in texts:
    text = text.lower()
    cleaned_texts.append(text)

print(f"Cleaned: {cleaned_texts[0][:70]}...\n")

#2. keep only letters and spaces
regexed_texts = []
for text in cleaned_texts:
    text = re.sub(r"[^a-z ]", "", text)
    text = " ".join(text.split())
    regexed_texts.append(text)

print(f"Regex: {regexed_texts[0][:70]}...\n")


# 3. Split into words
word_lists = []
for text in regexed_texts:
    words = text.split()
    word_lists.append(words)

print(f"Words: {word_lists[0][:10]}...\n")

# 4. Randomly mask words
random_mask_percentage = 0.2
masked_texts = []
for words in word_lists:
    masked_words = []
    for word in words:
        if random.random() < random_mask_percentage:
            masked_words.append("[MASK]")
        else:
            masked_words.append(word)
    masked_texts.append(masked_words)

print(f"Masked: {masked_texts[0][:10]}...\n")

# 5. Create batches
batch_size = 2
batches = []
for i in range(0, len(masked_texts), batch_size):
    batch = {
        "text": masked_texts[i:i+batch_size],
        "label": labels[i:i+batch_size]
    }
    batches.append(batch)

print(f"Batch 1: \n text: {batches[0]['text']}\n label: {batches[0]['label']}\n")

DeleteMe(Lea): Goodbye!
DeleteMe(Max): Goodbye!
Preprocessing 138 lines of data

Raw: One of the other reviewers has mentioned that after watching just 1 Oz..., label: positive

Cleaned: one of the other reviewers has mentioned that after watching just 1 oz...

Regex: one of the other reviewers has mentioned that after watching just oz e...

Words: ['one', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching']...

Masked: ['[MASK]', 'of', '[MASK]', 'other', 'reviewers', 'has', 'mentioned', '[MASK]', '[MASK]', 'watching']...

Batch 1: 
 text: [['[MASK]', 'of', '[MASK]', 'other', 'reviewers', 'has', 'mentioned', '[MASK]', '[MASK]', 'watching', '[MASK]', 'oz', 'episode', '[MASK]', 'be', '[MASK]', 'they', '[MASK]', 'right', 'as', 'this', 'is', 'exactly', 'what', '[MASK]', '[MASK]', 'mebr', 'br', 'the', 'first', 'thing', 'that', 'struck', 'me', 'about', 'oz', 'was', 'its', 'brutality', '[MASK]', '[MASK]', '[MASK]', 'of', 'violence', 'which', 'set', 'in', 'right'

### Problems:
- Everything in one scope
- Repetetive
- Variables are hard-coded

We will come back to this code.

### Functional Programming

- is a programming paradigm
- you will hear about others (e.g. OOP)

#### Core principle: Separate Data from Logic

<center><img src='images/mermaid-program-struct.png'> </center>

- Use pure functions (same input -> same output)
- Compose simple functions that do simple things

Key Benefits:
- Easier to test (pure functions are predictable)
- Easier to parallelize (no shared state)
- More maintainable (functions do one thing)

### Recap: Functions as First-Class Objects

- in Python, functions are objects that can be:
- passed as arguments
- returned from other functions 
- assigned to variables
- stored in data structures


In [None]:
def apply_twice(func, value):
    """Apply a function twice to a value."""
    return func(func(value))


def add_one(x):
    return x + 1


result = apply_twice(add_one, 10)
print(f"After applying add_one twice to 10: {result}")

### Closures

- when defining a function in another function, the inner function can access the outer functions variables

In [39]:
def make_greeter(greeting):
    """Creates a function that greets with a specific greeting"""
    def greet(name):
        return f"{greeting}, {name}!"
    return greet


# Create different greeting functions
say_hello = make_greeter("Hello")
say_aloha = make_greeter("Aloha")

print(say_hello("Alice"))  # "Hello, Alice!"
print(say_aloha("Bob"))    # "Aloha, Bob!"

Hello, Alice!
Aloha, Bob!


- we can even modify the outer function's variables:

In [41]:
def create_counter():
    count = 0  # This variable is "enclosed" in the closure
    
    def increment():
        nonlocal count  # Need nonlocal to modify enclosed variable
        count += 1
        return count
        
    return increment  # Return the inner function


counter = create_counter()
print(counter())  # 1
print(counter())  # 2
print(counter())  # 3

counter2 = create_counter()

# ATTENTION:
# While this is sometimes useful, it is also considered bad practice when following the functional programming paradigm.

1
2
3


##### Closures can do many things a custom classes are often used for while avoiding complexity!

In [42]:
## OOP implementation
class Masker:
    def __init__(self, mask_token, mask_prob):
        self.mask_token = mask_token
        self.mask_prob = mask_prob

    def __call__(self, words):
        return [self.mask_token if random.random() < self.mask_prob else word for word in words]
    

masker = Masker(mask_token="[MASK]", mask_prob=0.2)
print(masker(["Hello", "world", "!"]))

# Equivalent functional implementation
def get_masker(mask_token, mask_prob):
    def masker(words):
        return [mask_token if random.random() < mask_prob else word for word in words]
    return masker


masker = get_masker(mask_token="[MASK]", mask_prob=0.2)
print(masker(["Hello", "world", "!"]))


['Hello', '[MASK]', '[MASK]']
['Hello', 'world', '!']


### Decorators

- decorators are functions that modify other functions (or classes)
- they can be applied using the special `@` syntax
- commonly used for:
  - logging
  - timing
  - input validation
  - access control

In [44]:
def log_calls(func):
    def wrapper(*args, **kwargs):
        print(f"Calling {func.__name__} with args={args}, kwargs={kwargs}")
        result = func(*args, **kwargs)
        print(f"{func.__name__} returned {result}")
        return result
    return wrapper

def multiply(a, b, c=5):
    return a * b

multiply_decorated = log_calls(multiply)

result = multiply_decorated(3, 4, c=5)

Calling multiply with args=(3, 4), kwargs={'c': 5}
multiply returned 12


In [45]:
# Equivalent to:
@log_calls
def multiply(a, b):
    return a * b

result = multiply(3, 4)

Calling multiply with args=(3, 4), kwargs={}
multiply returned 12


#### Expansion on question from the lecture about parametrizing decorators

In [None]:
# During the lecture there was a question on passing a additional arguments to the decorator syntax
# In case we have a decorator function that takes additional arguments, like this (arguably contrived) example:
def log_calls(func, log_ntimes=1):
    def wrapper(*args, **kwargs):
        for _ in range(log_ntimes):
            print(f"Calling {func.__name__} with args={args}, kwargs={kwargs}")
        result = func(*args, **kwargs)
        for _ in range(log_ntimes):
            print(f"{func.__name__} returned {result}")
        return result
    return wrapper

# If we try to use the decorator syntax directly, we get an error:
@log_calls(log_ntimes=2)
def multiply(a, b):
    return a * b

result = multiply(3, 4)

In [None]:
# If we want to be able to pass a custom arguments to the decorator syntax, we need to change our decorator function definition a little: 
def log_calls(log_ntimes=1):
    def decorator(func):
        def wrapper(*args, **kwargs):
            for _ in range(log_ntimes):
                print(f"Calling {func.__name__} with args={args}, kwargs={kwargs}")
            result = func(*args, **kwargs)
            for _ in range(log_ntimes):
                print(f"{func.__name__} returned {result}")
            return result
        return wrapper
    return decorator

@log_calls(log_ntimes=2)
def multiply(a, b):
    return a * b

result = multiply(3, 4)

In [None]:
# The reason we need to do this is quite subtle: 
# In the case of the bare function above, the decorator syntax takes the supplied function and wraps the function defined below in it.
def log_calls(func):
    def wrapper(*args, **kwargs):
        print(f"Calling {func.__name__} with args={args}, kwargs={kwargs}")
        result = func(*args, **kwargs)
        print(f"{func.__name__} returned {result}")
        return result
    return wrapper

@log_calls # <-- this points to a function that takes a function as an argument
def multiply(a, b):
    return a * b

result = multiply(3, 4)

In [None]:
#But if we write a bracketed version, we already evaluate the function
@log_calls() # <-- this already tries to evaluate the decorator function (which expects a function as an argument)
def multiply(a, b):
    return a * b

result = multiply(3, 4)


In [None]:
# So we need an additional layer of wrapping:
def log_calls():
    def decorator(func): # <-- this is the additional layer of wrapping
        def wrapper(*args, **kwargs):
            print(f"Calling {func.__name__} with args={args}, kwargs={kwargs}")
            result = func(*args, **kwargs)
            print(f"{func.__name__} returned {result}")
            return result
        return wrapper
    return decorator

@log_calls() # <-- this will now work: When we call this function, it evaluates to return the decorator function (which means we now have a function that takes a function as an argument to the @ syntax, just as before)
def multiply(a, b):
    return a * b

result = multiply(3, 4)

In [None]:
# Now we can add more custom arguments to the decorator:
def log_calls(log_ntimes=1):
    def decorator(func):
        def wrapper(*args, **kwargs):
            for _ in range(log_ntimes):
                print(f"Calling {func.__name__} with args={args}, kwargs={kwargs}")
            result = func(*args, **kwargs)
            for _ in range(log_ntimes):
                print(f"{func.__name__} returned {result}")
            return result
        return wrapper
    return decorator

@log_calls(log_ntimes=2)
def multiply(a, b):
    return a * b

result = multiply(3, 4)

#### END Expansion
Make sure you understand the code in this explanation

### The partial function

- `partial` creates a new function with some arguments fixed
- equivalent to writing a closure, but much shorter
- equivalent to writing a lambda function but arguably more readable
- import from `functools` module

More built-in higher order functions: https://docs.python.org/3/library/functools.html

In [46]:
from functools import partial

def power(base, exponent):
    return base ** exponent


# Create specialized functions
square = partial(power, exponent=2)
cube = partial(power, exponent=3)

# equivalent to
square = lambda x: power(x, 2)
cube = lambda x: power(x, 3)

print(f"Square of 4: {square(4)}")
print(f"Cube of 4: {cube(4)}")

Square of 4: 16
Cube of 4: 64


We can now make our "masker" example from above even simpler:

In [None]:
# This ....
def get_masker(mask_token, mask_prob):
    def masker(words):
        return [mask_token if random.random() < mask_prob else word for word in words]
    return masker
masker = get_masker(mask_token="[MASK]", mask_prob=0.2)

# ... can be replaced by this:
def mask_words(words, mask_token, mask_prob):
    return [mask_token if random.random() < mask_prob else word for word in words]

masker = partial(mask_words, mask_token="[MASK]", mask_prob=0.2)

# another notable difference is that when using partial the created function can be pickled, while a closure cannot - this becomes important when using multiprocessing (which we often do when building in data pipelines)

Lets apply these tools to the data pipeline. We...

- ... extract the processing steps into simple functions 
- ... write a decorator for debugging
- ... compose functions into a single "pipeline" function
- ... use partial to configure the parametrizable processing steps

In [None]:
from functools import partial
import csv
import re
import random
from typing import Callable


def debug(func: Callable) -> Callable:
    """Simple decorator to show intermediate results"""
    def wrapper(*args, **kwargs):
        result = func(*args, **kwargs)
        # Show first 50 chars or 5 items of result
        preview = result[:50] if isinstance(result, str) else result[:5]
        print(f"{func.__name__}: {preview}...")
        return result
    return wrapper

#@debug
def clean_text(text: str) -> str:
    """Convert to lowercase and normalize spaces"""
    return " ".join(text.lower().split())


def remove_special_chars(text: str) -> str:
    """Keep only letters and spaces"""
    return re.sub(r"[^a-z ]", "", text)

#@debug
def split_words(text: str) -> list[str]:
    """Split text into words"""
    return text.split()

#@debug
def mask_words(words: list[str], mask_token: str, mask_prob: float) -> list[str]:
    """Randomly mask words with given probability and token"""
    return [mask_token if random.random() < mask_prob else word 
            for word in words]


def create_batches(texts: list[list[str]], labels: list[str], 
                  batch_size: int = 2) -> list[dict]:
    """Create batches from texts and labels"""
    return [
        {
            "text": texts[i:i+batch_size],
            "label": labels[i:i+batch_size]
        }
        for i in range(0, len(texts), batch_size)
    ]


def compose(*functions) -> Callable:
    """Compose multiple functions from left to right"""
    def composed_function(x):
        result = x
        for f in functions:
            result = f(result)
        return result
    return composed_function


def load_data(filepath: str) -> tuple[list, list]:
    """Load data from CSV file"""
    with open(filepath) as f:
        next(csv.reader(f))  # skip headers
        data = [row for row in csv.reader(f)]
    # Using list comprehensions instead of map
    texts = [row[0] for row in data]
    labels = [row[1] for row in data]
    return texts, labels


def process_text(filepath: str, mask_prob: float = 0.2, 
                batch_size: int = 2, mask_token: str ="[MASK]") -> list[dict]:
    texts, labels = load_data(filepath)
    
    masker = partial(mask_words, mask_token=mask_token, mask_prob=mask_prob)
    
    pipeline = compose(
        clean_text,
        remove_special_chars,
        split_words,
        masker
    )
    
    processed_texts = [pipeline(text) for text in texts]
    
    return create_batches(processed_texts, labels, batch_size)


# Process file
batches = process_text(
    filepath="data/imdb_snippet.csv",
    mask_prob=0.2,
    batch_size=2,
    mask_token="[MASK]"
)

# Show result
print("\nFirst batch:")
print(f"text: {batches[0]['text']}")
print(f"label: {batches[0]['label']}")

clean_text: one of the other reviewers has mentioned that afte...
clean_text: a wonderful little production. <br /><br />the fil...
clean_text: i thought this was a wonderful way to spend time o...
clean_text: basically there's a family where a little boy (jak...
clean_text: petter mattei's "love in the time of money" is a v...
clean_text: probably my all-time favorite movie, a story of se...
clean_text: i sure would like to see a resurrection of a up da...
clean_text: this show was an amazing, fresh & innovative idea ...
clean_text: encouraged by the positive comments about this fil...
clean_text: if you like original gut wrenching laughter you wi...
clean_text: phil the alien is one of those quirky films where ...
clean_text: i saw this movie when i was about 12 when it came ...
clean_text: so im not a big fan of boll's work but then again ...
clean_text: the cast played shakespeare.<br /><br />shakespear...
clean_text: this a fantastic movie of three prisoners who beco...
clean_text

## Custom types and dataclasses

### Recap: Type Annotations

- Python supports optional type hints
- helps with code documentation and tooling
- does not affect runtime behavior
- uses the `typing` module for complex types

In [None]:
from typing import Callable


def process_numbers(
    numbers: list[float],
    operation: Callable[[float], float],
    threshold: float | None = None
) -> list[float]:
    result = [operation(x) for x in numbers]
    if threshold is not None:
        result = [x for x in result if x > threshold]
    return result


# Example usage
nums = [1.0, 2.0, 3.0, 4.0]
result = process_numbers(
    numbers=nums,
    operation=lambda x: x * 2,
    threshold=5.0
)
print(f"Processed numbers: {result}")

### Custom Types
- you can create custom type definitions to shorten complex type definitions

In [50]:
from typing import Literal, Callable

# Custom types for our pipeline
StringProcessingStep = Callable[[str], str]
TextType = str | list[str]
ModelType = Literal["bert", "gpt2", "roberta"]


In [51]:
ComplexType = tuple[int, str, list[str], Callable[[str, float], list[str]]]

### Classes can be used Custom Types (and Data Structures)
- while they can include both data and behaviour, we can also use classes just as a data structure

In [52]:
# To initialize the class, we need to define the __init__ method

class ProcessingConfig:
    def __init__(self, model: ModelType, mask_prob: float, batch_size: int, max_length: int | None):
        self.model = model
        self.mask_prob = mask_prob
        self.batch_size = batch_size
        self.max_length = max_length


config = ProcessingConfig(model="bert", mask_prob=0.15, batch_size=32, max_length=None)
print(config.model)  
print(config)

bert
<__main__.ProcessingConfig object at 0x110be1fa0>


To make the class printable, we need to define the `__repr__` method

In [53]:
class PrintableProcessingConfig(ProcessingConfig):
    def __repr__(self):
        return f"ProcessingConfig(model={self.model}, mask_prob={self.mask_prob}, batch_size={self.batch_size}, max_length={self.max_length})"
    

printable_config = PrintableProcessingConfig(model="bert", mask_prob=0.15, batch_size=32, max_length=None)
print(printable_config)

ProcessingConfig(model=bert, mask_prob=0.15, batch_size=32, max_length=None)


To make the class comparable, we need to define the `__eq__` method

In [55]:
class ComparableProcessingConfig(ProcessingConfig):
    def __eq__(self, other):
        if not isinstance(other, ProcessingConfig):
            return False
        return self.model == other.model and self.mask_prob == other.mask_prob and self.batch_size == other.batch_size and self.max_length == other.max_length


comparable_config = ComparableProcessingConfig(model="bart", mask_prob=0.15, batch_size=32, max_length=None)
print(comparable_config == printable_config)

False


### Built-In Methods for Classes ("dunder" methods)
- Python classes can define special behaviors using double underscore (double-under -> dunder) methods
- common examples:
  - `__init__`: initialization
  - `__str__`: string representation for users
  - `__repr__`: string representation for developers
  - `__eq__`: equality comparison
  - `__lt__`, `__gt__`: ordering comparisons
  - `__getitem__`: accessing elements of a sequence

Look up up all the dunder methods here: https://www.pythonmorsels.com/every-dunder-method/

In [56]:
class TextBatch:
    def __init__(self, texts: list[str], labels: list[str]):
        if len(texts) != len(labels):
            raise ValueError("Number of texts and labels must match")
        self.texts = texts
        self.labels = labels
    
    def __repr__(self):
        """Developer-friendly string representation"""
        return f"TextBatch(texts={self.texts}, labels={self.labels})"
    
    def __len__(self):
        """Allow using len() on batch"""
        return len(self.texts)
    
    def __getitem__(self, index: int) -> tuple[list[str], str]:
        """Allow indexing into batch"""
        return self.texts[index], self.labels[index]
    

batch = TextBatch(texts=["hello", "world"], labels=[0, 1])
print(batch)
print(len(batch))
print(batch[0])

TextBatch(texts=['hello', 'world'], labels=[0, 1])
2
('hello', 0)


### From Classes to Dataclasses
- regular classes often require lots of boilerplate code 
- dataclasses implement \__init__, \__eq__, and \__repr__ automatically
- we can define a \__post_init__ function instead of an \__init__ function
- intended purely as data containers
- can be made immutable (with frozen=True) - we cannot mutate the attributes of an instance after creation (raises an error)

See the docs for full functionality: https://docs.python.org/3/library/dataclasses.html

In [57]:
from dataclasses import dataclass


@dataclass(frozen=True)
class TextBatch:
    texts: list[list[str]]
    labels: list[str]
    
    def __post_init__(self):
        if len(self.texts) != len(self.labels):
            raise ValueError("Number of texts and labels must match")
    
    def __len__(self) -> int:
        return len(self.texts)
    
    def __getitem__(self, index: int) -> tuple[str, str]:
        return self.texts[index], self.labels[index]
        
        
print("Repr Example:", TextBatch([["text1"], ["text2"]], ["pos", "neg"]))
print("Eq Example:", TextBatch(["text1"], ["pos"]) == TextBatch(["text1"], ["pos"]))

Repr Example: TextBatch(texts=[['text1'], ['text2']], labels=['pos', 'neg'])
Eq Example: True


Careful: Frozen only applies to top-level attributes. We can still mutate mutatable attributes:

In [58]:
batch = TextBatch(texts=[["text1"], ["text2"]], labels=["pos", "neg"])
batch.texts.append(["text3"])
print(batch)
try:
    batch.texts = (["text3"], ["text4"])
except Exception as e:
    print(f"Error: {e}")

TextBatch(texts=[['text1'], ['text2'], ['text3']], labels=['pos', 'neg'])
Error: cannot assign to field 'texts'


Let's apply dataclasses to our data pipeline example from above:

In [59]:
from pathlib import Path
from typing import Callable
from dataclasses import dataclass


@dataclass(frozen=True)
class ProcessingConfig:
    data_path: str = "data/imdb_snippet.csv"
    mask_token: str = "[MASK]"
    mask_prob: float = 0.15
    batch_size: int = 4

    def __post_init__(self):
        if self.batch_size <= 0:
            raise ValueError("Batch size must be positive")
        if not 0 <= self.mask_prob <= 1:
            raise ValueError("Mask probability must be between 0 and 1")
        if not Path(self.data_path).exists():
            raise FileNotFoundError(f"Data file does not exist: {self.data_path}")


@dataclass(frozen=True)
class DataBatch:
    texts: list[list[str]]
    labels: list[str]

    def __len__(self) -> int:
        return len(self.texts)
    
    def __str__(self) -> str:
        shortened_texts = [text[:10] + ["..."] for text in self.texts]
        return f"\nTexts: {shortened_texts}, \nLabels: {self.labels}"



def clean_text(text: str) -> str:
    """Convert to lowercase and normalize spaces"""
    return " ".join(text.lower().split())


def remove_special_chars(text: str) -> str:
    """Keep only letters and spaces"""
    return re.sub(r"[^a-z ]", "", text)


def split_words(text: str) -> list[str]:
    """Split text into words"""
    return text.split()


def mask_words(words: list[str], mask_token: str, mask_prob: float) -> list[str]:
    """Randomly mask words with given probability and token"""
    return [mask_token if random.random() < mask_prob else word 
            for word in words]


def create_batches(texts: list[list[str]], labels: list[str], 
                  batch_size: int = 2) -> list[DataBatch]:
    """Create batches from texts and labels"""
    return [
        DataBatch(texts=texts[i:i+batch_size], labels=labels[i:i+batch_size])
        for i in range(0, len(texts), batch_size)
    ]


def debug(func: Callable) -> Callable:
    """Simple decorator to show intermediate results"""
    def wrapper(*args, **kwargs):
        result = func(*args, **kwargs)
        preview = result[:50] if isinstance(result, str) else result[:5]
        print(f"{func.__name__}: {preview}...")
        return result
    return wrapper


def compose(*functions: Callable) -> Callable:
    """Compose multiple functions from left to right"""
    def composed_function(x):
        result = x
        for func in functions:
            result = func(result)
        return result
    return composed_function


def load_data(filepath: str) -> tuple[list, list]:
    """Load data from CSV file"""
    with open(filepath) as f:
        next(csv.reader(f))  # skip headers
        data = [row for row in csv.reader(f)]
    texts, labels = zip(*data)
    return list(texts), list(labels)


def process_batch(config: ProcessingConfig) -> list[DataBatch]:
    """Pure function that processes a batch"""
    texts, labels = load_data(config.data_path)

    masker = partial(mask_words, mask_token=config.mask_token, mask_prob=config.mask_prob)
    
    # Create pipeline function
    pipeline = compose(
        clean_text,
        remove_special_chars,
        split_words,
        masker
    )
    
    # Use list comprehension instead of map
    processed_texts = [pipeline(text) for text in texts]
    
    return create_batches(processed_texts, labels, config.batch_size)


# Usage in pipeline
config = ProcessingConfig()
processed = process_batch(config)
print(f"First batch: {processed[0]}")
print(f"Batch size: {len(processed[0])}")
print(f"Config: {config}")

First batch: 
Texts: [['one', 'of', 'the', '[MASK]', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching', '...'], ['a', 'wonderful', 'little', 'production', 'br', 'br', 'the', 'filming', '[MASK]', 'is', '...'], ['i', 'thought', '[MASK]', 'was', '[MASK]', 'wonderful', 'way', 'to', 'spend', 'time', '...'], ['basically', 'theres', 'a', 'family', '[MASK]', '[MASK]', 'little', 'boy', 'jake', 'thinks', '...']], 
Labels: ['positive', 'positive', 'positive', 'negative']
Batch size: 4
Config: ProcessingConfig(data_path='data/imdb_snippet.csv', mask_token='[MASK]', mask_prob=0.15, batch_size=4)


## Exceptions and Exception Handling
---
<!-- - motivation: look before you leap (LBYL) VS easier to ask forgiveness than permission (EAFP)
- raise
- try, except

---

- what are the error types? TypeError, ValueError
- always be descriptive in the error types you raise!
- do not just `raise Exception` -->

### Encountering Errors in Python

- you may have introduced an error in Python before, for instance, when you mistyped a variable name

In [60]:
solution = 13
print(soltuion)

NameError: name 'soltuion' is not defined

- when errors occur in Python, **Exceptions** are raised

- unless the Exception is **caught**, your Python program will halt

### Catching Errors in Python

- if you expect an error to be caused in some situation from which you can recover, you can catch it

In [61]:
try:
    value = 1. / 0.
    print('You will never see this message!')
except ZeroDivisionError:
    # e.g., we know that we approach zero from the positive side
    value = float('inf')
print(value)

inf


- notice, however, that any code within the **try** statement, but after the error, is **not** executed

### Always keep code within `try` minimal

- consider the following case

In [62]:
divisor = 1.
try:
    value = 1. / divisor
    print(f'I hope the result does not cause {1. / 0.}!') 
except ZeroDivisionError:
    # uh-oh! is divisor really zero here?
    value = float('inf')
print(value)

inf


- since we do not know which expression caused the error, it is good practice to put the **minimal** expression that we expect to cause an error into the `try` block

### Except will only catch matching exceptions

In [63]:
try:
    value = 1. / 0.
except ValueError:
    # uh-oh! is divisor really zero here?
    value = float('inf')
print(value)

ZeroDivisionError: float division by zero

"Exception" is the base class all exception inherit from

In [64]:
try:
    value = 1. / 0.
except Exception:
    # uh-oh! is divisor really zero here?
    value = float('inf')
print(value)

inf


### Cleaning up with `try...finally`

- sometimes, we need to clean up in order to crash graciously
- for instance, consider writing some log files, line-by-line

In [65]:
for divisor in [3, 2, 1, 0]:
    try:
        value = 1. / divisor
        print(value, end='')
    finally:
        print(',\n', end='')

0.3333333333333333,
0.5,
1.0,
,


ZeroDivisionError: float division by zero

- here, the *command and newline* will be added, regardless of whether there was an error

### Doing things when there was no error

- `try` also supports an `else`, which is executed when there was **no** exception
- for instance, we can add only prin

In [66]:
for divisor in [3, 0, 2, 1]:
    try:
        value = 1. / divisor
    except ZeroDivisionError:
        print(float('inf'))
    else:
        print(value)

0.3333333333333333
inf
0.5
1.0


### Raising exceptions

- in your own code, you sometimes need to indicate that there was an error
- to do this, you use the `raise` keyword with in instance of a specific `Exception`

In [67]:
def divide(a, b):
    if b == 0:
        raise ZeroDivisionError("Division by zero is undefined.")
    return a / b


divide(1, 0)

ZeroDivisionError: Division by zero is undefined.

### Raising exceptions after encountering exceptions

- it may happen that an exception is raised from within an `except` block
- this will print both errors, connecting them with `During handling of the above exception, another exception occurred:`

In [None]:
try:
    raise NotImplementedError('Nothing here')
except NotImplementedError:
    # do stuff
    raise RuntimeError('Cannot compute without implementing first!')

### Clean exception chaining

- Purposefully raising an exception as a direct consequence of another should be done **explicitly**.
- This is possible by using the optional `from` clause like the following

In [None]:
def func():
    raise ConnectionError('Could not remember what TCP stands for!')


try:
    func()
except ConnectionError as exc:
    # do something
    raise RuntimeError('Failed to open database') from exc

### Raising the right exception

- it is important to be as detailed as possible when raising your errors
- use the built in exception types where possible:

Some of the most common errors you may encounter:

| **Error** | **Description** |
| - | - |
| `TypeError` | Occurs when an operation is performed on incompatible data types. |
| `ValueError` | Arises when a function receives an argument of the correct type but with an inappropriate value. |
| `IndexError` | Arises when trying to access an index that is out of range in a sequence. |
| `KeyError` | Occurs when a dictionary key is not found in the dictionary. |
| `AttributeError` | Happens when an attribute reference or assignment fails. |
| `FileNotFoundError` | Happens when a file or directory is requested but cannot be found. |
| `RuntimeError` | Some general error that arised during runtime. Use if there is no better one. |
| `NotImplementedError` | Raised when something is not (yet) implemented. |


There are many more, a list can be found here: https://docs.python.org/3/library/exceptions.html

### Raising custom exceptions 

- if there is no exception that is specific enough, or you would like to catch only your own exceptions, you can create custom exceptions

In [None]:
class NegativeAmountError(Exception):
    """Custom exception for negative transaction amounts."""

    def __init__(self, amount):
        self.amount = amount
        self.message = f"Transaction amount {amount} is not positive."
        super().__init__(self.message)


def process_transaction(amount):
    """Process a transaction, raising NegativeAmountError if amount is not positive."""
    if amount <= 0:
        raise NegativeAmountError(amount)
    # Process transaction logic goes here
    print(f"Transaction of amount '{amount}' processed successfully.")

- custom exceptions are caught like any of the built-in ones

In [None]:
try:
    process_transaction(-100)  # Attempting to process a transaction with a negative amount
except NegativeAmountError as error:
    print("Error:", error.message)
    raise ValueError('Bad input value to process_transaction') from error

### Asserts

- Asserts will raise an exception when they evaluate to false

In [None]:
def calculate_average(numbers:list[int|float]) -> float:
    """Calculate the average of a list of numbers."""
    assert all(isinstance(num, (int, float)) for num in numbers), "All elements must be numbers."
    return sum(numbers) / len(numbers)


print(calculate_average([1, 2, 3]))
print(calculate_average([1, 2, 3, 'a']))

In [None]:
# equivalent to:
def calculate_average(numbers:list[int|float]) -> float:
    if not all(isinstance(num, (int, float)) for num in numbers):
        raise AssertionError("All elements must be numbers.")
    return sum(numbers) / len(numbers)

When asserts evaluate to false they will evaluate the statement left of the comma and will append the result to the error log.

In [None]:
import math


def calculate_average(numbers:list[int|float]) -> float:
    """Calculate the average of a list of numbers."""
    assert all(isinstance(num, (int, float)) for num in numbers), 3 + 5 / math.pi
    return sum(numbers) / len(numbers)


calculate_average([1, 2, 3, 'a'])
calculate_average([1, 2, 3])

Use assert when:
- Checking internal assumptions in your code
- Writing tests
- Validating developer-controlled conditions that should never be false

Use raise when:
- Handling expected error conditions
- Validating user input
- Signaling problems that could occur during normal operation


Reason:
Assertions can be disabled (using python -O flag), while raised exceptions cannot. They are meant to catch programming errors, not expected errors. 