# Recap: Useful Python Concepts

## Organizing Code

### Decorators

Decorators modify or extend the behavior of functions or methods without changing their code. In data engineering, decorators can be used for logging, memoization, or access control.

In [None]:
from datetime import datetime

def logger(func):
    def wrapper(*args, **kwargs):
        # Actions before
        print(f"Calling {func.__name__} with args {args} and kwargs {kwargs}")
        start = datetime.now()
        # Calling the decorated function
        result = func(*args, **kwargs)
        # Actions after
        end = datetime.now()
        print(f"Returned {result} in {end - start}")
        # Return the end result
        return result
    return wrapper

@logger
def add(a, b, rounding=2):
    return round(a + b, rounding)

add(1.5, 2)

In [None]:
@logger
def mult(a, b, rounding=2):
    return round(a * b, rounding)

mult(3, 2)

### Decorators with Arguments

Decorators with arguments allow you to pass extra information to your decorator, making them more flexible. This is useful in data engineering for dynamically setting behaviors like caching policies or operation modes.

In [None]:
def multiplier(factor):
    def decorator(func):
        def wrapper(*args, **kwargs):
            return func(*args, **kwargs) * factor
        return wrapper
    return decorator

@multiplier(5)
def add(a, b):
    return a + b

print(add(1, 2))

### Custom context managers 

Custom context managers help in abstracting setup and teardown activities, making the code more readable and maintainable. In data engineering, they can manage database connections, temporary files, or other resources efficiently.

A context manager you already saw before:

```python
# Without context manager -- Don't do this!
file = open("somefile")
file.readline()
file.close()

# With context manager - Now we are sure the file gets closed!
with open("somefile") as file:
    file.readline()
```

So, to define your own:

In [None]:
from contextlib import contextmanager

@contextmanager
def managed_resource():
    print("Setup")
    yield
    print("Teardown")

with managed_resource():
    print("Do work")  # Setup -> Do work -> Teardown

## Efficient Coding

### Generator Expressions

Generator expressions provide a memory-efficient way to handle large data sets by yielding items one at a time, instead of loading all into memory. In data engineering, this is useful for streaming and transforming large data files or query results.

In [None]:
list_comp = [x**2 for x in range(10)]
for val in list_comp:
    print(val)

In [None]:
gen_exp = (x**2 for x in range(10))
for val in gen_exp:
    print(val)

### Walrus operator

The Walrus Operator := allows you to both assign a value to a variable and evaluate it in a single expression. In data engineering tasks like data filtering or transformation, this can reduce redundant calculations, making code more efficient.

##### Use case 1: simplifying an `if` construction

In [None]:
# No walrus here
tweet_limit = 50
some_tweet = "This is a tweet about the walrus " + "blah" * 50
diff = len(some_tweet) - tweet_limit
if diff < 0:
    print(some_tweet)
else:
    print(some_tweet[:tweet_limit], f"[Truncated {diff} characters]")

In [None]:
# I am the walrus
tweet_limit = 50
some_tweet = "This is a tweet about the walrus " + "blah" * 50
if (diff := len(some_tweet) - tweet_limit) < 0:
    print(some_tweet)
else:
    print(some_tweet[:tweet_limit], f"[Truncated {diff} characters]")

#### Use case 2: speeding up list comprehensions

In [None]:
%%time
from time import sleep
def slow_square(n):
    sleep(1)
    return n**2
slow_square(2)

In [None]:
%%time
filtered_data = [(n, slow_square(n)) for n in range(5)]
filtered_data

In [None]:
%%time
# Without walrus
filtered_data = [(n, slow_square(n)) for n in range(5) if slow_square(n) % 2]
filtered_data

In [None]:
%%time
# With walrus
filtered_data = [(n, n_squared) for n in range(5) if (n_squared := slow_square(n) % 2)]
filtered_data

### `defaultdict`

`defaultdict` is a subclass of Python's `dict` that returns default values for missing keys. In data engineering, this is useful for building frequency counters, group-by operations, or adjacency lists, where the structure of the dictionary needs to be dynamic.

In [None]:
normal_dict = {}
normal_dict['non_existing_key']

In [None]:
from collections import defaultdict

default_dict = defaultdict(int)
default_dict['non_existing_key']

In [None]:
dd = defaultdict(list)
dd['key1'].append(1)
dd['key2'].append(2)
print(dd)

In [None]:
items = ['a', 'b', 'a', 'a', 'def']

counter = defaultdict(int)
for item in items:
    counter[item] += 1

counter

In [None]:
items = ['a', 'b', 'a', 'a', 'def']

counter_dict = {}
for item in items:
    counter_dict[item] = counter_dict.get(item, 0) + 1

counter_dict

## Advanced Classes

### `__repr__` Method

The `__repr__` method should return a string that, when passed to `eval()`, would create an object with the same internal state as the original object. It's mainly intended for debugging and development.

In [None]:
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __repr__(self):
        return f"Point({self.x}, {self.y})"

p = Point(2, 3)
print(repr(p))

In [None]:
z = Point(2, 3)
z

### `__str__` Method

The `__str__` method returns a string that provides an "informal" or nicely printable representation of the object. This makes the object's printout more human-readable.

In [None]:
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __repr__(self):
        return f"Point({self.x}, {self.y})"

    def __str__(self):
        return f"A point at x={self.x} and y={self.y}"

p = Point(2, 3)
print(p)

### `__dict__`

The `__dict__` attribute contains an objects attributes.

In [None]:
p.__dict__

In [None]:
p.x

In [None]:
# p.x translates into:
p.__dict__['x']

In [None]:
# p.x = 100 translates into:
p.__dict__['x'] = 100
p

In [None]:
p.z = 5
p.__dict__

In [None]:
# If I make a typo, a new attribute is created
p.X = 5
p.__dict__

This means that an object's memory space cannot be fixed upfront, because the attributes can be modified or extended by changing the object's `__dict__`.

### Slots

__slots__ in Python constrains object attributes to a fixed set, eliminating the memory overhead associated with the dynamic per-instance __dict__. This leads to more memory-efficient storage of objects. In a data engineering context, where you often work with large data sets or many instances of custom classes, using __slots__ can significantly reduce memory footprint and improve performance during data transformations and manipulations.

In [None]:
from pympler.asizeof import asizeof

class WithoutSlots:
    def __init__(self, name, age):
        self.name = name
        self.age = age

class WithSlots:
    __slots__ = ['name', 'age']
    def __init__(self, name, age):
        self.name = name
        self.age = age

obj1 = WithoutSlots('Alice', 30)
obj2 = WithSlots('Bob', 40)

print(f"Size without slots (obj1): {asizeof(obj1)} bytes")
print(f"Size with slots    (obj2): {asizeof(obj2)} bytes")

In [None]:
obj1.Age = 31
asizeof(obj1)

In [None]:
obj1.__dict__

In [None]:
obj2.Age = 41

The **advantages** of `__slots__` in a nutshell:
- Memory space
- Faster lookup of attributes
- Prevents from accidentally creating new attributes (typos)

### Data Classes

Data classes in Python automatically generate special methods like `__init__`, `__repr__`, and `__eq__`. They make it easier to create classes for storing data. In data engineering, this simplifies the definition of complex data structures.

#### Without Data Classes:

In [None]:
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __repr__(self):
        return f"DataPoint(x={self.x}, y={self.y})"

    # def __eq__(self, other):
    #     return self.x == other.x and self.y == other.y


In [None]:
a = Point(1, 4)
a

In [None]:
b = Point(1, 4)
a == b

#### With Data Classes:

In [None]:
from dataclasses import dataclass

@dataclass
class DataPoint:
    x: int
    y: int

In [None]:
c = DataPoint(1, 4)
c

In [None]:
d = DataPoint(1, 4)
c == d

### Abstract Base Classes (ABCs)

Abstract Base Classes define a set of methods and properties that a class must implement, but don't provide implementations. They allow you to set up a blueprint for other classes, ensuring a consistent interface. In data engineering, they can be used to define interfaces for plug-and-play components in a pipeline.

In [None]:
# Code Block
from abc import ABC, abstractmethod

class DataProcessor(ABC):
    @abstractmethod
    def process(self, data):
        pass

class MyProcessor(DataProcessor):
    pass
    # def process(self, data):
    #     print(f"Processing {data}")


processor = MyProcessor()
processor.process("some data")  # Output: Processing some data

So, this doesn't work, unless we uncomment the lines that define the process method.

Data classes make it easier to manage data in a structured form, while ABCs ensure that certain classes adhere to a specific contract, making your data engineering pipelines more modular and easier to understand.