# Data Structures

**Related Notebooks**:
- Python [collections](https://colab.research.google.com/drive/1F48ctQ05BFDvyj_pq9hqDz0wtK7TnULE) module

# Python Basics: [Mutable vs Immutable Objects](https://towardsdatascience.com/https-towardsdatascience-com-python-basics-mutable-vs-immutable-objects-829a0cb1530a)

All the data in a Python code is represented by `objects` or by `relations` between objects. Every object has an **`identity`**, a **`type`**, and a **`value`**.

**`Identity`**:

An object’s identity never changes once it has been created; you may think of it as the **object’s address in memory**. The is operator compares the identity of two objects; the `id()` function returns an integer representing its identity.

**`Type`**:

An object’s type defines the possible values and operations (e.g. “does it have a length?”) that type supports. The `type()` function returns the type of an object. An object type is **unchangeable** like the identity.

**`Value`**:

The value of some objects can change. Objects whose value can change are said to be `mutable`; objects whose value is unchangeable once they are created are called `immutable`.

The mutability of an object is determined by its type.


---
**READING LIST:**

- **[Common Python Data Structures Guide](https://realpython.com/python-data-structures/)**
- 这边**fluent python**这本书讲得非常好


# Mutable Data Types
- List
- Dictionary
- Set
- User-Defined Classes


## List - Dynamic Array



In [None]:
a = [1, 2, 3, 4, 5, 6]
id_1 = id(a)

In [None]:
a[0] = 100
print(a)
id_2 = id(a)

[100, 2, 3, 4, 5, 6]


In [None]:
# list is mutable, any changes on the value doesn't change object's identity
id_1 == id_2

True

In [None]:
a.append(10)
id_3 = id(a)
a

[100, 2, 3, 4, 5, 6, 10]

In [None]:
id_3 == id_1

True

In [None]:
id(a + [11, 12]) == id_1  # list `+` will create a new object

False

`slice` operation returns a shallow copy of the list

In [None]:
# All slice operations return a new list containing the requested elements.
# This means that the following slice returns a shallow copy of the list:
b = a[:2]
b

[100, 2]

In [None]:
b[1] = 110
b

[100, 110]

In [None]:
# updating b doesn't update a
a

[100, 2, 3, 4, 5, 6]

Assignment to `slices`
- slices在`=`左边是in-place updates
- slices在`=`右边是shallow copy

In [None]:
a = [1, 2, 3, 4, 5, 6]
c = a  # c is just a new reference to a
d = a[:]  # this is copy

In [None]:
a[2:4] = [20, 30, 40]  # slicing assignment, 在等号左边会inplace改变a
b = a[4:]

In [None]:
a

[1, 2, 20, 30, 40, 5, 6]

In [None]:
b[1] = 50
d[-1] = -1

print(f"a: {a}")
print(f"b: {b}")
print(f"c: {c}")
print(f"d: {d}")

### Slice
ref: https://stackoverflow.com/a/509295/8280662

```python
a[start:stop]  # items start through stop-1
a[start:]      # items start through the rest of the array
a[:stop]       # items from the beginning through stop-1
a[:]           # a copy of the whole array

a[start:stop:step] # start through not past stop, by step

a[-1]    # last item in the array
a[-2:]   # last two items in the array
a[:-2]   # everything except the last two items

a[::-1]    # all items in the array, reversed
a[1::-1]   # the first two items, reversed
a[:-3:-1]  # the last two items, reversed
a[-3::-1]  # everything except the last two items, reversed
```

**Relation to `slice()` object**:

The slicing operator `[]` is actually being used in the above code with a `slice()` object using the : notation (which is only valid within `[]`), i.e.:

```python
a[start:stop:step]
```
is equivalent to:
```python
a[slice(start, stop, step)]
```

In [None]:
a = list(range(10))
print(a)
print(a[1::-1])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[1, 0]


### Index
`list.index(element, start, end)`
- `element` - the element to be searched
- `start` (optional) - start searching from this index
- `end` (optional) - search the element up to this index

In [None]:
alphabets = ['a', 'e', 'i', 'o', 'g', 'l', 'i', 'u']

# index of 'i' in alphabets
index = alphabets.index('e')   # 1
print('The index of e:', index)

# 'i' after the 4th index is searched
index = alphabets.index('i', 4)   # 6
print('The index of i:', index)

# 'i' between 3rd and 5th index is searched
index = alphabets.index('i', 3, 5)   # Error!
print('The index of i:', index)

The index of e: 1
The index of i: 6


ValueError: ignored

## Linked list (User-Defined Classes)
这个class是mutable的，所以 any reference 的改变，都会对原data structure造成改变

### Singly Linked List

In [None]:
from typing import List


class ListNode:
    def __init__(self, val=0, next_node=None):
        self.val = val
        self.next = next_node


class SinglyLinkedList:
    """
    ✅ tested
    除了我下面写的add_at_index这种，还有add_at_node这种

    Time complexity: O(1) for addAtHead.  O(k) for get, addAtIndex, and deleteAtIndex,
                    where k is an index of the element to get, add or delete. O(N) for addAtTail.
    Space complexity:  O(1) for all operations.
    """
    def __init__(self):
        self.sentinel = ListNode()
        self.size = 0


    def get(self, index: int) -> int:
        """
        Get the value of the index^th node in the linked list. If the index is invalid, return -1.
        """
        # using self.size is a much good way to quick test if index is valid or not
        if index < 0 or index >= self.size:
            return -1

        curr = self.sentinel
        for _ in range(index + 1):
            curr = curr.next
        return curr.val


    def add_at_head(self, val: int) -> None:
        """
        Add a node of value val before the first element of the linked list.
        After the insertion, the new node will be the first node of the linked list.
        """
        self.add_at_index(index=0, val=val)


    def add_at_tail(self, val: int) -> None:
        """
        Append a node of value val as the last element of the linked list.
        """
        self.add_at_index(index=self.size, val=val)


    def add_at_index(self, index: int, val: int) -> None:
        """
        Add a node of value val before the index^th node in the linked list.
        If index equals the length of the linked list, the node will be appended to the end of the linked list.
        If index is greater than the length, the node will not be inserted.
        """
        if index > self.size:
            return

        if index < 0:
            raise ValueError("index cannot be negative.")

        self.size += 1

        # self.sentinel.next => prev.next 是错的哟: name "prev" is not defined
        prev, curr = self.sentinel, self.sentinel.next
        new_node = ListNode(val=val)

        for _ in range(index):
            prev, curr = curr, curr.next
        prev.next, new_node.next = new_node, curr


    def delete_at_index(self, index: int) -> None:
        """
        Delete the index^th node in the linked list, if the index is valid.
        """
        if index >= self.size:
            return

        prev, curr = self.sentinel, self.sentinel.next

        for _ in range(index):
            prev, curr = curr, curr.next
            index -= 1

        prev.next = curr.next
        self.size -= 1


    def show(self) -> List[int]:
        """
        show values of each node one by one in a list
        """
        res = []
        curr = self.sentinel.next
        while curr:
            res.append(curr.val)
            curr = curr.next
        return res


linked_list = SinglyLinkedList()
linked_list.add_at_head(10)
linked_list.add_at_tail(20)
linked_list.add_at_index(index=1, val=15)
print(linked_list.show())
print(linked_list.get(2))

print("=============")

linked_list.delete_at_index(1)
print(linked_list.show())
print(linked_list.get(2))
print(linked_list.get(1))

[10, 15, 20]
20
[10, 20]
-1
20


In [None]:
prev = curr  # this is reassignment of prev
curr = curr.next  # this is reassignment of curr

print(id(prev) == id(b))
print(id(curr) == id(c))

True
True


#### Quick implementation

to be deleted

In [None]:
# V1:
class Node:
    """construct a singly linked list"""
    def __init__(self, val, next_node=None):
        self.val = val
        self.next = next_node


class SinglyLinkedList:
    def __init__(self):
        self.head = Node(val=-1)  # sentinel

    def search(self, key):
        """
        Iterate through the whole linked list to see if key exists

        Time Complexity: O(N)
        Space Complexity: O(1)
        """
        curr_node = self.head
        while curr_node.next is not None:
            if curr_node.next.val == key:
                return True
            curr_node = curr_node.next
        return False

    def insert(self, key):
        """
        Insert a new node right at the start of the linked list

        Time Complexity: O(N)
        Space Complexity: O(1)
        """
        if self.search(key):
            print(f"key {key} already exists")
            return None
        self.head.next = Node(key, self.head.next)
        print(f"intert key {key} successfully")

    def remove(self, key):
        """
        Remove a node from the linked list

        Time Complexity: O(N)
        Space Complexity: O(1)
        """
        curr_node = self.head
        while curr_node.next is not None:
            if curr_node.next.val == key:
                curr_node.next = curr_node.next.next
                print(f"remove key {key} successfully")
                return None
            curr_node = curr_node.next

    def show(self):
        res = []
        curr_node = self.head
        while curr_node.next:
            res.append(curr_node.next.val)
            curr_node = curr_node.next

        return res

In [None]:
obj = SinglyLinkedList()

In [None]:
for i in range(5):
    obj.insert(i)

obj.show()

intert key 0 successfully
intert key 1 successfully
intert key 2 successfully
intert key 3 successfully
intert key 4 successfully


[4, 3, 2, 1, 0]

In [None]:
obj.remove(1)
obj.show()

remove key 1 successfully


[4, 3, 2, 0]

#### Broke down each steps

In [None]:
class Node:
    """construct a singly linked list"""
    def __init__(self, val, next_node=None):
        self.val = val
        self.next = next_node

In [None]:
# ┌─┐
# │a│
# └┬┘
# ┌▽┐
# │b│
# └┬┘
# ┌▽┐
# │c│
# └┬┘
# ┌▽┐
# │d│
# └─┘

sentinel = Node(-1)
a = Node(1)
b = Node(2)
c = Node(3)
d = Node(4)


sentinel.next = a
a.next = b
b.next = c
c.next = d

In [None]:
# Copy object by reference

prev = a
curr = b

print(id(prev) == id(a))
print(id(curr) == id(b))

True
True


#### A circular singly Linked List
下面所有block都是要先跑着一个block



In [None]:
# ┌─────┐
# │node1│
# └△─┬──┘
#  │┌▽────┐
#  ││node2│
#  │└┬────┘
# ┌┴─▽──┐
# │node3│
# └─────┘


class Node:
    """construct a singly linked list"""
    def __init__(self, val, next_node=None):
        self.val = val
        self.next = next_node


# build circular linked list
node1 = Node(1, None)
node2 = Node(2, None)
node3 = Node(3, None)

node1.next = node2
node2.next = node3
node3.next = node1

# define 2 orphan node:
node_test1 = Node(4, None)
node_test2 = Node(4, None)

# get object identity
id1 = id(node1)
id2 = id(node2)
id3 = id(node3)
id_t1 = id(node_test1)
id_t2 = id(node_test2)

In [None]:
print(f"node1 -> {node1.next.val}")
print(f"node2 -> {node2.next.val}")
print(f"node3 -> {node3.next.val}")

node1 -> 2
node2 -> 3
node3 -> 1


In [None]:
# 同一个class的不同instance是两个不同的object
print(id_t1)
print(id_t2)
print(node_test1 == node_test2)  # 不同的instance不是同一个object

140587172830992
140587172830288
False


In [None]:
# 改变instance的attribute，并不会改变object(instance)的identity
node_test1.next = node_test2
print(id(node_test1)==id_t1)
print(id(node_test2)==id_t2)

True
True


#### Add a new node

In [None]:
"""
Add a new node
"""

# ┌───────┐
# │ node1 │
# └△─────┬┘
#  │    ┌▽────┐
#  │    │node2│
#  │    └┬────┘
# ┌┴────┐│
# │node4││
# └△────┘│
# ┌┴─────▽─┐
# │ node3  │
# └────────┘

node4 = Node(4, node1)
node3.next = node4  # TODO: node3.next 本来是 node1，这个操作是否相当于 node1=node4？这个算是reassignment么？是的！

id4 = id(node4)

In [None]:
# 仅仅是obj的attribute改变
id3 == id(node3)
id1 == id(node1)

True

In [None]:
print(f"node1 -> {node1.next.val}")
print(f"node2 -> {node2.next.val}")
print(f"node3 -> {node3.next.val}")
print(f"node4 -> {node4.next.val}")
print("----------")
print(f"node1 -> {node4.next.next.val}")

node1 -> 2
node2 -> 3
node3 -> 4
node4 -> 1
----------
node1 -> 2


#### Delete a node

In [None]:
# ┌─────┐
# │node1│
# └△─┬──┘
#  │┌▽────┐
#  ││node2│
#  │└┬────┘
# ┌┴─▽──┐
# │node3│
# └─────┘

node4 = node3  # 不work, 因为 id(node4) == id(node3), 新的node4变成了alias, 原来的node4变成了nameless obj
print(id(node4) == id(node3))  # True; reassignment of node4

True


In [None]:
id4 == id(node3.next)

True

In [None]:
node3.next = node1

In [None]:
print(f"node1 -> {node1.next.val}")
print(f"node2 -> {node2.next.val}")
print(f"node3 -> {node3.next.val}")

node1 -> 2
node2 -> 3
node3 -> 4


#### Replace a Node
目标: 把图里面的`node3`换成`node4`

In [None]:
# 即:
# ┌─────┐
# │node1│
# └△─┬──┘
#  │┌▽────┐
#  ││node2│
#  │└┬────┘
# ┌┴─▽──┐
# │node3│
# └─────┘
# 变成==>
# ┌─────┐
# │node1│
# └△─┬──┘
#  │┌▽────┐
#  ││node2│
#  │└┬────┘
# ┌┴─▽──┐
# │node4│
# └─────┘

Wrong Implementation:

In [None]:
print("id3_old:", id3)

node4 = Node(4, node1)
node3 = node4  # node3是一个新的obj的reference了，旧的obj没有了reference但是还存在的(依然是node2.next)

print(id(node3) == id(node4))
print("id3_new:", id(node3))  #不一样

id3_old: 139631081064720
True
id3_new: 139631081215760


In [None]:
# 还是下图不变
# ┌─────┐
# │node1│
# └△─┬──┘
#  │┌▽────┐
#  ││node2│
#  │└┬────┘
# ┌┴─▽────────────────────┐
# │node3_without_reference│
# └───────────────────────┘


print(f"node1: {node1.val} -> node2: {node1.next.val}")
print(f"node2: {node2.val} -> node2.next: {node2.next.val}")
print(f"node2.next: {node2.next.val} -> node2.next.next: {node2.next.next.val}")
print("---------------------")
print(f"node3: {node3.val} -> node1: {node3.next.val}")

node1: 1 -> node2: 2
node2: 2 -> node2.next: 3
node2.next: 3 -> node2.next.next: 1
---------------------
node3: 4 -> node1: 1


Correct Version:

In [None]:
print("id3_old:", id3)

node4 = Node(4, node1)
node2.next = node4  # update the linkage

print(id(node2.next) == id(node4))
print("id4:", id(node2.next))  # 不一样

id3_old: 139631080552464
True
id4: 139631080897872


In [None]:
# ┌─────┐
# │node1│
# └△─┬──┘
#  │┌▽────┐
#  ││node2│
#  │└┬────┘
# ┌┴─▽──┐
# │node4│
# └─────┘

print(f"node1: {node1.val} -> node2: {node1.next.val}")
print(f"node2: {node2.val} -> node2.next: {node2.next.val}")
print(f"node2.next: {node2.next.val} -> node2.next.next (node1): {node2.next.next.val}")

node1: 1 -> node2: 2
node2: 2 -> node2.next: 4
node2.next: 4 -> node2.next.next (node1): 1


### Doubly Linked List

In [None]:
class ListNode:
    def __init__(self, val=0, prev_node=None, next_node=None):
        self.val = val
        self.prev = prev_node
        self.next = next_node


class DoublyLinkedList:
    """ ✅ tested
    good node names: pred -> curr -> succ

    Time Complexity: O(1) for add_at_head & add_at_tail;
                     O(min(k, N-k)) for get, add_at_index, and delete_a_index;
                     where k is an index of the element to get, add or delete
    Space Complexity: O(1) for all operations
    """
    def __init__(self):
        self.head = ListNode()
        self.tail = ListNode()
        self.head.next, self.tail.prev = self.tail, self.head  # 要记得连起来!

        self.size = 0


    def get(self, index: int) -> int:
        """
        Get the value of the index^th node in the linked list. If the index is invalid, return -1.
        应该也可以选择get node而不是仅仅返还value而已
        """
        if index < 0 or index >= self.size:
            return -1

        # choose fastest way: move from head or from tail
        if index + 1 <= self.size // 2:
            curr = self.head
            for _ in range(index + 1):
                curr = curr.next
        else:
            curr = self.tail
            for _ in range(self.size - index):  # just consider the case when index = self.size
                curr = curr.prev
        return curr.val


    def add_at_head(self, val: int) -> None:
        """
        Add a node of value val before the first element of the linked list.
        After the insertion, the new node will be the first node of the linked list.
        """
        self.add_at_index(0, val)


    def add_at_tail(self, val: int) -> None:
        """
        Append a node of value val as the last element of the linked list.
        """
        self.add_at_index(self.size, val)


    def add_at_index(self, index: int, val: int) -> None:
        """
        Add a node of value val before the index^th node in the linked list.
        If index equals the length of the linked list, the node will be appended to the end of the linked list.
        If index is greater than the length, the node will not be inserted.

        这边和singly linked list不大一样
        """
        if index < 0 or index > self.size:
            return

        new_node = ListNode(val=val)

        if index + 1 <= self.size // 2:
            curr = self.head  # curr node is the node being indexed
            for _ in range(index + 1):
                curr = curr.next
        else:
            curr = self.tail
            for _ in range(self.size - index):
                curr = curr.prev
        pred = curr.prev

        new_node.next = curr
        new_node.prev = pred
        pred.next = new_node
        curr.prev = new_node

        self.size += 1


    def delete_at_index(self, index: int) -> None:
        """
        Delete the index^th node in the linked list, if the index is valid.
        """
        if index >= self.size or index < 0:
            return

        if index + 1 <= self.size // 2:
            curr = self.head  # the node to be deleted
            for _ in range(index + 1):
                curr = curr.next
        else:
            curr =self.tail
            for _ in range(self.size - index):
                curr = curr.prev

        pred = curr.prev
        next_node = curr.next

        pred.next = next_node
        next_node.prev = pred

        self.size -= 1


    def show(self) -> List[int]:
        """
        show values of each node one by one in a list
        """
        val_nodes = []
        curr = self.head
        for _ in range(self.size):
            curr = curr.next
            val_nodes.append(curr.val)
        return val_nodes


doubly_linked_list = DoublyLinkedList()
doubly_linked_list.add_at_head(10)
doubly_linked_list.add_at_tail(30)
doubly_linked_list.add_at_index(1, 20)
print(doubly_linked_list.show())
print("============")
doubly_linked_list.delete_at_index(2)
print(doubly_linked_list.get(1))
print(doubly_linked_list.show())

[10, 20, 30]
20
[10, 20]


#### LRUCache by Doubly Linked List
[collections.OrderedDict/LRUCache](
https://colab.research.google.com/drive/1F48ctQ05BFDvyj_pq9hqDz0wtK7TnULE#scrollTo=n72u157Wsf8R)

In [None]:
"""
主要就是要实现dict + order, 所以OrderedDict (HashMap + DoublyLinkedList) 就行了
1. Why HashMap + DoublyLinkedList?

一开始有考虑用queue (deque), 但是queue的话把中间的element移动到head那儿的Time Complexity是O(N).
Array的话把中间一个node移到开头的Time Complexity也是O(N)
所以最好的解决方法还是DoublyLinkedList

- dict: key -> DLinkedNode
- DLinkedNode attributes: key, val, prev, next
"""

class DLinkedNode():
    def __init__(self, key=0, value=0, prev=None, _next=None):
        self.key = key
        self.value = value
        self.prev = prev
        self.next = _next


class LRUCache():
    """
    Implemented by a Doubly Linked List and a HashMap
    - DLinkedNode is the value of HashMap
    - DLinkedNode needs 2 key ops:
        1. always add new node from head
        2. delete last node
    """

    # 要基于head & tail操作, 所以这些DLinkedList的func不能塞在DLinkedNode Class里
    def _add_node(self, node):
        """
        Always add the new node right after head.

        Time: O(1)
        """
        # adding in new node
        node.prev = self.head
        node.next = self.head.next

        # reconnect self.head
        self.head.next.prev = node  # original the 1st becomes the 2nd
        self.head.next = node

    def _remove_node(self, node):
        """
        Remove an existing node from the linked list.

        Time Complexity: O(1)
        """
        prev = node.prev
        _next = node.next

        prev.next = _next
        _next.prev = prev

    def _move_to_head(self, node):
        """
        Move certain node in between to the head.

        Time: O(1)
        """
        self._remove_node(node)
        self._add_node(node)

    def _pop_tail(self):
        """
        Pop the current tail.

        Time: O(1)
        """
        res = self.tail.prev
        self._remove_node(res)
        return res

    def __init__(self, capacity):
        """
        :type capacity: int
        """
        self.cache = {}
        self.capacity = capacity

        # self.head: start of queue
        # self.tail: end of queue
        self.head, self.tail = DLinkedNode(), DLinkedNode()

        self.head.next = self.tail
        self.tail.prev = self.head


    def get(self, key):
        """
        :type key: int
        :rtype: int
        """
        node = self.cache.get(key, None)
        if not node:
            return -1

        # move the accessed node to the head;
        self._move_to_head(node)

        return node.value

    def put(self, key, value):
        """
        :type key: int
        :type value: int
        :rtype: void
        """
        node = self.cache.get(key)

        if not node:  # node is None
            new_node = DLinkedNode(key=key, value=value)

            self.cache[key] = new_node  # add key: newNode to cache
            self._add_node(new_node)

            if len(self.cache) > self.capacity:
                # pop the tail
                tail = self._pop_tail()
                del self.cache[tail.key]

        else:  # node exists and update its value
            node.value = value
            self._move_to_head(node)


In [None]:
lru = LRUCache(2)

lru.put(1, 1)
print(f"lru_cache: { {key: lru.cache[key].value for key in lru.cache} }")

lru_cache: {1: 1}


In [None]:
lru.put(2, 2)
print(f"lru_cache: { {key: lru.cache[key].value for key in lru.cache} }")

lru_cache: {1: 1, 2: 2}


In [None]:
val = 1
print(f"lru get {val}: {lru.get(val)}")
print(f"lru_cache: { {key: lru.cache[key].value for key in lru.cache} }")

lru get 1: 1
lru_cache: {1: 1, 2: 2}


In [None]:
lru.put(3, 3)
print(f"lru_cache: { {key: lru.cache[key].value for key in lru.cache} }")

lru_cache: {1: 1, 3: 3}


In [None]:
lru.get(3)

3

In [None]:
lru.get(2)

-1

In [None]:
lru.get(1)

1

## HashSet & HashMap


- Keys of a `dict` must be **`hashable`**
- Elements of a `set` must be **`hashable`**
(or **immutable**, exception: `User Defined Class instance`)

ref: [https://stackoverflow.com/a/19371472](https://stackoverflow.com/a/19371472)

Therefore,
Values that are not `hashable`, that is, values containing `lists`, `dictionaries` or other `mutable` types (that are compared by value rather than by object identity) __may not__ be used as keys of `dict` or elements of `set`.

ref: [https://docs.python.org/3/library/stdtypes.html#mapping-types-dict](https://docs.python.org/3/library/stdtypes.html#mapping-types-dict)


>**`hashable`**: [https://docs.python.org/3/glossary.html#term-hashable](https://docs.python.org/3/glossary.html#term-hashable) <br>
Most of Python’s `immutable` built-in objects are `hashable`; `mutable` containers (such as `lists` or `dictionaries`) are not;
`immutable` containers (such as `tuples` and `frozensets`) are only `hashable` if their elements are `hashable`.
Objects which are instances of user-defined classes are `hashable` by default. They all compare unequal (except with themselves), and their hash value is derived from their `id()`.
>

In [None]:
set([1, 2, 3])

{1, 2, 3}

In [None]:
# list is not hashable (immutable)
set([[1,2,3], [4,5,6]])

TypeError: ignored

In [None]:
{"1": 10, "2": 20}

{'1': 10, '2': 20}

In [None]:
# list is not hashable
{["1"]: 10, ["2"]: 20}

TypeError: ignored

In [None]:
# set is not hashable
{{"1"}: 10, {"2"}: 20}

TypeError: ignored

### Basic Python API

### **Immutable** vs **hashable**
- Immutable is a subset of Hashable
- Key in `dictionary` and Value in `set` have to be `hashable`
    - ie. `hash()` which calls `.__hash__()` under the hood should work

ref: https://www.programiz.com/python-programming/methods/built-in/hash

![picture](https://drive.google.com/uc?id=1r4tMlWBvSdsbVuQ2Rd_vS9gg477Ad7zG)
- As is shown in the above graph, `Lists` and `Dictionaries` are not hashable.
- Chet: Class can be hashable but not immutable

In [None]:
hash("Chet")

8157800170281132188

In [None]:
class Person:
    def __init__(self, age=int, name=str) -> None:
        self.age = age
        self.name = name

    def __eq__(self, other) -> bool:
        """
        Optional: it is created by default for all objects.
        other: an argument expected to be an instance of the class, but not the one calling the method (self)
        """
        return self.age == other.age and self.name == other.name

    def __hash__(self) -> int:  # override __hash__()
        return hash((self.age, self.name))

person1 = Person(18, "chet")
person2 = Person(18, "chet")
print(hash(person1) == hash((18, "chet")))
print(person1 == person2)
print(person1 is person2)

True
True
False


In [None]:
# dropping  __eq__()
class Person:
    def __init__(self, age=int, name=str) -> None:
        self.age = age
        self.name = name

    def __hash__(self) -> int:
        return hash((self.age, self.name))

person = Person(18, "chet")
hash(person) == hash((18, "chet"))

True

In [None]:
# dropping __hash__()

class Person:
    def __init__(self, age=int, name=str) -> None:
        self.age = age
        self.name = name

person = Person(18, "chet")
print(hash(person))
print(hash((18, "chet")))  # due to different instance
print(hash(person) == hash((18, "chet")))

8731891874529
-260688521825975685
False


In [None]:
# Error: use list a python dictionary key (not hashable)
hash_map = dict()
key = [1, 2, 3]
hash_map[key] = "test_value"

TypeError: ignored

### Instances of user-defined classes as dict key

In [None]:
# instances of user-defined classes are hashable by default.
# They all compare unequal (except with themselves),
# and their hash value is derived from their `id()`.

class Node:
    def __init__(self, val=-1, neighbour=[]):
        self.val = val
        self.neighbour = neighbour

    def __repr__(self):
        """return instance id"""
        return f"root{self.val}_{id(self)}"

root = Node(val=0)
hash_map = {}
hash_map[root] = "root_node"

In [None]:
hash_map.keys()

dict_keys([root0_140001097892752])

In [None]:
root_dup = Node(val=0)
hash_map[root_dup] = "root_node_dup"
hash_map

{root0_140000967021136: 'root_node_dup', root0_140001097892752: 'root_node'}

### function as `dict` key

In [None]:
hash_map = {}
hash_map[abs] = "abs"
hash_map

{<function abs>: 'abs'}

## Trie
Build Trie obj once and do searches later.
Dictionary of Dictionary implementation

### Dict of Dict Impelmentation

In [None]:
class Trie:
    """dictionary of dictionary implementation, very straight forward"""
    def __init__(self):
        self.root = {}

    def insert(self, word: str):
        """
        Time Complexity:  O(M), where M is size of word
        Space Complexity: O(M)
        """
        curr = self.root
        for char in word:  # O(M)
            if char not in curr:
                curr[char] = {}
            curr = curr[char]
        curr["$"] = True

    def is_prefix(self, word: str):
        """
        Time Complexity:  O(M), where M is size of word
        Space Complexity: O(M)
        """
        curr = self.root
        for char in word:
            if char not in curr:
                return False
            curr = curr[char]
        return True

    def is_word(self, word: str):
        """
        Time Complexity:  O(M), where M is size of word
        Space Complexity: O(M)
        """
        curr = self.root
        for char in word:
            if char not in curr:
                return False
            curr = curr[char]
        return "$" in curr

In [None]:
inputs = ["Apple", "app", "China", "chita", "China Joy"]
inputs = [word.lower() for word in inputs]
print(f"inputs: {inputs}")

trie = Trie()
for input in inputs:
    trie.insert(input)

trie.root

inputs: ['apple', 'app', 'china', 'chita', 'china joy']


{'a': {'p': {'p': {'$': True, 'l': {'e': {'$': True}}}}},
 'c': {'h': {'i': {'n': {'a': {' ': {'j': {'o': {'y': {'$': True}}}},
      '$': True}},
    't': {'a': {'$': True}}}}}}

In [None]:
trie.is_prefix("chii")

False

In [None]:
trie.is_word("china")

True

In [None]:
trie.is_word("ap")

False

In [None]:
trie.is_prefix("appl")

True

### Class Implementation


In [None]:
class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_word = False

class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, word: str):
        curr = self.root
        for char in word:
            if char not in curr.children:
                curr.children[char] = TrieNode()
            curr = curr.children[char]
        curr.is_word = True


    def search(self, word: str):
        curr = self.root
        for char in word:
            if char not in curr.children:
                return False
            curr = curr.children[char]
        return curr.is_word

    def startwith(self, prefix: str):
        curr = self.root
        for char in prefix:
            if char not in curr.children:
                return False
            curr = curr.children[char]
        return True


In [None]:
inputs = ["Apple", "app", "China", "chita", "China Joy"]
inputs = [word.lower() for word in inputs]
print(f"inputs: {inputs}")

trie = Trie()
for input in inputs:
    trie.insert(input)

trie.root.children

inputs: ['apple', 'app', 'china', 'chita', 'china joy']


{'a': <__main__.TrieNode at 0x7f2501ae8810>,
 'c': <__main__.TrieNode at 0x7f2501ae8bd0>}

In [None]:
trie.startwith("chii")

False

In [None]:
trie.startwith("chi")

True

In [None]:
trie.search("china")

True

In [None]:
trie.search("ap")

False

### More Complex Trie Definitions

In [None]:
"""
Prefix & Suffix Search. Build Trie for once, and queries would be fast
"""

from typing import List, Optional


class Trie:
    def __init__(self) -> None:
        """
        Trie Structure here:
        {
            "char1": {"char2": {...}, "idx": {1}},
            "idx": {1, 2, ...}
        }
        """
        self.root = {}

    def _add_index(self, idx: int, node: dict):
        """
        Add idx key to current node (mutable), so inplace updates

        Time Complexity: O(1)
        Space Complexity: O(1)
        """
        if "idx" not in node:
            node["idx"] = set()
        node["idx"].add(idx)

    def add_word(self, index: int, word: str) -> None:
        """
        Time Complexity:  O(M), M is max length of word
        Space Complexity: O(M)
        """
        curr = self.root
        self._add_index(index, curr)

        for char in word:  # O(M)
            if char not in curr:
                curr[char] = {}
            curr = curr[char]
            self._add_index(index, curr)

    def prefix_filter(self, prefix: str) -> set:
        """
        Time Complexity:  O(P), where P = len(prefix)
        Space Complexity: O(1)
        """
        curr = self.root
        for char in prefix:
            if char not in curr:
                return set()
            curr = curr[char]
        return curr["idx"]


class WordFilter:
    """
    Time Complexity: O(MN + QP), where Q is number of queries
    Space Complexity: O(MN), size of tries
    """

    def __init__(self, words: List[str]):
        self.trie = Trie()
        self.trie_reverse = Trie()

        for idx, word in enumerate(words):  # Time: O(N)
            self.trie.add_word(idx, word)  # Time: O(M)
            self.trie_reverse.add_word(idx, word[::-1])  # Time: O(M)


    def f(self, prefix: str, suffix: str) -> int:
        res_prefix = self.trie.prefix_filter(prefix)
        res_suffix = self.trie_reverse.prefix_filter(suffix[::-1])
        common_indices = res_prefix & res_suffix  # 这一步的复杂度也比较大，可能算是O()
        if len(common_indices) == 0:
            return -1
        max_common_indices = max(common_indices)
        print(words[max_common_indices])
        return max_common_indices


# Your WordFilter object will be instantiated and called as such:
# obj = WordFilter(words)
# param_1 = obj.f(prefix,suffix)

In [None]:
words = ["apple", "appa", "appppple", "banana", "bambo", "bo"]
prefix = "b"
suffix = "o"

obj = WordFilter(words)
obj.f(prefix, suffix)

bo


5

# Immutable Data Types
- int
- float
- decimal
- bool
- string
- tuple
- range

核心就是不能变动，但凡变了(eg. tuple expand)，就是一个新的object.

## Tuple

In [None]:
list_val = [1, 2, 3]  # list
tuple_val = (10, 20, 30)  # tuple

In [None]:
print("list_val[0]:\t\t", list_val[0])
list_val[0] = 100
print("list_val (updated):\t", list_val)

print("tuple_val[0]:\t\t", tuple_val[0])
tuple_val[0] = 100  # tuple doesn't allow item assignment

list_val[0]:		 1
list_val (updated):	 [100, 2, 3]
tuple_val[0]:		 10


TypeError: ignored

In [None]:
list_values = [1, 2, 3]
print(f"Old list_values (id): {id(list_values)}")
list_values += [4, 5, 6]
print(f"New list_values (id): {id(list_values)}")

print()

tuple_values = (1, 2, 3)
print(f"Old tuple_values (id): {id(tuple_values)}")
tuple_values += (4, 5, 6)  # this will be a new object
print(f"New tuple_values (id): {id(tuple_values)}")


Old list_values (id): 139631149472256
New list_values (id): 139631149472256

Old tuple_values (id): 139631177220656
New tuple_values (id): 139631177099760


## Int/ String

In [None]:
number = 42
print(id(number))

number += 1
print(id(number))

94336872832800
94336872832832


In [None]:
text = "Data Science"
print(id(text))

text += " with Python"
print(id(text))

139631149045872
139631149067200


# [Queue](https://towardsdatascience.com/dive-into-queue-module-in-python-its-more-than-fifo-ce86c40944ef)
- It is especially useful in `threaded` programming when information must be exchanged safely between multiple threads.
- CPython Source Code is good: https://github.com/python/cpython/blob/3.10/Lib/queue.py

## FIFO
ie. **First in, First out**

The Python module provides `queue.Queue()` and `queue.SimpleQueue()` that implements a `FIFO` queue. `queue.SimpleQueue()` is a new feature in Python 3.7. There are 2 differences between them:

1. `SimpleQueue()` doesn’t do task tracking in the **thread** programming. Thread programming will be discussed later.
2. `SimpleQueue()` is an **unbounded** FIFO queue while `Queue()` can have an **upper bound**. In both classes, if the queue is empty, `get()` operation will be blocked until new elements are inserted. In `Queue()`, if the queue is full, the `put()` operation will be blocked as well until elements are removed. This will never happen to `SimpleQueue`(). According to Python doc, it’s possible to disable the block using `block=False` in both `get()` and `put()`, then you will receive a `queue.Full` and `queue.Empty` exception immediately.

### `queue.Queue()`

`block=True` by default

Since `Queue()` is designed for **multi-threads**, it also offers 2 methods that support task tracking: `Queue.task_done()` and `Queue.join()`:
- `Queue.task_done()` is to indicate that a task in the queue has been processed, it’s usually called after `get()`.
- `Queue.join()` is similar to `Thread.join()` which will block the main thread until all the tasks in the queue have been processed

In [None]:
import queue

q = queue.Queue()  # don't set upper bound
for i in range(5):
    q.put(i)  # put item into the queue

while not q.empty():
    print(q.get())  # remove and return an item from the queue

0
1
2
3
4


In [None]:
q = queue.Queue(maxsize=3)  # set a upper bound

try:
    for i in range(5):
        q.put(i, block=False)
except queue.Full:
    print("Queue is Full with 3 items.")

try:
    for i in range(5):
        print(f"element {q.get(block=False)}")
except queue.Empty:
    print("Queue is already empty")

Queue is Full with 3 items.
element 0
element 1
element 2
Queue is already empty


### `queue.SimpleQueue()`
unbounded queue

In [None]:
import queue
simple_q = queue.SimpleQueue()

for i in range(5):
    simple_q.put(i)

while not simple_q.empty():
    print(simple_q.get())

0
1
2
3
4


## LIFO
Last in, First out. ie. `stack`

This is implemented in `queue.LifoQueue()` class. The interface is the same as queue.Queue() except for the order of removing elements.

In [None]:
import queue

q = queue.LifoQueue()
for i in range(5):
    q.put(i)

while not q.empty():
    print(q.get())

4
3
2
1
0


In [None]:
q = queue.LifoQueue(maxsize=3)

try:
    for i in range(5):
        q.put(i, block=False)
except queue.Full:
    print("Queue is Full with 3 items")

try:
    for i in range(5):
        print(f"element {q.get(block=False)}")
except queue.Empty:
    print("Queue is already empty")

Queue is Full with 3 items
element 2
element 1
element 0
Queue is already empty


## Priority Queue
It used `heapq` lib internally in cpython

In [None]:
import queue

q = queue.PriorityQueue()

for i in [4,1,3,2,0]:
    q.put(i)
while not q.empty():
    print(q.get())

0
1
2
3
4


The priority queue doesn’t only work with numbers but also complex data types like tuple or customized classes as long as the objects are comparable. To make a class object comparable, you need to implement a couple of rich comparison methods. A simpler way is to use `@dataclass`, dataclass can implement these methods for you with config `order=True`.

In [None]:
from dataclasses import dataclass
from typing import Any
import queue

@dataclass(order=True)
class Item:
    key: int
    value: Any

q = queue.PriorityQueue()

for i in [Item(3,"leiden"), Item(1,"amsterdam"), Item(2,"rotterdam"), Item(1,"utrecht")]:
    q.put(i)
while not q.empty():
    print(q.get())

Item(key=1, value='amsterdam')
Item(key=1, value='utrecht')
Item(key=2, value='rotterdam')
Item(key=3, value='leiden')


# Heap
ie. `堆`: `最小堆` & `最大堆`


## Min Heap

In [1]:
import heapq

In [2]:
print("create a heap:")
h = [5, 7, 9, 1, 3]  # create a list
id_1 = id(h)

heapq.heapify(h)  # make list h a min heap
id_2 = id(h)
print(h)

create a heap:
[1, 3, 9, 7, 5]


In [3]:
type(h)  # 依旧是list, 但是具备heap的所有特点

list

In [4]:
h[0]   # first element of min heap

1

In [5]:
h[1]  # well, 接下来的顺序可不保证了

3

In [6]:
h[2]

9

In [7]:
id_1 == id_2  # inplace updates

True

In [None]:
# push values to heap
print("\npush values to heap:")
heapq.heappush(h, 10)
print(h)
heapq.heappush(h, -1)
print(h)


push values to heap:
[1, 3, 9, 7, 5, 10]
[-1, 3, 1, 7, 5, 10, 9]


In [None]:
# inspect min value in min heap (top element)
peek_val = h[0]
print(f"\ninspect min/top value in min heap:\n {peek_val}")


inspect min/top value in min heap:
 -1


In [None]:
print("\npop top values from min heap:")
heapq.heappop(h)


pop top values from min heap:


-1

In [None]:
# inspect min value in min heap (top element)
peek_val = h[0]
print(f"\ninspect min/top value in min heap:\n {peek_val}")
print(h)


inspect min/top value in min heap:
 1
[1, 3, 9, 7, 5, 10]


In [None]:
# get size of heap:
len(h)

6

In [None]:
h

[1, 3, 9, 7, 5, 10]

In [None]:
# get n smallest of heap
heapq.nsmallest(3, h)

[1, 3, 5]

In [None]:
# get n largest of heap
heapq.nlargest(3, h)

[10, 9, 7]

## Max Heap
`heapq`没有max heap, 取`负数`实现

In [None]:
maxHeap = []
# 将列表堆化，此时的堆是最小堆，我们需要将元素取反技巧，将最小堆转换为最大堆
heapq.heapify(maxHeap)
# 分别往堆中添加1，3，2，注意此时添加的是-1，-3，-2，原因是需要将元素取反，最后将最小堆转换为最大堆
heapq.heappush(maxHeap, 1*-1)
heapq.heappush(maxHeap, 3*-1)
heapq.heappush(maxHeap, 2*-1)
# 查看堆中所有元素：[-3, -1, -2]

print("maxHeap: ",maxHeap)
# 查看堆中的最大元素，即当前堆中最小值*-1
peekNum = maxHeap[0]
# 结果为：3
print("peek number: ", peekNum*-1)

# 删除堆中最大元素，即当前堆中最小值
popNum = heapq.heappop(maxHeap)
# 结果为：3
print("pop number: ", popNum*-1)
# 查看删除3后堆中最大值， 结果为：2
print("peek number: ", maxHeap[0]*-1)

# 查看堆中所有元素，结果为：[-2,-1]
print("maxHeap: ",maxHeap)
# 查看堆的元素个数，即堆的大小
size = len(maxHeap)
# 结果为：2
print("maxHeap size: ", size)


maxHeap:  [-3, -1, -2]
peek number:  3
pop number:  3
peek number:  2
maxHeap:  [-2, -1]
maxHeap size:  2


## HeapSort

In [None]:
import heapq
from typing import Iterable

def heap_sort(x: Iterable):
    """
    sort in ascending order: min -> max
    """
    heapq.heapify(x)
    sorted_x = []

    while x:
        sorted_x.append(heapq.heappop(x))

    return sorted_x


def heap_sort_inverse(x: Iterable):
    """
    sort in descending order: max -> min
    """
    x = [-1 * i for i in x]
    heapq.heapify(x)
    sorted_x = []

    while x:
        sorted_x.append(heapq.heappop(x)*-1)

    return sorted_x

In [None]:
import random

x = [random.randint(1, 20) for i in range(10)]
x

[5, 11, 10, 12, 14, 1, 14, 15, 18, 7]

In [None]:
heap_sort(x)

[1, 5, 7, 10, 11, 12, 14, 14, 15, 18]

In [None]:
x = [random.randint(1, 20) for i in range(10)]

heap_sort_inverse(x)

[19, 15, 15, 15, 14, 9, 8, 8, 4, 1]

## Priority Queue

`(priority, task)` tuples on the heap:

This works fine as long as no two tasks have the same priority; otherwise, the tasks themselves are compared (which might not work at all in Python 3).

The regular docs give guidance on how to implement `priority queues` using `heapq`:

http://docs.python.org/library/heapq.html#priority-queue-implementation-notes

---

`Heap` elements can be `tuples`. This is useful for assigning comparison values (such as task priorities) alongside the main record being tracked:

eg. `(priority_score, task)`
```python
>>>
>>> h = []
>>> heappush(h, (5, 'write code'))
>>> heappush(h, (7, 'release product'))
>>> heappush(h, (1, 'write spec'))
>>> heappush(h, (3, 'create tests'))
>>> heappop(h)
(1, 'write spec')
```

### [heapq.heapify](https://docs.python.org/3/library/heapq.html#heapq.heapify)
`heapq.heapify(x)`: Transform **list** `x` into a `heap`, in-place, in linear time `O(n)`.
- ✅ list of `list`
- ✅ list of `tuple`
- ✅ list of `Class`
- ❌ not a `list`, ie `dict` etc.

In [None]:
"""
Having a list of list or a list of tuple as heap elements
"""

import heapq

# eg. get the lest frequent character:
# list of list  ✅
pq_list = [[3, "a"], [4, "b"], [1, "c"]]
# list of tuple ✅
pq_tuple = [(3, "a"), (4, "b"), (1, "c")]

heapq.heapify(pq_list)
heapq.heapify(pq_tuple)

print(pq_list[0])
print(pq_tuple[0])

[1, 'c']
(1, 'c')


In [None]:
"""
Having a list of class as heap elements
Problem set: find all edges between any 2 vertices, and save edges to Priority Queue based on edges length
"""

class Edge:
    def __init__(self, p1, p2, dist):
        self.p1 = p1
        self.p2 = p2
        self.dist = dist

    def __lt__(self, other):
        """
        __lt__: `less than (<)` is a must have for sorting
        """
        return self.dist < other.dist

# coordinate of points (vertices),
points = [[0, 0], [0, 1], [2, 2], [2, 4]]

pq = []
size = len(points)
for i in range(size):
    x1, y1 = points[i]
    for j in range(i+1, size):
        x2, y2 = points[j]
        # Calculate the absolute distance between two coordinates.
        dist = abs(x1 - x2) + abs(y1 - y2)
        edge = Edge(i, j, dist)
        pq.append(edge)

heapq.heapify(pq)  # list of Class ✅
pq

[<__main__.Edge at 0x7f006d2baf90>,
 <__main__.Edge at 0x7f006d2bac90>,
 <__main__.Edge at 0x7f006d2ba610>,
 <__main__.Edge at 0x7f006d2bafd0>,
 <__main__.Edge at 0x7f006d2bae50>,
 <__main__.Edge at 0x7f006d2bae90>]

In [None]:
print(pq[0].p1)
print(pq[0].p2)
print(pq[0].dist)

0
1
1


In [None]:
# cannot have a dict, not a list ❌ as heap element
hash_table = {"chet": 150, "zoe": 100, "shen": "20" }
heapq.heapify(hash_table)

TypeError: ignored

### [heapq.nlargest](https://docs.python.org/3/library/heapq.html#heapq.nlargest)
`heapq.nlargest(n, iterable, key=None)`

1. `iterable` can be:
    - ✅ `dict`
    - ✅ `Counter` (subclass of `dict`)
2. `key`, if provided, specifies a function of one argument that is used to extract a comparison key from each element in iterable
    - (for example, `key=str.lower`). Equivalent to: `sorted(iterable, key=key, reverse=True)[:n]`.

In [None]:
# priority queue of dictionary

from collections import defaultdict
import heapq

k = 2
counter = defaultdict(int)
words = ["zoe", "zoe", "zoe", "chet", "chet", "chet", "wingwing"]

for word in words:
    counter[word] += 1

top_k = heapq.nlargest(k, counter.keys(), key=counter.get)
print(top_k)
print(counter)
print(counter.keys())
# equivalent to below:
top_k = heapq.nlargest(k, counter.keys(), key=lambda x: (counter[x], x))  # 因为是reverse的所以zoe在前
top_k

['zoe', 'chet']
defaultdict(<class 'int'>, {'zoe': 3, 'chet': 3, 'wingwing': 1})
dict_keys(['zoe', 'chet', 'wingwing'])


['zoe', 'chet']

In [None]:
# priority queue of Counter

from collections import Counter
import heapq

# using Counter ✅
counter = Counter(["chet","chet","chet","zoe","zoe","wingwing"])
print(f"counter: {counter}")

k = 2
heapq.nlargest(k, counter.keys(), key=counter.get)

counter: Counter({'chet': 3, 'zoe': 2, 'wingwing': 1})


['chet', 'zoe']

### [heapq.nsmallest](https://docs.python.org/3/library/heapq.html#heapq.nsmallest)

`heapq.nsmallest(n, iterable, key=None)`
Return a list with the n smallest elements from the dataset defined by iterable.
1. `key`, if provided, specifies a function of one argument that is used to extract a comparison key from each element in iterable
    - for example, `key=str.lower` is equivalent to: `sorted(iterable, key=key)[:n]`.

In [None]:
# priority queue of dictionary

from collections import defaultdict
import heapq

k = 2
counter = defaultdict(int)
words = ["zoe", "zoe", "zoe", "chet", "chet", "chet", "wingwing"]

for word in words:
    counter[word] += 1

print(counter)
print(counter.keys())

top_k = heapq.nsmallest(k, counter.keys(), key=lambda x: (-counter[x], x))  # reverse=False
print(top_k)

defaultdict(<class 'int'>, {'zoe': 3, 'chet': 3, 'wingwing': 1})
dict_keys(['zoe', 'chet', 'wingwing'])
['chet', 'zoe']


# `dataclass`
Dataclass is a new feature introduced since Python 3.7. It is used as a decorator. What it does under the hood is implementing `__init__`, `__repr__` , etc. for us.

- `dataclass` can be set to be **mutable** / **immutable**
- **Named Tuple** behaves like a `tuple`, while `dataclass` behaves more like a regular Python class. Why do I say that? Because by default, the dataclass attributes are all mutable and they can only be accessed by name, not by index.

- ref: [Understand how to use NamedTuple and Dataclass in Python
](https://towardsdatascience.com/understand-how-to-use-namedtuple-and-dataclass-in-python-e82e535c3691)
- https://realpython.com/python-data-classes/

In [None]:
from dataclasses import dataclass

@dataclass
class Transaction:
    sender: str
    receiver: str
    date: str
    amount: float  # type hints are mandatory but these types are not enforced at runtime

record = Transaction(sender="jojo", receiver="xiaoxu", date="2020-06-08", amount=1.0)

print(record)
print(f"record.sender: {record.sender}")

# updating attribute of the data_class
record.sender = "gaga"
print(record)
print(f"record.sender: {record.sender}")

Transaction(sender='jojo', receiver='xiaoxu', date='2020-06-08', amount=1.0)
record.sender: jojo
Transaction(sender='gaga', receiver='xiaoxu', date='2020-06-08', amount=1.0)


In [None]:
from dataclasses import dataclass
from typing import Any


@dataclass
class Document:
    title: str
    abstract: str  # type hints are mandatory but these types are not enforced at runtime
    year: Any  # Without a type hint, the field will not be a part of the data class.
               # However, if you do not want to add explicit types to your data class, use `typing.Any`

doc1 = Document(title="ICML", abstract=123, year=2022)
print(f"doc1: {doc1}")

doc2 = Document("NIPS", "lol", 2020)
print(f"doc2: {doc2}")


doc1: Document(title='ICML', abstract=123, year=2022)
doc2: Document(title='NIPS', abstract='lol', year=2020)


In [None]:
from dataclasses import dataclass
from typing import List

@dataclass
class PlayingCard:
    rank: str
    suit: str

@dataclass
class Deck:
    cards: List[PlayingCard]  # use other dataclass as types

queen_of_hearts = PlayingCard('Q', 'Hearts')
ace_of_spades = PlayingCard('A', 'Spades')
two_cards = Deck([queen_of_hearts, ace_of_spades])

print(f"{type(two_cards)}, {type(queen_of_hearts)}")
print(two_cards)

<class '__main__.Deck'>, <class '__main__.PlayingCard'>
Deck(cards=[PlayingCard(rank='Q', suit='Hearts'), PlayingCard(rank='A', suit='Spades')])


In [None]:
from dataclasses import asdict, dataclass, fields
from typing import List, Optional, Type

@dataclass
class AbstractSection:
    content_id: str
    section_title: Optional[str] = None

abs_sec = AbstractSection("ICML")
abs_sec

AbstractSection(content_id='ICML', section_title=None)

In [None]:
fields(abs_sec)[0]

Field(name='content_id',type=<class 'str'>,default=<dataclasses._MISSING_TYPE object at 0x7fe02b9a7d10>,default_factory=<dataclasses._MISSING_TYPE object at 0x7fe02b9a7d10>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),_field_type=_FIELD)

In [None]:
abs_sec_attr_name = fields(abs_sec)[0].name
abs_sec_attr_name

'content_id'

In [None]:
getattr(abs_sec, abs_sec_attr_name)

'ICML'

**Adding Methods**:

You already know that a `dataclass` is just a regular `class`.

That means that you can freely add your own `methods` to a `data class`. As an example, let us calculate the distance between one position and another, along the Earth’s surface.

In [None]:
from dataclasses import dataclass
from math import asin, cos, radians, sin, sqrt

@dataclass
class Position:
    name: str
    lon: float = 0.0
    lat: float = 0.0

    def distance_to(self, other):
        # it is a convention to use `other` to represent other instances
        r = 6371  # Earth radius in kilometers
        lam_1, lam_2 = radians(self.lon), radians(other.lon)
        phi_1, phi_2 = radians(self.lat), radians(other.lat)
        h = (sin((phi_2 - phi_1) / 2)**2
             + cos(phi_1) * cos(phi_2) * sin((lam_2 - lam_1) / 2)**2)
        return 2 * r * asin(sqrt(h))


oslo = Position('Oslo', 10.8, 59.9)
vancouver = Position('Vancouver', -123.1, 49.3)
oslo.distance_to(vancouver)

7181.7841229421165

## default values

In [None]:
from dataclasses import dataclass

@dataclass
class Position:
    name: str
    lon: float = 0.0
    lat: float = 0.0

print(Position('Null Island'))
print(Position('Greenwich', lat=51.8))
print(Position('Vancouver', -123.1, 49.3))

Position(name='Null Island', lon=0.0, lat=0.0)
Position(name='Greenwich', lon=0.0, lat=51.8)
Position(name='Vancouver', lon=-123.1, lat=49.3)


**Advanced default values**:

In [None]:
RANKS = '2 3 4 5 6 7 8 9 10 J Q K A'.split()
SUITS = '♣ ♢ ♡ ♠'.split()  # Python supports writing source code in UTF-8 by default

from dataclasses import dataclass
from typing import List

@dataclass
class PlayingCard:
    rank: str
    suit: str

@dataclass
class Deck:
    cards: List[PlayingCard]  # use other dataclass as types

def make_french_deck():
    return [PlayingCard(r, s) for s in SUITS for r in RANKS]

make_french_deck()

[PlayingCard(rank='2', suit='♣'),
 PlayingCard(rank='3', suit='♣'),
 PlayingCard(rank='4', suit='♣'),
 PlayingCard(rank='5', suit='♣'),
 PlayingCard(rank='6', suit='♣'),
 PlayingCard(rank='7', suit='♣'),
 PlayingCard(rank='8', suit='♣'),
 PlayingCard(rank='9', suit='♣'),
 PlayingCard(rank='10', suit='♣'),
 PlayingCard(rank='J', suit='♣'),
 PlayingCard(rank='Q', suit='♣'),
 PlayingCard(rank='K', suit='♣'),
 PlayingCard(rank='A', suit='♣'),
 PlayingCard(rank='2', suit='♢'),
 PlayingCard(rank='3', suit='♢'),
 PlayingCard(rank='4', suit='♢'),
 PlayingCard(rank='5', suit='♢'),
 PlayingCard(rank='6', suit='♢'),
 PlayingCard(rank='7', suit='♢'),
 PlayingCard(rank='8', suit='♢'),
 PlayingCard(rank='9', suit='♢'),
 PlayingCard(rank='10', suit='♢'),
 PlayingCard(rank='J', suit='♢'),
 PlayingCard(rank='Q', suit='♢'),
 PlayingCard(rank='K', suit='♢'),
 PlayingCard(rank='A', suit='♢'),
 PlayingCard(rank='2', suit='♡'),
 PlayingCard(rank='3', suit='♡'),
 PlayingCard(rank='4', suit='♡'),
 PlayingCard

In [None]:
@dataclass
class Deck:  # Will NOT work
    cards: List[PlayingCard] = make_french_deck()
    # all instances of Deck will use the same list object as the default value of the .cards property.
    # This means that if, say, one card is removed from one Deck, then it disappears from all other instances of Deck as well.

ValueError: ignored

In [None]:
from dataclasses import dataclass, field
from typing import List

@dataclass
class Deck:
    cards: List[PlayingCard] = field(default_factory=make_french_deck)
    # data classes use something called a default_factory to handle mutable default values.

Deck()  # full deck of cards

Deck(cards=[PlayingCard(rank='2', suit='♣'), PlayingCard(rank='3', suit='♣'), PlayingCard(rank='4', suit='♣'), PlayingCard(rank='5', suit='♣'), PlayingCard(rank='6', suit='♣'), PlayingCard(rank='7', suit='♣'), PlayingCard(rank='8', suit='♣'), PlayingCard(rank='9', suit='♣'), PlayingCard(rank='10', suit='♣'), PlayingCard(rank='J', suit='♣'), PlayingCard(rank='Q', suit='♣'), PlayingCard(rank='K', suit='♣'), PlayingCard(rank='A', suit='♣'), PlayingCard(rank='2', suit='♢'), PlayingCard(rank='3', suit='♢'), PlayingCard(rank='4', suit='♢'), PlayingCard(rank='5', suit='♢'), PlayingCard(rank='6', suit='♢'), PlayingCard(rank='7', suit='♢'), PlayingCard(rank='8', suit='♢'), PlayingCard(rank='9', suit='♢'), PlayingCard(rank='10', suit='♢'), PlayingCard(rank='J', suit='♢'), PlayingCard(rank='Q', suit='♢'), PlayingCard(rank='K', suit='♢'), PlayingCard(rank='A', suit='♢'), PlayingCard(rank='2', suit='♡'), PlayingCard(rank='3', suit='♡'), PlayingCard(rank='4', suit='♡'), PlayingCard(rank='5', suit='♡

## Immutable dataclass

In [None]:
from dataclasses import dataclass

@dataclass(frozen=True)  # make dataclass immutable
class Transaction:
  sender: str
  receiver: str
  date: str
  amount: float

record = Transaction(sender="jojo", receiver="xiaoxu", date="2020-06-08", amount=1.0)
record.sender = "gaga"
# will raise FrozenInstanceError: cannot assign to field 'sender', because dataclass is defined to be immutable

FrozenInstanceError: ignored

## Inheritance
You can subclass data classes quite freely

In [None]:
from dataclasses import dataclass

@dataclass
class Position:
    name: str
    lon: float
    lat: float

@dataclass
class Capital(Position):
    country: str
    # The country field of Capital is added after the three original fields in Position

Capital('Oslo', 10.8, 59.9, 'Norway')

Capital(name='Oslo', lon=10.8, lat=59.9, country='Norway')

Another thing to be aware of is how fields are ordered in a subclass. Starting with the base class, fields are ordered in the order in which they are first defined. If a field is redefined in a subclass, its order does not change. For example, if you define `Position` and `Capital` as follows:



In [None]:
from dataclasses import dataclass

@dataclass
class Position:
    name: str
    lon: float = 0.0
    lat: float = 0.0

@dataclass
class Capital(Position):
    country: str = 'Unknown'
    lat: float = 40.0

Capital('Madrid', country='Spain')

Capital(name='Madrid', lon=0.0, lat=40.0, country='Spain')

# Copy Objects

## Copying Mutable Objects by Reference

In [None]:
values1 = [4, 5, 6]
values2 = values1
print("id(values1)==id(values2):", id(values1)==id(values2))

id(values1)==id(values2): True


In [None]:
# append value to the original list, referenced obj will also be updated
values1.append(7)
print('\n====== values1.append(7) ======')
print("values1 is values2:", values1 is values2)  # ie. print(id(values1)==id(values2))
print("values1:", values1)
print("values2:", values2)


values1 is values2: True
values1: [4, 5, 6, 7]
values2: [4, 5, 6, 7]


In [None]:
values2.append(10)
print('\n====== values2.append(10) ======')
print("values1 is values2:", values1 is values2)  # ie. print(id(values1)==id(values2))
print("values1:", values1)
print("values2:", values2)


values1 is values2: True
values1: [4, 5, 6, 7, 10]
values2: [4, 5, 6, 7, 10]


## Copying Immutable Objects
Every time when we try to update the value of an `immutable` object, a new object is created instead

In [None]:
text1 = "Python"
text2 = text1
print("id(text1) == id(text2):\t", id(text1) == id(text2))

id(text1) == id(text2):	 True


In [None]:
text1 += " is awesome"
print("text1 is text2:\t\t", text1 is text2)

print("text1:", text1)
print("text2:", text2)

text1 is text2:		 False
text1: Python is awesome
text2: Python


## Mutable object inside an immutable container

In [None]:
skills = ["Programming", "Machine Learning", "Statistics"]
person1 = (129392130, skills)
person2 = person1

print(type(person1))
print(person1)

<class 'tuple'>
(129392130, ['Programming', 'Machine Learning', 'Statistics'])


In [None]:
skills[2] = "Maths"
print(person1)
print(person2)
print()

"The object is still considered immutable because when we talk about the mutability of a container only the identities of the contained objects are implied."


(129392130, ['Programming', 'Machine Learning', 'Maths'])
(129392130, ['Programming', 'Machine Learning', 'Maths'])



'The object is still considered immutable because when we talk about the mutability of a container only the identities of the contained objects are implied.'

In [None]:
person1 is person2

True

In [None]:
person2 += (2,)
person2

(129392130, ['Programming', 'Machine Learning', 'Maths'], 2)

In [None]:
unique_identifier = 42
age = 24
skills = ("Python", "pandas", "scikit-learn")

info = (unique_identifier, age, skills)

print(id(unique_identifier))
print(id(age))
print(info)

94336872832800
94336872832224
(42, 24, ('Python', 'pandas', 'scikit-learn'))


In [None]:
unique_identifier = 50
age += 1
skills += ("machine learning", "deep learning")

print(id(unique_identifier))
print(id(age))
print(info)

94336872833056
94336872832256
(42, 24, ('Python', 'pandas', 'scikit-learn'))


In [None]:
age = 27
print(f"id(age): {id(age)}")

age = age + 1
print(f"id(age=age+1): {id(age)}")

age += 1
print(f"id(age += 1): {id(age)}")

id(age): 94336872832320
id(age=age+1): 94336872832352
id(age += 1): 94336872832384


## `a += 1` vs `a = a + 1`
- `+=` calls the `__iadd__` [method](https://docs.python.org/3.8/reference/datamodel.html#object.__iadd__) (if it exists -- falling back on `__add__` if it doesn't exist)
- whereas `+` calls the `__add__` [method](https://docs.python.org/3.8/reference/datamodel.html#object.__add__) or the `__radd__` [method](https://docs.python.org/3.8/reference/datamodel.html#object.__radd__) in a few cases.

From an API perspective,
1. for mutable objects:
  - `__iadd__` is supposed to be used for modifying `mutable` objects in place (returning the object which was mutated)
  - whereas `__add__` should return a new instance of something.
2. For `immutable` objects: both methods return a new instance,
  - but `__iadd__` will put the new instance in the current namespace with the same name that the old instance had.

In [None]:
a = [1, 2, 3]
b = a
b += [1, 2, 3]  # += inplace modification
print (a)  #[1, 2, 3, 1, 2, 3]
print (b)  #[1, 2, 3, 1, 2, 3]

[1, 2, 3, 1, 2, 3]
[1, 2, 3, 1, 2, 3]


In [None]:
a = [1, 2, 3]
b = a
b = b + [1, 2, 3]  # new objects
print (a)  #[1, 2, 3]
print (b)  #[1, 2, 3, 1, 2, 3]

[1, 2, 3]
[1, 2, 3, 1, 2, 3]
