# Week 7 – Abstract Data Types
## 7.1 – Lists
### Array Lists
In this section we are going to create our own list data structures, starting with the array-backed list, sometimes simply called an array list. 

As we've mentioned countless times now, Python doesn't really support arrays. It actually has an inbuilt module called `array` which creates collections with a fixed data type. This one aspect of a traditional array, but Python `array` collections can still grow to accommodate any number of elements. This is the actually interesting part of the list data structure implementation.

When you create an array in other languages you are required to pick a data type. The fixed type is important because elements of arrays are stored in sequentially in memory, so we need to know how big each one is to access any element in $O(1)$ time. 

Example: if each element in an array is 4 bytes big (an `int` in Java or C++), and we know the memory location of the first element is $x$, we can find the memory location of the $n$th element by doing $x+4n$. Languages that support pointers usually adjust for the size of the type for you.

So how can we store data types that can be variable size, like strings? The answer is that we can create an array of *objects*, which is really an array of *pointers*. Each pointer is a fixed size, and the memory that contains the object itself can be managed to be any size we like.

That means, for my money, type is not what makes a list distinct from an array. The fact that a list can grow to accommodate any number of elements, however, is interesting. 

If I asked you to make a list (which grows) which uses an array to store its elements, you'd probably come to the right idea quickly enough, so let's just get straight to it:

In [1]:
class ArrayList:
    def __init__(self, initial_capacity=10):
        self.contents = [0] * initial_capacity
        self._size = 0
        self._capacity = initial_capacity
    
    def __is_full(self):
        return self._size + 1 == self._capacity
    
    def __grow_if_necessary(self, growth_factor=2):
        if self.__is_full():
            self._capacity *= growth_factor
            new_contents = [0] * self._capacity
            for i in range(self._size + 1):
                new_contents[i] = self.contents[i]
                
            self.contents = new_contents
        
    def add(self, element):
        self.__grow_if_necessary()
        
        self.contents[self._size] = element
        self._size += 1
        
    def get(self, index):
        if 0 <= index < self._size:
            return self.contents[index]
        raise IndexError("Index out of bounds")
        
    def size(self):
        return self._size
        
        
my_list = ArrayList(initial_capacity=3)
my_list.add(4)
my_list.add(9)
my_list.add(36)
my_list.add(17)
my_list.add(10)
my_list.add(73)
my_list.add(12)
my_list.add(24)

for i in range(my_list.size()):
    print(f"item {i}: {my_list.get(i)}")
    
print()
print("Final size:", my_list._size)
print("Final capacity:", my_list._capacity)

item 0: 4
item 1: 9
item 2: 36
item 3: 17
item 4: 10
item 5: 73
item 6: 12
item 7: 24

Final size: 8
Final capacity: 12


Take a look at the code above – hopefully many of the concepts are clear from reading the implementation, though I will cover the highlights.

Basically, the list creates an array with some initial capacity. Ten is a fine default but we allow this to be specified manually in case you know you're creating a massive list. We keep a pointer which tells us where to place the next item: it obviously starts at zero. The code above gives this pointer the name `_size` because it happens to also be the number of elements in the list; the underscore is there to hint it's for internal use (within the class) only, but does not actually prevent easy external access if we want to (like two underscores would).

Every time we add a new element we check if there is space, and if there isn't, we grow the array. This is probably the section which requires the most explanation. First of all, here we've decided to double in size each time. This will work decently, but different implementations use different growth factors, some are non-linear! Next is the important part: we create a new “array” of the new size, then copy all the elements across from the old one. There is no way to grow an array, we just have to allocate new memory, copy the contents, then free the old memory. In languages with arrays there would usually be a built in function like `memcpy` or `arraycopy` that could do the copying for us, but it's just a for loop anyway.

As soon as we hit the line `self.contents = new_contents`, the old array (actually a list) is no longer being referenced by any variable in our code. This means it will eventually be picked up by an automatic process called *garbage collection*. Every now and then Python will scan for objects with no references, and if it finds them, it will free their memory. We can play around with data structures in Python without worrying about things like memory leaks, while still understanding the underlying concepts. 

If you are interested in learning more about how this bit actually works, I think you are ready to read about the [Python data model](https://docs.python.org/3/reference/datamodel.html), where this is all specified. Python is a specification (how things should work), which omits some implementation details. The implementation of Python we are using is called `CPython`, because some of the core code is written in C.

#### More List Features
The array list class above is a very basic implementation. Most lists contain additional methods for convenience, such as `insert(element, index)` or `remove(index)`. In the case of the array list, these both have the same basic mechanism.

To insert an item `element` into an array list at position `index`:
* Check the array is big enough, grow if necessary
* Starting at the last element in the list (at position `size-1`), work backwards moving each item one space to the right, until you get to `index`
* Once you have moved the item at `index` one space to the right, replace it with `element`
* Increase `size` by one

*(Note: you might be able to think of a slight optimisation if you have to grow at the same time as inserting)*

To remove an item at position `index`:
* Starting at position `index` and moving forwards until you reach the `size`, move each item one space to the left (so the item at position `index` gets overwritten)
* Decrease `size` by one

***Exercise:*** implement both of these as methods in the ArrayList class, along with any others you might be interested in trying (e.g. `.find(element)`, `.sort()`).

#### Complexity
Try to work out the complexity of each of the following operations on an array list, going by the implementations given above:
* Accessing an item (`get(index)`)
* Appending an item (`add(element)`)
* Inserting an item (`insert(element, index)`)
* Removing an item (`remove(index)`)

I'll reveal the answers further down the page!

#### Iterators
This gives us a chance to briefly mention another Python feature. If you try to use an array list object as the subject of a for-each loop, you will get an error:

In [2]:
for element in my_list:
    print(element)

TypeError: 'ArrayList' object is not iterable

The way the `for` syntax works in Python is using an `iterator`. This is a particular design of object which only exists to churn out one object at a time from a collection, and indicate when they are done (if they are ever done – they could generate objects infinitely).

For an object to be *iterable* it must have a method `.__iter__()` which returns an *iterator* object.

An iterator object must implement `.__next__()` and return `StopIteration` when it is done. We can write our own iterator classes if we want. We can also use a *generator* as an *iterator*.

You've seen *generator expressions* before of the form `(x for x in …)`. It's also possible to write more complex generators as functions that use the keyword `yield`.

For our ArrayList class, a generator expression is one simple way to add the `__iter__()` method and make it iterable. If you add this code to the definition, then the for-each syntax will work.

```
    def __iter__(self):
        return (self.contents[i] for i in range(self._size))
```

We don't have time to go into all of the details about iterators, generators, `itertools`, and so on, but I encourage you to read around (e.g. [this tutorial website](https://anandology.com/python-practice-book/iterators.html), or as always [the official documentation](https://docs.python.org/3/howto/functional.html)) if you are interested.

### Linked Lists
The array-backed list is broadly good enough to be used for most purposes, hence why Python uses it for the inbuilt list syntax.

Hopefully you came up with some ideas for the complexities of the various operations. Here are the answers:
* Accessing an item (`get(index)`) – $O(1)$, just accessing an array 
* Appending an item (`add(element)`) – $O(1)$ *most of the time* except when it needs to resize which is $O(n)$
* Inserting an item (`insert(element, index)`) – $O(n)$, must go through each element to move
* Removing an item (`remove(index)`) – $O(n)$, same as the above

If we need to store something in a list then we obviously want to access it again, and in many use cases, we'll want to access it multiple times. So prioritising for this complexity seems reasonable, and the array list is the obvious choice.

But there are occasions where we need to add and remove items from a list a lot. Consider the example of the *queue*. In a previous week we asked you to try to implement a queue with an array by using pointers for the head and tail. A *linear queue* gets stuck if you remove and add items. A *circular queue* avoids this problem by having the queue “wrap around” the array. This works fine for a fixed size queue, but if you were to try to grow the array you'd be in for a headache.

Enter another form of list, the *linked list*.

Here is an illustration of a linked list containing the values 4, 9, 36, 17:

<img src="./resources/linked_list.png" width=600 />

The arrows in this illustration represent *pointers* or *references*, each component is referencing the next element of memory. The list itself is not really a single *thing*. These four components might be in totally different regions of memory.

You can think of each component as an *object*, and indeed, this is how we'll achieve the effect in Python. We need to refer to this list somehow, which is done with a reference to the first object: it is traditional to call this the “head” of the list.

Have a look at the code below. The class itself does barely anything, it is just being used like a `struct` in other languages, to combine two pieces of data into one, which also enables us to use references (like pointers).

In [3]:
class LinkedListNode:
    def __init__(self, value, next_node=None):
        self.value = value
        self.next = next_node
        
head = LinkedListNode(4)
head.next = LinkedListNode(9)
head.next.next = LinkedListNode(36)
head.next.next.next = LinkedListNode(17)

Now, this is very *structural*, we haven't actually made a *list* yet that supports the operations, but it illustrates the basic idea. 

Let's think about how we append an item in general: we must start at the `head`, then check to see if `.next` is empty. If it is, we add the item. Otherwise, we move onto the `.next` node and repeat the process.

How about *accessing* an item? We start at zero (the `head`) then count as we go through the chain of `.next` references.

We could add these operations to the `LinkedListNode` class itself by leaning on recursion:

In [4]:
class LinkedListNode:
    def __init__(self, value, next_node=None):
        self.value = value
        self.next = next_node
        
    def add(self, val):
        if self.next is None:
            self.next = LinkedListNode(val)
        else:
            self.next.add(val)
            
    def get(self, i):
        if i == 0:
            return self.value
        elif self.next is None:
            raise IndexError("i is greater than the length of the list")
        else:
            return self.next.get(i-1)
            
        
        
head = LinkedListNode(4)
head.add(9)
head.add(36)
head.add(17)

print(head.get(3))

17


But for this section, I would rather encapsulate all of the list operations inside a separate class which deals with all the references – this will make it easier to add some additional functionality later.

Now we can provide an interface which is more list-like:

In [5]:
class LinkedListNode:
    def __init__(self, value, next_node=None):
        self.value = value
        self.next = next_node

class LinkedList:
    def __init__(self):
        self.head = None
        
    def add(self, val):
        if self.head is None:
            self.head = LinkedListNode(val)
        else:
            ptr = self.head
            while ptr.next is not None:
                ptr = ptr.next
                
            ptr.next = LinkedListNode(val)
            
    def get(self, i):
        if self.head is None:
            return IndexError("Empty list")
        else:
            ptr = self.head
            while ptr.next is not None and i > 0:
                i -= 1
                ptr = ptr.next
                
            if i == 0:
                return ptr.value
            else:
                raise IndexError("i is greater than the length of the list")
            
        
my_list = LinkedList()
my_list.add(4)
my_list.add(9)
my_list.add(36)
my_list.add(17)

print(my_list.get(3))

17


So now we have a list made of linked nodes. But why? Can you see any advantages?

Let's start by pointing out some somewhat obvious disadvantages. Accessing an element in the list now means we have to start at the beginning and count one by one! We've gone from $O(1)$ to $O(n)$. Likewise right now, adding items to the end of the list is also $O(n)$.

But what about adding items to the start of the list?

Suppose we want to insert the number 25. Here's what the list looks like now:

<img src="./resources/llinsert1.png" width=600 />

We can just make the `head` point at a new node object, which itself points at our old first node:

<img src="./resources/llinsert2.png" width=677 />

Unlike the array list, which is $O(n)$, inserting at the start of a standard linked list is $O(1)$. Removing the first element is just as easy.

In [6]:
class LinkedListNode:
    def __init__(self, value, next_node=None):
        self.value = value
        self.next = next_node

class LinkedList:
    def __init__(self):
        self.head = None
        
    def add(self, val):
        if self.head is None:
            self.head = LinkedListNode(val)
        else:
            ptr = self.head
            while ptr.next is not None:
                ptr = ptr.next
                
            ptr.next = LinkedListNode(val)
            
    def get(self, i):
        if self.head is None:
            return IndexError("Empty list")
        else:
            ptr = self.head
            while ptr.next is not None and i > 0:
                i -= 1
                ptr = ptr.next
                
            if i == 0:
                return ptr.value
            else:
                raise IndexError("i is greater than the length of the list")
                
    def prepend(self, val):
        self.head = LinkedListNode(val, self.head)
        
    def remove_front(self):
        self.head = self.head.next
        
            
        
my_list = LinkedList()
my_list.add(4)
my_list.add(9)
my_list.add(36)
my_list.add(17)

my_list.remove_front()
my_list.prepend(25)

print(my_list.get(0))
print(my_list.get(1))

25
9


It's not very often you get to write methods like this which are literally a single line!

#### Doubly Linked List
We can also create a linked list where each item points in two directions: to the next node *and* to the previous node. This can be helpful for certain operations: if we want to get an item in the second half of the list, we might want to start at the end and work backwards to find the object. It's still $O(n)$ complexity but it's better in practice.

<img src="./resources/dlinked_list.png" width=800 />

Notice that I said we can *start at the end*. The list structure itself needs to keep track of the head *and the tail* of the list. But herein lies a huge advantage: if we do this, we will be able add items to the end of the list in $O(1)$ time as well.

Have a close read of the code below, there is a lot to digest!

In [7]:
class DoublyLinkedListNode:
    def __init__(self, value, next_node=None, prev_node=None):
        self.value = value
        self.next = next_node
        self.prev = prev_node

    def __str__(self):
        return str(self.value)


class DoublyLinkedList:
    def __init__(self):
        self.head = None
        self.tail = None
        self.size = 0

    def add(self, val):
        self.tail = DoublyLinkedListNode(val, prev_node=self.tail)
        if self.size == 0:
            self.head = self.tail
        else:
            # update the .next field of the previous last element
            self.tail.prev.next = self.tail

        self.size += 1

    def prepend(self, val):
        self.head = DoublyLinkedListNode(val, next_node=self.head)
        if self.size == 0:
            self.tail = self.head
        else:
            # update the .prev field of the previous first element
            self.head.next.prev = self.head

        self.size += 1

    def get(self, i):
        if i < 0 or i >= self.size:
            raise IndexError(f"Invalid index {i} for list of size {self.size}")
        elif i < self.size / 2:
            ptr = self.head
            while i > 0:
                i -= 1
                ptr = ptr.next
        else:
            ptr = self.tail
            steps = self.size - i - 1
            while steps > 0:
                steps -= 1
                ptr = ptr.prev

        return ptr.value

    def remove_front(self):
        if self.size == 0:
            raise ValueError("Can't remove from empty list")
        elif self.size == 1:
            self.head = None
            self.tail = None
            self.size = 0
        else:
            self.head = self.head.next
            self.head.prev = None
            self.size -= 1

    def remove_end(self):
        if self.size == 0:
            raise ValueError("Can't remove from empty list")
        elif self.size == 1:
            self.head = None
            self.tail = None
            self.size = 0
        else:
            self.tail = self.tail.prev
            self.tail.next = None
            self.size -= 1

    def __iter__(self):
        def iterator():
            ptr = self.head
            while ptr.next is not None:
                yield ptr.value
                ptr = ptr.next
            yield ptr.value

        return iterator()

    def __str__(self):
        return "[" + "⟷".join([str(item) for item in self]) + "]"


my_list = DoublyLinkedList()
my_list.add(4)
my_list.add(9)
my_list.add(36)
my_list.add(17)
my_list.prepend(25)
my_list.add(0)
my_list.prepend(13)

print(my_list)

my_list.remove_front()
my_list.remove_end()

print(my_list)

print(my_list.get(4))

[13⟷25⟷4⟷9⟷36⟷17⟷0]
[25⟷4⟷9⟷36⟷17]
17


The great thing about a double linked list is we can use it as a queue of unlimited size, by adding items to one end and removing them from the other, both with $O(1)$ complexity. This use case has a special name: a double-ended queue or *deque*. And as it happens, Python has one built in, located [in the `collections` module](https://docs.python.org/3/library/collections.html#collections.deque).

If you ever find yourself looking for a data structure which prioritises inserting and removing items, especially at the ends, then the `deque` class is a great tool to reach for.

***Exercise:*** Add a method which allows you to insert elements into an arbitrary position within the doubly linked list, i.e. `.insert(element, index)`. To be more efficient, count from either the head or the tail depending on where the insertion point is.

***Activity:*** Worried about the storage requirements of the doubly linked list? Not unreasonable – you need to store two “links” for every actual element in the list. If the elements are small, that will be a big overhead. You can halve the overhead using a *XOR linked list*. Research this structure yourself online, you can [start here](https://en.wikipedia.org/wiki/XOR_linked_list).

## What Next?
Make sure you have spent significant time with the code on this page – the text is just overview, the implementation explains the rest itself. Maybe you can find some mistakes in it, totally possible!

Once you are done with the material and have tried the exercises, head back to Engage for the next section.