![Erudio logo](img/erudio-logo-small.png)

Data Structures are a way of organizing data so that it can be accessed more efficiently depending upon the situation. Data Structures are fundamentals of any programming language around which a program is built. Python provides a slightly easier path to learn the fundamentals of these data structures in a simpler way as compared to other programming languages.

# Built-In Data Structures

Data Structures can be either built in or uder-defined (a more advanced set of data structures). The complete collection of *data structures* in Python's `__builtins__` module includes `list`, `dict`, `set`, `frozenset`, `tuple`, `bytearray`, and perhaps `str`, `bytes`, and `complex`. The builtin types gradually shades from *data structures* into *data types*.  In my mind, `int`, `float`, and `bool` fall on the latter side of that divide, but the last three "structures" listed are somewhere in the middle.

We need not draw too fine a point with this distinction since every object in Python has *methods* and *attributes*.  Some of these objects still feel *scalar* in containing *a single value* while others feel like they are composed of many values, with specific relationships among those values also implied.

![DataStructures](img/datastructures.png)

Many additional data structures are available in the `collections` module, in numerous other Python standard library modules, and in numerous third-party libraries including NumPy and Pandas. We will be discussing advanced data structures in the next lesson. 

## Mutable and Immutable Data Structures

One of the important distinctions among Python data structures is their mutability.  For some structures, a user is able to change the values contained within the structure after creation, for others a user can only create new objects (but the new objects might be based on prior ones).

Of the builtins, `list`, `dict`, `set`, and `bytearray` are mutable.  The remainder are immutable.  Importantly, NumPy arrays and Pandas DataFrames and Series that are discussed later in this course are mutable data structures.

### List versus tuple

In some sense a list and a tuple are mutable and immutable versions of the same data structure.  Both allow numeric indexing, slicing, iteration, and share a few utility methods.  Specifically, both data structures are ordered collections of arbitrary Python objects, with fixed length, and $O(1)$ access time for elements by index (regardless of how large the structures are).

In [2]:
mylist = [3, 1, 4, 1, 5]
mytup = (3, 1, 4, 1, 5)
len(mylist), len(mytup)

(5, 5)

In [3]:
print(mylist[1:4])
print(mytup[1:4])

[1, 4, 1]
(1, 4, 1)


In [4]:
sum_of_digitsL = 0
for digit in mylist:
    sum_of_digitsL += digit

sum_of_digitsT = 0
for digit in mytup:
    sum_of_digitsT += digit

print(sum_of_digitsL, sum_of_digitsT)

14 14


In [5]:
mylist.count(1), mytup.count(1)

(2, 2)

---

Despite the similarities in their interfaces, tuples are usually data we think of as "records" while lists are usually data we think of as "sequential items of similar data."  A list-of-tuples is a common compount data structure, but in later lessons we see how data frames often serve a similar purpose more powerfully.

Lists have methods like `.append()`, `.extend()`, `.remove()` and the statement `del` to modify their items, whereas tuples do not.

In [6]:
mylist.extend([9, 2, 6, 5, 3])
mylist

[3, 1, 4, 1, 5, 9, 2, 6, 5, 3]

In [7]:
del mylist[0]
mylist

[1, 4, 1, 5, 9, 2, 6, 5, 3]

It's possible to construct new tuples based on previous ones, but the point to keep in mind is that they are distinct objects, allocated elsewhere in memory; lists, in contrast can change while remaining the same object.
Tuples are unchangeable, so you cannot remove items from it, but you can delete the tuple completely.

In [10]:
del mytup[0] #Will throw an error comment this to run the below line 

# del mytup # This will work. 

In [11]:
mytup = (3, 1, 4, 1, 5)
newtup = mytup[1:] + (9, 2, 6, 5, 3)
newtup

(1, 4, 1, 5, 9, 2, 6, 5, 3)

### Dict

A dictionary is a mapping between (hashable) keys and arbitrary Python objects.  As you dig further into Python, you'll find that almost everything is built around dictionaries (such as classes, modules, packages).  `dict` is somewhat unusual in the Python menagerie in not having a standard immutable twin.

Like most Python objects, `dict` comes with an assortment of useful methods, but also with some convenient syntax that is part of the language itself.

![dict](img/dict.png)

In [12]:
math_constants = {"pi": 3.141592653589793, "e": 2.718281828459045}
math_constants.values()

dict_values([3.141592653589793, 2.718281828459045])

In [13]:
math_constants['phi'] = (1 + 5**0.5) / 2
math_constants.items()

dict_items([('pi', 3.141592653589793), ('e', 2.718281828459045), ('phi', 1.618033988749895)])

In [14]:
math_constants

{'pi': 3.141592653589793, 'e': 2.718281828459045, 'phi': 1.618033988749895}

In [15]:
del math_constants['pi']
math_constants.items()

dict_items([('e', 2.718281828459045), ('phi', 1.618033988749895)])

In [16]:
import sys
data = {"numbers": math_constants, 
        "nations": ["Indonesia", "Brasil", "Ghana"],
        "version": sys.version}
data

{'numbers': {'e': 2.718281828459045, 'phi': 1.618033988749895},
 'nations': ['Indonesia', 'Brasil', 'Ghana'],
 'version': '3.10.7 (tags/v3.10.7:6cc6b13, Sep  5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)]'}

In [17]:
sorted(data.items())

[('nations', ['Indonesia', 'Brasil', 'Ghana']),
 ('numbers', {'e': 2.718281828459045, 'phi': 1.618033988749895}),
 ('version',
  '3.10.7 (tags/v3.10.7:6cc6b13, Sep  5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)]')]

Notice where this next cell goes wrong (*immutable* and *hashable* aren't exactly the same thing, but at first pass it's fine to pretend they are):

In [18]:
{math_constants: "numbers", sys.version: "version", ["Indonesia", "Brasil", "Ghana"]: "nations"}

TypeError: unhashable type: 'dict'

## Why did the above cell error out? Aren't lists supported as a valid type for dictionary keys?

Dict keys need to be hashable, and lists are mutable. Hashing mutable objects is a bad idea because hash values should be computed on the basis of instance attributes. Any mutable Python object doesn't have a consistent hash value over its lifetime, and so can't be used as a key. 
List can't be a key, because elements can be added to or removed from a list. Likewise, a dictionary itself can't be a key for the same reason

**It's not impossible, there are definitelty sneaky ways to break this but it's not recommended.**

### Set versus frozenset

A parallel similar to that between `list` and `tuple` exists between `set` and `frozenset`. However, in the case of this pair, `set` is extremely commonly used, and `frozenset` somewhat rarely.

As with dictionary keys, elements of sets must be hashable.  One of the uncommon uses of `frozenset`, in fact, is to allow a kid of set to be an element within another set.

A set (frozen or mutable) contains a collection of Python objects, but without regard to their order or frequency.  Any given object simply is or is not in a set.

While lack of order or count is disadvantageous for some purposes, it is a large benefit in that checking for membership—and also adding an element—is an $O(1)$ operation.  For many operations, including the basic *set theoretic* operations one would expect, this efficiency is a very big win.  Let's define a two sets of paint colors taken from a store catalog.

In [26]:
wall_colors = {
    "Snow drift", 
    "Wheat field", 
    "Sage leaf",
    "Rawhide",
    "Ash gray",
    "Slate stone",
    "Brick red",
}
trim_colors = {
    "Brick red",
    "Ash gray",
    "Grass green",
    "River blue",
}
outdoor_colors = {
    "Brick red",
    "Sage leaf",
    "Sunny day",
    "Sky blue",
    "Wheat field",
}

We can ask membership questions, but not only of one set, but also of a union. For example: 

In [27]:
"Pebble path" in (wall_colors | trim_colors)

False

In [28]:
"Rawhide" in (wall_colors | trim_colors)

True

Perhaps we'd like to ask 
for the intersection of the color sets:

In [29]:
wall_colors & trim_colors

{'Ash gray', 'Brick red'}

Conversely, we might ask which trim colors are not suitable for wall colors:

In [30]:
trim_colors - wall_colors

{'Grass green', 'River blue'}

---

If we try to put sets inside sets, we get an exception.

In [31]:
sets_of_colors = {wall_colors, trim_colors, outdoor_colors}

TypeError: unhashable type: 'set'

---

But we could group frozensets this way

In [32]:
sets_of_colors = {frozenset(wall_colors), frozenset(trim_colors), frozenset(outdoor_colors)}
sets_of_colors

{frozenset({'Brick red', 'Sage leaf', 'Sky blue', 'Sunny day', 'Wheat field'}),
 frozenset({'Ash gray',
            'Brick red',
            'Rawhide',
            'Sage leaf',
            'Slate stone',
            'Snow drift',
            'Wheat field'}),
 frozenset({'Ash gray', 'Brick red', 'Grass green', 'River blue'})}

Simiarly, these frozensets could serve as dictionary keys, for example, associating sets of colors with the various paint stores that sell them.

### Strings, bytes, and bytearrays

Python does not have a concept of a "mutable string"; if you really wanted such a thing, you might use a list of characters (each of which is itself a string), and join them together when you needed a string.  You need to do this only uncommonly.  Operations like "adding" strings work fine, but they do so by creating brand new string objects.

In [33]:
techne_episteme_gnosis = 'τέχνη ' + 'επιστήμη ' + 'γνσις'
#"τέχνη " + "ἐπιστήμη " + "γνῶσις"
techne_episteme_gnosis

'τέχνη επιστήμη γνσις'

Each of the shorter strings (Greek words for kinds of *knowledge* in this case) are immutable, but the objects are also short-lived because they are never individually bound to any Python names.

The crucial thing to understand is that immutable strings consist of Unicode code point and have a variety of useful methods.

In [34]:
techne_episteme_gnosis.upper()

'ΤΈΧΝΗ ΕΠΙΣΤΉΜΗ ΓΝΣΙΣ'

In [35]:
techne_episteme_gnosis.split()

['τέχνη', 'επιστήμη', 'γνσις']

In [36]:
techne_episteme_gnosis[6:14]

'επιστήμη'

There is, however, no singular way of representing a string as bytes, rather we need to **encode** it by get bytes.

In [37]:
techne_episteme_gnosis.encode('utf-8')

b'\xcf\x84\xce\xad\xcf\x87\xce\xbd\xce\xb7 \xce\xb5\xcf\x80\xce\xb9\xcf\x83\xcf\x84\xce\xae\xce\xbc\xce\xb7 \xce\xb3\xce\xbd\xcf\x83\xce\xb9\xcf\x82'

In [38]:
techne_episteme_gnosis.encode('iso-8859-7')  # older 8-bit ASCII extension for Greek

b'\xf4\xdd\xf7\xed\xe7 \xe5\xf0\xe9\xf3\xf4\xde\xec\xe7 \xe3\xed\xf3\xe9\xf2'

Absent specifying something else, encoding and decoding are assumed to be UTF-8, and you should avoid using other encodings unless you have a *very good reason*.

In [39]:
techne_episteme_gnosis.encode().decode()

'τέχνη επιστήμη γνσις'

"Enhanced ASCII" encodings are fragile. For example, encoding our Greek clause using such an encoding could be followed by later decoding it as incompatible Cyrillic (which is a silent failure, but produces gibberish—i.e. data corruption—in the new alphabet).  

In [40]:
techne_episteme_gnosis.encode('iso-8859-7').decode('iso-8859-5')

'єнїэч х№щѓєоьч уэѓщђ'

Failing quickly and noisily is nearly always better, since you know to fix the problem

In [41]:
techne_episteme_gnosis.encode('iso-8859-7').decode()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 0: invalid continuation byte

---

`bytes` and `bytearray` simply make no claim to be strings at all.  However, because many wire and on-disk protocols nonetheless intend bytes as strings, they have the `.decode()` method that we have already seen.  That method will fail, of course, when bytes are truly not interpretable as strings.  `bytes` also share many of the useful methods of `str` as a convenience.

Working directly with bytes as such is very often useful.  `bytes` are immutable, but `bytearray` is mutable.  Let's look at a file in the `data/` directory to illustrate this.  Students on Windows systems may not have the `file` utility, but I'll illustrate the file type detected here.

This file is used in a later lesson, but let's read in this file as bytes now, and maninpulate it slightly as such.

In [43]:
with open("data/basement.gz", "rb") as fh:
    basement = fh.read()

print(basement[:20])

b'\x1f\x8b\x08\x08z\xbb\xfc@\x00\x03basement\x00t'


In [44]:
ba = bytearray(basement)
ba[10:18]

bytearray(b'basement')

We can modify the contents of a bytearray, and write the contents back to disk.

In [48]:
import os
ba[10:18] = b"driveway"
if not os.path.exists('tmp/'):
    os.makedirs('tmp/')
with open("tmp/driveway.gz", "wb") as fh:
    fh.write(ba)

If we hadn't kept the length of the gzip compressed file the same, and substituted only the portion devoted to the stored filename, we'd get unpredictable results. Still, we've worked directly with a binary format easily enough, and more robust code is easily possible to write.

## The Boundary Between Scalars and Data Structures

In Python, even values that we usually think of as scalars have attributes and methods.  It might be natural to think that immutable values are scalar and mutable ones are data structures.  However, we've already seen examples of immutable objects that have quite intricate internal structure (for example, frozensets of frozensets).  Moreover, in many *other* programming language, "simple" values like numbers and strings are mutable (although not in Python.

The better distinction, in my mind is asking whether an object *contains* other values.  Unfortunately, the answer here is not obvious.  Take complex numbers:

In [49]:
cnum = 1.5 + 2.3j
cnum.real, cnum.imag

(1.5, 2.3)

In a straightforward way, a complex number contains both a real and imaginary component. Yes, these are immutable, but that's incidental.

In contrast, a `bool` is about as simple a value (a *singleton* even, but let's not go there) as one might want.

In [50]:
trueval = (2 + 2 == 4)
falseval = (4 // 2 == 3)
trueval, falseval

(True, False)

Fair enough, but even numbers, whether integers or floating point numbers can cause us some uncertainty.

In [51]:
myint = 12345
myfloat = 1.2345
mysum = myint + myfloat
myint, myfloat, mysum

(12345, 1.2345, 12346.2345)

However, let's look at this (is an integer actually two values, as we thought of complex?):

In [52]:
myint.numerator, myint.denominator

(12345, 1)

And likewise, even non-complex numbers have an imaginary component (that happens to equal zero).

In [53]:
myfloat.imag, myfloat.real

(0.0, 1.2345)

However we wish to think of these *simple* values, they definitely have some interesting methods, as well as attributes.

In [54]:
myint.bit_count?

In [55]:
myint.bit_length?

In [56]:
myint.bit_count(), myint.bit_length()

(6, 14)

## Comprehensions

Python provides a collection of handy shortcuts for creating its basic data structures.  These are called *comprehensions*, and express both conditionals and limited flow control within the same expressions that create data structures.

Lists, sets, and dicts may be created in this manner.  In all cases, we describe what should be contained within a data structure in terms of what is already inside one or more iterables (which includes data structures, but also sometimes includes other things).

In [57]:
reds_and_blues = [  # Construct a list
    color 
    for color in wall_colors | trim_colors | outdoor_colors 
    if "red" in color or "blue" in color
]
reds_and_blues

['River blue', 'Sky blue', 'Brick red']

In [58]:
reds_and_blues = {  # construct a set
    color
    for color in wall_colors | trim_colors | outdoor_colors
    if "red" in color or "blue" in color
}
reds_and_blues

{'Brick red', 'River blue', 'Sky blue'}

In [59]:
color_to_length = {
    color:len(color)
    for color in wall_colors | trim_colors | outdoor_colors
    if "red" in color or "blue" in color
}
color_to_length

{'River blue': 10, 'Sky blue': 8, 'Brick red': 9}

**Dictionary Comprehensions**

In [61]:

input_list = [1, 2, 3, 4, 5, 6, 7]
   
output_dict = {}
   
for var in input_list:
    if var % 2 != 0:
        output_dict[var] = var**3
    else:
        output_dict[var] = var**2
   
print("Output Dictionary:",output_dict )


Output Dictionary: {1: 1, 3: 27, 5: 125, 7: 343}


In [65]:
# Another way of doing the same thing
number_range = 10
output_dict = {x: x**2 if x % 2 == 0 else x**3 for x in range(1, number_range + 1)}