# Weird Data Structures

While we normally think of and use data structures like lists - we have a bunch of stuff and we put it into some kind of order. In data science, this will usually be the case (along with dataframes and another container, arrays). Some situations and some problems are better suited to a much different type of data structure to hold the data. Here we'll look at a few data structures that are distinctly non-list-like.

## Unordered Collections - Sets and Dictionaries

Lists and tuples are ordered collections of elements. We can access elements by their index that refers to the position of the element in the collection. Sets and dictionaries are different types of data structures, they still hold objects, but those objects do not have an order or positions. 

### Lies of New Python

As of Python 3.7, sets and dictionaries are now internally ordered. This doesn't change how we think of them, as their benefits come from their unordered nature. It means that if we are iterating over the items in the data structure that the order will be consistent. We still don't use indexes to refer to items by position, but this makes the sets and dictionaries more consistent and interchangeable with other data structures. So, we'll talk about these data structures as if they are unordered, but they have had that change made to their internal workings to allow for more interoperability with other data structures.

So all the mentions of the order not being maintained are not technically true. 

### Sets

Sets are unordered collections of unique elements. Sets differ from lists primarily in that they are unordered and that they can only contain unique elements. We can make an empty set and add items, or we can create it from some other data structure (more on this in the dictionary bit).

![Python Sets](../../images/python-sets.jpg "Python Sets")
![Python Sets](../images/python-sets.jpg "Python Sets")

We don't use sets a lot in data science, but they can be useful for a few things. One is to remove duplicates from a list. Another is to find the unique elements in a list. 

Sets are immutable, so we can't change them once we've made them. We can add and remove items from them, but we can't change the items that are in them. This is similar to a tuple, and distinctly different from a list or a dictionary, where we can change things at will. 

#### Set Operations

We can make sets using the set() constructor or using curly braces; we can also make an empty one that we add to, or make it directly from some other data structure like a list. Items can also be added and removed from a set, and we can check and see if an item is in a set. Note that when we make a set from a list, we automatically filter out all the duplicates. 

In [445]:
# Sets
some_stuff = ["where", 23, "to", 4, "go", 5, "now", 23, 23, 23, "go", 23, "go "]

tmp_set = set(some_stuff)
tmp_set.add("hello")
tmp_set.add(2)
tmp_set.add("everyone")
tmp_set.add(4)
tmp_set.add("where")

print(tmp_set)

{'to', 2, 4, 5, 'hello', 'go ', 'everyone', 'now', 23, 'go', 'where'}


We can also create a set by listing the items with curly braces. 

In [446]:
other_set = {"hello", True, "everyone", 4, "purple"}
other_set

{4, True, 'everyone', 'hello', 'purple'}

##### Checking for Membership

We can ask if an element is in a set using the `in` operator.

Note that our double "where" is gone, only one "where" remains. This is because sets can only contain unique elements.

In [447]:
"hello" in tmp_set

True

We can remove things by remove or discard. 

In [448]:
tmp_set.remove(4)
tmp_set.discard("hello")
print(tmp_set)

{'to', 2, 5, 'go ', 'everyone', 'now', 23, 'go', 'where'}


In [449]:
# Check again
"hello" in tmp_set

False

### Set Logic

Sets are an important part of a branch of discreet math called set theory, which is a branch of mathematics that deals with collections of objects. We can use sets to do some basic set theory operations.
<ul>
<li> We can use the `|` operator to get the union of two sets. This is all the elements in either set.
<li> We can use the `&` operator to get the intersection of two sets. This is all the elements that are in both sets.
<li> We can use the `-` operator to get the difference of two sets. This is all the elements in the first set that are not in the second set.
<li> We can use the `^` operator to get the symmetric difference of two sets. This is all the elements that are in one set or the other, but not both.
<li> We can use the `<=` operator to ask if one set is a subset of another set. This is all the elements in the first set are also in the second set.
<li> We can use the `>=` operator to ask if one set is a superset of another set. This is all the elements in the second set are also in the first set.
</ul>

![Set Operations](../images/set_operations.png "Set Operations")
![Set Operations](../../images/set_operations.png "Set Operations")

For data science work, these mathematical operations aren't super common. These set operations are something that may help make things that we may normally do with a loop a bit easier - if we needed to find the unique elements or elements that are in two other data structures. 

In [450]:
# Union
tmp_set | other_set

{2,
 23,
 4,
 5,
 True,
 'everyone',
 'go',
 'go ',
 'hello',
 'now',
 'purple',
 'to',
 'where'}

In [451]:
# Intersection
tmp_set & other_set

{'everyone'}

In [452]:
# Difference
tmp_set - other_set

{2, 23, 5, 'go', 'go ', 'now', 'to', 'where'}

In [453]:
# Symmetric difference
tmp_set ^ other_set

{2, 23, 4, 5, True, 'go', 'go ', 'hello', 'now', 'purple', 'to', 'where'}

In [454]:
# Subset
tmp_set <= other_set

False

In [455]:
# Superset
tmp_set >= other_set

False

### Dictionaries

Dictionaries are unordered mappings for storing objects in key-value pairs. We'll deal with dictionaries more than sets or other varieties of data structures. Previously we saw how lists store objects in an ordered sequence, dictionaries use a key-value pairing instead. This key-value pair allows users to quickly grab objects without needing to know an index location. Dictionaries use curly braces and colons to signify the keys and their associated values. 

![Dictionary vs. List](../../images/dict_list.png "Dictionary vs. List")
![Dictionary vs. List](../images/dict_list.png "Dictionary vs. List")

We can create a dictionary and build it, or we can create it from some other data structure or starting data. 

![Dictionary Creation](../../images/make_dict.jpg "Dictionary Creation")
![Dictionary Creation](../images/make_dict.jpg "Dictionary Creation")

Accessing items in a dictionary is done with a similar syntax to that of lists, except that instead of using an index value, you use the key name. Just like in an actual dictionary, the lookup isn't based on a position or index, but on the "key" you provide.

In [456]:
# Sample Dictionary
d = {'key1':'value1','key2':'value2'}
d['key1'] # Call values by their key

'value1'

#### Dictionary Usage

Dictionaries are used quite frequently in Python programming, notably it is common to use dictionaries as arguments to functions. We declare a dictionary with the curly braces, and then we can access the values by using the key name.

Some common dictionary methods and abilities are:
<ul>
<li> dict.keys() - returns a list of all keys in the dictionary.</li>
<li> dict.values() - returns a list of all values in the dictionary.</li>
<li> dict.items() - returns a list of all key-value pairs in the dictionary.</li>
<li> in - checks if the key is in the dictionary.</li>
<li> del - deletes a key-value pair from the dictionary.</li>
</ul>

To add a new item to the dictionary we can simply assign a new key and value to the dictionary. To remove an item from the dictionary we can use the del keyword, which has some weird syntax.

In [457]:
d.items() # Get all items

dict_items([('key1', 'value1'), ('key2', 'value2')])

In [458]:
d.values() # Get all values

dict_values(['value1', 'value2'])

In [459]:
d.keys() # Get all keys

dict_keys(['key1', 'key2'])

In [460]:
# Add 
d["new_value"] = "new_value"

# Deletem key1 from d
del d["key1"]

d.items() 

dict_items([('key2', 'value2'), ('new_value', 'new_value')])

In [461]:
# Make it a list
list(d.values())

['value2', 'new_value']

#### Dictionary Uses

Dictionaries are most useful when we have a collection of data that we want to access by name, rather than running through a sequence. If we compare them to a list for things like this, they are much easier to use. Rather than having to look through each item to see if the item we want is there, we can just ask for it by name and the dictionary will find it for us. In general, if we have a bunch of attributes that we want to associate with a single object, we can sensibly use a dictionary to store them.

#### Dictionary Looping

In recent versions of Python, the dictionary is also iterable, or able to provide its items one-by-one. This means that we can loop through its items using a for list, without having to really adapt our loop at all. This is one strength of the way things are designed as iterables in Python, we can create a function that loops through our data in a list, then use that same function on data that is stored in a dictionary, or a set, or any other iterable. Since dictionaries are now internally ordered in Python, we can expect the order of the items to be consistent - even though we aren't explicitly using those positions for referencing items. 

<b>Note:</b> this is also one example of something we commonly see in the syntax of Python, multiple return values. In this case, the for loop is returning two values, the key and the value. This is something that is common, a function can return more than one value, and we can "take" as many of those return values as we need. Commonly the "main" value is the first, and others follow. If we don't need them, we just leave them out. This is also an example where some simple design choices that we make can have positive or negative unintended impacts. By utilizing the common interface provided by the iterable data structures, our code can be more flexible and more easily reused.

In [462]:
for key, value in d.items():
    print(key, value)

key2 value2
new_value new_value


If we want to return the items themselves, as tuples, we can just get the one return value in the for loop. I.e. this one is returning each item as a tuple, as we're looping through the items; when we asked for the key and value above, we were getting them as separate values. This is something that is defined on the object itself, we need to refer back to the documentation to see what is available.

In [463]:
for key in d.items():
    print(key)

('key2', 'value2')
('new_value', 'new_value')


## Exercise

Create a class called "StudentGraduation" that does the following:
<ul>
<li> Contains a dictionary of the courses a student has taken and the grade they received in each course. </li>
<li> Contains a method that allows a function call to add or update a grade for a course. If the course is not in the student's dictionary already, add it; if it is, update that record. </li>
<li> Contains a method that will calculate if the student can graduate. </li>
    <ul>
    <li> Consider them graduated if "math", "science", and "english" are all in their course list and they have a passing grade (>50%) in each. </li>
    </ul>
<li> Create a method that will print the student's transcript and GPA. </li>
<li> Bonus - add some error checking to not allow any courses that are not "math", "science", "english", "french", or "gym" to be added to the dictionary. Provide an error if an unacceptable course is added. </li>
</ul>

There is some ambiguity here, that's ok, you can strategize and choose a good way to implement it. This exercise is good practice. In particular, you should think about both how to hold the data, and how to allow access to it through useful methods. Remember, from the outside we are asking the "student graduation" object to do something for us, we don't care how it does it, we just want it to do it. When asking if the student can graduate, we shouldn't have to worry about how that is determined internally, we just want to know if they can graduate or not.

In [464]:
# Codes:

class StudentGraduation:

    requirements = ["math", "english", "science"]
    allowed = ["math", "english", "science", "french", "gym"]
    
    def __init__(self, name):
        self.name = name
        self.graduate = False
        self.courses = {}
    
    def add_course(self, course, grade):
        if course in StudentGraduation.allowed:
            self.courses[course] = grade
        else:
            print("You can't take that course")
    
    def GPA(self):
        return sum(self.courses.values())/len(self.courses)
    
    def checkGrad(self):
        tempGrad = 0
        for req in StudentGraduation.requirements:
            if req in self.courses:
                if self.courses[req] >= 50:
                    tempGrad += 1
        if tempGrad >= 3:
            self.graduate = True
        return self.graduate
    
    def __str__(self):
        print("Student: " + self.name)
        for course, grade in self.courses.items():
            print(course, grade)
        return "GPA: " + str(self.GPA()) + "\nGraduated: " + str(self.graduate)

In [465]:
a = StudentGraduation("John")
a.add_course("math", 50)
a.add_course("math", 50)
a.add_course("math", 50)
a.add_course("math", 90)
a.add_course("english", 50)
print(a.checkGrad())

False


In [466]:
a.add_course("science", 50)
a.checkGrad()

True

In [467]:
a.add_course("Turkish", 50) 

You can't take that course


In [468]:
a.add_course("french", 40)

In [469]:
print(a)

Student: John
math 90
english 50
science 50
french 40
GPA: 57.5
Graduated: True


### This In That Out

There are a couple of other types of data structures that we'll look at briefly. These are data structures that are used to hold data, but they are used in a very specific way. They are commonly used to hold data that is waiting to be processed. Queues and stacks hold data in a way where we can only add items in one way, and remove them in another.

#### Queues

A queue is a collection of objects that are inserted and removed according to the first-in, first-out (FIFO) principle. An excellent example of a queue is a line of people ordering at Burger King. New additions to a line made to the back of the queue, while removal (or serving) happens in the front. In the queue data structure, the oldest element is removed first.

In the example below we can see the mechanics of a queue, we can add items to the back and get items from the front. 

#### Stacks

Stacks are kind of the opposite of queues, they are last-in, first-out (LIFO). A stack is basically a pile of items, we stack them on and take off the top one. We'll look at an example of a stack when we look at recursion. Stack data types need to be imported from some external package, or we can make our own.

In [470]:
import queue
q = queue.Queue()
q.put("item1")
q.put("item2")
q.put("item3")
q.put("item4")

q

<queue.Queue at 0x1116b7550>

In [471]:
q.get()

'item1'

In [472]:
q.put("item5")
q.get()

'item2'

## Exercise

Implement a queue in your own class, using some other data structure internally to hold the data. We want to be able to add things to it and remove things, at a minimum.

<b>Note:</b> there are lots of ways to do this, and if you search for examples there will probably try to operate more efficiently. Since a data structure is generally used over and over, potentially with lots of items, in many programs, this is a pretty good use of time trying to optimize for speed. Unless this is super easy for you, don't worry about that, the functions listed when we looked at lists should do the job. 

In [473]:
# Make a queue

class myQueue:

    def __init__(self):
        self.queue = []
        self.length = 0

    def pop(self):
        if self.length > 0:
            self.length -= 1
            return self.queue.pop(0)
    
    def push(self, item):
        self.queue.append(item)
        self.length += 1
    
    def peek(self):
        if self.length > 0:
            return self.queue[0]
    
    def size(self):
        return self.length
    
    def backdoorPeek(self):
        if self.length > 0:
            return self.queue[-1]

In [485]:
list1 = []
meters =5 
feet =3
miles = 6
inches =7


NameError: name 'newline' is not defined

In [474]:
tmp = myQueue()
tmp.push("item1")
tmp.push("item2")
tmp.push("item3")
tmp.pop()

'item1'

In [475]:
tmp.pop()
tmp.peek()

'item3'

In [476]:
class myStack:

    def __init__(self):
        self.stack = []
        self.length = 0
    
    def push(self, item):
        self.stack.append(item)
        self.length += 1
    
    def pop(self):
        if self.length > 0:
            self.length -= 1
            return self.stack.pop()
    def size(self):
        return self.length

In [477]:
s = myStack()
s.push("item1")
s.push("item2")
s.push("item3")
s.push("item4")
s.pop()

'item4'

In [478]:
s.push("item5")
s.pop()

'item5'

## 2+ Dimension Data Structures

One thing that we will mention now, and delve more into when we do neural networks, is that data structures can be in any number of dimensions, and can be nested - like lists of lists. A dataframe is naturally a 2 dimensional structure, and we can create 2+ dimensional structures with other data structures. 

When dealing with these structures, all the same concepts apply as when dealing with simple lists, only now we have to think about the dimensions. We can access elements by their index, but now we have to specify the index for each dimension.

In [479]:
listA = [1,2,3]
listB = [4,5,6]
listC = [7,8,9]

list_1 = [listA, listB, listC]
list_1

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

In [480]:
print(list_1[1])
print(list_1[1][2])
print(list_1[0:2])
print([list_1[0][2], list_1[1][2], list_1[2][2]])

[4, 5, 6]
6
[[1, 2, 3], [4, 5, 6]]
[3, 6, 9]


Dealing with multi-dimensional data structures is something that we semi-regularly need to do. It is also something that is easy to get wrong, and can be a major source of bugs. In general, we want to avoid this if it is possible. If we can use a dataframe, we should; if we can split the data into some more logical arrangement that avoids nesting the lists, we should consider that as well. 