# PythonStudyGroup : Lists, Dicts, Sets
### Feb. 26th, 2019

Today we will be walking through a few examples in order to better understand lists, dictionaries and sets.

You may follow along by cloning and navigating to this repo.

`git clone https://github.com/ComBEE-UW-Madison/PythonStudyGroup.git && cd PythonStudyGroup/2019Spring/feb28`

Now open up this notebook using jupyter notebooks.

`jupyter notebook lists_dicts_sets_tutorial.ipynb`

If you do not have jupyter notebook installed you may [download and install it here](https://jupyter.org/install "Jupyter Notebook Install Page").

_Or_ you may follow along in your favorite text editors or interactive environment.

Now that we are situated, lets take a look at some of the data.

In [None]:
# The '!' allows jupyter notebooks to interpret the line as a bash command.
!ls data #bash command for looking in our directory (dir)

So, we have another directory titled `prokka_output` which we will get to shortly. Here is a brief description of our data dir's contents:

| Path | Description |   |
|-------:|-------------|---|
| metagenome.fna (file) | nucleotide fasta file of metagenomic scaffolds
| autometa_test_data_clustered.tsv (file) | tab separated file of metagenome assembled genomes (MAGs).
| prokka_output (dir) | folder of prokka annotated cluster_DBSCAN_round4_1


We may peak into each of these files with the shell commands `head` or `less`.

For this tutorial, we'll use `head`

In [None]:
!head -n3 data/metagenome.fna

In [None]:
!head -n3 data/autometa_test_data_clustered.tsv

## Lists, Dicts, Sets Oh My!

So what are lists, dictionaries and sets?!?

All three of these are built-in python data structures.

In [None]:
# lists are commonly instantiated using the syntax
l = list()
l = []
# As you can see, if we test whether these empty lists are the same we will get True
isSame = [] == list()
print('the list constructors are the same type (list):{}'.format(isSame))

### Accessing elements in a list

In [None]:
myList = list('abcdefg')
print(myList)

Each of the characters in the list are known as list ***elements***.
We can access elements of the list by using the _index_ of the element.

The _index_ of the element is the position of the element in the list.
I will provide a simple example below.

In [None]:
myList[5]

Notice we receieved 'f' as the output! What is going on here?
The element at the _5th index_ of the list is 'f'. This is due to lists in python being _0-indexed_. Try accessing the list with some other _indices_ by uncommenting each line one at a time and seeing what you receive as output.

In [None]:
# myList[0]
# myList[1]
# myList[2]
# myList[10]

Uh oh! What happened? We have received an IndexError. When accessing a list, you must use a position that is within the range of indices in the list. You can access the last elements by putting a negative sign before the index. This will access the list from the end.

In [None]:
# myList[-1]
# myList[-2]

Finally, you may take list slices to access multiple elements at once.

In [None]:
# myList[3:6]
# myList[0:5:2]

Notice what is happening while slicing the list? The elements that are sliced are in the form of

`list[inclusive:exclusive:step]`

In [None]:
# Taking the example from above:
includedIndex = 0
excludedIndex = 5
step = 2
myList[includedIndex:excludedIndex:step]

### Feel free to play around with each of these parameters to see how the list is manipulated.

## Adding elements to a list

Now that we have the basic syntax figured out for constructing and accessing a list. Lets look at how we may build a list.

We will first need to know what methods are available for the list data structure.

We can access these methods using the python built-in function `dir()`.

In [None]:
myList = ['ATCG','ACCCG','ATTTCA']
dir(myList)

Notice anything? The methods are actually given to us as a list!

Let's look through this list. The methods with the double underscore (***dunder***) are methods that python will use in the background, so let us ignore these. The methods we are curious about are the methods at the end of our list. As you can see, the _list_ already comes with a few built-in functions for manipulating the data.

We are going to be using _append_ to add elements to our list.

***Note: Methods refer to functions with the syntax dataStructure.method(). I.E. list.append()***

In your free time I encourage you to explore some of the list methods using the syntax `?dataStructure.method()` or `??dataStructure.method()`.

- `?dataStructure.method()` will provide you details on what arguments to supply to the method, if any!

- `??dataStructure.method()` will display the source code and documentation associated with the method. (This is true for the ipython interpreter, _not jupyter notebooks_ and functions only with certain python objects)

In [None]:
# One example to check out how we will use append!
?myList.append()

Notice using the append method will put a given element at the end of the list...

Here's an example:

In [None]:
len(myList)

In [None]:
myList.append('GGGGGGGGGGTTTTTGGGG')

Notice we have no output... but if we check the len of our list, we see it has gained another element!

In [None]:
print(myList)
print(len(myList))

## Dictionaries: A data structure with keys and values

Dictionaries are one of the most powerful data structures in python and much of the language is actually constructed as a dictionary! We won't dive too deep into these details, but suffice it to say, dictionaries offer a powerful tool to manipulate data and to search for specific pieces of data.

Let's briefly go through instantiating a dictionary.

In [None]:
myDict = dict()
myDict = {}
isSame = {} == dict()
print('data structure types are the same: {}'.format(isSame))

Rather than starting with an _empty_ dictionary, we can construct a dicitonary with keys and values.

In [None]:
# dict syntax {key:value, key2:value2, ...}
myDict = {'contigA':'AAACCCTTGG',
     'contigB':'TTCCGGACTG',
     'contigC':'TTCCGGAANNN'}
myDict

Once again, let's check out the dictionary built-in methods (...and ignore the _dunders_)

I encourage you to browse the methods and investigate as previously described with the `?` or `??` syntax.

In [None]:
dir(myDict)

## Accessing the dictionary's keys and values

We can access the dictionary's keys and values by using key accession. Dictionaries are unordered like sets in contrast to the ordered elements in a list. Having an unordered data structure allows easier access to data in the object. This is particularly important if you are working with large amounts of data.

It is common to access a value of a key in a dictionary by providing the key to the dictionary. You can almost think of this as a key to unlock a door to get access to the value behind it. 
_or_ as a translation by the dictionary, provided a key and a translation(value) for it. This may sound a bit convoluted, so let me show you what I mean.

First the general case.

```
input: dict[key]
returns: value
```

In [None]:
myDict

In [None]:
# myDict['contigA']
# myDict['contigB']
# myDict['contigC']
# myDict['contigD']

Notice! Instead of **IndexError** when trying to access an index that does not exist in a list, we now receive a **KeyError**. _This is important._ As you will need to provide an exact key if you want to retrieve a value from a dictionary.

A convenient method to view the keys, values, or both together as a ***[tuple](https://www.tutorialspoint.com/python/python_tuples.htm "Python Tuples Explanation")*** in your dictionary is...

In [None]:
# myDict.keys()
# myDict.values()
# myDict.items()

Dictiontaries, Lists and Sets are flexible data structures and can become quite complex.

Such examples include constructing a numpy array which is similar to a list of lists.

Or nested dictionaries, which are similar to some pandas data objects.

Dictionary values can be _another_ data structure, such as a list, set or even another dictionary!

Therefore, you can imagine how a lot of information and metadata could all be fit into one dictionary data structure with related objects categorized by keys and their metadata by their values.

But we are getting a bit ahead of ourselves.


### Building a dictionary


Lets look at how we may build a dictionary of contig identifiers using `cluster_DBSCAN_round4_1.txt` from the `prokka_output` directory

In [None]:
# Lets look in the directory to make sure we have the cluster_DBSCAN_round4_1.txt file
!ls data/prokka_output/

In [None]:
# Let's see the contents of the file
!cat data/prokka_output/cluster_DBSCAN_round4_1.txt

As you can see, the file format resembles that of **key:value** pairs and so lets parse these data into a dictionary data structure keyed by the description on the left of the colon.

1. We instantiate an empty dictionary.
2. We open our file and retrieve the key and value, line by line
3. Add retrieved key and value to the dictionary

In [None]:
# 1. instantiate dict
genomeDict = {}
# 2. open and read through file (First lets look at how python is viewing the lines)
filepath = 'data/prokka_output/cluster_DBSCAN_round4_1.txt'
with open(filepath) as filehandle:
    for line in filehandle:
#         repr() is a python built-in function that will give the representation of the string with special characters
        print(repr(line))
#     Notice the \n... This is a new line character and often represents the end of a line.
        break

We will remove this new line character (`\n`) and strip the 'white spaces' before constructing our dictionary

In [None]:
# Check the methods for manipulating a string...
# If you're unsure of the data type you're working with. You can use the python built-in function type()
type(line)
# dir(line)

In [None]:
line.strip('\n')

In [None]:
line.strip('\n').split(':')

Now we have our line split into the key and value we want. Notice, it is in a list structure! We can access each element and update our dictionary...

In [None]:
# 3. Add retrieved key:value pair to dictionary
elements = line.strip('\n').split(':')

# Note we can 'unpack' our list with variable assignment if we know we will always receive the same sized list.
# key, value = line.strip('\t').split(':')
# genomeDict[key] = value

# List accession as we have previously learned
genomeDict[elements[0]] = elements[1]
genomeDict

Notice we add keys into our dictionary by providing the key and the value to hold for the key. This value can be _whatever_ we like.

In [None]:
genomeDict['someKey'] = ['a','list','of','values']

print(genomeDict)

Let's try adding the same key but with a different value

In [None]:
genomeDict['someKey'] = 23

print(genomeDict)

Our value has changed to 23! As you can see, the keys of the dictionary ***must be unique***! Therefore, if a key is already in the dictionary, it will be overwritten with the supplied value. So it is often important to check if the key is in the dictionary first, to avoid accidentally overwriting any of your data.

In [None]:
test1 = 'someKey' in genomeDict
test2 = 'someOtherKey' in genomeDict
test3 = 'somekey' in genomeDict

print('test1:%s' % test1)
print('test3:%s' % test2)
print('test2:%s' % test3)

Notice the keys _are_ case sensitive. Ensuring the exact key is provided can often same time when trying to construct and access data across multiple files where keys are shared.

We now have one of the key value pairs in our dictionary! Lets try it altogether now...

In [None]:
# 1. instantiate dict
genomeDict = {}
# 2. open and read through file (First lets look at how python is viewing the lines)
filepath = 'data/prokka_output/cluster_DBSCAN_round4_1.txt'
with open(filepath) as filehandle:
    for line in filehandle:
        elements = line.strip().split(':')
#         We could add a check here to see if the key is already in the dict.
        genomeDict[elements[0]] = elements[1]
#         genomeDict.update({elements[0]:elements[1]})
        
genomeDict

### Lets talk about Sets

Sets are similar to lists _and_ dictionaries. Sets contain the same properties of a dictionary, those being that they are unordered must be unique. But sets are constructed similar to an array of elements (i.e. not constructed as key:value pairs)... Like a list!

So how do sets behave?
Let's start by instantiating a set data structure.

In [None]:
mySet = set()
mySet

What gives? There is only one constructor for this? Not so, sets may also be constructed using _comprehensions_, but thats for another time. 

Let me first show you some behavior that you may think is weird at first. I will show you the simple fix and we can move on to methods...

In [None]:
mySet = set('contig1:value')
print(mySet)
mySet = set(['contig1','contig2','contig1','contig2','contig3'])
print(mySet)

Our first set composed of characters! If we want to construct a set of strings, we will need to pass the [] constructor then use the set() function

Our second sets duplicate contigs have disappeared! Sets only contain unique elements. So if you add an element to the set that is already contained therein, nothing will change!

It is important to note, this is **not the same behavior as a list**. Which will add any element as it is agnostic to unique elements.

In [None]:
mySet = set(['contig1:value'])
mySet

Now lets check out some methods for sets...

In [None]:
dir(mySet)


In [None]:
# mySet[0]

Notice we receieve a **TypeError**! set objects do _not_ support indexing! So how do we access the elements in a set?

One method is `s.pop()`. Using `pop()` will remove the element from the set. We can also iterate through the set, similar to the way we would a list object.

In [None]:
for elem in mySet:
    print(repr(elem))

As you can see, we can still access elements by ***iterating*** through the set...

While I'm mentioning iteration. All three of these objects (dicts, lists, sets) are iterable objects.

This means you can iterate through them using the `__iter__` method, commonly called with **for**

In [None]:
for key in myDict:
    print('myDict',key)
    
for elem in myList:
    print('myList',elem)
    
for elem in mySet:
    print('mySet',elem)

Iteration is a powerful concept to grasp, and once you are able to put all of this syntax together, you can quickly and efficiently construct sets, lists, or dictionaries in only a few lines! (This is some foreshadowing to comprehensions)

But First lets return to our dicitonary object and see how we may iterate through this another way using the keys and items methods.

In [None]:
# Recall
myDict.items() # Returns list of (key,value) tuples
# myDict.keys() # Returns list of dict keys

Lets take advantage of the items method and unpack the tuples while we iterate! Lets see how this is done below...

In [None]:
for key,value in myDict.items():
    print('here is my key: ',key)
    print('now the value... ',value)

Another method to retrieve values based on specific keys...

In [None]:
for k in myDict:
    val = myDict[k]
    print(k,val)

I've shown you a few different routes to access and manipulate your data...

Now lets looks at how we may construct some of these objects with comprehensions!

***A final challenge exercise:***

Run `count_pfams.py` (I've placed the file in the example directory). Try to understand where some of these data objects are being used and how each is being employed due to its unique properties.

In [None]:
listComprehension = [elem for elem in range(0,5)]
setComprehension = {elem for elem in range(0,5)}
dictComprehension = {elem:elem+1 for elem in range(0,5)}
print('listComprhension: ',listComprehension)
print('setComprehension: ',setComprehension)
print('dictComprehension: ',dictComprehension)


# Extra! Generators!!!
# someStuff = (elem for elem in range(0,20))
# print(someStuff)

In [None]:
# Show count_pfams.py
!ls example/

In [None]:
!cat example/count_pfams.py

In [None]:
!head data/autometa_test_data_clustered.tsv

In [None]:
!python3 example/count_pfams.py

In [None]:
!python3 example/count_pfams.py DBSCAN_round2_2 cluster data/autometa_test_data_clustered.tsv example/count_pfams.output