# Dictionaries (again)
##### Modified from WISCONSIN Center for Sutainability

Remember, we can add lists together with labels (called keys) to make a dict. Another way to think about this, is in a dictionary, the key is the column header and the data are values. They go together in a {key:value} pair. You can think of this like a table where the key is the column label and the values are the entries in that column. Values can be anything, any type of data, and they don't have to be the same in a {key:value} pair. 

You can tell something is a dict and not a list because curly brackets are used { }. There is also always a colon : between the key and the values. 


In [5]:
# list format reminder
list3 = [16.0,12.0,23.7,18.2]

# dictionaries
dict1 = {'column1':[1,2,4,8,16,32]}
dict2 = {'Numbers':[1,2,3,4],'Fruit':['apples','bananas','oranges','lemons'],'Randoms':list3}
print(dict1)

{'column1': [1, 2, 4, 8, 16, 32]}


Ok, let's break down the printed dict a little more. The curly brackets { } tell us everything we printed is a dict. Inside that dict, we have a str that is the key. Then we have a colon : that tells us the values for the key are coming up. Next, we have straight brackets [ ] surrounding the values. The straight brackets indicate a list, and the values are elements in that list. 

If you want to know what your keys or values are without printing the whole dict, you can call these items.

In [6]:
# The keys in dict1:
print(dict1.keys())

# The corresponding values:
print(dict1.values())

dict_keys(['column1'])
dict_values([[1, 2, 4, 8, 16, 32]])


Easy, right? Now, what if you want to know the values of a certain key? Try this statement: dict_name['key_name']:

In [7]:
print(dict1['column1'])

[1, 2, 4, 8, 16, 32]


Since the values in our dict are elements of a list, we can call a specific element using its index. But first we must indicate which column of data we want by using the key.

In [8]:
print(dict1['column1'][1])
# this code is saying find the 2nd element in column1 of dict1, remember indexing starts at 0

2


Now that we understand all the {key:value} pairs, we can use pandas (a Python package) to do fancy stuff to our dict.

# DataFrames!
Let's make a table using pandas!

Reminder from Lesson/Exercise 1, python is a language, and inside every language there are different types of words. In English, we have parts of speech like nouns, verbs, adjectives, etc. In Python, we have *packages* like Numpy, Scipy, pandas, etc. Each package has a specific purpose, Numpy = matrix math, Scipy = Science-based analysis, pandas = spreadsheet operations (like Excel). Each part of speech has many different categories and words. For example, we have abstract nouns, proper nouns, collective nouns, common nouns, but they are all types of nouns. In a Python package like Numpy, we have different *commands* that tell us what is happening in our Python sentence. 


In [14]:
import pandas as pd

df = pd.DataFrame(data=dict1) # we already imported pandas, so we don't need to do it again
print(df)

   column1
0        1
1        2
2        4
3        8
4       16
5       32


df is our first table! In pandas, these are called DataFrames. We can make any DataFrame using the pd.DataFrame command and inputing a dictionary for the data argument. The list of numbers on the left side are the indexes. The default, if you don't specify the index when you make the DataFrame, is to number the rows starting at 0. We can change the column names or the indexes using pandas commands.

In [15]:
df1 = pd.DataFrame(data=dict2)
print(df1)

   Numbers    Fruit  Randoms
0        1   apples     16.0
1        2  bananas     12.0
2        3  oranges     23.7
3        4   lemons     18.2


In [16]:
df2 = df1.set_index('Numbers')
print(df2)

           Fruit  Randoms
Numbers                  
1         apples     16.0
2        bananas     12.0
3        oranges     23.7
4         lemons     18.2


Let's check-in Yes, in the example above, the first column of df1, named "Numbers", became the index of df2. But let's break this down a bit. 

First, we made a new DataFrame called df2. We said df2 is equal to df1 BUT changing the indexes for the values in  column 1. This also means that the index label is the key from column 1 in df1. By making column 1 our index, we removed this column from the table. Now there are only two {key:value} data pairs in the DataFrame. 

When we say there are only two {key:value} pairs in the DataFrame, we mean that Numbers is no longer a key! It has become the index label. We can still call the Fruit and Randoms keys, but we cannot treat Numbers the same way.

In [18]:
print(df2['Fruit'])

Numbers
1     apples
2    bananas
3    oranges
4     lemons
Name: Fruit, dtype: object


In [19]:
print(df2['Numbers'])

KeyError: 'Numbers'

This long error message ends with "KeyError: 'Numbers'". This is saying 'Numbers' is not a key, so you can't use it to select those values. Instead, we must call the DataFrame indexes.

In [20]:
print(df2.index)

Index([1, 2, 3, 4], dtype='int64', name='Numbers')


Now let's change the column labels!

In [21]:
df3 = df1.rename(columns={'Numbers':'Friday','Fruit':'Monday','Randoms':'Thursday'})
print(df3)

   Friday   Monday  Thursday
0       1   apples      16.0
1       2  bananas      12.0
2       3  oranges      23.7
3       4   lemons      18.2


In the example above, we renamed all our columns using the rename command. We started with uur original df1, which had 3 columns. Columns 1, 2, and 3 of df1 now have new names: Friday, Monday, Thursday.

We'll explore pandas in more detail in a later module. 

## Applying our skills to ArcGIS Pro

JUST ADD LINK TO PDF IF/WHEN TIME COMES






## Exercise 2.1: The Greeter
A. Write a program that asks for the user's name and tells them hello. Try it first in a cell but then write it to a file and execute it in the Terminal.

B. Add on to the script, making it ask for your year of birth. Make the script compute your age and print it in the following format: 
```
Since you were born in 1988, your age is 28.
```


## Exercise 2.2: Numbers, numbers everywhere
A. Take Exercise 1.8 and re-write it so that the two numbers are inputted into the program rather than directly set to a variable. Have it print the fraction and decimal to screen as normal. Figure out how to make this program fail - how many can you come up with?

**Exercise 1.8 Lazy...fractionator:** Set two variables **x** and **y** to any two numbers. Print out x/y in both fraction and decimal form.

Try it out by inputting some numbers - check for any errors and record them as you go. 

B. Write a script where you read five numbers in. Print out their sum and mean.

C. Enter two numbers and store these in two variables called **input1** and **input2**. Swap the values of the two variables so **input1** and **input2** switch values. However, do this in only one line of code and without introducing new variables.

D. CHALLENGE: First, ask the user to specify a number of digits (e.g. 1-digit numbers, 2-digit numbers, etc.). Then, Ask the user to supply five numbers of that many digits separated by spaces. Parse these numbers out, and calculate sum and mean as in Exercise 2.2B. Try doing this without using any additional variables! (Use only the one that contains the string that you read in.)


## Exercise 2.3: Crazy strings
Reproduce the triple-quoted string from earlier
```python
s = '''hello "world", if that is your real name. 
That's World, to you'''
```
in just one string using single or double quotes. Make sure you have the line break and all the quotes and apostrophes in there!


## Exercise 2.4: New Data Structures

For each of the lines below, add comments detailing what each line in the script above is doing. Then run these commands and make sure you are correct. 


In [None]:
L = [1,2,3] + [4,5,6]
print (L, L[:], L[:0], L[-2], L[-2:])
print ()

L.reverse()
print (L)
print ()

L.sort()
print (L)
print ()

idx = L.index(4)
print (idx)
print ()

print ({'a':1, 'b':2}['b'])
print ()

D = {'x':1, 'y':2, 'z':3}
D['w'] = 0
print (D['x'] + D['w'])
print (D.keys(), D.values(), 'z' in D)

## Exercise 2.5: Meet your classmates!
Take a moment to list FIVE of your friends' names. Make a list (called **friends**) containing their names. 

A. What happens when you try to index out of bounds (eg. __friends[15]__)?

B. What about slicing out of bounds (eg. __friends[-100:100]__, __friends[30:50]__, __friends[2:10]__)?

C. What happens when you try to extract a sequence in reverse--with the lower bound greater than the higher bound (e.g. __friends[3:1]__)? What happens when you try assigning __friends[3:1] = ['?']__? 

D. Add your own name to the middle of the list by indexing, then add my name to the list ('Mel') by using the __insert()__ method that we discussed in class. See what happens if you use indexing to add the string 'AHHH' to the list using the index __[3:5]__ or __[4:2]__.


## Exercise 2.6: Slicing
A. Set a variable **x** to the string "Yang Lab" Apply the function **list**. What happens to the string?

B. Now set the outcome of 2.6A to a new variable **y**. What happens in each of the cases below? 
```python
print ('1', y)
print ('2', y[2:])
print ('3', y[:4])
print ('4', y[1:3])
print ('5', y[-1:])
print ('6', y[-2:])
print ('7', y[:-1])
print ('8', y[1:4:2])
print ('9', y[::-1])
```

C. Take this list of numbers: 
```python
mylst = [0,1,2,3,4,5,6,7,8,9]
```
Using slicing, print a list of only even numbers, only odd numbers, and only numbers divisible by four. 


## Exercise 2.7: Make some friends
For each friend in Exercise 2.5, we'll make a list of their favorite food, activity, and the month they were born.  Create a dictionary to store this information, making their names keys and their the list of information on them values. 

A. What happens if you try to index a non-existent key (e.g. `print (D['Terry'])`)?

B. What happens if you try to assign to a non-existent key (e.g. `D['Terry'] = '?'`)?

C. How does this compare to out-of-bound assignments for lists?

D. Write the code to print the names of any classmates with a birthday in July (with no errors produced) - think about combining the **for loop** and the **if statement**. 


## Exercise 2.8: Sets, sets, sets!

This lesson, you learned about many ways to group data. Here is one more, Sets!  Like a set in mathematics, it has a bunch of elements with no repeats.  To build a set, you pass in a list, and it will automatically remove duplicates.
```python
myset = set([1,1,2,3,5,8])

set([8, 1, 2, 3, 5])
```

Google "python Sets" and see if you can learn a bit more about them. They have many methods and we will learn a few. 

Take the following two sets:
```python
beijing_unis = {"Beida","CAS","Qinghua","Renmin"}
china_unis = {"Beida","CAS","Qinghua","Renmin","Jilin","Fudan"}
```
Can you figure out how to get the union of these two sets? The intersection? The Chinese universities that are not in Beijing? Look up the methods associated with Sets, and consider using the **??** option when using Jupyter notebooks to see details on different methods. 


## Exercise 2.9: Pulling it all together (Challenge!)
Your boss has asked you to do a small bioinformatics project on LeuT (pdb code 2Q6H), which is the neurotransmitter responsible for transporting antidepressants. To help you out, I am providing a script (see below) that will read in a file and save the protein sequence to a list called protSeq. I have commented the code, but there are many things in here you haven't learned yet.

HINT: Protein Databank (PDB) structure files are stored at http://www.rcsb.org/. Use the pdb code to find the amino acid structure. The structure file can be downloaded from Download Files >> PDB Format (don't use the one with the 'gz' option). __Move the structure file into your `resources/` directory - try refreshing your memory on Linux commands.__

So, using the code below, you can access the list of all amino acids stored as the list variable `protSeq` (you may need to rewrite the file path if you get an error and it says you can't find the file - it should work if you set it up like I did in your `resources/` folder). You can start by printing this variable and proving to yourself that the list contains the information you think it does. Then add to the code to answer the following questions:

A. How many total amino acids are in the protein?
B. Print out total count of each amino acid in alphabetical order.
C. Can you do part B with **sets** instead?

In [None]:
#initialize list to store sequence
protSeq = []
#open pdb file
f1 = open('../resources/2q6h.pdb', 'r')
#loop over lines in file
for next in f1:
    #identify lines that contain sequences
    if next[:6] == 'SEQRES':
        #strip away white space and
        #convert line into list
        line = next.strip().split()
        #delete descriptor information
        #at beginning of each line
        del line[:4]
        #loop over amino acids in line
        for aa in line:
            #add to sequence list
            protSeq.append(aa)
#close file
f1.close()


## Exercise 2.10: Back to SGDPinfo.txt (Challenge!)

A. In Lesson 2, I gave you a Python list of the unique population IDs from `SGDPinfo.txt`. This script is how I obtained that list. Most of it are things you haven't learned yet, but let's try to take the code and understand it a bit better. From Lesson 2, you should have some recognize some functions/methods from sections 2 and 4 of the code. Work through those lines, commenting as needed, to better understand what I did. 

B. Then, let's tweak the code to get other types of information. Edit the code to obtain the following:
1. The total number of individuals in this dataset. Note that SGDP_ID should be unique, though you can doublecheck this by adding a line to the code. 
2. The unique list of all countries 
3. The number of males and females (Due to some bad line coding, the indices may not be lined up correctly - try thinking of alternative ways of accessing that column - let me know if you're having difficulty.)

In [None]:
## 1. Open file and display the column information.
myfile=open("../resources/SGDPinfo.txt",encoding="Latin 1")
header=myfile.readline().split()
print ("This is the header in list form:", header)
print ()

## 2. Move information from the wanted column into the list mypopn
mypopn=[]
for line in myfile: #This you have not learned yet, but it basically allows you to loop over each line in the file. 
    x=line.split('\t') #You haven't learned this yet either, but look up the split method - what is it doing?
    if x[0]!="": #You haven't learned if statements like this, but you will in Lesson 3!
        mypopn.append(x[2]) 
        
## 3. Close the file
myfile.close()

## 4. Get the unique set of population IDs. 
myuniqpopn = list(dict.fromkeys(mypopn))
print (myuniqpopn)