# Lesson 2: Data Types and Introducing Packages  <a name='home' />

Last time we started using Python and began to learn about different object types. Today, we will dive deeper explore what you can do with strings, before moving on to different ways of grouping (i.e., lists, tuples, dictionaries) and manipulating these data. We will introduce the topic of data frames and also start playing with some commonly used python packages. 

Table of Contents:
- <a href=#bookmark1>1. Strings, Floats, and Integers: A review.</a> 
- <a href=#bookmark2>2. Strings: A Deep Dive.</a> 
- <a href=#bookmark3>3. Exploring Functions with Strings.</a> 
- <a href=#bookmark4>4. Lists.</a> 
- <a href=#bookmark5>5. Tuples.</a> 
- <a href=#bookmark6>6. Dictionaries.</a> 
- <a href=#bookmark7>7. Packages, dataframes, and pandas.</a> 

## 1 Strings, Floats, and Integers: A review.  <a name='bookmark1' />

As a reminder from Lesson 1, strings, floats, and integers are commonly used data types in python.

In [None]:
# strings
str1 = 'apple'
str2 = 'PM2.5'
str3 = 'On Wednesdays, we wear pink.'

# integers
int1 = 12
int2 = 104382
int3 = -48

#floats
float1 = 0.0
float2 = 19.807
float3 = 2.4e-1

We can use python to tell us what kind of object you have:

In [None]:
print(type(str1))

print(type(float3))

And we can (generally) convert data from one type to another:

In [None]:
print(int1)

str4 = str(int1)
print(str4)
print(type(str4))

In [None]:
int4 = int(str1)

This error message above is saying that we can't turn str1 into an integer because "apple" doesn't fit in any base 10 number

In [None]:
float4 = float(int1)
print(float4)

<a href=#home>Return to Top</a> 

## 2. Strings: A Deep Dive  <a name='bookmark2' />

### 2.1 Quotes

Let's explore a bit more what we can do with strings. In Lesson 1, we also learned how to concatenate a string - that is, how to put two strings together. For example:

In [None]:
var1="The Yang Lab " 
var2="is the best!" # Steph will allow it
var3 = var1+var2

print (var1)
print (var2)
print (var3)

Here, we will learn what else can be done with strings. As mentioned, strings are specified by wrapping a series of characters in quotes. These can be quotes of three different flavors. The first two, single (a.k.a. the apostrophe) and double, are familiar (although don't confuse the single quote (') with the backtick (`) -- the one that's probably above the tilde (~) on your keyboard).
Single and double quotes can more or less be used interchangeably, the only exception being which type of quote is allowed to appear inside the string. If the string itself is double-quoted, single quotes may appear inside the string, and visa-versa:

```python
s1 = 'Hello, "World", if that is your real name.'
s2 ="That's World, to you, buddy."
```
Double quotes are present in `s1`, and a single quote appears in `s2`, but the two cannot be combined. In order to use both single and double quotes in the same print statement, employ the triple quote, which is actually just three single quotes, as shown in `s3`.

```python
s3 = '''Hello, "World", if that is your real name.
That's World, to you, buddy.'''
```

Note two aspects of the triple quotes: 1) Both single and double quotes can be used inside triple quotes. 2) Triple quoted strings can span multiple lines, and line breaks inside the quoted string are stored and faithfully displayed in the print operation.

### 2.2. Escape characters

In [None]:
s1 = 'some\thing is missing'
print (s1)
s2 = "somethi\ng is broken"
print (s2)
s3 = '''something th\at will drive you b\an\an\as'''
print (s3)
 
s4 = r'\a solu\tio\n'
print (s4)
 
s5 = '\\another solu\\tio\\n'
print (s5)

This ugly mess is caused by escape characters. In python strings, several special characters ([bigger list here](https://chercher.tech/python-programming/python-special-characters)) can be preceded by a backslash "\" to produce special output, such as: ***a tab (\t), newline (\n) or even a bell noise (\a)*** (unfortunately, the bell noise does not seem to work from a remote computer like Spydur).

This is handy, since it means you can liberally pepper your strings with tabs and line breaks. In fact, lots of the data that biologists use are conveniently stored in files that are delimited by such tabs and line breaks; geography/envs data is generally comma delimimted. Tabs and line breaks might be a problem, however, say if you wanted to use a backslash in your string. Python offers two ways around this: the safest is to escape your escape, using a second backslash (see s5 above, '\\'). A fancier way involves a special kind of string, the raw string.

Raw strings start with r' and end with ' will treat every character between as exactly what it looks like, with the exception of the single quote (which ends the string). If you do use raw strings, watch out for two catches:
1) You must still escape single quotes you want to appear inside the string.
2) The last character of the string cannot be a backslash, since this will escape the closing quote.

There are proper times and places for the use of the raw string method, but in general we recommend just escaping your backslashes.

As a final point on escapes: \' and \" provide a means to employ single quotes inside a single quoted string, and likewise double quotes in a double quoted string.

Try the following code - you should get a Syntax Error. Then, uncomment (remove the '#' sign from the second s6 and s7 lines, and make the first s6 and s7 lines comments (add a '#' at the beginning). 

In [None]:
s6 = r'don't do this'
#s6 = r'but there ain\'t a problem with this'
print (s6)

s7 = r'this is bad\'
#s7 = r'but this is okay\ '
print (s7)

### 2.3 Strings: as sequence type, index and slice (substring)

Strings are merely successions of characters, and python stores and operates on them as such. The official python lingo for something that can be operated on as an ordered series of sub-elements is a 'sequence'.

The first property of sequences we'll look at is indexing. Understanding how python indexing works applies beyond just string data types, but it's a good place to practice. 

Python indexing starts at 0. So the 'M' in Melinda is `name[0]`. 

In [None]:
name = 'Melinda A. Yang'
first_initial = name[0]
print(first_initial)
middle_initial = name[8]
print(middle_initial)

Using indexing, it's possible to pick out any number of individual characters in this manner and stitch them back together as substrings, but sequences can also be operated on in contiguous subsets of indices, called slices. Like many other languages (except R...),

Slices look like indices, except with two numbers separated by a colon. The slice starts at the index of the first number, and ends at the index before the second. In the example above `name[0:7]` would be 'Melinda'. 

Many programming languages (C, Java) use the same 0-based indexing as Python. Others, such as R (and UNIX!), use 1-based indexing. Though both have their advantages and disadvantanges, Python's 0-base 'half-open' indexing offers some elegance.

If you want to select the first n elements of a sequence, and the rest of the sequence, you can do this without fussing with any +/- 1.

This lets us do handy things like:
```python 
first = name[0:7]
last = name[11:]
```

In [None]:
name = 'Melinda A. Yang'
middle_initial = name[8]
print (middle_initial)
print (name[7:11])
first = name[:7] #can also put zero for name[0:7]
last = name[11:]
print (first)
print (last)


### 2.4 Inserting and formatting variables in strings

The final string topic we'll discuss is the specific formatting and insertion of variables into strings. There are several methods, but concatenation and the newest update, f-strings, for Python 3 is shown here. 

***Concatenation***

As you learned last lesson, concatenating strings is one method of inserting a variable into a string. 

```python
name = 'Melinda A. Yang'
middle_initial = name[8]
first = name[:7]
last = name[11:]

last_first=last + ", " + first + " " + middle_initial + "."
```

***String interpolation using f-string***

Another method python offers, called string interpolation, for injecting variables into strings is shown in the following:
```python
last_first = f'{last}, {first} {middle_initial}.' 
```
This handily replaces all those + operations with a very readable string, where the 'f' at the beginning before the quote indicates that anything you put in {} is a variable or expression you want to insert. Below I show two examples - one comparing to concatenation and one showing how you can change the format for floating numbers. 

This form of string interpolation, f-string, is faster than previous versions of string interpolation (using %s or the str.format method). For a review of different types, you can look [here](https://realpython.com/python-f-strings/). 


In [None]:
name = 'Melinda A. Yang'
middle_initial = name[8]
first = name[:7]
last = name[11:]

last_first1 = last + ", " + first + " " + middle_initial + "."
last_first2 = f'{last}, {first} {middle_initial}.' 
print (last_first1)
print (last_first2)

In [None]:
myint = 42
myfloat = 3.14159265
 
string = f'Other types of variables can be interpolated as strings like here: {myint}, and here: {myfloat}.'
print (string)
 
print () 

print (f'''To get 2 decimal places write {myfloat:.2f}, or to get 2 decimal places padded
to a total width of 5 characters, write {myfloat:05.2f} (notice that the '.' counts as a character).
To write seven characters, you can write: {myfloat:07.2f}. ''')
# Remember how we said returns are faithfully reproduced from triple quoted strings?

Here's one more example below - note that I can also put expressions into the curly braces (see `string1`). In `string2`, I show that you can input more than just zero - note that for non-zero values like 'X' or space, I need to add the '>'. 

In [None]:
num = 22
den = 7
pi  = 3.14159265
string1 = f'pi is {num/den} or {num}/{den}'
string2 =  f'''
For 2 decimal places, write {pi:.2f}.
For 2 decimal places padded to a total width of 5 digits, write {pi: >6.2f} 
    Notice that I have to include the space if I want to pad with spaces.
The spaces can be replaced with zeros this way: {pi:06.2f}.
Or you could even do this: {pi:X>6.2f}'''
print (string1)
print (string2)

## Knowledge Check 1: Strings
1. Create a string variable named `a` with the value "The New York Times". Then, concatenate "is an American newspaper based in New York City" to the variable `a`. Do so using three different options.
      - Option 1: use + operator
      - Option 2: string formatting
      - Option 3: f string

<a href=#home>Return to Top</a> 

## 3. Exploring Functions with Strings  <a name='bookmark3' />

We've all learned about functions in math class: a function is a relationship, or mapping, between one or more inputs and a set of outputs: f(x)=2x+3. In programming, a function is a self-contained block of code that encapsulates a specific task or a group of tasks. 

Here we're going to focus on the `input()` function. While there are several ways to gather data from the outside world, the simplest is to just ask. In python, a program asks for information from the user with the `input()` function, as demonstrated here:
```python
user = input("what's your name? ")
print ('hello %s!' % (user))
```
The `input()` function prompts the user by printing to the screen whatever value is given in the parentheses immediately following the input call, (in this case asking "what's your name?") and then waits for the user to type whatever they want for as long as they feel like, until the user hits enter. `input()` (which is a function, like `int()` or `float()`, a topic we'll talk a lot more about later) then takes everything up until the user hits enter and returns that as a string. Again, we'll talk more about this idea, but here all you need to know is that `input()` gives back the user's input as a string, and that gets saved to the variable user using the assignment operator (=). After taking this input, we just spit it right back out (employing the string interpolation trick we learned a few minutes ago).

In [None]:
user = input("what's your name? ")
print (f'hello {user}')

<a href=#home>Return to Top</a> 

## 4. Lists  <a name='bookmark4' />

A ***list*** provides a way of storing an ordered series of values in a structure referenced by a single variable.

Lists have lots of really useful features. One is that they are __ordered__, which means the order of items in a list __does not change__ (this is not true for dictionaries, as we will see later). This means you can access individual items in a list or entire sections by indexing or slicing (like what you did for characters within a string).


In [None]:
NEstates = ['ME', 'MA', 'RI', 'VT', 'NH', 'CT']
print(NEstates)

If you want to see how long a list is without printing it and counting each element, you can use the `len()` function.

In [None]:
print(len(NEstates))

Each item in a list is called an element. Much like each character in a string, each element in a list has an index.

In [None]:
# to get the first three elements in your list:
NEstates[0:3]

In [None]:
# to get the last element of a list:
NEstates[-1]

### 4.1 Adding to a List
1. `append()` - adds a single object to the end of a list
2. `extend()` joins a second list to a first list
3. **concatenation** (the `+` sign) - adding two lists together also joins a second list to a first list
4. `insert()` - this allows you to add object to a specific position in the list
5. **insert by slicing** - another method of inserting

Numbers 1, 2, and 4 are methods, which are essentially a subset of functions specifically attached to that object type. Many of these work directly upon the variable itself (and a list of the available methods will pop up if you type the variable name, followed by "." and then pressing tab. They come in the format `object.method()`, where the object of variable is in front, followed by a '.', the name of the method and paranthesis (which require different objects/functions inside). 
```python
mylist.append(mynewelement)
```

In [None]:
mybases = ["Adenine","Guanine","Cytosine","Thymine"]
mybases_short   = ["A","G","C","T"]
mybases.append("Uracil")
mybases_short.append("U")
print ("append()")
print (mybases)
print (mybases_short) ##What if I had put ["U"] into the append method?
print ()

biologists=["Janaki Ammal", "Jennifer Doudna"]
morebiologists = ["Barbara McClintock","Flossie Wong-Staal"]
biologists.extend(morebiologists)
print ("extend()")
print (biologists)
print (morebiologists)
print

biologists=["Janaki Ammal", "Jennifer Doudna"]
morebiologists = ["Barbara McClintock","Flossie Wong-Staal"]
print ("concatenate")
print (biologists+morebiologists)
print (biologists)
print (morebiologists)
print

biologists.insert(1,"Ruby Hirose")
print ('insert()')
print (biologists)
print 

biologists[2:2] = ["Ruth Ella Moore"]
print ('insert with slicing')
print (biologists)
print ()
biologists[1:1] = (morebiologists)
print (biologists)

In the above example, sometimes the variable itself was changed, while in other cases we had to specify a new variable. `append()`, `extend()`, `insert()` and **insert with slicing** directly acted upon the original variable, modifying it in place. **concatenate** or `+` did not affect the original variable, and to affect the old variable, you would have to reassign the new list to the old variable.

***CAUTION!***
Be careful with using insertions. One of the most useful properties of lists is that you know the index, or position, or each element in the list. More complicated actions using lists often use information about the position, and uncareful use of insertions may result in assigning elements in the list to variables that you did not intend to assign. 

Lastly, if you didn't realize - the above people included in the list are all famous female biologists who have been recognized for their immense academic contributions. If you don't recognize them, then before moving on, I suggest googling a few of them to learn about awesome women scientists (from the past and today)!

### 4.2 Multiplication

We take a moment here to consider the multiplication operator. For both strings and lists, this works exactly as multiplication should. 

For instance, 
```python
3*4 = 3+3+3+3 = 12
```
Then, 
```python
3*'a' = 'a'+'a'+'a' = 'aaa'
3*[0] = [0]+[0]+[0] = [0,0,0]
```

In [None]:
print(3*'a')
print(3*[0])

### 4.3 Shrinking a list

1. `del`  - built-in function (like `print`) that removes particular item from the list
2. `pop()` - method that removes the last item from the list, returning a variable
3. **slicing** - slice the list to retrieve only the subset you want (delete by omission)

In [None]:
ingredients = ['DNA polymerase', 'RNA polymerase', 'helicase', 'RNA primer', 
         'nucleotides', 'DNA ligase']

del ingredients[2]
print ("del: ")
print (ingredients)
print ()

ingredients.pop()
print ("pop(): ")
print (ingredients)
print ()

print ("slicing: ")
print (ingredients[:-1])

Above, I have a list of ingredients used in DNA replication, which I use to show three different methods of removing items from the list. 

However, I've made a mistake and added something used in transcription, a different cellular process - 'RNA polymerase'! What would you do to create a list of that correctly shows ingredients for DNA replication (i.e. a list without RNA polymerase)? What about a list of only ingredients used in transcription, i.e. a list of only the single element 'RNA polymerase'?


Next, we consider what the methods are returning, if they are changing the original variable. 

In [None]:
ingredients = ['DNA polymerase', 'RNA polymerase', 'helicase', 'RNA primer', 
         'nucleotides', 'DNA ligase']

myreturn = ingredients.append('topoisomerase')
print (ingredients)
print (myreturn)
print

myreturn = ingredients.extend(['activator protein','repressor protein'])
print (ingredients)
print (myreturn)
print

myreturn = ingredients.pop()
print (ingredients)
print (myreturn)

`append()` and `extend()` return "None", but `pop()` returns the last element of the list, which is removed from the list. Thus, different methods (and functions) will return different things. You can figure out for yourself what is returned, as well as more information about the method or function in one of two ways, searching the documentation online or using a nifty tool in Jupyter Notebook - adding a `?` to the end of the method/function. 

In [None]:
ingredients.extend?

In [None]:
ingredients.pop?

In [None]:
type?

### 4.4 Changing lists in place
1. Overwriting the element in the list
2. Sort by the method `sort()` or the function `sorted()`
3. Reverse the order of the list using `reverse()` or **slicing**

First, let's make a list to start with - I've initialized a list of four zeros and then assigned values to some elements of the list. Note that if I run the below cell, the notebook 'remembers' my variables, unless I overwrite the old assignment. I could also erase this memory using 'Kernel-->Restart' in the bar above or clicking the circular arrow - in both cases, I'd have to then click Restart in the pop up. But for now, we want to retain the memory of what the variables are across each cell. 

In [None]:
brainsizes=4*[0]
print ("initialized list")
print (brainsizes)
print ()

mice_brain = 10
rat_brain = 20
human_brain = 500
brainsizes[2] = mice_brain
brainsizes[1] = rat_brain
brainsizes[3] = human_brain
print ("modified list")
print (brainsizes)
print ()

As you look at what each method or function in the below cells do to the `brainsizes` list, note whether you are using a method, function, or neither. Also, note which ones change the variable `brainsizes` itself, and which ones are actually outputting a NEW list, which to be saved, must be assigned to a new variable OR if you want the `brainsizes` list to be updated, you would overwrite the `brainsizes` variable.

In [None]:
print ('sorted')
print (sorted(brainsizes))
print (brainsizes)
print ()


In [None]:
print ('sort()')
brainsizes.sort()
print (brainsizes)
print ()


In [None]:
print ('reverse by slicing')
print (brainsizes[::-1])
print (brainsizes)
print ()


In [None]:
print ('reverse()')
brainsizes.reverse()
print (brainsizes)
print ()


In [None]:
myAtmosphere = ["N2","oxygen","argon"]
print ("Why is this not sorted alphabetically?")
print (myAtmosphere)
print (sorted(myAtmosphere))

The above are not sorted alphabetically because upper and lower case letters are not sorted with each other - upper case comes first, then lower case in the 'sorting'. 

### 4.5 Characterizing Lists

Here, we will learn a few more things we can do with lists.

1) The built-in functions `len()`, `max()` and `min()` tell us how many items are in the list and the maximum and minimum values in the list.

2) The list method `index()` tells us where an item is in the list.

3) We can iterate over each item in the list and print it using the syntax `for x in mylist`:



In [None]:
print (brainsizes)
print ("# Elements =", len(brainsizes))
print ('Max =', max(brainsizes))
print ('Min =', min(brainsizes))

Above, how would you rewrite the strings to print using f-string formatting?

Note that for the **for loop** example below, you will be learning a lot more later in the next lesson about them. Here, mainly note what you think is happening in the loop. 

In [None]:
#iterate over list
for x in brainsizes: print (x)

In [None]:
##find index of element in list
print (brainsizes)
human_brain = 500
humanindex = brainsizes.index(human_brain)
print (humanindex)

## Knowledge Check 2. Lists
Below is a list of all the unique population IDs in the `SGDPinfo.txt` file. Can you answer the following?

      - a. How many population IDs are included? 

      - b. How many populations begin with the letter 'A'?

In [None]:
mypopns=['BantuHerero', 'BantuKenya', 'BantuTswana', 'Biaka', 'Dinka', 'Esan', 'Gambian', 'Ju_hoan_North', 
         'Khomani_San', 'Luhya', 'Luo', 'Mandenka', 'Masai', 'Mbuti', 'Mende', 'Mozabite', 'Saharawi', 'Somali', 
         'Yoruba', 'Chane', 'Karitiana', 'Mayan', 'Mixe', 'Mixtec', 'Piapoco', 'Pima', 'Quechua', 'Surui', 'Zapotec', 
         'Aleut', 'Altaian', 'Chukchi', 'Eskimo_Chaplin', 'Eskimo_Naukan', 'Eskimo_Sireniki', 'Even', 'Itelman', 
         'Kyrgyz', 'Mansi', 'Mongola', 'Tlingit', 'Tubalar', 'Ulchi', 'Yakut', 'Ami', 'Atayal', 'Burmese', 
         'Cambodian', 'Dai', 'Daur', 'Han', 'Hezhen', 'Japanese', 'Kinh', 'Korean', 'Lahu', 'Miao', 'Naxi', 
         'Oroqen', 'She', 'Thai', 'Tu', 'Tujia', 'Uygur', 'Xibo', 'Yi', 'Australian', 'Bougainville', 'Dusun', 
         'Hawaiian', 'Igorot', 'Maori', 'Papuan', 'Balochi', 'Bengali', 'Brahmin', 'Brahui', 'Burusho', 'Hazara', 
         'Irula', 'Kalash', 'Kapu', 'Khonda_Dora', 'Kusunda', 'Madiga', 'Makrani', 'Mala', 'Pathan', 'Punjabi', 
         'Relli', 'Sindhi', 'Yadava', 'Abkhasian', 'Adygei', 'Albanian', 'Armenian', 'Basque', 'BedouinB', 
         'Bergamo', 'Bulgarian', 'Chechen', 'Czech', 'Druze', 'English', 'Estonian', 'Finnish', 'French', 
         'Georgian', 'Greek', 'Hungarian', 'Icelandic', 'Iranian', 'Iraqi_Jew', 'Jordanian', 'Lezgin', 
         'North_Ossetian', 'Orcadian', 'Palestinian', 'Polish', 'Russian', 'Samaritan', 'Sardinian', 'Spanish', 
         'Tajik', 'Turkish', 'Tuscan', 'Yemenite_Jew']

In [None]:
## Add your code here.

# A

# B

<a href=#home>Return to Top</a> 

## 5. Tuples  <a name='bookmark5' />

A tuple is essentially a list that you can not change - it is immutable. You can index them, slice them, and add them together to make new tuples but not use `sort()`, `reverse()`, delete or remove items from them. If you ever have a tuple that you want to change, you have to turn it into a list. Tuples have structure, lists have order. 

Why would you want to use a tuple and not a list? Maybe you have an original dataset that you don't want modified, tuples use a little bit less memory than lists, and programs with tuples often run faster than programs with lists.

Geography/envs friends - we *rarely* use tuples. 

In [None]:
SNP = ('chrII', '378445')
print (type(SNP))
 
for i in SNP: print (i)
 
#Can we change an element in a tuple? Guess what might occur. 
#Then, try the following after uncommenting the lines to see if you were correct.
#SNP[0] = 'chrV'
#print (SNP)

In [None]:
#What if we first coerce the tuple to a list?
SNP = list(SNP)
print (type(SNP))
SNP[0] = 'chrV'
SNP = tuple(SNP)
print (SNP)

If your tuple only has one item, you need to use a comma to make it clear that the tuple is a tuple and not just a value in parentheses:
```python
tuple_A = ("Is this a tuple?")    ##This is a string
tuple_B = ("What about this?", )  ##This is a tuple with one element. 
```

Tuples are also handy for doing an in-place swap.

In [None]:
a = 1
b = 2
print (a,b)
a,b = b,a
print (a,b)

##Above is equivalent to:
mytuple = (b,a)
a,b = mytuple

<a href=#home>Return to Top</a> 

## 6. Dictionaries  <a name='bookmark6' />

There are two ways to think of dictionaries. 

The **first** way is you can imagine a dictionary as just that -- a dictionary. To retrieve information out of it, you look up a word, called a key, and you find information associated with that word, called the key's value.

To create a dictionary, you write each key-value pair as **key:value**, divide the pairs with commas, and surround the entire structure with curly braces.

In [None]:
#dr yang's dictionary
names = {'Jennifer':'Doudna', 'Flossie':'Wong-Staal', 'Barbara':'McClintock'}

The key is what you use to retrieve information. Thus, whereas in tuples and lists you used the index to grab a particular element, in a dictionary, you use the key. 

In [None]:
print ("Find value associated with key")
print (names["Flossie"])
print ()

print ("Add new key:value pair")
names["Ruth"]="Moore"
print (names)
print ()

print ("del")
del names['Flossie']
print (names)
print ()

print ("change in place")
names["Jennifer"] = "I refuse to share my last name"
print (names)
print ()

print ("Combine two dictionaries")
morenames={"Janaki":"Ammal", "Ruby":"Hirose", "Jennifer":"Doudna"}
names.update(morenames)
print (names)
print ()
print ("What happened to Jennifer when we used update()??")


What happens when you try the list method `pop()`? What about `sort()`? And `sorted()`?

The **second** way to think about dictionaries is a table where the key is the column label and the values are the entries in that column. Values can be anything, any type of data, and they don't have to be the same in a {key:value} pair.

In [None]:
# prof spera's dictionaries
dict1 = {'column1':[1,2,4,8,16,32]}
dict2 = {'Numbers':[1,2,3,4],'Fruit':['apples','bananas','oranges','lemons'],'Randoms':[16.0,12.0,23.7,18.2]}
print(dict1)

As stated above, the curly brackets **{ }** tell us everything we printed is a `dict`. Inside that `dict`, we have a `str` that is the **key**. Then we have a colon **:** that tells us the **values** for the key are coming up. Next, we have straight brackets **[ ]** surrounding the values. The straight brackets indicate a `list`, and the values are elements in that list.

If you want to know what your keys or values are without printing the whole dict, you can call these items.

In [None]:
# The keys in dict1:
print(dict1.keys())

# The corresponding values:
print(dict1.values())

Now, what if you want to know the values of a certain key/'column'? Try this statement: dict_name['key_name']:

In [None]:
print(dict1['column1'])

Since the values in our dict are elements of a list, we can call a specific element using its index. But first we must indicate which column of data we want by using the key.



In [None]:
print(dict1['column1'][1])
# this code is saying find the 2nd element in column1 of dict1, remember indexing starts at 0

## Knowledge Check 3: Lists and Dictionaries
For each of the lines below, add comments detailing what each line in the script above is doing. Then run these commands and make sure you are correct.
```python
L = [1,2,3] + [4,5,6]
print (L, L[:], L[:0], L[-2], L[-2:])
print ()

L.reverse()
print (L)
print ()

L.sort()
print (L)
print ()

idx = L.index(4)
print (idx)
print ()

print ({'a':1, 'b':2}['b'])
print ()

D = {'x':1, 'y':2, 'z':3}
D['w'] = 0
print (D['x'] + D['w'])
print (D.keys(), D.values(), 'z' in D)
```

In [None]:
# Add your code here

# Summary So Far...

__Lists are:__

1) ordered collections of arbitrary variables.

2) accessible by slicing.

3) can be grown or shrunk in place.

4) mutable (can be changed in place).

5) defined with list = [X,Y]

__Tuples are:__

1) like lists except they are immutable (cannot do #3 and #4 for lists)

2) defined with tuple = (X,Y)

__Dictionaries are:__

1) unordered collection of arbitrary variables.

2) accessible by keys.

3) can be grown or shrunk in place.

4) mutable.

5) defined with dict = {X:Y}

List methods include: __append()__, __extend()__, __insert()__, __pop()__, __sort()__, __reverse()__, __index()__
Dictionary methods include: __update()__, __keys()__, __values()__, __pop()__--but __pop()__ works a bit differently compared to how it's used in lists!
Built in functions include: __sorted__, __len__, __max__, __min__, __type__

You will use dictionaries and lists often in your coding. However, there is another data structure that you should know about to make your life a little easier: __Sets__. __Sets__ are unordered and unique bags of variables. You will learn some about them in your exercises.

Now that we understand all the **{key:value}** pairs, we can use `pandas` (a Python package) to do fancy stuff to our dictionary.

<a href=#home>Return to Top</a> 

## 7. Packages, dataframes, and pandas.  <a name='bookmark7' />

**Until we figure out a package dependency thing - to do this section, just make sure you are using the qgis kernel.**
**This means, in the top ribbon, under the box that says 'Trusted', if it says 'Python 3', click on that, and select 'qgis' from the dropdown.**


### 7.1 Packages and modules
Python is a language, and inside every language there are different types of words. In English, we have parts of speech like nouns, verbs, adjectives, etc. In Python, we have packages like Numpy, Scipy, pandas, etc. Each package has a specific purpose, `numpy` = matrix math, `scipy` = Science-based analysis, `pandas` = spreadsheet operations (like Excel). Each part of speech has many different categories and words. For example, we have abstract nouns, proper nouns, collective nouns, common nouns, but they are all types of nouns. In a Python package like Numpy, we have different commands that tell us what is happening in our Python sentence. Thus, another way of thinking of packages is as a set of Python code written by someone else with ready to go commands (i.e. functions) we might want to use related to the topic (e.g. `numpy` will have many functions and data types useful for matric math).  

The code below is an example of a script that uses the `numpy` package. `import numpy` allows us to access the package, so to use a `numpy` function, we add the `import numpy` line to tell Python to look for that package. That's all you need to know about `import` for now. 

In [None]:
import numpy

x = numpy.arange(5)
print(x)
type(x)

Numpy is all about matrices, lists, and math operations on those matrices and lists. In this example, I said `numpy.`, which indicates that I want to use a command from the `numpy` package. The command/function I want to use is `arange`, so I added that after. This function makes a list, or a range of numbers, based on a number you provide in the paranthesis. I said `(5)`, so Numpy makes a range of numbers that is 5 elements long, starting at 0. The `print(x)` statement in Line 4 shows us the result of this command, `[0 1 2 3 4]`. 

Note that `numpy` creates a new data structure, known as an `array`. It is like lists, except many `numpy` functions can be applied, allowing quick matrix algebra calculations. We aren't getting into those, but you can always use `type` to see the new data type. A numpy array differs from lists in that there are no commas separating the elements of the array. Like lists, arrays use the straight brackets as well, **[ ]**. 

Back to the `import` statement - remember that Python doesn't automatically open all packages when you start a notebook or script. Instead, you need to tell Python which packages you want. You do this with the `import` statement. You can also give each package a nickname, like `pd` in the example below. Here are some more examples of importing packages.

In [None]:
import numpy as np
import pandas as pd
import scipy
from matplotlib import pyplot as plt

Wait, what happened with that last line?!

Matplotlib is a plotting package. But I didn't import all of matplotlib, I only imported one module from matplotlib called `pyplot`. This would be like me writing a sentence using only proper nouns. I could use any type of noun, but I'm only using the proper noun module in my sentence. Another way to write this import statement is below:

In [None]:
import matplotlib.pyplot as plt

Both examples do the same thing, so it totally depends on which you prefer!

Sometimes you will see "import " statement. This statement imports all the functions and classes from the selected package. This is actually a *bad* habit and you should not use this form of import statement. Instead, specify which modules you want explicitly or just import the entire package. But it is good to know what other people might do in their code.

In [None]:
# dont run this bad code
from matplotlib import * # bad code!

### 7.2 Data frames and pandas light
Let's make a data frame using pandas. 

We already imported pandas a bit earlier using `import pandas as pd` (scroll up a few cells), so you shouldn't need to import it again. But you can always add it at the top of your cell if it says your package is not defined. 

We also set up two dictionaries towards the end of section 6, pasted below. 

```python
dict1 = {'column1':[1,2,4,8,16,32]}
dict2 = {'Numbers':[1,2,3,4],'Fruit':['apples','bananas','oranges','lemons'],'Randoms':[16.0,12.0,23.7,18.2]}
```

If you ran those dictionaries, the following code will work. If you get an error saying `dict1` and `dict2` don't exist, then paste the above two dictionaries in your code as well to initialize them. 

In [None]:
dict1 = {'column1':[1,2,4,8,16,32]}
dict2 = {'Numbers':[1,2,3,4],'Fruit':['apples','bananas','oranges','lemons'],'Randoms':[16.0,12.0,23.7,18.2]}
df1 = pd.DataFrame(data=dict1) # we already imported pandas a bit earlier using import pandas as pd, so we don't need to do it again
print(df1)

`df1` is our first table. In pandas, they also work with a new datatype (not lists, arrays, strings, dictionaries, or tuples), and these are called `DataFrames`. We can make any DataFrame using the `pd.DataFrame` command and inputing a dictionary for the data argument. The list of numbers on the left side are the indexes. The default, if you don't specify the index when you make the DataFrame, is to number the rows starting at 0. We can change the column names or the indexes using functions from the pandas module.

In [None]:
df1 = pd.DataFrame(data=dict2)
print(df1)

In [None]:
df2 = df1.set_index('Numbers')
print(df2)

In the example above, the first column of `df1`, named "Numbers", became the index of `df2`. But let's break this down a bit.

First, we made a new DataFrame called `df2`. We said `df2` is equal to `df1` BUT we are changing the indexes for the values in column 1. This also means that the index label is the key from column 1 in df1. By making column 1 our index, we removed this column from the table. Now there are only two **{key:value}** data pairs in the DataFrame.

When we say there are only two {key:value} pairs in the DataFrame, we mean that Numbers is no longer a key! It has become the index label. We can still call the Fruit and Randoms keys, but we cannot treat Numbers the same way.

In [None]:
print(df2['Fruit'])

In [None]:
print(df2['Numbers'])

This long error message ends with "KeyError: 'Numbers'". This is saying 'Numbers' is not a key, so you can't use it to select those values. Instead, we must call the DataFrame indexes.

In [None]:
print(df2.index)

Now let's change the column labels.

In [None]:
df3 = df1.rename(columns={'Numbers':'Numeros','Fruit':'Frutas','Randoms':'Temperature'})
print(df3)

### 7.3 pandas Applied

The pandas package is a Python package that specifically analyzes and manipulates data in 2D or 1D arrays. Basically, pandas is the go-to tool for looking at data in a table (2D) or list (1D) format. In an Excel file, a 2D array is called a spreadsheet. In pandas, these 2D arrays are called DataFrames. pandas also has a name for a 1D array: a Series. Certain commands only work on Series and not DataFrames, or vice versa, so it is important to know which kind of object/data type you are working with.

(If you level up to 3D data, like climate data, with a lat, lon, and time - you will primarily work with the `numpy` package). 

We've already imported pandas, but if you closed the jupyter notebook between here and above, you can re-import the package.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt # we'll be looking at our data, so we'll import pyplot here as well

Whether your data is in html, csv, json, or xlsx, pandas can open that! Here is a comprehensive list of all file types pandas can open and read: https://pandas.pydata.org/docs/reference/io.html

We're going to try opening a few different files. Each file name is listed below.

In [None]:
# names of files to open
excel_file = '/scratch/myang_shared/lab/PythonBootcamp/Sp24/lessons/Lesson2/Weather.xlsx'
csv_file = '/scratch/myang_shared/lab/PythonBootcamp/Sp24/lessons/Lesson2/temp.csv'
txt_file = '/scratch/myang_shared/lab/PythonBootcamp/Sp24/lessons/Lesson2/moisture.txt'

In [None]:
# use the read_ command to open the three files. Note: use a different command for excel versus csv/txt!
excel_data = pd.read_excel(excel_file)
csv_data = pd.read_csv(csv_file, sep=',') 
txt_data = pd.read_csv(txt_file,sep=',')
# the sep=',' argument refers to the separator between data, meaning that values in the file are separated by ",".


In [None]:
# check out each DataFrame by replacing the name below. 
# You don't need to use the print command here - in fact, Jupyter Notebook spits out DataFrames in a nice format.

excel_data

When you looked at the text file, you might have noticed some funny stuff in the first few rows. What happened there? If you open the file in a text editor, you'll see there are three lines of text before the data starts. Remember that we can take a quick look inside of the file using Linux commands, see the next cell using `%%bash%%`. 

In [None]:
%%bash 
head /scratch/myang_shared/lab/PythonBootcamp/Sp24/lessons/Lesson2/moisture.txt

We don't want those three lines of text before the data. In order to open the text file without those three rows, we can tell pandas where to start reading in the data. It will skip those three rows of text for us.

In [None]:
# we use the header argument to name the row to start on.
# remember, Python starts counting at 0. So the 4th row is actually 3!
txt_data = pd.read_csv(txt_file,sep=',',header=3)

In [None]:
# check to make sure the DataFrame looks nice.
txt_data

Another useful coding practice is to limit the number of rows pandas shows you. If you have a large dataset, but you just want to give it a quick check, like w. linuxm you can just load first or last couple rows using ".head()" or ".tail()". Try it out below.

In [None]:
txt_data.tail(5)  #note the similarity of this command to our head and tail Linux commands!

We now have three DataFrames each with different types of weather data for the year 2019. But do we actually care about all these variables? And what if we want to compare variables between DataFrames?

In order to simplify things, we can just select the variables we care about. The next few code cells will walk you through the process of selecting columns from each existing DataFrame. After you select the data, you can put these together to make a new DataFrame.

In [None]:
# first, let's recap the variables we have by printing out the column labels for each DataFrame.
print('Excel Variables:')
print(excel_data.columns)
print('CSV Variables:')
print(csv_data.columns)
print('TXT Variables:')
print(txt_data.columns)

In [None]:
# choose the variables you care about, then use those column headers to select the data.
# you can select different variables here if you want, 
# but you'll need to make sure you follow these changes throughout the code.
var1 = excel_data['Dates']
var2 = excel_data['Sky Condition (oktas)']
var3 = csv_data['airtemp_degc']
var4 = txt_data['liqprec_mm']

In [None]:
# always check your data to make sure things worked properly.
var1

You now have four pandas Series. You might notice that Series look different than DataFrames. First of all, there is no column header! This is because a Series is 1D data - it is a list. You don't need a column header if you only have a list.

Rather than work with four different Series, lets make a new DataFrame. First, we'll make a dict with our four Series, then we'll use that dict to make a DataFrame.

Important note: we are able to do this because the Series we extracted have same length! In practice, if you have different length, the missing spots will be filled with a filler value, called NaN (more on this later).

In [None]:
# make a dict.
data = {'Dates':var1,'Sky Cond (oktas)':var2,'Temp (˚C)':var3,'Precip (mm)':var4}

# use that dict to make a DataFrame.
df1 = pd.DataFrame(data=data)

In [None]:
# check the df to make sure it worked.
df1

Now that you have the data you want, this is a good time to save your data! You can save as any kind of file that pandas can open, but let's try .xlsx and .csv.



In [None]:
# we perform the to_csv and to_excel command on df1, then provide a file name and path for saving the data.
# specify index=False to remove the DataFrame indexes from the saved file. Just makes things look nicer.
# from openpyxl import Workbook
df1.to_csv('MyData.csv',index=False)
df1.to_excel('MyData.xlsx',index=False)

For CSV and TXT files, you should be able to look at them through Jupyter Notebook easily. Remember the convention is for CSV files to separate data by commas, and TXT files separate by spaces or tabs. An XLSX file is written in a special way to make it easy for data/formulas to pop up in Microsoft Excel. To view this new file, you will want to download the XLSX file onto your computer to view in Microsoft Excel.

#### Quality Control
Ok, so we have data. Now what do we do with it? First of all, we should probably remove all those NaNs. In Python, NaN means Not a Number. It is a placeholder for missing data. If there are a lot of NaNs in your DataFrame, it can be a pain to remove them one by one. Fortunately, pandas has an easy tool to remove NaNs!

In [None]:
# as a reminder, this is what the DataFrame looks like.
df1

In [None]:
# the dropna command will remove rows OR columns with NaNs. We must specify how we want to data handled. 
# we'll try it both ways below and see which makes sense!
df2 = df1.dropna(axis='index')
df3 = df1.dropna(axis='columns')

In [None]:
# look at the data. Which method removed rows and which method removed columns with NaNs? 
# which method do you think makes more sense in this case?
print(df2)
print('-------------------------------------------------------------')
print(df3)

It seems like we lost all the data in `df3`. This is because each column with meteorological data had NaNs in it, so the `dropna` command removed those columns. Instead, we want to use `df2` and the **axis='index'** option.

Let's take a quick step back. We made 3 DataFrames in this tutorial. Two of these DataFrames were made from the first DataFrame. Was this necessary? Why did we have to make new DataFrames?

pandas is really cool because it can let you edit a DataFrame in place OR make a copy. When you edit a DataFrame in place, this means you change something about the DataFrame without having to change the name or make a new DataFrame. Making a copy is when you basically duplicate the DataFrame, then make the edit to the new DataFrame.

To change whether you edit in place or make a copy, all you have to do is specify that `inplace=True` in the key-word arguments. If we had set `inplace=True` in our dropna example, we would not have been able to test out both axis options. In this situation, it was really helpful to make a copy. But if we know for sure what we want to do, `inplace=True` can be helpful for saving memory if you're working with big data.

#### Sorting, indexing, and subselecting data  <a name='knowledgecheck3' />

Now that we have good data, let's explore it a little. Maybe I want to look at only hours with temperatures below 0 ˚C. I need to use the `.loc[condition]` command to subselect the data.

With the `loc` command, the order of information goes **[index, columns]**. If you want to select all of the DataFrame rows (meaning all of the indexes), you can skip the index condition and jump right to the column label and condition. This is what we will do in the next code cell.

In [None]:
# select only cold data with .loc[] command.
# the condition in the brackets says 'anywhere in df2 where the Temp column is less than 0.0 ˚C'.
cold = df2.loc[df2['Temp (˚C)']<=0.0]
cold

In [None]:
# let's look at the Sky Conditions on these cold days. 
# a histogram .hist() is a great tool for quickly assessing the distribution of a variable.
# you could also look at precip by changing the column name.
plt.hist(cold['Sky Cond (oktas)'].values)
plt.xlabel('Sky Cover (oktas)')
plt.ylabel('Count')
plt.title('Distribution of Sky Cover on Cold Days')
plt.show()

In [None]:
# what about only warm days?
warm = df2.loc[df2['Temp (˚C)']>0.0]
warm

In [None]:
# are the sky conditions different?
plt.hist(warm['Sky Cond (oktas)'].values)
plt.xlabel('Sky Cover (oktas)')
plt.ylabel('Count')
plt.title('Distribution of Sky Cover on Warm Days')
plt.show()

pandas has some really helpful tricks when working with datetime objects, so we can select data based on the Dates column as well.



In [None]:
# first, set the Dates column as an index for the DataFrame.
df2_dates = df2.set_index(['Dates'])
df2_dates

Remember when I said the order of the loc command went [index,columns]? Now we will focus on the index conditions. Since we want all of the columns, we can skip this condition.

In [None]:
# now let's select summer months only.
# reminder: summer months are June, July, August (06,07,08).
summer = df2_dates.loc['2019-06-01':'2019-08-31'] 
summer

In [None]:
# and select fall dates to compare.
# fall is definied as September, October, November (09,10,11).
fall = df2_dates.loc['2019-09-01':'2019-11-30']
fall

What is the average temperature in the summer compared to the fall? What about the max and min precipitation

In [None]:
# take the average.
summer_ave = summer['Temp (˚C)'].mean()
fall_ave = fall['Temp (˚C)'].mean()

# find the min value.
summer_min = summer['Precip (mm)'].min()
fall_min = fall['Precip (mm)'].min()

# find the max value.
summer_max = summer['Precip (mm)'].max()
fall_max = fall['Precip (mm)'].max()

In [None]:
# print out the results. 
# we can put ints or floats into strings using a trick with %.
# the .2f means print two floats after the decimal.
# you can also use .d or .E. Try both of these out by editing the print statements. What happened?
print('Summer average temp: %.2f' % summer_ave) 
print('Fall average temp: %.2f' % fall_ave)

In [None]:
print('Summer rainfall range: %.2f to %.2f' % (summer_min,summer_max))
print('Fall rainfall range: %.2f to %.2f' % (fall_min,fall_max))

Hey! We have negative numbers for our rainfall amount! That isn't possible. Well, we clearly missed something in our quality control. That's ok. We know how to fix that now using the loc command. 

#### More complex sorting and indexing
What if we want to find the average temperature of each month? pandas has a great command for that also:

In [None]:
# we use groupby to group the data according to month. Then we tell pandas to take the mean of each group.
month_aves = df2_dates.groupby(by=df2_dates.index.month).mean()
month_aves

Now we have the averages of everything for each month. What if we just want the temperature column? We can do this in two ways. 1) run groupby on the whole DataFrame then select the Temp column, or 2) run groupby on only the Temp column

In [None]:
month_aves = df2_dates.groupby(by=df2_dates.index.month).mean()
temp1 = month_aves['Temp (˚C)']
temp2 = df2_dates['Temp (˚C)'].groupby(by=df2_dates.index.month).mean()
print(temp1)
print('---------------------')
print(temp2)

Both methods produce the same result, so you can use either in the future!

Now let's group by other features. What if we want to group based on hour? Let's try this by finding the maximum value at each hour.

In [None]:
hour_max = df2_dates.groupby(by=df2_dates.index.hour).max()

plt.plot(hour_max['Temp (˚C)'])
plt.xlabel('Hour')
plt.ylabel('Temperature (˚C)')
plt.title('Hourly Average Temperature')
plt.show()

Weird, why is the temperature coldest at 11 am? Because the time is in [UTC](https://www.utctime.net)! We could fix that by making the DatetimeIndex time-zone aware and then converting the time zone.

In [None]:
df2_localized = df2_dates.tz_localize(tz='UTC') # make the DataFrame time-zone aware
df3 = df2_localized.tz_convert(tz='America/New_York') # convert from UTC to Central time
# side note: tz_localize and tz_convert do not have inplace=True options.
# we have to make new dfs, but we can just keep calling those copies df2.

In [None]:
# check how the Dates info changed here: 
print("UTC  :",df2_localized.index[0:4].hour)
print("Central time:",df3.index[0:4].hour)

In [None]:
# try the hour groupby again
hour_max = df3.groupby(by=df3.index.hour).max()
plt.plot(hour_max['Temp (˚C)'])
plt.xlabel('Hour')
plt.ylabel('Temperature (˚C)')
plt.title('Hourly Average Temperature in Eastern Time')
plt.show()

<a href=#home>Return to Top</a> 

## Knowledge Check 4. Pandas, Sorting, Indexing, and Selecting. 

Below is a code that has been adjusted from the <a href=#knowledgecheck3>'Sorting, indexing, and subselecting data' subsection</a>. Add one simple line of code below to remove negative liquid precip from df2 (from the "Precip (mm)" column), and work through understanding the rest of the code.

In [None]:
# look at the data. Which method removed rows and which method removed columns with NaNs? 
# which method do you think makes more sense in this case?
print(df2)

# find the min precip value.
annual_min_pcp = df2['Precip (mm)'].min()

# find and the max precip value.
annual_max_pcp = df2['Precip (mm)'].max()

print('2019 minimum rainfall: %.2f mm/hr' % (annual_min_pcp))
print('2019 maximum rainfall: %.2f mm/hr' % (annual_max_pcp))

# we use groupby to group the data according to month. Then we tell pandas to take the mean of each group.
month_aves = df2_dates.groupby(by=df2_dates.index.month).mean()
month_aves
print(month_aves)

# this is in average pcp in mm/hr
pcp1 = month_aves['Precip (mm)']
# it would be more helpful to have it just mm/month
# there are better, more specific ways to do this to make sure you are doing the math 
# correctly and are using the actual number of days in each month to scale up mm/hr to mm
# per month for the month (Feb has 28 days, July has 31...but we'll do a rough average below)
pcp2 = pcp1*24*30 

print(len(pcp2))
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
plt.bar(months, pcp1, color ='teal', 
        width = 0.4)
plt.xlabel('Month')
plt.ylabel('Precipitation (mm)')
plt.title('Average(ish) monthly precipitation')
plt.show()

<a href=#home>Return to Top</a> 