## Selection by Labels
* We can extract data from the rows using the location (`.loc`) attribute
* Watch carefully....

In [None]:
students = [{'Name': 'Alice', 'Class': 'Physics', 'Score': 85},
            {'Name': 'Jack', 'Class': 'Chemistry', 'Score': 82},
            {'Name': 'Mark', 'Class': 'Biology', 'Score': 90}]
df = pd.DataFrame(students, index=['U-M', 'MSU', 'U-M'])

* Two important considerations:
1. The return value seems to be a `Series` -- neat!
2. `.loc` is **not** a function.  

~~loc()~~ is not a thing. it's loc\[\]. Think about this as a numpy array and it will make more sense -- you're just indexing into the array.

In [None]:
# reminder what our dataframe looks like
df

In [None]:
# we can use loc to index in (you saw this)
#df.loc["MSU"]
# we can also add the second dimension, column names, to the index


In [None]:
# what if we want two columns?


## Other ways of slicing: by Index
* So, `.loc` allows us to index in both dimensions of the dataframe, and allows us to slice by both index and column.
* `.loc` has a sibling though, `.iloc`. This stands for integer location. So you can slice by the row or column number

In [None]:
df

In [None]:
# Oh, and slicing? Check ✅


* Great, we have a DataFrame. A two dimensional data storage object with row indexes and column names.
* We can get data out a row or column at a time, or narrow down to specific row/column combinations.
* And we can pull data out using nice labels (strings!) or integer locations.

### An Aside (and a warning)...

I'm going to show you something I would encourage you to never use

I mean, it looks nice, but it's really going to bite you later....

In [None]:
df

Pandas devs add the column name as an attrbute to the DataFrame and this is used to index directly into the dataframe.

See https://www.dataschool.io/pandas-dot-notation-vs-brackets/

Please, just forget you saw this.

### Indexing by Callable

In [None]:
import pandas as pd
# We'll load in our CSV file
df = pd.read_csv('datasets/Admission_Predict.csv', index_col=0)
# And we'll clean up a couple of poorly named columns like before
df.columns = [x.lower().strip() for x in df.columns]
# And we'll take a look at the results
df.head()

* Querying dataframes is all about boolean masking

* We can apply a mask in a couple of ways

* Of course, you don't have to make the mask object (and likely won't)!

* We can also use the `where()` function, a subtle issue is that NaN's are left in for you.

In [None]:
df.where(admit_mask).head()

* The nice thing about `where()` is that it's easy to read
* Often you mix it together with `dropna()`

In [None]:
df.where(admit_mask).dropna().head()

* Masks can be composites, and made up of several conditions

* The problem is, pandas doesn't know how to `and` two `Series` objects together.
* PEP 335: https://www.python.org/dev/peps/pep-0335/
* But it does know how to `&` them!

* But, you need to watch out for order of operations!

* Finally, there are additional helper functions on dataframes to be aware of

## More Indexing
* Let's go back to indexing dataframes, there's some neat stuff there
* Remember that the index are row level labels, and the column names are the column level labels
* We can swap columns and rows trivially

* We saw that we can set the index with `set_index()`

In [None]:
# Of course, this didn't actually change our previous dataframe, right?


## Multilevel indexing
* We can create hierarchial indicies, which is pretty neat
* Let's look at some (old) census data

In [None]:
import pandas as pd 
df=pd.read_csv("datasets/census.csv")
df.head()

In [None]:
# In this data there are only two sumlevels
# ...so lets just get county level data


In [None]:
# We can set a multilevel index just by passing a list of things we want to index on


* Querying gets, frankly, complex
* `df.loc[row, column]`
* But with a multiindex we can do
  * `df.loc[row index1, row index2]`

In [None]:
# It's a bit ambiguous; I recommend passing keys as tuple instead


#### Which county has the largest population in Michigan?