# STATS 102
## Class 13 

In this notebook, we will do a quick review of the Panda objects and associated operations.  Topic include:

* #1 Creating Series, Dataframes
* #2 Data selection in Series, Dataframes
* #3 Operations in Pandas
* #4 Operating on Null Values (find, remove, fill)
* #5 Multi-index

<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">

*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*

*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*

# Introducing Pandas Objects

At the very basic level, **Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.**

Before we go any further, **let's introduce these three fundamental Pandas data structures: 

* ``Series``, 
* ``DataFrame``
* ``Index``

We will start our code sessions with the standard NumPy and Pandas imports:

In [1]:
import numpy as np
import pandas as pd

# #1 Creating Series, Dataframes

## The Pandas Series Object

A Pandas ``Series`` is a one-dimensional array of indexed data.
It can be created from a list or array as follows:

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0]) # One-dimensional array
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

### ``Series`` as generalized NumPy array

**While the Numpy Array has an *implicitly defined* integer index used to access the values, the Pandas ``Series`` has an *explicitly defined* index associated with the values.**

In [3]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

And the item access works as expected:

In [4]:
data['b']

0.5

### Constructing Series objects

```python
>>> pd.Series(data, index=index)
```

where ``index`` is an optional argument, and ``data`` can be one of many entities.

For example, ``data`` can be a list or NumPy array, in which case ``index`` defaults to an integer sequence:

In [5]:
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

``data`` can be a scalar, which is repeated to fill the specified index:

In [6]:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

``data`` can be a dictionary, in which ``index`` defaults to the sorted dictionary keys:

In [7]:
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

## The Pandas DataFrame Object

The next fundamental structure in Pandas is the ``DataFrame``.

### DataFrame as a generalized NumPy array
A ``DataFrame`` is an analog of a two-dimensional array with both flexible row indices and flexible column names.

You can think of a ``DataFrame`` as a sequence of aligned ``Series`` objects.

In [8]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict) # Construct series from dictionary
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

Now that we have this along with the ``population`` Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:

In [9]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
# Constructing series from a Panda dictionary
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [10]:
# Dataframe combines both Series
states = pd.DataFrame({'population': population,
                       'area': area}) 
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


Like the ``Series`` object, the ``DataFrame`` has an ``index`` attribute that gives access to the index labels:

In [11]:
states.index # Rows

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

**Additionally**, the ``DataFrame`` has a ``columns`` attribute, which is an ``Index`` object holding the column labels:

In [12]:
states.columns # Columns

Index(['population', 'area'], dtype='object')

### Thus the ``DataFrame`` can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

### Constructing DataFrame objects

A Pandas ``DataFrame`` can be constructed in a variety of ways.
Here we'll give you four examples.

#### #1 From a single Series object

A ``DataFrame`` is a collection of ``Series`` objects, and a single-column ``DataFrame`` can be constructed from a single ``Series``:

In [13]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


#### #2 From a list of dicts

Any list of dictionaries can be made into a ``DataFrame``.
We'll use a simple list comprehension to create some data:

In [14]:
data = [{'a': i, 'b': 2 * i} for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


Even if some keys in the dictionary are missing, Pandas will fill them in with ``NaN`` (i.e., "not a number") values:

In [15]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


#### #3 From a dictionary of Series objects

As we saw before, a ``DataFrame`` can be constructed from a dictionary of ``Series`` objects as well:

In [16]:
pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


#### #4 From a two-dimensional NumPy array

Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names.

If omitted, an integer index will be used for each:

In [17]:
# Creating a 2 dimensional dataframe
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.985717,0.053217
b,0.168369,0.117277
c,0.52248,0.73013


# #2 Data selection in Series, Dataframes

## Data Selection in Series

As we saw in the previous section, a ``Series`` object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary.
If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.

In [18]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd']) # Define object with values and labels
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [19]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [20]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [21]:
# masking
data[(data > 0.3) & (data < 0.8)] # Compare values

b    0.50
c    0.75
dtype: float64

In [22]:
# fancy indexing
data[['a', 'e']]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self.loc[key]


a    0.25
e     NaN
dtype: float64

## IMPORTANT
Among these, slicing may be the source of the most confusion. There are two possibilities:

1) Slicing with an explicit index (i.e., ``data['a':'c']``), the final index is **included** in the slice, 

2) Slicing with an implicit index (i.e., ``data[0:2]``), the final index is *excluded* from the slice.

### Indexers: loc, iloc, and ix

These slicing and indexing conventions can be a source of confusion.
For example, if your ``Series`` has an explicit integer index, an indexing operation such as ``data[1]`` will use the explicit indices, while a slicing operation like ``data[1:3]`` will use the implicit Python-style index.

In [23]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

Because of this potential confusion in the case of integer indexes, Pandas provides some special *indexer* attributes that explicitly expose certain indexing schemes.
These are not functional methods, but attributes that expose a particular slicing interface to the data in the ``Series``.

First, the ``loc`` attribute allows indexing and slicing that always references the explicit index:

In [24]:
data.loc[1] # Refers to key (or explicit index)

'a'

In [25]:
data.loc[1:3] # Slicing using keys (or explicit index)

1    a
3    b
dtype: object

The ``iloc`` attribute allows indexing and slicing that always references the implicit Python-style index:

In [26]:
data.iloc[1] # Refers to implicit index

'b'

In [27]:
data.iloc[1:3] # Slice using implicit index 

3    b
5    c
dtype: object

## Data Selection in DataFrame

Recall that a ``DataFrame`` acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of ``Series`` structures **sharing the same index.**

These analogies can be helpful to keep in mind as we explore data selection within this structure.

In [28]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


The individual ``Series`` that make up the columns of the ``DataFrame`` can be accessed via dictionary-style indexing of the column name:

In [29]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [30]:
data.values

array([[  423967, 38332521],
       [  695662, 26448193],
       [  141297, 19651127],
       [  170312, 19552860],
       [  149995, 12882135]])

In [31]:
# Adding a new column to dataframe

# WOW this is easy!!!

data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


Using the ``iloc`` indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the ``DataFrame`` index and column labels are maintained in the result:

In [32]:
data.iloc[:3, :2] # Use implicit index (first 3 rows, first 2 columns)

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


Similarly, using the ``loc`` indexer we can index the underlying data in an array-like style but using the explicit index and column names:

In [33]:
data.loc[:'Illinois', :'pop'] # Indexing using "column names"

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


Any of the familiar NumPy-style data access patterns can be used within these indexers.
For example, in the ``loc`` indexer we can combine masking and fancy indexing as in the following:

In [34]:
# If data.density values greater than 100, 
# give me corresponding pop and density
# Slice based on values
data.loc[data.density > 100, ['pop', 'density']] 

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


### Additional indexing conventions

First, while *indexing* refers to columns, *slicing* refers to rows:

In [35]:
data['Florida':'Illinois'] # Using explicit index

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


Similarly, direct masking operations are also interpreted row-wise rather than column-wise:

In [36]:
data[data.density > 100] # Masking using values

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


# #3 Operations in Pandas

Pandas inherit ufuncs from NumPy and the ability to do element-wise operations, but includes a couple useful twists, however: 

1) for unary operations like negation and trigonometric functions, these ufuncs will *preserve index and column labels* in the output 

2) for binary operations such as addition and multiplication, Pandas will automatically *align indices* when passing the objects to the ufunc.

This means that keeping the context of data and combining data from different sources–both potentially error-prone tasks with raw NumPy arrays–become essentially foolproof ones with Pandas.

We will additionally see that there are well-defined operations between one-dimensional ``Series`` structures and two-dimensional ``DataFrame`` structures.

## Ufuncs: Index Preservation

Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas ``Series`` and ``DataFrame`` objects.
Let's start by defining a simple ``Series`` and ``DataFrame`` on which to demonstrate this:

In [37]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4)) 
# Create a series with random integers between 0 and 10, 4 values
ser

0    6
1    3
2    7
3    4
dtype: int64

In [38]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D']) # 3 rows, 4 columns
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object *with the indices preserved:*

In [41]:
print(df)
np.sin(df * np.pi / 4)

   A  B  C  D
0  6  9  2  6
1  7  4  3  7
2  7  2  5  4


Unnamed: 0,A,B,C,D
0,-1.0,0.7071068,1.0,-1.0
1,-0.707107,1.224647e-16,0.707107,-0.7071068
2,-0.707107,1.0,-0.707107,1.224647e-16


## UFuncs: Index Alignment

For binary operations on two ``Series`` or ``DataFrame`` objects, Pandas **will align indices** in the process of performing the operation.

This is very convenient when working with incomplete data, as we'll see in some of the examples that follow.

In [42]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

In [69]:
# Let's see what happens when we divide these to compute the 
# population density:
print("\nPopulation\n",population)
print("\nArea\n",area)
print("\nDensity\n")
population / area # matches indices


Population
 California    38332521
Texas         26448193
New York      19651127
Name: population, dtype: int64

Area
 Alaska        1723337
Texas          695662
California     423967
Name: area, dtype: int64

Density



Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

The resulting array contains the **union of indices** of the two input arrays, which could be determined using standard Python set arithmetic on these indices:

In [44]:
area.index | population.index

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

Any item for which **one or the other** does not have an entry is marked with ``NaN``, or "Not a Number"

This index matching is implemented this way for any of Python's built-in arithmetic expressions; any missing values are filled in with NaN by default:

In [45]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
print(A)
print(B)
A + B # Adds values where indices match

0    2
1    4
2    6
dtype: int64
1    1
2    3
3    5
dtype: int64


0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

### Index alignment in DataFrame

A similar type of alignment takes place for *both* columns and indices when performing operations on ``DataFrame``s:

In [70]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))
A

Unnamed: 0,A,B
0,13,17
1,8,1


In [71]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,3,6,7
1,2,0,3
2,1,7,3


In [74]:
print(A)
print(B)
A + B

    A   B
0  13  17
1   8   1
   B  A  C
0  3  6  7
1  2  0  3
2  1  7  3


Unnamed: 0,A,B,C
0,19.0,20.0,
1,8.0,3.0,
2,,,


Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted.

The following table lists Python operators and their equivalent Pandas object methods:

| Python Operator | Pandas Method(s)                      |
|-----------------|---------------------------------------|
| ``+``           | ``add()``                             |
| ``-``           | ``sub()``, ``subtract()``             |
| ``*``           | ``mul()``, ``multiply()``             |
| ``/``           | ``truediv()``, ``div()``, ``divide()``|
| ``//``          | ``floordiv()``                        |
| ``%``           | ``mod()``                             |
| ``**``          | ``pow()``                             |


## Ufuncs: Operations Between DataFrame and Series

When performing operations between a ``DataFrame`` and a ``Series``, the index and column alignment is similarly maintained.
Operations between a ``DataFrame`` and a ``Series`` are similar to operations between a two-dimensional and one-dimensional NumPy array.
Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:

In [86]:
A = rng.randint(10, size=(3, 4))
print(A)
print(A[0,:])

[[1 5 5 9]
 [3 5 1 9]
 [1 9 3 7]]
[1 5 5 9]


In [89]:
A - A[0,:] # Substract first row

array([[ 0,  0,  0,  0],
       [ 2,  0, -4,  0],
       [ 0,  4, -2, -2]])

According to NumPy's broadcasting rules (see [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb)), subtraction between a two-dimensional array and one of its rows is applied row-wise.

In Pandas, the convention similarly operates row-wise by default:

In [55]:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0,:] # Implicit index first row

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,-1,-2,2,4
2,3,-7,1,4


If you would instead like to operate column-wise, you can use the object methods mentioned earlier, while specifying the ``axis`` keyword:

In [56]:
# From dataframe df, substract "R" column
print(df)
df.subtract(df['R'], axis=0) #substract values in "R" column

   Q  R  S  T
0  3  8  2  4
1  2  6  4  8
2  6  1  3  8


Unnamed: 0,Q,R,S,T
0,-5,0,-6,-4
1,-4,0,-2,2
2,5,0,2,7


Note that these ``DataFrame``/``Series`` operations, like the operations discussed above, will automatically align  indices between the two elements:

In [57]:
# Take a slice from df using implicit index
halfrow = df.iloc[0, ::2] # Implicit index (first row, every other column)
halfrow

Q    3
S    2
Name: 0, dtype: int64

In [58]:
df - halfrow # remove half row from df (if missing value populate NaN)
# R and T have NaN because halfrow did not have any corresponding values

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,-1.0,,2.0,
2,3.0,,1.0,


### This preservation and alignment of indices and columns means that operations on data in Pandas will always maintain the data context, which prevents the types of silly errors that might come up when working with heterogeneous and/or misaligned data in raw NumPy arrays.

# #4 Operating on Null Values 

Finding, Removing, Filling

## Operating on Null Values

As we have seen, Pandas treats ``None`` and ``NaN`` as essentially interchangeable for indicating missing or null values.
To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures.
They are:

- ``isnull()``: Generate a boolean mask indicating missing values
- ``notnull()``: Opposite of ``isnull()``
- ``dropna()``: Return a filtered version of the data
- ``fillna()``: Return a copy of the data with missing values filled or imputed

We will conclude this section with a brief exploration and demonstration of these routines.

### Detecting null values
Pandas data structures have two useful methods for detecting null data: ``isnull()`` and ``notnull()``.
Either one will return a Boolean mask over the data. For example:

In [None]:
data = pd.Series([1, np.nan, 'hello', None])

In [None]:
data.isnull()

As mentioned in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb), Boolean masks can be used directly as a ``Series`` or ``DataFrame`` index:

In [None]:
data[data.notnull()] # Use mask as index (show me not null elements)

### Dropping null values

In addition to the masking used before, there are the convenience methods, ``dropna()``
(which removes NA values) and ``fillna()`` (which fills in NA values). For a ``Series``,
the result is straightforward:

In [None]:
data.dropna()

For a ``DataFrame``, there are more options.
Consider the following ``DataFrame``:

In [None]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

We **cannot drop single values** from a ``DataFrame``; we can only drop full rows or full columns.
Depending on the application, you might want one or the other, so ``dropna()`` gives a number of options for a ``DataFrame``.

### By default, ``dropna()`` will drop all rows in which *any* null value is present:

In [None]:
df.dropna()

Alternatively, you can drop NA values along a different axis; ``axis=1`` (i.e. operate on columns) drops all columns containing a null value:

In [None]:
df.dropna(axis='columns') # Drops all columns containing NaN

But this drops some good data as well; you might rather be interested in dropping rows or columns with **all** NA values, or a **majority** of NA values.
This can be specified through the ``how`` or ``thresh`` parameters, which allow fine control of the number of nulls to allow through.

The default is ``how='any'``, such that any row or column (depending on the ``axis`` keyword) containing a null value will be dropped.
You can also specify ``how='all'``, which will only drop rows/columns that are *all* null values:

In [None]:
df[3] = np.nan # Add a column with NaN values
df

In [None]:
df.dropna(axis='columns', how='all') # drop only if they are all NaN

### Filling null values

Sometimes rather than dropping NA values, you'd rather replace them with a valid value.
This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values.
You could do this in-place using the ``isnull()`` method as a mask, but because it is such a common operation Pandas provides the ``fillna()`` method, which returns a copy of the array with the null values replaced.

Consider the following ``Series``:

In [None]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

We can fill NA entries with a single value, such as zero:

In [None]:
data.fillna(0)

We can specify a forward-fill to propagate the previous value forward:

In [None]:
# forward-fill
data.fillna(method='ffill') 
# Forward fill means use value in previous index

Or we can specify a back-fill to propagate the next values backward:

In [None]:
# back-fill
data.fillna(method='bfill') # Use value in next index

For ``DataFrame``s, the options are similar, but we can also specify an ``axis`` along which the fills take place:

In [None]:
df

In [None]:
df.fillna(method='ffill', axis=1) # Forward fill by prev value in row

Notice that if a previous value is not available during a forward fill, the NA value remains.

# #5 Multi-Index

Up to this point we've been focused primarily on one-dimensional and two-dimensional data, stored in Pandas ``Series`` and ``DataFrame`` objects, respectively. Often it is useful to go beyond this and store higher-dimensional data–that is, data indexed by more than one or two keys.
We will talk about how to make use of *hierarchical indexing* (also known as *multi-indexing*) to incorporate **multiple index *levels* within a single index**. In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional ``Series`` and two-dimensional ``DataFrame`` objects.

In this section, we'll explore the direct creation of ``MultiIndex`` objects, considerations when indexing, slicing, and computing statistics across multiply indexed data, and useful routines for converting between simple and hierarchically indexed representations of your data.

We begin with the standard imports:

## A Multiply Indexed Series

Let's start by considering how we might represent two-dimensional data within a one-dimensional ``Series``.
For concreteness, we will consider a series of data where each point has a character and numerical key.

In [None]:
# Define a index of Tuples
tup_index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=tup_index)
pop

In [None]:
# Define a multilevel index based on tuples
index = pd.MultiIndex.from_tuples(tup_index)
index

In [None]:
# Let's reindex with this index
pop = pop.reindex(index)
# Then show me Series with Multi-index
pop

Here the first two columns of the ``Series`` representation show the multiple index values, while the third column shows the data.
Notice that some entries are missing in the first column: in this multi-index representation, any **blank entry indicates the same value as the line above it**.

Now to access all data for which the second index is 2010, we can simply use the Pandas slicing notation:

In [None]:
# Show me all the data with index 2010
pop[:, 2010] # Use Panda slicing notation - column containing 2010 data

### MultiIndex as extra dimension

You might notice something else here: we could easily have stored the same data using a simple ``DataFrame`` with index and column labels.
In fact, Pandas is built with this equivalence in mind. The ``unstack()`` method will quickly convert a multiply indexed ``Series`` into a conventionally indexed ``DataFrame``:

In [None]:
# Conver multi-indexed to conventionally indexed
pop_df = pop.unstack() # Define new unstacked dataframe
pop_df

Say we want to add another column of demographic data for each state at each year (say, population under 18) ; with a ``MultiIndex`` this is as easy as adding another column to the ``DataFrame``:

In [None]:
pop_df = pd.DataFrame({'total': pop,
                       'under18': [9267089, 9284094,
                                   4687374, 4318033,
                                   5906301, 6879014]})
pop_df


### #1 The most straightforward way to construct a multiply indexed ``Series`` or ``DataFrame`` is to simply pass a list of two or more index arrays to the constructor. For example:

In [None]:
# Contruct multi-index pandas by passing a list of index arrays
# Values is a matrix of four rows and two columns
# Explicit indexes are two: [a,a,b,b] [1,2,1,2]
# Column labels are two "data1" and "data2"
df = pd.DataFrame(np.random.rand(4, 2),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=['data1', 'data2'])
df

### #2 Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a ``MultiIndex`` by default:

In [None]:
# If you pass a dictionary with tuple keys, 
# it will be interpreted as multiindex
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

## Explicit MultiIndex constructors

For more flexibility in how the index is constructed, you can instead use the class method constructors available in the ``pd.MultiIndex``.
For example, as we did before, you can construct the ``MultiIndex`` from a simple list of arrays giving the index values within each level:

## #3 From_array method

In [None]:
#  Note you add "from_arrays" method to MultiIndex
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

## #4 You can construct it from a list of tuples giving the multiple index values of each point:

In [None]:
#  Note you add "from_tuples" method to MultiIndex
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

## #5 You can even construct it from a Cartesian product of single indices:

In [None]:
# You could even do product of level arrays!!!
# You get: a1, a2, b1, b2
#  Note you add "from_product" method to MultiIndex
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

## MultiIndex level names

Sometimes it is convenient to name the levels of the ``MultiIndex``.
This can be accomplished by:

1) passing the ``names`` argument to any of the above ``MultiIndex`` constructors, 

2) setting the ``names`` attribute of the index after the fact

In [None]:
# Ok, let's define the names of the "levels"
pop.index.names = ['state', 'year']
pop

## MultiIndex for columns

In a ``DataFrame``, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well.
Consider the following, which is a mock-up of some (somewhat realistic) medical data:

In [None]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1) # round to one decimal point
data[:, ::2] *= 10 # Multiply by 10 values in every other column
data += 37 # Set baseline to 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Here we see where the multi-indexing for both rows and columns can come in *very* handy.
This is fundamentally four-dimensional data, where the dimensions are the subject, the measurement type, the year, and the visit number.
With this in place we can, for example, index the top-level column by the person's name and get a full ``DataFrame`` containing just that person's information:

In [None]:
health_data['Guido'] # Show me records for "Guido"

## Indexing and Slicing a MultiIndex

Indexing and slicing on a ``MultiIndex`` is designed to be intuitive, and it helps if you think about the indices as added dimensions.
We'll first look at indexing multiply indexed ``Series``, and then multiply-indexed ``DataFrame``s.

### Multiply indexed Series

Consider the multiply indexed ``Series`` of state populations we saw earlier:

In [None]:
pop # Series with two indexes

We can access single elements by indexing with multiple terms:

In [None]:
pop['California', 2000] # Specify both labels to index specific value

Other types of indexing and selection (discussed in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb)) work as well; for example, selection based on Boolean masks:

In [None]:
pop[pop > 22000000] # Use boolean masks (all states with pop > value)

Selection based on fancy indexing also works:

In [None]:
pop[['California', 'Texas']] # Fancy indexing

### Multiply indexed DataFrames

A multiply indexed ``DataFrame`` behaves in a similar manner.
Consider our toy medical ``DataFrame`` from before:

In [None]:
health_data

Remember that **columns are primary in a ``DataFrame``**, and the syntax used for multiply indexed ``Series`` applies to the columns.

For example, we can recover Guido's heart rate data with a simple operation:

In [None]:
health_data['Guido', 'HR'] # Columns primary in dataframe

Also, as with the single-index case, we can use the ``loc``, ``iloc``, and ``ix`` indexers introduced in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb). For example:

In [None]:
# Slicing the salami (or is it Sushi????)
health_data.iloc[:2, :2] # First two rows and first two columns

In [None]:
# Give all data associated with Bob and Heart Rate
health_data.loc[:, ('Bob', 'HR')] # Use explicit indices all rows

In [None]:
# Turn indices into columns using "reset_index" 
# now state and year are columns will define flat dataframe
print(pop)
pop_flat = pop.reset_index(name='population')
pop_flat

Often when working with data in the real world, the raw input data looks like this and it's useful to build a ``MultiIndex`` from the column values.
This can be done with the ``set_index`` method of the ``DataFrame``, which returns a multiply indexed ``DataFrame``:

In [None]:
print(pop_flat)
# Create multi-index from columns
pop_flat.set_index(['state', 'year'])