# Table of Contents
* [Setup](#Setup)
* [The Index Object](#The-Index-Object)
	* [Changing the Index](#Changing-the-Index)
	* [Hierarchical Indexing](#Hierarchical-Indexing)
	* [Iterating over the MultiIndex](#Iterating-over-the-MultiIndex)
	* [MultiIndex .loc accessor](#MultiIndex-.loc-accessor)
	* [Multi-axis indexers](#Multi-axis-indexers)
* [Categorical data types](#Categorical-data-types)
	* [Overview of categoricals](#Overview-of-categoricals)
	* [Categorical Example: Decrease Memory Use](#Categorical-Example:-Decrease-Memory-Use)
		* [A slightly harder way to select IPA styles](#A-slightly-harder-way-to-select-IPA-styles)
	* [Categoricals: How it Works](#Categoricals:-How-it-Works)
* [Section Review](#Section-Review)


# Setup

In [None]:
# Import modules needed for this lesson
import pandas as pd

# Set Jupyter notebook display options
pd.options.display.max_rows = 6
pd.options.display.max_columns = 8
pd.options.display.width = 100

# Test the version of pandas: it should be 0.16 or greater
pd.__version__

# The Index Object

The pandas containers Series and DataFrame both have an associated Index
* the Index can be a simple column of integers, such as 0, 1, 2, ..., but that is the least useful
* the Index can be a column of lables, like string labels, such as "A", "B", "C".
* the Index can be **hierarchical** (aka "MultiIndex")

The `Index` is the farthest left column shown in the DataFrame output. The values are printed in bold. In this DataFrame each row is indexed by a unique integer from 0 to 49999.

In [None]:
# Reread the data file, to start fresh
df = pd.read_csv('data/beer2.csv.gz',
                  index_col=0,
                  parse_dates=['time'])
df.head()

In [None]:
df.index

The values in the `Index` in this case are unique.
* In general, an Index need not be unique (examples to come...)
* There are many advantages to having a unique `Index`.

In [None]:
# The current index is unique
df.index.is_unique

## Changing the Index

The `Index` of a DataFrame can be reset to any currently existing column or a custom Pandas `Index` object.

In [None]:
# What column names do we have to use as an Index?

list(df.columns)

While each review is unique, we may want to reference this data set by other clasifications like `beer_id`.

In [None]:
# Use the 'beer_id' column as the Index

dfb = df.set_index('beer_id', inplace=False)  # default inplace=False, more on this later
dfb.head(3)

In [None]:
# Notice that even these beer IDs are not unique. That's okay.

dfb.index.is_unique

In [None]:
# Use .describe() to show beed_id is NOT unique
# Compare "count" vs "unique"

dfb.select_dtypes(include=['object']).describe()

In [None]:
# Example: This index is NOT unique: 
# There are 240 entries for beer_id 1904

dfb.loc[1904]

## Hierarchical Indexing

The default Index has only 1 "level" and it is not named

In [None]:
# Inspect the Index attributes
print( type(df.index) )
print( df.index.nlevels )  # number/count of levels in the Index
print( df.index.names )    # names of levels in the Index

The Index set to `beer_id` is named, but still has only 1 "level"

In [None]:
# Inspect the Index attributes
print( type(dfb.index) )
print( dfb.index.nlevels )  # number/count of levels in the Index
print( dfb.index.names )    # names of levels in the Index

For high dimensional data, heirarchical indexing is ideal
* Pandas supports multi-level indexing with `MultiIndex`
* a `MultiIndex` can be built from multiple columns, e.g. `profile_name`, `beer_id`, and `time`. 
* This will aggregate the data by those columns
* Each of these three categories is called a `level`.

In [None]:
# sort_index is required to fully associate profile_name with beer_id
reviews = df.set_index(['profile_name', 'beer_id', 'time']).sort_index()
reviews.head(5)

# Notice first 3 "column" labels: these are the Index levels

Now the Index has 3 levels and each is named

In [None]:
# Inspect the Index attributes
print( type(reviews.index) )
print( reviews.index.nlevels )  # number/count of levels in the Index
print( reviews.index.names )    # names of levels in the Index

Now we can easily subset the data using the levels of the hierarchical index

In [None]:
ind_column = 'beer_name'
reviews[ind_column]

When working with hierarchical Indexes the `pd.IndexSlice` provides a convenient way to extend the `.loc` selection method to accommodate selecting some or all of multilevel Indexes. Here, we want to select reviews by '05Harley' and only the 'beer_name' column.

For completeness I've selected all entries in the `beer_id` and `time` index levels as well. 

In [None]:
ind_column  = 'beer_name'
ind_profile = '05Harley'

idx = pd.IndexSlice

reviews.loc[idx[ind_profile, :, :], ind_column]

Alternatively, I can just provide `ind_profile` to `idx`.

In [None]:
reviews.loc[idx[ind_profile], ind_column]

Let's find all of 05Harley's reviews for beer id 1307.

In [None]:
ind_column  = 'beer_name'
ind_profile = '05Harley'
ind_beer_id = 1307

reviews.loc[idx[ind_profile,ind_beer_id], ind_column]

#reviews[ind_column][ind_profile][ind_beer_id]

In [None]:
# The whole multiindex is unique
reviews.index.is_unique

Using hierarchical index in not only more convenient, but faster than using fancy and boolean selection.

In [None]:
# Create a Series, indexed by profile_name (people who reviewed beers)
# Take the top 5 most prolific reviewers.
top_reviewers = (reviews
                        .index
                        .get_level_values('profile_name')
                        .value_counts()
                        .head(5)
                )
top_reviewers

In [None]:
# The .index attribute is itself a Numpy-like array
top_reviewers.index

## Iterating over the MultiIndex

In [None]:
reviews.head(3)

In [None]:
# Example: iterating with a hierarchical index

for index, row in reviews.head().iterrows():
    print('Index:', index, ', Row:', '%s' % row['abv'])

In [None]:
# Example: iterating with a hierarchical index, printing only one level

for index, row in reviews.head().iterrows():
    print('Index:beer_id', index[1], ', Row:', '%s' % row['abv'])

## MultiIndex .loc accessor

When using MultiIndex and Hierarchical indexers
* the `.loc` accessor now requires entries for each `level`. 
* a slice or index value can be provided for each level.

Usage:
```
df.loc[level0, level1, level2, etc.]
```

**Remember**: using `.loc` means that the right endpoint in a slice is returned. **INCLUSIVE**

In [None]:
pd.options.display.max_rows=4
reviews.loc[top_reviewers.index, :, :]

The more complete usage of MulitIindex and `.loc` is to provide the index selection in a tuple whose length is the number of `level`s.

In [None]:
reviews.loc[[('BeerFMAndy',100,pd.Timestamp('2009-10-22 03:39:21'))]]

## Multi-axis indexers

In [None]:
# Here are the scores for all 91 of BeerFMAndy's reviews
# for November only
idx=pd.IndexSlice
reviews.loc[idx['BeerFMAndy',:,'2009-11':'2009-11'],['beer_name','review_overall']]

In [None]:
# Alternative syntax using slice objects
# slice(None) == :
reviews.loc[('BeerFMAndy',slice(None),slice('2009-11','2009-11')),['beer_name','review_overall']]

In [None]:
reviews.loc[idx[top_reviewers.index, [92,54919], :], ['beer_name','beer_style']]

In [None]:
#! conda install -y numexpr
# SQL like query
(reviews
        .query('profile_name in @top_reviewers.index and \
                beer_id in [92,54919]')
        [['beer_name','beer_style']]
 )

In [None]:
# All review for 92 and 54919
reviews.loc[idx[:, [92, 54919], :], ['beer_name' ,'beer_style']]

# Categorical data types

In addition to the Numpy `dtypes` Pandas introduces a `categorical` dtype.

## Overview of categoricals

The categorical data type is useful in the following cases:

* A string variable consisting of only a few different values. 
    * Converting such a string variable to a categorical variable will save some memory.
* The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). 
    * By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
* As a signal to other python libraries that this column should be treated as a categorical variable
    * e.g. to use suitable statistical methods or plot types.

In [None]:
df.info()

## Categorical Example: Decrease Memory Use

Candidates for `categorical` dtypes are columns that have only a small fraction of unique strings. In the next cell we'll see that there are only 104 unique `beer_style`s out of the 50000 independent reviews.

In [None]:
# Select all columns of dtype object 
# Find out which ones have fewest unique values

df.select_dtypes(include=['object']).describe()

In [None]:
df[['beer_style']]

In [None]:
# fancy selection returns a DataFrame (not a Series) with a single column:

df[['beer_style']].info()

Convert to categorical:
* `beer_style` column converted to a `categorical`
* reassign column back into the DataFrame
* memory usage for this column has decreased!

In [None]:
df['beer_style'] = df['beer_style'].astype('category')
df[['beer_style']].info()

### A slightly harder way to select IPA styles

In [None]:
cats = (df
          .beer_style
          .cat
          .categories[df.beer_style.cat.categories.str.contains('IPA')]
        )
cats

In [None]:
# generate the boolean indexer
df.beer_style.isin(cats)

In [None]:
df.loc[df.beer_style.isin(cats)]

## Categoricals: How it Works

`categorical` series or columns store only the unique strings and provide an integer `code` lookup. 

Notice the dtype of the code series object. Pandas will select the smallest allowable dtype for the number of unique categories.

In [None]:
df.beer_style.cat.codes

In [None]:
df.beer_style.cat.categories

In [None]:
df.beer_style.cat.ordered

***

# Section Review

* Indexing
   * Attribute and Bracket style indexing
   * Column then row indexing
   * Setting/Changing the Index
   * Boolean indexing
   * Hierarchical indexing with MultiIndex
   * .loc() and .iloc()
* Categorical dtype
