<a href="https://colab.research.google.com/github/ProfessorPatrickSlatraigh/CST3512/blob/main/CST3512_D308_Class19_Indexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##CST3512 Class 19    
**Row and Column Indexing**    


1. **Hierarchical Indexing**    
2. **Dewey Decimal Dictionary**     



This notebook is based on [Section 8.1 Hierarchical Indexing](https://wesmckinney.com/book/data-wrangling.html) from Chapter 8 - Data Wrangling:     

  - Join,     
  - Combine, and     
  - Reshape     
  
  in Wes Mckinney's 'Python for Data Analysis'    



In many applications, data may be spread across a number of files or databases or be arranged in a form that is not convenient to analyze. This chapter focuses on tools to help combine, join, and rearrange data.    

This notebook introduces the concept of **hierarchical indexing** in pandas, which is used extensively in some of these operations. Chapter 8 of the book then digs into the particular data manipulations. Various applied usages of these tools can be seen in [Data Analysis Examples](https://wesmckinney.com/book/data-wrangling.html#data-analysis-examples).



---



##**Housekeeping**    

Import required modules    


In [None]:
# Import pandas 
import pandas as pd

# Import numpy   
import numpy as np

##**1. Hierarchical Indexing**    


**Hierarchical indexing** is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis. Another way of thinking about it is that it provides a way for you to work with higher dimensional data in a lower dimensional form. Let’s start with a simple example: create a Series with a list of lists (or arrays) as the index:

###First, using the default index values...

In [None]:
data_0 = pd.Series(np.random.randn(9))

print(data_0)  

In [None]:
print(data_0[5])

###Then, using a list of unique values as row indices...   

In [None]:
data_1 = pd.Series(np.random.randn(9),
       index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])

print(data_1)    

In [None]:
print(data_1['d'])

###Of course, while this does may work, it would be hard to make use of...

In [None]:
data_2 = pd.Series(np.random.randn(9),
       index=['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'])

print(data_2)    

In [None]:
print(data_2['b'])

###But, if we layer iterables to create unique indexing combinations...   

In [None]:
data_3 = pd.Series(np.random.randn(9),
       index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
       [1, 2, 3, 1, 3, 1, 2, 2, 3]])

print(data_3)    

In [None]:
corp = pd.Series(['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth', 'ninth'], 
index=[['Americas', 'Americas', 'Americas', 'EMEA', 'EMEA', 'AsiaPac', 'AsiaPac', 'Corp', 'Corp'], 
[101, 201, 301, 101, 301, 101, 201, 201, 301]])

print(corp)    

What you’re seeing is a view of a Series with a **MultiIndex** as its index. The “gaps” in the index display mean “use the label directly above”:

*A conventional, single iterable of index values as we created in `data_1`:*    

In [None]:
print(data_1.index)

*A <b>MultiIndex</b> with a combination of iterable index values as we created in `data_3`:*    





In [None]:
print(data_3.index)

*A <b>MultiIndex</b> with a combination of iterable index values as we created in `corp`:*  

In [None]:
print(corp.index)

With a hierarchically indexed object, so-called partial indexing is possible, enabling you to concisely select subsets of the data:

In [None]:
data_3['b']

In [None]:
corp['EMEA']

In [None]:
data_3['c']

In [None]:
data_3['b':'c']

In [None]:
data_3['b':'d']

In [None]:
data_3.loc[['b', 'd']]

In [None]:
corp.loc[['EMEA', 'Americas', 'AsiaPac']]

Selection is even possible from an “inner” level. Here I select all of the values having the value "2" from the second index level:

In [None]:
data_3.loc[:, 2]

Or, all of the company `201` rows from the `corp` DataFrame:    

In [None]:
corp.loc[:,201]

Hierarchical indexing plays an important role in reshaping data and group-based operations like forming a pivot table. For example, you can rearrange this data into a DataFrame using its `.unstack()` method:

In [None]:
data_3.unstack()

In [None]:
corp.unstack()

The inverse operation of `.unstack()` is `.stack()`:

In [None]:
data_3.unstack().stack()

In [None]:
corp.unstack().stack()

`.stack()` and `.unstack()` are explored in more detail in [Chapter 8 of Wes Mckinney's Python for Data Analysis](https://wesmckinney.com/book/data-wrangling.html).

---

With a DataFrame, either axis can have a hierarchical index:


In [None]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
            index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
            columns=[['Ohio', 'Ohio', 'Colorado'],
            ['Green', 'Red', 'Green']])

In [None]:
print(frame)

The hierarchical levels can have names (as strings or any Python objects). If so, these will show up in the console output:


In [None]:
# Assign key1 and key2 as `frame` index hierarchy names, respectively   
frame.index.names = ['key1', 'key2'] 

# Assign state and color as `frame` column hierarchy names, respectively 
frame.columns.names = ['state', 'color']

In [None]:
print(frame)

In [None]:
# Assign key1 and key2 as `frame` index hierarchy names, respectively   
frame.index.names = ['Region', 'Product'] 

# Assign state and color as `frame` column hierarchy names, respectively 
frame.columns.names = ['state', 'color']

In [None]:
print(frame)

***Caution***    
*Be careful to note the index names `state` and `color` are not part of the row labels (the `frame.index values`).*

With partial column indexing you can similarly select groups of columns:

In [None]:
frame['Ohio']

A `MultiIndex` can be created by itself and then reused; the columns in the preceding DataFrame with level names could be created like this:

In [None]:
pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'],
                          ['Green', 'Red', 'Green']],
                          names=['state', 'color'])

###Reordering and Sorting Levels    



At times you may need to rearrange the order of the levels on an axis or sort the data by the values in one specific level. The swaplevel takes two level numbers or names and returns a new object with the levels interchanged (but the data is otherwise unaltered):

In [None]:
# frame.swaplevel('key1', 'key2')  # not applicable due to `Region` `Product` assignment earlier
frame.swaplevel('Region', 'Product')  

`sort_index`, on the other hand, sorts the data using only the values in a single level. When swapping levels, it’s not uncommon to also use `sort_index` so that the result is lexicographically sorted by the indicated level:

In [None]:
frame.sort_index(level=1)

In [None]:
frame.swaplevel(0, 1).sort_index(level=0)

***Note:***    

*Data selection performance is much better on hierarchically indexed objects if the index is lexicographically sorted starting with the outermost level—that is, the result of calling `sort_index(level=0)` or `sort_index()`.*    



###Summary Statistics by Level    



Many descriptive and summary statistics on DataFrame and Series have a `level` option in which you can specify the level you want to aggregate by on a particular axis. Consider the above DataFrame; we can aggregate by `level` on either the rows or columns like so:

In [None]:
frame

In [None]:
# frame.groupby(level='key2').sum()   # not applicable due to `Region` `Product` assignment earlier
frame.groupby(level='Product').sum()

In [None]:
frame.groupby(level='color', axis=1).sum()

Internally, this utilizes pandas’s `groupby` machinery, which is discussed in more detail in the book [Python for Data Analysis](https://wesmckinney.com/book/data-aggregation.html).

---

###Indexing with a DataFrame's Columns    



It’s not unusual to want to use one or more columns from a DataFrame as the row index; alternatively, you may wish to move the row index into the DataFrame’s columns. Here’s an example DataFrame:

In [None]:
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
            'c': ['one', 'one', 'one', 'two', 'two',
                 'two', 'two'],
            'd': [0, 1, 2, 0, 1, 2, 3]})

In [None]:
print(frame)

DataFrame’s `set_index` function will create a new DataFrame using one or more of its columns as the index:

In [None]:
frame2 = frame.set_index(['c', 'd'])

In [None]:
print(frame2)

By default the columns are removed from the DataFrame, though you can leave them in by passing `drop=False` to `set_index`:

In [None]:
frame.set_index(['c', 'd'], drop=False)

`reset_index`, on the other hand, does the opposite of `set_index`; the hierarchical index levels are moved into the columns:

In [None]:
frame2.reset_index()



---



####**Related Exercise**


*See the following section of this notebook or the notebook ['Dewey_Dictionary'](https://bit.ly/dewey_notebook) for a related exercise on hierarchical indexing using the Dewey Decimal System.* 



---



##**2. Dewey Dictionary**    

Loads the Dewey Decimal System **codes** and **categories** to a dictionary using a [reference from the University of Illinois Library](https://www.library.illinois.edu/infosci/research/guides/dewey/).   

Makes the Dewey `code:category` dictionary available as a `pandas` DataFrame or as a Python Dictionary.     



First, copy the file to the current working directory using a `.csv` file from ProfessorPatrickSlatraigh Github at:     
* https://raw.githubusercontent.com/ProfessorPatrickSlatraigh/data/main/dewey_codes_categories.csv

In [None]:
!curl 'https://raw.githubusercontent.com/ProfessorPatrickSlatraigh/data/main/dewey_codes_categories.csv' -o dewey_dictionary.csv

###Housekeeping    

Import modules required:    
* **csv** - to read a csv file into a variable    
* **pandas** - for dataframes, etc.
* **numpy** - for arrays   


In [None]:
import csv    

import pandas as pd    

import numpy as np    

###Read the `CSV` into a Dictionary    

Create an empty `dewey_dict` dictionary and populate it with `key:value` pairs read from the two columns in the `.csv` file, excluding the header row.    


In [None]:
# Create an empty dictionary for Dewey code:category pairs   
dewey_dict = {}

try: 
    with open('dewey_dictionary.csv', mode='r') as source:
        csv_read = csv.reader(source)
        next(csv_read)              # to skip the header row in the csv_read file
        for line in csv_read: 
            # print(line)           # scaffolding to peek at lines in csv_read
            # print(line[0])        # scaffolding to peek at col 0 in csv_read
            # print(type(line[0]))  # scaffolding to peek at col 0 type in csv_read
            # wait = input('Hit Enter to continue.') # wait for scaffolding output
            dewey_dict[line[0]] = line[1] # dict entry (key=1st col, value=2nd col)
 
    print('Created the `dewey_dict` Dictionary.')
except:
    print('Error encountered attempting to create `dewey_dict`.')

Descriptive information on the `dewey_dict` Dictionary. 

In [None]:
# print the length of the dictionary (# of key:value pairs)
print(len(dewey_dict))

In [None]:
# print the populated dictionary   
print(dewey_dict)



---



###Read the .CSV into a Dataframe then Create a Dictionary    

Use of pandas' built-in function `read_csv()` with a few parameters to specify the `.csv` file format. After calling pandas `read_csv()`, convert the result to a dictionary using the built-in pandas function `to_dict()`.


* `header` parameter specifies that the headers are explicitly passed or declared by another parameter.    
* `index_col` specifies which column is used as the labels for the DataFrame object that the `read_csv()` function returns. In this case, the first column of index 0 is the key.    
* `squeeze` parameter defines if the data contains only one column for values. In this case, there is only one column since the first column is used as the index column or the labels.    


In [None]:
try: 
    # Use pandas `read_csv` to read the file
    df_dewey = pd.read_csv('dewey_dictionary.csv', header=0, index_col=0, squeeze=True)
    
    # Use pandas `to_dict()` to assign dataframe index:value to dictionary
    dict_dewey = df_dewey.to_dict()
    
    print('Created `df_dewey` DataFrame and `dict_dewey` Dictionary.')
except:
    print('Error attempting to create `df_dewey` and/or `dict_dewey.')

Descriptive information on the `df_dewey` DataFrame (a series, with `dewey_code` as the index.)

In [None]:
df_dewey.describe

In [None]:
df_dewey.head()

Descriptive information on the `dict_dewey` Dictionary.

In [None]:
print(len(dict_dewey))

In [None]:
print(dict_dewey)



---



###**<u>Exercise</u>**    

Add your code below to take a DataFrame with Dewey Decimal System `dewey_code` and `dewey_category` columns and transform the DataFrame to include a hierarchical structure of the following columns, derived from the `dewey_code` column:    
* **dewey_level1** - based on the **first** character in `dewey_code`    
* **dewey_level2** - based on the **second** character in `dewey_code`    
* **dewey_level3** - based on the **third** character in `dewey_code`    



*Note:*    

*Refer to the first section of this notebook or the Colab notebook on [**Hierarchical Indexing**](https://bit.ly/hierarchical_indexing) for reference/refresher information.*    

In [None]:
### YOUR CODE HERE ###
### add snippets below, if you like ###